logo

Optimizer 📂Machine Learning

Optimizer

Definition

An optimization problem refers to finding xx_{\ast} such that the value of function f:RnRf : \mathbb{R}^{n} \to \mathbb{R} is minimized.

x=arg minxf(x) x_{\ast} = \argmin\limits_{x} f(x)

A series of algorithms used to solve an optimization problem is called an optimizer.

Explanation

In machine learning and deep learning, the function ff is referred to as the loss function, and in this context, xx becomes the parameters of the neural network, or the weights.

Stochastic Gradient Descent

The optimizers used in deep learning are almost all stochastic gradient descent. Let’s denote the loss function as LL and the parameters as θ\boldsymbol{\theta}. Gradient descent refers to the following type of optimizer: θi+1=θiαL(θi) \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} - \alpha \nabla L(\boldsymbol{\theta}_{i})

Momentum Technique

The momentum technique is an optimizer that accumulates gradients from previous steps as follows:

θi+1=θi+αj=0iβjL(θi) \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} + \alpha \sum\limits_{j=0}^{i} \beta^{j} \nabla L(\boldsymbol{\theta}_{i})

Nesterov Momentum

Nesterov momentum is a slight modification of the momentum technique. For p0=0\mathbf{p}_{0} = \mathbf{0},

pi+1=βpiαL(θi+βpi) \mathbf{p}_{i+1} = \beta \mathbf{p}_{i} - \alpha \nabla L(\boldsymbol{\theta}_{i} + \beta \mathbf{p}_{i})

θi+1=θi+pi+1 \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} + \mathbf{p}_{i+1}

AdaGrad

AdaGrad is an optimizer that applies an adaptive learning rate. If we denote the gradient simply as Li=L(θi)\nabla L _{i} = \nabla L (\boldsymbol{\theta}_{i}),

ri=(Li)(Li)αi=αi1+ϵδ+ri=j=1iϵδ+rjθi+1=θiαiLi \begin{align*} \mathbf{r}_{i} &= (\nabla L_{i}) \odot (\nabla L_{i}) \\ \boldsymbol{\alpha}_{i} &= \boldsymbol{\alpha}_{i-1} + \dfrac{\epsilon}{\delta + \sqrt{\mathbf{r}_{i}}} = \sum_{j=1}^{i} \dfrac{\epsilon}{\delta + \sqrt{\mathbf{r}_{j}}} \\ \boldsymbol{\theta}_{i+1} &= \boldsymbol{\theta}_{i} - \boldsymbol{\alpha}_{i} \odot \nabla L_{i} \end{align*}

RMSProp

RMSProp is a variation of AdaGrad that applies a weighted average to decay the added terms exponentially.

ri=(Li)(Li)αi=ραi1+(1ρ)ϵδ+ri=(1ρ)j=1iρijϵδ+rjθi+1=θiαiLi \begin{align*} \mathbf{r}_{i} &= (\nabla L_{i}) \odot (\nabla L_{i}) \\ \boldsymbol{\alpha}_{i} &= \rho \boldsymbol{\alpha}_{i-1} + (1-\rho) \dfrac{\epsilon}{\delta + \sqrt{\mathbf{r}_{i}}} = (1-\rho) \sum_{j=1}^{i} \rho^{i-j} \dfrac{\epsilon}{\delta + \sqrt{\mathbf{r}_{j}}} \\ \boldsymbol{\theta}_{i+1} &= \boldsymbol{\theta}_{i} - \boldsymbol{\alpha}_{i} \odot \nabla L_{i} \end{align*}

Adam

Adam is an optimizer that combines adaptive learning rates and momentum.

pi=β1pi1+(1β1)Li1p^i=pi1(β1)iri=β2ri1+(1β2)LiLir^i=r1(β2)iα^i=ϵδ+r^iθi+1=θiαi^pi^ \begin{align*} \mathbf{p}_{i} &= \beta_{1}\mathbf{p}_{i-1} + (1-\beta_{1}) \nabla L_{i-1} \quad \\[0.5em] \hat{\mathbf{p}}_{i} &= \dfrac{\mathbf{p}_{i}}{1-(\beta_{1})^{i}} \\[0.5em] \mathbf{r}_{i} &= \beta_{2} \mathbf{r}_{i-1} + (1-\beta_{2}) \nabla L_{i} \odot \nabla L_{i} \\[0.5em] \hat{\mathbf{r}}_{i} &= \dfrac{\mathbf{r}}{1-(\beta_{2})^{i}} \\[0.5em] \hat{\boldsymbol{\alpha}}_{i} &= \dfrac{\epsilon}{\delta + \sqrt{\hat{\mathbf{r}}_{i}}} \\[0.5em] \boldsymbol{\theta}_{i+1} &= \boldsymbol{\theta}_{i} - \hat{\boldsymbol{\alpha}_{i}} \odot \hat{\mathbf{p}_{i}} \end{align*}

Others

Graduate Student Descent

Graduate Student Descent refers to using graduate students as an optimizer. It has been widely used since ancient times and is still actively used worldwide. The performance varies, but in terms of cost, it is known for its incredibly efficient and cost-effective optimization performance.

Monte Carlo

Monte Carlo is about trying as many random attempts as possible.

Grid Search means, as its name suggests, dividing the Euclidean space Rn\mathbb{R}^{n} into a grid and trying many points to find the optimal solution.