Optimizer

Definition

An optimization problem refers to finding $x_{\ast}$ such that the value of function $f : \mathbb{R}^{n} \to \mathbb{R}$ is minimized.

$$ x_{\ast} = \argmin\limits_{x} f(x) $$

A series of algorithms used to solve an optimization problem is called an optimizer.

Explanation

In machine learning and deep learning, the function $f$ is referred to as the loss function, and in this context, $x$ becomes the parameters of the neural network, or the weights.

Stochastic Gradient Descent

The optimizers used in deep learning are almost all stochastic gradient descent. Let’s denote the loss function as $L$ and the parameters as $\boldsymbol{\theta}$. Gradient descent refers to the following type of optimizer: $$ \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} - \alpha \nabla L(\boldsymbol{\theta}_{i}) $$

Momentum Technique

The momentum technique is an optimizer that accumulates gradients from previous steps as follows:

$$ \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} + \alpha \sum\limits_{j=0}^{i} \beta^{j} \nabla L(\boldsymbol{\theta}_{i}) $$

Nesterov Momentum

Nesterov momentum is a slight modification of the momentum technique. For $\mathbf{p}_{0} = \mathbf{0}$,

$$ \mathbf{p}_{i+1} = \beta \mathbf{p}_{i} - \alpha \nabla L(\boldsymbol{\theta}_{i} + \beta \mathbf{p}_{i}) $$

$$ \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} + \mathbf{p}_{i+1} $$

AdaGrad

AdaGrad is an optimizer that applies an adaptive learning rate. If we denote the gradient simply as $\nabla L _{i} = \nabla L (\boldsymbol{\theta}_{i})$,

$$ \begin{align*} \mathbf{r}_{i} &= (\nabla L_{i}) \odot (\nabla L_{i}) \\ \boldsymbol{\alpha}_{i} &= \boldsymbol{\alpha}_{i-1} + \dfrac{\epsilon}{\delta + \sqrt{\mathbf{r}_{i}}} = \sum_{j=1}^{i} \dfrac{\epsilon}{\delta + \sqrt{\mathbf{r}_{j}}} \\ \boldsymbol{\theta}_{i+1} &= \boldsymbol{\theta}_{i} - \boldsymbol{\alpha}_{i} \odot \nabla L_{i} \end{align*} $$

RMSProp

RMSProp is a variation of AdaGrad that applies a weighted average to decay the added terms exponentially.

$$ \begin{align*} \mathbf{r}_{i} &= (\nabla L_{i}) \odot (\nabla L_{i}) \\ \boldsymbol{\alpha}_{i} &= \rho \boldsymbol{\alpha}_{i-1} + (1-\rho) \dfrac{\epsilon}{\delta + \sqrt{\mathbf{r}_{i}}} = (1-\rho) \sum_{j=1}^{i} \rho^{i-j} \dfrac{\epsilon}{\delta + \sqrt{\mathbf{r}_{j}}} \\ \boldsymbol{\theta}_{i+1} &= \boldsymbol{\theta}_{i} - \boldsymbol{\alpha}_{i} \odot \nabla L_{i} \end{align*} $$

Adam

Adam is an optimizer that combines adaptive learning rates and momentum.

$$ \begin{align*} \mathbf{p}_{i} &= \beta_{1}\mathbf{p}_{i-1} + (1-\beta_{1}) \nabla L_{i-1} \quad \\[0.5em] \hat{\mathbf{p}}_{i} &= \dfrac{\mathbf{p}_{i}}{1-(\beta_{1})^{i}} \\[0.5em] \mathbf{r}_{i} &= \beta_{2} \mathbf{r}_{i-1} + (1-\beta_{2}) \nabla L_{i} \odot \nabla L_{i} \\[0.5em] \hat{\mathbf{r}}_{i} &= \dfrac{\mathbf{r}}{1-(\beta_{2})^{i}} \\[0.5em] \hat{\boldsymbol{\alpha}}_{i} &= \dfrac{\epsilon}{\delta + \sqrt{\hat{\mathbf{r}}_{i}}} \\[0.5em] \boldsymbol{\theta}_{i+1} &= \boldsymbol{\theta}_{i} - \hat{\boldsymbol{\alpha}_{i}} \odot \hat{\mathbf{p}_{i}} \end{align*} $$

Others

Graduate Student Descent

Graduate Student Descent refers to using graduate students as an optimizer. It has been widely used since ancient times and is still actively used worldwide. The performance varies, but in terms of cost, it is known for its incredibly efficient and cost-effective optimization performance.

Monte Carlo

Monte Carlo is about trying as many random attempts as possible.

Grid Search

Grid Search means, as its name suggests, dividing the Euclidean space $\mathbb{R}^{n}$ into a grid and trying many points to find the optimal solution.