Momentum Method in Gradient Descent
Overview1 2
In gradient descent, the momentum technique involves using all previous gradients when updating parameters. This is its essence and the end of the explanation. However, explanations involving strange update equations, the motivation from physics’ momentum, setting the mass to $1$ and the initial velocity to $0$, only complicate understanding. This article explains the momentum technique as plainly as possible.
Build-Up
Let’s denote the parameters as $\boldsymbol{\theta}$ and the loss function as $L$. The standard gradient descent method updates the parameters iteratively as follows:
$$ \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} - \alpha \nabla L_{i} $$
Here, $L_{i} = L(\boldsymbol{\theta}_{i})$ represents the loss function calculated in the $i$th iteration. The momentum technique simply adds $\nabla L_{i-1}$, the gradient of the loss function calculated in the previous iteration.
$$ \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} - \alpha \nabla L_{i} - \alpha \nabla L_{i-1} - \cdots - \alpha \nabla L_{0} $$
As iterations progress, to reduce the impact of the gradient and prevent the sum of gradients from diverging, a coefficient $\beta \in (0,1)$ is added as follows:
$$ \begin{align} \boldsymbol{\theta}_{i+1} &= \boldsymbol{\theta}_{i} - \alpha \nabla L_{i} - \beta \alpha \nabla L_{i-1} - \cdots - \beta^{i}\alpha \nabla L_{0} \nonumber \\ &= \boldsymbol{\theta}_{i} - \alpha\sum_{j=0}^{i} \beta^{j} \nabla L_{i-j} \end{align} $$
Definition
Updating the parameters as in $(1)$ is called the momentum method, and the added term $\alpha\sum\limits_{j=0}^{i} \beta^{j} \nabla L_{i-j}$ is referred to as momentum.
Explanation
According to the definition, the momentum method is a generalized form of gradient descent. In fact, gradient descent can be seen as a special case of the momentum method where $\beta = 0$. The closer $\beta$ is to $1$, the more it reflects previous gradients, and vice versa when it’s closer to $0$.
Gradient descent updates parameters in the direction of the steepest current gradient and is thus a greedy algorithm. The momentum method mitigates this greediness, enabling choices that may not be the best in the short term but are more beneficial in the long run. It also helps prevent sudden and excessive changes in gradient direction.
Obviously, since the magnitude of the gradient used to update parameters is larger compared to standard gradient descent, the momentum method has the advantage of faster convergence. Empirically, it is also known to escape local minima more effectively, similar to how a ball rolling down a hill can overcome small bumps if it has enough speed.
It is important to note that there is no absolute superiority among these optimizers, including adaptive learning rate techniques. The optimal optimizer varies by field and task, so it’s unwise to ask or judge which is the best. It’s helpful to learn what is commonly used in your field, or if unsure, to try SGD+momentum or Adam.
Nesterov Momentum
To summarize the momentum method: to obtain the next parameter $\boldsymbol{\theta}_{i+1}$, the current gradient $\alpha \nabla L(\boldsymbol{\theta}_{i})$ is cumulatively added to the current parameter $\boldsymbol{\theta}_{i}$.
Nesterov momentum, or Nesterov accelerated gradient (NAG), computes the gradient at “the current parameter plus the previous gradient” and adds this to the current parameter to obtain the next one. This might sound complicated, but if you understand the momentum method, the following algorithm may make Nesterov momentum easier to grasp.
Algorithm
Let’s denote the momentum term as $\mathbf{p}$.
Algorithm: Momentum Method | ||
---|---|---|
Input | Learning rate $\alpha$, momentum parameter $\beta$, epochs $N$ | |
1. | Initialize parameters $\boldsymbol{\theta}$ to random values. | |
2. | Initialize momentum $\mathbf{p} = \mathbf{0}$. | |
3. | for $k = 1, \cdots, N$ do | |
4. | $\mathbf{p} \leftarrow \beta \mathbf{p} - \alpha \nabla L(\boldsymbol{\theta})$ | # Update momentum with calculated gradient |
5. | $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \mathbf{p}$ | # Update parameters |
6. | end for |
Algorithm: Nesterov Momentum | ||
---|---|---|
Input | Learning rate $\alpha$, momentum parameter $\beta$, epochs $N$ | |
1. | Initialize parameters $\boldsymbol{\theta}$ to random values. | |
2. | Initialize momentum $\mathbf{p} = \mathbf{0}$. | |
3. | for $k = 1, \cdots, N$ do | |
4. | $\mathbf{p} \leftarrow \beta \mathbf{p} - \alpha \nabla L(\boldsymbol{\theta} + \beta \mathbf{p})$ | # Update momentum with calculated gradient |
5. | $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \mathbf{p}$ | # Update parameters |
6. | end for |
Looking at the first few calculations of both methods, we can represent it as follows. If we simply denote $\mathbf{p}_{i} = \alpha \nabla L_{i}$, and $\mathbf{p}^{i} = \alpha \nabla L(\boldsymbol{\theta}_{i} - \beta^{1}\mathbf{p}^{i-1} - \beta^{2}\mathbf{p}^{i-2} - \cdots - \beta^{i}\mathbf{p}^{0})$ (here $\mathbf{p}^{0} = \mathbf{p}_{0}$), then
Momentum | Nesterov Momentum |
$\boldsymbol{\theta}\_{0} =$ initial | $\boldsymbol{\theta}\_{0} =$ initial |
$\boldsymbol{\theta}\_{1} = \boldsymbol{\theta}\_{0} - \alpha \nabla L\_{0} \\ \quad\ = \boldsymbol{\theta}\_{0} - \mathbf{p}\_{0}$ | $\boldsymbol{\theta}\_{1} = \boldsymbol{\theta}\_{0} - \alpha \nabla L\_{0} \\ \quad\ = \boldsymbol{\theta}\_{0} - \mathbf{p}^{0}$ |
$\boldsymbol{\theta}\_{2} = \boldsymbol{\theta}\_{1} - \alpha\nabla L\_{1} - \beta \mathbf{p}\_{0} \\ \quad\ = \boldsymbol{\theta}\_{1} - \mathbf{p}\_{1} - \beta \mathbf{p}\_{0}$ | $\boldsymbol{\theta}\_{2} = \boldsymbol{\theta}\_{1} - \alpha \nabla L(\boldsymbol{\theta}\_{1} - \beta \mathbf{p}^{0}) - \beta \mathbf{p}^{0} \\ \quad\ = \boldsymbol{\theta}\_{1} - \mathbf{p}^{1} - \beta \mathbf{p}^{0}$ |
$\boldsymbol{\theta}\_{3} = \boldsymbol{\theta}\_{2} - \mathbf{p}\_{2} - \beta \mathbf{p}\_{1} - \beta^{2} \mathbf{p}\_{0} \\ \quad\ = \boldsymbol{\theta}\_{2} - \sum\limits\_{j=0}^{2}\beta^{j}\mathbf{p}\_{2-j}$ | $\boldsymbol{\theta}\_{3} = \boldsymbol{\theta}\_{2} - \alpha \nabla L(\boldsymbol{\theta}\_{2} - \beta \mathbf{p}^{1} - \beta^{2}\mathbf{p}^{0}) - \beta \mathbf{p}^{1} - \beta^{2}\mathbf{p}^{0} \\ \quad\ = \boldsymbol{\theta}\_{2} - \mathbf{p}^{2} - \beta \mathbf{p}^{1} - \beta^{2} \mathbf{p}^{0}$ |
$$\vdots$$ | $$\vdots$$ |
$\boldsymbol{\theta}\_{i+1} = \boldsymbol{\theta}\_{i} - \sum\limits\_{j=0}^{i}\beta^{j}\mathbf{p}\_{i-j}$ | $\boldsymbol{\theta}\_{i+1} = \boldsymbol{\theta}\_{i} - \sum\limits\_{j=0}^{i}\beta^{j}\mathbf{p}^{i-j}$ |