logo

Momentum Method in Gradient Descent 📂Machine Learning

Momentum Method in Gradient Descent

Overview1 2

In gradient descent, the momentum technique involves using all previous gradients when updating parameters. This is its essence and the end of the explanation. However, explanations involving strange update equations, the motivation from physics’ momentum, setting the mass to 11 and the initial velocity to 00, only complicate understanding. This article explains the momentum technique as plainly as possible.

Build-Up

Let’s denote the parameters as θ\boldsymbol{\theta} and the loss function as LL. The standard gradient descent method updates the parameters iteratively as follows:

θi+1=θiαLi \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} - \alpha \nabla L_{i}

Here, Li=L(θi)L_{i} = L(\boldsymbol{\theta}_{i}) represents the loss function calculated in the iith iteration. The momentum technique simply adds Li1\nabla L_{i-1}, the gradient of the loss function calculated in the previous iteration.

θi+1=θiαLiαLi1αL0 \boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_{i} - \alpha \nabla L_{i} - \alpha \nabla L_{i-1} - \cdots - \alpha \nabla L_{0}

As iterations progress, to reduce the impact of the gradient and prevent the sum of gradients from diverging, a coefficient β(0,1)\beta \in (0,1) is added as follows:

θi+1=θiαLiβαLi1βiαL0=θiαj=0iβjLij \begin{align} \boldsymbol{\theta}_{i+1} &= \boldsymbol{\theta}_{i} - \alpha \nabla L_{i} - \beta \alpha \nabla L_{i-1} - \cdots - \beta^{i}\alpha \nabla L_{0} \nonumber \\ &= \boldsymbol{\theta}_{i} - \alpha\sum_{j=0}^{i} \beta^{j} \nabla L_{i-j} \end{align}

Definition

Updating the parameters as in (1)(1) is called the momentum method, and the added term αj=0iβjLij\alpha\sum\limits_{j=0}^{i} \beta^{j} \nabla L_{i-j} is referred to as momentum.

Explanation

According to the definition, the momentum method is a generalized form of gradient descent. In fact, gradient descent can be seen as a special case of the momentum method where β=0\beta = 0. The closer β\beta is to 11, the more it reflects previous gradients, and vice versa when it’s closer to 00.

Gradient descent updates parameters in the direction of the steepest current gradient and is thus a greedy algorithm. The momentum method mitigates this greediness, enabling choices that may not be the best in the short term but are more beneficial in the long run. It also helps prevent sudden and excessive changes in gradient direction.

Obviously, since the magnitude of the gradient used to update parameters is larger compared to standard gradient descent, the momentum method has the advantage of faster convergence. Empirically, it is also known to escape local minima more effectively, similar to how a ball rolling down a hill can overcome small bumps if it has enough speed.

It is important to note that there is no absolute superiority among these optimizers, including adaptive learning rate techniques. The optimal optimizer varies by field and task, so it’s unwise to ask or judge which is the best. It’s helpful to learn what is commonly used in your field, or if unsure, to try SGD+momentum or Adam.

Nesterov Momentum

To summarize the momentum method: to obtain the next parameter θi+1\boldsymbol{\theta}_{i+1}, the current gradient αL(θi)\alpha \nabla L(\boldsymbol{\theta}_{i}) is cumulatively added to the current parameter θi\boldsymbol{\theta}_{i}.

Nesterov momentum, or Nesterov accelerated gradient (NAG), computes the gradient at “the current parameter plus the previous gradient” and adds this to the current parameter to obtain the next one. This might sound complicated, but if you understand the momentum method, the following algorithm may make Nesterov momentum easier to grasp.

Algorithm

Let’s denote the momentum term as p\mathbf{p}.

Algorithm: Momentum Method
InputLearning rate α\alpha, momentum parameter β\beta, epochs NN
1.Initialize parameters θ\boldsymbol{\theta} to random values.
2.Initialize momentum p=0\mathbf{p} = \mathbf{0}.
3.for k=1,,Nk = 1, \cdots, N do
4.   pβpαL(θ)\mathbf{p} \leftarrow \beta \mathbf{p} - \alpha \nabla L(\boldsymbol{\theta})# Update momentum with calculated gradient
5.   θθ+p\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \mathbf{p}# Update parameters
6.end for
Algorithm: Nesterov Momentum
InputLearning rate α\alpha, momentum parameter β\beta, epochs NN
1.Initialize parameters θ\boldsymbol{\theta} to random values.
2.Initialize momentum p=0\mathbf{p} = \mathbf{0}.
3.for k=1,,Nk = 1, \cdots, N do
4.   pβpαL(θ+βp)\mathbf{p} \leftarrow \beta \mathbf{p} - \alpha \nabla L(\boldsymbol{\theta} + \beta \mathbf{p})# Update momentum with calculated gradient
5.   θθ+p\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \mathbf{p}# Update parameters
6.end for

Looking at the first few calculations of both methods, we can represent it as follows. If we simply denote pi=αLi\mathbf{p}_{i} = \alpha \nabla L_{i}, and pi=αL(θiβ1pi1β2pi2βip0)\mathbf{p}^{i} = \alpha \nabla L(\boldsymbol{\theta}_{i} - \beta^{1}\mathbf{p}^{i-1} - \beta^{2}\mathbf{p}^{i-2} - \cdots - \beta^{i}\mathbf{p}^{0}) (here p0=p0\mathbf{p}^{0} = \mathbf{p}_{0}), then

MomentumNesterov Momentum
θ_0=\boldsymbol{\theta}\_{0} = initialθ_0=\boldsymbol{\theta}\_{0} = initial
θ_1=θ_0αL_0 =θ_0p_0\boldsymbol{\theta}\_{1} = \boldsymbol{\theta}\_{0} - \alpha \nabla L\_{0} \\ \quad\ = \boldsymbol{\theta}\_{0} - \mathbf{p}\_{0}θ_1=θ_0αL_0 =θ_0p0\boldsymbol{\theta}\_{1} = \boldsymbol{\theta}\_{0} - \alpha \nabla L\_{0} \\ \quad\ = \boldsymbol{\theta}\_{0} - \mathbf{p}^{0}
θ_2=θ_1αL_1βp_0 =θ_1p_1βp_0\boldsymbol{\theta}\_{2} = \boldsymbol{\theta}\_{1} - \alpha\nabla L\_{1} - \beta \mathbf{p}\_{0} \\ \quad\ = \boldsymbol{\theta}\_{1} - \mathbf{p}\_{1} - \beta \mathbf{p}\_{0}θ_2=θ_1αL(θ_1βp0)βp0 =θ_1p1βp0\boldsymbol{\theta}\_{2} = \boldsymbol{\theta}\_{1} - \alpha \nabla L(\boldsymbol{\theta}\_{1} - \beta \mathbf{p}^{0}) - \beta \mathbf{p}^{0} \\ \quad\ = \boldsymbol{\theta}\_{1} - \mathbf{p}^{1} - \beta \mathbf{p}^{0}
θ_3=θ_2p_2βp_1β2p_0 =θ_2_j=02βjp_2j\boldsymbol{\theta}\_{3} = \boldsymbol{\theta}\_{2} - \mathbf{p}\_{2} - \beta \mathbf{p}\_{1} - \beta^{2} \mathbf{p}\_{0} \\ \quad\ = \boldsymbol{\theta}\_{2} - \sum\limits\_{j=0}^{2}\beta^{j}\mathbf{p}\_{2-j}θ_3=θ_2αL(θ_2βp1β2p0)βp1β2p0 =θ_2p2βp1β2p0\boldsymbol{\theta}\_{3} = \boldsymbol{\theta}\_{2} - \alpha \nabla L(\boldsymbol{\theta}\_{2} - \beta \mathbf{p}^{1} - \beta^{2}\mathbf{p}^{0}) - \beta \mathbf{p}^{1} - \beta^{2}\mathbf{p}^{0} \\ \quad\ = \boldsymbol{\theta}\_{2} - \mathbf{p}^{2} - \beta \mathbf{p}^{1} - \beta^{2} \mathbf{p}^{0}
\vdots\vdots
θ_i+1=θ_i_j=0iβjp_ij\boldsymbol{\theta}\_{i+1} = \boldsymbol{\theta}\_{i} - \sum\limits\_{j=0}^{i}\beta^{j}\mathbf{p}\_{i-j}θ_i+1=θ_i_j=0iβjpij\boldsymbol{\theta}\_{i+1} = \boldsymbol{\theta}\_{i} - \sum\limits\_{j=0}^{i}\beta^{j}\mathbf{p}^{i-j}

  1. Ian Goodfellow, Deep Learning, Ch 8.3.2 Momentum, 8.3.3 Nesterov Momentum ↩︎

  2. O Ilseok, Machine Learning (2017), Ch 5.3 Momentum ↩︎