Momentum Method in Gradient Descent
📂Machine LearningMomentum Method in Gradient Descent
Overview
In gradient descent, the momentum technique involves using all previous gradients when updating parameters. This is its essence and the end of the explanation. However, explanations involving strange update equations, the motivation from physics’ momentum, setting the mass to 1 and the initial velocity to 0, only complicate understanding. This article explains the momentum technique as plainly as possible.
Build-Up
Let’s denote the parameters as θ and the loss function as L. The standard gradient descent method updates the parameters iteratively as follows:
θi+1=θi−α∇Li
Here, Li=L(θi) represents the loss function calculated in the ith iteration. The momentum technique simply adds ∇Li−1, the gradient of the loss function calculated in the previous iteration.
θi+1=θi−α∇Li−α∇Li−1−⋯−α∇L0
As iterations progress, to reduce the impact of the gradient and prevent the sum of gradients from diverging, a coefficient β∈(0,1) is added as follows:
θi+1=θi−α∇Li−βα∇Li−1−⋯−βiα∇L0=θi−αj=0∑iβj∇Li−j
Definition
Updating the parameters as in (1) is called the momentum method, and the added term αj=0∑iβj∇Li−j is referred to as momentum.
Explanation
According to the definition, the momentum method is a generalized form of gradient descent. In fact, gradient descent can be seen as a special case of the momentum method where β=0. The closer β is to 1, the more it reflects previous gradients, and vice versa when it’s closer to 0.
Gradient descent updates parameters in the direction of the steepest current gradient and is thus a greedy algorithm. The momentum method mitigates this greediness, enabling choices that may not be the best in the short term but are more beneficial in the long run. It also helps prevent sudden and excessive changes in gradient direction.
Obviously, since the magnitude of the gradient used to update parameters is larger compared to standard gradient descent, the momentum method has the advantage of faster convergence. Empirically, it is also known to escape local minima more effectively, similar to how a ball rolling down a hill can overcome small bumps if it has enough speed.
It is important to note that there is no absolute superiority among these optimizers, including adaptive learning rate techniques. The optimal optimizer varies by field and task, so it’s unwise to ask or judge which is the best. It’s helpful to learn what is commonly used in your field, or if unsure, to try SGD+momentum or Adam.
Nesterov Momentum
To summarize the momentum method: to obtain the next parameter θi+1, the current gradient α∇L(θi) is cumulatively added to the current parameter θi.
Nesterov momentum, or Nesterov accelerated gradient (NAG), computes the gradient at “the current parameter plus the previous gradient” and adds this to the current parameter to obtain the next one. This might sound complicated, but if you understand the momentum method, the following algorithm may make Nesterov momentum easier to grasp.
Algorithm
Let’s denote the momentum term as p.
Algorithm: Momentum Method | | |
---|
Input | Learning rate α, momentum parameter β, epochs N | |
1. | Initialize parameters θ to random values. | |
2. | Initialize momentum p=0. | |
3. | for k=1,⋯,N do | |
4. | p←βp−α∇L(θ) | # Update momentum with calculated gradient |
5. | θ←θ+p | # Update parameters |
6. | end for | |
Algorithm: Nesterov Momentum | | |
---|
Input | Learning rate α, momentum parameter β, epochs N | |
1. | Initialize parameters θ to random values. | |
2. | Initialize momentum p=0. | |
3. | for k=1,⋯,N do | |
4. | p←βp−α∇L(θ+βp) | # Update momentum with calculated gradient |
5. | θ←θ+p | # Update parameters |
6. | end for | |
Looking at the first few calculations of both methods, we can represent it as follows. If we simply denote pi=α∇Li, and pi=α∇L(θi−β1pi−1−β2pi−2−⋯−βip0) (here p0=p0), then
Momentum | Nesterov Momentum |
θ_0= initial | θ_0= initial |
θ_1=θ_0−α∇L_0 =θ_0−p_0 | θ_1=θ_0−α∇L_0 =θ_0−p0 |
θ_2=θ_1−α∇L_1−βp_0 =θ_1−p_1−βp_0 | θ_2=θ_1−α∇L(θ_1−βp0)−βp0 =θ_1−p1−βp0 |
θ_3=θ_2−p_2−βp_1−β2p_0 =θ_2−∑_j=02βjp_2−j | θ_3=θ_2−α∇L(θ_2−βp1−β2p0)−βp1−β2p0 =θ_2−p2−βp1−β2p0 |
⋮ | ⋮ |
θ_i+1=θ_i−∑_j=0iβjp_i−j | θ_i+1=θ_i−∑_j=0iβjpi−j |