Gradient Descent and Stochastic Gradient Descent in Machine Learning
Overview
The Gradient Descent Algorithm is the simplest method among algorithms that find the local minimum of the loss function by using the gradient of the loss function.
Description
Here, the loss function $L$ is considered a function of weights and biases with the dataset $X$ being fixed. If the input data looks like $\mathbf{x} \in \mathbb{R}^{m}$, then $L$ becomes a function of $(w_{1} , w_{2} , \cdots , w_{m} , b) \in \mathbb{R}^{m+1}$. Even with the same data, the value of the loss function varies depending on the weights and biases, and a decrease in the loss function indicates a better model has been made.
The Gradient Descent follows the manifold created by this function $L$ to find the optimal weights that result in a local minimum. To understand this principle in more detail, it’s beneficial to study about the gradient descent in numerical analysis.
The loss function’s value for the first chosen vector of weights and biases $\mathbf{w}_{1} \in \mathbb{R}^{m+1}$ can be further reduced to $\mathbf{w}_{2}$ by some appropriate positive number $\alpha$ as computed by $$ \mathbf{w}_{2} := \mathbf{w}_{1} - \alpha \nabla L (\mathbf{w}_{1} ) $$. Repeating this, $$ \mathbf{w}_{n+1} := \mathbf{w}_{n} - \alpha \nabla L (\mathbf{w}_{n} ) $$, can also further decrease the value of the loss function. This process of updating $\mathbf{w}_{n}$ is called Backpropagation. In machine learning, $\alpha$ is referred to as the Learning Rate, and depending on this value, the gradient descent may or may not succeed.
Success is as depicted above, where repeated calculations accurately find the weights and biases that make $L$ a local minimum. Especially, in the picture, it is both a local and absolute minimum, though usually, a local minimum may not definitely be the absolute minimum.
If $\alpha$ is too large, the values change significantly, speeding up the learning process, but if it’s excessively large, it could fail to converge, as shown above. This is referred to as Overshooting.
Conversely, if $\alpha$ is too small, mathematical convergence is guaranteed, but the change is so minimal that it takes too much time, and if it gets stuck in a local minimum, it can’t escape from its vicinity.
This is the basic concept of gradient descent, and in practice, various techniques are utilized to mitigate such issues.
Stochastic Gradient Descent
Applying gradient descent after each mini-batch is called stochastic gradient descent or SGD. Some sources describe it as follows:
- Learning with batch learning: Batch gradient descent
- Learning with mini-batch learning: Mini-batch gradient descent
- Learning with online learning: Stochastic gradient descent
However, this distinction is practically useless. Typically in deep learning, only mini-batch learning is used, and setting the batch size to $1$ turns it into online learning. Thus, in actual deep learning practices, “gradient descent = stochastic gradient descent = mini-batch gradient descent” is an acceptable interpretation.
The term “stochastic” doesn’t need to be heavily emphasized. If the entire dataset is seen as a population, then learning with mini-batches is akin to repeatedly learning from sample populations, hence the term “stochastic” is appropriate.