logo

Gradient Descent and Stochastic Gradient Descent in Machine Learning 📂Machine Learning

Gradient Descent and Stochastic Gradient Descent in Machine Learning

Overview

The Gradient Descent Algorithm is the simplest method among algorithms that find the local minimum of the loss function by using the gradient of the loss function.

Description

Here, the loss function LL is considered a function of weights and biases with the dataset XX being fixed. If the input data looks like xRm\mathbf{x} \in \mathbb{R}^{m}, then LL becomes a function of (w1,w2,,wm,b)Rm+1(w_{1} , w_{2} , \cdots , w_{m} , b) \in \mathbb{R}^{m+1}. Even with the same data, the value of the loss function varies depending on the weights and biases, and a decrease in the loss function indicates a better model has been made.

The Gradient Descent follows the manifold created by this function LL to find the optimal weights that result in a local minimum. To understand this principle in more detail, it’s beneficial to study about the gradient descent in numerical analysis.

The loss function’s value for the first chosen vector of weights and biases w1Rm+1\mathbf{w}_{1} \in \mathbb{R}^{m+1} can be further reduced to w2\mathbf{w}_{2} by some appropriate positive number α\alpha as computed by w2:=w1αL(w1) \mathbf{w}_{2} := \mathbf{w}_{1} - \alpha \nabla L (\mathbf{w}_{1} ) . Repeating this, wn+1:=wnαL(wn) \mathbf{w}_{n+1} := \mathbf{w}_{n} - \alpha \nabla L (\mathbf{w}_{n} ) , can also further decrease the value of the loss function. This process of updating wn\mathbf{w}_{n} is called Backpropagation. In machine learning, α\alpha is referred to as the Learning Rate, and depending on this value, the gradient descent may or may not succeed.

20190325\_114213.png

Success is as depicted above, where repeated calculations accurately find the weights and biases that make LL a local minimum. Especially, in the picture, it is both a local and absolute minimum, though usually, a local minimum may not definitely be the absolute minimum.

20190325\_124516.png

If α\alpha is too large, the values change significantly, speeding up the learning process, but if it’s excessively large, it could fail to converge, as shown above. This is referred to as Overshooting.

20190325\_125126.png

Conversely, if α\alpha is too small, mathematical convergence is guaranteed, but the change is so minimal that it takes too much time, and if it gets stuck in a local minimum, it can’t escape from its vicinity.

This is the basic concept of gradient descent, and in practice, various techniques are utilized to mitigate such issues.

Stochastic Gradient Descent

Applying gradient descent after each mini-batch is called stochastic gradient descent or SGD. Some sources describe it as follows:

However, this distinction is practically useless. Typically in deep learning, only mini-batch learning is used, and setting the batch size to 11 turns it into online learning. Thus, in actual deep learning practices, “gradient descent = stochastic gradient descent = mini-batch gradient descent” is an acceptable interpretation.

The term “stochastic” doesn’t need to be heavily emphasized. If the entire dataset is seen as a population, then learning with mini-batches is akin to repeatedly learning from sample populations, hence the term “stochastic” is appropriate.

See Also