Gradient Descent Learning of Linear Regression Models in Machine Learning
Overview1
Introducing a method using gradient descent, one of the learning methods of the linear regression model.
Description
Let’s assume that the data set is $X = \left\{ \mathbf{x}_{i} \right\}_{i=1}^{N}$ and the label set is $Y = \left\{ y_{i} \right\}_{i=1}^{N}$. And let’s assume the following linear regression model.
$$ \hat{y} = \sum\limits_{j=0}^{n} w_{j}x_{j} = \mathbf{w}^{T} \mathbf{x} $$
At this time, $\mathbf{x} = \begin{bmatrix} x_{0} & \dots & x_{n} \end{bmatrix}^{T}$ and $\mathbf{w} = \begin{bmatrix} w_{0} & \dots & w_{n} \end{bmatrix}^{T}$. Gradient descent is a learning method that uses the fact that the gradient of a function indicates the direction in which the function value increases the most. Let’s set the loss function $J$ as MSE like below.
$$ J(\mathbf{w}) = \dfrac{1}{2} \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} $$
The reason the constant $\frac{1}{2}$ is attached at the front is to cancel out $2$ when differentiating $J$. After all, minimizing $J$ or minimizing $\frac{1}{2}J$ are the same. In other words, the goal is to find the following optimal solution $\mathbf{w}_{\ast}$ given $X$ and $Y$.
$$ \mathbf{w}_{\ast} = \argmin\limits_{\mathbf{w}} \left( J(\mathbf{w}) = \dfrac{1}{2} \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} \right) $$
Notation
In mathematics, physics, and engineering, the gradient of a scalar function is denoted as follows.
$$ \nabla J (\mathbf{w}) = \begin{bmatrix} \dfrac{\partial J(\mathbf{w})}{\partial w_{0}} & \dots & \dfrac{\partial J(\mathbf{w})}{\partial w_{n}} \end{bmatrix} $$
In machine learning, the following notation is also commonly used.
$$ \dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}} = \begin{bmatrix} \dfrac{\partial J(\mathbf{w})}{\partial w_{0}} & \dots & \dfrac{\partial J(\mathbf{w})}{\partial w_{n}} \end{bmatrix} $$
In other words, $\nabla J = \dfrac{\partial J}{\partial \mathbf{w}}$. In this article, we will use $\nabla J$ for the notation of the gradient.
Algorithm
Due to the nature of the gradient, $\nabla J$ points in the direction where $J$ increases the most. Conversely, $-\nabla J$ points in the direction where $J$ decreases. Since our goal is to reduce the function value of $J$, we move the weight $\mathbf{w}$ in the direction of $-\nabla J$. In other words, update as follows.
$$ \mathbf{w} - \alpha \nabla J \to \mathbf{w} $$
Here $\alpha$ is the learning rate. If you actually calculate $\nabla J$ similarly, it is as follows. Let’s say $\mathbf{x}_{i} = \begin{bmatrix} x_{i0} & \dots & x_{in} \end{bmatrix}$. Since $\dfrac{\partial \mathbf{w}^{T}\mathbf{x}_{i}}{\partial w_{j}} = \dfrac{\partial \sum_{k}w_{k}x_{ik}}{\partial w_{j}} = x_{ij}$,
$$ \begin{align*} \nabla J &= \begin{bmatrix} \dfrac{1}{2}\sum\limits_{i=1}^{N} \dfrac{\partial}{\partial w_{1}}(y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} & \cdots & \dfrac{1}{2}\sum\limits_{i=1}^{N} \dfrac{\partial}{\partial w_{n}}(y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} \end{bmatrix} \\ &= \begin{bmatrix} \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})x_{i0} & \cdots & \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})x_{in} \end{bmatrix} \\ &= \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \begin{bmatrix} x_{i0} & \cdots & x_{in} \end{bmatrix} \\ &= \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \mathbf{x}_{i} \end{align*} $$
Therefore, the specific formula to update $\mathbf{w}$ is as follows.
$$ \mathbf{w} - \alpha \sum\limits_{i}(y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \mathbf{x}_{i} \to \mathbf{w} $$
This is called the Widrow-Hoff or LMS algorithm. From the formula above, you can see that if the error is large, $\mathbf{w}$ is significantly updated, and if the error is small, $\mathbf{w}$ is slightly updated.
Update Method2
There are two main methods of updating weights. Both methods are known to converge to the optimal solution for an appropriate learning rate $\alpha$.
Batch Learning
Batch learning refers to modifying weights all at once for the error of the entire data set $X$. That is, as explained above.
$$ \begin{align*} &\text{Repeat until convergence: }\\ &\quad \mathbf{w} - \alpha \sum\limits_{i}(y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \mathbf{x}_{i} \to \mathbf{w} \end{align*} $$
Online Learning
Online learning refers to modifying weights for the error of each data $\mathbf{x}_{i}$.
$$ \begin{align*} &\text{Repeat until convergence: } \\ &\quad \text{For } i = 1 \text{ to } N: \\ &\qquad \mathbf{w} - \alpha (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \mathbf{x}_{i} \to \mathbf{w} \end{align*} $$