Gradient Descent Learning of Linear Regression Models in Machine Learning 📂Machine Learning

Gradient Descent Learning of Linear Regression Models in Machine Learning

Overview¹

Introducing a method using gradient descent, one of the learning methods of the linear regression model.

Description

Let’s assume that the data set is $X = \left\{ \mathbf{x}_{i} \right\}_{i=1}^{N}$ and the label set is $Y = \left\{ y_{i} \right\}_{i=1}^{N}$ . And let’s assume the following linear regression model.

$\hat{y} = \sum\limits_{j=0}^{n} w_{j}x_{j} = \mathbf{w}^{T} \mathbf{x}$

At this time, $\mathbf{x} = \begin{bmatrix} x_{0} & \dots & x_{n} \end{bmatrix}^{T}$ and $\mathbf{w} = \begin{bmatrix} w_{0} & \dots & w_{n} \end{bmatrix}^{T}$ . Gradient descent is a learning method that uses the fact that the gradient of a function indicates the direction in which the function value increases the most. Let’s set the loss function $J$ as MSE like below.

$J(\mathbf{w}) = \dfrac{1}{2} \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2}$

The reason the constant $\frac{1}{2}$ is attached at the front is to cancel out $2$ when differentiating $J$ . After all, minimizing $J$ or minimizing $\frac{1}{2}J$ are the same. In other words, the goal is to find the following optimal solution $\mathbf{w}_{\ast}$ given $X$ and $Y$ .

$\mathbf{w}_{\ast} = \argmin\limits_{\mathbf{w}} \left( J(\mathbf{w}) = \dfrac{1}{2} \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} \right)$

Notation

In mathematics, physics, and engineering, the gradient of a scalar function is denoted as follows.

$\nabla J (\mathbf{w}) = \begin{bmatrix} \dfrac{\partial J(\mathbf{w})}{\partial w_{0}} & \dots & \dfrac{\partial J(\mathbf{w})}{\partial w_{n}} \end{bmatrix}$

In machine learning, the following notation is also commonly used.

$\dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}} = \begin{bmatrix} \dfrac{\partial J(\mathbf{w})}{\partial w_{0}} & \dots & \dfrac{\partial J(\mathbf{w})}{\partial w_{n}} \end{bmatrix}$

In other words, $\nabla J = \dfrac{\partial J}{\partial \mathbf{w}}$ . In this article, we will use $\nabla J$ for the notation of the gradient.

Algorithm

Due to the nature of the gradient, $\nabla J$ points in the direction where $J$ increases the most. Conversely, $-\nabla J$ points in the direction where $J$ decreases. Since our goal is to reduce the function value of $J$ , we move the weight $\mathbf{w}$ in the direction of $-\nabla J$ . In other words, update as follows.

$\mathbf{w} - \alpha \nabla J \to \mathbf{w}$

Here $\alpha$ is the learning rate. If you actually calculate $\nabla J$ similarly, it is as follows. Let’s say $\mathbf{x}_{i} = \begin{bmatrix} x_{i0} & \dots & x_{in} \end{bmatrix}$ . Since $\dfrac{\partial \mathbf{w}^{T}\mathbf{x}_{i}}{\partial w_{j}} = \dfrac{\partial \sum_{k}w_{k}x_{ik}}{\partial w_{j}} = x_{ij}$ ,

$\begin{align*} \nabla J &= \begin{bmatrix} \dfrac{1}{2}\sum\limits_{i=1}^{N} \dfrac{\partial}{\partial w_{1}}(y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} & \cdots & \dfrac{1}{2}\sum\limits_{i=1}^{N} \dfrac{\partial}{\partial w_{n}}(y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} \end{bmatrix} \\ &= \begin{bmatrix} \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})x_{i0} & \cdots & \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})x_{in} \end{bmatrix} \\ &= \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \begin{bmatrix} x_{i0} & \cdots & x_{in} \end{bmatrix} \\ &= \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \mathbf{x}_{i} \end{align*}$

Therefore, the specific formula to update $\mathbf{w}$ is as follows.

$\mathbf{w} - \alpha \sum\limits_{i}(y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \mathbf{x}_{i} \to \mathbf{w}$

This is called the Widrow-Hoff or LMS algorithm. From the formula above, you can see that if the error is large, $\mathbf{w}$ is significantly updated, and if the error is small, $\mathbf{w}$ is slightly updated.

Update Method²

There are two main methods of updating weights. Both methods are known to converge to the optimal solution for an appropriate learning rate $\alpha$ .

Batch Learning

Batch learning refers to modifying weights all at once for the error of the entire data set $X$ . That is, as explained above.

$\begin{align*} &\text{Repeat until convergence: }\\ &\quad \mathbf{w} - \alpha \sum\limits_{i}(y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \mathbf{x}_{i} \to \mathbf{w} \end{align*}$

Online Learning

Online learning refers to modifying weights for the error of each data $\mathbf{x}_{i}$ .

$\begin{align*} &\text{Repeat until convergence: } \\ &\quad \text{For } i = 1 \text{ to } N: \\ &\qquad \mathbf{w} - \alpha (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i}) \mathbf{x}_{i} \to \mathbf{w} \end{align*}$