Gradient Descent Learning of Linear Regression Models in Machine Learning
📂Machine LearningGradient Descent Learning of Linear Regression Models in Machine Learning
Overview
Introducing a method using gradient descent, one of the learning methods of the linear regression model.
Description
Let’s assume that the data set is X={xi}i=1N and the label set is Y={yi}i=1N. And let’s assume the following linear regression model.
y^=j=0∑nwjxj=wTx
At this time, x=[x0…xn]T and w=[w0…wn]T. Gradient descent is a learning method that uses the fact that the gradient of a function indicates the direction in which the function value increases the most. Let’s set the loss function J as MSE like below.
J(w)=21i=1∑N(yi−wTxi)2
The reason the constant 21 is attached at the front is to cancel out 2 when differentiating J. After all, minimizing J or minimizing 21J are the same. In other words, the goal is to find the following optimal solution w∗ given X and Y.
w∗=wargmin(J(w)=21i=1∑N(yi−wTxi)2)
Notation
In mathematics, physics, and engineering, the gradient of a scalar function is denoted as follows.
∇J(w)=[∂w0∂J(w)…∂wn∂J(w)]
In machine learning, the following notation is also commonly used.
∂w∂J(w)=[∂w0∂J(w)…∂wn∂J(w)]
In other words, ∇J=∂w∂J. In this article, we will use ∇J for the notation of the gradient.
Algorithm
Due to the nature of the gradient, ∇J points in the direction where J increases the most. Conversely, −∇J points in the direction where J decreases. Since our goal is to reduce the function value of J, we move the weight w in the direction of −∇J. In other words, update as follows.
w−α∇J→w
Here α is the learning rate. If you actually calculate ∇J similarly, it is as follows. Let’s say xi=[xi0…xin]. Since ∂wj∂wTxi=∂wj∂∑kwkxik=xij,
∇J=[21i=1∑N∂w1∂(yi−wTxi)2⋯21i=1∑N∂wn∂(yi−wTxi)2]=[i=1∑N(yi−wTxi)xi0⋯i=1∑N(yi−wTxi)xin]=i=1∑N(yi−wTxi)[xi0⋯xin]=i=1∑N(yi−wTxi)xi
Therefore, the specific formula to update w is as follows.
w−αi∑(yi−wTxi)xi→w
This is called the Widrow-Hoff or LMS algorithm. From the formula above, you can see that if the error is large, w is significantly updated, and if the error is small, w is slightly updated.
Update Method
There are two main methods of updating weights. Both methods are known to converge to the optimal solution for an appropriate learning rate α.
Batch Learning
Batch learning refers to modifying weights all at once for the error of the entire data set X. That is, as explained above.
Repeat until convergence: w−αi∑(yi−wTxi)xi→w
Online Learning
Online learning refers to modifying weights for the error of each data xi.
Repeat until convergence: For i=1 to N:w−α(yi−wTxi)xi→w
See Also