머신러닝에서 선형회귀모델의 최소제곱법 학습
Overview1
We introduce a method using the least squares, a learning method for linear regression models.
Description
Let the dataset be $X = \left\{ \mathbf{x}_{i} \right\}_{i=1}^{N}$, and the label set be $Y = \left\{ y_{i} \right\}_{i=1}^{N}$. Assume the following linear regression model:
$$ \hat{y} = \sum\limits_{j=0}^{n-1} w_{j}x_{j} = \mathbf{w}^{T} \mathbf{x} $$
Here, $\mathbf{x} = \begin{bmatrix} x_{0} & \dots & x_{n-1} \end{bmatrix}^{T}$ and $\mathbf{w} = \begin{bmatrix} w_{0} & \dots & w_{n-1} \end{bmatrix}^{T}$. The least squares method is a learning method that minimizes the squared error between the model’s prediction $\hat{y}_{i} = \mathbf{w}^{T} \mathbf{x}_{i}$ and the actual label $y_{i}$. Thus, the loss function $J$ is set to MSE as follows:
$$ J(\mathbf{w}) = \dfrac{1}{2} \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} = \dfrac{1}{2} \left\| \mathbf{y} - \mathbf{X}\mathbf{w} \right\|^{2} $$
Here, $\mathbf{y} = \begin{bmatrix} y_{1} & \dots & y_{N} \end{bmatrix}^{T}$ and $\mathbf{X} = \begin{bmatrix} \mathbf{x}_{1} & \dots & \mathbf{x}_{N} \end{bmatrix}^{T}$. Then $\mathbf{X}\mathbf{w}$ is the following.
$$ \mathbf{X}\mathbf{w} = \begin{bmatrix} \mathbf{x}_{1}^{T} \mathbf{w} \\ \vdots \\ \mathbf{x}_{N}^{T} \mathbf{w} \end{bmatrix} $$
The reason a constant $\frac{1}{2}$ is attached in front is to simplify the derivative of $J$ when $2$ emerges. In essence, minimizing $J$ is the same as minimizing $\frac{1}{2}J$. The goal is to find the least squares solution $\mathbf{w}_{\text{LS}}$ for given $X$ and $Y$.
$$ \mathbf{w}_{\text{LS}} = \argmin\limits_{\mathbf{w}} \left( J(\mathbf{w}) = \dfrac{1}{2} \left\| \mathbf{y} - \mathbf{X}\mathbf{w} \right\|^{2} \right) $$
Unlike learning using gradient descent, which finds optimal weights through an iterative approach, the method introduced here is an analytical one-shot approach to find the optimal solution in one go.
Learning
Case 1: Exact and Unique Solution Exists
Suppose the number of data points $N$, the dimension of weights $n$, and the rank of $\mathbf{X}$ are all equal. This means $\mathbf{X}$ is of full rank.
$$ N = n = \rank (\mathbf{X}) $$
In this situation, $\mathbf{X}$ becomes a $N \times N$ square matrix which has an inverse since its rank is $N$. Therefore, the system of equations $\mathbf{y} = \mathbf{X} \mathbf{w}$ has a unique solution.
$$ \mathbf{w} = \mathbf{X}^{-1}\mathbf{y} $$
For such a weight $\mathbf{w}$, the function value of the loss function is $0$.
$$ J(\mathbf{X}^{-1}\mathbf{y}) = \dfrac{1}{2} \left\| \mathbf{y} - \mathbf{X}\mathbf{X}^{-1}\mathbf{y}\right\|^{2} = 0 $$
In this ideal case, one can find the optimal weights without resorting to gradient descent, which rarely holds true in real situations.
Case 2: Full Rank and Overdetermined
Assume the number of data points $N$, the dimension of weights $n$, and $\rank (\mathbf{X})$ meet the following condition. This assumption holds in many real-world scenarios.
$$ N \gt n = \rank (\mathbf{X}) $$
When the linear system $\mathbf{X} \mathbf{w} = \mathbf{y}$ is overdetermined, the solution to the least squares problem is not unique. By expanding $J$, we compute the norm of the vector as follows:
$$ \begin{align*} J(\mathbf{w}) &= \dfrac{1}{2} \left\| \mathbf{y} - \mathbf{X}\mathbf{X}^{-1}\mathbf{y}\right\|^{2} = \dfrac{1}{2} \left( \mathbf{y} - \mathbf{X}\mathbf{w} \right)^{T} \left( \mathbf{y} - \mathbf{X}\mathbf{w} \right) \\ &= \dfrac{1}{2} \left( \mathbf{y}^{T} - \mathbf{w}^{T}\mathbf{X}^{T} \right) \left( \mathbf{y} - \mathbf{X}\mathbf{w} \right) \\ &= \dfrac{1}{2} \left( \mathbf{y}^{T}\mathbf{y} - \mathbf{y}^{T}\mathbf{X}\mathbf{w} - \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{y} + \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right) \end{align*} $$
Since $\mathbf{w}^{T}\mathbf{X}^{T}\mathbf{y}$ is a scalar,
$$ \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{y} = (\mathbf{w}^{T}\mathbf{X}^{T}\mathbf{y})^{T} = \mathbf{y}^{T}\mathbf{X}\mathbf{w} $$
Therefore, we obtain the following:
$$ J(\mathbf{w}) = \dfrac{1}{2} \left( \mathbf{y}^{T}\mathbf{y} - 2\mathbf{y}^{T}\mathbf{X}\mathbf{w} + \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right) $$
Although only vectors and matrices are shown, note that the above value is a scalar. To find when this value is minimized, we compute the gradient:
$$ \begin{align*} \nabla J(\mathbf{w}) = \dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}} &= \dfrac{1}{2} \dfrac{\partial }{\partial \mathbf{w}}\left( \mathbf{y}^{T}\mathbf{y} - 2\mathbf{y}^{T}\mathbf{X}\mathbf{w} + \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right) \\ &= \dfrac{1}{2} \left( -2\mathbf{X}^{T}\mathbf{y} + 2\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right) \\ &= \mathbf{X}^{T}\mathbf{X}\mathbf{w} - \mathbf{X}^{T}\mathbf{y} \end{align*} $$
Refer to the here for the derivative computation results. Then we get the following equation.
$$ \nabla J = \mathbf{0} \implies \mathbf{X}^{T}\mathbf{X}\mathbf{w} = \mathbf{X}^{T}\mathbf{y} $$
This equation $\mathbf{X}^{T}\mathbf{X}\mathbf{w} = \mathbf{X}^{T}\mathbf{y}$ is known as the normal equation. When full rank is maintained, this problem has a unique solution. $\mathbf{X}^{T} \mathbf{X}$ is a $n\times n$ matrix and shares the same rank as $\mathbf{X}$.
$$ \rank (\mathbf{X}^{T} \mathbf{X}) = n $$
Thus, an inverse matrix exists, and we derive the following least squares solution.
$$ \mathbf{w} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} $$
Case 3: Full Rank and Underdetermined
Suppose the number of data points $N$, the dimension of weights $n$, and $\rank (\mathbf{X})$ meet the following condition.
$$ n \gt N = \rank (\mathbf{X}) $$
Then, the system of equations $\mathbf{X}\mathbf{w} = \mathbf{y}$ is an underdetermined system. Just like in case 2, there is no unique least squares solution, with infinitely many solutions existing. Among these, our goal is to find the $\mathbf{w}$ with the smallest norm.
$$ \argmin\limits_{\mathbf{w}} \dfrac{1}{2} \left\| \mathbf{w} \right\|^{2} $$
Approach this problem using the Lagrange multiplier method. Thus, we minimize the function obtained by adding the product of multiplier $\boldsymbol{\lambda} = \begin{bmatrix} \lambda_{1} & \dots & \lambda_{N} \end{bmatrix}^{T}$ and constraint $\mathbf{y} - \mathbf{X}\mathbf{w}$:
$$ L(\mathbf{w}, \boldsymbol{\lambda}) = \dfrac{1}{2} \left\| \mathbf{w} \right\|^{2} + \boldsymbol{\lambda}^{T}(\mathbf{y} - \mathbf{X}\mathbf{w}) $$
The gradient of $L$ is
$$ \nabla L = \dfrac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \mathbf{X}^{T}\boldsymbol{\lambda} $$
Refer to here for the derivative computation. The weights making the derivative $\mathbf{0}$ are $\mathbf{w} = \mathbf{X}^{T}\boldsymbol{\lambda}$. This results in the following:
$$ \mathbf{y} = \mathbf{X}\mathbf{w} = \mathbf{X}\mathbf{X}^{T}\boldsymbol{\lambda} $$
Thus $\boldsymbol{\lambda} = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{y}$ represents the solution we desire:
$$ \mathbf{w} = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{y} $$
Case 4: Rank Deficient
Assume it is rank deficient.
$$ \text{overdetermined: }N \gt n \gt \rank(\mathbf{X}) \\ \text{underdetermined: } n \gt N \gt \rank(\mathbf{X}) $$
In this scenario, as before, there are infinitely many least squares solutions and none are unique since $\mathbf{X}$ isn’t full rank. Therefore, it cannot be solved as in case 2 or case 3.
Christoper M. Bishop, Pattern Recognition annd Machine Learning (2006), p140-147 ↩︎