logo

머신러닝에서 선형회귀모델의 최소제곱법 학습 📂Machine Learning

머신러닝에서 선형회귀모델의 최소제곱법 학습

Overview1

We introduce a method using the least squares, a learning method for linear regression models.

Description

Let the dataset be X={xi}i=1NX = \left\{ \mathbf{x}_{i} \right\}_{i=1}^{N}, and the label set be Y={yi}i=1NY = \left\{ y_{i} \right\}_{i=1}^{N}. Assume the following linear regression model:

y^=j=0n1wjxj=wTx \hat{y} = \sum\limits_{j=0}^{n-1} w_{j}x_{j} = \mathbf{w}^{T} \mathbf{x}

Here, x=[x0xn1]T\mathbf{x} = \begin{bmatrix} x_{0} & \dots & x_{n-1} \end{bmatrix}^{T} and w=[w0wn1]T\mathbf{w} = \begin{bmatrix} w_{0} & \dots & w_{n-1} \end{bmatrix}^{T}. The least squares method is a learning method that minimizes the squared error between the model’s prediction y^i=wTxi\hat{y}_{i} = \mathbf{w}^{T} \mathbf{x}_{i} and the actual label yiy_{i}. Thus, the loss function JJ is set to MSE as follows:

J(w)=12i=1N(yiwTxi)2=12yXw2 J(\mathbf{w}) = \dfrac{1}{2} \sum\limits_{i=1}^{N} (y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} = \dfrac{1}{2} \left\| \mathbf{y} - \mathbf{X}\mathbf{w} \right\|^{2}

Here, y=[y1yN]T\mathbf{y} = \begin{bmatrix} y_{1} & \dots & y_{N} \end{bmatrix}^{T} and X=[x1xN]T\mathbf{X} = \begin{bmatrix} \mathbf{x}_{1} & \dots & \mathbf{x}_{N} \end{bmatrix}^{T}. Then Xw\mathbf{X}\mathbf{w} is the following.

Xw=[x1TwxNTw] \mathbf{X}\mathbf{w} = \begin{bmatrix} \mathbf{x}_{1}^{T} \mathbf{w} \\ \vdots \\ \mathbf{x}_{N}^{T} \mathbf{w} \end{bmatrix}

The reason a constant 12\frac{1}{2} is attached in front is to simplify the derivative of JJ when 22 emerges. In essence, minimizing JJ is the same as minimizing 12J\frac{1}{2}J. The goal is to find the least squares solution wLS\mathbf{w}_{\text{LS}} for given XX and YY.

wLS=arg minw(J(w)=12yXw2) \mathbf{w}_{\text{LS}} = \argmin\limits_{\mathbf{w}} \left( J(\mathbf{w}) = \dfrac{1}{2} \left\| \mathbf{y} - \mathbf{X}\mathbf{w} \right\|^{2} \right)

Unlike learning using gradient descent, which finds optimal weights through an iterative approach, the method introduced here is an analytical one-shot approach to find the optimal solution in one go.

Learning

Case 1: Exact and Unique Solution Exists

Suppose the number of data points NN, the dimension of weights nn, and the rank of X\mathbf{X} are all equal. This means X\mathbf{X} is of full rank.

N=n=rank(X) N = n = \rank (\mathbf{X})

In this situation, X\mathbf{X} becomes a N×NN \times N square matrix which has an inverse since its rank is NN. Therefore, the system of equations y=Xw\mathbf{y} = \mathbf{X} \mathbf{w} has a unique solution.

w=X1y \mathbf{w} = \mathbf{X}^{-1}\mathbf{y}

For such a weight w\mathbf{w}, the function value of the loss function is 00.

J(X1y)=12yXX1y2=0 J(\mathbf{X}^{-1}\mathbf{y}) = \dfrac{1}{2} \left\| \mathbf{y} - \mathbf{X}\mathbf{X}^{-1}\mathbf{y}\right\|^{2} = 0

In this ideal case, one can find the optimal weights without resorting to gradient descent, which rarely holds true in real situations.


Case 2: Full Rank and Overdetermined

Assume the number of data points NN, the dimension of weights nn, and rank(X)\rank (\mathbf{X}) meet the following condition. This assumption holds in many real-world scenarios.

N>n=rank(X) N \gt n = \rank (\mathbf{X})

When the linear system Xw=y\mathbf{X} \mathbf{w} = \mathbf{y} is overdetermined, the solution to the least squares problem is not unique. By expanding JJ, we compute the norm of the vector as follows:

J(w)=12yXX1y2=12(yXw)T(yXw)=12(yTwTXT)(yXw)=12(yTyyTXwwTXTy+wTXTXw) \begin{align*} J(\mathbf{w}) &= \dfrac{1}{2} \left\| \mathbf{y} - \mathbf{X}\mathbf{X}^{-1}\mathbf{y}\right\|^{2} = \dfrac{1}{2} \left( \mathbf{y} - \mathbf{X}\mathbf{w} \right)^{T} \left( \mathbf{y} - \mathbf{X}\mathbf{w} \right) \\ &= \dfrac{1}{2} \left( \mathbf{y}^{T} - \mathbf{w}^{T}\mathbf{X}^{T} \right) \left( \mathbf{y} - \mathbf{X}\mathbf{w} \right) \\ &= \dfrac{1}{2} \left( \mathbf{y}^{T}\mathbf{y} - \mathbf{y}^{T}\mathbf{X}\mathbf{w} - \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{y} + \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right) \end{align*}

Since wTXTy\mathbf{w}^{T}\mathbf{X}^{T}\mathbf{y} is a scalar,

wTXTy=(wTXTy)T=yTXw \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{y} = (\mathbf{w}^{T}\mathbf{X}^{T}\mathbf{y})^{T} = \mathbf{y}^{T}\mathbf{X}\mathbf{w}

Therefore, we obtain the following:

J(w)=12(yTy2yTXw+wTXTXw) J(\mathbf{w}) = \dfrac{1}{2} \left( \mathbf{y}^{T}\mathbf{y} - 2\mathbf{y}^{T}\mathbf{X}\mathbf{w} + \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right)

Although only vectors and matrices are shown, note that the above value is a scalar. To find when this value is minimized, we compute the gradient:

J(w)=J(w)w=12w(yTy2yTXw+wTXTXw)=12(2XTy+2XTXw)=XTXwXTy \begin{align*} \nabla J(\mathbf{w}) = \dfrac{\partial J(\mathbf{w})}{\partial \mathbf{w}} &= \dfrac{1}{2} \dfrac{\partial }{\partial \mathbf{w}}\left( \mathbf{y}^{T}\mathbf{y} - 2\mathbf{y}^{T}\mathbf{X}\mathbf{w} + \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right) \\ &= \dfrac{1}{2} \left( -2\mathbf{X}^{T}\mathbf{y} + 2\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right) \\ &= \mathbf{X}^{T}\mathbf{X}\mathbf{w} - \mathbf{X}^{T}\mathbf{y} \end{align*}

Refer to the here for the derivative computation results. Then we get the following equation.

J=0    XTXw=XTy \nabla J = \mathbf{0} \implies \mathbf{X}^{T}\mathbf{X}\mathbf{w} = \mathbf{X}^{T}\mathbf{y}

This equation XTXw=XTy\mathbf{X}^{T}\mathbf{X}\mathbf{w} = \mathbf{X}^{T}\mathbf{y} is known as the normal equation. When full rank is maintained, this problem has a unique solution. XTX\mathbf{X}^{T} \mathbf{X} is a n×nn\times n matrix and shares the same rank as X\mathbf{X}.

rank(XTX)=n \rank (\mathbf{X}^{T} \mathbf{X}) = n

Thus, an inverse matrix exists, and we derive the following least squares solution.

w=(XTX)1XTy \mathbf{w} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}


Case 3: Full Rank and Underdetermined

Suppose the number of data points NN, the dimension of weights nn, and rank(X)\rank (\mathbf{X}) meet the following condition.

n>N=rank(X) n \gt N = \rank (\mathbf{X})

Then, the system of equations Xw=y\mathbf{X}\mathbf{w} = \mathbf{y} is an underdetermined system. Just like in case 2, there is no unique least squares solution, with infinitely many solutions existing. Among these, our goal is to find the w\mathbf{w} with the smallest norm.

arg minw12w2 \argmin\limits_{\mathbf{w}} \dfrac{1}{2} \left\| \mathbf{w} \right\|^{2}

Approach this problem using the Lagrange multiplier method. Thus, we minimize the function obtained by adding the product of multiplier λ=[λ1λN]T\boldsymbol{\lambda} = \begin{bmatrix} \lambda_{1} & \dots & \lambda_{N} \end{bmatrix}^{T} and constraint yXw\mathbf{y} - \mathbf{X}\mathbf{w}:

L(w,λ)=12w2+λT(yXw) L(\mathbf{w}, \boldsymbol{\lambda}) = \dfrac{1}{2} \left\| \mathbf{w} \right\|^{2} + \boldsymbol{\lambda}^{T}(\mathbf{y} - \mathbf{X}\mathbf{w})

The gradient of LL is

L=Lw=wXTλ \nabla L = \dfrac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \mathbf{X}^{T}\boldsymbol{\lambda}

Refer to here for the derivative computation. The weights making the derivative 0\mathbf{0} are w=XTλ\mathbf{w} = \mathbf{X}^{T}\boldsymbol{\lambda}. This results in the following:

y=Xw=XXTλ \mathbf{y} = \mathbf{X}\mathbf{w} = \mathbf{X}\mathbf{X}^{T}\boldsymbol{\lambda}

Thus λ=(XXT)1y\boldsymbol{\lambda} = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{y} represents the solution we desire:

w=XT(XXT)1y \mathbf{w} = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{y}


Case 4: Rank Deficient

Assume it is rank deficient.

overdetermined: N>n>rank(X)underdetermined: n>N>rank(X) \text{overdetermined: }N \gt n \gt \rank(\mathbf{X}) \\ \text{underdetermined: } n \gt N \gt \rank(\mathbf{X})

In this scenario, as before, there are infinitely many least squares solutions and none are unique since X\mathbf{X} isn’t full rank. Therefore, it cannot be solved as in case 2 or case 3.


  1. Christoper M. Bishop, Pattern Recognition annd Machine Learning (2006), p140-147 ↩︎