Let the dataset be X={xi}i=1N, and the label set be Y={yi}i=1N. Assume the following linear regression model:
y^=j=0∑n−1wjxj=wTx
Here, x=[x0…xn−1]T and w=[w0…wn−1]T. The least squares method is a learning method that minimizes the squared error between the model’s prediction y^i=wTxi and the actual label yi. Thus, the loss functionJ is set to MSE as follows:
J(w)=21i=1∑N(yi−wTxi)2=21∥y−Xw∥2
Here, y=[y1…yN]T and X=[x1…xN]T. Then Xw is the following.
Xw=x1Tw⋮xNTw
The reason a constant 21 is attached in front is to simplify the derivative of J when 2 emerges. In essence, minimizing J is the same as minimizing 21J. The goal is to find the least squares solutionwLS for given X and Y.
wLS=wargmin(J(w)=21∥y−Xw∥2)
Unlike learning using gradient descent, which finds optimal weights through an iterative approach, the method introduced here is an analytical one-shot approach to find the optimal solution in one go.
Learning
Case 1: Exact and Unique Solution Exists
Suppose the number of data points N, the dimension of weights n, and the rank of X are all equal. This means X is of full rank.
N=n=rank(X)
In this situation, X becomes a N×N square matrix which has an inverse since its rank is N. Therefore, the system of equations y=Xw has a unique solution.
w=X−1y
For such a weight w, the function value of the loss function is 0.
J(X−1y)=21y−XX−1y2=0
In this ideal case, one can find the optimal weights without resorting to gradient descent, which rarely holds true in real situations.
Case 2: Full Rank and Overdetermined
Assume the number of data points N, the dimension of weights n, and rank(X) meet the following condition. This assumption holds in many real-world scenarios.
N>n=rank(X)
When the linear systemXw=y is overdetermined, the solution to the least squares problem is not unique. By expanding J, we compute the norm of the vector as follows:
Refer to the here for the derivative computation results. Then we get the following equation.
∇J=0⟹XTXw=XTy
This equation XTXw=XTy is known as the normal equation. When full rank is maintained, this problem has a unique solution. XTX is a n×n matrix and shares the same rank as X.
Suppose the number of data points N, the dimension of weights n, and rank(X) meet the following condition.
n>N=rank(X)
Then, the system of equations Xw=y is an underdetermined system. Just like in case 2, there is no unique least squares solution, with infinitely many solutions existing. Among these, our goal is to find the w with the smallest norm.
wargmin21∥w∥2
Approach this problem using the Lagrange multiplier method. Thus, we minimize the function obtained by adding the product of multiplier λ=[λ1…λN]T and constraint y−Xw:
L(w,λ)=21∥w∥2+λT(y−Xw)
The gradient of L is
∇L=∂w∂L=w−XTλ
Refer to here for the derivative computation. The weights making the derivative 0 are w=XTλ. This results in the following:
y=Xw=XXTλ
Thus λ=(XXT)−1y represents the solution we desire:
In this scenario, as before, there are infinitely many least squares solutions and none are unique since X isn’t full rank. Therefore, it cannot be solved as in case 2 or case 3.