Sum of Squared Residuals' Gradient
📂Vector AnalysisSum of Squared Residuals' Gradient
Overview
In many regression problems in statistics and machine learning, the sum of squared residuals is used as the objective function, especially when f is a linear combination, it can be succinctly expressed in matrix form.
RSS====k∑(yk−f(xk))2k∑(yk−(s0+s1xk1+⋯+spxkp))2(y−Xs)T(y−Xs)∥y−Xs∥22
To generalize this further, consider the gradient of a scalar function for matrix R∈Rn×n, which takes the following form.
f(s):=(y−Xs)TR(y−Xs)
For vectors y∈Rn and matrices X∈Rn×p and R∈Rn×n that are independent of s, the following holds:
∂s∂f(s)=−XT(R+RT)(y−Xs)
Derivation
Properties of transpose matrices: Assuming r,s∈R and A,B have sizes that ensure matrix operations are well-defined, the following holds:
- (a) Linearity: (rA+sB)T=rAT+sBT
Gradient of vectors and matrices:
∂w∂wTx=∂w∂xTw=x
∂w∂(wTRw)=(R+RT)w
∂s∂f(s)======∂s∂(y−Xs)TR(y−Xs)∂s∂(yT−sTXT)R(y−Xs)∂s∂(−sTXTRy−yTRXs+sTXTRXs)−XTRy−XTRTy+XT(R+RT)Xs−XT(R+RT)y+XT(R+RT)Xs−XT(R+RT)(y−Xs)
■
Corollary 1
As a corollary, if R is a symmetric matrix
∂s∂f(s)=−2XTR(y−Xs)
and if it is an identity matrix, then we obtain:
∂s∂f(s)=−2XT(y−Xs)
Corollary 2
For the Hadamard product ⊙, defining f(s):=∥X(τ⊙s)−y∥22 as, since X(τ⊙s)=Xdiag(τ)s applies
∂s∂f(s)=2(Xdiag(τ))T(Xdiag(τ)s−y)=2diag(τ)TXT(X(τ⊙s)−y)=2τ⊙XT(X(τ⊙s)−y)