logo

Sum of Squared Residuals' Gradient 📂Vector Analysis

Sum of Squared Residuals' Gradient

Overview

In many regression problems in statistics and machine learning, the sum of squared residuals is used as the objective function, especially when ff is a linear combination, it can be succinctly expressed in matrix form. RSS=k(ykf(xk))2=k(yk(s0+s1xk1++spxkp))2=(yXs)T(yXs)=yXs22 \begin{align*} RSS =& \sum_{k} \left( y_{k} - f \left( \mathbf{x}_{k} \right) \right)^{2} \\ =& \sum_{k} \left( y_{k} - \left( s_{0} + s_{1} x_{k1} + \cdots + s_{p} x_{kp} \right) \right)^{2} \\ =& \left( \mathbf{y} - X \mathbf{s} \right)^{T} \left( \mathbf{y} - X \mathbf{s} \right) \\ =& \left\| \mathbf{y} - X \mathbf{s} \right\|_{2}^{2} \end{align*} To generalize this further, consider the gradient of a scalar function for matrix RRn×nR \in \mathbb{R}^{n \times n}, which takes the following form.

Formula 1

f(s):=(yXs)TR(yXs) f \left( \mathbf{s} \right) := \left( \mathbf{y} - X \mathbf{s} \right)^{T} R \left( \mathbf{y} - X \mathbf{s} \right) For vectors yRn\mathbf{y} \in \mathbb{R}^{n} and matrices XRn×pX \in \mathbb{R}^{n \times p} and RRn×nR \in \mathbb{R}^{n \times n} that are independent of s\mathbf{s}, the following holds: f(s)s=XT(R+RT)(yXs) {{ \partial f \left( \mathbf{s} \right) } \over { \partial \mathbf{s} }} = - X^{T} \left( R + R^{T} \right) \left( \mathbf{y} - X \mathbf{s} \right)

Derivation

Properties of transpose matrices: Assuming r,sRr,s\in \mathbb{R} and A,BA,B have sizes that ensure matrix operations are well-defined, the following holds:

  • (a) Linearity: (rA+sB)T=rAT+sBT\left( rA + sB\right)^{T}=r A^{T} + s B^{T}

Gradient of vectors and matrices: wTxw=xTww=x \frac{ \partial \mathbf{w}^{T}\mathbf{x}}{ \partial \mathbf{w} } = \frac{ \partial \mathbf{x}^{T}\mathbf{w}}{ \partial \mathbf{w} } = \mathbf{x} w(wTRw)=(R+RT)w \frac{ \partial }{ \partial \mathbf{w} }\left( \mathbf{w}^{T}\mathbf{R}\mathbf{w} \right)= \left( \mathbf{R} + \mathbf{R}^{T} \right) \mathbf{w}

sf(s)=s(yXs)TR(yXs)=s(yTsTXT)R(yXs)=s(sTXTRyyTRXs+sTXTRXs)=XTRyXTRTy+XT(R+RT)Xs=XT(R+RT)y+XT(R+RT)Xs=XT(R+RT)(yXs) \begin{align*} {{ \partial } \over { \partial \mathbf{s} }} f \left( \mathbf{s} \right) =& {{ \partial } \over { \partial \mathbf{s} }} \left( \mathbf{y} - X \mathbf{s} \right)^{T} R \left( \mathbf{y} - X \mathbf{s} \right) \\ =& {{ \partial } \over { \partial \mathbf{s} }} \left( \mathbf{y}^{T} - \mathbf{s}^{T} X^{T} \right) R \left( \mathbf{y} - X \mathbf{s} \right) \\ =& {{ \partial } \over { \partial \mathbf{s} }} \left( - \mathbf{s}^{T} X^{T} R \mathbf{y} - \mathbf{y}^{T} R X \mathbf{s} + \mathbf{s}^{T} X^{T} R X \mathbf{s} \right) \\ =& - X^{T} R \mathbf{y} - X^{T} R^{T} \mathbf{y} + X^{T} \left( R + R^{T} \right) X \mathbf{s} \\ =& - X^{T} \left( R + R^{T} \right) \mathbf{y} + X^{T} \left( R + R^{T} \right) X \mathbf{s} \\ =& - X^{T} \left( R + R^{T} \right) \left( \mathbf{y} - X \mathbf{s} \right) \end{align*}

Corollary 1

As a corollary, if RR is a symmetric matrix f(s)s=2XTR(yXs) {{ \partial f \left( \mathbf{s} \right) } \over { \partial \mathbf{s} }} = - 2 X^{T} R \left( \mathbf{y} - X \mathbf{s} \right) and if it is an identity matrix, then we obtain: f(s)s=2XT(yXs) {{ \partial f \left( \mathbf{s} \right) } \over { \partial \mathbf{s} }} = - 2 X^{T} \left( \mathbf{y} - X \mathbf{s} \right)

Corollary 2

For the Hadamard product \odot, defining f(s):=X(τs)y22f(\mathbf{s}) := \left\| X(\boldsymbol{\tau} \odot \mathbf{s}) - \mathbf{y} \right\|_{2}^{2} as, since X(τs)=Xdiag(τ)sX(\boldsymbol{\tau} \odot \mathbf{s}) = X \diag(\boldsymbol{\tau}) \mathbf{s} applies

f(s)s=2(Xdiag(τ))T(Xdiag(τ)sy)=2diag(τ)TXT(X(τs)y)=2τXT(X(τs)y) \begin{align*} \dfrac{\partial f(\mathbf{s})}{\partial \mathbf{s}} & = 2 \left( X \diag(\boldsymbol{\tau}) \right)^{T} \left( X \diag(\boldsymbol{\tau})\mathbf{s} - \mathbf{y}\right) \\ & = 2 \diag(\boldsymbol{\tau})^{T} X^{T} \left( X (\boldsymbol{\tau} \odot \mathbf{s}) - \mathbf{y}\right) \\ & = 2 \boldsymbol{\tau} \odot X^{T} \left( X (\boldsymbol{\tau} \odot \mathbf{s}) - \mathbf{y}\right) \\ \end{align*}


  1. Petersen. (2008). The Matrix Cookbook: p10. ↩︎