logo

Matrix Calculus of Quadratic and Bilinear Forms 📂Vector Analysis

Matrix Calculus of Quadratic and Bilinear Forms

Formula

  • For two vectors aRm\mathbf{a} \in \mathbb{R}^{m}, bRn\mathbf{b} \in \mathbb{R}^{n} and a matrix XRm×n\mathbf{X} \in \mathbb{R}^{m \times n}, the gradient matrix of the bilinear form aTXb\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b} is as follows.

    X(aTXb)=(aTXb)X=abT(1) \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b}) = \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \mathbf{a}\mathbf{b}^{\mathsf{T}} \tag{1}
    As a corollary, for quadratic form aTXa\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{a}, the following holds.
    (aTXa)X=aaT \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{a})}{\partial \mathbf{X}} = \mathbf{a}\mathbf{a}^{\mathsf{T}}

  • For two vectors aRn\mathbf{a} \in \mathbb{R}^{n}, bRn\mathbf{b} \in \mathbb{R}^{n} and a matrix XRm×n\mathbf{X} \in \mathbb{R}^{m \times n}, the gradient matrix for aTXTXb\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}}\mathbf{X} \mathbf{b} is as follows.

    X(aTXTXb)=(aTXTXb)X=X(abT+baT)(2) \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b}) = \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \mathbf{X}(\mathbf{a} \mathbf{b}^{\mathsf{T}} + \mathbf{b}\mathbf{a}^{\mathsf{T}} ) \tag{2}

    If b=a\mathbf{b} = \mathbf{a}, then

    (aTXTXa)X=2XaaT(2) \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{a})}{\partial \mathbf{X}} = 2\mathbf{X}\mathbf{a} \mathbf{a}^{\mathsf{T}} \tag{2}

Explanation

The result is similar to the differentiation of polynomial functions.

  • (1)(1): Essentially, it’s similar to the differentiation of a linear function, so it results in the form where only the coefficients remain. What can be a bit confusing is that the computation result should be a matrix, so it’s abT\mathbf{a}\mathbf{b}^{\mathsf{T}} rather than aTb\mathbf{a}^{\mathsf{T}}\mathbf{b}.
  • (2)(2): From the perspective of quadratic forms, both aTXa\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{a} and aTXTXa\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}}\mathbf{X} \mathbf{a} are just quadratic forms, but they show different results due to the variable being multiplied several times when differentiating. Essentially, it corresponds to the differentiation of a quadratic function.

In the proof below, we’ve shown it via direct computation, but using the method called the trace trick allows for a simpler calculation. Direct computation for an arbitrary form of X\mathbf{X} or expressions containing many X\mathbf{X} instances is practically too difficult, and the trace trick needs to be used.

More formulas can be found in the Matrix Differentiation Table for Scalar Functions.

Proof

(1)(1)

The bilinear form can be expressed as aTXb=i=1mj=1naixijbj\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b} = \sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n} a_{i} x_{ij} b_{j}. Therefore, it’s (i=1mj=1naixijbj)xk=akb\dfrac{\partial \left( \sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n} a_{i} x_{ij} b_{j} \right)}{\partial x_{k\ell}} = a_{k}b_{\ell}, and

X(aTXb)=[(aTXb)x11(aTXb)x1n(aTXb)xm1(aTXb)xmn]=[a1b1a1b2a1bna2b1a2b2a2bnamb1amb1ambn]=[a1an][b1bn]=abT=ab \begin{align*} \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b}) &= \begin{bmatrix} \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{11}} & \cdots & \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{m1}} & \cdots & \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{mn}} \end{bmatrix} \\ &= \begin{bmatrix} a_{1}b_{1} & a_{1}b_{2} & \cdots & a_{1}b_{n} \\ a_{2}b_{1} & a_{2}b_{2} & \cdots & a_{2}b_{n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m}b_{1} & a_{m}b_{1} & \cdots & a_{m}b_{n} \end{bmatrix} \\ &= \begin{bmatrix} a_{1} \\ \vdots \\ a_{n} \end{bmatrix} \begin{bmatrix} b_{1} & \cdots & b_{n} \end{bmatrix} \\ &= \mathbf{a} \mathbf{b}^{\mathsf{T}} = \mathbf{a} \otimes \mathbf{b} \end{align*}

\otimes is the outer product of two vectors, or the Kronecker product of two matrices.

(2)(2)

Direct Calculation

The product of two matrices is [AB]ij=k=1naikbkj[AB]_{ij} = \sum\limits_{k=1}^{n} a_{ik}b_{kj}, so it’s [XTX]ij=s=1mxsixsj[X^{\mathsf{T}}X]_{ij} = \sum\limits_{s=1}^{m} x_{si} x_{sj}. The following holds:

aTXTXb=k,=1nak(s=1mxskxs)b=k,=1ns=1makxskxsb \mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}}\mathbf{X} \mathbf{b} = \sum\limits_{k, \ell = 1}^{n} a_{k} \left(\sum\limits_{s=1}^{m} x_{sk} x_{s\ell}\right) b_{\ell} = \sum\limits_{k,\ell = 1}^{n} \sum\limits_{s=1}^{m} a_{k} x_{sk} x_{s\ell} b_{\ell}

Hence, the differentiation is as follows:

(k,=1ns=1makxskxsb)xij==1najxib+k=1nakxikbj \dfrac{\partial \left( \sum\limits_{k,\ell = 1}^{n} \sum\limits_{s=1}^{m} a_{k} x_{sk} x_{s\ell} b_{\ell} \right)}{\partial x_{ij}} = \sum\limits_{\ell=1}^{n}a_{j}x_{i\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{ik}b_{j}

The gradient matrix is as follows:

X(aTXTXb)=[=1na1x1b+k=1nakx1kb1=1na2x1b+k=1nakx1kb2=1nanx1b+k=1nakx1kbn=1na1x2b+k=1nakx2kb1=1na2x2b+k=1nakx2kb2=1nanx2b+k=1nakx2kbn=1na1xnb+k=1nakxnkb1=1na2xnb+k=1nakxnkb2=1nanxnb+k=1nakxnkbn]=[=1na1x1b=1nanx1b=1na1xnb=1nanxnb]+[k=1nakx1kb1k=1nakx1kbnk=1nakxnkb1k=1nakxnkbn]=[=1nx1ba1=1nx1ban=1nxnba1=1nxnban]+[k=1nx1kakb1k=1nx1kakbnk=1nxnkakb1k=1nxnkakbn] \begin{align*} & \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b}) \\ &= \begin{bmatrix} \sum\limits_{\ell=1}^{n}a_{1}x_{1\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{1} & \sum\limits_{\ell=1}^{n}a_{2}x_{1\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{2} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{1\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{n} \\ \sum\limits_{\ell=1}^{n}a_{1}x_{2\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{2k}b_{1} & \sum\limits_{\ell=1}^{n}a_{2}x_{2\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{2k}b_{2} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{2\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{2k}b_{n} \\ \vdots & \vdots & \ddots & \vdots \\ \sum\limits_{\ell=1}^{n}a_{1}x_{n\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{1} & \sum\limits_{\ell=1}^{n}a_{2}x_{n\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{2} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{n\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{n} \end{bmatrix} \\ &= \begin{bmatrix} \sum\limits_{\ell=1}^{n}a_{1}x_{1\ell}b_{\ell} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{1\ell}b_{\ell} \\ \vdots & \ddots & \vdots \\ \sum\limits_{\ell=1}^{n}a_{1}x_{n\ell}b_{\ell} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{n\ell}b_{\ell} \end{bmatrix} + \begin{bmatrix} \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{1} & \cdots &\sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{n} \\ \vdots & \ddots & \vdots \\ \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{1} & \cdots &\sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{n} \end{bmatrix} \\ &= \begin{bmatrix} \sum\limits_{\ell=1}^{n}x_{1\ell}b_{\ell}a_{1} & \cdots & \sum\limits_{\ell=1}^{n}x_{1\ell}b_{\ell}a_{n} \\ \vdots & \ddots & \vdots \\ \sum\limits_{\ell=1}^{n}x_{n\ell}b_{\ell}a_{1} & \cdots & \sum\limits_{\ell=1}^{n}x_{n\ell}b_{\ell}a_{n} \end{bmatrix} + \begin{bmatrix} \sum\limits_{k = 1}^{n} x_{1k}a_{k}b_{1} & \cdots &\sum\limits_{k = 1}^{n} x_{1k}a_{k}b_{n} \\ \vdots & \ddots & \vdots \\ \sum\limits_{k = 1}^{n} x_{nk}a_{k}b_{1} & \cdots &\sum\limits_{k = 1}^{n} x_{nk}a_{k}b_{n} \end{bmatrix} \\ \end{align*}

Thus, the following is obtained:

X(aTXTXb)=XbaT+XabT=X(abT+baT) \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b}) = \mathbf{X}\mathbf{b}\mathbf{a}^{\mathsf{T}} + \mathbf{X}\mathbf{a}\mathbf{b}^{\mathsf{T}} = \mathbf{X} ( \mathbf{a}\mathbf{b}^{\mathsf{T}} + \mathbf{b}\mathbf{a}^{\mathsf{T}} )

Trace Trick

A scalar can be thought of as a 1×11\times 1 matrix, and thus, the trace essentially acts as an identity function. In other words, the value of the bilinear form is a scalar, so the following holds:

aTXTXb=Tr(aTXTXb) \mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b} = \Tr(\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})

Additionally, since the trace has the cyclic property Tr(ABC)=Tr(CBA)\Tr(ABC) = \Tr(CBA),

(aTXTXb)X=Tr(aTXTXb)X=Tr(baTXTX)X \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \dfrac{\partial \Tr(\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \dfrac{\partial \Tr(\mathbf{b}\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X})}{\partial \mathbf{X}}

By the trace differentiation formula, the following is obtained:

(baTXTX)X=X(baT+abT) \dfrac{\partial (\mathbf{b}\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X})}{\partial \mathbf{X}} = \mathbf{X}(\mathbf{b}\mathbf{a}^{\mathsf{T}} + \mathbf{a}\mathbf{b}^{\mathsf{T}})

At this point, if b=a\mathbf{b} = \mathbf{a}, then

(aTXTXa)X=2XaaT \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{a})}{\partial \mathbf{X}} = 2\mathbf{X}\mathbf{a}\mathbf{a}^{\mathsf{T}}

In this manner, the method that utilizes the properties of trace to simplify matrix differentiation calculations through straightforward matrix operations is called the trace trick.