logo

Matrix Calculus of Quadratic and Bilinear Forms 📂Vector Analysis

Matrix Calculus of Quadratic and Bilinear Forms

Formula

  • For two vectors $\mathbf{a} \in \mathbb{R}^{m}$, $\mathbf{b} \in \mathbb{R}^{n}$ and a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$, the gradient matrix of the bilinear form $\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b}$ is as follows.

    $$ \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b}) = \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \mathbf{a}\mathbf{b}^{\mathsf{T}} \tag{1} $$
    As a corollary, for quadratic form $\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{a}$, the following holds.
    $$ \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{a})}{\partial \mathbf{X}} = \mathbf{a}\mathbf{a}^{\mathsf{T}} $$

  • For two vectors $\mathbf{a} \in \mathbb{R}^{n}$, $\mathbf{b} \in \mathbb{R}^{n}$ and a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$, the gradient matrix for $\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}}\mathbf{X} \mathbf{b}$ is as follows.

    $$ \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b}) = \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \mathbf{X}(\mathbf{a} \mathbf{b}^{\mathsf{T}} + \mathbf{b}\mathbf{a}^{\mathsf{T}} ) \tag{2} $$

    If $\mathbf{b} = \mathbf{a}$, then

    $$ \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{a})}{\partial \mathbf{X}} = 2\mathbf{X}\mathbf{a} \mathbf{a}^{\mathsf{T}} \tag{2} $$

Explanation

The result is similar to the differentiation of polynomial functions.

  • $(1)$: Essentially, it’s similar to the differentiation of a linear function, so it results in the form where only the coefficients remain. What can be a bit confusing is that the computation result should be a matrix, so it’s $\mathbf{a}\mathbf{b}^{\mathsf{T}}$ rather than $\mathbf{a}^{\mathsf{T}}\mathbf{b}$.
  • $(2)$: From the perspective of quadratic forms, both $\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{a}$ and $\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}}\mathbf{X} \mathbf{a}$ are just quadratic forms, but they show different results due to the variable being multiplied several times when differentiating. Essentially, it corresponds to the differentiation of a quadratic function.

In the proof below, we’ve shown it via direct computation, but using the method called the trace trick allows for a simpler calculation. Direct computation for an arbitrary form of $\mathbf{X}$ or expressions containing many $\mathbf{X}$ instances is practically too difficult, and the trace trick needs to be used.

More formulas can be found in the Matrix Differentiation Table for Scalar Functions.

Proof

$(1)$

The bilinear form can be expressed as $\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b} = \sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n} a_{i} x_{ij} b_{j}$. Therefore, it’s $\dfrac{\partial \left( \sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n} a_{i} x_{ij} b_{j} \right)}{\partial x_{k\ell}} = a_{k}b_{\ell}$, and

$$ \begin{align*} \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b}) &= \begin{bmatrix} \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{11}} & \cdots & \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{m1}} & \cdots & \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{mn}} \end{bmatrix} \\ &= \begin{bmatrix} a_{1}b_{1} & a_{1}b_{2} & \cdots & a_{1}b_{n} \\ a_{2}b_{1} & a_{2}b_{2} & \cdots & a_{2}b_{n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m}b_{1} & a_{m}b_{1} & \cdots & a_{m}b_{n} \end{bmatrix} \\ &= \begin{bmatrix} a_{1} \\ \vdots \\ a_{n} \end{bmatrix} \begin{bmatrix} b_{1} & \cdots & b_{n} \end{bmatrix} \\ &= \mathbf{a} \mathbf{b}^{\mathsf{T}} = \mathbf{a} \otimes \mathbf{b} \end{align*} $$

$\otimes$ is the outer product of two vectors, or the Kronecker product of two matrices.

$(2)$

Direct Calculation

The product of two matrices is $[AB]_{ij} = \sum\limits_{k=1}^{n} a_{ik}b_{kj}$, so it’s $[X^{\mathsf{T}}X]_{ij} = \sum\limits_{s=1}^{m} x_{si} x_{sj}$. The following holds:

$$ \mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}}\mathbf{X} \mathbf{b} = \sum\limits_{k, \ell = 1}^{n} a_{k} \left(\sum\limits_{s=1}^{m} x_{sk} x_{s\ell}\right) b_{\ell} = \sum\limits_{k,\ell = 1}^{n} \sum\limits_{s=1}^{m} a_{k} x_{sk} x_{s\ell} b_{\ell} $$

Hence, the differentiation is as follows:

$$ \dfrac{\partial \left( \sum\limits_{k,\ell = 1}^{n} \sum\limits_{s=1}^{m} a_{k} x_{sk} x_{s\ell} b_{\ell} \right)}{\partial x_{ij}} = \sum\limits_{\ell=1}^{n}a_{j}x_{i\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{ik}b_{j} $$

The gradient matrix is as follows:

$$ \begin{align*} & \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b}) \\ &= \begin{bmatrix} \sum\limits_{\ell=1}^{n}a_{1}x_{1\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{1} & \sum\limits_{\ell=1}^{n}a_{2}x_{1\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{2} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{1\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{n} \\ \sum\limits_{\ell=1}^{n}a_{1}x_{2\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{2k}b_{1} & \sum\limits_{\ell=1}^{n}a_{2}x_{2\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{2k}b_{2} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{2\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{2k}b_{n} \\ \vdots & \vdots & \ddots & \vdots \\ \sum\limits_{\ell=1}^{n}a_{1}x_{n\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{1} & \sum\limits_{\ell=1}^{n}a_{2}x_{n\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{2} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{n\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{n} \end{bmatrix} \\ &= \begin{bmatrix} \sum\limits_{\ell=1}^{n}a_{1}x_{1\ell}b_{\ell} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{1\ell}b_{\ell} \\ \vdots & \ddots & \vdots \\ \sum\limits_{\ell=1}^{n}a_{1}x_{n\ell}b_{\ell} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{n\ell}b_{\ell} \end{bmatrix} + \begin{bmatrix} \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{1} & \cdots &\sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{n} \\ \vdots & \ddots & \vdots \\ \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{1} & \cdots &\sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{n} \end{bmatrix} \\ &= \begin{bmatrix} \sum\limits_{\ell=1}^{n}x_{1\ell}b_{\ell}a_{1} & \cdots & \sum\limits_{\ell=1}^{n}x_{1\ell}b_{\ell}a_{n} \\ \vdots & \ddots & \vdots \\ \sum\limits_{\ell=1}^{n}x_{n\ell}b_{\ell}a_{1} & \cdots & \sum\limits_{\ell=1}^{n}x_{n\ell}b_{\ell}a_{n} \end{bmatrix} + \begin{bmatrix} \sum\limits_{k = 1}^{n} x_{1k}a_{k}b_{1} & \cdots &\sum\limits_{k = 1}^{n} x_{1k}a_{k}b_{n} \\ \vdots & \ddots & \vdots \\ \sum\limits_{k = 1}^{n} x_{nk}a_{k}b_{1} & \cdots &\sum\limits_{k = 1}^{n} x_{nk}a_{k}b_{n} \end{bmatrix} \\ \end{align*} $$

Thus, the following is obtained:

$$ \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b}) = \mathbf{X}\mathbf{b}\mathbf{a}^{\mathsf{T}} + \mathbf{X}\mathbf{a}\mathbf{b}^{\mathsf{T}} = \mathbf{X} ( \mathbf{a}\mathbf{b}^{\mathsf{T}} + \mathbf{b}\mathbf{a}^{\mathsf{T}} ) $$

Trace Trick

A scalar can be thought of as a $1\times 1$ matrix, and thus, the trace essentially acts as an identity function. In other words, the value of the bilinear form is a scalar, so the following holds:

$$ \mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b} = \Tr(\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b}) $$

Additionally, since the trace has the cyclic property $\Tr(ABC) = \Tr(CBA)$,

$$ \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \dfrac{\partial \Tr(\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \dfrac{\partial \Tr(\mathbf{b}\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X})}{\partial \mathbf{X}} $$

By the trace differentiation formula, the following is obtained:

$$ \dfrac{\partial (\mathbf{b}\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X})}{\partial \mathbf{X}} = \mathbf{X}(\mathbf{b}\mathbf{a}^{\mathsf{T}} + \mathbf{a}\mathbf{b}^{\mathsf{T}}) $$

At this point, if $\mathbf{b} = \mathbf{a}$, then

$$ \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{a})}{\partial \mathbf{X}} = 2\mathbf{X}\mathbf{a}\mathbf{a}^{\mathsf{T}} $$

In this manner, the method that utilizes the properties of trace to simplify matrix differentiation calculations through straightforward matrix operations is called the trace trick.