Matrix Calculus of Quadratic and Bilinear Forms
📂Vector Analysis Matrix Calculus of Quadratic and Bilinear Forms For two vectors a ∈ R m \mathbf{a} \in \mathbb{R}^{m} a ∈ R m , b ∈ R n \mathbf{b} \in \mathbb{R}^{n} b ∈ R n and a matrix X ∈ R m × n \mathbf{X} \in \mathbb{R}^{m \times n} X ∈ R m × n , the gradient matrix of the bilinear form a T X b \mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b} a T Xb is as follows.
∇ X ( a T X b ) = ∂ ( a T X b ) ∂ X = a b T (1)
\nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b}) =
\dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} =
\mathbf{a}\mathbf{b}^{\mathsf{T}} \tag{1}
∇ X ( a T Xb ) = ∂ X ∂ ( a T Xb ) = a b T ( 1 ) As a corollary, for quadratic form a T X a \mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{a} a T Xa , the following holds.∂ ( a T X a ) ∂ X = a a T
\dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{a})}{\partial \mathbf{X}} =
\mathbf{a}\mathbf{a}^{\mathsf{T}}
∂ X ∂ ( a T Xa ) = a a T
For two vectors a ∈ R n \mathbf{a} \in \mathbb{R}^{n} a ∈ R n , b ∈ R n \mathbf{b} \in \mathbb{R}^{n} b ∈ R n and a matrix X ∈ R m × n \mathbf{X} \in \mathbb{R}^{m \times n} X ∈ R m × n , the gradient matrix for a T X T X b \mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}}\mathbf{X} \mathbf{b} a T X T Xb is as follows.
∇ X ( a T X T X b ) = ∂ ( a T X T X b ) ∂ X = X ( a b T + b a T ) (2)
\nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})
= \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}}
= \mathbf{X}(\mathbf{a} \mathbf{b}^{\mathsf{T}} + \mathbf{b}\mathbf{a}^{\mathsf{T}} ) \tag{2}
∇ X ( a T X T Xb ) = ∂ X ∂ ( a T X T Xb ) = X ( a b T + b a T ) ( 2 )
If b = a \mathbf{b} = \mathbf{a} b = a , then
∂ ( a T X T X a ) ∂ X = 2 X a a T (2)
\dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{a})}{\partial \mathbf{X}} =
2\mathbf{X}\mathbf{a} \mathbf{a}^{\mathsf{T}} \tag{2}
∂ X ∂ ( a T X T Xa ) = 2 Xa a T ( 2 )
Explanation The result is similar to the differentiation of polynomial functions.
( 1 ) (1) ( 1 ) : Essentially, it’s similar to the differentiation of a linear function, so it results in the form where only the coefficients remain. What can be a bit confusing is that the computation result should be a matrix, so it’s a b T \mathbf{a}\mathbf{b}^{\mathsf{T}} a b T rather than a T b \mathbf{a}^{\mathsf{T}}\mathbf{b} a T b .( 2 ) (2) ( 2 ) : From the perspective of quadratic forms, both a T X a \mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{a} a T Xa and a T X T X a \mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}}\mathbf{X} \mathbf{a} a T X T Xa are just quadratic forms, but they show different results due to the variable being multiplied several times when differentiating. Essentially, it corresponds to the differentiation of a quadratic function.In the proof below, we’ve shown it via direct computation, but using the method called the trace trick allows for a simpler calculation. Direct computation for an arbitrary form of X \mathbf{X} X or expressions containing many X \mathbf{X} X instances is practically too difficult, and the trace trick needs to be used.
More formulas can be found in the Matrix Differentiation Table for Scalar Functions .
Proof ( 1 ) (1) ( 1 ) The bilinear form can be expressed as a T X b = ∑ i = 1 m ∑ j = 1 n a i x i j b j \mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b} = \sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n} a_{i} x_{ij} b_{j} a T Xb = i = 1 ∑ m j = 1 ∑ n a i x ij b j . Therefore, it’s ∂ ( ∑ i = 1 m ∑ j = 1 n a i x i j b j ) ∂ x k ℓ = a k b ℓ \dfrac{\partial \left( \sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n} a_{i} x_{ij} b_{j} \right)}{\partial x_{k\ell}} = a_{k}b_{\ell} ∂ x k ℓ ∂ ( i = 1 ∑ m j = 1 ∑ n a i x ij b j ) = a k b ℓ , and
∇ X ( a T X b ) = [ ∂ ( a T X b ) ∂ x 11 ⋯ ∂ ( a T X b ) ∂ x 1 n ⋮ ⋱ ⋮ ∂ ( a T X b ) ∂ x m 1 ⋯ ∂ ( a T X b ) ∂ x m n ] = [ a 1 b 1 a 1 b 2 ⋯ a 1 b n a 2 b 1 a 2 b 2 ⋯ a 2 b n ⋮ ⋮ ⋱ ⋮ a m b 1 a m b 1 ⋯ a m b n ] = [ a 1 ⋮ a n ] [ b 1 ⋯ b n ] = a b T = a ⊗ b
\begin{align*}
\nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})
&= \begin{bmatrix} \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{11}} & \cdots & \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{m1}} & \cdots & \dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial x_{mn}} \end{bmatrix} \\
&= \begin{bmatrix} a_{1}b_{1} & a_{1}b_{2} & \cdots & a_{1}b_{n} \\ a_{2}b_{1} & a_{2}b_{2} & \cdots & a_{2}b_{n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m}b_{1} & a_{m}b_{1} & \cdots & a_{m}b_{n} \end{bmatrix} \\
&= \begin{bmatrix} a_{1} \\ \vdots \\ a_{n} \end{bmatrix} \begin{bmatrix} b_{1} & \cdots & b_{n} \end{bmatrix} \\
&= \mathbf{a} \mathbf{b}^{\mathsf{T}} = \mathbf{a} \otimes \mathbf{b}
\end{align*}
∇ X ( a T Xb ) = ∂ x 11 ∂ ( a T Xb ) ⋮ ∂ x m 1 ∂ ( a T Xb ) ⋯ ⋱ ⋯ ∂ x 1 n ∂ ( a T Xb ) ⋮ ∂ x mn ∂ ( a T Xb ) = a 1 b 1 a 2 b 1 ⋮ a m b 1 a 1 b 2 a 2 b 2 ⋮ a m b 1 ⋯ ⋯ ⋱ ⋯ a 1 b n a 2 b n ⋮ a m b n = a 1 ⋮ a n [ b 1 ⋯ b n ] = a b T = a ⊗ b
⊗ \otimes ⊗ is the outer product of two vectors, or the Kronecker product of two matrices.
■
( 2 ) (2) ( 2 ) Direct Calculation The product of two matrices is [ A B ] i j = ∑ k = 1 n a i k b k j [AB]_{ij} = \sum\limits_{k=1}^{n} a_{ik}b_{kj} [ A B ] ij = k = 1 ∑ n a ik b kj , so it’s [ X T X ] i j = ∑ s = 1 m x s i x s j [X^{\mathsf{T}}X]_{ij} = \sum\limits_{s=1}^{m} x_{si} x_{sj} [ X T X ] ij = s = 1 ∑ m x s i x s j . The following holds:
a T X T X b = ∑ k , ℓ = 1 n a k ( ∑ s = 1 m x s k x s ℓ ) b ℓ = ∑ k , ℓ = 1 n ∑ s = 1 m a k x s k x s ℓ b ℓ
\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}}\mathbf{X} \mathbf{b} = \sum\limits_{k, \ell = 1}^{n} a_{k} \left(\sum\limits_{s=1}^{m} x_{sk} x_{s\ell}\right) b_{\ell} = \sum\limits_{k,\ell = 1}^{n} \sum\limits_{s=1}^{m} a_{k} x_{sk} x_{s\ell} b_{\ell}
a T X T Xb = k , ℓ = 1 ∑ n a k ( s = 1 ∑ m x s k x s ℓ ) b ℓ = k , ℓ = 1 ∑ n s = 1 ∑ m a k x s k x s ℓ b ℓ
Hence, the differentiation is as follows:
∂ ( ∑ k , ℓ = 1 n ∑ s = 1 m a k x s k x s ℓ b ℓ ) ∂ x i j = ∑ ℓ = 1 n a j x i ℓ b ℓ + ∑ k = 1 n a k x i k b j
\dfrac{\partial \left( \sum\limits_{k,\ell = 1}^{n} \sum\limits_{s=1}^{m} a_{k} x_{sk} x_{s\ell} b_{\ell} \right)}{\partial x_{ij}} = \sum\limits_{\ell=1}^{n}a_{j}x_{i\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{ik}b_{j}
∂ x ij ∂ ( k , ℓ = 1 ∑ n s = 1 ∑ m a k x s k x s ℓ b ℓ ) = ℓ = 1 ∑ n a j x i ℓ b ℓ + k = 1 ∑ n a k x ik b j
The gradient matrix is as follows:
∇ X ( a T X T X b ) = [ ∑ ℓ = 1 n a 1 x 1 ℓ b ℓ + ∑ k = 1 n a k x 1 k b 1 ∑ ℓ = 1 n a 2 x 1 ℓ b ℓ + ∑ k = 1 n a k x 1 k b 2 ⋯ ∑ ℓ = 1 n a n x 1 ℓ b ℓ + ∑ k = 1 n a k x 1 k b n ∑ ℓ = 1 n a 1 x 2 ℓ b ℓ + ∑ k = 1 n a k x 2 k b 1 ∑ ℓ = 1 n a 2 x 2 ℓ b ℓ + ∑ k = 1 n a k x 2 k b 2 ⋯ ∑ ℓ = 1 n a n x 2 ℓ b ℓ + ∑ k = 1 n a k x 2 k b n ⋮ ⋮ ⋱ ⋮ ∑ ℓ = 1 n a 1 x n ℓ b ℓ + ∑ k = 1 n a k x n k b 1 ∑ ℓ = 1 n a 2 x n ℓ b ℓ + ∑ k = 1 n a k x n k b 2 ⋯ ∑ ℓ = 1 n a n x n ℓ b ℓ + ∑ k = 1 n a k x n k b n ] = [ ∑ ℓ = 1 n a 1 x 1 ℓ b ℓ ⋯ ∑ ℓ = 1 n a n x 1 ℓ b ℓ ⋮ ⋱ ⋮ ∑ ℓ = 1 n a 1 x n ℓ b ℓ ⋯ ∑ ℓ = 1 n a n x n ℓ b ℓ ] + [ ∑ k = 1 n a k x 1 k b 1 ⋯ ∑ k = 1 n a k x 1 k b n ⋮ ⋱ ⋮ ∑ k = 1 n a k x n k b 1 ⋯ ∑ k = 1 n a k x n k b n ] = [ ∑ ℓ = 1 n x 1 ℓ b ℓ a 1 ⋯ ∑ ℓ = 1 n x 1 ℓ b ℓ a n ⋮ ⋱ ⋮ ∑ ℓ = 1 n x n ℓ b ℓ a 1 ⋯ ∑ ℓ = 1 n x n ℓ b ℓ a n ] + [ ∑ k = 1 n x 1 k a k b 1 ⋯ ∑ k = 1 n x 1 k a k b n ⋮ ⋱ ⋮ ∑ k = 1 n x n k a k b 1 ⋯ ∑ k = 1 n x n k a k b n ]
\begin{align*}
& \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b}) \\
&=
\begin{bmatrix}
\sum\limits_{\ell=1}^{n}a_{1}x_{1\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{1} & \sum\limits_{\ell=1}^{n}a_{2}x_{1\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{2} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{1\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{n} \\
\sum\limits_{\ell=1}^{n}a_{1}x_{2\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{2k}b_{1} & \sum\limits_{\ell=1}^{n}a_{2}x_{2\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{2k}b_{2} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{2\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{2k}b_{n} \\
\vdots & \vdots & \ddots & \vdots \\
\sum\limits_{\ell=1}^{n}a_{1}x_{n\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{1} & \sum\limits_{\ell=1}^{n}a_{2}x_{n\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{2} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{n\ell}b_{\ell} + \sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{n}
\end{bmatrix} \\
&=
\begin{bmatrix}
\sum\limits_{\ell=1}^{n}a_{1}x_{1\ell}b_{\ell} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{1\ell}b_{\ell} \\
\vdots & \ddots & \vdots \\
\sum\limits_{\ell=1}^{n}a_{1}x_{n\ell}b_{\ell} & \cdots & \sum\limits_{\ell=1}^{n}a_{n}x_{n\ell}b_{\ell}
\end{bmatrix} +
\begin{bmatrix}
\sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{1} & \cdots &\sum\limits_{k = 1}^{n} a_{k}x_{1k}b_{n} \\
\vdots & \ddots & \vdots \\
\sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{1} & \cdots &\sum\limits_{k = 1}^{n} a_{k}x_{nk}b_{n}
\end{bmatrix} \\
&=
\begin{bmatrix}
\sum\limits_{\ell=1}^{n}x_{1\ell}b_{\ell}a_{1} & \cdots & \sum\limits_{\ell=1}^{n}x_{1\ell}b_{\ell}a_{n} \\
\vdots & \ddots & \vdots \\
\sum\limits_{\ell=1}^{n}x_{n\ell}b_{\ell}a_{1} & \cdots & \sum\limits_{\ell=1}^{n}x_{n\ell}b_{\ell}a_{n}
\end{bmatrix} +
\begin{bmatrix}
\sum\limits_{k = 1}^{n} x_{1k}a_{k}b_{1} & \cdots &\sum\limits_{k = 1}^{n} x_{1k}a_{k}b_{n} \\
\vdots & \ddots & \vdots \\
\sum\limits_{k = 1}^{n} x_{nk}a_{k}b_{1} & \cdots &\sum\limits_{k = 1}^{n} x_{nk}a_{k}b_{n}
\end{bmatrix} \\
\end{align*}
∇ X ( a T X T Xb ) = ℓ = 1 ∑ n a 1 x 1 ℓ b ℓ + k = 1 ∑ n a k x 1 k b 1 ℓ = 1 ∑ n a 1 x 2 ℓ b ℓ + k = 1 ∑ n a k x 2 k b 1 ⋮ ℓ = 1 ∑ n a 1 x n ℓ b ℓ + k = 1 ∑ n a k x nk b 1 ℓ = 1 ∑ n a 2 x 1 ℓ b ℓ + k = 1 ∑ n a k x 1 k b 2 ℓ = 1 ∑ n a 2 x 2 ℓ b ℓ + k = 1 ∑ n a k x 2 k b 2 ⋮ ℓ = 1 ∑ n a 2 x n ℓ b ℓ + k = 1 ∑ n a k x nk b 2 ⋯ ⋯ ⋱ ⋯ ℓ = 1 ∑ n a n x 1 ℓ b ℓ + k = 1 ∑ n a k x 1 k b n ℓ = 1 ∑ n a n x 2 ℓ b ℓ + k = 1 ∑ n a k x 2 k b n ⋮ ℓ = 1 ∑ n a n x n ℓ b ℓ + k = 1 ∑ n a k x nk b n = ℓ = 1 ∑ n a 1 x 1 ℓ b ℓ ⋮ ℓ = 1 ∑ n a 1 x n ℓ b ℓ ⋯ ⋱ ⋯ ℓ = 1 ∑ n a n x 1 ℓ b ℓ ⋮ ℓ = 1 ∑ n a n x n ℓ b ℓ + k = 1 ∑ n a k x 1 k b 1 ⋮ k = 1 ∑ n a k x nk b 1 ⋯ ⋱ ⋯ k = 1 ∑ n a k x 1 k b n ⋮ k = 1 ∑ n a k x nk b n = ℓ = 1 ∑ n x 1 ℓ b ℓ a 1 ⋮ ℓ = 1 ∑ n x n ℓ b ℓ a 1 ⋯ ⋱ ⋯ ℓ = 1 ∑ n x 1 ℓ b ℓ a n ⋮ ℓ = 1 ∑ n x n ℓ b ℓ a n + k = 1 ∑ n x 1 k a k b 1 ⋮ k = 1 ∑ n x nk a k b 1 ⋯ ⋱ ⋯ k = 1 ∑ n x 1 k a k b n ⋮ k = 1 ∑ n x nk a k b n
Thus, the following is obtained:
∇ X ( a T X T X b ) = X b a T + X a b T = X ( a b T + b a T )
\nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})
= \mathbf{X}\mathbf{b}\mathbf{a}^{\mathsf{T}} + \mathbf{X}\mathbf{a}\mathbf{b}^{\mathsf{T}} = \mathbf{X} ( \mathbf{a}\mathbf{b}^{\mathsf{T}} + \mathbf{b}\mathbf{a}^{\mathsf{T}} )
∇ X ( a T X T Xb ) = Xb a T + Xa b T = X ( a b T + b a T )
■
Trace Trick A scalar can be thought of as a 1 × 1 1\times 1 1 × 1 matrix, and thus, the trace essentially acts as an identity function. In other words, the value of the bilinear form is a scalar, so the following holds:
a T X T X b = Tr ( a T X T X b )
\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b} = \Tr(\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})
a T X T Xb = Tr ( a T X T Xb )
Additionally, since the trace has the cyclic property Tr ( A B C ) = Tr ( C B A ) \Tr(ABC) = \Tr(CBA) Tr ( A BC ) = Tr ( CB A ) ,
∂ ( a T X T X b ) ∂ X = ∂ Tr ( a T X T X b ) ∂ X = ∂ Tr ( b a T X T X ) ∂ X
\dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}}
= \dfrac{\partial \Tr(\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}}
= \dfrac{\partial \Tr(\mathbf{b}\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X})}{\partial \mathbf{X}}
∂ X ∂ ( a T X T Xb ) = ∂ X ∂ Tr ( a T X T Xb ) = ∂ X ∂ Tr ( b a T X T X )
By the trace differentiation formula , the following is obtained:
∂ ( b a T X T X ) ∂ X = X ( b a T + a b T )
\dfrac{\partial (\mathbf{b}\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X})}{\partial \mathbf{X}}
= \mathbf{X}(\mathbf{b}\mathbf{a}^{\mathsf{T}} + \mathbf{a}\mathbf{b}^{\mathsf{T}})
∂ X ∂ ( b a T X T X ) = X ( b a T + a b T )
At this point, if b = a \mathbf{b} = \mathbf{a} b = a , then
∂ ( a T X T X a ) ∂ X = 2 X a a T
\dfrac{\partial (\mathbf{a}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{X} \mathbf{a})}{\partial \mathbf{X}}
= 2\mathbf{X}\mathbf{a}\mathbf{a}^{\mathsf{T}}
∂ X ∂ ( a T X T Xa ) = 2 Xa a T
In this manner, the method that utilizes the properties of trace to simplify matrix differentiation calculations through straightforward matrix operations is called the trace trick .