Total Differential of Function of a Matrix
Background
Scalar Differentiation
Let’s consider a scalar function $f : \mathbb{R} \to \mathbb{R}$ and ordinary differentiation.
$$ \dfrac{d f}{d x} \tag{1} $$
This notation resembles a fraction, and it’s assured that it can indeed be treated as such for calculations. For example, the chain rule, which is the differentiation rule for composite functions, can be intuitively calculated as if canceling fractions, as shown below.
$$ \dfrac{d f}{d t} = \dfrac{d f}{d x} \dfrac{d x}{d t} $$
The ability to treat $(1)$ as a fraction means regarding $df$ and $dx$ as meaningful entities in themselves. In this context, $df$ (or $dx$) signifies a very small change in $f$ (or $x$).
$$ \begin{align*} df &= \dfrac{d f}{d x} dx \tag{2} \\ \text{Small change in $f$} &= \text{(Rate of change of $f$ with respect to $x$) $\times$ (Small change in $x$) } \end{align*} $$
Vector Differentiation
Now let’s consider a multivariable function $f : \mathbb{R}^{n} \to \mathbb{R}$ and vector differentiation. The derivative of $f$ is given by the gradient, with the $i$-th component being the partial derivative with respect to the $i$-th variable.
$$ \nabla_{\mathbf{x}}f = \dfrac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \dfrac{\partial f}{\partial x_{1}} & \cdots & \dfrac{\partial f}{\partial x_{n}} \end{bmatrix}^{\mathsf{T}} $$
In this case, the small change in $f$, denoted by $df$, is defined as the sum of the changes in $f$ with respect to each variable $\dfrac{\partial f}{\partial x_{i}}$ and their respective changes $dx_{i}$. This is referred to as the total differential, complete differential, or simply the differential, of $f$.
$$ df = \dfrac{\partial f}{\partial x_{1}}dx_{1} + \cdots + \dfrac{\partial f}{\partial x_{n}}dx_{n} \tag{3} $$
Dividing both sides by $dx_{i}$, and since each variable is independent, it is possible to obtain the rate of change of $f$ with respect to each variable, confirming that $df$ is well-defined.
$$ \dfrac{d f}{d x_{i}} = \dfrac{\partial f}{\partial x_{1}} \dfrac{dx_{1}}{dx_{i}} + \cdots + \dfrac{\partial f}{\partial x_{n}} \dfrac{dx_{n}}{dx_{i}} = \dfrac{\partial f}{\partial x_{i}} \dfrac{dx_{i}}{dx_{i}} = \dfrac{\partial f}{\partial x_{i}} $$
Moreover, in some cases, $(3)$ can be expressed as a matrix product, as shown below.
$$ df = \begin{bmatrix} \dfrac{\partial f}{\partial x_{1}} & \cdots & \dfrac{\partial f}{\partial x_{n}} \end{bmatrix} \begin{bmatrix} d x_{1} \\ \vdots \\ d x_{n} \end{bmatrix} \tag{4} $$
The row vector is the gradient $(\nabla_{\mathbf{x}}f)^{\mathsf{T}}$. Let’s denote a column vector as $d\mathbf{x} = \begin{bmatrix} d x_{1} & \cdots & d x_{n} \end{bmatrix}^{\mathsf{T}}$. This signifies a small change in the variable $(x_{1}, x_{2}, \dots, x_{n})$. Then, $(4)$ can be represented in the form of a dot product of the two vectors, as shown below.
$$ df = (\nabla_{\mathbf{x}}f)^{\mathsf{T}} d\mathbf{x} = \nabla_{\mathbf{x}}f \cdot \mathbf{x} = \braket{\nabla_{\mathbf{x}}f, \mathbf{x}} \tag{5} $$
If one writes $(2)$ as $\nabla_{x}f = \frac{d f}{d x}$ in the scalar differentiation $df = \nabla_{x}f dx$, thereby aligning perfectly with $(5)$, the generalization to vectors becomes naturally evident. Let us now derive the generalization to matrix differentiation in the same manner.
Matrix Differentiation
Consider a function $f : \mathbb{R}^{n \times n} \to \mathbb{R}$ defined over a matrix space, where the function variable is represented by a matrix.
$$ \mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nn} \end{bmatrix} $$
Just like in the case of vectors, it would make sense to define $d \mathbf{X}$ as shown below.
$$ d \mathbf{X} = \begin{bmatrix} d x_{11} & d x_{12} & \cdots & d x_{1n} \\ d x_{21} & d x_{22} & \cdots & d x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ d x_{n1} & d x_{n2} & \cdots & d x_{nn} \end{bmatrix} $$
Moreover, it is natural to define $df$ similar to $(3)$ as follows.
$$ df = \sum\limits_{i, j=1}^{n} \dfrac{\partial f}{\partial x_{ij}} d x_{ij} $$
The right-hand side resembles the inner product of two matrices. Therefore, it can be expressed as follows.
$$ df = \Braket{ \left[ \dfrac{\partial f}{\partial x_{ij}} \right], \left[ dx_{ij} \right] } = \Tr \left( \left[ \dfrac{\partial f}{\partial x_{ij}} \right]^{\mathsf{T}}, \left[ dx_{ij} \right] \right) $$
Here, the first matrix is the gradient matrix $\nabla_{\mathbf{X}}f = \dfrac{\partial f}{\partial \mathbf{X}}$, and the second matrix is $\mathbf{X}$. Hence, the following theorem is obtained.
Theorem
For a scalar function $f : \mathbb{R}^{n \times n} \to \mathbb{R}$ defined over a matrix space, the total differential $df$ is given by the following:
$$ df = \Braket{\nabla_{\mathbf{X}}f, d \mathbf{X}} = \Tr \left( \left( \nabla_{\mathbf{X}}f \right)^{\mathsf{T}} d \mathbf{X} \right) \tag{6} $$
Here, $\nabla_{\mathbf{X}}f$ is the gradient matrix, and $d \mathbf{X}$ is the differential of $\mathbf{X}$.
$$ \nabla_{\mathbf{X}} f = \dfrac{\partial f}{\partial \mathbf{X}} = \begin{bmatrix} \dfrac{\partial f}{\partial x_{11}} & \cdots & \dfrac{\partial f}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f}{\partial x_{n1}} & \cdots & \dfrac{\partial f}{\partial x_{nn}} \end{bmatrix} $$
$$ d \mathbf{X} = \begin{bmatrix} d x_{11} & d x_{12} & \cdots & d x_{1n} \\ d x_{21} & d x_{22} & \cdots & d x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ d x_{n1} & d x_{n2} & \cdots & d x_{nn} \end{bmatrix} $$
Explanation
$(6)$ is extremely useful for differential calculations, as one only needs to arrange $df = \Tr (A^{\mathsf{T}} d \mathbf{X})$ without the actual computation of $f$ to obtain $\nabla_{\mathbf{X}}f$. The matrix $A$ then becomes the gradient matrix of $f$. For more detailed explanations, refer to the following.
- Trace Trick
- [Determinant Trick]