This notation resembles a fraction, and it’s assured that it can indeed be treated as such for calculations. For example, the chain rule, which is the differentiation rule for composite functions, can be intuitively calculated as if canceling fractions, as shown below.
dtdf=dxdfdtdx
The ability to treat (1) as a fraction means regarding df and dx as meaningful entities in themselves. In this context, df (or dx) signifies a very small change in f (or x).
dfSmall change in f=dxdfdx=(Rate of change of f with respect to x) × (Small change in x) (2)
Vector Differentiation
Now let’s consider a multivariable function f:Rn→Rf : \mathbb{R}^{n} \to \mathbb{R}f:Rn→R and vector differentiation. The derivative of fff is given by the gradient, with the iii-th component being the partial derivative with respect to the iii-th variable.
In this case, the small change in fff, denoted by dfdfdf, is defined as the sum of the changes in fff with respect to each variable ∂f∂xi\dfrac{\partial f}{\partial x_{i}}∂xi∂f and their respective changes dxidx_{i}dxi. This is referred to as the total differential, complete differential, or simply the differential, of fff.
Dividing both sides by dxidx_{i}dxi, and since each variable is independent, it is possible to obtain the rate of change of fff with respect to each variable, confirming that dfdfdf is well-defined.
The row vector is the gradient (∇xf)T(\nabla_{\mathbf{x}}f)^{\mathsf{T}}(∇xf)T. Let’s denote a column vector as dx=[dx1⋯dxn]Td\mathbf{x} = \begin{bmatrix} d x_{1} & \cdots & d x_{n} \end{bmatrix}^{\mathsf{T}}dx=[dx1⋯dxn]T. This signifies a small change in the variable (x1,x2,…,xn)(x_{1}, x_{2}, \dots, x_{n})(x1,x2,…,xn). Then, (4)(4)(4) can be represented in the form of a dot product of the two vectors, as shown below.
If one writes (2)(2)(2) as ∇xf=dfdx\nabla_{x}f = \frac{d f}{d x}∇xf=dxdf in the scalar differentiation df=∇xfdxdf = \nabla_{x}f dxdf=∇xfdx, thereby aligning perfectly with (5)(5)(5), the generalization to vectors becomes naturally evident. Let us now derive the generalization to matrix differentiation in the same manner.
Matrix Differentiation
Consider a function f:Rn×n→Rf : \mathbb{R}^{n \times n} \to \mathbb{R}f:Rn×n→R defined over a matrix space, where the function variable is represented by a matrix.
Just like in the case of vectors, it would make sense to define dXd \mathbf{X}dX as shown below.
dX=[dx11dx12⋯dx1ndx21dx22⋯dx2n⋮⋮⋱⋮dxn1dxn2⋯dxnn]
d \mathbf{X} = \begin{bmatrix}
d x_{11} & d x_{12} & \cdots & d x_{1n} \\
d x_{21} & d x_{22} & \cdots & d x_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
d x_{n1} & d x_{n2} & \cdots & d x_{nn}
\end{bmatrix}
dX=dx11dx21⋮dxn1dx12dx22⋮dxn2⋯⋯⋱⋯dx1ndx2n⋮dxnn
Moreover, it is natural to define dfdfdf similar to (3)(3)(3) as follows.
Here, the first matrix is the gradient matrix∇Xf=∂f∂X\nabla_{\mathbf{X}}f = \dfrac{\partial f}{\partial \mathbf{X}}∇Xf=∂X∂f, and the second matrix is X\mathbf{X}X. Hence, the following theorem is obtained.
Theorem
For a scalar functionf:Rn×n→Rf : \mathbb{R}^{n \times n} \to \mathbb{R}f:Rn×n→R defined over a matrix space, the total differential dfdfdf is given by the following:
df=⟨∇Xf,dX⟩=Tr((∇Xf)TdX)(6)
df = \Braket{\nabla_{\mathbf{X}}f, d \mathbf{X}} = \Tr \left( \left( \nabla_{\mathbf{X}}f \right)^{\mathsf{T}} d \mathbf{X} \right) \tag{6}
df=⟨∇Xf,dX⟩=Tr((∇Xf)TdX)(6)
dX=[dx11dx12⋯dx1ndx21dx22⋯dx2n⋮⋮⋱⋮dxn1dxn2⋯dxnn]
d \mathbf{X} = \begin{bmatrix}
d x_{11} & d x_{12} & \cdots & d x_{1n} \\
d x_{21} & d x_{22} & \cdots & d x_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
d x_{n1} & d x_{n2} & \cdots & d x_{nn}
\end{bmatrix}
dX=dx11dx21⋮dxn1dx12dx22⋮dxn2⋯⋯⋱⋯dx1ndx2n⋮dxnn
Explanation
(6)(6)(6) is extremely useful for differential calculations, as one only needs to arrange df=Tr(ATdX)df = \Tr (A^{\mathsf{T}} d \mathbf{X})df=Tr(ATdX) without the actual computation of fff to obtain ∇Xf\nabla_{\mathbf{X}}f∇Xf. The matrix AAA then becomes the gradient matrix of fff. For more detailed explanations, refer to the following.