logo

Total Differential of Function of a Matrix 📂Vector Analysis

Total Differential of Function of a Matrix

Background

Scalar Differentiation

Let’s consider a scalar function f:RRf : \mathbb{R} \to \mathbb{R} and ordinary differentiation.

dfdx(1) \dfrac{d f}{d x} \tag{1}

This notation resembles a fraction, and it’s assured that it can indeed be treated as such for calculations. For example, the chain rule, which is the differentiation rule for composite functions, can be intuitively calculated as if canceling fractions, as shown below.

dfdt=dfdxdxdt \dfrac{d f}{d t} = \dfrac{d f}{d x} \dfrac{d x}{d t}

The ability to treat (1)(1) as a fraction means regarding dfdf and dxdx as meaningful entities in themselves. In this context, dfdf (or dxdx) signifies a very small change in ff (or xx).

df=dfdxdxSmall change in f=(Rate of change of f with respect to x× (Small change in x \begin{align*} df &= \dfrac{d f}{d x} dx \tag{2} \\ \text{Small change in ff} &= \text{(Rate of change of ff with respect to xx) ×\times (Small change in xx) } \end{align*}

Vector Differentiation

Now let’s consider a multivariable function f:RnRf : \mathbb{R}^{n} \to \mathbb{R} and vector differentiation. The derivative of ff is given by the gradient, with the ii-th component being the partial derivative with respect to the ii-th variable.

xf=fx=[fx1fxn]T \nabla_{\mathbf{x}}f = \dfrac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \dfrac{\partial f}{\partial x_{1}} & \cdots & \dfrac{\partial f}{\partial x_{n}} \end{bmatrix}^{\mathsf{T}}

In this case, the small change in ff, denoted by dfdf, is defined as the sum of the changes in ff with respect to each variable fxi\dfrac{\partial f}{\partial x_{i}} and their respective changes dxidx_{i}. This is referred to as the total differential, complete differential, or simply the differential, of ff.

df=fx1dx1++fxndxn(3) df = \dfrac{\partial f}{\partial x_{1}}dx_{1} + \cdots + \dfrac{\partial f}{\partial x_{n}}dx_{n} \tag{3}

Dividing both sides by dxidx_{i}, and since each variable is independent, it is possible to obtain the rate of change of ff with respect to each variable, confirming that dfdf is well-defined.

dfdxi=fx1dx1dxi++fxndxndxi=fxidxidxi=fxi \dfrac{d f}{d x_{i}} = \dfrac{\partial f}{\partial x_{1}} \dfrac{dx_{1}}{dx_{i}} + \cdots + \dfrac{\partial f}{\partial x_{n}} \dfrac{dx_{n}}{dx_{i}} = \dfrac{\partial f}{\partial x_{i}} \dfrac{dx_{i}}{dx_{i}} = \dfrac{\partial f}{\partial x_{i}}

Moreover, in some cases, (3)(3) can be expressed as a matrix product, as shown below.

df=[fx1fxn][dx1dxn](4) df = \begin{bmatrix} \dfrac{\partial f}{\partial x_{1}} & \cdots & \dfrac{\partial f}{\partial x_{n}} \end{bmatrix} \begin{bmatrix} d x_{1} \\ \vdots \\ d x_{n} \end{bmatrix} \tag{4}

The row vector is the gradient (xf)T(\nabla_{\mathbf{x}}f)^{\mathsf{T}}. Let’s denote a column vector as dx=[dx1dxn]Td\mathbf{x} = \begin{bmatrix} d x_{1} & \cdots & d x_{n} \end{bmatrix}^{\mathsf{T}}. This signifies a small change in the variable (x1,x2,,xn)(x_{1}, x_{2}, \dots, x_{n}). Then, (4)(4) can be represented in the form of a dot product of the two vectors, as shown below.

df=(xf)Tdx=xfx=xf,x(5) df = (\nabla_{\mathbf{x}}f)^{\mathsf{T}} d\mathbf{x} = \nabla_{\mathbf{x}}f \cdot \mathbf{x} = \braket{\nabla_{\mathbf{x}}f, \mathbf{x}} \tag{5}

If one writes (2)(2) as xf=dfdx\nabla_{x}f = \frac{d f}{d x} in the scalar differentiation df=xfdxdf = \nabla_{x}f dx, thereby aligning perfectly with (5)(5), the generalization to vectors becomes naturally evident. Let us now derive the generalization to matrix differentiation in the same manner.

Matrix Differentiation

Consider a function f:Rn×nRf : \mathbb{R}^{n \times n} \to \mathbb{R} defined over a matrix space, where the function variable is represented by a matrix.

X=[x11x12x1nx21x22x2nxn1xn2xnn] \mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nn} \end{bmatrix}

Just like in the case of vectors, it would make sense to define dXd \mathbf{X} as shown below.

dX=[dx11dx12dx1ndx21dx22dx2ndxn1dxn2dxnn] d \mathbf{X} = \begin{bmatrix} d x_{11} & d x_{12} & \cdots & d x_{1n} \\ d x_{21} & d x_{22} & \cdots & d x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ d x_{n1} & d x_{n2} & \cdots & d x_{nn} \end{bmatrix}

Moreover, it is natural to define dfdf similar to (3)(3) as follows.

df=i,j=1nfxijdxij df = \sum\limits_{i, j=1}^{n} \dfrac{\partial f}{\partial x_{ij}} d x_{ij}

The right-hand side resembles the inner product of two matrices. Therefore, it can be expressed as follows.

df=[fxij],[dxij]=Tr([fxij]T,[dxij]) df = \Braket{ \left[ \dfrac{\partial f}{\partial x_{ij}} \right], \left[ dx_{ij} \right] } = \Tr \left( \left[ \dfrac{\partial f}{\partial x_{ij}} \right]^{\mathsf{T}}, \left[ dx_{ij} \right] \right)

Here, the first matrix is the gradient matrix Xf=fX\nabla_{\mathbf{X}}f = \dfrac{\partial f}{\partial \mathbf{X}}, and the second matrix is X\mathbf{X}. Hence, the following theorem is obtained.

Theorem

For a scalar function f:Rn×nRf : \mathbb{R}^{n \times n} \to \mathbb{R} defined over a matrix space, the total differential dfdf is given by the following:

df=Xf,dX=Tr((Xf)TdX)(6) df = \Braket{\nabla_{\mathbf{X}}f, d \mathbf{X}} = \Tr \left( \left( \nabla_{\mathbf{X}}f \right)^{\mathsf{T}} d \mathbf{X} \right) \tag{6}

Here, Xf\nabla_{\mathbf{X}}f is the gradient matrix, and dXd \mathbf{X} is the differential of X\mathbf{X}.

Xf=fX=[fx11fx1nfxn1fxnn] \nabla_{\mathbf{X}} f = \dfrac{\partial f}{\partial \mathbf{X}} = \begin{bmatrix} \dfrac{\partial f}{\partial x_{11}} & \cdots & \dfrac{\partial f}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f}{\partial x_{n1}} & \cdots & \dfrac{\partial f}{\partial x_{nn}} \end{bmatrix}

dX=[dx11dx12dx1ndx21dx22dx2ndxn1dxn2dxnn] d \mathbf{X} = \begin{bmatrix} d x_{11} & d x_{12} & \cdots & d x_{1n} \\ d x_{21} & d x_{22} & \cdots & d x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ d x_{n1} & d x_{n2} & \cdots & d x_{nn} \end{bmatrix}

Explanation

(6)(6) is extremely useful for differential calculations, as one only needs to arrange df=Tr(ATdX)df = \Tr (A^{\mathsf{T}} d \mathbf{X}) without the actual computation of ff to obtain Xf\nabla_{\mathbf{X}}f. The matrix AA then becomes the gradient matrix of ff. For more detailed explanations, refer to the following.