logo

Generalizing Differentiation: Gradient Matrices and Matrix Calculus 📂Vector Analysis

Generalizing Differentiation: Gradient Matrices and Matrix Calculus

Definition

We define the gradient matrix Xf\nabla_{\mathbf{X}} f for a scalar function f:Rn×nRf : \mathbb{R}^{n \times n} \to \mathbb{R} and a matrix X=[xij]Rn×n\mathbf{X} = [x_{ij}] \in \mathbb{R}^{n \times n} as follows.

[Xf]ij=fxij(i,j=1,,n) [\nabla_{\mathbf{X}} f]_{ij} = \dfrac{\partial f}{\partial x_{ij}} \quad (i,j=1,\dots,n)

Xf=fX=[fx11fx1nfxn1fxnn] \nabla_{\mathbf{X}} f = \dfrac{\partial f}{\partial \mathbf{X}} = \begin{bmatrix} \dfrac{\partial f}{\partial x_{11}} & \cdots & \dfrac{\partial f}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f}{\partial x_{n1}} & \cdots & \dfrac{\partial f}{\partial x_{nn}} \end{bmatrix}

Explanation

If you think of a matrix as a reshaped form of a vector, it is naturally generalizable from the concept of a gradient. Originally, a gradient is nothing more than a vector created by differentiating with respect to each component of a vector, rather than differentiating the vector itself. In the same way, the concept of differentiating with respect to a matrix is not truly differentiating with respect to a matrix but creating a matrix by differentiating with respect to each component of the matrix. Remember, this is essentially no different from differentiating a multivariable function. Specifically, differentiating "with respect to" a matrix is correct, but differentiating "a" matrix refers to differentiating a matrix function, which is different.

In fields dealing with data, including machine learning and deep learning, vector and matrix calculations and the calculations of their derivatives appear frequently. Thus, it is appropriate to call the operator X=X\nabla_{\mathbf{X}} = \dfrac{\partial}{\partial \mathbf{X}} the matrix gradient and the computed matrix Xf\nabla_{\mathbf{X}}f the gradient matrix.

See Also

Generalization

By considering the explanation above, ultimately, it is about whether the value being differentiated and the variable of differentiation are scalars or multivariables; thus, it need not be limited only to matrices. It can be defined for various cases as shown in the table below.

Variable \ ValueScalar yyVector y\mathbf{y}Matrix Y\mathbf{Y}
Scalaryx\dfrac{\partial y}{\partial x}yx\dfrac{\partial \mathbf{y}}{\partial x}Yx\dfrac{\partial \mathbf{Y}}{\partial x}
Vectoryx\dfrac{\partial y}{\partial \mathbf{x}}yx\dfrac{\partial \mathbf{y}}{\partial \mathbf{x}}
MatrixyX\dfrac{\partial y}{\partial \mathbf{X}}

However, what has actually been done is unifying existing concepts into a single, elegant notation.

Variable \ ValueScalar yyVector y\mathbf{y}Matrix Y\mathbf{Y}
ScalarDifferentiationGradientMatrix function differentiation
VectorVector-valued function differentiationJacobian
MatrixMatrix Gradient

Considering Jacobian or the chain rule, it is natural to treat the function value y\mathbf{y} as a column vector and the variable x\mathbf{x} as a row vector.

y=[y1y2ym],x=[x1x2xn] \mathbf{y} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} x_{1} & x_{2} & \cdots & x_{n} \end{bmatrix}

Hence, yx\dfrac{\partial y}{\partial \mathbf{x}} is a row vector, and yx\dfrac{\partial \mathbf{y}}{\partial x} is a column vector. Since yx\dfrac{\partial y}{\partial x} is a scalar, the chain rule can be naturally expressed as follows. In physics textbooks, gradients are expressed as row vectors for this reason. Let y=y(x)y = y(\mathbf{x}) and x=x(t)\mathbf{x} = \mathbf{x}(t) be,

yt=yxdxdt=[yx1yxn][dx1dtdxndt]=i=1nyxidxidt \dfrac{\partial y}{\partial t} = \dfrac{\partial y}{\partial \mathbf{x}} \dfrac{d \mathbf{x}}{d t} = \begin{bmatrix} \dfrac{\partial y}{\partial x_{1}} & \cdots & \dfrac{\partial y}{\partial x_{n}} \end{bmatrix} \begin{bmatrix} \dfrac{d x_{1}}{d t} \\ \vdots \\ \dfrac{d x_{n}}{d t} \end{bmatrix} = \sum\limits_{i=1}^{n} \dfrac{\partial y}{\partial x_{i}} \dfrac{d x_{i}}{d t}

Even if the notation is unified, it is easy to confuse whether it is a gradient or a vector-valued function differentiation. In these instances, consider x\dfrac{\partial }{\partial x} itself as a scalar, x\dfrac{\partial }{\partial \mathbf{x}} itself as a vector, and X\dfrac{\partial }{\partial \mathbf{X}} itself as a matrix, further noting that the differentiation above is connected to the Kronecker product \otimes of matrices. For more details, refer to the following.

Scalar By

Differentiated by a Scalar

Differentiating a scalar by a scalar is analogous to the ordinary differentiation.

yx=limh0y(x+h)y(x)h \dfrac{\partial y}{\partial x} = \lim\limits_{h \to 0} \dfrac{y(x+h) - y(x)}{h}

In terms of the Kronecker product, it can be expressed as,

yx=xy=yx \dfrac{\partial y}{\partial x} = \dfrac{\partial }{\partial x} \otimes y = \dfrac{\partial y}{\partial x}

Differentiated by a Vector

This is analogous to the gradient. As the value is a scalar and the variable is a row vector, the gradient is also naturally regarded as a row vector.

yx=xy=[yx1yxn] \dfrac{\partial y}{\partial \mathbf{x}} = \nabla_{\mathbf{x}} y = \begin{bmatrix} \dfrac{\partial y}{\partial x_{1}} & \cdots & \dfrac{\partial y}{\partial x_{n}} \end{bmatrix}

In terms of the Kronecker product, it can be expressed as,

yx=xy=[x1xn][y]=[yx1yxn] \dfrac{\partial y}{\partial \mathbf{x}} = \dfrac{\partial }{\partial \mathbf{x}} \otimes y = \begin{bmatrix} \dfrac{\partial }{\partial x_{1}} & \cdots & \dfrac{\partial }{\partial x_{n}} \end{bmatrix} \otimes \begin{bmatrix} y \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y}{\partial x_{1}} & \cdots & \dfrac{\partial y}{\partial x_{n}} \end{bmatrix}

The Laplacian Δy=2y\Delta y = \nabla^{2} y is as follows.

Δy=2y=i=1n2yxi2=(yx)(yx)T=Tr((yx)T(yx)) \Delta y = \nabla^{2} y = \sum\limits_{i=1}^{n} \dfrac{\partial^{2} y}{\partial x_{i}^{2}} = \left( \dfrac{\partial y}{\partial \mathbf{x}} \right) \left( \dfrac{\partial y}{\partial \mathbf{x}} \right)^{\mathsf{T}} = \Tr \left( \left( \dfrac{\partial y}{\partial \mathbf{x}} \right)^{\mathsf{T}} \left( \dfrac{\partial y}{\partial \mathbf{x}} \right) \right)

(yx)T(yx)\left( \dfrac{\partial y}{\partial \mathbf{x}} \right)^{\mathsf{T}} \left( \dfrac{\partial y}{\partial \mathbf{x}} \right) is the Hessian.

H(y)=(yx)T(yx)=[2yx122yx1xn2yxnx12yxn2] H(y) = \left( \dfrac{\partial y}{\partial \mathbf{x}} \right)^{\mathsf{T}} \left( \dfrac{\partial y}{\partial \mathbf{x}} \right) = \begin{bmatrix} \dfrac{\partial^{2} y}{\partial x_{1}^{2}} & \cdots & \dfrac{\partial^{2} y}{\partial x_{1} \partial x_{n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial^{2} y}{\partial x_{n} \partial x_{1}} & \cdots & \dfrac{\partial^{2} y}{\partial x_{n}^{2}} \end{bmatrix}

Differentiated by a Matrix

This refers to the matrix gradient defined above.

Xy=yX=[yx11yx1nyxm1yxmn] \nabla_{\mathbf{X}} y = \dfrac{\partial y}{\partial \mathbf{X}} = \begin{bmatrix} \dfrac{\partial y}{\partial x_{11}} & \cdots & \dfrac{\partial y}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y}{\partial x_{m1}} & \cdots & \dfrac{\partial y}{\partial x_{mn}} \end{bmatrix}

In terms of the Kronecker product, it can be expressed as,

yX=Xy=[x11x1nxm1xmn][y]=[yx11yx1nyxm1yxmn] \dfrac{\partial y}{\partial \mathbf{X}} = \dfrac{\partial }{\partial \mathbf{X}} \otimes y = \begin{bmatrix} \dfrac{\partial }{\partial x_{11}} & \cdots & \dfrac{\partial }{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial }{\partial x_{m1}} & \cdots & \dfrac{\partial }{\partial x_{mn}} \end{bmatrix} \otimes \begin{bmatrix} y \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y}{\partial x_{11}} & \cdots & \dfrac{\partial y}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y}{\partial x_{m1}} & \cdots & \dfrac{\partial y}{\partial x_{mn}} \end{bmatrix}

Vector By

Differentiated by a Scalar

This is analogous to the vector-valued function differentiation.

yx=y(x)=limh0y(x+h)y(x)h \dfrac{\partial \mathbf{y}}{\partial x} = \mathbf{y}^{\prime}(x) = \lim\limits_{h \to 0} \dfrac{\mathbf{y}(x+h) - \mathbf{y}(x)}{h}

As the function value y\mathbf{y} is a column vector, and the variable xx is a scalar, the result is a column vector. In terms of the Kronecker product, it can be expressed as,

yx=xy=[x][y1ym]=[y1xymx] \dfrac{\partial \mathbf{y}}{\partial x} = \dfrac{\partial }{\partial x} \otimes \mathbf{y} = \begin{bmatrix} \dfrac{\partial }{\partial x} \end{bmatrix} \otimes \begin{bmatrix} y_{1} \\ \vdots \\ y_{m} \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y_{1}}{\partial x} \\ \vdots \\ \dfrac{\partial y_{m}}{\partial x} \end{bmatrix}

Differentiated by a Vector

Differentiating a vector by a vector is equivalent to the Jacobian. Considering the definition of the Jacobian,

yx=(y1,y2,,ym)(x1,x2,,xn)=[y1x1y1xnymx1ymxn] \dfrac{\partial \mathbf{y}}{\partial \mathbf{x}} = \dfrac{\partial (y_{1}, y_{2}, \dots, y_{m})}{\partial (x_{1}, x_{2}, \dots, x_{n})} = \begin{bmatrix} \dfrac{\partial y_{1}}{\partial x_{1}} & \cdots & \dfrac{\partial y_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y_{m}}{\partial x_{1}} & \cdots & \dfrac{\partial y_{m}}{\partial x_{n}} \end{bmatrix}

It produces the same outcome as when expressed with the Kronecker product.

yx=xy=[x1xn][y1ym]=[y1x1y1xnymx1ymxn] \dfrac{\partial \mathbf{y}}{\partial \mathbf{x}} = \dfrac{\partial }{\partial \mathbf{x}} \otimes \mathbf{y} = \begin{bmatrix} \dfrac{\partial }{\partial x_{1}} & \cdots & \dfrac{\partial }{\partial x_{n}} \end{bmatrix} \otimes \begin{bmatrix} y_{1} \\ \vdots \\ y_{m} \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y_{1}}{\partial x_{1}} & \cdots & \dfrac{\partial y_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y_{m}}{\partial x_{1}} & \cdots & \dfrac{\partial y_{m}}{\partial x_{n}} \end{bmatrix}

Matrix By

Differentiated by a Scalar

This is analogous to the differentiation of a matrix function. In terms of the Kronecker product, it can be expressed as,

Yx=xY=[x][y11y12y1ny21y22y2nym1ym2ymn]=[y11xy1nxym1xymnx] \dfrac{\partial \mathbf{Y}}{\partial x} = \dfrac{\partial }{\partial x} \otimes \mathbf{Y} = \begin{bmatrix} \dfrac{\partial }{\partial x} \end{bmatrix} \otimes \begin{bmatrix} y_{11} & y_{12} & \cdots & y_{1n} \\ y_{21} & y_{22} & \cdots & y_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ y_{m1} & y_{m2} & \cdots & y_{mn} \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y_{11}}{\partial x} & \cdots & \dfrac{\partial y_{1n}}{\partial x} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y_{m1}}{\partial x} & \cdots & \dfrac{\partial y_{mn}}{\partial x} \end{bmatrix}