Generalizing Differentiation: Gradient Matrices and Matrix Calculus 📂Vector Analysis

Generalizing Differentiation: Gradient Matrices and Matrix Calculus

Definition

We define the gradient matrix $\nabla_{\mathbf{X}} f$ for a scalar function $f : \mathbb{R}^{n \times n} \to \mathbb{R}$ and a matrix $\mathbf{X} = [x_{ij}] \in \mathbb{R}^{n \times n}$ as follows.

$$ [\nabla_{\mathbf{X}} f]_{ij} = \dfrac{\partial f}{\partial x_{ij}} \quad (i,j=1,\dots,n) $$

$$ \nabla_{\mathbf{X}} f = \dfrac{\partial f}{\partial \mathbf{X}} = \begin{bmatrix} \dfrac{\partial f}{\partial x_{11}} & \cdots & \dfrac{\partial f}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f}{\partial x_{n1}} & \cdots & \dfrac{\partial f}{\partial x_{nn}} \end{bmatrix} $$

Explanation

If you think of a matrix as a reshaped form of a vector, it is naturally generalizable from the concept of a gradient. Originally, a gradient is nothing more than a vector created by differentiating with respect to each component of a vector, rather than differentiating the vector itself. In the same way, the concept of differentiating with respect to a matrix is not truly differentiating with respect to a matrix but creating a matrix by differentiating with respect to each component of the matrix. Remember, this is essentially no different from differentiating a multivariable function. Specifically, differentiating "with respect to" a matrix is correct, but differentiating "a" matrix refers to differentiating a matrix function, which is different.

In fields dealing with data, including machine learning and deep learning, vector and matrix calculations and the calculations of their derivatives appear frequently. Thus, it is appropriate to call the operator $\nabla_{\mathbf{X}} = \dfrac{\partial}{\partial \mathbf{X}}$ the matrix gradient and the computed matrix $\nabla_{\mathbf{X}}f$ the gradient matrix.

Generalization

By considering the explanation above, ultimately, it is about whether the value being differentiated and the variable of differentiation are scalars or multivariables; thus, it need not be limited only to matrices. It can be defined for various cases as shown in the table below.

Variable \ Value	Scalar $y$	Vector $\mathbf{y}$	Matrix $\mathbf{Y}$
Scalar	$\dfrac{\partial y}{\partial x}$	$\dfrac{\partial \mathbf{y}}{\partial x}$	$\dfrac{\partial \mathbf{Y}}{\partial x}$
Vector	$\dfrac{\partial y}{\partial \mathbf{x}}$	$\dfrac{\partial \mathbf{y}}{\partial \mathbf{x}}$
Matrix	$\dfrac{\partial y}{\partial \mathbf{X}}$

However, what has actually been done is unifying existing concepts into a single, elegant notation.

Variable \ Value	Scalar $y$	Vector $\mathbf{y}$	Matrix $\mathbf{Y}$
Scalar	Differentiation	Gradient	Matrix function differentiation
Vector	Vector-valued function differentiation	Jacobian
Matrix	Matrix Gradient

Considering Jacobian or the chain rule, it is natural to treat the function value $\mathbf{y}$ as a column vector and the variable $\mathbf{x}$ as a row vector.

$$ \mathbf{y} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} x_{1} & x_{2} & \cdots & x_{n} \end{bmatrix} $$

Hence, $\dfrac{\partial y}{\partial \mathbf{x}}$ is a row vector, and $\dfrac{\partial \mathbf{y}}{\partial x}$ is a column vector. Since $\dfrac{\partial y}{\partial x}$ is a scalar, the chain rule can be naturally expressed as follows. In physics textbooks, gradients are expressed as row vectors for this reason. Let $y = y(\mathbf{x})$ and $\mathbf{x} = \mathbf{x}(t)$ be,

$$ \dfrac{\partial y}{\partial t} = \dfrac{\partial y}{\partial \mathbf{x}} \dfrac{d \mathbf{x}}{d t} = \begin{bmatrix} \dfrac{\partial y}{\partial x_{1}} & \cdots & \dfrac{\partial y}{\partial x_{n}} \end{bmatrix} \begin{bmatrix} \dfrac{d x_{1}}{d t} \\ \vdots \\ \dfrac{d x_{n}}{d t} \end{bmatrix} = \sum\limits_{i=1}^{n} \dfrac{\partial y}{\partial x_{i}} \dfrac{d x_{i}}{d t} $$

Even if the notation is unified, it is easy to confuse whether it is a gradient or a vector-valued function differentiation. In these instances, consider $\dfrac{\partial }{\partial x}$ itself as a scalar, $\dfrac{\partial }{\partial \mathbf{x}}$ itself as a vector, and $\dfrac{\partial }{\partial \mathbf{X}}$ itself as a matrix, further noting that the differentiation above is connected to the Kronecker product $\otimes$ of matrices. For more details, refer to the following.

Scalar By

Differentiated by a Scalar

Differentiating a scalar by a scalar is analogous to the ordinary differentiation.

$$ \dfrac{\partial y}{\partial x} = \lim\limits_{h \to 0} \dfrac{y(x+h) - y(x)}{h} $$

In terms of the Kronecker product, it can be expressed as,

$$ \dfrac{\partial y}{\partial x} = \dfrac{\partial }{\partial x} \otimes y = \dfrac{\partial y}{\partial x} $$

Differentiated by a Vector

This is analogous to the gradient. As the value is a scalar and the variable is a row vector, the gradient is also naturally regarded as a row vector.

$$ \dfrac{\partial y}{\partial \mathbf{x}} = \nabla_{\mathbf{x}} y = \begin{bmatrix} \dfrac{\partial y}{\partial x_{1}} & \cdots & \dfrac{\partial y}{\partial x_{n}} \end{bmatrix} $$

In terms of the Kronecker product, it can be expressed as,

$$ \dfrac{\partial y}{\partial \mathbf{x}} = \dfrac{\partial }{\partial \mathbf{x}} \otimes y = \begin{bmatrix} \dfrac{\partial }{\partial x_{1}} & \cdots & \dfrac{\partial }{\partial x_{n}} \end{bmatrix} \otimes \begin{bmatrix} y \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y}{\partial x_{1}} & \cdots & \dfrac{\partial y}{\partial x_{n}} \end{bmatrix} $$

The Laplacian $\Delta y = \nabla^{2} y$ is as follows.

$$ \Delta y = \nabla^{2} y = \sum\limits_{i=1}^{n} \dfrac{\partial^{2} y}{\partial x_{i}^{2}} = \left( \dfrac{\partial y}{\partial \mathbf{x}} \right) \left( \dfrac{\partial y}{\partial \mathbf{x}} \right)^{\mathsf{T}} = \Tr \left( \left( \dfrac{\partial y}{\partial \mathbf{x}} \right)^{\mathsf{T}} \left( \dfrac{\partial y}{\partial \mathbf{x}} \right) \right) $$

$\left( \dfrac{\partial y}{\partial \mathbf{x}} \right)^{\mathsf{T}} \left( \dfrac{\partial y}{\partial \mathbf{x}} \right)$ is the Hessian.

$$ H(y) = \left( \dfrac{\partial y}{\partial \mathbf{x}} \right)^{\mathsf{T}} \left( \dfrac{\partial y}{\partial \mathbf{x}} \right) = \begin{bmatrix} \dfrac{\partial^{2} y}{\partial x_{1}^{2}} & \cdots & \dfrac{\partial^{2} y}{\partial x_{1} \partial x_{n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial^{2} y}{\partial x_{n} \partial x_{1}} & \cdots & \dfrac{\partial^{2} y}{\partial x_{n}^{2}} \end{bmatrix} $$

Differentiated by a Matrix

This refers to the matrix gradient defined above.

$$ \nabla_{\mathbf{X}} y = \dfrac{\partial y}{\partial \mathbf{X}} = \begin{bmatrix} \dfrac{\partial y}{\partial x_{11}} & \cdots & \dfrac{\partial y}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y}{\partial x_{m1}} & \cdots & \dfrac{\partial y}{\partial x_{mn}} \end{bmatrix} $$

In terms of the Kronecker product, it can be expressed as,

$$ \dfrac{\partial y}{\partial \mathbf{X}} = \dfrac{\partial }{\partial \mathbf{X}} \otimes y = \begin{bmatrix} \dfrac{\partial }{\partial x_{11}} & \cdots & \dfrac{\partial }{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial }{\partial x_{m1}} & \cdots & \dfrac{\partial }{\partial x_{mn}} \end{bmatrix} \otimes \begin{bmatrix} y \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y}{\partial x_{11}} & \cdots & \dfrac{\partial y}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y}{\partial x_{m1}} & \cdots & \dfrac{\partial y}{\partial x_{mn}} \end{bmatrix} $$

Vector By

Differentiated by a Scalar

This is analogous to the vector-valued function differentiation.

$$ \dfrac{\partial \mathbf{y}}{\partial x} = \mathbf{y}^{\prime}(x) = \lim\limits_{h \to 0} \dfrac{\mathbf{y}(x+h) - \mathbf{y}(x)}{h} $$

As the function value $\mathbf{y}$ is a column vector, and the variable $x$ is a scalar, the result is a column vector. In terms of the Kronecker product, it can be expressed as,

$$ \dfrac{\partial \mathbf{y}}{\partial x} = \dfrac{\partial }{\partial x} \otimes \mathbf{y} = \begin{bmatrix} \dfrac{\partial }{\partial x} \end{bmatrix} \otimes \begin{bmatrix} y_{1} \\ \vdots \\ y_{m} \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y_{1}}{\partial x} \\ \vdots \\ \dfrac{\partial y_{m}}{\partial x} \end{bmatrix} $$

Differentiated by a Vector

Differentiating a vector by a vector is equivalent to the Jacobian. Considering the definition of the Jacobian,

$$ \dfrac{\partial \mathbf{y}}{\partial \mathbf{x}} = \dfrac{\partial (y_{1}, y_{2}, \dots, y_{m})}{\partial (x_{1}, x_{2}, \dots, x_{n})} = \begin{bmatrix} \dfrac{\partial y_{1}}{\partial x_{1}} & \cdots & \dfrac{\partial y_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y_{m}}{\partial x_{1}} & \cdots & \dfrac{\partial y_{m}}{\partial x_{n}} \end{bmatrix} $$

It produces the same outcome as when expressed with the Kronecker product.

$$ \dfrac{\partial \mathbf{y}}{\partial \mathbf{x}} = \dfrac{\partial }{\partial \mathbf{x}} \otimes \mathbf{y} = \begin{bmatrix} \dfrac{\partial }{\partial x_{1}} & \cdots & \dfrac{\partial }{\partial x_{n}} \end{bmatrix} \otimes \begin{bmatrix} y_{1} \\ \vdots \\ y_{m} \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y_{1}}{\partial x_{1}} & \cdots & \dfrac{\partial y_{1}}{\partial x_{n}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y_{m}}{\partial x_{1}} & \cdots & \dfrac{\partial y_{m}}{\partial x_{n}} \end{bmatrix} $$

Matrix By

Differentiated by a Scalar

This is analogous to the differentiation of a matrix function. In terms of the Kronecker product, it can be expressed as,

$$ \dfrac{\partial \mathbf{Y}}{\partial x} = \dfrac{\partial }{\partial x} \otimes \mathbf{Y} = \begin{bmatrix} \dfrac{\partial }{\partial x} \end{bmatrix} \otimes \begin{bmatrix} y_{11} & y_{12} & \cdots & y_{1n} \\ y_{21} & y_{22} & \cdots & y_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ y_{m1} & y_{m2} & \cdots & y_{mn} \end{bmatrix} = \begin{bmatrix} \dfrac{\partial y_{11}}{\partial x} & \cdots & \dfrac{\partial y_{1n}}{\partial x} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial y_{m1}}{\partial x} & \cdots & \dfrac{\partial y_{mn}}{\partial x} \end{bmatrix} $$

Generalizing Differentiation: Gradient Matrices and Matrix Calculus

Definition

Explanation

See Also

Generalization

Scalar By

Differentiated by a Scalar

Differentiated by a Vector

Differentiated by a Matrix

Vector By

Differentiated by a Scalar

Differentiated by a Vector

Matrix By

Differentiated by a Scalar