Automatic differentiation
Definition1 2
Automatic differentiation refers to a method for obtaining the derivative of a function defined by computer programming code. It is also abbreviated as AD or autodiff.
Explanation
Automatic differentiation involves using the chain rule to compute the derivative of a composite function composed of functions whose derivatives are already known. In simple terms, it is the chain rule itself. Implementing the chain rule in programming code constitutes automatic differentiation. Therefore, unlike methods that numerically approximate derivatives, automatic differentiation calculates the derivative coefficients accurately. It is used in deep learning to optimize neural networks through backpropagation. As you will see below, automatic differentiation does not actually compute the derivative itself; it simply sequentially adds, multiplies, or composes derivatives that are already known.
Depending on the method of computation, it can be broadly classified into forward mode and reverse mode.
About $f : \mathbb{R}^{n} \to \mathbb{R}$
Let us consider a bivariate function $f$.
$$ f(x_{1}, x_{2}) = x_{1}x_{2} + \sin(x_{1}) + 2(x_{1})^{2} $$
The partial derivative $\dfrac{\partial f}{\partial x_{1}}$ is computed as follows:
$$ \dfrac{\partial f}{\partial x_{1}} = \dfrac{\partial (x_{1}x_{2})}{\partial x_{1}} + \dfrac{\partial \sin(x_{1})}{\partial x_{1}} + 2(x_{1})^{2} $$
Each term here is already known as shown below:
$$ \dfrac{\partial (x_{1}x_{2})}{\partial x_{1}} = x_{2},\qquad \dfrac{\partial \sin(x_{1})}{\partial x_{1}} = \cos(x_{1}),\qquad \dfrac{\partial 2(x_{1})^{2}}{\partial x_{1}} = 4x_{1} $$
In other words, once the values of $x_{1}$ and $x_{2}$ are determined, the derivative coefficient is immediately decided. Implementing this in programming code would look as follows:
Forward Mode
Let’s set $w_{1} = x_{1}$, $w_{2} = x_{2}$, and consider the case when $(x_{1}, x_{5}) = (2, 5)$ to compute the derivative coefficient $\dfrac{\partial f}{\partial x_{1}}(2,5)$ of $f$. Each time we add, multiply, or apply any function, we record the derivative. Let us denote this as $\dfrac{\partial w}{\partial x_{1}} = \dot{w}$.
$$ \begin{array}{|l|l|} \hline \textbf{Forward calulations} & \textbf{Derivatives} \\ \hline w_{1} = x_{1} = 2 & \dot{w}_{1} = 1 \\ w_{2} = x_{2} = 5 & \dot{w}_{2} = 0 \\ \hline w_{3} = w_{1}w_{2} = 10 & \dot{w}_{3} = w_{2} = 5 \\ w_{4} = \sin(w_{1}) = \sin 2 & \dot{w}_{4} = \cos 2 \\ w_{5} = (w_{1})^{2} = 4 & \dot{w}_{5} = 2w_{1} = 4 \\ w_{6} = 2w_{5} = 8 & \dot{w}_{6} = 2\dot{w}_{5} = 8 \\ \hline \begin{align*} w_{7} &= w_{3} + w_{4} + w_{6} \\ &= 10 + \sin 2 + 8 \\ &= y \end{align*} & \begin{align*} \dot{w}_{7} &= \dot{w}_{3} + \dot{w}_{4} + \dot{w}_{6} \\ &= 5 + \cos 2 + 8 \\ &= \dot{y} \end{align*} \\ \hline \end{array} $$
Therefore, we have $\dfrac{\partial f}{\partial x_{1}}(2,5) = \dot{y} = \cos 2 + 13$. If the derivative with respect to $x_{2}$ was recorded during the process of computing $w_{7}$, we obtain the gradient $\nabla f = [\partial_{x_{1}}f \quad \partial_{x_{2}}f]^{T}$.
See Also
Mykel J. Kochenderfer, Algorithms for Optimization (2019), p27-32 ↩︎