logo

Automatic differentiation 📂Machine Learning

Automatic differentiation

Definition1 2

Automatic differentiation refers to a method for obtaining the derivative of a function defined by computer programming code. It is also abbreviated as AD or autodiff.

Explanation

Automatic differentiation involves using the chain rule to compute the derivative of a composite function composed of functions whose derivatives are already known. In simple terms, it is the chain rule itself. Implementing the chain rule in programming code constitutes automatic differentiation. Therefore, unlike methods that numerically approximate derivatives, automatic differentiation calculates the derivative coefficients accurately. It is used in deep learning to optimize neural networks through backpropagation. As you will see below, automatic differentiation does not actually compute the derivative itself; it simply sequentially adds, multiplies, or composes derivatives that are already known.

Depending on the method of computation, it can be broadly classified into forward mode and reverse mode.

About f:RnRf : \mathbb{R}^{n} \to \mathbb{R}

Let us consider a bivariate function ff.

f(x1,x2)=x1x2+sin(x1)+2(x1)2 f(x_{1}, x_{2}) = x_{1}x_{2} + \sin(x_{1}) + 2(x_{1})^{2}

The partial derivative fx1\dfrac{\partial f}{\partial x_{1}} is computed as follows:

fx1=(x1x2)x1+sin(x1)x1+2(x1)2 \dfrac{\partial f}{\partial x_{1}} = \dfrac{\partial (x_{1}x_{2})}{\partial x_{1}} + \dfrac{\partial \sin(x_{1})}{\partial x_{1}} + 2(x_{1})^{2}

Each term here is already known as shown below:

(x1x2)x1=x2,sin(x1)x1=cos(x1),2(x1)2x1=4x1 \dfrac{\partial (x_{1}x_{2})}{\partial x_{1}} = x_{2},\qquad \dfrac{\partial \sin(x_{1})}{\partial x_{1}} = \cos(x_{1}),\qquad \dfrac{\partial 2(x_{1})^{2}}{\partial x_{1}} = 4x_{1}

In other words, once the values of x1x_{1} and x2x_{2} are determined, the derivative coefficient is immediately decided. Implementing this in programming code would look as follows:

Forward Mode

Let’s set w1=x1w_{1} = x_{1}, w2=x2w_{2} = x_{2}, and consider the case when (x1,x5)=(2,5)(x_{1}, x_{5}) = (2, 5) to compute the derivative coefficient fx1(2,5)\dfrac{\partial f}{\partial x_{1}}(2,5) of ff. Each time we add, multiply, or apply any function, we record the derivative. Let us denote this as wx1=w˙\dfrac{\partial w}{\partial x_{1}} = \dot{w}.

Forward calulationsDerivativesw1=x1=2w˙1=1w2=x2=5w˙2=0w3=w1w2=10w˙3=w2=5w4=sin(w1)=sin2w˙4=cos2w5=(w1)2=4w˙5=2w1=4w6=2w5=8w˙6=2w˙5=8w7=w3+w4+w6=10+sin2+8=yw˙7=w˙3+w˙4+w˙6=5+cos2+8=y˙ \begin{array}{|l|l|} \hline \textbf{Forward calulations} & \textbf{Derivatives} \\ \hline w_{1} = x_{1} = 2 & \dot{w}_{1} = 1 \\ w_{2} = x_{2} = 5 & \dot{w}_{2} = 0 \\ \hline w_{3} = w_{1}w_{2} = 10 & \dot{w}_{3} = w_{2} = 5 \\ w_{4} = \sin(w_{1}) = \sin 2 & \dot{w}_{4} = \cos 2 \\ w_{5} = (w_{1})^{2} = 4 & \dot{w}_{5} = 2w_{1} = 4 \\ w_{6} = 2w_{5} = 8 & \dot{w}_{6} = 2\dot{w}_{5} = 8 \\ \hline \begin{align*} w_{7} &= w_{3} + w_{4} + w_{6} \\ &= 10 + \sin 2 + 8 \\ &= y \end{align*} & \begin{align*} \dot{w}_{7} &= \dot{w}_{3} + \dot{w}_{4} + \dot{w}_{6} \\ &= 5 + \cos 2 + 8 \\ &= \dot{y} \end{align*} \\ \hline \end{array}

Therefore, we have fx1(2,5)=y˙=cos2+13\dfrac{\partial f}{\partial x_{1}}(2,5) = \dot{y} = \cos 2 + 13. If the derivative with respect to x2x_{2} was recorded during the process of computing w7w_{7}, we obtain the gradient f=[x1fx2f]T\nabla f = [\partial_{x_{1}}f \quad \partial_{x_{2}}f]^{T}.

See Also


  1. Mykel J. Kochenderfer, Algorithms for Optimization (2019), p27-32 ↩︎

  2. https://en.wikipedia.org/wiki/Automatic_differentiation ↩︎