Back Propagation Algorithm
This article is written for math majors to understand the principles of the backpropagation algorithm.
Notation
Given an artificial neural network like the one shown above. Let $\mathbf{x} = (x_{1}, x_{2}, \dots, x_{n_{0}})$ be the input, $y_{j}^{l}$ be the $j$th node of the $l$th layer, $\hat{\mathbf{y}} = (\hat{y}_{1}, \hat{y}_{2}, \dots, \hat{y}_{\hat{n}})$ is the output .
Let $L \in \mathbb{N}$ be the number of hidden layers, and the components of $\mathbf{n}=(n_{0}, n_{1}, \dots, n_{L}, \hat{n}) \in \mathbb{N}^{L+2}$ be the number of nodes in the input layer, $L$ hidden layers, and output layer, in that order. Also, for convenience, let the $0$th hidden layer be the input layer and the $L+1$th hidden layer be the output layer.
Let $w_{ji}^{l}$ denote the weight connecting the $i$th node in the $l$th layer to the $j$th node in the next layer. Propagation from each layer to the next then occurs as shown in the image below.
where $\phi$ is an arbitrary activation function. Let us denote by $v_{i}^{l}$ the linear combination passed from the $l$th layer to the $j$th node of the next layer.
$$ \begin{align*} v_{j}^{l} &= \sum _{i=1}^{n_{l}} w_{ji}^{l}y_{i}^{l} \\ y_{j}^{l+1} &= \phi ( v_{j}^{l} ) = \phi \left( \sum \nolimits_{i=1}^{n_{l}} w_{ji}^{l}y_{i}^{l} \right) \end{align*} $$
To summarize, this looks like this
Symbols | Meaning |
---|---|
$\mathbf{x}=(x_{1}, x_{2}, \dots, x_{n_{0}})$ | input |
$y^{l}_{j}$ | The $j$th node in the $l$th layer |
$\hat{\mathbf{y}} = (\hat{y}_{1}, \hat{y}_{2}, \dots, \hat{y}_{\hat{n}} )$ | output |
$n_{l}$ | Number of nodes in the $l$th layer |
$w_{ji}^{l}$ | The weight connecting the $i$th node in the $l$th layer to the $j$th node in the next layer. |
$\phi$ | Activation Functions |
$v_{j}^{l} = \sum \limits _{i=1} ^{n_{l}} w_{ji}^{l}y_{i}^{l}$ | Linear Combination |
$y^{l+1}_{j} = \phi (v_{j}^{l})$ | Propagation from $l$th layer to next layer |
Theorem
Let $E = E(\hat{\mathbf{y}})$ be a proper differentiable loss function, then the way to optimize $E$ is to update the weights $w_{ji}^{l}$ at each layer as follows. $$ \begin{equation} w_{ji}^{l} \leftarrow w_{ji}^{l} + \alpha \delta^{l}_{j} y_{i}^{l} \label{thm} \end{equation} $$
Where $\alpha$ is the learning rate and $\delta_{j}^{l}$ is as follows when $l=L$,
$$ -\delta_{j}^{L} = \phi ^{\prime} (v_{j}^{L}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{j}} $$
For $l \in \left\{ 0,\dots, L-1 \right\}$,
$$ \delta_{j}^{l} = \phi ^{\prime} (v_{j}^{l}) \sum_{i=1}^{n_{l}} \delta_{i}^{l+1} w_{i j}^{l+1} $$
Explanation
Let’s look at $(1)$. It says that we rely on the $l$th nodes $y_{j}^{l}$ to update the weights between the $l$th and $l+1$th layers, which makes sense since the output of each layer ultimately determines the output $\hat{\mathbf{y}}$. Also, $y_{j}^{l}$ can be viewed as inputs as they propagate from the $l$th to the $l+1$th layer, which is similar to how a linear regression model is trained with LMS.
$$ \mathbf{w} \leftarrow \mathbf{w} - \alpha (\mathbf{w}^{T}\mathbf{x} - \mathbf{y}) \mathbf{x} $$
This optimization technique is called a back propagation algorithm because the outputs $y_{j}^{l}$ at each layer are computed from the input layer to the output layer, while the $\delta_{j}^{l}$ for optimization are computed backwards from the output layer to the input layer as follows.
$$ \begin{align*} \delta_{j}^{L} &= - \phi ^{\prime} (v_{j}^{L}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{j}} \\ \delta_{j}^{L-1} &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} \delta_{j}^{L} w_{ij}^{L} \\ \delta_{j}^{L-2} &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \delta_{i}^{L-1} w_{ij}^{L-1} \\ \delta_{j}^{L-3} &= \phi ^{\prime} (v_{j}^{L-3}) \sum _{i} \delta_{i}^{L-2} w_{ij}^{L-2} \\ &\vdots \\ \delta_{j}^{1} &= \phi ^{\prime} (v_{j}^{1}) \sum _{i} \delta_{i}^{2} w_{ij}^{2} \\ \delta_{j}^{0} &= \phi ^{\prime} (v_{j}^{0}) \sum _{i} \delta_{i}^{1} w_{ij}^{1} \end{align*} $$
Proof
Let’s say we’re done computing from the input layer to the output layer. We can modify the weights in such a way that the loss function $E$ decreases, using the gradient descent method.
$$ \begin{equation} w_{ji}^{l} \leftarrow w_{ji}^{l} - \alpha \dfrac{\partial E(\hat{\mathbf{y}})}{\partial w_{ji}^{l} } \label{gradesent} \end{equation} $$
Since each $y_{i}^{l}$ is a given value, we can solve for the partial derivative in a computable form. The partial differential on the right hand side is given by the chain rule.
$$ \begin{equation} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial w_{ji}^{l}} = \dfrac{\partial E(\hat{\mathbf{y}}) }{\partial v_{j}^{l}} \dfrac{\partial v_{j}^{l}}{\partial w_{ji}^{l}} = \dfrac{\partial E(\hat{\mathbf{y}})}{\partial v_{j}^{l}} y_{i}^{l} \label{chainrule} \end{equation} $$
Letting $-\delta_{j}^{l}$ be the partial derivative of the right-hand side of $(3)$, we obtain $(1)$ from $(2)$.
$$ w_{ji}^{l} \leftarrow w_{ji}^{l} + \alpha \delta^{l}_{j} y_{i}^{l} $$
Find $\delta_{j}^{l}$ at each floor as follows.
Case $l = L$
For $j \in \left\{ 1, \dots, \hat{n} \right\}$, the following holds.
$$ \begin{equation} -\delta_{j}^{L} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial v_{j}^{L}} = \dfrac{\partial E ( \hat{\mathbf{y}} ) } {\partial \hat{y}_{j}} \dfrac{d \hat{y}_{j}}{d v_{j}^{L}} \label{deltamL} \end{equation} $$
Since $\hat{y}_{j} =\phi (v_{j}^{L})$, we get
$$ -\delta_{j}^{L} (t) =\phi ^{\prime} (v_{j}^{L}(t)) \dfrac{\partial E (\hat{\mathbf{y}})}{\partial \hat{y}_{j}} $$
■
Case $l = L-1$
For $j \in \left\{ 1, \dots, n_{L-1} \right\}$, we have
$$ -\delta_{j}^{L-1} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial v_{j}^{L-1}} = \dfrac{\partial E ( \hat{\mathbf{y}} ) } {\partial y_{j}^{L}} \dfrac{d y_{j}^{L}}{d v_{j}^{L-1}} $$
Since $y_{j}^{L} =\phi (v_{j}^{L-1})$, we get
$$ -\delta_{j}^{L-1} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial y_{j}^{L}} \dfrac{\partial y_{j}^{L}}{\partial v_{j}^{L-1}} = \phi ^{\prime} (v_{j}^{L-1}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{L}} $$
The partial derivative on the right-hand side is computed by the chain rule as follows.
$$ \begin{align*} -\delta_{j}^{L-1} &= \phi ^{\prime} (v_{j}^{L-1}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{L}} \\ &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{\partial \hat{y}_{i}}{\partial y_{j}^{L}} \\ &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}} \dfrac{\partial v_{i}^{L}}{\partial y_{j}^{L}} \end{align*} $$
Here, by $(4)$ and ${\color{green}v_{i}^{L}=\sum_{j}w_{ij}^{L}y_{j}^{L}}$, we get the following.
$$ \begin{align} && -\delta_{j}^{L-1} &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i=1} {\color{blue}\dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{\partial \hat{y}_{i}}{\partial v_{i}^{L}}} {\color{green} \dfrac{d v_{i}^{L}}{d y_{j^{L}}} } \nonumber \\ && &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} {\color{blue} -\delta_{i}^{L}} {\color{green} w_{ij}^{L} }\nonumber \\ {}\nonumber \\ \implies && \delta_{j}^{L-1} &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} \delta_{i}^{L} w_{ij}^{L} \label{deltajL-1} \end{align} $$
■
Case $l = L-2$
For $j \in \left\{ 1, \dots, n_{L-2} \right\}$
$$ -\delta_{j}^{L-2} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial v_{j}^{L-2}} = \dfrac{\partial E ( \hat{\mathbf{y}} ) } {\partial y_{j}^{L-1}} \dfrac{d y_{j}^{L-1}}{d v_{j}^{L-2}} $$
Since $y_{j}^{L-1} =\phi (v_{j}^{L-2})$, we get
$$ -\delta_{j}^{L-2} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial y_{j}^{L-1}} \dfrac{d y_{j}^{L-1}}{d v_{j}^{L-2}} = \phi ^{\prime} (v_{j}^{L-2}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{L-1}} $$
The partial derivative on the right-hand side is computed by the chain rule as follows.
$$ \begin{align*} -\delta_{j}^{L-2} &= \phi ^{\prime} (v_{j}^{L-2}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{\partial \hat{y}_{i}}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}} \dfrac{\partial v_{i}^{L}}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}} \sum _{k} \dfrac{\partial v_{i}^{L}}{\partial y_{k}^{L}} \dfrac{\partial y_{k}^{L}}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}} \sum _{k} \dfrac{\partial v_{i}^{L}}{\partial y_{k}^{L}} \dfrac{d y_{k}^{L}}{d v_{k}^{L-1}} \dfrac{\partial v_{k}^{L-1}}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{k} \sum _{i} {\color{blue}\dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}}} {\color{red}\dfrac{\partial v_{i}^{L}}{\partial y_{k}^{L}} } {\color{green}\dfrac{d y_{k}^{L}}{d v_{k}^{L-1}}} {\color{purple}\dfrac{d v_{k}^{L-1}}{\partial y_{j}^{L-1}}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{k} \sum _{i} {\color{blue} -\delta_{i}^{L}} {\color{red} w_{ik}^{L}} {\color{green} \phi^{\prime}(v_{k}^{L-1})} {\color{purple} w_{kj}^{L-1}} \end{align*} $$
So we get the following
$$ \delta_{j}^{L-2} = -\phi ^{\prime} (v_{j}^{L-2}) \sum _{k} \sum _{i} \delta_{i}^{L} w_{ik}^{L} \phi^{\prime}(v_{k}^{L-1}) w_{kj}^{L-1} $$
Then, by $(5)$, the following holds.
$$ \sum _{i} \delta_{i}^{L} w_{ik}^{L} \phi^{\prime}(v_{k}^{L-1}) = \phi^{\prime}(v_{k}^{L-1}) \sum _{i} \delta_{i}^{L} w_{ik}^{L} = \delta_{k}^{L-1} $$
THerefore we get the following
$$ \begin{align*} \delta_{j}^{L-2} &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{k} \delta_{k}^{L-1} w_{kj}^{L-1} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \delta_{i}^{L-1} w_{ij}^{L-1} \end{align*} $$
■
Generalization: $l \in \left\{ 1, \dots, L-1 \right\}$
Based on the above results, we can generalize as follows for $j \in \left\{ 1, \dots, n_{l} \right\}$,
$$ -\delta_{j}^{l} = \phi ^{\prime} (v_{j}^{l}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{l}} $$
Solving the partial derivative on the right-hand side by the chain rule is as follows.
$$ \begin{align*} &\quad \delta_{j}^{l} \\ &= -\phi ^{\prime} (v_{j}^{l}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{\partial \hat{y}_{i_{(1)}}}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{\partial y_{i_{(2)}}^{L}}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{d y_{i_{(2)}}^{L}}{d v_{i_{(2)}}^{L-1}} \frac{\partial v_{i_{(2)}}^{L-1}}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(3)}} \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{d y_{i_{(2)}}^{L}}{d v_{i_{(2)}}^{L-1}} \frac{\partial v_{i_{(2)}}^{L-1}}{\partial y_{i_{(3)}}^{L-1} } \frac{\partial y_{i_{(3)}}^{L-1} }{ \partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(3)}} \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{d y_{i_{(2)}}^{L}}{d v_{i_{(2)}}^{L-1}} \frac{\partial v_{i_{(2)}}^{L-1}}{\partial y_{i_{(3)}}^{L-1} } \frac{d y_{i_{(3)}}^{L-1} }{d v_{i_{(3)}}^{L-2} } \frac{\partial v_{i_{(3)}}^{L-2} }{ \partial y_{j}^{l}} \\ & \quad \vdots \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \cdots \sum_{i_{(3)}} \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{d y_{i_{(2)}}^{L}}{d v_{i_{(2)}}^{L-1}} \frac{\partial v_{i_{(2)}}^{L-1}}{\partial y_{i_{(3)}}^{L-1} } \frac{d y_{i_{(3)}}^{L-1} }{d v_{i_{(3)}}^{L-2} } \frac{\partial v_{i_{(3)}}^{L-2} }{ \partial y_{i_{(4)}}^{L-2}} \cdots \frac{d y_{i_{(L-l+1)}}^{l+1} }{d v_{i_{(L-l+1)}}^{l} } \frac{\partial v_{i_{(L-l+1)}}^{l} }{ \partial y_{j}^{l}} \\ &= \phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \cdots \sum_{i_{(3)}} \sum_{i_{(2)}} \sum_{i_{(1)}} -\delta_{i_{(1)}}^{L} w_{i_{(1)}i_{(2)}}^{L} \phi^{\prime}(v_{i_{(2)}}^{L-1}) w_{i_{(2)} i_{(3)}}^{L-1} \phi^{\prime}( v_{i_{(3)}}^{L-2} ) w_{i_{(3)} i_{(4)}}^{L-2} \cdots \phi^{\prime}(v_{L-l+1}^{l})w_{i_{(L-l+1)} j}^{L} \\ &= \phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \cdots \sum_{i_{(3)}} \sum_{i_{(2)}} \delta_{i_{(2)}}^{L-1}w_{i_{(2)} i_{(3)}}^{L-1} \phi^{\prime}( v_{i_{(3)}}^{L-2} ) w_{i_{(3)} i_{(4)}}^{L-2} \cdots \phi^{\prime}(v_{L-l+1}^{l})w_{i_{(L-l+1)} j}^{L} \\ &= \phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \cdots \sum_{i_{(3)}} \delta_{i_{(3)}}^{L-2} w_{i_{(3)} i_{(4)}}^{L-2} \cdots w_{i_{(L-l)} j}^{L} \\ &\quad \vdots \\ &= \phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \delta_{i_{(L-l)}}^{l+1} w_{i_{(l-l)} j}^{l} \end{align*} $$
Therefore to summarize
$$ \delta_{j}^{l} = \phi ^{\prime} (v_{j}^{l}) \sum_{i} \delta_{i}^{l+1} w_{ij}^{l+1} $$
■