logo

Back Propagation Algorithm 📂Machine Learning

Back Propagation Algorithm

This article is written for math majors to understand the principles of the backpropagation algorithm.

Notation

1.PNG

Given an artificial neural network like the one shown above. Let x=(x1,x2,,xn0)\mathbf{x} = (x_{1}, x_{2}, \dots, x_{n_{0}}) be the input, yjly_{j}^{l} be the jjth node of the llth layer, y^=(y^1,y^2,,y^n^)\hat{\mathbf{y}} = (\hat{y}_{1}, \hat{y}_{2}, \dots, \hat{y}_{\hat{n}}) is the output .

Let LNL \in \mathbb{N} be the number of hidden layers, and the components of n=(n0,n1,,nL,n^)NL+2\mathbf{n}=(n_{0}, n_{1}, \dots, n_{L}, \hat{n}) \in \mathbb{N}^{L+2} be the number of nodes in the input layer, LL hidden layers, and output layer, in that order. Also, for convenience, let the 00th hidden layer be the input layer and the L+1L+1th hidden layer be the output layer.

Let wjilw_{ji}^{l} denote the weight connecting the iith node in the llth layer to the jjth node in the next layer. Propagation from each layer to the next then occurs as shown in the image below.

1.gif

where ϕ\phi is an arbitrary activation function. Let us denote by vilv_{i}^{l} the linear combination passed from the llth layer to the jjth node of the next layer.

vjl=i=1nlwjilyilyjl+1=ϕ(vjl)=ϕ(i=1nlwjilyil) \begin{align*} v_{j}^{l} &= \sum _{i=1}^{n_{l}} w_{ji}^{l}y_{i}^{l} \\ y_{j}^{l+1} &= \phi ( v_{j}^{l} ) = \phi \left( \sum \nolimits_{i=1}^{n_{l}} w_{ji}^{l}y_{i}^{l} \right) \end{align*}

To summarize, this looks like this

SymbolsMeaning
x=(x1,x2,,xn0)\mathbf{x}=(x_{1}, x_{2}, \dots, x_{n_{0}})input
yjly^{l}_{j}The jjth node in the llth layer
y^=(y^1,y^2,,y^n^)\hat{\mathbf{y}} = (\hat{y}_{1}, \hat{y}_{2}, \dots, \hat{y}_{\hat{n}} )output
nln_{l}Number of nodes in the llth layer
wjilw_{ji}^{l}The weight connecting the iith node in the llth layer to the jjth node in the next layer.
ϕ\phiActivation Functions
vjl=i=1nlwjilyilv_{j}^{l} = \sum \limits _{i=1} ^{n_{l}} w_{ji}^{l}y_{i}^{l}Linear Combination
yjl+1=ϕ(vjl)y^{l+1}_{j} = \phi (v_{j}^{l})Propagation from llth layer to next layer

Theorem

Let E=E(y^)E = E(\hat{\mathbf{y}}) be a proper differentiable loss function, then the way to optimize EE is to update the weights wjilw_{ji}^{l} at each layer as follows. wjilwjil+αδjlyil \begin{equation} w_{ji}^{l} \leftarrow w_{ji}^{l} + \alpha \delta^{l}_{j} y_{i}^{l} \label{thm} \end{equation}

Where α\alpha is the learning rate and δjl\delta_{j}^{l} is as follows when l=Ll=L,

δjL=ϕ(vjL)E(y^)y^j -\delta_{j}^{L} = \phi ^{\prime} (v_{j}^{L}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{j}}

For l{0,,L1}l \in \left\{ 0,\dots, L-1 \right\},

δjl=ϕ(vjl)i=1nlδil+1wijl+1 \delta_{j}^{l} = \phi ^{\prime} (v_{j}^{l}) \sum_{i=1}^{n_{l}} \delta_{i}^{l+1} w_{i j}^{l+1}

Explanation

Let’s look at (1)(1). It says that we rely on the llth nodes yjly_{j}^{l} to update the weights between the llth and l+1l+1th layers, which makes sense since the output of each layer ultimately determines the output y^\hat{\mathbf{y}}. Also, yjly_{j}^{l} can be viewed as inputs as they propagate from the llth to the l+1l+1th layer, which is similar to how a linear regression model is trained with LMS.

wwα(wTxy)x \mathbf{w} \leftarrow \mathbf{w} - \alpha (\mathbf{w}^{T}\mathbf{x} - \mathbf{y}) \mathbf{x}

This optimization technique is called a back propagation algorithm because the outputs yjly_{j}^{l} at each layer are computed from the input layer to the output layer, while the δjl\delta_{j}^{l} for optimization are computed backwards from the output layer to the input layer as follows.

δjL=ϕ(vjL)E(y^)y^jδjL1=ϕ(vjL1)iδjLwijLδjL2=ϕ(vjL2)iδiL1wijL1δjL3=ϕ(vjL3)iδiL2wijL2δj1=ϕ(vj1)iδi2wij2δj0=ϕ(vj0)iδi1wij1 \begin{align*} \delta_{j}^{L} &= - \phi ^{\prime} (v_{j}^{L}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{j}} \\ \delta_{j}^{L-1} &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} \delta_{j}^{L} w_{ij}^{L} \\ \delta_{j}^{L-2} &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \delta_{i}^{L-1} w_{ij}^{L-1} \\ \delta_{j}^{L-3} &= \phi ^{\prime} (v_{j}^{L-3}) \sum _{i} \delta_{i}^{L-2} w_{ij}^{L-2} \\ &\vdots \\ \delta_{j}^{1} &= \phi ^{\prime} (v_{j}^{1}) \sum _{i} \delta_{i}^{2} w_{ij}^{2} \\ \delta_{j}^{0} &= \phi ^{\prime} (v_{j}^{0}) \sum _{i} \delta_{i}^{1} w_{ij}^{1} \end{align*}

Proof

Let’s say we’re done computing from the input layer to the output layer. We can modify the weights in such a way that the loss function EE decreases, using the gradient descent method.

wjilwjilαE(y^)wjil \begin{equation} w_{ji}^{l} \leftarrow w_{ji}^{l} - \alpha \dfrac{\partial E(\hat{\mathbf{y}})}{\partial w_{ji}^{l} } \label{gradesent} \end{equation}

Since each yily_{i}^{l} is a given value, we can solve for the partial derivative in a computable form. The partial differential on the right hand side is given by the chain rule.

E(y^)wjil=E(y^)vjlvjlwjil=E(y^)vjlyil \begin{equation} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial w_{ji}^{l}} = \dfrac{\partial E(\hat{\mathbf{y}}) }{\partial v_{j}^{l}} \dfrac{\partial v_{j}^{l}}{\partial w_{ji}^{l}} = \dfrac{\partial E(\hat{\mathbf{y}})}{\partial v_{j}^{l}} y_{i}^{l} \label{chainrule} \end{equation}

Letting δjl-\delta_{j}^{l} be the partial derivative of the right-hand side of (3)(3), we obtain (1)(1) from (2)(2).

wjilwjil+αδjlyil w_{ji}^{l} \leftarrow w_{ji}^{l} + \alpha \delta^{l}_{j} y_{i}^{l}

Find δjl\delta_{j}^{l} at each floor as follows.

  • Case l=Ll = L

    For j{1,,n^}j \in \left\{ 1, \dots, \hat{n} \right\}, the following holds.

    δjL=E(y^)vjL=E(y^)y^jdy^jdvjL \begin{equation} -\delta_{j}^{L} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial v_{j}^{L}} = \dfrac{\partial E ( \hat{\mathbf{y}} ) } {\partial \hat{y}_{j}} \dfrac{d \hat{y}_{j}}{d v_{j}^{L}} \label{deltamL} \end{equation}

    Since y^j=ϕ(vjL)\hat{y}_{j} =\phi (v_{j}^{L}), we get

    δjL(t)=ϕ(vjL(t))E(y^)y^j -\delta_{j}^{L} (t) =\phi ^{\prime} (v_{j}^{L}(t)) \dfrac{\partial E (\hat{\mathbf{y}})}{\partial \hat{y}_{j}}

  • Case l=L1l = L-1

    For j{1,,nL1}j \in \left\{ 1, \dots, n_{L-1} \right\}, we have

    δjL1=E(y^)vjL1=E(y^)yjLdyjLdvjL1 -\delta_{j}^{L-1} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial v_{j}^{L-1}} = \dfrac{\partial E ( \hat{\mathbf{y}} ) } {\partial y_{j}^{L}} \dfrac{d y_{j}^{L}}{d v_{j}^{L-1}}

    Since yjL=ϕ(vjL1)y_{j}^{L} =\phi (v_{j}^{L-1}), we get

    δjL1=E(y^)yjLyjLvjL1=ϕ(vjL1)E(y^)yjL -\delta_{j}^{L-1} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial y_{j}^{L}} \dfrac{\partial y_{j}^{L}}{\partial v_{j}^{L-1}} = \phi ^{\prime} (v_{j}^{L-1}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{L}}

    The partial derivative on the right-hand side is computed by the chain rule as follows.

    δjL1=ϕ(vjL1)E(y^)yjL=ϕ(vjL1)iE(y^)y^iy^iyjL=ϕ(vjL1)iE(y^)y^idy^idviLviLyjL \begin{align*} -\delta_{j}^{L-1} &= \phi ^{\prime} (v_{j}^{L-1}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{L}} \\ &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{\partial \hat{y}_{i}}{\partial y_{j}^{L}} \\ &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}} \dfrac{\partial v_{i}^{L}}{\partial y_{j}^{L}} \end{align*}

    Here, by (4)(4) and viL=jwijLyjL{\color{green}v_{i}^{L}=\sum_{j}w_{ij}^{L}y_{j}^{L}}, we get the following.

    δjL1=ϕ(vjL1)i=1E(y^)y^iy^iviLdviLdyjL=ϕ(vjL1)iδiLwijL    δjL1=ϕ(vjL1)iδiLwijL \begin{align} && -\delta_{j}^{L-1} &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i=1} {\color{blue}\dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{\partial \hat{y}_{i}}{\partial v_{i}^{L}}} {\color{green} \dfrac{d v_{i}^{L}}{d y_{j^{L}}} } \nonumber \\ && &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} {\color{blue} -\delta_{i}^{L}} {\color{green} w_{ij}^{L} }\nonumber \\ {}\nonumber \\ \implies && \delta_{j}^{L-1} &= \phi ^{\prime} (v_{j}^{L-1}) \sum _{i} \delta_{i}^{L} w_{ij}^{L} \label{deltajL-1} \end{align}

  • Case l=L2l = L-2

    For j{1,,nL2}j \in \left\{ 1, \dots, n_{L-2} \right\}

    δjL2=E(y^)vjL2=E(y^)yjL1dyjL1dvjL2 -\delta_{j}^{L-2} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial v_{j}^{L-2}} = \dfrac{\partial E ( \hat{\mathbf{y}} ) } {\partial y_{j}^{L-1}} \dfrac{d y_{j}^{L-1}}{d v_{j}^{L-2}}

    Since yjL1=ϕ(vjL2)y_{j}^{L-1} =\phi (v_{j}^{L-2}), we get

    δjL2=E(y^)yjL1dyjL1dvjL2=ϕ(vjL2)E(y^)yjL1 -\delta_{j}^{L-2} = \dfrac{\partial E (\hat{\mathbf{y}})}{\partial y_{j}^{L-1}} \dfrac{d y_{j}^{L-1}}{d v_{j}^{L-2}} = \phi ^{\prime} (v_{j}^{L-2}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{L-1}}

    The partial derivative on the right-hand side is computed by the chain rule as follows.

    δjL2=ϕ(vjL2)E(y^)yjL1=ϕ(vjL2)iE(y^)y^iy^iyjL1=ϕ(vjL2)iE(y^)y^idy^idviLviLyjL1=ϕ(vjL2)iE(y^)y^idy^idviLkviLykLykLyjL1=ϕ(vjL2)iE(y^)y^idy^idviLkviLykLdykLdvkL1vkL1yjL1=ϕ(vjL2)kiE(y^)y^idy^idviLviLykLdykLdvkL1dvkL1yjL1=ϕ(vjL2)kiδiLwikLϕ(vkL1)wkjL1 \begin{align*} -\delta_{j}^{L-2} &= \phi ^{\prime} (v_{j}^{L-2}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{\partial \hat{y}_{i}}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}} \dfrac{\partial v_{i}^{L}}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}} \sum _{k} \dfrac{\partial v_{i}^{L}}{\partial y_{k}^{L}} \dfrac{\partial y_{k}^{L}}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}} \sum _{k} \dfrac{\partial v_{i}^{L}}{\partial y_{k}^{L}} \dfrac{d y_{k}^{L}}{d v_{k}^{L-1}} \dfrac{\partial v_{k}^{L-1}}{\partial y_{j}^{L-1}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{k} \sum _{i} {\color{blue}\dfrac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i}} \dfrac{d \hat{y}_{i}}{d v_{i}^{L}}} {\color{red}\dfrac{\partial v_{i}^{L}}{\partial y_{k}^{L}} } {\color{green}\dfrac{d y_{k}^{L}}{d v_{k}^{L-1}}} {\color{purple}\dfrac{d v_{k}^{L-1}}{\partial y_{j}^{L-1}}} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{k} \sum _{i} {\color{blue} -\delta_{i}^{L}} {\color{red} w_{ik}^{L}} {\color{green} \phi^{\prime}(v_{k}^{L-1})} {\color{purple} w_{kj}^{L-1}} \end{align*}

    So we get the following

    δjL2=ϕ(vjL2)kiδiLwikLϕ(vkL1)wkjL1 \delta_{j}^{L-2} = -\phi ^{\prime} (v_{j}^{L-2}) \sum _{k} \sum _{i} \delta_{i}^{L} w_{ik}^{L} \phi^{\prime}(v_{k}^{L-1}) w_{kj}^{L-1}

    Then, by (5)(5), the following holds.

    iδiLwikLϕ(vkL1)=ϕ(vkL1)iδiLwikL=δkL1 \sum _{i} \delta_{i}^{L} w_{ik}^{L} \phi^{\prime}(v_{k}^{L-1}) = \phi^{\prime}(v_{k}^{L-1}) \sum _{i} \delta_{i}^{L} w_{ik}^{L} = \delta_{k}^{L-1}

    THerefore we get the following

    δjL2=ϕ(vjL2)kδkL1wkjL1=ϕ(vjL2)iδiL1wijL1 \begin{align*} \delta_{j}^{L-2} &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{k} \delta_{k}^{L-1} w_{kj}^{L-1} \\ &= \phi ^{\prime} (v_{j}^{L-2}) \sum _{i} \delta_{i}^{L-1} w_{ij}^{L-1} \end{align*}

  • Generalization: l{1,,L1}l \in \left\{ 1, \dots, L-1 \right\}

    Based on the above results, we can generalize as follows for j{1,,nl}j \in \left\{ 1, \dots, n_{l} \right\},

    δjl=ϕ(vjl)E(y^)yjl -\delta_{j}^{l} = \phi ^{\prime} (v_{j}^{l}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{l}}

    Solving the partial derivative on the right-hand side by the chain rule is as follows.

    δjl=ϕ(vjl)E(y^)yjl=ϕ(vjl)i(1)E(y^)y^i(1)y^i(1)yjl=ϕ(vjl)i(1)E(y^)y^i(1)dy^i(1)dvi(1)Lvi(1)Lyjl=ϕ(vjl)i(2)i(1)E(y^)y^i(1)dy^i(1)dvi(1)Lvi(1)Lyi(2)Lyi(2)Lyjl=ϕ(vjl)i(2)i(1)E(y^)y^i(1)dy^i(1)dvi(1)Lvi(1)Lyi(2)Ldyi(2)Ldvi(2)L1vi(2)L1yjl=ϕ(vjl)i(3)i(2)i(1)E(y^)y^i(1)dy^i(1)dvi(1)Lvi(1)Lyi(2)Ldyi(2)Ldvi(2)L1vi(2)L1yi(3)L1yi(3)L1yjl=ϕ(vjl)i(3)i(2)i(1)E(y^)y^i(1)dy^i(1)dvi(1)Lvi(1)Lyi(2)Ldyi(2)Ldvi(2)L1vi(2)L1yi(3)L1dyi(3)L1dvi(3)L2vi(3)L2yjl=ϕ(vjl)i(Ll)i(3)i(2)i(1)E(y^)y^i(1)dy^i(1)dvi(1)Lvi(1)Lyi(2)Ldyi(2)Ldvi(2)L1vi(2)L1yi(3)L1dyi(3)L1dvi(3)L2vi(3)L2yi(4)L2dyi(Ll+1)l+1dvi(Ll+1)lvi(Ll+1)lyjl=ϕ(vjl)i(Ll)i(3)i(2)i(1)δi(1)Lwi(1)i(2)Lϕ(vi(2)L1)wi(2)i(3)L1ϕ(vi(3)L2)wi(3)i(4)L2ϕ(vLl+1l)wi(Ll+1)jL=ϕ(vjl)i(Ll)i(3)i(2)δi(2)L1wi(2)i(3)L1ϕ(vi(3)L2)wi(3)i(4)L2ϕ(vLl+1l)wi(Ll+1)jL=ϕ(vjl)i(Ll)i(3)δi(3)L2wi(3)i(4)L2wi(Ll)jL=ϕ(vjl)i(Ll)δi(Ll)l+1wi(ll)jl \begin{align*} &\quad \delta_{j}^{l} \\ &= -\phi ^{\prime} (v_{j}^{l}) \dfrac{\partial E(\hat{\mathbf{y}})}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{\partial \hat{y}_{i_{(1)}}}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{\partial y_{i_{(2)}}^{L}}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{d y_{i_{(2)}}^{L}}{d v_{i_{(2)}}^{L-1}} \frac{\partial v_{i_{(2)}}^{L-1}}{\partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(3)}} \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{d y_{i_{(2)}}^{L}}{d v_{i_{(2)}}^{L-1}} \frac{\partial v_{i_{(2)}}^{L-1}}{\partial y_{i_{(3)}}^{L-1} } \frac{\partial y_{i_{(3)}}^{L-1} }{ \partial y_{j}^{l}} \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(3)}} \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{d y_{i_{(2)}}^{L}}{d v_{i_{(2)}}^{L-1}} \frac{\partial v_{i_{(2)}}^{L-1}}{\partial y_{i_{(3)}}^{L-1} } \frac{d y_{i_{(3)}}^{L-1} }{d v_{i_{(3)}}^{L-2} } \frac{\partial v_{i_{(3)}}^{L-2} }{ \partial y_{j}^{l}} \\ & \quad \vdots \\ &= -\phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \cdots \sum_{i_{(3)}} \sum_{i_{(2)}} \sum_{i_{(1)}} \frac{\partial E(\hat{\mathbf{y}})}{\partial \hat{y}_{i_{(1)}}} \frac{d \hat{y}_{i_{(1)}}}{d v_{i_{(1)}}^{L}} \frac{\partial v_{i_{(1)}}^{L}}{\partial y_{i_{(2)}}^{L}} \frac{d y_{i_{(2)}}^{L}}{d v_{i_{(2)}}^{L-1}} \frac{\partial v_{i_{(2)}}^{L-1}}{\partial y_{i_{(3)}}^{L-1} } \frac{d y_{i_{(3)}}^{L-1} }{d v_{i_{(3)}}^{L-2} } \frac{\partial v_{i_{(3)}}^{L-2} }{ \partial y_{i_{(4)}}^{L-2}} \cdots \frac{d y_{i_{(L-l+1)}}^{l+1} }{d v_{i_{(L-l+1)}}^{l} } \frac{\partial v_{i_{(L-l+1)}}^{l} }{ \partial y_{j}^{l}} \\ &= \phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \cdots \sum_{i_{(3)}} \sum_{i_{(2)}} \sum_{i_{(1)}} -\delta_{i_{(1)}}^{L} w_{i_{(1)}i_{(2)}}^{L} \phi^{\prime}(v_{i_{(2)}}^{L-1}) w_{i_{(2)} i_{(3)}}^{L-1} \phi^{\prime}( v_{i_{(3)}}^{L-2} ) w_{i_{(3)} i_{(4)}}^{L-2} \cdots \phi^{\prime}(v_{L-l+1}^{l})w_{i_{(L-l+1)} j}^{L} \\ &= \phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \cdots \sum_{i_{(3)}} \sum_{i_{(2)}} \delta_{i_{(2)}}^{L-1}w_{i_{(2)} i_{(3)}}^{L-1} \phi^{\prime}( v_{i_{(3)}}^{L-2} ) w_{i_{(3)} i_{(4)}}^{L-2} \cdots \phi^{\prime}(v_{L-l+1}^{l})w_{i_{(L-l+1)} j}^{L} \\ &= \phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \cdots \sum_{i_{(3)}} \delta_{i_{(3)}}^{L-2} w_{i_{(3)} i_{(4)}}^{L-2} \cdots w_{i_{(L-l)} j}^{L} \\ &\quad \vdots \\ &= \phi ^{\prime} (v_{j}^{l}) \sum_{i_{(L-l)}} \delta_{i_{(L-l)}}^{l+1} w_{i_{(l-l)} j}^{l} \end{align*}

    Therefore to summarize

    δjl=ϕ(vjl)iδil+1wijl+1 \delta_{j}^{l} = \phi ^{\prime} (v_{j}^{l}) \sum_{i} \delta_{i}^{l+1} w_{ij}^{l+1}