Back Propagation Algorithm
📂Machine LearningBack Propagation Algorithm
This article is written for math majors to understand the principles of the backpropagation algorithm.
Notation

Given an artificial neural network like the one shown above. Let x=(x1,x2,…,xn0) be the input, yjl be the jth node of the lth layer, y^=(y^1,y^2,…,y^n^) is the output .
Let L∈N be the number of hidden layers, and the components of n=(n0,n1,…,nL,n^)∈NL+2 be the number of nodes in the input layer, L hidden layers, and output layer, in that order. Also, for convenience, let the 0th hidden layer be the input layer and the L+1th hidden layer be the output layer.
Let wjil denote the weight connecting the ith node in the lth layer to the jth node in the next layer. Propagation from each layer to the next then occurs as shown in the image below.

where ϕ is an arbitrary activation function. Let us denote by vil the linear combination passed from the lth layer to the jth node of the next layer.
vjlyjl+1=i=1∑nlwjilyil=ϕ(vjl)=ϕ(∑i=1nlwjilyil)
To summarize, this looks like this
Symbols | Meaning |
---|
x=(x1,x2,…,xn0) | input |
yjl | The jth node in the lth layer |
y^=(y^1,y^2,…,y^n^) | output |
nl | Number of nodes in the lth layer |
wjil | The weight connecting the ith node in the lth layer to the jth node in the next layer. |
ϕ | Activation Functions |
vjl=i=1∑nlwjilyil | Linear Combination |
yjl+1=ϕ(vjl) | Propagation from lth layer to next layer |
Theorem
Let E=E(y^) be a proper differentiable loss function, then the way to optimize E is to update the weights wjil at each layer as follows.
wjil←wjil+αδjlyil
Where α is the learning rate and δjl is as follows when l=L,
−δjL=ϕ′(vjL)∂y^j∂E(y^)
For l∈{0,…,L−1},
δjl=ϕ′(vjl)i=1∑nlδil+1wijl+1
Explanation
Let’s look at (1). It says that we rely on the lth nodes yjl to update the weights between the lth and l+1th layers, which makes sense since the output of each layer ultimately determines the output y^. Also, yjl can be viewed as inputs as they propagate from the lth to the l+1th layer, which is similar to how a linear regression model is trained with LMS.
w←w−α(wTx−y)x
This optimization technique is called a back propagation algorithm because the outputs yjl at each layer are computed from the input layer to the output layer, while the δjl for optimization are computed backwards from the output layer to the input layer as follows.
δjLδjL−1δjL−2δjL−3δj1δj0=−ϕ′(vjL)∂y^j∂E(y^)=ϕ′(vjL−1)i∑δjLwijL=ϕ′(vjL−2)i∑δiL−1wijL−1=ϕ′(vjL−3)i∑δiL−2wijL−2⋮=ϕ′(vj1)i∑δi2wij2=ϕ′(vj0)i∑δi1wij1
Proof
Let’s say we’re done computing from the input layer to the output layer. We can modify the weights in such a way that the loss function E decreases, using the gradient descent method.
wjil←wjil−α∂wjil∂E(y^)
Since each yil is a given value, we can solve for the partial derivative in a computable form. The partial differential on the right hand side is given by the chain rule.
∂wjil∂E(y^)=∂vjl∂E(y^)∂wjil∂vjl=∂vjl∂E(y^)yil
Letting −δjl be the partial derivative of the right-hand side of (3), we obtain (1) from (2).
wjil←wjil+αδjlyil
Find δjl at each floor as follows.
Case l=L
For j∈{1,…,n^}, the following holds.
−δjL=∂vjL∂E(y^)=∂y^j∂E(y^)dvjLdy^j
Since y^j=ϕ(vjL), we get
−δjL(t)=ϕ′(vjL(t))∂y^j∂E(y^)
■
Case l=L−1
For j∈{1,…,nL−1}, we have
−δjL−1=∂vjL−1∂E(y^)=∂yjL∂E(y^)dvjL−1dyjL
Since yjL=ϕ(vjL−1), we get
−δjL−1=∂yjL∂E(y^)∂vjL−1∂yjL=ϕ′(vjL−1)∂yjL∂E(y^)
The partial derivative on the right-hand side is computed by the chain rule as follows.
−δjL−1=ϕ′(vjL−1)∂yjL∂E(y^)=ϕ′(vjL−1)i∑∂y^i∂E(y^)∂yjL∂y^i=ϕ′(vjL−1)i∑∂y^i∂E(y^)dviLdy^i∂yjL∂viL
Here, by (4) and viL=∑jwijLyjL, we get the following.
⟹−δjL−1δjL−1=ϕ′(vjL−1)i=1∑∂y^i∂E(y^)∂viL∂y^idyjLdviL=ϕ′(vjL−1)i∑−δiLwijL=ϕ′(vjL−1)i∑δiLwijL
■
Case l=L−2
For j∈{1,…,nL−2}
−δjL−2=∂vjL−2∂E(y^)=∂yjL−1∂E(y^)dvjL−2dyjL−1
Since yjL−1=ϕ(vjL−2), we get
−δjL−2=∂yjL−1∂E(y^)dvjL−2dyjL−1=ϕ′(vjL−2)∂yjL−1∂E(y^)
The partial derivative on the right-hand side is computed by the chain rule as follows.
−δjL−2=ϕ′(vjL−2)∂yjL−1∂E(y^)=ϕ′(vjL−2)i∑∂y^i∂E(y^)∂yjL−1∂y^i=ϕ′(vjL−2)i∑∂y^i∂E(y^)dviLdy^i∂yjL−1∂viL=ϕ′(vjL−2)i∑∂y^i∂E(y^)dviLdy^ik∑∂ykL∂viL∂yjL−1∂ykL=ϕ′(vjL−2)i∑∂y^i∂E(y^)dviLdy^ik∑∂ykL∂viLdvkL−1dykL∂yjL−1∂vkL−1=ϕ′(vjL−2)k∑i∑∂y^i∂E(y^)dviLdy^i∂ykL∂viLdvkL−1dykL∂yjL−1dvkL−1=ϕ′(vjL−2)k∑i∑−δiLwikLϕ′(vkL−1)wkjL−1
So we get the following
δjL−2=−ϕ′(vjL−2)k∑i∑δiLwikLϕ′(vkL−1)wkjL−1
Then, by (5), the following holds.
i∑δiLwikLϕ′(vkL−1)=ϕ′(vkL−1)i∑δiLwikL=δkL−1
THerefore we get the following
δjL−2=ϕ′(vjL−2)k∑δkL−1wkjL−1=ϕ′(vjL−2)i∑δiL−1wijL−1
■
Generalization: l∈{1,…,L−1}
Based on the above results, we can generalize as follows for j∈{1,…,nl},
−δjl=ϕ′(vjl)∂yjl∂E(y^)
Solving the partial derivative on the right-hand side by the chain rule is as follows.
δjl=−ϕ′(vjl)∂yjl∂E(y^)=−ϕ′(vjl)i(1)∑∂y^i(1)∂E(y^)∂yjl∂y^i(1)=−ϕ′(vjl)i(1)∑∂y^i(1)∂E(y^)dvi(1)Ldy^i(1)∂yjl∂vi(1)L=−ϕ′(vjl)i(2)∑i(1)∑∂y^i(1)∂E(y^)dvi(1)Ldy^i(1)∂yi(2)L∂vi(1)L∂yjl∂yi(2)L=−ϕ′(vjl)i(2)∑i(1)∑∂y^i(1)∂E(y^)dvi(1)Ldy^i(1)∂yi(2)L∂vi(1)Ldvi(2)L−1dyi(2)L∂yjl∂vi(2)L−1=−ϕ′(vjl)i(3)∑i(2)∑i(1)∑∂y^i(1)∂E(y^)dvi(1)Ldy^i(1)∂yi(2)L∂vi(1)Ldvi(2)L−1dyi(2)L∂yi(3)L−1∂vi(2)L−1∂yjl∂yi(3)L−1=−ϕ′(vjl)i(3)∑i(2)∑i(1)∑∂y^i(1)∂E(y^)dvi(1)Ldy^i(1)∂yi(2)L∂vi(1)Ldvi(2)L−1dyi(2)L∂yi(3)L−1∂vi(2)L−1dvi(3)L−2dyi(3)L−1∂yjl∂vi(3)L−2⋮=−ϕ′(vjl)i(L−l)∑⋯i(3)∑i(2)∑i(1)∑∂y^i(1)∂E(y^)dvi(1)Ldy^i(1)∂yi(2)L∂vi(1)Ldvi(2)L−1dyi(2)L∂yi(3)L−1∂vi(2)L−1dvi(3)L−2dyi(3)L−1∂yi(4)L−2∂vi(3)L−2⋯dvi(L−l+1)ldyi(L−l+1)l+1∂yjl∂vi(L−l+1)l=ϕ′(vjl)i(L−l)∑⋯i(3)∑i(2)∑i(1)∑−δi(1)Lwi(1)i(2)Lϕ′(vi(2)L−1)wi(2)i(3)L−1ϕ′(vi(3)L−2)wi(3)i(4)L−2⋯ϕ′(vL−l+1l)wi(L−l+1)jL=ϕ′(vjl)i(L−l)∑⋯i(3)∑i(2)∑δi(2)L−1wi(2)i(3)L−1ϕ′(vi(3)L−2)wi(3)i(4)L−2⋯ϕ′(vL−l+1l)wi(L−l+1)jL=ϕ′(vjl)i(L−l)∑⋯i(3)∑δi(3)L−2wi(3)i(4)L−2⋯wi(L−l)jL⋮=ϕ′(vjl)i(L−l)∑δi(L−l)l+1wi(l−l)jl
Therefore to summarize
δjl=ϕ′(vjl)i∑δil+1wijl+1
■