Perceptron Convergence Theorem
📂Machine LearningPerceptron Convergence Theorem
Let’s say we have a training set that is linearly separable as represented by X+, X−. Consider y as the following labels.
yi=±1 (xi∈X±)
Suppose the entire training set X=X+∪X− has N data points. Then, let’s say we insert input values in the following order.
x(1),x(2),⋯x(N),x(1),x(2),⋯x(N),x(1),x(2),⋯
That is, once the learning process for the last data has finished, it starts over again from the beginning.
Theorem
Assume that every time a prediction fails for an input value, the following holds.
x1,x2,⋯,xn,⋯
Let’s say the initial value of the weight is w1=0. For the nth prediction failure on input data, the weight update is assumed to be as follows.
wn+1=wn+ynxn
Then, there exists a no that satisfies the following equation.
wno=wno+1=wno+2=⋯
{wnoTx>0wnoTx≤0if x∈X+if x∈X−
Description
To summarize the above summary in one sentence, ‘Perceptron completes learning within a finite number of iterations.’ This means that the weight does not update after no, implying that for all x, y=y^ holds true.
The key point is the method of updating the weight, which is simply adding the wrong data, and showing mathematically that learning ends within a finite number of times.
Proof
Assuming the number of prediction failures for input data is n, it will be shown that there is an upper limit for n when updating the weight as above.
Since the prediction failed for xn, the weight is updated as follows.
wn+1=wn+ynxn
And because it was assumed that X+ and X− are separable, a solution w∗ that satisfies the following equation exists.
{w∗Tx>0w∗Tx≤0if x∈X+if x∈X−and∥w∗∥=1
Then, the following holds true.
w∗Twn+1=w∗Twn+ynw∗Txn
And let’s define α>0 as follows.
α:=x∈Xminyw∗Tx
Then, the following inequality holds.
w∗Twn+1≥w∗Twn+α
But since the initial value of the weight was w1=0, the following equation holds true.
w∗Twn+1≥w∗Twn+α≥w∗Twn−1+2α⋯≥w∗Tw1+nα=nα
And by the Cauchy-Schwarz inequality, the following equation holds true.
∥wn+1∥2=∥w∗∥2∥wn+1∥2≥∣w∗Twn+1∣2
By (eq1) and (eq2), the following equation holds true.
∥wn+1∥2≥n2α2
Again, from (eq3), we obtain the below equation.
∥wn+1∥2=∥wn∥2+∥xn∥2+2ynwnTxn
But since ynwnTxn≤0, we obtain the following inequality.
∥wn+1∥2≤∥wn∥2+∥xn∥2≤∥wn∥2+β
At this time, β=x∈Xmax∥x∥2. Since the initial value of the weight was w1=0, the following equation holds true.
∥wn+1∥2≤∥wn∥2+β≤∥wn−1∥2+2β≤∥wn−2∥2+3β⋮≤∥w1∥2+nβ=nβ
Combining the two inequalities (eq4),(eq5) gives the following.
n2α2≤∥wn+1∥2≤nβ
Arranging for n gives the following.
⟹n2α2n≤nβ≤α2β
Therefore, we obtain the following.
nmax=α2β
This means that the number of times a perceptron can fail in its predictions is at most α2β, and thereafter, it succeeds in predicting all training data.
■