logo

Perceptron Convergence Theorem 📂Machine Learning

Perceptron Convergence Theorem

Let’s say we have a training set that is linearly separable as represented by X+X^{+}, XX^{-}. Consider yy as the following labels.

yi=±1 (xiX±) y_{i} = \pm 1\ (\mathbf{x}_{i} \in X^{\pm})

Suppose the entire training set X=X+XX = X^{+} \cup X^{-} has NN data points. Then, let’s say we insert input values in the following order.

x(1),x(2),x(N),x(1),x(2),x(N),x(1),x(2), \mathbf{x}(1), \mathbf{x}(2), \cdots \mathbf{x}(N), \mathbf{x}(1), \mathbf{x}(2), \cdots \mathbf{x}(N),\mathbf{x}(1), \mathbf{x}(2), \cdots

That is, once the learning process for the last data has finished, it starts over again from the beginning.

Theorem1

Assume that every time a prediction fails for an input value, the following holds.

x1,x2,,xn, \mathbf{x}_{1}, \mathbf{x}_{2}, \cdots, \mathbf{x}_{n}, \cdots

Let’s say the initial value of the weight is w1=0\mathbf{w}_{1}=\mathbf{0}. For the nnth prediction failure on input data, the weight update is assumed to be as follows.

wn+1=wn+ynxn \mathbf{w} _{n+1} = \mathbf{w}_{n} + y_{n}\mathbf{x}_{n}

Then, there exists a non_{o} that satisfies the following equation.

wno=wno+1=wno+2= \mathbf{w}_{n_{o}} = \mathbf{w}_{n_{o} + 1} = \mathbf{w}_{n_{o} + 2} = \cdots

{wnoTx>0if xX+wnoTx0if xX \begin{cases} \mathbf{w}_{n_{o}}^{T}\mathbf{x} > 0 & \text{if } \mathbf{x} \in X^{+} \\ \mathbf{w}_{n_{o}}^{T}\mathbf{x} \le 0 & \text{if } \mathbf{x} \in X^{-} \end{cases}

Description

To summarize the above summary in one sentence, ‘Perceptron completes learning within a finite number of iterations.’ This means that the weight does not update after non_{o}, implying that for all x\mathbf{x}, y=y^y = \hat{y} holds true.

The key point is the method of updating the weight, which is simply adding the wrong data, and showing mathematically that learning ends within a finite number of times.

Proof

Assuming the number of prediction failures for input data is nn, it will be shown that there is an upper limit for nn when updating the weight as above.


Since the prediction failed for xn\mathbf{x}_{n}, the weight is updated as follows.

wn+1=wn+ynxn \begin{equation} \mathbf{w} _{n+1} = \mathbf{w}_{n} + y_{n}\mathbf{x}_{n} \label{eq3} \end{equation} And because it was assumed that X+X^{+} and XX^{-} are separable, a solution w\mathbf{w}_{\ast} that satisfies the following equation exists.

{wTx>0if xX+wTx0if xXandw=1 \begin{cases} \mathbf{w}_{\ast}^{T}\mathbf{x} > 0 & \text{if } \mathbf{x} \in X^{+} \\ \mathbf{w}_{\ast}^{T}\mathbf{x} \le 0 & \text{if } \mathbf{x} \in X^{-} \end{cases} \quad \text{and} \quad \| \mathbf{w}_{\ast} \| = 1

Then, the following holds true.

wTwn+1=wTwn+ynwTxn \mathbf{w}_{\ast}^{T}\mathbf{w} _{n+1} = \mathbf{w}_{\ast}^{T}\mathbf{w}_{n} + y_{n}\mathbf{w}_{\ast}^{T}\mathbf{x}_{n}

And let’s define α>0\alpha>0 as follows.

α:=minxXywTx \alpha := \min \limits_{\mathbf{x} \in X} y \mathbf{w}_{\ast}^{T}\mathbf{x}

Then, the following inequality holds.

wTwn+1wTwn+α \mathbf{w}_{\ast}^{T}\mathbf{w} _{n+1} \ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{n} + \alpha

But since the initial value of the weight was w1=0\mathbf{w}_{1}=\mathbf{0}, the following equation holds true.

wTwn+1wTwn+αwTwn1+2αwTw1+nα=nα \begin{equation} \begin{aligned} \\ \mathbf{w}_{\ast}^{T}\mathbf{w} _{n+1} &\ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{n} + \alpha \\ &\ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{n-1} + 2\alpha \\ &\cdots \\ &\ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{1} + n\alpha \\ &= n\alpha \end{aligned} \label{eq1} \end{equation}

And by the Cauchy-Schwarz inequality, the following equation holds true.

wn+12=w2wn+12wTwn+12 \begin{equation} \| \mathbf{w}_{n+1} \|^{2} = \| \mathbf{w}_{\ast} \|^{2} \| \mathbf{w}_{n+1} \|^{2} \ge | \mathbf{w}_{\ast}^{T} \mathbf{w}_{n+1} |^{2} \label{eq2} \end{equation}

By (eq1)\eqref{eq1} and (eq2)\eqref{eq2}, the following equation holds true.

wn+12n2α2 \begin{equation} \| \mathbf{w}_{n+1} \|^{2} \ge n^{2}\alpha^{2} \label{eq4} \end{equation}

Again, from (eq3)\eqref{eq3}, we obtain the below equation.

wn+12=wn2+xn2+2ynwnTxn \| \mathbf{w}_{n+1} \|^{2} = \| \mathbf{w}_{n} \| ^{2} + \| \mathbf{x} _{n} \| ^{2} + 2 y_{n}\mathbf{w}^{T}_{n}\mathbf{x}_{n}

But since ynwnTxn0y_{n}\mathbf{w}^{T}_{n} \mathbf{x}_{n} \le 0, we obtain the following inequality.

wn+12wn2+xn2wn2+β \| \mathbf{w}_{n+1} \|^{2} \le \| \mathbf{w}_{n} \| ^{2} + \| \mathbf{x} _{n} \| ^{2} \le \| \mathbf{w}_{n} \| ^{2} + \beta

At this time, β=maxxXx2\beta = \max \limits_{\mathbf{x} \in X} \| \mathbf{x} \|^{2}. Since the initial value of the weight was w1=0\mathbf{w}_{1}=\mathbf{0}, the following equation holds true.

wn+12wn2+βwn12+2βwn22+3βw12+nβ=nβ \begin{equation} \begin{aligned} \| \mathbf{w}_{n+1} \|^{2} &\le \| \mathbf{w}_{n} \| ^{2} + \beta \\ &\le \| \mathbf{w}_{n-1} \| ^{2} + 2\beta \\ &\le \| \mathbf{w}_{n-2} \| ^{2} + 3\beta \\ &\vdots \\ &\le \| \mathbf{w}_{1} \|^{2} + n\beta \\ &= n\beta \end{aligned} \label{eq5} \end{equation}

Combining the two inequalities (eq4),(eq5)\eqref{eq4}, \eqref{eq5} gives the following.

n2α2wn+12nβ n^{2} \alpha^{2} \le \| \mathbf{w}_{n+1} \|^{2} \le n\beta

Arranging for nn gives the following.

n2α2nβ    nβα2 \begin{align*} && n^{2} \alpha^{2} &\le n\beta \\ \implies && n &\le \dfrac{\beta}{\alpha^2} \end{align*}

Therefore, we obtain the following.

nmax=βα2 n_{\text{max}} = \dfrac{\beta}{\alpha^{2}}

This means that the number of times a perceptron can fail in its predictions is at most βα2\dfrac{\beta}{\alpha^{2}}, and thereafter, it succeeds in predicting all training data.


  1. Simon Haykin, Neural Networks and Learning Machines (3rd Edition, 2009), p50-53 ↩︎