Perceptron Convergence Theorem
Let’s say we have a training set that is linearly separable as represented by $X^{+}$, $X^{-}$. Consider $y$ as the following labels.
$$ y_{i} = \pm 1\ (\mathbf{x}_{i} \in X^{\pm}) $$
Suppose the entire training set $X = X^{+} \cup X^{-}$ has $N$ data points. Then, let’s say we insert input values in the following order.
$$ \mathbf{x}(1), \mathbf{x}(2), \cdots \mathbf{x}(N), \mathbf{x}(1), \mathbf{x}(2), \cdots \mathbf{x}(N),\mathbf{x}(1), \mathbf{x}(2), \cdots $$
That is, once the learning process for the last data has finished, it starts over again from the beginning.
Theorem1
Assume that every time a prediction fails for an input value, the following holds.
$$ \mathbf{x}_{1}, \mathbf{x}_{2}, \cdots, \mathbf{x}_{n}, \cdots $$
Let’s say the initial value of the weight is $\mathbf{w}_{1}=\mathbf{0}$. For the $n$th prediction failure on input data, the weight update is assumed to be as follows.
$$ \mathbf{w} _{n+1} = \mathbf{w}_{n} + y_{n}\mathbf{x}_{n} $$
Then, there exists a $n_{o}$ that satisfies the following equation.
$$ \mathbf{w}_{n_{o}} = \mathbf{w}_{n_{o} + 1} = \mathbf{w}_{n_{o} + 2} = \cdots $$
$$ \begin{cases} \mathbf{w}_{n_{o}}^{T}\mathbf{x} > 0 & \text{if } \mathbf{x} \in X^{+} \\ \mathbf{w}_{n_{o}}^{T}\mathbf{x} \le 0 & \text{if } \mathbf{x} \in X^{-} \end{cases} $$
Description
To summarize the above summary in one sentence, ‘Perceptron completes learning within a finite number of iterations.’ This means that the weight does not update after $n_{o}$, implying that for all $\mathbf{x}$, $y = \hat{y}$ holds true.
The key point is the method of updating the weight, which is simply adding the wrong data, and showing mathematically that learning ends within a finite number of times.
Proof
Assuming the number of prediction failures for input data is $n$, it will be shown that there is an upper limit for $n$ when updating the weight as above.
Since the prediction failed for $\mathbf{x}_{n}$, the weight is updated as follows.
$$ \begin{equation} \mathbf{w} _{n+1} = \mathbf{w}_{n} + y_{n}\mathbf{x}_{n} \label{eq3} \end{equation} $$ And because it was assumed that $X^{+}$ and $X^{-}$ are separable, a solution $\mathbf{w}_{\ast}$ that satisfies the following equation exists.
$$ \begin{cases} \mathbf{w}_{\ast}^{T}\mathbf{x} > 0 & \text{if } \mathbf{x} \in X^{+} \\ \mathbf{w}_{\ast}^{T}\mathbf{x} \le 0 & \text{if } \mathbf{x} \in X^{-} \end{cases} \quad \text{and} \quad \| \mathbf{w}_{\ast} \| = 1 $$
Then, the following holds true.
$$ \mathbf{w}_{\ast}^{T}\mathbf{w} _{n+1} = \mathbf{w}_{\ast}^{T}\mathbf{w}_{n} + y_{n}\mathbf{w}_{\ast}^{T}\mathbf{x}_{n} $$
And let’s define $\alpha>0$ as follows.
$$ \alpha := \min \limits_{\mathbf{x} \in X} y \mathbf{w}_{\ast}^{T}\mathbf{x} $$
Then, the following inequality holds.
$$ \mathbf{w}_{\ast}^{T}\mathbf{w} _{n+1} \ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{n} + \alpha $$
But since the initial value of the weight was $\mathbf{w}_{1}=\mathbf{0}$, the following equation holds true.
$$ \begin{equation} \begin{aligned} \\ \mathbf{w}_{\ast}^{T}\mathbf{w} _{n+1} &\ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{n} + \alpha \\ &\ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{n-1} + 2\alpha \\ &\cdots \\ &\ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{1} + n\alpha \\ &= n\alpha \end{aligned} \label{eq1} \end{equation} $$
And by the Cauchy-Schwarz inequality, the following equation holds true.
$$ \begin{equation} \| \mathbf{w}_{n+1} \|^{2} = \| \mathbf{w}_{\ast} \|^{2} \| \mathbf{w}_{n+1} \|^{2} \ge | \mathbf{w}_{\ast}^{T} \mathbf{w}_{n+1} |^{2} \label{eq2} \end{equation} $$
By $\eqref{eq1}$ and $\eqref{eq2}$, the following equation holds true.
$$ \begin{equation} \| \mathbf{w}_{n+1} \|^{2} \ge n^{2}\alpha^{2} \label{eq4} \end{equation} $$
Again, from $\eqref{eq3}$, we obtain the below equation.
$$ \| \mathbf{w}_{n+1} \|^{2} = \| \mathbf{w}_{n} \| ^{2} + \| \mathbf{x} _{n} \| ^{2} + 2 y_{n}\mathbf{w}^{T}_{n}\mathbf{x}_{n} $$
But since $y_{n}\mathbf{w}^{T}_{n} \mathbf{x}_{n} \le 0$, we obtain the following inequality.
$$ \| \mathbf{w}_{n+1} \|^{2} \le \| \mathbf{w}_{n} \| ^{2} + \| \mathbf{x} _{n} \| ^{2} \le \| \mathbf{w}_{n} \| ^{2} + \beta $$
At this time, $\beta = \max \limits_{\mathbf{x} \in X} \| \mathbf{x} \|^{2}$. Since the initial value of the weight was $\mathbf{w}_{1}=\mathbf{0}$, the following equation holds true.
$$ \begin{equation} \begin{aligned} \| \mathbf{w}_{n+1} \|^{2} &\le \| \mathbf{w}_{n} \| ^{2} + \beta \\ &\le \| \mathbf{w}_{n-1} \| ^{2} + 2\beta \\ &\le \| \mathbf{w}_{n-2} \| ^{2} + 3\beta \\ &\vdots \\ &\le \| \mathbf{w}_{1} \|^{2} + n\beta \\ &= n\beta \end{aligned} \label{eq5} \end{equation} $$
Combining the two inequalities $\eqref{eq4}, \eqref{eq5}$ gives the following.
$$ n^{2} \alpha^{2} \le \| \mathbf{w}_{n+1} \|^{2} \le n\beta $$
Arranging for $n$ gives the following.
$$ \begin{align*} && n^{2} \alpha^{2} &\le n\beta \\ \implies && n &\le \dfrac{\beta}{\alpha^2} \end{align*} $$
Therefore, we obtain the following.
$$ n_{\text{max}} = \dfrac{\beta}{\alpha^{2}} $$
This means that the number of times a perceptron can fail in its predictions is at most $\dfrac{\beta}{\alpha^{2}}$, and thereafter, it succeeds in predicting all training data.
■
Simon Haykin, Neural Networks and Learning Machines (3rd Edition, 2009), p50-53 ↩︎