Perceptron Convergence Theorem 📂Machine Learning

Perceptron Convergence Theorem

Let’s say we have a training set that is linearly separable as represented by $X^{+}$ , $X^{-}$ . Consider $y$ as the following labels.

$y_{i} = \pm 1\ (\mathbf{x}_{i} \in X^{\pm})$

Suppose the entire training set $X = X^{+} \cup X^{-}$ has $N$ data points. Then, let’s say we insert input values in the following order.

$\mathbf{x}(1), \mathbf{x}(2), \cdots \mathbf{x}(N), \mathbf{x}(1), \mathbf{x}(2), \cdots \mathbf{x}(N),\mathbf{x}(1), \mathbf{x}(2), \cdots$

That is, once the learning process for the last data has finished, it starts over again from the beginning.

Theorem¹

Assume that every time a prediction fails for an input value, the following holds.

$\mathbf{x}_{1}, \mathbf{x}_{2}, \cdots, \mathbf{x}_{n}, \cdots$

Let’s say the initial value of the weight is $\mathbf{w}_{1}=\mathbf{0}$ . For the $n$ th prediction failure on input data, the weight update is assumed to be as follows.

$\mathbf{w} _{n+1} = \mathbf{w}_{n} + y_{n}\mathbf{x}_{n}$

Then, there exists a $n_{o}$ that satisfies the following equation.

$\mathbf{w}_{n_{o}} = \mathbf{w}_{n_{o} + 1} = \mathbf{w}_{n_{o} + 2} = \cdots$

$\begin{cases} \mathbf{w}_{n_{o}}^{T}\mathbf{x} > 0 & \text{if } \mathbf{x} \in X^{+} \\ \mathbf{w}_{n_{o}}^{T}\mathbf{x} \le 0 & \text{if } \mathbf{x} \in X^{-} \end{cases}$

Description

To summarize the above summary in one sentence, ‘Perceptron completes learning within a finite number of iterations.’ This means that the weight does not update after $n_{o}$ , implying that for all $\mathbf{x}$ , $y = \hat{y}$ holds true.

The key point is the method of updating the weight, which is simply adding the wrong data, and showing mathematically that learning ends within a finite number of times.

Proof

Assuming the number of prediction failures for input data is $n$ , it will be shown that there is an upper limit for $n$ when updating the weight as above.

Since the prediction failed for $\mathbf{x}_{n}$ , the weight is updated as follows.

$\begin{equation} \mathbf{w} _{n+1} = \mathbf{w}_{n} + y_{n}\mathbf{x}_{n} \label{eq3} \end{equation}$ And because it was assumed that $X^{+}$ and $X^{-}$ are separable, a solution $\mathbf{w}_{\ast}$ that satisfies the following equation exists.

$\begin{cases} \mathbf{w}_{\ast}^{T}\mathbf{x} > 0 & \text{if } \mathbf{x} \in X^{+} \\ \mathbf{w}_{\ast}^{T}\mathbf{x} \le 0 & \text{if } \mathbf{x} \in X^{-} \end{cases} \quad \text{and} \quad \| \mathbf{w}_{\ast} \| = 1$

Then, the following holds true.

$\mathbf{w}_{\ast}^{T}\mathbf{w} _{n+1} = \mathbf{w}_{\ast}^{T}\mathbf{w}_{n} + y_{n}\mathbf{w}_{\ast}^{T}\mathbf{x}_{n}$

And let’s define $\alpha>0$ as follows.

$\alpha := \min \limits_{\mathbf{x} \in X} y \mathbf{w}_{\ast}^{T}\mathbf{x}$

Then, the following inequality holds.

$\mathbf{w}_{\ast}^{T}\mathbf{w} _{n+1} \ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{n} + \alpha$

But since the initial value of the weight was $\mathbf{w}_{1}=\mathbf{0}$ , the following equation holds true.

$\begin{equation} \begin{aligned} \\ \mathbf{w}_{\ast}^{T}\mathbf{w} _{n+1} &\ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{n} + \alpha \\ &\ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{n-1} + 2\alpha \\ &\cdots \\ &\ge \mathbf{w}_{\ast}^{T}\mathbf{w}_{1} + n\alpha \\ &= n\alpha \end{aligned} \label{eq1} \end{equation}$

And by the Cauchy-Schwarz inequality, the following equation holds true.

$\begin{equation} \| \mathbf{w}_{n+1} \|^{2} = \| \mathbf{w}_{\ast} \|^{2} \| \mathbf{w}_{n+1} \|^{2} \ge | \mathbf{w}_{\ast}^{T} \mathbf{w}_{n+1} |^{2} \label{eq2} \end{equation}$

By $\eqref{eq1}$ and $\eqref{eq2}$ , the following equation holds true.

$\begin{equation} \| \mathbf{w}_{n+1} \|^{2} \ge n^{2}\alpha^{2} \label{eq4} \end{equation}$

Again, from $\eqref{eq3}$ , we obtain the below equation.

$\| \mathbf{w}_{n+1} \|^{2} = \| \mathbf{w}_{n} \| ^{2} + \| \mathbf{x} _{n} \| ^{2} + 2 y_{n}\mathbf{w}^{T}_{n}\mathbf{x}_{n}$

But since $y_{n}\mathbf{w}^{T}_{n} \mathbf{x}_{n} \le 0$ , we obtain the following inequality.

$\| \mathbf{w}_{n+1} \|^{2} \le \| \mathbf{w}_{n} \| ^{2} + \| \mathbf{x} _{n} \| ^{2} \le \| \mathbf{w}_{n} \| ^{2} + \beta$

At this time, $\beta = \max \limits_{\mathbf{x} \in X} \| \mathbf{x} \|^{2}$ . Since the initial value of the weight was $\mathbf{w}_{1}=\mathbf{0}$ , the following equation holds true.

$\begin{equation} \begin{aligned} \| \mathbf{w}_{n+1} \|^{2} &\le \| \mathbf{w}_{n} \| ^{2} + \beta \\ &\le \| \mathbf{w}_{n-1} \| ^{2} + 2\beta \\ &\le \| \mathbf{w}_{n-2} \| ^{2} + 3\beta \\ &\vdots \\ &\le \| \mathbf{w}_{1} \|^{2} + n\beta \\ &= n\beta \end{aligned} \label{eq5} \end{equation}$

Combining the two inequalities $\eqref{eq4}, \eqref{eq5}$ gives the following.

$n^{2} \alpha^{2} \le \| \mathbf{w}_{n+1} \|^{2} \le n\beta$

Arranging for $n$ gives the following.

$\begin{align*} && n^{2} \alpha^{2} &\le n\beta \\ \implies && n &\le \dfrac{\beta}{\alpha^2} \end{align*}$

Therefore, we obtain the following.

$n_{\text{max}} = \dfrac{\beta}{\alpha^{2}}$

This means that the number of times a perceptron can fail in its predictions is at most $\dfrac{\beta}{\alpha^{2}}$ , and thereafter, it succeeds in predicting all training data.

■

Simon Haykin, Neural Networks and Learning Machines (3rd Edition, 2009), p50-53 ↩︎