Proof of the Representation Theorem 📂Machine Learning

Proof of the Representation Theorem

Theorem

Let’s assume we are given an input set $X \ne \emptyset$ and a positive definite kernel $k: X \times X \to \mathbb{R}$. Define the Training Dataset as $$ D := \left\{ \left( x_{i} , y_{i} \right) \right\}_{i=1}^{m} \subset X \times \mathbb{R} $$ and a class in the Reproducing Kernel Hilbert Space $H_{k}$ as $$ \mathcal{F} := \left\{ f \in \mathbb{R}^{X} : f \left( \cdot \right) = \sum_{i=1}^{\infty} \beta_{i} k \left( \cdot , z_{i} \right) \land \beta_{i} \in \mathbb{R} \land z_{i} \in X \land \left\| f \right\| < \infty \right\} \subset H_{k} $$ Consider an arbitrary objective function $c : \left( D \times \mathbb{R} \right) ^{m} \to \overline{\mathbb{R}}$ and a monotonically increasing regularizer $g : \mathbb{R} \to [0,\infty)$, defining the Regularized Objective Functional $L : \mathcal{F} \to \overline{\mathbb{R}}$ as follows: $$ L (f) := c \left( \left( x_{1}, y_{1}, f \left( x_{1} \right) \right), \cdots , \left( x_{m}, y_{m}, f \left( x_{m} \right) \right) \right) + g \left( \left\| f \right\| \right) $$ Here, the norm $\left\| \cdot \right\|$ in $H_{k}$ is given according to the positive definiteness of $k$ as follows: $$ \left\| \sum_{i=1}^{\infty} \beta_{i} k \left( \cdot , z_{i} \right) \right\|^{2} := \sum_{i=1}^{\infty} \sum_{j=1}^{\infty} \beta_{i} \beta_{j} k \left( z_{i} , z_{j} \right) \ge 0 $$

$\mathbb{R}$ is the set of real numbers, and $\overline{\mathbb{R}}$ includes infinity $\infty$ as an extended real number.
$\mathbb{R}^{X}$ is a function space collecting functions with domain $X$ and codomain $\mathbb{R}$.
A regularizer is a penalty function to prevent overfitting on data.
A functional is, loosely speaking, a function that takes another function as input.

Nonparametric

The function $f \in \mathcal{F}$ that minimizes $L (f)$ for some $\left\{ \alpha_{i} \right\}_{i=1}^{m} \subset \mathbb{R}$ is expressed in the following form: $$ f (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right) $$

Semiparametric

Assuming a set of $M$ real functions defined in $X$ such that the rank of matrix $\left( \psi_{p} \left( x_{i} \right) \right)_{ip}$ is $M$ for $\left\{ \psi_{p} : X \to \mathbb{R} \right\}_{p=1}^{M}$, then for $f \in \mathcal{F}$ and $h \in \span \left\{ \psi_{p} \right\}$, the $\tilde{f} = f + h$ that minimizes $$ c \left( \left( x_{1}, y_{1}, \tilde{f} \left( x_{1} \right) \right) , \cdots , \left( x_{m}, y_{m}, \tilde{f} \left( x_{m} \right) \right) \right) + g \left( \left\| f \right\| \right) $$ for some $\left\{ \alpha_{i} \right\}_{i=1}^{m}, \left\{ \beta_{p} \right\}_{p=1}^{M} \subset \mathbb{R}$ is expressed in the following form: $$ \tilde{f} (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right) + \sum_{p=1}^{M} \beta_{p} \psi_{p} (\cdot) $$

Explanation

It is recommended to read the following two posts first if possible:

Representers

The Representer Theorem is one of the most important theorems in the context of classical machine learning, especially Support Vector Machines, stating that the objective function $f$ we wish to approximate for given data can be represented in the form of $$ f (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right) $$ with respect to a suitable kernel $k$. Here, functions $$ k \left( \cdot , x_{i} \right) = \phi \left( x_{i} \right) (\cdot)\in H_{k} $$ with one of the kernel’s inputs fixed at $x_{i}$ are called representers. According to this theorem, any function fitted to the training data in the Reproducing Kernel Hilbert Space can be represented as a finite linear combination of representers. This aligns perfectly with the kernel trick for support vector machines in nonlinear regression.

This is analogous to the relationship between Deep Learning and the Cybenko Theorem.

In the context of data science, representers $\phi \left( x_{i} \right) (\cdot)$ are also referred to as feature maps, implying that any data $X$ can be mapped into a Hilbert space where its features are known and represented by a finite sum of these features, justifying why many machine learning techniques we’ve learned so far work. While these techniques are valid in their own right even without mathematical guarantees, the Representer Theorem is crucial as it provides a theoretical foundation for them.

Objective Function and Regularizer

Although the objective function $c$ and regularizer $g$ are defined very generally in the theorem’s statement, in many cases, $c$ can be seen as the mean squared residual, $$ c = {{ 1 } \over { m }} \sum_{i=1}^{n} \left( y_{i} - f \left( x_{i} \right) \right)^{2} $$ measuring the fit between data and $f$, and $g$ can be a squared semi-norm penalty $g \left( \left\| f \right\| \right) = \lambda \left\| f \right\|^{2}$¹.

It’s important not to distinguish between the objective function and regularizer merely based on the appearance of the equations or the meanings of the words. The most emblematic application of the Representer Theorem is the Support Vector Machine, where the minimization problem handled in Soft Margin SVM is as follows: $$ \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} + \lambda \sum_{k=1}^{n} \xi_{k} \\ \text{subject to} & \left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 - \xi_{k} \\ & \xi_{k} \ge 0 \end{matrix} \\ k = 1, \cdots , n $$

Here, considering the optimization itself, the objective function for a constant $\lambda \ne 0$ is essentially $$ {{ 1 } \over { 2 \lambda }} \left\| \mathbf{w} \right\|_{2}^{2} + \sum_{k=1}^{n} \xi_{k} $$ where $\sum_{k=1}^{n} \xi_{k}$, representing the discrepancy with data, becomes $c$, and ${{ 1 } \over { 2 \lambda }} \left\| \mathbf{w} \right\|_{2}^{2}$ derived from the hyperplane of the Support Vector Machine $f (\mathbf{x}) = \mathbf{w}^{T} + b$ becomes $g$. This can be confusing depending on whether the focus is on mathematics or machine learning because

Those with a stronger mathematical inclination see SVM as ’linear regression first, then considering exceptions’ and tend to minimize $\left\| \mathbf{w} \right\|_{2}^{2}$ first,
Those with a stronger data science inclination view SVM as ‘first succeeding in classifying data well, then finding the hyperplane with the largest margin’ and tend to minimize $\sum_{k=1}^{n} \xi_{k}$ first.

Both perspectives are understandable, and since the application of the Representer Theorem isn’t limited to just SVM, one should think and adapt actively to the problem at hand rather than seeking a mnemonic technique.

Proof ²

For simplicity, only the nonparametric representer theorem is proved as per the reference.

Part 1. $f = \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v$

Definition of Reproducing Kernel: A function $k : X \times X \to \mathbb{C}$ is called the reproducing kernel of $H$ if it satisfies the following two conditions:
(i): Representer: For all $x \in X$, $$ k \left( \cdot , x \right) \in H $$
(ii) Reproducing Property: For all $x \in X$ and all $f \in H$, $$ \left< f , k \left( \cdot , x \right) \right> = f(x) = \delta_{x} (f) $$ Particularly, for all $x_{1} , x_{2} \in X$, $$ k \left( x_{1} , x_{2} \right) = \left< k \left( \cdot , x_{2} \right), k \left( \cdot , x_{1} \right) \right> $$

Define the (representer) function $\phi : X \to \mathbb{R}^{X}$ in the reproducing kernel $k : X \times X \to \mathbb{R}$ as $x \mapsto k (\cdot ,x)$. Since $k$ is a reproducing kernel, the function value of function $\left( \phi (x) \right) (\cdot)$ at any $x, x' \in X$ is $$ \left( \phi (x) \right) (x ') = k \left( x' , x \right) = \left< \phi \left( x ' \right) , \phi (x) \right> $$ where $\left< \cdot , \cdot \right>$ is the inner product in $H_{k}$. For a given $\left\{ x_{i} \right\}_{i=1}^{m}$, any function $f \in \mathcal{F}$ can be expressed as a sum of a part in $\span \left\{ \phi \left( x_{i} \right) \right\}_{i=1}^{m}$ and a part orthogonal to it, satisfying $$ \left< v , \phi \left( x_{j} \right) \right> = 0 $$ for all $j$, and for some $\left( \alpha_{1} , \cdots , \alpha_{m} \right) \subset \mathbb{R}^{m}$ as follows: $$ f = \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v $$ We will now argue that $c$ is independent of $v$ and when $v = 0$, $f$ minimizes $L(f)$ in $$ \begin{align*} L (f) :=& c \left( \left( x_{1}, y_{1}, f \left( x_{1} \right) \right), \cdots , \left( x_{m}, y_{m}, f \left( x_{m} \right) \right) \right) + g \left( \left\| f \right\| \right) \\ =& c + g \end{align*} $$.

Part 2. $c$ and $v$ are independent

The inner product of function $f = f(\cdot)$ and the reproducing kernel $k \left( \cdot , x_{j} \right)$, according to the reproducing property, is $$ \begin{align*} =& \left< f , k \left( \cdot , x_{j} \right) \right> \\ f \left( x_{j} \right) =& \left< \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v , \phi \left( x_{j} \right) \right> \\ =& \sum_{i=1}^{m} \alpha_{i} \left< \phi \left( x_{i} \right) , \phi \left( x_{j} \right) \right> + 0 \end{align*} $$ As this is independent of $v$, in $L (f) = c + g$, the expression dependent only on the training data $D$ and $f$, $$ c = c \left( \left( x_{1}, y_{1}, f \left( x_{1} \right) \right), \cdots , \left( x_{m}, y_{m}, f \left( x_{m} \right) \right) \right) $$ is also independent of $v$.

Part 3. $g$ is minimized when $v = 0$

(1): $v$ is orthogonal to $\sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right)$, and
(2): Since $g$ is assumed to be a monotonic function,

we obtain $$ \begin{align*} & g \left( \left\| f \right\| \right) \\ =& g \left( \left\| \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v \right\| \right) \\ =& g \left( \sqrt{\left\| \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v \right\|^{2} + \left\| v \right\|^{2}} \right) & \because (1) \\ \ge& g \left( \left\| \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) \right\| \right) & \because (2) \end{align*} $$ As seen, the equality holds when $v = 0$, and for $g$ to be minimized, $v=0$ must hold. Meanwhile, from Part 2, we confirmed that $v$ cannot influence $c$, so setting it to $v = 0$ is acceptable, and the function $f$ that minimizes $L = c + g$ can be represented in the form of $$ f (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right) $$

■

Wahba. (2019). Representer Theorem. https://pages.stat.wisc.edu/~wahba/ftp1/wahba.wang.2019submit.pdf ↩︎
Schölkopf. (2001). A Generalized Representer Theorem. https://link.springer.com/chapter/10.1007/3-540-44581-1_27 ↩︎