logo

Proof of the Representation Theorem 📂Machine Learning

Proof of the Representation Theorem

Theorem

Let’s assume we are given an input set XX \ne \emptyset and a positive definite kernel k:X×XRk: X \times X \to \mathbb{R}. Define the Training Dataset as D:={(xi,yi)}i=1mX×R D := \left\{ \left( x_{i} , y_{i} \right) \right\}_{i=1}^{m} \subset X \times \mathbb{R} and a class in the Reproducing Kernel Hilbert Space HkH_{k} as F:={fRX:f()=i=1βik(,zi)βiRziXf<}Hk \mathcal{F} := \left\{ f \in \mathbb{R}^{X} : f \left( \cdot \right) = \sum_{i=1}^{\infty} \beta_{i} k \left( \cdot , z_{i} \right) \land \beta_{i} \in \mathbb{R} \land z_{i} \in X \land \left\| f \right\| < \infty \right\} \subset H_{k} Consider an arbitrary objective function c:(D×R)mRc : \left( D \times \mathbb{R} \right) ^{m} \to \overline{\mathbb{R}} and a monotonically increasing regularizer g:R[0,)g : \mathbb{R} \to [0,\infty), defining the Regularized Objective Functional L:FRL : \mathcal{F} \to \overline{\mathbb{R}} as follows: L(f):=c((x1,y1,f(x1)),,(xm,ym,f(xm)))+g(f) L (f) := c \left( \left( x_{1}, y_{1}, f \left( x_{1} \right) \right), \cdots , \left( x_{m}, y_{m}, f \left( x_{m} \right) \right) \right) + g \left( \left\| f \right\| \right) Here, the norm \left\| \cdot \right\| in HkH_{k} is given according to the positive definiteness of kk as follows: i=1βik(,zi)2:=i=1j=1βiβjk(zi,zj)0 \left\| \sum_{i=1}^{\infty} \beta_{i} k \left( \cdot , z_{i} \right) \right\|^{2} := \sum_{i=1}^{\infty} \sum_{j=1}^{\infty} \beta_{i} \beta_{j} k \left( z_{i} , z_{j} \right) \ge 0


  • R\mathbb{R} is the set of real numbers, and R\overline{\mathbb{R}} includes infinity \infty as an extended real number.
  • RX\mathbb{R}^{X} is a function space collecting functions with domain XX and codomain R\mathbb{R}.
  • A regularizer is a penalty function to prevent overfitting on data.
  • A functional is, loosely speaking, a function that takes another function as input.

Nonparametric

The function fFf \in \mathcal{F} that minimizes L(f)L (f) for some {αi}i=1mR\left\{ \alpha_{i} \right\}_{i=1}^{m} \subset \mathbb{R} is expressed in the following form: f()=i=1mαik(,xi) f (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right)

Semiparametric

Assuming a set of MM real functions defined in XX such that the rank of matrix (ψp(xi))ip\left( \psi_{p} \left( x_{i} \right) \right)_{ip} is MM for {ψp:XR}p=1M\left\{ \psi_{p} : X \to \mathbb{R} \right\}_{p=1}^{M}, then for fFf \in \mathcal{F} and hspan{ψp}h \in \span \left\{ \psi_{p} \right\}, the f~=f+h\tilde{f} = f + h that minimizes c((x1,y1,f~(x1)),,(xm,ym,f~(xm)))+g(f) c \left( \left( x_{1}, y_{1}, \tilde{f} \left( x_{1} \right) \right) , \cdots , \left( x_{m}, y_{m}, \tilde{f} \left( x_{m} \right) \right) \right) + g \left( \left\| f \right\| \right) for some {αi}i=1m,{βp}p=1MR\left\{ \alpha_{i} \right\}_{i=1}^{m}, \left\{ \beta_{p} \right\}_{p=1}^{M} \subset \mathbb{R} is expressed in the following form: f~()=i=1mαik(,xi)+p=1Mβpψp() \tilde{f} (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right) + \sum_{p=1}^{M} \beta_{p} \psi_{p} (\cdot)

Explanation

It is recommended to read the following two posts first if possible:

Representers

The Representer Theorem is one of the most important theorems in the context of classical machine learning, especially Support Vector Machines, stating that the objective function ff we wish to approximate for given data can be represented in the form of f()=i=1mαik(,xi) f (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right) with respect to a suitable kernel kk. Here, functions k(,xi)=ϕ(xi)()Hk k \left( \cdot , x_{i} \right) = \phi \left( x_{i} \right) (\cdot)\in H_{k} with one of the kernel’s inputs fixed at xix_{i} are called representers. According to this theorem, any function fitted to the training data in the Reproducing Kernel Hilbert Space can be represented as a finite linear combination of representers. This aligns perfectly with the kernel trick for support vector machines in nonlinear regression.

In the context of data science, representers ϕ(xi)()\phi \left( x_{i} \right) (\cdot) are also referred to as feature maps, implying that any data XX can be mapped into a Hilbert space where its features are known and represented by a finite sum of these features, justifying why many machine learning techniques we’ve learned so far work. While these techniques are valid in their own right even without mathematical guarantees, the Representer Theorem is crucial as it provides a theoretical foundation for them.

Objective Function and Regularizer

Although the objective function cc and regularizer gg are defined very generally in the theorem’s statement, in many cases, cc can be seen as the mean squared residual, c=1mi=1n(yif(xi))2 c = {{ 1 } \over { m }} \sum_{i=1}^{n} \left( y_{i} - f \left( x_{i} \right) \right)^{2} measuring the fit between data and ff, and gg can be a squared semi-norm penalty g(f)=λf2g \left( \left\| f \right\| \right) = \lambda \left\| f \right\|^{2}1.

It’s important not to distinguish between the objective function and regularizer merely based on the appearance of the equations or the meanings of the words. The most emblematic application of the Representer Theorem is the Support Vector Machine, where the minimization problem handled in Soft Margin SVM is as follows: Minimize12w22+λk=1nξksubject tof(xk)1ξkξk0k=1,,n \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} + \lambda \sum_{k=1}^{n} \xi_{k} \\ \text{subject to} & \left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 - \xi_{k} \\ & \xi_{k} \ge 0 \end{matrix} \\ k = 1, \cdots , n

Here, considering the optimization itself, the objective function for a constant λ0\lambda \ne 0 is essentially 12λw22+k=1nξk {{ 1 } \over { 2 \lambda }} \left\| \mathbf{w} \right\|_{2}^{2} + \sum_{k=1}^{n} \xi_{k} where k=1nξk\sum_{k=1}^{n} \xi_{k}, representing the discrepancy with data, becomes cc, and 12λw22{{ 1 } \over { 2 \lambda }} \left\| \mathbf{w} \right\|_{2}^{2} derived from the hyperplane of the Support Vector Machine f(x)=wT+bf (\mathbf{x}) = \mathbf{w}^{T} + b becomes gg. This can be confusing depending on whether the focus is on mathematics or machine learning because

  • Those with a stronger mathematical inclination see SVM as ’linear regression first, then considering exceptions’ and tend to minimize w22\left\| \mathbf{w} \right\|_{2}^{2} first,
  • Those with a stronger data science inclination view SVM as ‘first succeeding in classifying data well, then finding the hyperplane with the largest margin’ and tend to minimize k=1nξk\sum_{k=1}^{n} \xi_{k} first.

Both perspectives are understandable, and since the application of the Representer Theorem isn’t limited to just SVM, one should think and adapt actively to the problem at hand rather than seeking a mnemonic technique.

Proof 2

For simplicity, only the nonparametric representer theorem is proved as per the reference.


Part 1. f=i=1mαiϕ(xi)+vf = \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v

Definition of Reproducing Kernel: A function k:X×XCk : X \times X \to \mathbb{C} is called the reproducing kernel of HH if it satisfies the following two conditions:

  • (i): Representer: For all xXx \in X, k(,x)H k \left( \cdot , x \right) \in H
  • (ii) Reproducing Property: For all xXx \in X and all fHf \in H, <f,k(,x)>=f(x)=δx(f) \left< f , k \left( \cdot , x \right) \right> = f(x) = \delta_{x} (f) Particularly, for all x1,x2Xx_{1} , x_{2} \in X, k(x1,x2)=<k(,x2),k(,x1)> k \left( x_{1} , x_{2} \right) = \left< k \left( \cdot , x_{2} \right), k \left( \cdot , x_{1} \right) \right>

Define the (representer) function ϕ:XRX\phi : X \to \mathbb{R}^{X} in the reproducing kernel k:X×XRk : X \times X \to \mathbb{R} as xk(,x)x \mapsto k (\cdot ,x). Since kk is a reproducing kernel, the function value of function (ϕ(x))()\left( \phi (x) \right) (\cdot) at any x,xXx, x' \in X is (ϕ(x))(x)=k(x,x)=<ϕ(x),ϕ(x)> \left( \phi (x) \right) (x ') = k \left( x' , x \right) = \left< \phi \left( x ' \right) , \phi (x) \right> where <,>\left< \cdot , \cdot \right> is the inner product in HkH_{k}. For a given {xi}i=1m\left\{ x_{i} \right\}_{i=1}^{m}, any function fFf \in \mathcal{F} can be expressed as a sum of a part in span{ϕ(xi)}i=1m\span \left\{ \phi \left( x_{i} \right) \right\}_{i=1}^{m} and a part orthogonal to it, satisfying <v,ϕ(xj)>=0 \left< v , \phi \left( x_{j} \right) \right> = 0 for all jj, and for some (α1,,αm)Rm\left( \alpha_{1} , \cdots , \alpha_{m} \right) \subset \mathbb{R}^{m} as follows: f=i=1mαiϕ(xi)+v f = \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v We will now argue that cc is independent of vv and when v=0v = 0, ff minimizes L(f)L(f) in L(f):=c((x1,y1,f(x1)),,(xm,ym,f(xm)))+g(f)=c+g \begin{align*} L (f) :=& c \left( \left( x_{1}, y_{1}, f \left( x_{1} \right) \right), \cdots , \left( x_{m}, y_{m}, f \left( x_{m} \right) \right) \right) + g \left( \left\| f \right\| \right) \\ =& c + g \end{align*} .


Part 2. cc and vv are independent

The inner product of function f=f()f = f(\cdot) and the reproducing kernel k(,xj)k \left( \cdot , x_{j} \right), according to the reproducing property, is =<f,k(,xj)>f(xj)=<i=1mαiϕ(xi)+v,ϕ(xj)>=i=1mαi<ϕ(xi),ϕ(xj)>+0 \begin{align*} =& \left< f , k \left( \cdot , x_{j} \right) \right> \\ f \left( x_{j} \right) =& \left< \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v , \phi \left( x_{j} \right) \right> \\ =& \sum_{i=1}^{m} \alpha_{i} \left< \phi \left( x_{i} \right) , \phi \left( x_{j} \right) \right> + 0 \end{align*} As this is independent of vv, in L(f)=c+gL (f) = c + g, the expression dependent only on the training data DD and ff, c=c((x1,y1,f(x1)),,(xm,ym,f(xm))) c = c \left( \left( x_{1}, y_{1}, f \left( x_{1} \right) \right), \cdots , \left( x_{m}, y_{m}, f \left( x_{m} \right) \right) \right) is also independent of vv.


Part 3. gg is minimized when v=0v = 0

  • (1): vv is orthogonal to i=1mαiϕ(xi)\sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right), and
  • (2): Since gg is assumed to be a monotonic function,

we obtain g(f)=g(i=1mαiϕ(xi)+v)=g(i=1mαiϕ(xi)+v2+v2)(1)g(i=1mαiϕ(xi))(2) \begin{align*} & g \left( \left\| f \right\| \right) \\ =& g \left( \left\| \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v \right\| \right) \\ =& g \left( \sqrt{\left\| \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) + v \right\|^{2} + \left\| v \right\|^{2}} \right) & \because (1) \\ \ge& g \left( \left\| \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) \right\| \right) & \because (2) \end{align*} As seen, the equality holds when v=0v = 0, and for gg to be minimized, v=0v=0 must hold. Meanwhile, from Part 2, we confirmed that vv cannot influence cc, so setting it to v=0v = 0 is acceptable, and the function ff that minimizes L=c+gL = c + g can be represented in the form of f()=i=1mαik(,xi) f (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right)


  1. Wahba. (2019). Representer Theorem. https://pages.stat.wisc.edu/~wahba/ftp1/wahba.wang.2019submit.pdf ↩︎

  2. Schölkopf. (2001). A Generalized Representer Theorem. https://link.springer.com/chapter/10.1007/3-540-44581-1_27 ↩︎