Let’s assume we are given an input set X=∅ and a positive definite kernelk:X×X→R. Define the Training Dataset as
D:={(xi,yi)}i=1m⊂X×R
and a class in the Reproducing Kernel Hilbert SpaceHk as
F:={f∈RX:f(⋅)=i=1∑∞βik(⋅,zi)∧βi∈R∧zi∈X∧∥f∥<∞}⊂Hk
Consider an arbitrary objective functionc:(D×R)m→R and a monotonically increasingregularizerg:R→[0,∞), defining the Regularized Objective FunctionalL:F→R as follows:
L(f):=c((x1,y1,f(x1)),⋯,(xm,ym,f(xm)))+g(∥f∥)
Here, the norm∥⋅∥ in Hk is given according to the positive definiteness of k as follows:
i=1∑∞βik(⋅,zi)2:=i=1∑∞j=1∑∞βiβjk(zi,zj)≥0
R is the set of real numbers, and R includes infinity ∞ as an extended real number.
A functional is, loosely speaking, a function that takes another function as input.
Nonparametric
The function f∈F that minimizes L(f) for some {αi}i=1m⊂R is expressed in the following form:
f(⋅)=i=1∑mαik(⋅,xi)
Semiparametric
Assuming a set of M real functions defined in X such that the rank of matrix (ψp(xi))ip is M for {ψp:X→R}p=1M, then for f∈F and h∈span{ψp},
the f~=f+h that minimizes
c((x1,y1,f~(x1)),⋯,(xm,ym,f~(xm)))+g(∥f∥)
for some {αi}i=1m,{βp}p=1M⊂R is expressed in the following form:
f~(⋅)=i=1∑mαik(⋅,xi)+p=1∑Mβpψp(⋅)
Explanation
It is recommended to read the following two posts first if possible:
The Representer Theorem is one of the most important theorems in the context of classical machine learning, especially Support Vector Machines, stating that the objective functionf we wish to approximate for given data can be represented in the form of
f(⋅)=i=1∑mαik(⋅,xi)
with respect to a suitable kernel k. Here, functions
k(⋅,xi)=ϕ(xi)(⋅)∈Hk
with one of the kernel’s inputs fixed at xi are called representers. According to this theorem, any function fitted to the training data in the Reproducing Kernel Hilbert Space can be represented as a finite linear combination of representers. This aligns perfectly with the kernel trick for support vector machines in nonlinear regression.
In the context of data science, representers ϕ(xi)(⋅) are also referred to as feature maps, implying that any dataX can be mapped into a Hilbert space where its features are known and represented by a finite sum of these features, justifying why many machine learning techniques we’ve learned so far work. While these techniques are valid in their own right even without mathematical guarantees, the Representer Theorem is crucial as it provides a theoretical foundation for them.
Objective Function and Regularizer
Although the objective function c and regularizer g are defined very generally in the theorem’s statement, in many cases, c can be seen as the mean squared residual,
c=m1i=1∑n(yi−f(xi))2
measuring the fit between data and f, and g can be a squared semi-norm penalty g(∥f∥)=λ∥f∥21.
It’s important not to distinguish between the objective function and regularizer merely based on the appearance of the equations or the meanings of the words. The most emblematic application of the Representer Theorem is the Support Vector Machine, where the minimization problem handled in Soft Margin SVM is as follows:
Minimizesubject to21∥w∥22+λ∑k=1nξk∣f(xk)∣≥1−ξkξk≥0k=1,⋯,n
Here, considering the optimization itself, the objective function for a constant λ=0 is essentially
2λ1∥w∥22+k=1∑nξk
where ∑k=1nξk, representing the discrepancy with data, becomes c, and 2λ1∥w∥22 derived from the hyperplane of the Support Vector Machine f(x)=wT+b becomes g. This can be confusing depending on whether the focus is on mathematics or machine learning because
Those with a stronger mathematical inclination see SVM as ’linear regression first, then considering exceptions’ and tend to minimize ∥w∥22 first,
Those with a stronger data science inclination view SVM as ‘first succeeding in classifying data well, then finding the hyperplane with the largest margin’ and tend to minimize ∑k=1nξk first.
Both perspectives are understandable, and since the application of the Representer Theorem isn’t limited to just SVM, one should think and adapt actively to the problem at hand rather than seeking a mnemonic technique.
For simplicity, only the nonparametric representer theorem is proved as per the reference.
Part 1. f=∑i=1mαiϕ(xi)+v
Definition of Reproducing Kernel: A function k:X×X→C is called the reproducing kernel of H if it satisfies the following two conditions:
(i): Representer: For all x∈X,
k(⋅,x)∈H
(ii) Reproducing Property: For all x∈X and all f∈H,
⟨f,k(⋅,x)⟩=f(x)=δx(f)
Particularly, for all x1,x2∈X,
k(x1,x2)=⟨k(⋅,x2),k(⋅,x1)⟩
Define the (representer) function ϕ:X→RX in the reproducing kernel k:X×X→R as x↦k(⋅,x). Since k is a reproducing kernel, the function value of function (ϕ(x))(⋅) at any x,x′∈X is
(ϕ(x))(x′)=k(x′,x)=⟨ϕ(x′),ϕ(x)⟩
where ⟨⋅,⋅⟩ is the inner product in Hk. For a given {xi}i=1m, any function f∈F can be expressed as a sum of a part in span{ϕ(xi)}i=1m and a part orthogonal to it, satisfying
⟨v,ϕ(xj)⟩=0
for all j, and for some (α1,⋯,αm)⊂Rm as follows:
f=i=1∑mαiϕ(xi)+v
We will now argue that c is independent of v and when v=0, f minimizes L(f) in
L(f):==c((x1,y1,f(x1)),⋯,(xm,ym,f(xm)))+g(∥f∥)c+g.
Part 2. c and v are independent
The inner product of function f=f(⋅) and the reproducing kernel k(⋅,xj), according to the reproducing property, is
=f(xj)==⟨f,k(⋅,xj)⟩⟨i=1∑mαiϕ(xi)+v,ϕ(xj)⟩i=1∑mαi⟨ϕ(xi),ϕ(xj)⟩+0
As this is independent of v, in L(f)=c+g, the expression dependent only on the training data D and f,
c=c((x1,y1,f(x1)),⋯,(xm,ym,f(xm)))
is also independent of v.
we obtain
==≥g(∥f∥)g(i=1∑mαiϕ(xi)+v)gi=1∑mαiϕ(xi)+v2+∥v∥2g(i=1∑mαiϕ(xi))∵(1)∵(2)
As seen, the equality holds when v=0, and for g to be minimized, v=0 must hold. Meanwhile, from Part 2, we confirmed that v cannot influence c, so setting it to v=0 is acceptable, and the functionf that minimizes L=c+g can be represented in the form of
f(⋅)=i=1∑mαik(⋅,xi)