logo

Definite Kernel and Reproducing Kernel Hilbert Space in Machine Learning 📂Machine Learning

Definite Kernel and Reproducing Kernel Hilbert Space in Machine Learning

Definition 1 2

Input Space XX \ne \emptyset is the domain and the codomain is the set of complex numbers C\mathbb{C}, and let’s denote the space of functions (H,<,>)CX\left( H , \left< \cdot , \cdot \right> \right) \subset \mathbb{C}^{X} composed of mappings f:XCf: X \to \mathbb{C} as a Hilbert space.

Reproducing Kernel Hilbert Space

  1. For a fixed datum xXx \in X, the functional δx:HC\delta_{x} : H \to \mathbb{C}, which takes a function fHf \in H and returns its value at xx, is called the (Dirac) Evaluation Functional at xx. δx(f):=f(x) \delta_{x} (f) := f (x)
  2. If the evaluation functional δx\delta_{x} is continuous for all xXx \in X, then HH is called a Reproducing Kernel Hilbert Space (RKHS) and is sometimes denoted as HkH_{k}.
  3. A function k:X×XCk : X \times X \to \mathbb{C} is called the reproducing kernel of HH if it satisfies the following two conditions:
    • (i) Representer: For all xXx \in X, k(,x)H k \left( \cdot , x \right) \in H
    • (ii) Reproducing Property: For all xXx \in X and all f inHf \ in H, <f,k(,x)>=f(x)=δx(f) \left< f , k \left( \cdot , x \right) \right> = f(x) = \delta_{x} (f) Especially, for all x1,x2Xx_{1} , x_{2} \in X, the following holds: k(x1,x2)=<k(,x2),k(,x1)> k \left( x_{1} , x_{2} \right) = \left< k \left( \cdot , x_{2} \right), k \left( \cdot , x_{1} \right) \right>

Positive Definite Kernel

  1. Let’s call a mapping ϕ:XH\phi : X \to H from the input space XX \ne \emptyset to the Hilbert space (H,<,>)\left( H , \left< \cdot , \cdot \right> \right) a feature map. In this context, HH is also referred to as the feature space.
  2. A function k:X×XCk : X \times X \to \mathbb{C} defined by the inner product <,>:H×HC\left< \cdot , \cdot \right> : H \times H \to \mathbb{C} in (H,<,>)\left( H , \left< \cdot , \cdot \right> \right) is called a kernel. k(x1,x2):=<ϕ(x1),ϕ(x2)> k \left( x_{1} , x_{2} \right) := \left< \phi \left( x_{1} \right) , \phi \left( x_{2} \right) \right>
  3. For mm data points {x1,,xm}X\left\{ x_{1} , \cdots , x_{m} \right\} \subset X, the matrix KCm×mK \in \mathbb{C}^{m \times m} formed as follows is called the Gram Matrix of the kernel kk. K:=(k(xi,xj))ij K := \left( k \left( x_{i} , x_{j} \right) \right)_{ij}
  4. If the Gram Matrix of kk is a positive definite matrix, then kk is called a positive definite kernel. In other words, a kernel kk whose Gram Matrix satisfies the following for all {c1,,cm}C\left\{ c_{1} , \cdots , c_{m} \right\} \subset \mathbb{C} is a positive definite kernel. i=1mj=1mcicjˉKij0 \sum_{i=1}^{m} \sum_{j=1}^{m} c_{i} \bar{c_{j}} K_{ij} \ge 0

Explanation

Although the content is complex, let’s read it carefully as it’s been simplified as much as possible.

The Meaning of Hilbert Space in Data Science

  • A Hilbert space is a complete space where an inner product is defined. In mathematics, an inner product is simply a bi-variable scalar function that satisfies certain conditions, but in machine learning, it can be thought of as a measure of similarity. In fact, the cosine similarity used to compare word frequencies between two documents also uses an inner product, and another naive example is when we have three vectors A:=(3,0,1)B:=(4,1,0)C:=(0,2,5) A := \left( 3, 0, 1 \right) \\ B := \left( 4, 1, 0 \right) \\ C := \left( 0, 2, 5 \right) it’s evident that AA and BB are similar, and both are different from CC. Although this is still an intuitive inference, quantifying it through inner product yields: AB=12+0+0=12AC=0+0+5=5BC=0+2+0=2 A \cdot B = 12 + 0 + 0 = 12 \\ A \cdot C = 0 + 0 + 5 = 5 \\ B \cdot C = 0 + 2 + 0 = 2 This simple comparison of the absolute values of inner products explains the data better than just ‘it’s obvious’.

  • Note that there are no specific assumptions about the input space XX. In real applications, we cannot guarantee what kind of bad data we will handle. For example, if XX represents photos or document data, it does not make sense to take inner products of photos or documents.

    • Q. If XX is a set of black and white photos, can’t we just consider the photos as matrices and take inner products based on pixel values?
    • A. That would work, and that’s exactly what a feature map ϕ:XH\phi : X \to H does. In this case, HH becomes a space of functions defined on a rectangle [a,b]×[c,d][a,b] \times [c,d].
    • Thinking in this way, the existence of a kernel itself is almost like bringing ‘difficult to handle data’ into a space we are familiar with.
  • Even if the meaning of inner products mentioned above doesn’t make sense, since an inner product space is a norm space and a metric space, most assumptions we consider necessary logically hold. An inner product <,>\left< \cdot , \cdot \right> induces a norm f:=<f,f> \left\| f \right\| := \sqrt{ \left< f , f \right> } and a norm f\left\| f \right\| induces a metric d(f,g)=fg d (f,g) = \left\| f -g \right\|

    • From a data science perspective, a norm itself quantifies data. For example, if we define the norm of a black and white photo as the sum of all pixel values, this alone can roughly evaluate how bright or dark the photo is.
    • From a data science perspective, a distance tells us how different two data points are. Distinguishing between right and wrong, similar and different, is undoubtedly important.
  • Apart from these reasons, sometimes in mathematical derivations, inner products become necessary. This post won’t cover all related examples as it would become too cluttered. Refer to the ‘Kernel Trick’ section of the ‘Support Vector Machine’ post.

Why a Function Space? Does it Have to be This Complicated?

Most applications of mathematics involve finding ’the function we want’.

There are countless examples like these. Returning to machine learning, the reason we consider function spaces is that ultimately, what we are looking for is a function. Even if it’s not explicitly stated, we want a function that returns the desired result for a given input. For example, a function that returns the number on a photo of a digit, or one that calculates the probability of loan repayment based on personal information. Such useful functions are unlikely to be simple, and we hope they can be represented as a sum of finitely many ϕk(x)()\phi_{k} (x) (\cdot), which serve as bases. In particular, for ϕ(x)=k(,x)\phi (x) = k (\cdot , x), the proposition that some function ff can be found is precisely the Representer Theorem.

Representer Theorem: In a Reproducing Kernel Hilbert Space, any function fitted to the training data can be represented as a finite linear combination of representers.

In summary, in machine learning (especially in the context of Support Vector Machines), what we seek is ultimately a function, so exploring the function space where they reside is inevitable.

Why is Dirac’s Name in Front of the Evaluation Functional?

δx0(x)={1,if x=x00,if xx0 \delta_{x_{0}} (x) = \begin{cases} 1 & , \text{if } x = x_{0} \\ 0 & , \text{if } x \ne x_{0} \end{cases} Originally, the Dirac delta function is known as a function that only has a value at one point. Regardless of its precise definition and usage, its variations are typically named after Dirac if they have a non-zero value at only one point. To aid understanding, imagine two functions f:RRf : \mathbb{R} \to \mathbb{R}, δx0:RR\delta_{x_{0}} : \mathbb{R} \to \mathbb{R}, and their inner product as <f,δx0>=xRf(x)δx0(x)=f(x0) \left< f, \delta_{x_{0}} \right> = \sum_{x \in \mathbb{R}} f (x) \delta_{x_{0}} (x) = f \left( x_{0} \right) Though we normally use integration instead of summation for the inner product of functions, and summing over all xRx \in \mathbb{R} is risky, the concept aligns with the idea.

In this sense, δx0(f)\delta_{x_{0}} (f) straightforwardly yields f(x0)f \left( x_{0} \right), ‘obtaining a single point’ at x0x_{0}, hiding the aforementioned discussion.

Why it’s Called Reproducing Property

The definition of RKHS is quite interesting. Usually, when we say ‘some space’ in mathematics, we define it as the space where ‘some’ exists, but RKHS is defined as a Hilbert space where ’the evaluation functional is continuous at every point’, which seems out of the blue.

Riesz Representation Theorem: Let (H,,)\left( H, \left\langle \cdot,\cdot \right\rangle \right) be a Hilbert space. For a linear functional fHf \in H^{ \ast } and xH\mathbf{x} \in H, there exists a unique wH\mathbf{w} \in H such that f(x)=x,wf ( \mathbf{x} ) = \left\langle \mathbf{x} , \mathbf{w} \right\rangle and fH=wH\| f \|_{H^{\ast}} = \| \mathbf{w} \|_{H}.

Moore-Aronszajn Theorem: If a positive definite kernel exists, then a unique RKHS corresponding to it also exists.

According to this definition, the existence of a reproducing kernel in RKHS is not self-evident and requires proof. In fact, the Riesz Representation Theorem guarantees the unique existence of a reproducing kernel in RKHS. Interestingly, conversely, an RKHS corresponding to a reproducing kernel also uniquely exists.

Let’s delve into the formulas in the definition.

Originally, the function k:X×XCk : X \times X \to \mathbb{C} could take x1,x2Xx_{1}, x_{2} \in X, but if we fix xx like in the definition, kk essentially becomes k:yk(y,x)k : y \mapsto k (y,x), a function k:XCk : X \to \mathbb{C}. By blocking one input, it’s like <f,k(,x)>=f(x) \left< f , k \left( \cdot , x \right) \right> = f(x) This expression is simply the inner product of two functions f():XCf (\cdot) : X \to \mathbb{C} and k(,x):XCk \left( \cdot , x \right): X \to \mathbb{C}. There’s no need to overcomplicate thinking, “How does ff come out and how does the inner product with xx…” It’s unnecessary. Since f(x)Cf(x) \in \mathbb{C} is also just a result of the inner product, it’s just some complex number, the codomain being the set of complex numbers.

Here, let’s discuss the naming of the reproducing property. The word Reproduction inherently carries the meaning of Re-(again, 再) -produce (create, 生), with its first translation being reproduction/generation, second being copying/duplication, and third being reproduction. Reproduction in the sense of breeding doesn’t fit, and copying doesn’t seem right as there’s no original to speak of.

However, if we consider that inner-producting f()f(\cdot) and k(,x)k (\cdot, x) to get f(x)f(x) is ‘reproducing’ the information contained in ff through the kernel, then wouldn’t it make sense? Imagine we have a function y(t)y(t) dependent on time tt, representing a YouTube video. We don’t see yy itself but the reproduction of {y(t):t[0,T]}\left\{ y(t) : t \in [0,T] \right\}. In this analogy, the kernel kk ‘reproduces’ the function values from ff, not as the function itself but as its values, justifying the term ‘reproducing kernel’.

Feature Map and Uncomfortable Notation

Looking closely at the definitions of kernel and reproducing kernel, we notice that they don’t necessarily need each other for their definitions. A kernel is a kernel, and a reproducing kernel is a reproducing kernel, and they become the same when the feature map is also the representer, i.e., ϕ(x)=k(,x) \phi (x) = k \left( \cdot , x \right) A feature map transforms original data into a form that’s easier for us to handle,

and saying that a function is represented by such functions means it’s explained by certain features derived from the data. One issue is that even if one intuitively understands up to this point, the notation like k(,x)k \left( \cdot , x \right) remains uncomfortable, and it’s hard to empathize with the motive behind defining kernels separately from their inner products and feature maps.

Since a feature map is ϕ:XH\phi : X \to H, its function value for xXx \in X is some function λ:XC\lambda : X \to \mathbb{C}, which usually isn’t confusing. More precisely, ϕ(x)\phi (x) can be written as (ϕ(x))()=k(,x) \left( \phi (x) \right) (\cdot) = k \left( \cdot , x \right) Then why use such inconvenient notation with the dot \cdot? Most people find it easier to understand with examples where not using such notation would cause more trouble. As mentioned earlier, whether it’s a kernel or a reproducing kernel, the space we consistently care about is the function space HH, and the inner product in HH is between functions. First, let’s assume a function ff is represented by a linear combination of data representers ϕ(xi)\phi \left( x_{i} \right): f(y)=i=1mαiϕ(xi)(y)=i=1mαi(ϕ(xi))(y) f (y) = \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) (y) = \sum_{i=1}^{m} \alpha_{i} \left( \phi \left( x_{i} \right) \right) (y) This already looks messy. Considering another function gg and different data {xj}j=1n\left\{ x'_{j} \right\}_{j=1}^{n}, we get g(y)=j=1nβj(ϕ(xj))(y) g (y) = \sum_{j=1}^{n} \beta_{j} \left( \phi \left( x'_{j} \right) \right) (y) Moreover, if we’re not using inner products, there’s no point in considering an inner product space. Writing down <f,g>\left< f,g \right> gives us <f,g>=i=1mj=1nαiˉβj<ϕ(xi),ϕ(xj)> \left< f,g \right> = \sum_{i=1}^{m} \sum_{j=1}^{n} \bar{\alpha_{i}} \beta_{j} \left< \phi \left( x_{i} \right) , \phi \left( x'_{j} \right) \right> This is unnecessarily complicated. Before taking the inner product, we hardly need to deal with actual yXy \in X in the function space, and after taking the inner product, there’s no need to keep writing ϕ\phi and the inner product. Seeing this, the notation f()=i=1mαik(,xi)g()=j=1nβjk(,xj)<f,g>=i=1mj=1nαiˉβjk(xi,xj) f (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right) \\ g (\cdot) = \sum_{j=1}^{n} \beta_{j} k \left( \cdot , x'_{j} \right) \\ \left< f,g \right> = \sum_{i=1}^{m} \sum_{j=1}^{n} \bar{\alpha_{i}} \beta_{j} k \left( x_{i} , x'_{j} \right) might not seem as cumbersome.

Reproducing Kernels are Positive Definite

Given data {xk}k=1m\left\{ x_{k} \right\}_{k=1}^{m}, if kk is a kernel, then the following holds: i=1mj=1mαiˉαjk(xi,xj)=i=1mj=1m<αiϕ(xi),αjϕ(xj)>=<i=1mαiϕ(xi),j=1mαjϕ(xj)>=i=1mαiϕ(xi)20 \begin{align*} & \sum_{i=1}^{m} \sum_{j=1}^{m} \bar{\alpha_{i}} \alpha_{j} k \left( x_{i} , x_{j} \right) \\ =& \sum_{i=1}^{m} \sum_{j=1}^{m} \left< \alpha_{i} \phi \left( x_{i} \right) , \alpha_{j} \phi \left( x_{j} \right) \right> \\ =& \left< \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) , \sum_{j=1}^{m} \alpha_{j} \phi \left( x_{j} \right) \right> \\ =& \left\| \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) \right\|^{2} \\ \ge & 0 \end{align*} As mentioned earlier, if we consider ϕ:xk(,x)\phi : x \mapsto k (\cdot , x), the reproducing kernel kk is also a kernel and thus positive definite. This positive definiteness of kernels naturally appears in various properties related to kernels.

Kernels Outside of Functional Analysis

If you ask someone who isn’t specialized in functional analysis about kernels, nine out of ten times, they’ll think of meaning (1). If your background is in mathematics, you should at least know about (1), and even if not, you should be familiar with (2).

Named Kernels

In the context of machine learning, the following kernels are known3. These might not seem like kernels at first glance, but they can be derived from the fact that the sum and product of kernels remain kernels.

  1. Linear Kernel: k(x1,x2)=<x1,x2> k \left( x_{1} , x_{2} \right) = \left< x_{1} , x_{2} \right>
  2. Polynomial Kernel: For c0c \ge 0 and dNd \in \mathbb{N}, k(x1,x2)=(<x1,x2>+c)d k \left( x_{1} , x_{2} \right) = \left( \left< x_{1} , x_{2} \right> + c \right) ^{d}
  3. Gaussian Kernel: For σ2>0\sigma^{2} > 0, k(x1,x2)=exp(x1x22σ2) k \left( x_{1} , x_{2} \right) = \exp \left( - {{ \left\| x_{1} - x_{2} \right\| } \over { 2 \sigma^{2} }} \right)
  4. Sigmoid Kernel: For w,bCw, b \in \mathbb{C}, k(x1,x2)=tanh(w<x1,x2>+b) k \left( x_{1} , x_{2} \right) = \tanh \left( w \left< x_{1} , x_{2} \right> + b \right)

  1. Sejdinovic, Gretton. (2014). What is an RKHS?: p7~11. http://www.stats.ox.ac.uk/~sejdinov/teaching/atml14/Theory_2014.pdf ↩︎

  2. Schölkopf. (2001). A Generalized Representer Theorem. https://link.springer.com/chapter/10.1007/3-540-44581-1_27 ↩︎

  3. Jakkula. (2006). Tutorial on Support Vector Machine (SVM). https://course.ccs.neu.edu/cs5100f11/resources/jakkula.pdf ↩︎