Definite Kernel and Reproducing Kernel Hilbert Space in Machine Learning
Definition 1 2
Input Space $X \ne \emptyset$ is the domain and the codomain is the set of complex numbers $\mathbb{C}$, and let’s denote the space of functions $\left( H , \left< \cdot , \cdot \right> \right) \subset \mathbb{C}^{X}$ composed of mappings $f: X \to \mathbb{C}$ as a Hilbert space.
Reproducing Kernel Hilbert Space
- For a fixed datum $x \in X$, the functional $\delta_{x} : H \to \mathbb{C}$, which takes a function $f \in H$ and returns its value at $x$, is called the (Dirac) Evaluation Functional at $x$. $$ \delta_{x} (f) := f (x) $$
- If the evaluation functional $\delta_{x}$ is continuous for all $x \in X$, then $H$ is called a Reproducing Kernel Hilbert Space (RKHS) and is sometimes denoted as $H_{k}$.
- A function $k : X \times X \to \mathbb{C}$ is called the reproducing kernel of $H$ if it satisfies the following two conditions:
- (i) Representer: For all $x \in X$, $$ k \left( \cdot , x \right) \in H $$
- (ii) Reproducing Property: For all $x \in X$ and all $f \ in H$, $$ \left< f , k \left( \cdot , x \right) \right> = f(x) = \delta_{x} (f) $$ Especially, for all $x_{1} , x_{2} \in X$, the following holds: $$ k \left( x_{1} , x_{2} \right) = \left< k \left( \cdot , x_{2} \right), k \left( \cdot , x_{1} \right) \right> $$
Positive Definite Kernel
- Let’s call a mapping $\phi : X \to H$ from the input space $X \ne \emptyset$ to the Hilbert space $\left( H , \left< \cdot , \cdot \right> \right)$ a feature map. In this context, $H$ is also referred to as the feature space.
- A function $k : X \times X \to \mathbb{C}$ defined by the inner product $\left< \cdot , \cdot \right> : H \times H \to \mathbb{C}$ in $\left( H , \left< \cdot , \cdot \right> \right)$ is called a kernel. $$ k \left( x_{1} , x_{2} \right) := \left< \phi \left( x_{1} \right) , \phi \left( x_{2} \right) \right> $$
- For $m$ data points $\left\{ x_{1} , \cdots , x_{m} \right\} \subset X$, the matrix $K \in \mathbb{C}^{m \times m}$ formed as follows is called the Gram Matrix of the kernel $k$. $$ K := \left( k \left( x_{i} , x_{j} \right) \right)_{ij} $$
- If the Gram Matrix of $k$ is a positive definite matrix, then $k$ is called a positive definite kernel. In other words, a kernel $k$ whose Gram Matrix satisfies the following for all $\left\{ c_{1} , \cdots , c_{m} \right\} \subset \mathbb{C}$ is a positive definite kernel. $$ \sum_{i=1}^{m} \sum_{j=1}^{m} c_{i} \bar{c_{j}} K_{ij} \ge 0 $$
Explanation
Although the content is complex, let’s read it carefully as it’s been simplified as much as possible.
The Meaning of Hilbert Space in Data Science
A Hilbert space is a complete space where an inner product is defined. In mathematics, an inner product is simply a bi-variable scalar function that satisfies certain conditions, but in machine learning, it can be thought of as a measure of similarity. In fact, the cosine similarity used to compare word frequencies between two documents also uses an inner product, and another naive example is when we have three vectors $$ A := \left( 3, 0, 1 \right) \\ B := \left( 4, 1, 0 \right) \\ C := \left( 0, 2, 5 \right) $$ it’s evident that $A$ and $B$ are similar, and both are different from $C$. Although this is still an intuitive inference, quantifying it through inner product yields: $$ A \cdot B = 12 + 0 + 0 = 12 \\ A \cdot C = 0 + 0 + 5 = 5 \\ B \cdot C = 0 + 2 + 0 = 2 $$ This simple comparison of the absolute values of inner products explains the data better than just ‘it’s obvious’.
Note that there are no specific assumptions about the input space $X$. In real applications, we cannot guarantee what kind of bad data we will handle. For example, if $X$ represents photos or document data, it does not make sense to take inner products of photos or documents.
- Q. If $X$ is a set of black and white photos, can’t we just consider the photos as matrices and take inner products based on pixel values?
- A. That would work, and that’s exactly what a feature map $\phi : X \to H$ does. In this case, $H$ becomes a space of functions defined on a rectangle $[a,b] \times [c,d]$.
- Thinking in this way, the existence of a kernel itself is almost like bringing ‘difficult to handle data’ into a space we are familiar with.
Even if the meaning of inner products mentioned above doesn’t make sense, since an inner product space is a norm space and a metric space, most assumptions we consider necessary logically hold. An inner product $\left< \cdot , \cdot \right>$ induces a norm $$ \left\| f \right\| := \sqrt{ \left< f , f \right> } $$ and a norm $\left\| f \right\|$ induces a metric $$ d (f,g) = \left\| f -g \right\| $$
- From a data science perspective, a norm itself quantifies data. For example, if we define the norm of a black and white photo as the sum of all pixel values, this alone can roughly evaluate how bright or dark the photo is.
- From a data science perspective, a distance tells us how different two data points are. Distinguishing between right and wrong, similar and different, is undoubtedly important.
Apart from these reasons, sometimes in mathematical derivations, inner products become necessary. This post won’t cover all related examples as it would become too cluttered. Refer to the ‘Kernel Trick’ section of the ‘Support Vector Machine’ post.
Why a Function Space? Does it Have to be This Complicated?
Most applications of mathematics involve finding ’the function we want’.
- Interpolation finds a polynomial function that fills in between given data points.
- Statistical regression analysis is a technique for finding a line that best explains the data, which is a linear function.
- Deep learning approximates non-linear functions by incorporating activation functions because linear functions alone are insufficient.
- Fourier transform represents a function as a linear combination of trigonometric functions.
There are countless examples like these. Returning to machine learning, the reason we consider function spaces is that ultimately, what we are looking for is a function. Even if it’s not explicitly stated, we want a function that returns the desired result for a given input. For example, a function that returns the number on a photo of a digit, or one that calculates the probability of loan repayment based on personal information. Such useful functions are unlikely to be simple, and we hope they can be represented as a sum of finitely many $\phi_{k} (x) (\cdot)$, which serve as bases. In particular, for $\phi (x) = k (\cdot , x)$, the proposition that some function $f$ can be found is precisely the Representer Theorem.
Representer Theorem: In a Reproducing Kernel Hilbert Space, any function fitted to the training data can be represented as a finite linear combination of representers.
In summary, in machine learning (especially in the context of Support Vector Machines), what we seek is ultimately a function, so exploring the function space where they reside is inevitable.
Why is Dirac’s Name in Front of the Evaluation Functional?
$$ \delta_{x_{0}} (x) = \begin{cases} 1 & , \text{if } x = x_{0} \\ 0 & , \text{if } x \ne x_{0} \end{cases} $$ Originally, the Dirac delta function is known as a function that only has a value at one point. Regardless of its precise definition and usage, its variations are typically named after Dirac if they have a non-zero value at only one point. To aid understanding, imagine two functions $f : \mathbb{R} \to \mathbb{R}$, $\delta_{x_{0}} : \mathbb{R} \to \mathbb{R}$, and their inner product as $$ \left< f, \delta_{x_{0}} \right> = \sum_{x \in \mathbb{R}} f (x) \delta_{x_{0}} (x) = f \left( x_{0} \right) $$ Though we normally use integration instead of summation for the inner product of functions, and summing over all $x \in \mathbb{R}$ is risky, the concept aligns with the idea.
In this sense, $\delta_{x_{0}} (f)$ straightforwardly yields $f \left( x_{0} \right)$, ‘obtaining a single point’ at $x_{0}$, hiding the aforementioned discussion.
Why it’s Called Reproducing Property
The definition of RKHS is quite interesting. Usually, when we say ‘some space’ in mathematics, we define it as the space where ‘some’ exists, but RKHS is defined as a Hilbert space where ’the evaluation functional is continuous at every point’, which seems out of the blue.
Riesz Representation Theorem: Let $\left( H, \left\langle \cdot,\cdot \right\rangle \right)$ be a Hilbert space. For a linear functional $f \in H^{ \ast }$ and $\mathbf{x} \in H$, there exists a unique $\mathbf{w} \in H$ such that $f ( \mathbf{x} ) = \left\langle \mathbf{x} , \mathbf{w} \right\rangle$ and $\| f \|_{H^{\ast}} = \| \mathbf{w} \|_{H}$.
Moore-Aronszajn Theorem: If a positive definite kernel exists, then a unique RKHS corresponding to it also exists.
According to this definition, the existence of a reproducing kernel in RKHS is not self-evident and requires proof. In fact, the Riesz Representation Theorem guarantees the unique existence of a reproducing kernel in RKHS. Interestingly, conversely, an RKHS corresponding to a reproducing kernel also uniquely exists.
Let’s delve into the formulas in the definition.
Originally, the function $k : X \times X \to \mathbb{C}$ could take $x_{1}, x_{2} \in X$, but if we fix $x$ like in the definition, $k$ essentially becomes $k : y \mapsto k (y,x)$, a function $k : X \to \mathbb{C}$. By blocking one input, it’s like $$ \left< f , k \left( \cdot , x \right) \right> = f(x) $$ This expression is simply the inner product of two functions $f (\cdot) : X \to \mathbb{C}$ and $k \left( \cdot , x \right): X \to \mathbb{C}$. There’s no need to overcomplicate thinking, “How does $f$ come out and how does the inner product with $x$…” It’s unnecessary. Since $f(x) \in \mathbb{C}$ is also just a result of the inner product, it’s just some complex number, the codomain being the set of complex numbers.
Here, let’s discuss the naming of the reproducing property. The word Reproduction inherently carries the meaning of Re-(again, 再) -produce (create, 生), with its first translation being reproduction/generation, second being copying/duplication, and third being reproduction. Reproduction in the sense of breeding doesn’t fit, and copying doesn’t seem right as there’s no original to speak of.
However, if we consider that inner-producting $f(\cdot)$ and $k (\cdot, x)$ to get $f(x)$ is ‘reproducing’ the information contained in $f$ through the kernel, then wouldn’t it make sense? Imagine we have a function $y(t)$ dependent on time $t$, representing a YouTube video. We don’t see $y$ itself but the reproduction of $\left\{ y(t) : t \in [0,T] \right\}$. In this analogy, the kernel $k$ ‘reproduces’ the function values from $f$, not as the function itself but as its values, justifying the term ‘reproducing kernel’.
Feature Map and Uncomfortable Notation
Looking closely at the definitions of kernel and reproducing kernel, we notice that they don’t necessarily need each other for their definitions. A kernel is a kernel, and a reproducing kernel is a reproducing kernel, and they become the same when the feature map is also the representer, i.e., $$ \phi (x) = k \left( \cdot , x \right) $$ A feature map transforms original data into a form that’s easier for us to handle,
and saying that a function is represented by such functions means it’s explained by certain features derived from the data. One issue is that even if one intuitively understands up to this point, the notation like $k \left( \cdot , x \right)$ remains uncomfortable, and it’s hard to empathize with the motive behind defining kernels separately from their inner products and feature maps.
Since a feature map is $\phi : X \to H$, its function value for $x \in X$ is some function $\lambda : X \to \mathbb{C}$, which usually isn’t confusing. More precisely, $\phi (x)$ can be written as $$ \left( \phi (x) \right) (\cdot) = k \left( \cdot , x \right) $$ Then why use such inconvenient notation with the dot $\cdot$? Most people find it easier to understand with examples where not using such notation would cause more trouble. As mentioned earlier, whether it’s a kernel or a reproducing kernel, the space we consistently care about is the function space $H$, and the inner product in $H$ is between functions. First, let’s assume a function $f$ is represented by a linear combination of data representers $\phi \left( x_{i} \right)$: $$ f (y) = \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) (y) = \sum_{i=1}^{m} \alpha_{i} \left( \phi \left( x_{i} \right) \right) (y) $$ This already looks messy. Considering another function $g$ and different data $\left\{ x'_{j} \right\}_{j=1}^{n}$, we get $$ g (y) = \sum_{j=1}^{n} \beta_{j} \left( \phi \left( x'_{j} \right) \right) (y) $$ Moreover, if we’re not using inner products, there’s no point in considering an inner product space. Writing down $\left< f,g \right>$ gives us $$ \left< f,g \right> = \sum_{i=1}^{m} \sum_{j=1}^{n} \bar{\alpha_{i}} \beta_{j} \left< \phi \left( x_{i} \right) , \phi \left( x'_{j} \right) \right> $$ This is unnecessarily complicated. Before taking the inner product, we hardly need to deal with actual $y \in X$ in the function space, and after taking the inner product, there’s no need to keep writing $\phi$ and the inner product. Seeing this, the notation $$ f (\cdot) = \sum_{i=1}^{m} \alpha_{i} k \left( \cdot , x_{i} \right) \\ g (\cdot) = \sum_{j=1}^{n} \beta_{j} k \left( \cdot , x'_{j} \right) \\ \left< f,g \right> = \sum_{i=1}^{m} \sum_{j=1}^{n} \bar{\alpha_{i}} \beta_{j} k \left( x_{i} , x'_{j} \right) $$ might not seem as cumbersome.
Reproducing Kernels are Positive Definite
Given data $\left\{ x_{k} \right\}_{k=1}^{m}$, if $k$ is a kernel, then the following holds: $$ \begin{align*} & \sum_{i=1}^{m} \sum_{j=1}^{m} \bar{\alpha_{i}} \alpha_{j} k \left( x_{i} , x_{j} \right) \\ =& \sum_{i=1}^{m} \sum_{j=1}^{m} \left< \alpha_{i} \phi \left( x_{i} \right) , \alpha_{j} \phi \left( x_{j} \right) \right> \\ =& \left< \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) , \sum_{j=1}^{m} \alpha_{j} \phi \left( x_{j} \right) \right> \\ =& \left\| \sum_{i=1}^{m} \alpha_{i} \phi \left( x_{i} \right) \right\|^{2} \\ \ge & 0 \end{align*} $$ As mentioned earlier, if we consider $\phi : x \mapsto k (\cdot , x)$, the reproducing kernel $k$ is also a kernel and thus positive definite. This positive definiteness of kernels naturally appears in various properties related to kernels.
Kernels Outside of Functional Analysis
- (1) In general mathematics, a kernel often refers to the abstract algebra kernel $\ker$. For a structure $Y$ where $0$ is defined, the kernel $\ker f$ of a function $f : X \to Y$ is defined as $\ker f := f^{-1} \left( \left\{ 0 \right\} \right)$.
- (2) This concept is specialized in linear algebra as the kernel of a linear transformation.
If you ask someone who isn’t specialized in functional analysis about kernels, nine out of ten times, they’ll think of meaning (1). If your background is in mathematics, you should at least know about (1), and even if not, you should be familiar with (2).
Named Kernels
In the context of machine learning, the following kernels are known3. These might not seem like kernels at first glance, but they can be derived from the fact that the sum and product of kernels remain kernels.
- Linear Kernel: $$ k \left( x_{1} , x_{2} \right) = \left< x_{1} , x_{2} \right> $$
- Polynomial Kernel: For $c \ge 0$ and $d \in \mathbb{N}$, $$ k \left( x_{1} , x_{2} \right) = \left( \left< x_{1} , x_{2} \right> + c \right) ^{d} $$
- Gaussian Kernel: For $\sigma^{2} > 0$, $$ k \left( x_{1} , x_{2} \right) = \exp \left( - {{ \left\| x_{1} - x_{2} \right\| } \over { 2 \sigma^{2} }} \right) $$
- Sigmoid Kernel: For $w, b \in \mathbb{C}$, $$ k \left( x_{1} , x_{2} \right) = \tanh \left( w \left< x_{1} , x_{2} \right> + b \right) $$
Sejdinovic, Gretton. (2014). What is an RKHS?: p7~11. http://www.stats.ox.ac.uk/~sejdinov/teaching/atml14/Theory_2014.pdf ↩︎
Schölkopf. (2001). A Generalized Representer Theorem. https://link.springer.com/chapter/10.1007/3-540-44581-1_27 ↩︎
Jakkula. (2006). Tutorial on Support Vector Machine (SVM). https://course.ccs.neu.edu/cs5100f11/resources/jakkula.pdf ↩︎