Support Vector Machine
Model 1
Simple Definition
The method of finding a Support Vector Machine is to find a line or plane that best separates binary classifiable data.
Complex Definition
For an inner product space $X = \mathbb{R}^{p}$ and labeling $Y = \left\{ -1, +1 \right\}$, let’s denote the Training Dataset composed of $n$ pieces of data as $D = \left\{ \left( \mathbf{x}_{k} , y_{k} \right) \right\}_{k=1}^{n} \subset X \times Y$, and $$ \begin{align*} X^{+} :=& \left\{ \mathbf{x}_{k} \in X : y_{k} = +1 \right\} \\ X^{-} :=& \left\{ \mathbf{x}_{k} \in X : y_{k} = -1 \right\} \end{align*} $$
Suppose a hyperplane created by a linear function $f \left( \mathbf{x} \right) = \mathbf{w}^{T} \mathbf{x} + b$ with some weight $\mathbf{w} \in \mathbb{R}^{p}$ and bias $b \in \mathbb{R}$ is $H : \mathbf{w}^{T} \mathbf{x} + b = 0$. The $\mathbf{x}^{+} \in X^{+}$s and $\mathbf{x}^{-} \in X^{-}$s closest to $H$ are called Support Vectors, and the distance $\delta$ between them is called the Margin. The [machine learning](../../categories/Machine Learning) technique that finds $\mathbf{w} , b$ that maximizes the margin while satisfying $$ \begin{align*} f \left( \mathbf{x}^{+} \right) =& +1 \\ f \left( \mathbf{x}^{-} \right) =& -1 \end{align*} $$
is called Support Vector Machine (SVM).
- $\mathbb{R}$ is a set of real numbers, and $\mathbb{R}^{p}$ is the $p$-dimensional Euclidean space.
- $X \times Y$ denotes the Cartesian product of two sets.
- $\mathbf{w}^{T}$ is the transpose matrix of $\mathbf{w}$, and $\mathbf{w}^{T} \mathbf{x}$ is the inner product $\left< \mathbf{w} , \mathbf{x} \right>$ of two vectors $\mathbf{w}, \mathbf{x}$.
Explanation
Simply put, it’s like finding a line or plane that divides orange and sky-blue data as shown in the next picture. The red arrows in the picture correspond to the support vectors.
In the picture, we found a line for $2$ dimensions and a plane for $3$ dimensions, but for larger $p$ dimensions, we need to find a hyperplane, which becomes harder to represent in a picture. However, the concept of dividing the space into two remains the same. Once binary classification is done with the training dataset, new data can be classified using $f$ as a linear classifier.
Obviously, even with the same binary classification data, the left side is better than the right, as the margin for the sky-blue data on the right is excessive. Specifically, how this is calculated is not necessary to know since packages handle it.
For undergraduate students, just understanding up to the simple definition and grasping the concept through pictures is sufficient for future use or comprehension of the terminology. For slightly more complex content, practical summary points, and Python example codes, there are plenty of well-organized documents available on domestic web platforms. 2 3 4
Inner Product Space
As evident, SVM itself is not particularly complex conceptually, but the reason for introducing mathematical definitions and equations is due to the extensive theoretical discussions to follow.
The Euclidean space $\mathbb{R}^{p}$ is naturally a vector space, an inner product space, and since an inner product space is a metric space, it is also a metric space. Emphasizing this is important because in the real world of data, assuming an inner product space is quite a good assumption. For example, images, documents, or molecular structures immediately raise concerns about whether they can be directly input into SVM. The definition implicitly uses terms like ‘close in distance’ and linear functions $f$ involving inner products of vectors, which should not be taken for granted as theory approaches reality.
Support Vectors
In geometric problems, those on the edge (Boundary) are typically called supports. For example, in the Minimum Enclosing Disk problem, the points on the circumference of the circle determining the circle are called supports. Similarly, the support vectors in SVM are also on the boundary, positioned $\delta/2$ away from the sets $X^{+}, X^{-}$ and $\mathbf{x}^{+}, \mathbf{x}^{-}$.
There’s no guarantee that support vectors are unique for each $X^{+}, X^{-}$, but uniqueness is not important for the upcoming discussions, so let’s assume they are unique without losing generality. No data exists on the margin $H$, and $$ f \left( \mathbf{x} \right) \begin{cases} \ge +1 & , \text{if } \mathbf{x} \in X^{+} \\ \le -1 & , \text{if } \mathbf{x} \in X^{-} \end{cases} $$ thus, for all $\left\{ \mathbf{x}_{k} \right\}_{k=1}^{n}$s, $\left| f \left( \mathbf{x}_{k} \right) \right| \ge 1$ must hold.
Maximizing the Margin
Since support vectors are the closest points to $H$, the distance $\delta/2$ to $H$ will be the distance when the support vectors fall in the direction $\mathbf{w}$ perpendicular to $H$. This margin is the same whether for $\mathbf{x}^{+}$ or $\mathbf{x}^{-}$, and both being at a distance $\delta/2$ from the hyperplane $H$ means the distance between the two support vectors is $$ \delta \mathbf{w} = \mathbf{x}^{+} - \mathbf{x}^{-} $$ where operations like $\mathbf{x}^{+} - \mathbf{x}^{-}$ are permitted under the assumption that $X$ is a vector space. Taking the inner product of both sides of $\delta \mathbf{w} = \mathbf{x}^{+} - \mathbf{x}^{-}$ with $\mathbf{w}$, in other words, multiplying $\mathbf{w}^{T}$ on the left side, gives $$ \begin{align*} & \delta \mathbf{w} = \mathbf{x}^{+} - \mathbf{x}^{-} \\ \implies & \delta \mathbf{w}^{T} \mathbf{w} = \mathbf{w}^{T} \mathbf{x}^{+} - \mathbf{w}^{T} \mathbf{x}^{-} \\ \implies & \delta \left\| \mathbf{w} \right\|_{2}^{2} = \left( \mathbf{w}^{T} \mathbf{x}^{+} + b \right) - \left( \mathbf{w}^{T} \mathbf{x}^{-} + b \right) \\ \implies & \delta \left\| \mathbf{w} \right\|_{2}^{2} = +1 - (-1) \\ \implies & \delta \left\| \mathbf{w} \right\|_{2}^{2} = 2 \\ \implies & \delta = {{ 2 } \over { \left\| \mathbf{w} \right\|_{2}^{2} }} \end{align*} $$ by definition of $f$. Therefore, maximizing the margin is equivalent to minimizing the objective function $\left\| \mathbf{w} \right\|_{2}^{2} / 2$, and in summary, SVM is an optimizer that solves the following optimization problem. $$ \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} \\ \text{subject to} & \left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 \end{matrix} \\ k = 1, \cdots , n $$
Derived Models
As per the complex definition, SVM finds a linear function, whether it’s a line or hyperplane, and naturally, that wouldn’t be satisfactory.
Soft Margin SVM
Consider data like the following. SVM cannot perfectly binary classify it due to the mixed data in the middle.
Note that we had to satisfy the condition $\left| f \left( \mathbf{x}_{k} \right) \right| \ge 1$ under the constraint that data couldn’t exist in the margin of support vectors. If we allow this inequality to be smaller than $1$, it would yield better results than giving up on binary classification altogether, even if it’s not perfect. If we denote this allowance for each data as $\xi_{k} \ge 0$, we get a new constraint $\left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 - \xi_{k}$. This weakened margin is called a Soft Margin.
Although the constraint has been relaxed, completely loosening it to $\xi_{l} = \cdots = \xi_{n} = 1$ would negate SVM altogether. To prevent this, a term like $\sum_{k} \xi_{k}$ can be added to the objective function as a penalty for making impossible binary classification possible. Of course, such a simple penalty could be meaningless or too sensitive depending on the data scale, so instead of using $0 \le \sum_{k} \xi_{k} \le n$ as is, it’s better to multiply it by a suitable positive number $\lambda > 0$ and add it. $$ \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} + \lambda \sum_{k=1}^{n} \xi_{k} \\ \text{subject to} & \left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 - \xi_{k} \\ & \xi_{k} \ge 0 \end{matrix} \\ k = 1, \cdots , n $$
Kernel Trick
Given data like the above, it seems impossible to binary classify it with SVM, whether with a soft margin or not. However, it’s clear that sky-blue points are clustered closer to $0$, and orange points appear on the outside. To utilize this information, let’s create a new $z$-axis as follows. $$ \phi (x,y) := (x,y, x^{2} + y^{2}) $$
The above picture is a capture of the one below. The below is a 3D space interactive with the mouse, so take a look around.
While it was difficult to find a line that separates data in the original $\mathbb{R}^{2}$, in the expanded $\mathbb{R}^{3}$, we can now use SVM to classify data with a suitable plane. Naturally, one might ask, ‘So, is this good transformation $\phi$ called a Kernel, and is using a kernel called the Kernel Trick?’ The answer is half right and half wrong. $\phi$ with an extra step, including the inner product, is what constitutes a kernel.
Returning to Maximizing the Margin, let’s revisit the optimization problem given to us. $$ \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} \\ \text{subject to} & \left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 \end{matrix} \\ k = 1, \cdots , n $$
Although constraint $\left| f \left( \mathbf{x}_{k} \right) \right| \ge 1$ looks neat, it doesn’t help much when solving this problem. If we revert to the form in the original training dataset, for $k = 1 , \cdots , n$, $$ \begin{cases} f \left( \mathbf{x}_{k} \right) \ge 1 & , \text{if } y_{k} = 1 \\ f \left( \mathbf{x}_{k} \right) \le -1 & , \text{if } y_{k} = -1 \end{cases} \\ \implies \begin{cases} y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) \ge 1 & , \text{if } y_{k} = 1 \\ y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) \ge 1 & , \text{if } y_{k} = -1 \end{cases} \\ \implies y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) \ge 1 $$ must hold. The method of incorporating this constraint directly into the objective function, treating it as if there were no constraints, is the Lagrange Multiplier Method. For the objective function minus the terms multiplied by $y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) - 1 \ge 0$ with $\alpha_{k} \ge 0$, we get the following optimization problem for $L(\mathbf{w}, b)$: $$ \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} - \sum_{k=1}^{n} \alpha_{k} \left[ y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) - 1 \right] \\ \text{subject to} & \alpha_{k} \ge 0 \end{matrix} \\ k = 1, \cdots , n $$
To reiterate, our goal was to find $\mathbf{w}, b$ that minimizes this objective function. The condition that makes the partial derivative of the objective function with respect to $\mathbf{w}, b$ equal to $0$ is as follows: $$ \begin{align*} {{ \partial L } \over { \partial \mathbf{w} }} = 0 \implies & \mathbf{w} = \sum_{k=1}^{n} \alpha_{k} y_{k} \mathbf{x}_{k} \\ {{ \partial L } \over { \partial b }} = 0 \implies & 0 = \sum_{k=1}^{n} \alpha_{k} y_{k} \end{align*} $$
Substituting this directly into $L$ yields $$ \begin{align*} & L(\mathbf{w},b) \\ =& {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} - \sum_{k=1}^{n} \alpha_{k} \left[ y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) - 1 \right] \\ =& {{ 1 } \over { 2 }} \mathbf{w}^{T} \mathbf{w} - \sum_{k=1}^{n} \alpha_{k} y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) + \sum_{k=1}^{n} \alpha_{k} \\ =& {{ 1 } \over { 2 }} \mathbf{w}^{T} \sum_{k=1}^{n} \alpha_{k} y_{k} \mathbf{x}_{k} - \sum_{k=1}^{n} \alpha_{k} y_{k}\mathbf{w}^{T} \mathbf{x}_{k} - b \sum_{k=1}^{n} \alpha_{k} y_{k} - \sum_{k=1}^{n} \alpha_{k} \\ =& - {{ 1 } \over { 2 }} \sum_{k=1}^{n} \alpha_{k} y_{k} \mathbf{w}^{T} \mathbf{x}_{k} - b \cdot 0 + \sum_{k=1}^{n} \alpha_{k} \\ =& \sum_{k=1}^{n} \alpha_{k} - {{ 1 } \over { 2 }} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} y_{i} a_{j} y_{j} \mathbf{x}_{i}^{T} \mathbf{x}_{j} \\ =& \sum_{k=1}^{n} \alpha_{k} - {{ 1 } \over { 2 }} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} a_{j} y_{i} y_{j} \mathbf{x}_{i}^{T} \mathbf{x}_{j} \\ =& L \left( \alpha_{1} , \cdots , \alpha_{n} \right) \end{align*} $$
As expected, to specifically calculate $\mathbf{w}$ and $b$, the training data $\left\{ \left( \mathbf{x}_{k}, y_{k} \right) \right\}_{k=1}^{n}$ is required.
The point to note here is that the inner product of $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ was used in the equation. Ultimately, we must perform an inner product, and if $X$ is not an inner product space, there’s no guarantee we can take this smooth path. Conversely, even if $X$ isn’t an inner product space, if the transformation $\phi$ can send $X$ to an inner product space, considering SVM with the objective function $$ \sum_{k=1}^{n} \alpha_{k} - {{ 1 } \over { 2 }} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} a_{j} y_{i} y_{j} \phi \left( \mathbf{x}_{i} \right) ^{T} \phi \left( \mathbf{x}_{j} \right) $$ becomes plausible. In [Machine Learning](../../categories/Machine Learning), functions involving a transformation and inner product of two vectors $$ K \left( \mathbf{x}_{i}, \mathbf{x}_{j} \right) := \left< \phi \left( \mathbf{x}_{i} \right) , \phi \left( \mathbf{x}_{j} \right) \right> $$ are sometimes referred to as a Kernel. [ NOTE: Even within data science, there are other kernels that could be confused with this. The original mathematical kernel is a function with the same name but entirely different functionality. ]
If you can accept the content up to this point mathematically, you should understand why introducing a transformation $\phi$, not even a kernel, is called the Kernel Trick, and why it’s important that it guarantees an inner product space afterwards.
Several kernels that satisfy certain conditions can be considered, especially the original SVM can also be seen as using a Linear Kernel $$ K \left( \mathbf{x}_{i}, \mathbf{x}_{j} \right) = \left< \mathbf{x}_{i}, \mathbf{x}_{j} \right>^{1} = \mathbf{x}_{i}^{T} \mathbf{x}_{j} $$
See Also
The Kernel Trick section dealt with mathematically simple content, but if you’re interested in deeper theories, go beyond SVM and study the following topics:
- Kernels in Machine Learning and Reproducing Kernel Hilbert Spaces
- Proof of the Representation Theorem
Code
The following is Julia code implementing the kernel trick.
struct Sphere
d::Int64
end
Sphere(d) = Sphere(d)
import Base.rand
function rand(Topology::Sphere, n::Int64)
direction = randn(Topology.d, n)
boundary = direction ./ sqrt.(sum(abs2, direction, dims = 1))
return boundary
end
using Plots
A = 0.3rand(Sphere(2), 200) + 0.1randn(2, 200)
B = rand(Sphere(2), 200) + 0.1randn(2, 200)
scatter(A[1,:],A[2,:], ratio = :equal, label = "+1")
scatter!(B[1,:],B[2,:], ratio = :equal, label = "-1")
png("raw.png")
Plots.plotly()
ϕ(z) = z[1]^2 + z[2]^2
scatter(A[1,:],A[2,:],ϕ.(eachcol(A)), ms = 1, label = "+1")
scatter!(B[1,:],B[2,:],ϕ.(eachcol(B)), ms = 1, label = "-1")
savefig("kernel.html")
Jakkula. (2006). Tutorial on Support Vector Machine (SVM). https://course.ccs.neu.edu/cs5100f11/resources/jakkula.pdf ↩︎
https://ratsgo.github.io/machine%20learning/2017/05/23/SVM/ ↩︎
https://bkshin.tistory.com/entry/Machine-Learning-2-Support-Vector-Machine-SVM ↩︎