logo

Support Vector Machine 📂Machine Learning

Support Vector Machine

Model 1

Simple Definition

The method of finding a Support Vector Machine is to find a line or plane that best separates binary classifiable data.

Complex Definition

For an inner product space X=RpX = \mathbb{R}^{p} and labeling Y={1,+1}Y = \left\{ -1, +1 \right\}, let’s denote the Training Dataset composed of nn pieces of data as D={(xk,yk)}k=1nX×YD = \left\{ \left( \mathbf{x}_{k} , y_{k} \right) \right\}_{k=1}^{n} \subset X \times Y, and X+:={xkX:yk=+1}X:={xkX:yk=1} \begin{align*} X^{+} :=& \left\{ \mathbf{x}_{k} \in X : y_{k} = +1 \right\} \\ X^{-} :=& \left\{ \mathbf{x}_{k} \in X : y_{k} = -1 \right\} \end{align*}

Suppose a hyperplane created by a linear function f(x)=wTx+bf \left( \mathbf{x} \right) = \mathbf{w}^{T} \mathbf{x} + b with some weight wRp\mathbf{w} \in \mathbb{R}^{p} and bias bRb \in \mathbb{R} is H:wTx+b=0H : \mathbf{w}^{T} \mathbf{x} + b = 0. The x+X+\mathbf{x}^{+} \in X^{+}s and xX\mathbf{x}^{-} \in X^{-}s closest to HH are called Support Vectors, and the distance δ\delta between them is called the Margin. The [machine learning](../../categories/Machine Learning) technique that finds w,b\mathbf{w} , b that maximizes the margin while satisfying f(x+)=+1f(x)=1 \begin{align*} f \left( \mathbf{x}^{+} \right) =& +1 \\ f \left( \mathbf{x}^{-} \right) =& -1 \end{align*}

is called Support Vector Machine (SVM).


Explanation

Simply put, it’s like finding a line or plane that divides orange and sky-blue data as shown in the next picture. The red arrows in the picture correspond to the support vectors.

20220225_015438.png

In the picture, we found a line for 22 dimensions and a plane for 33 dimensions, but for larger pp dimensions, we need to find a hyperplane, which becomes harder to represent in a picture. However, the concept of dividing the space into two remains the same. Once binary classification is done with the training dataset, new data can be classified using ff as a linear classifier.

20220225_014703.png

Obviously, even with the same binary classification data, the left side is better than the right, as the margin for the sky-blue data on the right is excessive. Specifically, how this is calculated is not necessary to know since packages handle it.

For undergraduate students, just understanding up to the simple definition and grasping the concept through pictures is sufficient for future use or comprehension of the terminology. For slightly more complex content, practical summary points, and Python example codes, there are plenty of well-organized documents available on domestic web platforms. 2 3 4

Inner Product Space

As evident, SVM itself is not particularly complex conceptually, but the reason for introducing mathematical definitions and equations is due to the extensive theoretical discussions to follow.

The Euclidean space Rp\mathbb{R}^{p} is naturally a vector space, an inner product space, and since an inner product space is a metric space, it is also a metric space. Emphasizing this is important because in the real world of data, assuming an inner product space is quite a good assumption. For example, images, documents, or molecular structures immediately raise concerns about whether they can be directly input into SVM. The definition implicitly uses terms like ‘close in distance’ and linear functions ff involving inner products of vectors, which should not be taken for granted as theory approaches reality.

Support Vectors

In geometric problems, those on the edge (Boundary) are typically called supports. For example, in the Minimum Enclosing Disk problem, the points on the circumference of the circle determining the circle are called supports. Similarly, the support vectors in SVM are also on the boundary, positioned δ/2\delta/2 away from the sets X+,XX^{+}, X^{-} and x+,x\mathbf{x}^{+}, \mathbf{x}^{-}.

There’s no guarantee that support vectors are unique for each X+,XX^{+}, X^{-}, but uniqueness is not important for the upcoming discussions, so let’s assume they are unique without losing generality. No data exists on the margin HH, and f(x){+1,if xX+1,if xX f \left( \mathbf{x} \right) \begin{cases} \ge +1 & , \text{if } \mathbf{x} \in X^{+} \\ \le -1 & , \text{if } \mathbf{x} \in X^{-} \end{cases} thus, for all {xk}k=1n\left\{ \mathbf{x}_{k} \right\}_{k=1}^{n}s, f(xk)1\left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 must hold.

Maximizing the Margin

Since support vectors are the closest points to HH, the distance δ/2\delta/2 to HH will be the distance when the support vectors fall in the direction w\mathbf{w} perpendicular to HH. This margin is the same whether for x+\mathbf{x}^{+} or x\mathbf{x}^{-}, and both being at a distance δ/2\delta/2 from the hyperplane HH means the distance between the two support vectors is δw=x+x \delta \mathbf{w} = \mathbf{x}^{+} - \mathbf{x}^{-} where operations like x+x\mathbf{x}^{+} - \mathbf{x}^{-} are permitted under the assumption that XX is a vector space. Taking the inner product of both sides of δw=x+x\delta \mathbf{w} = \mathbf{x}^{+} - \mathbf{x}^{-} with w\mathbf{w}, in other words, multiplying wT\mathbf{w}^{T} on the left side, gives δw=x+x    δwTw=wTx+wTx    δw22=(wTx++b)(wTx+b)    δw22=+1(1)    δw22=2    δ=2w22 \begin{align*} & \delta \mathbf{w} = \mathbf{x}^{+} - \mathbf{x}^{-} \\ \implies & \delta \mathbf{w}^{T} \mathbf{w} = \mathbf{w}^{T} \mathbf{x}^{+} - \mathbf{w}^{T} \mathbf{x}^{-} \\ \implies & \delta \left\| \mathbf{w} \right\|_{2}^{2} = \left( \mathbf{w}^{T} \mathbf{x}^{+} + b \right) - \left( \mathbf{w}^{T} \mathbf{x}^{-} + b \right) \\ \implies & \delta \left\| \mathbf{w} \right\|_{2}^{2} = +1 - (-1) \\ \implies & \delta \left\| \mathbf{w} \right\|_{2}^{2} = 2 \\ \implies & \delta = {{ 2 } \over { \left\| \mathbf{w} \right\|_{2}^{2} }} \end{align*} by definition of ff. Therefore, maximizing the margin is equivalent to minimizing the objective function w22/2\left\| \mathbf{w} \right\|_{2}^{2} / 2, and in summary, SVM is an optimizer that solves the following optimization problem. Minimize12w22subject tof(xk)1k=1,,n \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} \\ \text{subject to} & \left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 \end{matrix} \\ k = 1, \cdots , n

Derived Models

As per the complex definition, SVM finds a linear function, whether it’s a line or hyperplane, and naturally, that wouldn’t be satisfactory.

Soft Margin SVM

Consider data like the following. SVM cannot perfectly binary classify it due to the mixed data in the middle.

20220225_030646.png

Note that we had to satisfy the condition f(xk)1\left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 under the constraint that data couldn’t exist in the margin of support vectors. If we allow this inequality to be smaller than 11, it would yield better results than giving up on binary classification altogether, even if it’s not perfect. If we denote this allowance for each data as ξk0\xi_{k} \ge 0, we get a new constraint f(xk)1ξk\left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 - \xi_{k}. This weakened margin is called a Soft Margin.

Although the constraint has been relaxed, completely loosening it to ξl==ξn=1\xi_{l} = \cdots = \xi_{n} = 1 would negate SVM altogether. To prevent this, a term like kξk\sum_{k} \xi_{k} can be added to the objective function as a penalty for making impossible binary classification possible. Of course, such a simple penalty could be meaningless or too sensitive depending on the data scale, so instead of using 0kξkn0 \le \sum_{k} \xi_{k} \le n as is, it’s better to multiply it by a suitable positive number λ>0\lambda > 0 and add it. Minimize12w22+λk=1nξksubject tof(xk)1ξkξk0k=1,,n \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} + \lambda \sum_{k=1}^{n} \xi_{k} \\ \text{subject to} & \left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 - \xi_{k} \\ & \xi_{k} \ge 0 \end{matrix} \\ k = 1, \cdots , n

Kernel Trick

raw.png

Given data like the above, it seems impossible to binary classify it with SVM, whether with a soft margin or not. However, it’s clear that sky-blue points are clustered closer to 00, and orange points appear on the outside. To utilize this information, let’s create a new zz-axis as follows. ϕ(x,y):=(x,y,x2+y2) \phi (x,y) := (x,y, x^{2} + y^{2})

20220225_140243.png The above picture is a capture of the one below. The below is a 3D space interactive with the mouse, so take a look around.

While it was difficult to find a line that separates data in the original R2\mathbb{R}^{2}, in the expanded R3\mathbb{R}^{3}, we can now use SVM to classify data with a suitable plane. Naturally, one might ask, ‘So, is this good transformation ϕ\phi called a Kernel, and is using a kernel called the Kernel Trick?’ The answer is half right and half wrong. ϕ\phi with an extra step, including the inner product, is what constitutes a kernel.

Returning to Maximizing the Margin, let’s revisit the optimization problem given to us. Minimize12w22subject tof(xk)1k=1,,n \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} \\ \text{subject to} & \left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 \end{matrix} \\ k = 1, \cdots , n

Although constraint f(xk)1\left| f \left( \mathbf{x}_{k} \right) \right| \ge 1 looks neat, it doesn’t help much when solving this problem. If we revert to the form in the original training dataset, for k=1,,nk = 1 , \cdots , n, {f(xk)1,if yk=1f(xk)1,if yk=1    {yk(wTxk+b)1,if yk=1yk(wTxk+b)1,if yk=1    yk(wTxk+b)1 \begin{cases} f \left( \mathbf{x}_{k} \right) \ge 1 & , \text{if } y_{k} = 1 \\ f \left( \mathbf{x}_{k} \right) \le -1 & , \text{if } y_{k} = -1 \end{cases} \\ \implies \begin{cases} y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) \ge 1 & , \text{if } y_{k} = 1 \\ y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) \ge 1 & , \text{if } y_{k} = -1 \end{cases} \\ \implies y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) \ge 1 must hold. The method of incorporating this constraint directly into the objective function, treating it as if there were no constraints, is the Lagrange Multiplier Method. For the objective function minus the terms multiplied by yk(wTxk+b)10y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) - 1 \ge 0 with αk0\alpha_{k} \ge 0, we get the following optimization problem for L(w,b)L(\mathbf{w}, b): Minimize12w22k=1nαk[yk(wTxk+b)1]subject toαk0k=1,,n \begin{matrix} \text{Minimize} & {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} - \sum_{k=1}^{n} \alpha_{k} \left[ y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) - 1 \right] \\ \text{subject to} & \alpha_{k} \ge 0 \end{matrix} \\ k = 1, \cdots , n

To reiterate, our goal was to find w,b\mathbf{w}, b that minimizes this objective function. The condition that makes the partial derivative of the objective function with respect to w,b\mathbf{w}, b equal to 00 is as follows: Lw=0    w=k=1nαkykxkLb=0    0=k=1nαkyk \begin{align*} {{ \partial L } \over { \partial \mathbf{w} }} = 0 \implies & \mathbf{w} = \sum_{k=1}^{n} \alpha_{k} y_{k} \mathbf{x}_{k} \\ {{ \partial L } \over { \partial b }} = 0 \implies & 0 = \sum_{k=1}^{n} \alpha_{k} y_{k} \end{align*}

Substituting this directly into LL yields L(w,b)=12w22k=1nαk[yk(wTxk+b)1]=12wTwk=1nαkyk(wTxk+b)+k=1nαk=12wTk=1nαkykxkk=1nαkykwTxkbk=1nαkykk=1nαk=12k=1nαkykwTxkb0+k=1nαk=k=1nαk12i=1nj=1nαiyiajyjxiTxj=k=1nαk12i=1nj=1nαiajyiyjxiTxj=L(α1,,αn) \begin{align*} & L(\mathbf{w},b) \\ =& {{ 1 } \over { 2 }} \left\| \mathbf{w} \right\|_{2}^{2} - \sum_{k=1}^{n} \alpha_{k} \left[ y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) - 1 \right] \\ =& {{ 1 } \over { 2 }} \mathbf{w}^{T} \mathbf{w} - \sum_{k=1}^{n} \alpha_{k} y_{k} \left( \mathbf{w}^{T} \mathbf{x}_{k} + b \right) + \sum_{k=1}^{n} \alpha_{k} \\ =& {{ 1 } \over { 2 }} \mathbf{w}^{T} \sum_{k=1}^{n} \alpha_{k} y_{k} \mathbf{x}_{k} - \sum_{k=1}^{n} \alpha_{k} y_{k}\mathbf{w}^{T} \mathbf{x}_{k} - b \sum_{k=1}^{n} \alpha_{k} y_{k} - \sum_{k=1}^{n} \alpha_{k} \\ =& - {{ 1 } \over { 2 }} \sum_{k=1}^{n} \alpha_{k} y_{k} \mathbf{w}^{T} \mathbf{x}_{k} - b \cdot 0 + \sum_{k=1}^{n} \alpha_{k} \\ =& \sum_{k=1}^{n} \alpha_{k} - {{ 1 } \over { 2 }} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} y_{i} a_{j} y_{j} \mathbf{x}_{i}^{T} \mathbf{x}_{j} \\ =& \sum_{k=1}^{n} \alpha_{k} - {{ 1 } \over { 2 }} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} a_{j} y_{i} y_{j} \mathbf{x}_{i}^{T} \mathbf{x}_{j} \\ =& L \left( \alpha_{1} , \cdots , \alpha_{n} \right) \end{align*}

As expected, to specifically calculate w\mathbf{w} and bb, the training data {(xk,yk)}k=1n\left\{ \left( \mathbf{x}_{k}, y_{k} \right) \right\}_{k=1}^{n} is required.

The point to note here is that the inner product of xi\mathbf{x}_{i} and xj\mathbf{x}_{j} was used in the equation. Ultimately, we must perform an inner product, and if XX is not an inner product space, there’s no guarantee we can take this smooth path. Conversely, even if XX isn’t an inner product space, if the transformation ϕ\phi can send XX to an inner product space, considering SVM with the objective function k=1nαk12i=1nj=1nαiajyiyjϕ(xi)Tϕ(xj) \sum_{k=1}^{n} \alpha_{k} - {{ 1 } \over { 2 }} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} a_{j} y_{i} y_{j} \phi \left( \mathbf{x}_{i} \right) ^{T} \phi \left( \mathbf{x}_{j} \right) becomes plausible. In [Machine Learning](../../categories/Machine Learning), functions involving a transformation and inner product of two vectors K(xi,xj):=<ϕ(xi),ϕ(xj)> K \left( \mathbf{x}_{i}, \mathbf{x}_{j} \right) := \left< \phi \left( \mathbf{x}_{i} \right) , \phi \left( \mathbf{x}_{j} \right) \right> are sometimes referred to as a Kernel. [ NOTE: Even within data science, there are other kernels that could be confused with this. The original mathematical kernel is a function with the same name but entirely different functionality. ]

If you can accept the content up to this point mathematically, you should understand why introducing a transformation ϕ\phi, not even a kernel, is called the Kernel Trick, and why it’s important that it guarantees an inner product space afterwards.

Several kernels that satisfy certain conditions can be considered, especially the original SVM can also be seen as using a Linear Kernel K(xi,xj)=<xi,xj>1=xiTxj K \left( \mathbf{x}_{i}, \mathbf{x}_{j} \right) = \left< \mathbf{x}_{i}, \mathbf{x}_{j} \right>^{1} = \mathbf{x}_{i}^{T} \mathbf{x}_{j}

See Also

The Kernel Trick section dealt with mathematically simple content, but if you’re interested in deeper theories, go beyond SVM and study the following topics:

Code

The following is Julia code implementing the kernel trick.

struct Sphere
    d::Int64
end
Sphere(d) = Sphere(d)

import Base.rand
function rand(Topology::Sphere, n::Int64)
    direction = randn(Topology.d, n)
    boundary = direction ./ sqrt.(sum(abs2, direction, dims = 1))
    return boundary
end

using Plots
A = 0.3rand(Sphere(2), 200) + 0.1randn(2, 200)
B = rand(Sphere(2), 200) + 0.1randn(2, 200)

scatter(A[1,:],A[2,:], ratio = :equal, label = "+1")
scatter!(B[1,:],B[2,:], ratio = :equal, label = "-1")
png("raw.png")

Plots.plotly()
ϕ(z) = z[1]^2 + z[2]^2
scatter(A[1,:],A[2,:],ϕ.(eachcol(A)), ms = 1, label = "+1")
scatter!(B[1,:],B[2,:],ϕ.(eachcol(B)), ms = 1, label = "-1")
savefig("kernel.html")