Support Vector Machine
Model 1
Simple Definition
The method of finding a Support Vector Machine is to find a line or plane that best separates binary classifiable data.
Complex Definition
For an inner product space and labeling , let’s denote the Training Dataset composed of pieces of data as , and
Suppose a hyperplane created by a linear function with some weight and bias is . The s and s closest to are called Support Vectors, and the distance between them is called the Margin. The [machine learning](../../categories/Machine Learning) technique that finds that maximizes the margin while satisfying
is called Support Vector Machine (SVM).
- is a set of real numbers, and is the -dimensional Euclidean space.
- denotes the Cartesian product of two sets.
- is the transpose matrix of , and is the inner product of two vectors .
Explanation
Simply put, it’s like finding a line or plane that divides orange and sky-blue data as shown in the next picture. The red arrows in the picture correspond to the support vectors.
In the picture, we found a line for dimensions and a plane for dimensions, but for larger dimensions, we need to find a hyperplane, which becomes harder to represent in a picture. However, the concept of dividing the space into two remains the same. Once binary classification is done with the training dataset, new data can be classified using as a linear classifier.
Obviously, even with the same binary classification data, the left side is better than the right, as the margin for the sky-blue data on the right is excessive. Specifically, how this is calculated is not necessary to know since packages handle it.
For undergraduate students, just understanding up to the simple definition and grasping the concept through pictures is sufficient for future use or comprehension of the terminology. For slightly more complex content, practical summary points, and Python example codes, there are plenty of well-organized documents available on domestic web platforms. 2 3 4
Inner Product Space
As evident, SVM itself is not particularly complex conceptually, but the reason for introducing mathematical definitions and equations is due to the extensive theoretical discussions to follow.
The Euclidean space is naturally a vector space, an inner product space, and since an inner product space is a metric space, it is also a metric space. Emphasizing this is important because in the real world of data, assuming an inner product space is quite a good assumption. For example, images, documents, or molecular structures immediately raise concerns about whether they can be directly input into SVM. The definition implicitly uses terms like ‘close in distance’ and linear functions involving inner products of vectors, which should not be taken for granted as theory approaches reality.
Support Vectors
In geometric problems, those on the edge (Boundary) are typically called supports. For example, in the Minimum Enclosing Disk problem, the points on the circumference of the circle determining the circle are called supports. Similarly, the support vectors in SVM are also on the boundary, positioned away from the sets and .
There’s no guarantee that support vectors are unique for each , but uniqueness is not important for the upcoming discussions, so let’s assume they are unique without losing generality. No data exists on the margin , and thus, for all s, must hold.
Maximizing the Margin
Since support vectors are the closest points to , the distance to will be the distance when the support vectors fall in the direction perpendicular to . This margin is the same whether for or , and both being at a distance from the hyperplane means the distance between the two support vectors is where operations like are permitted under the assumption that is a vector space. Taking the inner product of both sides of with , in other words, multiplying on the left side, gives by definition of . Therefore, maximizing the margin is equivalent to minimizing the objective function , and in summary, SVM is an optimizer that solves the following optimization problem.
Derived Models
As per the complex definition, SVM finds a linear function, whether it’s a line or hyperplane, and naturally, that wouldn’t be satisfactory.
Soft Margin SVM
Consider data like the following. SVM cannot perfectly binary classify it due to the mixed data in the middle.
Note that we had to satisfy the condition under the constraint that data couldn’t exist in the margin of support vectors. If we allow this inequality to be smaller than , it would yield better results than giving up on binary classification altogether, even if it’s not perfect. If we denote this allowance for each data as , we get a new constraint . This weakened margin is called a Soft Margin.
Although the constraint has been relaxed, completely loosening it to would negate SVM altogether. To prevent this, a term like can be added to the objective function as a penalty for making impossible binary classification possible. Of course, such a simple penalty could be meaningless or too sensitive depending on the data scale, so instead of using as is, it’s better to multiply it by a suitable positive number and add it.
Kernel Trick
Given data like the above, it seems impossible to binary classify it with SVM, whether with a soft margin or not. However, it’s clear that sky-blue points are clustered closer to , and orange points appear on the outside. To utilize this information, let’s create a new -axis as follows.
The above picture is a capture of the one below. The below is a 3D space interactive with the mouse, so take a look around.
While it was difficult to find a line that separates data in the original , in the expanded , we can now use SVM to classify data with a suitable plane. Naturally, one might ask, ‘So, is this good transformation called a Kernel, and is using a kernel called the Kernel Trick?’ The answer is half right and half wrong. with an extra step, including the inner product, is what constitutes a kernel.
Returning to Maximizing the Margin, let’s revisit the optimization problem given to us.
Although constraint looks neat, it doesn’t help much when solving this problem. If we revert to the form in the original training dataset, for , must hold. The method of incorporating this constraint directly into the objective function, treating it as if there were no constraints, is the Lagrange Multiplier Method. For the objective function minus the terms multiplied by with , we get the following optimization problem for :
To reiterate, our goal was to find that minimizes this objective function. The condition that makes the partial derivative of the objective function with respect to equal to is as follows:
Substituting this directly into yields
As expected, to specifically calculate and , the training data is required.
The point to note here is that the inner product of and was used in the equation. Ultimately, we must perform an inner product, and if is not an inner product space, there’s no guarantee we can take this smooth path. Conversely, even if isn’t an inner product space, if the transformation can send to an inner product space, considering SVM with the objective function becomes plausible. In [Machine Learning](../../categories/Machine Learning), functions involving a transformation and inner product of two vectors are sometimes referred to as a Kernel. [ NOTE: Even within data science, there are other kernels that could be confused with this. The original mathematical kernel is a function with the same name but entirely different functionality. ]
If you can accept the content up to this point mathematically, you should understand why introducing a transformation , not even a kernel, is called the Kernel Trick, and why it’s important that it guarantees an inner product space afterwards.
Several kernels that satisfy certain conditions can be considered, especially the original SVM can also be seen as using a Linear Kernel
See Also
The Kernel Trick section dealt with mathematically simple content, but if you’re interested in deeper theories, go beyond SVM and study the following topics:
- Kernels in Machine Learning and Reproducing Kernel Hilbert Spaces
- Proof of the Representation Theorem
Code
The following is Julia code implementing the kernel trick.
struct Sphere
d::Int64
end
Sphere(d) = Sphere(d)
import Base.rand
function rand(Topology::Sphere, n::Int64)
direction = randn(Topology.d, n)
boundary = direction ./ sqrt.(sum(abs2, direction, dims = 1))
return boundary
end
using Plots
A = 0.3rand(Sphere(2), 200) + 0.1randn(2, 200)
B = rand(Sphere(2), 200) + 0.1randn(2, 200)
scatter(A[1,:],A[2,:], ratio = :equal, label = "+1")
scatter!(B[1,:],B[2,:], ratio = :equal, label = "-1")
png("raw.png")
Plots.plotly()
ϕ(z) = z[1]^2 + z[2]^2
scatter(A[1,:],A[2,:],ϕ.(eachcol(A)), ms = 1, label = "+1")
scatter!(B[1,:],B[2,:],ϕ.(eachcol(B)), ms = 1, label = "-1")
savefig("kernel.html")
Jakkula. (2006). Tutorial on Support Vector Machine (SVM). https://course.ccs.neu.edu/cs5100f11/resources/jakkula.pdf ↩︎
https://ratsgo.github.io/machine%20learning/2017/05/23/SVM/ ↩︎
https://bkshin.tistory.com/entry/Machine-Learning-2-Support-Vector-Machine-SVM ↩︎