k-Means Clustering

Algorithm

Input

Given data of dimension $p$ , $N$ pieces, and a natural number $k$ .

Step 1. Initialization

Randomly set $k$ points $\mu_{1} , \cdots , \mu_{k}$ . Each $\mu_{j}$ will become the mean of cluster $M_{j}$ .

$20191001\_085422.png$

Step 2. Distance Calculation

Calculate $\| x_{i} - \mu_{j} \|$ for the $i$ th data $x_{i}$ and $j = 1 , \cdots , k$ . Choose the smallest one to make it $x_{i} \in M_{j}$ . Repeat this for each $i = 1 , \cdots , N$ .

$20191001\_085734.png$

Step 3. Update $\mu_{j}$

Repeat the following update for $j = 1 , \cdots , k$ . $\mu_{j} \gets \sum_{x \in M_{j}} {{ x } \over {| M_{j} |}}$ If the $\mu_{j}$ s no longer change after the update, end the algorithm. If properly updated, go back to Step 2.

$20191001\_091309.png$

Output

Return the centroid $\mu_{j}$ of each cluster and which cluster each data belongs to.

Here, $\| \cdot \|$ is called the $l_{2}$ -norm, or Euclidean norm, usual norm, representing the concept of distance.

Description

$k$ -mean clustering is an unsupervised learning method that clusters given data into $k$ classes. The above illustrations particularly represent the division of $N=20$ pieces of data given in $p=2$ dimensions into $k=2$ classes.

The mean minimizes the squared error and is a representative value. $k$ -means clustering classifies data well based on this mathematical fact. Simply put, adding coordinates and taking the average means geometrically it represents the center of mass, so $\mu_{j}$ s are drawn into the center of some data. Of course, to guarantee this, stronger conditions and mathematical explanations are necessary, but since $k$ -means clustering is primarily admired for its simplicity, we will not delve deeply into it.

Advantages and Disadvantages

The biggest problem is that this classification is not very accurate, yet despite the many algorithms that have come out to compensate for this, none can overpower this algorithm with simplicity. From the data scientist’s perspective, the first thing to try upon receiving data is precisely $k$ -means clustering. It is not about ending the analysis with one clustering but using the algorithm to get a feel for the data as the analysis begins.