logo

k-Means Clustering 📂Machine Learning

k-Means Clustering

Algorithm

Input

Given data of dimension pp, NN pieces, and a natural number kk.


Step 1. Initialization

Randomly set kk points μ1,,μk\mu_{1} , \cdots , \mu_{k}. Each μj\mu_{j} will become the mean of cluster MjM_{j}.

20191001\_085422.png


Step 2. Distance Calculation

Calculate xiμj\| x_{i} - \mu_{j} \| for the iith data xix_{i} and j=1,,kj = 1 , \cdots , k. Choose the smallest one to make it xiMjx_{i} \in M_{j}. Repeat this for each i=1,,Ni = 1 , \cdots , N.

20191001\_085734.png


Step 3. Update μj\mu_{j}

Repeat the following update for j=1,,kj = 1 , \cdots , k. μjxMjxMj \mu_{j} \gets \sum_{x \in M_{j}} {{ x } \over {| M_{j} |}} If the μj\mu_{j}s no longer change after the update, end the algorithm. If properly updated, go back to Step 2.

20191001\_091309.png


Output

Return the centroid μj\mu_{j} of each cluster and which cluster each data belongs to.


  • Here, \| \cdot \| is called the l2l_{2}-norm, or Euclidean norm, usual norm, representing the concept of distance.

Description

kk-mean clustering is an unsupervised learning method that clusters given data into kk classes. The above illustrations particularly represent the division of N=20N=20 pieces of data given in p=2p=2 dimensions into k=2k=2 classes.

The mean minimizes the squared error and is a representative value. kk-means clustering classifies data well based on this mathematical fact. Simply put, adding coordinates and taking the average means geometrically it represents the center of mass, so μj\mu_{j}s are drawn into the center of some data. Of course, to guarantee this, stronger conditions and mathematical explanations are necessary, but since kk-means clustering is primarily admired for its simplicity, we will not delve deeply into it.

Advantages and Disadvantages

The biggest problem is that this classification is not very accurate, yet despite the many algorithms that have come out to compensate for this, none can overpower this algorithm with simplicity. From the data scientist’s perspective, the first thing to try upon receiving data is precisely kk-means clustering. It is not about ending the analysis with one clustering but using the algorithm to get a feel for the data as the analysis begins.