How to k-means cluster in Julia 📂Julia

How to k-means cluster in Julia

Description

k-means clustering is a clustering algorithm that divides the given $n$ data points into $k$ clusters. In Julia, it can be easily implemented using the Clustering.jl package.

Code

The following is a code to perform clustering with $k=3$ on the Iris dataset. Since data loaded from RDatasets.jl are by default data frames, they are converted into arrays, and transposed so that each column becomes a single data point. It converged in 4 iterations.

using Clustering
using RDatasets

X = dataset("datasets", "iris")[:, 1:4]
X = Array(X)'

results = kmeans(X, 3, display=:iter)
# Iters               objv        objv-change | affected 
# -------------------------------------------------------------
#       0       9.002000e+01
#       1       7.934436e+01      -1.067564e+01 |        2
#       2       7.892131e+01      -4.230544e-01 |        2
#       3       7.885567e+01      -6.564390e-02 |        0
#       4       7.885567e+01       0.000000e+00 |        0

The results returned by kmeans contains 9 properties. Among them, the centers of each cluster can be accessed with .centers, the cluster assignment of each data point with .assignments, and the number of data points in each cluster with .counts.

julia> propertynames(results)
(:centers, :assignments, :costs, :counts, :wcounts, :totalcost, :iterations, :converged, :cweights)

julia> results.centers
4×3 Matrix{Float64}:
 5.006  6.85385  5.88361
 3.428  3.07692  2.74098
 1.462  5.71538  4.38852
 0.246  2.05385  1.43443

julia> results.assignments
150-element Vector{Int64}:
 1
 1
 1
 ⋮
 2
 2
 3

julia> results.counts
3-element Vector{Int64}:
 50
 39
 61

Visualizing the sepal length and sepal width reveals the following:

Environment

OS: Windows11
Version: Julia 1.10.0, Clustering v0.15.7, RDatasets v0.7.7