DBSCAN: Density-Based Clustering 📂Machine Learning

DBSCAN: Density-Based Clustering

Algorithm ¹

DBSCAN is an algorithm that performs clustering based on the density of points in a metric space $\left( X , d \right)$. Two hyperparameters are given: a radius $\varepsilon > 0$ and a minimal number of points $m$ that serves as the density threshold.

A finite set $D \subset X$ is called the database.
For $p \in D$, $N_{\varepsilon} \left( p \right) = \left\{ q \in X \mid d \left( p , q \right) \le \varepsilon \right\}$ is called the neighborhood of $p$ with respect to $p$ $p$.
If $p \in D$ satisfies $| N_{\varepsilon} \left( p \right) | \ge m$ then $p$ is called a core point.
If $p \in D$ is not a core point but $N_{\varepsilon} \left( p \right)$ contains another core point, then $p$ is called a border point.
For a core point $q$, if $p \in N_{\varepsilon} \left( q \right)$ then $q$ is said to be directly reachable from $p$. $$ p = p_{1} , \cdots , p_{n} = q $$ If there exists a sequence $p_{k}$ that is directly reachable from $p_{k+1}$ to $p_{k+1}$ while $p = p_{1}$ and $q = p_{n}$, then $p$ is said to be reachable from $q$.
If there exists a core point $o \in D$ that is reachable from both $p , q \in D$ and $p , q \in D$, then $p$ and $q$ are said to be connected.
A non-empty set $C \subseteq D$ is called a cluster if all its elements $\forall p, q \in C$ satisfy the following two conditions.
- (i) Maximality: If a point in $p$ is reachable to a point in $q$, then it belongs to $q \in C$.
- (ii) Connectivity: $p$ and $q$ are connected.
A point that does not belong to any cluster is called noise.

Algorithm: Density-based clustering
In	$\varepsilon > 0$, $m \in \mathbb{N}$ database $D$
1.	Select any point $p \in D$ not yet processed.
2.	If $N_{\varepsilon}(p) \ge m$ then mark it as a core point.
3.	If another core point $q$ is contained in the neighborhood of core point $p$, merge the clusters containing the two core points.
4.	If there is an unprocessed point among the neighbors of $p$, select that point and go to 2; if there are no unprocessed points, go to 1.
5.	When all points have been processed, terminate the algorithm. For each non-core point determine whether it is contained in some core point’s neighborhood to distinguish border points from noise.
Out	set of clusters $\left\{ C_{k} \right\}$

Explanation ²

The difference between reachability and connectivity is whether the relation is restricted to core points. Unlike connectivity, which is a symmetric relation, reachability requires the subject to be a core point; therefore, even if $p$ is reachable to $q$, the converse may not hold.

alt text

Popularity

As of 2025, DBSCAN shows such consistently good performance that it is often regarded as the de facto terminating algorithm for clustering and has been widely used for a long time. If $k$-means clustering ($k$-평균 군집화) trades on simplicity and hierarchical clustering trades on interpretability, DBSCAN exhibits high robustness even on relatively complex datasets.

While clustering algorithms have continued to evolve since DBSCAN, as with k-means and hierarchical clustering, most developments have been incremental variants that did not displace their established positions. These three algorithms are often regarded as the trinity of clustering, while techniques like the EM algorithm are covered more modestly in textbooks.

In particular, since the advent of the AI era, supervised learning has become dominant, and innovations in clustering techniques have largely stalled. Put differently, if you do not intend to research clustering itself further, studying DBSCAN may reasonably be the last step.

Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231). https://cdn.aaai.org/KDD/1996/KDD96-037.pdf ↩︎
https://www.digitalvidya.com/blog/the-top-5-clustering-algorithms-data-scientists-should-know/ ↩︎