Here we are on Day 10 of my Machine Learning “Advent Calendar”. I would like to thank you for your support.
I have been building these Google Sheet files for years. They evolved little by little. But when it is time to publish them, I always need hours to reorganize everything, clean the layout, and make them pleasant to read.
Today, we move to DBSCAN.
DBSCAN Does Not Learn a Parametric Model
Just like LOF, DBSCAN is not a parametric model. There is no formula to store, no rules, no centroids, and nothing compact to reuse later.
We must keep the whole dataset because the density structure depends on all points.
Its full name is Density-Based Spatial Clustering of Applications with Noise.
But careful: this “density” is not a Gaussian density.
It is a count-based notion of density. Just “how many neighbors live close to me”.
Why DBSCAN Is Special
As its name indicates, DBSCAN does two things at the same time:
- it finds clusters
- it marks anomalies (the points that do not belong to any cluster)
This is exactly why I present the algorithms in this order:
- k-means and GMM are clustering models. They output a compact object: centroids for k-means, means and variances for GMM.
- Isolation Forest and LOF are pure anomaly detection models. Their only goal is to find unusual points.
- DBSCAN sits in between. It does both clustering and anomaly detection, based only on the notion of neighborhood density.
A Tiny Dataset to Keep Things Intuitive
We stay with the same tiny dataset that we used for LOF: 1, 2, 3, 7, 8, 12
If you look at these numbers, you already see two compact groups:
one around 1–2–3, another around 7–8, and 12 living alone.
DBSCAN captures exactly this intuition.
Summary in 3 Steps
DBSCAN asks three simple questions for each point:
- How many neighbors do you have within a small radius (eps)?
- Do you have enough neighbors to become a Core point (minPts)?
- Once we know the Core points, to which connected group do you belong?
Here is the summary of the DBSCAN algorithm in 3 steps:

Let us begin step by step.
DBSCAN in 3 steps
Now that we understand the idea of density and neighborhoods, DBSCAN becomes very easy to describe.
Everything the algorithm does fits into three simple steps.
Step 1 – Count the neighbors
The goal is to check how many neighbors each point has.
We take a small radius called eps.
For each point, we look at all other points and mark those whose distance is less than eps.
These are the neighbors.
This gives us the first idea of density:
a point with many neighbors is in a dense region,
a point with few neighbors lives in a sparse region.
For a 1-dimensional toy example like ours, a common choice is:
eps = 2
We draw a little interval of radius 2 around each point.
Why is it called eps?
The name eps comes from the Greek letter ε (epsilon), which is traditionally used in mathematics to represent a small quantity or a small radius around a point.
So in DBSCAN, eps is literally “the small neighborhood radius”.
It answers the question:
How far do we look around each point?
So in Excel, the first step is to compute the pairwise distance matrix, then count how many neighbors each point has within eps.

Step 2 – Core Points and Density Connectivity
Now that we know the neighbors from Step 1, we apply minPts to decide which points are Core.
minPts means here minimum number of points.
It is the smallest number of neighbors a point must have (inside the eps radius) to be considered a Core point.
A point is Core if it has at least minPts neighbors inside eps.
Otherwise, it may become Border or Noise.
With eps = 2 and minPts = 2, we have 12 that is not Core.
Once the Core points are known, we simply check which points are density-reachable from them. If a point can be reached by moving from one Core point to another within eps, it belongs to the same group.
In Excel, we can represent this as a simple connectivity table that shows which points are linked through Core neighbors.
This connectivity is what DBSCAN uses to form clusters in Step 3.

Step 3 – Assign cluster labels
The goal is to turn connectivity into actual clusters.
Once the connectivity matrix is ready, the clusters appear naturally.
DBSCAN simply groups all connected points together.
To give each group a simple and reproducible name, we use a very intuitive rule:
The cluster label is the smallest point in the connected group.
For example:
- Group {1, 2, 3} becomes cluster 1
- Group {7, 8} becomes cluster 7
- A point like 12 with no Core neighbors becomes Noise
This is exactly what we will display in Excel using formulas.

Final thoughts
DBSCAN is perfect to teach the idea of local density.
There is no probability, no Gaussian formula, no estimation step.
Just distances, neighbors, and a small radius.
But this simplicity also limits it.
Because DBSCAN uses one fixed radius for everyone, it cannot adapt when the dataset contains clusters of different scales.
HDBSCAN keeps the same intuition, but looks at all radii and keeps what remains stable.
It is far more robust, and much closer to how humans naturally see clusters.
With DBSCAN, we have reached a natural moment to step back and summarize the unsupervised models we have explored so far, as well as a few others we have not covered.
It is a good opportunity to draw a small map that links these algorithms together and shows where each of them sits in the broader landscape.
- Distance–based models
K-means, K-medoids, and hierarchical clustering (HAC) work by comparing distances between points or between groups. - Density–based models
Mean Shift and Gaussian Mixture Models (GMM) estimate a smooth density and extract clusters from its structure. - Neighborhood–based models
DBSCAN, OPTICS, HDBSCAN, and LOF define clusters and anomalies from local connectivity rather than global distance. - Graph–based models
Spectral clustering, Louvain, and Leiden rely on structure inside similarity graphs.
Each group reflects a different philosophy of what a “cluster” is.
Your choice of algorithm often depends less on theory and more on the shape of the data, the scale of its densities, and the kinds of structures you expect to find.
Here is how these methods connect to each other:
- K-means generalizes into GMM when you replace hard assignments with probabilistic densities.
- DBSCAN generalizes into OPTICS when you remove the need for a single eps value.
- OPTICS leads naturally to HDBSCAN, which turns density connectivity into a stable hierarchy.
- HAC and Spectral clustering both build clusters from pairwise distances, but Spectral adds a graph-based view.
- LOF uses the same neighborhoods as DBSCAN, but only for anomaly detection.
There are many more models, but this gives a sense of the landscape and where DBSCAN fits inside it.

Tomorrow, we will continue the Advent Calendar with models that are more “classic” and widely used in everyday machine learning.
Thank you for following the journey so far, and see you tomorrow.
Source link
#Machine #Learning #Advent #Calendar #Day #DBSCAN #Excel









