Clustering - Finding Structure When Nobody Gave You Labels // Megha Bose

Helpful context:

Most machine learning assumes you have labels. Someone went through thousands of emails and marked each one “spam” or “not spam.” Someone diagnosed each tumor as malignant or benign. Supervised learning uses these labels to learn a decision boundary.

Clustering does something harder: it finds structure in data when nobody told you what the structure should be. You have ten million customer transactions and no idea how many distinct “types” of customer exist, let alone which customer belongs to which type. You have millions of news articles and no predefined categories. Clustering finds the groupings the data itself suggests.

This makes clustering both more open-ended and more subjective than classification. There is no ground truth to check against. A clustering is “good” if it is useful - and useful means different things to different problems.

What a Cluster Actually Is

A cluster is a group of points that are more similar to each other than they are to points in other groups. The tricky part: what “similar” means, and how to measure it, is a modeling choice that shapes the entire result.

Euclidean distance in feature space is the default: two customers are similar if their age, income, purchase frequency, and average basket size are close numerically. But if features have different scales (age in years vs income in rupees), Euclidean distance is dominated by the high-magnitude features. Standardizing features before clustering is not optional.

Other similarity measures appear in specific domains: cosine similarity for text (measures angle, not magnitude - two documents with the same word distribution but different lengths are considered identical), Jaccard similarity for sets (shared items / total items), edit distance for strings.

k-Means: The Workhorse

k-means is the algorithm most people learn first and use most often, and for good reason - it scales to millions of points, runs fast, and produces interpretable results.

The idea: partition $n$ points into $k$ clusters, where each point belongs to the cluster whose centroid (mean) is closest to it. The algorithm alternates between two steps:

Assignment step: for each point, find the nearest centroid and assign the point to that cluster.

Update step: recompute each centroid as the mean of all points currently assigned to it.

Repeat until assignments stop changing. The algorithm is guaranteed to converge (the total within-cluster sum of squared distances decreases or stays constant at every step), but it converges to a local minimum, not necessarily the global one.

The local minimum problem is real. Different random initializations can produce very different clusterings. The standard fix is k-means++ initialization (Arthur and Vassilvitskii, 2007): instead of choosing initial centroids uniformly at random, choose each successive centroid with probability proportional to its squared distance from the nearest already-chosen centroid. This spreads the initial centroids across the data space, dramatically reducing the probability of a bad local minimum. Scikit-learn uses k-means++ by default for this reason.

The k problem: you must specify $k$ upfront. Two approaches to choose it:

The elbow method plots within-cluster sum of squares (inertia) against $k$. Inertia always decreases as $k$ increases (at $k = n$ each point is its own cluster). Look for the “elbow” - the point where adding another cluster gives diminishing returns. The elbow is often ambiguous but gives a reasonable starting range.

The silhouette score for each point measures how similar it is to its own cluster versus the nearest other cluster. Score of 1 means well inside its cluster; score near 0 means on the boundary; score of -1 means likely in the wrong cluster. Average silhouette across all points peaks at the best $k$. Unlike inertia, silhouette can be compared across different values of $k$ without ambiguity.

What k-means assumes and gets wrong:

k-means minimizes squared Euclidean distance to centroids. This implicitly assumes clusters are:

Spherical: every cluster looks like a ball in feature space. Long, thin, or crescent-shaped clusters get split incorrectly.
Similarly sized: the algorithm assigns every point to its nearest centroid, so large clusters tend to be split and small dense clusters tend to be absorbed by neighbors.
Non-overlapping: no notion of soft membership or uncertainty.

k-means is also sensitive to outliers. One point at $x = 10^6$ pulls its cluster’s centroid significantly away from the other members. Clipping outliers before clustering is standard practice.

Despite these limitations, k-means is usually the right first algorithm to try. It is fast ($O(nkt)$ per iteration where $t$ is the number of iterations), scales to millions of points, and its assumptions hold reasonably well for many real-world datasets.

DBSCAN: Clusters of Any Shape

DBSCAN (Density-Based Spatial Clustering of Applications with Noise, Ester et al. 1996) takes a fundamentally different view. Instead of minimizing distance to centroids, it asks: in which regions of space is the data dense?

The two parameters:

eps ($\varepsilon$): the neighborhood radius. Two points are neighbors if their distance is less than eps.
min_samples: the minimum number of points (including itself) in a point’s eps-neighborhood for that point to be a “core point.”

Three types of points emerge:

Core point: has at least min_samples points within eps distance. These form the interior of clusters.
Border point: within eps of a core point but doesn’t have enough neighbors to be core itself. These are on the cluster’s edge.
Noise point: not within eps of any core point. These are outliers.

A cluster is formed by connecting core points that are within eps of each other, then adding their border points. The algorithm never forces every point into a cluster - noise points are labeled $-1$ and left unassigned.

Why this is powerful: DBSCAN finds clusters of arbitrary shape. A crescent, a ring, two interleaved spirals - as long as the interior is dense, DBSCAN finds it. k-means would split each shape into pieces.

Pros:

Does not require specifying $k$ upfront.
Automatically identifies outliers (noise points).
Finds clusters of arbitrary shape.
Not sensitive to initialization.

Cons:

Requires choosing eps and min_samples, which interact in non-obvious ways. Too small an eps and everything is noise; too large and everything merges into one cluster.
Struggles with clusters of varying density. A cluster with 100 tightly-packed points and a cluster with 100 sparsely-spread points require different eps values to be detected simultaneously.
Memory cost is $O(n^2)$ in the worst case for the distance computations (though spatial indexing can reduce this to $O(n \log n)$ for low dimensions).

Choosing eps: plot the distance from each point to its $k$-th nearest neighbor, sorted. The distance at which there is a sharp jump is a good candidate for eps. This is the “k-distance graph” heuristic.

Hierarchical Clustering: The Dendrogram

Hierarchical clustering builds a tree (dendrogram) that encodes all possible clusterings simultaneously, from one big cluster (everything together) down to $n$ singleton clusters (every point alone). You then cut the dendrogram at a height to get the $k$ clusters you want.

Agglomerative clustering builds bottom-up: start with every point as its own cluster, then repeatedly merge the two clusters that are most similar, until one cluster remains.

The merge criterion is the linkage: how to measure distance between two clusters (not two points)?

Single linkage: distance between the nearest pair of points across clusters. Tends to produce long, chained clusters (chaining effect).
Complete linkage: distance between the farthest pair of points across clusters. Produces more compact, similar-sized clusters. Sensitive to outliers.
Average linkage: average pairwise distance across clusters. A compromise.
Ward linkage: merge the pair that minimizes the increase in total within-cluster variance. Produces clusters of similar size, often the best default.

Pros:

No need to choose $k$ upfront. The dendrogram lets you explore all values of $k$ at once.
The dendrogram is interpretable: it shows which clusters merged at what scale, revealing hierarchical structure (sub-groups within groups).
Works with any distance metric.

Cons:

$O(n^2)$ memory and $O(n^2 \log n)$ or $O(n^3)$ time depending on implementation. Does not scale to large datasets. Practical limit is roughly tens of thousands of points.
Greedy: once two clusters merge, they never un-merge, even if a better global solution would have separated them differently.

Evaluating Clusters Without Labels

Supervised evaluation (accuracy, F1) is not available because you have no ground truth. Two internal metrics dominate:

Silhouette score: for each point $i$, let $a_i$ be the mean distance to other points in its cluster and $b_i$ be the mean distance to points in the nearest other cluster. The silhouette is $(b_i - a_i) / \max(a_i, b_i)$. Ranges from $-1$ to $1$; higher is better. Average over all points gives the overall score. Valid for any clustering algorithm.

Inertia (within-cluster sum of squares): sum of squared distances from each point to its centroid. Only defined for k-means-style algorithms. Useful for comparing different $k$ values on the same algorithm, but not comparable across algorithms.

If you have ground truth labels for a subset (e.g., you know that half your customers belong to “loyal” vs “new”), you can use the Adjusted Rand Index (ARI) to compare your clustering to those labels. ARI of 1 means perfect match; 0 means random.

Practical Choices

Algorithm	Good for	Not good for
k-means	Large datasets, spherical clusters, speed	Arbitrary shapes, unknown $k$, outliers
DBSCAN	Arbitrary shapes, outlier detection, unknown $k$	Varying density, high dimensions
Hierarchical (Ward)	Small datasets, exploratory analysis, dendrogram insight	Large datasets ($n > 50{,}000$)

In practice: try k-means first with k-means++ and multiple random seeds. If the clusters look geometrically implausible (non-spherical, very uneven sizes), try DBSCAN or hierarchical. For datasets above a few hundred thousand points, k-means or mini-batch k-means are essentially the only options; the others become computationally impractical.

Concept	Key point
k-means	Minimizes within-cluster sum of squares; local minima; use k-means++
Elbow method	Plot inertia vs $k$; look for diminishing returns
Silhouette score	Measures cohesion vs separation; higher is better; picks best $k$
DBSCAN	Density-based; arbitrary shapes; automatic outlier detection
eps/min_samples	DBSCAN hyperparameters; use k-distance graph to choose eps
Ward linkage	Minimizes variance increase on merge; best default for hierarchical
Dendrogram	Encodes all possible $k$-clusterings; cut at desired height

Read Next:

Anomaly Detection - Finding the Needles Without Knowing the Haystack