K-Means Clustering
K-Means minimises inertia: Σᵢ min_k ‖xᵢ − μₖ‖². The algorithm alternates between assignment (each instance → nearest centroid) and update (centroids → mean of assigned instances) — this is Lloyd's algorithm. Convergence is guaranteed but to a local minimum. K-Means++ initialises centroids with D²-weighted sampling for dramatically better results.
Place K centroids randomly in the feature space. K-Means++ chooses initial centroids spread apart — faster convergence, better results.
Each instance is assigned to the nearest centroid (Euclidean distance). This is the "E-step."
Each centroid moves to the mean of all instances assigned to it. This is the "M-step."
Alternate assignment and update until centroids stop moving. Guaranteed to converge but may find a local minimum — run multiple times (n_init=10).
Choosing K
Elbow method: plot inertia vs K; look for the "elbow" where improvement plateaus. Silhouette score: (b−a)/max(a,b) where a=mean intra-cluster distance, b=mean nearest-cluster distance. Range [−1, 1]; higher is better. Domain knowledge: often the best guide — how many natural groups does the problem have?
DBSCAN: Density-Based Clustering
DBSCAN defines clusters as maximal sets of density-connected points. Its main advantages: no K required, handles arbitrary shapes, built-in outlier detection (label −1). Key limitation: it struggles when clusters have very different densities. HDBSCAN (Hierarchical DBSCAN) handles varying density by finding stable clusters at multiple ε levels.
Has ≥ min_samples instances within radius ε (including itself). Anchor of a cluster.
Within ε of a core point, but has < min_samples neighbours. Belongs to the cluster, not its core.
Not a core point and not within ε of any core point. Labelled −1 (outlier). DBSCAN's built-in anomaly detection.
Point B is DR from core A if B is within ε of A.
A and B are DC if there is a chain of DR points linking them — this defines the cluster.
Gaussian Mixture Models (GMM)
A GMM defines p(x) = Σₖ φₖ·N(x; μₖ, Σₖ) where φₖ are mixing weights. The EM algorithm: E-step computes posterior responsibilities rᵢₖ = p(zᵢ=k|xᵢ); M-step updates φₖ, μₖ, Σₖ to maximise expected log-likelihood. Convergence to a local maximum. The covariance type controls cluster shape. Use BIC to select K — minimise BIC(K) = k·ln(m) − 2ln(L̂).
Each component has its own unconstrained covariance matrix. Most flexible; can model any ellipse.
Parameters: K·n²/2 params
All components share the same covariance matrix. Good when clusters have similar shape.
Parameters: n²/2 params
Diagonal covariance: features independent within each component. Faster, less expressive.
Parameters: K·n params
Single variance per component: circular Gaussians. Equivalent to K-Means with soft assignments.
Parameters: K params
Choosing a Clustering Algorithm
K-Means
Centroid-based✓ Fast O(m·n·K·i), scalable, simple
✗ Assumes spherical clusters; needs K upfront; sensitive to outliers
Key params: K (n_clusters), n_init
DBSCAN
Density-based✓ Finds arbitrary shapes; detects outliers as noise; no K needed
✗ Struggles with varying density; ε and min_samples to tune
Key params: ε (eps), min_samples
Gaussian Mixture Models (GMM)
Probabilistic✓ Soft assignments; models elliptical clusters; AIC/BIC for model selection
✗ EM convergence not guaranteed global; slow on large m
Key params: n_components, covariance_type
Agglomerative HC
Hierarchical✓ No K needed upfront; dendrogram shows hierarchy; any linkage metric
✗ O(m²·log m); can't update incrementally
Key params: n_clusters, linkage (ward/complete/average)
Mental Models
K-Means, DBSCAN, and GMMs all discover groupings from the data itself. They answer "are there natural groups?" before you know what to label them.
Inertia (sum of squared distances to centroids) always decreases with more clusters. Use the elbow method or silhouette score — not raw inertia — to choose K.
K-Means forces spherical, equal-sized clusters. DBSCAN finds crescent shapes, interlocking rings, blobs of any density. If your data's cluster shapes are non-convex, try DBSCAN first.
GMM gives each instance a soft probability of belonging to each cluster. This is more realistic than hard assignment. A GMM with spherical covariance and hard assignments is numerically equivalent to K-Means.
AIC = 2k − 2ln(L̂); BIC = k·ln(m) − 2ln(L̂). Both penalise complexity. The K that minimises BIC is usually the best number of components. BIC penalises more strongly on large m.
In a GMM, instances with very low log-probability are outliers — they don't fit any component well. DBSCAN labels them as noise directly. Both are principled anomaly detectors.
Label a handful of instances per cluster (most representative = closest to centroid), then propagate labels to the whole cluster. Often achieves 90%+ of fully supervised accuracy with 1% of the labels.
Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Chapter 9 closes Part I of the book — the classical ML toolkit. Chapter 10 opens Part II with neural networks and deep learning.