AI.FM — Advancing Accessible AI Infrastructure

K-Means Clustering

K-Means minimises inertia: Σᵢ min_k ‖xᵢ − μₖ‖². The algorithm alternates between assignment (each instance → nearest centroid) and update (centroids → mean of assigned instances) — this is Lloyd's algorithm. Convergence is guaranteed but to a local minimum. K-Means++ initialises centroids with D²-weighted sampling for dramatically better results.

Initialise centroids

Place K centroids randomly in the feature space. K-Means++ chooses initial centroids spread apart — faster convergence, better results.

Assign clusters

Each instance is assigned to the nearest centroid (Euclidean distance). This is the "E-step."

Update centroids

Each centroid moves to the mean of all instances assigned to it. This is the "M-step."

Repeat until convergence

Alternate assignment and update until centroids stop moving. Guaranteed to converge but may find a local minimum — run multiple times (n_init=10).

Choosing K

Elbow method: plot inertia vs K; look for the "elbow" where improvement plateaus. Silhouette score: (b−a)/max(a,b) where a=mean intra-cluster distance, b=mean nearest-cluster distance. Range [−1, 1]; higher is better. Domain knowledge: often the best guide — how many natural groups does the problem have?

DBSCAN: Density-Based Clustering

DBSCAN defines clusters as maximal sets of density-connected points. Its main advantages: no K required, handles arbitrary shapes, built-in outlier detection (label −1). Key limitation: it struggles when clusters have very different densities. HDBSCAN (Hierarchical DBSCAN) handles varying density by finding stable clusters at multiple ε levels.

Core point

Has ≥ min_samples instances within radius ε (including itself). Anchor of a cluster.

Border point

Within ε of a core point, but has < min_samples neighbours. Belongs to the cluster, not its core.

Noise point

Not a core point and not within ε of any core point. Labelled −1 (outlier). DBSCAN's built-in anomaly detection.

Directly density-reachable

Point B is DR from core A if B is within ε of A.

Density-connected

A and B are DC if there is a chain of DR points linking them — this defines the cluster.

Gaussian Mixture Models (GMM)

A GMM defines p(x) = Σₖ φₖ·N(x; μₖ, Σₖ) where φₖ are mixing weights. The EM algorithm: E-step computes posterior responsibilities rᵢₖ = p(zᵢ=k|xᵢ); M-step updates φₖ, μₖ, Σₖ to maximise expected log-likelihood. Convergence to a local maximum. The covariance type controls cluster shape. Use BIC to select K — minimise BIC(K) = k·ln(m) − 2ln(L̂).

full

Each component has its own unconstrained covariance matrix. Most flexible; can model any ellipse.

Parameters: K·n²/2 params

tied

All components share the same covariance matrix. Good when clusters have similar shape.

Parameters: n²/2 params

diag

Diagonal covariance: features independent within each component. Faster, less expressive.

Parameters: K·n params

spherical

Single variance per component: circular Gaussians. Equivalent to K-Means with soft assignments.

Parameters: K params

Choosing a Clustering Algorithm

K-Means

Centroid-based

✓ Fast O(m·n·K·i), scalable, simple

✗ Assumes spherical clusters; needs K upfront; sensitive to outliers

Key params: K (n_clusters), n_init

DBSCAN

Density-based

✓ Finds arbitrary shapes; detects outliers as noise; no K needed

✗ Struggles with varying density; ε and min_samples to tune

Key params: ε (eps), min_samples

Gaussian Mixture Models (GMM)

Probabilistic

✓ Soft assignments; models elliptical clusters; AIC/BIC for model selection

✗ EM convergence not guaranteed global; slow on large m

Key params: n_components, covariance_type

Agglomerative HC

Hierarchical

✓ No K needed upfront; dendrogram shows hierarchy; any linkage metric

✗ O(m²·log m); can't update incrementally

Key params: n_clusters, linkage (ward/complete/average)

Mental Models

Unsupervised = find structure without labels

K-Means, DBSCAN, and GMMs all discover groupings from the data itself. They answer "are there natural groups?" before you know what to label them.

K-Means inertia is a proxy, not the goal

Inertia (sum of squared distances to centroids) always decreases with more clusters. Use the elbow method or silhouette score — not raw inertia — to choose K.

DBSCAN finds what K-Means can't

K-Means forces spherical, equal-sized clusters. DBSCAN finds crescent shapes, interlocking rings, blobs of any density. If your data's cluster shapes are non-convex, try DBSCAN first.

GMMs are K-Means with uncertainty

GMM gives each instance a soft probability of belonging to each cluster. This is more realistic than hard assignment. A GMM with spherical covariance and hard assignments is numerically equivalent to K-Means.

AIC/BIC are free model selection tools for GMMs

AIC = 2k − 2ln(L̂); BIC = k·ln(m) − 2ln(L̂). Both penalise complexity. The K that minimises BIC is usually the best number of components. BIC penalises more strongly on large m.

Anomaly detection = low-density region

In a GMM, instances with very low log-probability are outliers — they don't fit any component well. DBSCAN labels them as noise directly. Both are principled anomaly detectors.

Semi-supervised learning leverages clustering

Label a handful of instances per cluster (most representative = closest to centroid), then propagate labels to the whole cluster. Often achieves 90%+ of fully supervised accuracy with 1% of the labels.

Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Chapter 9 closes Part I of the book — the classical ML toolkit. Chapter 10 opens Part II with neural networks and deep learning.

← Previous

Ch.8 Dimensionality Reduction

Ch.10 Neural Networks

Unsupervised Learning: K-Means, DBSCAN & Gaussian Mixture Models

K-Means Clustering

DBSCAN: Density-Based Clustering

Gaussian Mixture Models (GMM)

Choosing a Clustering Algorithm

K-Means

DBSCAN

Gaussian Mixture Models (GMM)

Agglomerative HC

Mental Models