View Mode
← Back to Knowledge Hub
HOML · Ch.8April 2026·13 min read

Dimensionality Reduction: PCA, Manifolds & the Curse of Dimensions

Chapter 8 covers the curse of dimensionality, PCA (via SVD), explained variance ratio and choosing d, Incremental/Randomised PCA, Kernel PCA with RBF/poly/sigmoid, Locally Linear Embedding, Random Projections (Johnson–Lindenstrauss), and UMAP/t-SNE for visualisation. Central theme: the manifold hypothesis — real-world high-d data lies near a lower-dimensional manifold.

The Curse of Dimensionality

In d dimensions, a unit hypercube's volume is 1, but to capture 1% of the volume you need a hypercube with side length 0.01^(1/d) ≈ 0.63 when d=10000. Most of a high-dimensional space is near the boundary, not the interior. Distances concentrate: the ratio of max/min pairwise distances → 1 as d → ∞ — nearest-neighbour search becomes meaningless. The manifold hypothesis offers a way out: real data lives near a lower-dimensional manifold embedded in the high-d space.

The Manifold Hypothesis

Most real-world high-dimensional data (images, audio, text) actually lies near a much lower-dimensional manifold. A face image has ~10⁶ pixels, but the "space of realistic faces" is far smaller — perhaps 50–100 dimensions of variation (pose, lighting, expression, identity). Dimensionality reduction finds and exploits this manifold structure.

Principal Component Analysis (PCA)

PCA identifies the d orthogonal directions of maximum variance. It is equivalent to an eigendecomposition of the covariance matrix, but computed efficiently via SVD: X = UΣVᵀ. The principal components are columns of V; explained variance ratios are σᵢ²/Σσⱼ². Reconstruction: X̂ = X_reduced · W_dᵀ. Reconstruction error measures information lost.

1
Centre the data

Subtract the mean from each feature so the data is centred at the origin.

2
Compute covariance matrix

C = (1/m) XᵀX — captures how features vary together.

3
Singular Value Decomposition

C = UΣVᵀ. Columns of V are the principal components (eigenvectors of C).

4
Project onto top-d components

X_reduced = X · W_d, where W_d = first d columns of V. Variance preserved = Σᵢ₌₁ᵈ λᵢ / Σλᵢ.

Choosing the Right Number of Dimensions

Scree plot elbow

Plot explained variance ratio per component; keep components up to the "elbow" where variance drops steeply.

95% explained variance

n_components=0.95 in sklearn automatically finds the minimum d that preserves 95% of variance.

Reconstruction error

Compress then decompress; measure MSE between original and reconstruction. Use as an anomaly score.

Task performance

The best d is the one that maximises downstream model accuracy on a validation set — use cross-validation.

Technique Comparison

Linear methods (PCA, Random Projections) assume the manifold is a hyperplane. Non-linear methods handle curved manifolds but are more computationally expensive and sensitive to hyperparameters. For downstream supervised learning, PCA is usually the best starting point — use non-linear methods when you have evidence of strong non-linear structure (e.g. visualisation shows clusters that PCA doesn't separate).

PCA

Linear

Best for: General-purpose; preserves global variance

Complexity: O(m·n²) + O(n³) for SVD

Default first choice. Supports incremental PCA for large datasets.

Kernel PCA (kPCA)

Non-linear

Best for: Complex curved manifolds; Swiss Roll, moons

Complexity: O(m²) – O(m³)

Uses kernel trick in feature space. RBF kernel common. Kernel selection via CV.

LLE (Locally Linear Embedding)

Non-linear (manifold)

Best for: Unfolding twisted manifolds; neighbourhood-preserving

Complexity: O(m·log m) neighbour search + O(m·d²)

Great at local structure; poor scaling to very large m or d.

Random Projections

Linear (random)

Best for: Very high-dimensional data (text, images) with speed priority

Complexity: O(m·n·d) — very fast

Johnson–Lindenstrauss lemma: distances approximately preserved.

UMAP / t-SNE

Non-linear (visualisation)

Best for: Visualisation at d=2 or 3; cluster exploration

Complexity: O(m·log m) – O(m²)

Not in sklearn by default; preserve local not global structure.

PCA as a Preprocessing Step

Typical pipeline: StandardScaler → PCA(n_components=0.95) → classifier. The number of components is a hyperparameter to tune via cross-validation. Caution: do not use PCA to "reduce overfitting" blindly — it discards variance, not necessarily overfitting components. Regularisation (Ridge, dropout) is better targeted at overfitting. PCA's main win is speed and memory.

Mental Models

01
Dimensions are axes of variation

PCA finds the directions in feature space along which the data varies the most. The first principal component captures more variance than any other direction; the second captures the most of what remains, and so on.

02
The curse of dimensionality is real

In high dimensions, all points become roughly equidistant from each other. Distances lose meaning; density estimation fails; nearest-neighbour search slows down. Reducing dimensions before modelling often helps more than adding features.

03
PCA is a rotation, not a distortion

PCA rotates the coordinate system to align axes with directions of maximum variance. Information is preserved — only the reference frame changes. Dropping low-variance components is the only lossy step.

04
Kernel PCA = PCA in kernel space

A non-linearly separable manifold in the original space may become linearly separable in a high-dimensional kernel feature space. kPCA runs standard PCA in that space — without ever computing coordinates there explicitly.

05
LLE unfolds manifolds locally

LLE assumes each point can be reconstructed as a linear combination of its k nearest neighbours. It finds a low-d embedding where those reconstruction weights are best preserved. Works beautifully on Swiss Roll; struggles with globally entangled structures.

06
Dimensionality reduction is lossy — always

You discard information. The question is whether the discarded dimensions carry signal (bad) or noise (good). Inspect the reconstruction error and downstream model performance — don't blindly reduce.

07
Use PCA as preprocessing, not as an end

PCA accelerates training and reduces overfitting by removing low-variance features that are often noise. But the best use is as a Pipeline step before your real estimator — tune the number of components as a hyperparameter.

Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Dimensionality reduction bridges the supervised learning of earlier chapters and the unsupervised methods of Chapter 9.