The Curse of Dimensionality
In d dimensions, a unit hypercube's volume is 1, but to capture 1% of the volume you need a hypercube with side length 0.01^(1/d) ≈ 0.63 when d=10000. Most of a high-dimensional space is near the boundary, not the interior. Distances concentrate: the ratio of max/min pairwise distances → 1 as d → ∞ — nearest-neighbour search becomes meaningless. The manifold hypothesis offers a way out: real data lives near a lower-dimensional manifold embedded in the high-d space.
The Manifold Hypothesis
Most real-world high-dimensional data (images, audio, text) actually lies near a much lower-dimensional manifold. A face image has ~10⁶ pixels, but the "space of realistic faces" is far smaller — perhaps 50–100 dimensions of variation (pose, lighting, expression, identity). Dimensionality reduction finds and exploits this manifold structure.
Principal Component Analysis (PCA)
PCA identifies the d orthogonal directions of maximum variance. It is equivalent to an eigendecomposition of the covariance matrix, but computed efficiently via SVD: X = UΣVᵀ. The principal components are columns of V; explained variance ratios are σᵢ²/Σσⱼ². Reconstruction: X̂ = X_reduced · W_dᵀ. Reconstruction error measures information lost.
Subtract the mean from each feature so the data is centred at the origin.
C = (1/m) XᵀX — captures how features vary together.
C = UΣVᵀ. Columns of V are the principal components (eigenvectors of C).
X_reduced = X · W_d, where W_d = first d columns of V. Variance preserved = Σᵢ₌₁ᵈ λᵢ / Σλᵢ.
Choosing the Right Number of Dimensions
Plot explained variance ratio per component; keep components up to the "elbow" where variance drops steeply.
n_components=0.95 in sklearn automatically finds the minimum d that preserves 95% of variance.
Compress then decompress; measure MSE between original and reconstruction. Use as an anomaly score.
The best d is the one that maximises downstream model accuracy on a validation set — use cross-validation.
Technique Comparison
Linear methods (PCA, Random Projections) assume the manifold is a hyperplane. Non-linear methods handle curved manifolds but are more computationally expensive and sensitive to hyperparameters. For downstream supervised learning, PCA is usually the best starting point — use non-linear methods when you have evidence of strong non-linear structure (e.g. visualisation shows clusters that PCA doesn't separate).
PCA
LinearBest for: General-purpose; preserves global variance
Complexity: O(m·n²) + O(n³) for SVD
Default first choice. Supports incremental PCA for large datasets.
Kernel PCA (kPCA)
Non-linearBest for: Complex curved manifolds; Swiss Roll, moons
Complexity: O(m²) – O(m³)
Uses kernel trick in feature space. RBF kernel common. Kernel selection via CV.
LLE (Locally Linear Embedding)
Non-linear (manifold)Best for: Unfolding twisted manifolds; neighbourhood-preserving
Complexity: O(m·log m) neighbour search + O(m·d²)
Great at local structure; poor scaling to very large m or d.
Random Projections
Linear (random)Best for: Very high-dimensional data (text, images) with speed priority
Complexity: O(m·n·d) — very fast
Johnson–Lindenstrauss lemma: distances approximately preserved.
UMAP / t-SNE
Non-linear (visualisation)Best for: Visualisation at d=2 or 3; cluster exploration
Complexity: O(m·log m) – O(m²)
Not in sklearn by default; preserve local not global structure.
PCA as a Preprocessing Step
Typical pipeline: StandardScaler → PCA(n_components=0.95) → classifier. The number of components is a hyperparameter to tune via cross-validation. Caution: do not use PCA to "reduce overfitting" blindly — it discards variance, not necessarily overfitting components. Regularisation (Ridge, dropout) is better targeted at overfitting. PCA's main win is speed and memory.
Mental Models
PCA finds the directions in feature space along which the data varies the most. The first principal component captures more variance than any other direction; the second captures the most of what remains, and so on.
In high dimensions, all points become roughly equidistant from each other. Distances lose meaning; density estimation fails; nearest-neighbour search slows down. Reducing dimensions before modelling often helps more than adding features.
PCA rotates the coordinate system to align axes with directions of maximum variance. Information is preserved — only the reference frame changes. Dropping low-variance components is the only lossy step.
A non-linearly separable manifold in the original space may become linearly separable in a high-dimensional kernel feature space. kPCA runs standard PCA in that space — without ever computing coordinates there explicitly.
LLE assumes each point can be reconstructed as a linear combination of its k nearest neighbours. It finds a low-d embedding where those reconstruction weights are best preserved. Works beautifully on Swiss Roll; struggles with globally entangled structures.
You discard information. The question is whether the discarded dimensions carry signal (bad) or noise (good). Inspect the reconstruction error and downstream model performance — don't blindly reduce.
PCA accelerates training and reduces overfitting by removing low-variance features that are often noise. But the best use is as a Pipeline step before your real estimator — tune the number of components as a hyperparameter.
Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Dimensionality reduction bridges the supervised learning of earlier chapters and the unsupervised methods of Chapter 9.