View Mode
← Back to Knowledge Hub
HOML · Ch.4April 2026·17 min read

Training Models: From Normal Equations to Regularised Regression

Chapter 4 distils the closed-form Normal Equation (θ̂ = (XᵀX)⁻¹Xᵀy), three gradient descent variants (Batch / SGD / Mini-batch), Polynomial Regression and the bias–variance trade-off, four regularisation techniques (Ridge, Lasso, Elastic Net, Early Stopping), and the leap from linear to Logistic and Softmax classification. A dense but foundational chapter.

Linear Regression: Two Paths to the Same Answer

Linear Regression seeks θ that minimises MSE(θ) = (1/m) Σ(θᵀxᵢ−yᵢ)². The Normal Equation delivers the global optimum in a single step: θ̂ = (XᵀX)⁻¹Xᵀy. In Scikit-Learn this is computed via the Moore–Penrose pseudo-inverse (SVD-based, numerically stable). Gradient descent approaches the same minimum iteratively using the gradient ∇ₜMSE(θ) = (2/m) Xᵀ(Xθ−y).

Gradient Descent: Three Flavours

All three variants follow θ ← θ − η·∇θ, differing only in which data they use to estimate the gradient. The learning rate η drives convergence: a learning schedule (e.g. η(t) = η₀/(1 + decay·t)) helps SGD converge without manually tuning a fixed rate.

Batch GD

θ ← θ − η · (2/m) Xᵀ(Xθ−y)

Speed: Slow per step (full dataset)

Memory: O(m·n)

Convergence: Smooth, guaranteed convex

Best for: Small–medium datasets, precise minimum

Stochastic GD

θ ← θ − η · 2xᵢᵀ(xᵢθ−yᵢ)

Speed: Very fast per step

Memory: O(1)

Convergence: Noisy, escapes local minima

Best for: Large datasets, online learning

Mini-batch GD

θ ← θ − η · (2/b) Xbᵀ(Xbθ−y)

Speed: Fast, GPU-friendly

Memory: O(b·n)

Convergence: Balanced noise/stability

Best for: Default choice in practice

Polynomial Regression & the Bias–Variance Trade-off

Polynomial Regression adds engineered features (x, x², …, xᵈ) then applies standard Linear Regression. Model complexity is proxied by degree d. Learning curves — plotting training vs. validation MSE against training set size — diagnose bias (curves plateau high) vs. variance (large gap between curves). Regularisation is the primary tool for shifting the trade-off.

High Bias (Underfitting)

Training error ≈ Validation error, both high. Model too simple to capture the signal. Fix: increase complexity or add features.

High Variance (Overfitting)

Training error low, Validation error high — large gap. Model memorises training noise. Fix: regularise, get more data, or simplify.

Regularisation: Taming the Model

All four techniques constrain model complexity during training. Ridge and Lasso add a penalty term to the cost function; Elastic Net blends them; Early Stopping uses validation performance as an implicit regulariser. Always scale features before applying regularised models (the penalty is magnitude-dependent).

Ridge (L2)

Penalty: α Σθⱼ²

Shrinks all weights smoothly toward zero; never exactly zero

Key params: α (strength)Default go-to; differentiable everywhere

Lasso (L1)

Penalty: α Σ|θⱼ|

Drives some weights to exactly zero → automatic feature selection

Key params: α (strength)Prefer when many features are irrelevant

Elastic Net

Penalty: r·α Σ|θⱼ| + (1−r)·α Σθⱼ²

Hybrid: some sparsity + groups correlated features

Key params: α, mix ratio rSafer than pure Lasso with correlated features

Early Stopping

Penalty: — (implicit)

Halt training when val error stops improving

Key params: patience / toleranceFree regularisation; restore best weights on stop

Logistic Regression & Softmax

Logistic Regression estimates p̂ = σ(θᵀx) where σ is the sigmoid. Training minimises binary cross-entropy: J(θ) = −(1/m) Σ[yᵢ log(p̂ᵢ) + (1−yᵢ) log(1−p̂ᵢ)]. This is convex — gradient descent finds the global minimum. Softmax generalises to K classes with score sₖ(x) = θₖᵀx and probability p̂ₖ = exp(sₖ)/Σexp(sⱼ). Training uses cross-entropy and can be multiclass or one-vs-rest.

Sigmoid (Binary)

σ(t) = 1 / (1 + e⁻ᵗ)

Output ∈ (0,1). Threshold at 0.5 by default; tune separately for precision/recall balance.

Softmax (Multi-class)

p̂ₖ = exp(sₖ) / Σⱼ exp(sⱼ)

Probabilities sum to 1. Cross-entropy loss = −Σₖ yₖ log(p̂ₖ). Mutually exclusive classes only.

Mental Models

01
Normal Equation = one-shot solve

No iterations, no learning rate, just matrix algebra. Fast if n < ~100k features; memory-bound beyond that.

02
Gradient Descent flavours trade speed vs noise

Batch is precise but slow. SGD is noisy but fast. Mini-batch is the Goldilocks default — use it.

03
Learning rate is everything in SGD

Too high → diverge. Too low → glacial. Use a schedule: high early, decay later.

04
Bias–variance is the fundamental tension

Underfitting = high bias (model too simple). Overfitting = high variance (model too complex). Regularisation buys variance reduction at the cost of a little bias.

05
Ridge for shrinkage, Lasso for sparsity

If you have 200 features and suspect only 20 matter, Lasso will zero out the rest automatically. Otherwise Ridge or Elastic Net.

06
Logistic regression outputs probability, not class

The sigmoid maps any real number to (0,1). Threshold at 0.5 by default, but you can tune the threshold separately from the model.

07
Softmax extends logistic to K classes

Each class gets its own score vector; softmax normalises to a probability distribution. Trained with cross-entropy loss.

Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Chapter 4 lays the mathematical groundwork that powers every supervised model in later chapters.