Linear Regression: Two Paths to the Same Answer
Linear Regression seeks θ that minimises MSE(θ) = (1/m) Σ(θᵀxᵢ−yᵢ)². The Normal Equation delivers the global optimum in a single step: θ̂ = (XᵀX)⁻¹Xᵀy. In Scikit-Learn this is computed via the Moore–Penrose pseudo-inverse (SVD-based, numerically stable). Gradient descent approaches the same minimum iteratively using the gradient ∇ₜMSE(θ) = (2/m) Xᵀ(Xθ−y).
Gradient Descent: Three Flavours
All three variants follow θ ← θ − η·∇θ, differing only in which data they use to estimate the gradient. The learning rate η drives convergence: a learning schedule (e.g. η(t) = η₀/(1 + decay·t)) helps SGD converge without manually tuning a fixed rate.
Batch GD
θ ← θ − η · (2/m) Xᵀ(Xθ−y)Speed: Slow per step (full dataset)
Memory: O(m·n)
Convergence: Smooth, guaranteed convex
Best for: Small–medium datasets, precise minimum
Stochastic GD
θ ← θ − η · 2xᵢᵀ(xᵢθ−yᵢ)Speed: Very fast per step
Memory: O(1)
Convergence: Noisy, escapes local minima
Best for: Large datasets, online learning
Mini-batch GD
θ ← θ − η · (2/b) Xbᵀ(Xbθ−y)Speed: Fast, GPU-friendly
Memory: O(b·n)
Convergence: Balanced noise/stability
Best for: Default choice in practice
Polynomial Regression & the Bias–Variance Trade-off
Polynomial Regression adds engineered features (x, x², …, xᵈ) then applies standard Linear Regression. Model complexity is proxied by degree d. Learning curves — plotting training vs. validation MSE against training set size — diagnose bias (curves plateau high) vs. variance (large gap between curves). Regularisation is the primary tool for shifting the trade-off.
High Bias (Underfitting)
Training error ≈ Validation error, both high. Model too simple to capture the signal. Fix: increase complexity or add features.
High Variance (Overfitting)
Training error low, Validation error high — large gap. Model memorises training noise. Fix: regularise, get more data, or simplify.
Regularisation: Taming the Model
All four techniques constrain model complexity during training. Ridge and Lasso add a penalty term to the cost function; Elastic Net blends them; Early Stopping uses validation performance as an implicit regulariser. Always scale features before applying regularised models (the penalty is magnitude-dependent).
Ridge (L2)
Penalty: α Σθⱼ²Shrinks all weights smoothly toward zero; never exactly zero
Key params: α (strength) — Default go-to; differentiable everywhere
Lasso (L1)
Penalty: α Σ|θⱼ|Drives some weights to exactly zero → automatic feature selection
Key params: α (strength) — Prefer when many features are irrelevant
Elastic Net
Penalty: r·α Σ|θⱼ| + (1−r)·α Σθⱼ²Hybrid: some sparsity + groups correlated features
Key params: α, mix ratio r — Safer than pure Lasso with correlated features
Early Stopping
Penalty: — (implicit)Halt training when val error stops improving
Key params: patience / tolerance — Free regularisation; restore best weights on stop
Logistic Regression & Softmax
Logistic Regression estimates p̂ = σ(θᵀx) where σ is the sigmoid. Training minimises binary cross-entropy: J(θ) = −(1/m) Σ[yᵢ log(p̂ᵢ) + (1−yᵢ) log(1−p̂ᵢ)]. This is convex — gradient descent finds the global minimum. Softmax generalises to K classes with score sₖ(x) = θₖᵀx and probability p̂ₖ = exp(sₖ)/Σexp(sⱼ). Training uses cross-entropy and can be multiclass or one-vs-rest.
Sigmoid (Binary)
σ(t) = 1 / (1 + e⁻ᵗ)
Output ∈ (0,1). Threshold at 0.5 by default; tune separately for precision/recall balance.
Softmax (Multi-class)
p̂ₖ = exp(sₖ) / Σⱼ exp(sⱼ)
Probabilities sum to 1. Cross-entropy loss = −Σₖ yₖ log(p̂ₖ). Mutually exclusive classes only.
Mental Models
No iterations, no learning rate, just matrix algebra. Fast if n < ~100k features; memory-bound beyond that.
Batch is precise but slow. SGD is noisy but fast. Mini-batch is the Goldilocks default — use it.
Too high → diverge. Too low → glacial. Use a schedule: high early, decay later.
Underfitting = high bias (model too simple). Overfitting = high variance (model too complex). Regularisation buys variance reduction at the cost of a little bias.
If you have 200 features and suspect only 20 matter, Lasso will zero out the rest automatically. Otherwise Ridge or Elastic Net.
The sigmoid maps any real number to (0,1). Threshold at 0.5 by default, but you can tune the threshold separately from the model.
Each class gets its own score vector; softmax normalises to a probability distribution. Trained with cross-entropy loss.
Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Chapter 4 lays the mathematical groundwork that powers every supervised model in later chapters.