AI.FM — Advancing Accessible AI Infrastructure

Large Margin Classification

The hard-margin SVM seeks the hyperplane wᵀx+b=0 that maximises the geometric margin 2/‖w‖ subject to every training instance satisfying tᵢ(wᵀxᵢ+b)≥1. This is equivalent to minimising ½‖w‖². The instances lying exactly on the margin boundaries (tᵢ(wᵀxᵢ+b)=1) are the support vectors — the decision function depends on these alone.

Hard Margin

Zero misclassification allowed. Requires linearly separable data. One outlier can ruin the margin entirely. Not practical for noisy real-world data.

Soft Margin (C)

Allows violations with slack variables ξᵢ. C penalises violations — low C = wide margin + more errors, high C = narrow margin + fewer errors. Use cross-val to tune C.

⚠ Always scale features before training an SVM. SVMs are sensitive to feature scale — a feature with range [0, 10000] will dominate margin calculations over one with range [0, 1]. Use StandardScaler or MinMaxScaler inside a Pipeline.

The Kernel Trick

The dual form of the SVM depends on training instances only through dot products xᵢᵀxⱼ. Replacing this with a kernel function K(xᵢ, xⱼ) = φ(xᵢ)ᵀφ(xⱼ) implicitly computes dot products in a transformed space φ without ever computing φ explicitly. Mercer's theorem guarantees this works for any positive semi-definite kernel.

Linear Kernel

K(a,b) = aᵀb

Use when: Linearly separable data, large datasets

Key params: C

Use LinearSVC — much faster O(m×n)

Polynomial Kernel

K(a,b) = (γaᵀb + r)ᵈ

Use when: Polynomial feature relationships

Key params: degree d, coef0 r, C

coef0 controls high vs. low degree influence

RBF / Gaussian Kernel

K(a,b) = exp(−γ‖a−b‖²)

Use when: Default choice for unknown structure

Key params: γ (1/2σ²), C

High γ = narrow bell → overfitting risk

Sigmoid Kernel

K(a,b) = tanh(γaᵀb + r)

Use when: Neural-network-like decision boundaries

Key params: γ, coef0 r, C

Less common; behaves like 2-layer NN

LinearSVC vs. SVC: Complexity Trade-offs

LinearSVC exploits the dual and primal formulations for linear kernels — training scales as O(m×n). SVC (via libsvm) must solve a quadratic programming problem that scales between O(m²·n) and O(m³·n) depending on implementation. For m > ~10k instances, LinearSVC is strongly preferred for linear boundaries; for non-linear, consider kernel approximations (RBFSampler + SGD).

LinearSVC

Train: O(m × n)

Predict: O(n) | Kernels: Linear only

Best for large/sparse

SVC (linear)

Train: O(m² × n) – O(m³ × n)

Predict: O(n_sv × n) | Kernels: Linear + kernels

Slow when m > ~10k

SVC (RBF)

Train: O(m² × n) – O(m³ × n)

Predict: O(n_sv × n) | Kernels: Any kernel

n_sv ≈ 20–60% of m

SVM Regression

SVR reverses the margin objective: it tries to fit as many instances as possible inside the ε-insensitive tube (|yᵢ−ŷᵢ|≤ε) while minimising ‖w‖². Instances outside the tube are "support vectors" that anchor the regression line. Adding more training instances within the tube does not affect the model — a useful property for robustness to outliers.

ε-Insensitive Tube

Points inside the tube: no penalty. Points outside by δ: penalty = δ (L1-style). The tube width 2ε is a hyperparameter — wider tube = more tolerance = simpler model. Narrowing ε raises sensitivity to residuals.

Hyperparameter Guide

Low C = wide margin (more violations allowed). High C = hard margin (few violations). Controls bias–variance trade-off.

💡 Start at 1.0; tune via cross-val. Use smaller C when data is noisy.

γ (RBF)

Low γ = wide Gaussian bell → smoother boundary. High γ = narrow bell → complex boundary, overfits.

💡 If overfit: lower γ. If underfit: raise γ. Always scale features first.

degree (Poly)

Controls curve complexity. Degree 2–5 typical; higher risks overfitting.

💡 Combine with small C for regularisation. Use grid search.

ε (Regression)

Half-width of insensitive tube. Points inside tube incur no penalty.

💡 Set ≈ expected noise level in labels.

Mental Models

The street analogy

SVM finds the widest road between two classes. The road edges are the margin; the support vectors are the houses right on the kerb. Wider road = better generalisation.

C is the tolerance dial

High C: "I refuse to misclassify anything, even if the margin shrinks." Low C: "Allow some errors to keep the margin wide." Low C regularises more.

Kernels are feature space teleporters

The kernel trick computes dot products in a transformed (potentially infinite-dimensional) space without ever computing the transformation explicitly. O(1) extra cost.

RBF γ = bell width

Each support vector casts a Gaussian "vote." High γ = tight votes (complex boundary). Low γ = broad votes (smooth boundary). Think of it as the reach of each training instance.

Scale before SVM, always

SVMs are not scale-invariant. An unscaled feature with range [0, 10000] dominates the margin computation. StandardScaler first, every time.

SVMs do not output probabilities natively

Platt scaling (probability=True in sklearn) adds a logistic layer post-hoc — but it is slow and cross-validated internally. For probabilities, Logistic Regression is often faster.

Regression SVMs flip the objective

Instead of maximising the margin between classes, SVR tries to fit as many points as possible inside the ε-tube. Points outside the tube are the "support vectors" that define the regression line.

Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. SVMs are particularly strong on small-to-medium datasets with well-defined class boundaries.

← Previous

Ch.4 Training Models

Ch.6 Decision Trees

Support Vector Machines: Maximum Margin, Kernel Tricks & Beyond