Large Margin Classification
The hard-margin SVM seeks the hyperplane wᵀx+b=0 that maximises the geometric margin 2/‖w‖ subject to every training instance satisfying tᵢ(wᵀxᵢ+b)≥1. This is equivalent to minimising ½‖w‖². The instances lying exactly on the margin boundaries (tᵢ(wᵀxᵢ+b)=1) are the support vectors — the decision function depends on these alone.
Hard Margin
Zero misclassification allowed. Requires linearly separable data. One outlier can ruin the margin entirely. Not practical for noisy real-world data.
Soft Margin (C)
Allows violations with slack variables ξᵢ. C penalises violations — low C = wide margin + more errors, high C = narrow margin + fewer errors. Use cross-val to tune C.
⚠ Always scale features before training an SVM. SVMs are sensitive to feature scale — a feature with range [0, 10000] will dominate margin calculations over one with range [0, 1]. Use StandardScaler or MinMaxScaler inside a Pipeline.
The Kernel Trick
The dual form of the SVM depends on training instances only through dot products xᵢᵀxⱼ. Replacing this with a kernel function K(xᵢ, xⱼ) = φ(xᵢ)ᵀφ(xⱼ) implicitly computes dot products in a transformed space φ without ever computing φ explicitly. Mercer's theorem guarantees this works for any positive semi-definite kernel.
Linear Kernel
K(a,b) = aᵀbUse when: Linearly separable data, large datasets
Key params: C
Use LinearSVC — much faster O(m×n)
Polynomial Kernel
K(a,b) = (γaᵀb + r)ᵈUse when: Polynomial feature relationships
Key params: degree d, coef0 r, C
coef0 controls high vs. low degree influence
RBF / Gaussian Kernel
K(a,b) = exp(−γ‖a−b‖²)Use when: Default choice for unknown structure
Key params: γ (1/2σ²), C
High γ = narrow bell → overfitting risk
Sigmoid Kernel
K(a,b) = tanh(γaᵀb + r)Use when: Neural-network-like decision boundaries
Key params: γ, coef0 r, C
Less common; behaves like 2-layer NN
LinearSVC vs. SVC: Complexity Trade-offs
LinearSVC exploits the dual and primal formulations for linear kernels — training scales as O(m×n). SVC (via libsvm) must solve a quadratic programming problem that scales between O(m²·n) and O(m³·n) depending on implementation. For m > ~10k instances, LinearSVC is strongly preferred for linear boundaries; for non-linear, consider kernel approximations (RBFSampler + SGD).
LinearSVC
Train: O(m × n)Predict: O(n) | Kernels: Linear only
Best for large/sparse
SVC (linear)
Train: O(m² × n) – O(m³ × n)Predict: O(n_sv × n) | Kernels: Linear + kernels
Slow when m > ~10k
SVC (RBF)
Train: O(m² × n) – O(m³ × n)Predict: O(n_sv × n) | Kernels: Any kernel
n_sv ≈ 20–60% of m
SVM Regression
SVR reverses the margin objective: it tries to fit as many instances as possible inside the ε-insensitive tube (|yᵢ−ŷᵢ|≤ε) while minimising ‖w‖². Instances outside the tube are "support vectors" that anchor the regression line. Adding more training instances within the tube does not affect the model — a useful property for robustness to outliers.
ε-Insensitive Tube
Points inside the tube: no penalty. Points outside by δ: penalty = δ (L1-style). The tube width 2ε is a hyperparameter — wider tube = more tolerance = simpler model. Narrowing ε raises sensitivity to residuals.
Hyperparameter Guide
Low C = wide margin (more violations allowed). High C = hard margin (few violations). Controls bias–variance trade-off.
💡 Start at 1.0; tune via cross-val. Use smaller C when data is noisy.
Low γ = wide Gaussian bell → smoother boundary. High γ = narrow bell → complex boundary, overfits.
💡 If overfit: lower γ. If underfit: raise γ. Always scale features first.
Controls curve complexity. Degree 2–5 typical; higher risks overfitting.
💡 Combine with small C for regularisation. Use grid search.
Half-width of insensitive tube. Points inside tube incur no penalty.
💡 Set ≈ expected noise level in labels.
Mental Models
SVM finds the widest road between two classes. The road edges are the margin; the support vectors are the houses right on the kerb. Wider road = better generalisation.
High C: "I refuse to misclassify anything, even if the margin shrinks." Low C: "Allow some errors to keep the margin wide." Low C regularises more.
The kernel trick computes dot products in a transformed (potentially infinite-dimensional) space without ever computing the transformation explicitly. O(1) extra cost.
Each support vector casts a Gaussian "vote." High γ = tight votes (complex boundary). Low γ = broad votes (smooth boundary). Think of it as the reach of each training instance.
SVMs are not scale-invariant. An unscaled feature with range [0, 10000] dominates the margin computation. StandardScaler first, every time.
Platt scaling (probability=True in sklearn) adds a logistic layer post-hoc — but it is slow and cross-validated internally. For probabilities, Logistic Regression is often faster.
Instead of maximising the margin between classes, SVR tries to fit as many points as possible inside the ε-tube. Points outside the tube are the "support vectors" that define the regression line.
Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. SVMs are particularly strong on small-to-medium datasets with well-defined class boundaries.