AI.FM — Advancing Accessible AI Infrastructure

What Is Machine Learning, Actually?

Tom Mitchell's 1997 definition remains the most precise: a program learns from experience E with respect to task T and performance measure P if its performance on T, as measured by P, improves with E. The key implication is that simply having more data is not machine learning. Downloading all of Wikipedia doesn't make a computer smarter at any particular task — what matters is whether performance on a defined task measurably improves as a result of exposure to that data.

The key insight on "why ML": Traditional rule-based systems require you to manually update rules every time the world changes. A spam filter built on hand-coded rules needs a human engineer every time spammers evolve their language. An ML-based filter adapts automatically — it notices that "For U" suddenly appears in flagged emails and starts catching it without human intervention. ML also unlocks problems where no algorithm is even conceivable upfront, like speech recognition or protein folding.

The Three Axes of Classification

Every ML system can be described along three independent axes. Importantly, these aren't mutually exclusive — a state-of-the-art spam filter might simultaneously be supervised, online, and model-based.

Axis 1 — Training Supervision

There are five main supervision types, each suited to different problem structures:

Supervised

Labelled training data. Two canonical tasks: classification (discrete output — spam/ham) and regression (continuous output — house price). Algorithms include logistic regression, SVMs, decision trees, random forests, neural networks.

Unsupervised

No labels. Key tasks: clustering (k-means, DBSCAN, hierarchical), dimensionality reduction (PCA, t-SNE — reduces features by merging correlated ones into composites via feature extraction), anomaly detection (isolation forest, autoencoders), and association rule learning.

Semi-supervised

Most instances unlabelled, few labelled. Google Photos face recognition is a canonical example: clustering groups faces (unsupervised), then a single label per cluster propagates to all instances. Most algorithms here are hybrids.

Self-supervised

Generates its own labels from unlabelled data — e.g., masking part of an image and training to reconstruct it. The resulting model can be fine-tuned for downstream tasks. This is the pretraining paradigm behind most modern large models. Not the same as unsupervised: it still uses labels during training, just synthetic ones.

Reinforcement

Agent observes environment, selects actions via a policy, receives rewards or penalties, updates policy. Goal: maximise cumulative reward over time. AlphaGo is the canonical example — it learned by playing millions of games against itself, not by following programmed rules.

Axis 2 — Batch vs. Online Learning

Batch (offline) learning trains on the full dataset at once, then freezes the model for deployment. The main risk is model rot (also called data drift): the world evolves but the model doesn't. Solutions involve periodic full retrains, which are resource-intensive and impractical at extreme data scales or on constrained hardware (mobile devices, Mars rovers). Online learning trains incrementally on mini-batches as data arrives. The critical parameter is the learning rate: high → adapts fast but forgets old patterns quickly; low → stable but slow to adapt. The major operational risk is that bad input data (sensor faults, adversarial manipulation) immediately degrades live model performance — monitoring and anomaly detection on the input stream become essential.

Axis 3 — Instance-Based vs. Model-Based Generalisation

Instance-based systems memorise the training data and generalise by measuring similarity to known instances. k-Nearest Neighbours is the archetype — classification is determined by a majority vote among the k closest points in feature space. Simple, interpretable, but expensive at inference time (must compare to the full training set). Model-based systems extract parameters from the training data via a training algorithm and discard the raw data at inference time. The workflow is: select model type → define a cost function (or utility function) → run a training algorithm to minimise that cost → apply the fitted model for inference. The Géron example — fitting a linear model of life satisfaction as a function of GDP per capita using sklearn.linear_model.LinearRegression — is the simplest possible instantiation of this workflow.

What Can Go Wrong: The Seven Failure Modes

Everything that can go wrong in ML falls into two buckets: bad data and bad algorithms. In practice, bad data is responsible for the vast majority of real-world ML failures.

📉Insufficient Training Data

The Banko & Brill (2001) study showed that very different algorithms converged on near-identical performance once given enough data — suggesting that for complex tasks, corpus size matters more than algorithm choice. That said, small/medium datasets remain the norm in most real applications.

⚖️Nonrepresentative Training Data

Sampling noise (small sample → unrepresentative by chance) vs. sampling bias (flawed method → systematically unrepresentative regardless of size). The 1936 Literary Digest poll is the classic example: 2.4M responses, catastrophically wrong prediction, because the sampling frame overrepresented wealthy, Republican-leaning voters.

🗑️Poor-Quality Data

Handling strategies: discard or manually fix clear outliers; for missing features, either drop the attribute, drop affected instances, impute (e.g., median fill), or train two models — one with and one without the feature. Most production data pipelines spend more time here than anywhere else.

🔎Irrelevant Features

Feature engineering: feature selection (choose the most predictive subset), feature extraction (combine correlated features — e.g. mileage + age → "wear and tear" via dimensionality reduction), and creating new features from external data. The quality of features often dominates the choice of algorithm.

📈Overfitting

Model is too complex relative to training data size/noisiness. It fits the noise, not the signal. Solutions: simpler model (fewer parameters), more training data, noise reduction, or regularisation — constraining parameters to reduce degrees of freedom. The regularisation amount is controlled by a hyperparameter set before training.

📊Underfitting

Model is too constrained to represent the underlying structure. Solutions: more powerful model (more parameters), better features (feature engineering), or reduce regularisation. Underfitting is generally easier to diagnose than overfitting — training error is already high.

🔀Data Mismatch

When training data is easy to obtain but not representative of production data (e.g., web images vs. mobile app photos). Solution: reserve a train-dev set from the same distribution as training data. If model performs well on train-dev but poorly on validation, it's overfitting. If it performs poorly on train-dev, the gap is data mismatch — the model never saw production-like data during training.

Testing and Validating: Don't Trust a Model You Haven't Challenged

Training error tells you how well the model fits known data. What you actually care about is generalisation error (out-of-sample error) — performance on new instances. The canonical split is 80% training / 20% test. If training error is low but generalisation error is high, the model is overfitting. The test set is sacred: evaluate on it only once, after all decisions are made.

The Validation Set Trap

If you use the test set to choose between models or tune hyperparameters, you've leaked information — you've adapted your model to that particular set, so its error estimate is no longer honest. The solution is holdout validation: carve a validation set (also called dev set) from the training data. Train multiple candidate models on the reduced training set, pick the best on the validation set, retrain that model on the full training set (training + validation), then evaluate once on the test set.

If your validation set is too small, model selection is noisy. If too large, your reduced training set is too small to properly train candidates (like selecting a marathon runner based on sprint times). The solution is cross-validation: split training data into k folds, train k models each using k-1 folds and validating on the remainder, then average performance across folds. More accurate, but k× the training cost.

The Mental Models Worth Keeping

ML isn't magic — it's optimisation.

Every ML system is fundamentally searching for parameter values that minimise a cost function on a training dataset, then hoping those parameters generalise. Everything else is scaffolding around this core loop.

Data quality beats algorithm sophistication.

The Banko & Brill finding — that diverse algorithms converge once given enough data — is humbling. In most projects, cleaning and curating the training data returns more than swapping a random forest for a gradient boosted tree.

Overfitting is about complexity relative to data, not complexity in absolute terms.

A high-degree polynomial is not inherently bad. It's bad when the training set is too small or too noisy to support that many parameters. More data or stronger regularisation can make the same model safe.

The test set is a one-shot instrument.

Every time you look at test set performance and change something, you contaminate it. The test set estimates generalisation error only if the model has never been adapted to it, directly or indirectly.

Your categories are not the data's categories.

The data you can get for training is rarely perfectly representative of the production distribution. This gap — data mismatch — is a distinct failure mode from overfitting and requires its own mitigation strategy.

HOML Reading Notes — Chapter 1 of 19

This is the first article in a chapter-by-chapter reading companion for Aurélien Géron's Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd Edition). Each article distils the key insights, mental models, and practical patterns from each chapter. Chapter 2 will cover the end-to-end ML project workflow — from data sourcing and exploratory analysis through model selection, fine-tuning, and deployment.

The Machine Learning Landscape: A Map Before the Journey