View Mode
โ† Back to Knowledge Hub
Machine LearningHOML ยท Chapter 3April 2026 ยท 15 min read

Classification: Beyond Accuracy โ€” The Full Evaluator's Toolkit

Chapter 3 distilled: the MNIST dataset as a pedagogical vehicle for binary, multiclass, multilabel, and multioutput classification. The chapter's real contribution is its thorough treatment of evaluation metrics and the insight that accuracy systematically misleads on imbalanced datasets.

The MNIST Playground

MNIST: 70,000 images, 28ร—28 = 784 features per instance, 10 classes. Standard split: 60,000 training / 10,000 test. The dataset is already shuffled, so cross-validation folds are IID. Chapter 3 creates a binary classification task โ€” "is this a 5?" โ€” to explore metrics in depth before generalising to multiclass and multilabel settings. The base classifier is SGDClassifier (linear, stochastic gradient descent, handles large datasets well).

The Accuracy Trap

Accuracy seems like the obvious metric. It's not. On the MNIST "5 detector", only ~10% of images are actually 5s. A classifier that always predicts "not 5" achieves 90%+ accuracy โ€” better than many real models โ€” while being completely worthless.

The imbalance problem: cross_val_score(SGDClassifier, accuracy) returns ~96%. Then cross_val_score(DummyClassifier(strategy="most_frequent"), accuracy) returns 91%. The SGD model is barely 5 points better than always predicting "not 5". Accuracy is always misleading on skewed datasets.

The Confusion Matrix

Generate out-of-sample predictions with cross_val_predict(sgd_clf, X_train, y_train_5, cv=3), then call confusion_matrix(y_train_5, y_train_pred). Each row is the actual class; each column is the predicted class.

53,892
True Negative
Correctly said "not 5"
687
False Positive
Said "5" โ€” it wasn't
1,891
False Negative
Said "not 5" โ€” it was
3,530
True Positive
Correctly said "5"

Precision, Recall, and the F1 Score

Three metrics derived from the confusion matrix tell the story that accuracy hides.

Precision
TP / (TP + FP)0.837

Positive predictive value. High precision means low false alarm rate. Precision=0.837 means 83.7% of predicted 5s are genuinely 5s. Useful when false positives are costly (e.g., flagging legitimate emails as spam).

Recall
TP / (TP + FN)0.651

Sensitivity / true positive rate. Recall=0.651 means 34.9% of genuine 5s are missed. Important when missing a positive is costly (e.g., failing to detect cancer).

F1 Score
2 ร— Pร—R / (P+R)0.733

Harmonic mean of precision and recall. Appropriate when both false positives and false negatives matter roughly equally. The harmonic mean penalises extreme imbalances between P and R.

The Precision / Recall Trade-off

Every classifier has a decision threshold. Raising it increases precision (fewer false alarms) but decreases recall (more misses). The precision-recall curve plots this trade-off across all thresholds. The right threshold depends entirely on the business cost of each error type โ€” there is no universally correct setting.

ROC Curve and AUC

ROC plots TPR (recall) vs FPR (1 โˆ’ specificity) across all thresholds. AUC = area under the ROC curve: 1.0 perfect, 0.5 random. Prefer the PR curve when the positive class is rare; prefer ROC when you care about the negative class too. The RandomForest dominates on both curves.

SGDClassifier
F10.733
ROC AUC0.960
Linear, fast, handles large data
RandomForest
F10.924
ROC AUC0.998
Ensemble, much stronger baseline

Beyond Binary: Multiclass, Multilabel, Multioutput

Multiclass (OvR / OvO)

OvR: n_classes binary classifiers, pick highest confidence score. OvO: n(n-1)/2 classifiers, majority vote. OvR preferred for most algorithms. OvO preferred when the algorithm scales poorly with dataset size (like SVMs). Error analysis on the normalised confusion matrix reveals which class pairs confuse the model.

Multilabel Classification

Output is a binary vector: for "large number" and "odd number" detectors on MNIST, [1, 0] means large but even. KNeighborsClassifier supports multilabel natively. ClassifierChain models label dependencies: each classifier uses the predictions of all preceding ones as additional features.

Multioutput Classification

Generalisation of multilabel: each label is multiclass. The denoising MNIST task uses noisy images as input and clean images as targets. The model outputs 784 values (one per pixel), each in 0โ€“255 โ€” a reconstruction task expressed as classification.

The Mental Models Worth Keeping

1.

Never report accuracy alone on imbalanced data.

Always accompany it with precision, recall, and F1. A classifier that ignores the minority class can achieve excellent accuracy. The confusion matrix exposes this immediately.

2.

The threshold is a business decision, not a model decision.

The model learns to rank โ€” it outputs scores or probabilities. Where you set the threshold depends on the relative cost of false positives vs false negatives in your specific application.

3.

AUC measures ranking quality, not calibration.

A high AUC means the model ranks positives above negatives on average, but says nothing about whether probability estimates are well-calibrated. For production systems acting on probabilities, calibration matters separately.

4.

Error analysis drives feature engineering.

Plotting the normalised confusion matrix for multiclass problems shows exactly which class pairs confuse the model. That tells you where to focus preprocessing effort โ€” not hyperparameter tuning.

HOML Reading Notes โ€” Chapter 3 of 19

Chapter 4 covers how models are actually trained โ€” the Normal Equation, gradient descent variants (Batch, SGD, Mini-batch), and the regularisation techniques (Ridge, Lasso, Elastic Net) that prevent overfitting.