Voting Classifiers
A voting classifier aggregates predictions from diverse base estimators. The key requirement is diversity — identical classifiers trained on identical data will make identical errors that don't cancel. Soft voting typically outperforms hard voting because it incorporates confidence information; a classifier that is 99% confident on a class outweighs one that is 51% confident.
Hard Voting
Each classifier votes a class; majority wins
Best when: Diverse classifiers, roughly equal accuracy
Simple majority rule — no probability needed
Soft Voting
Average predicted class probabilities; argmax wins
Best when: Classifiers that output calibrated probabilities
Usually outperforms hard voting; requires predict_proba()
Bagging, Pasting & Sampling Variants
Bootstrap aggregating (Bagging) trains each predictor on a bootstrap sample of m instances (drawn with replacement). Each bag contains ~63.2% unique instances. The ~37% not drawn form the out-of-bag (OOB) set — a free validation set. Predictions are aggregated by majority vote (classification) or mean (regression). Bagging reduces variance; bias is largely unchanged.
Bagging
Sampling: Bootstrap (with replacement)
Across: Instances
Some instances repeated; ~63% unique per bag
Pasting
Sampling: Without replacement
Across: Instances
No repeats; smaller effective diversity
Random Patches
Sampling: Bootstrap
Across: Instances + Features
Sample both instances and features
Random Subspaces
Sampling: All instances
Across: Features only
Good for high-dimensional data (images, text)
Out-of-Bag (OOB) Evaluation
Each tree sees ~63% of instances; the remaining ~37% are never used for training that tree. They form a natural held-out set. oob_score=True in BaggingClassifier computes this automatically. Example: oob_score_=0.896, test accuracy=0.92 — close enough that OOB is often a good substitute for a separate validation split.
Random Forests & ExtraTrees
RandomForestClassifier = BaggingClassifier of DecisionTrees with max_features=√n_features. The feature subsampling at each split decorrelates trees beyond what bootstrap sampling alone achieves. ExtraTrees (Extremely Randomized Trees) goes further — it also uses random thresholds per feature (splitter='random'), trading slightly higher bias for lower variance and much faster training.
Feature Importance — Iris Dataset Example
Importance = normalised weighted impurity reduction. Available via feature_importances_
Boosting: AdaBoost & Gradient Boosting
Unlike bagging (parallel, variance reduction), boosting trains predictors sequentially. Each predictor attempts to correct the errors of its predecessor. The key trade-off: boosting reduces both bias and variance but is sensitive to noisy data (overfits outliers more than bagging). A learning rate (shrinkage) moderates each tree's contribution.
AdaBoost
αⱼ = η · log((1 − rⱼ) / rⱼ)Re-weight misclassified instances; each predictor focuses on predecessors' errors
sklearn: AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1))
Gradient Boosting
hⱼ(x) fits ∇loss of ensemble so farFit each tree to the residual errors (pseudo-residuals) of the previous ensemble
learning_rate = shrinkage; smaller → need more trees but generalises better
Histogram-Based GB (HGB)
O(b × m) vs O(n × m × log m)Bins continuous features into b bins; splits evaluated on histogram counts
sklearn HistGradientBoostingClassifier; supports NaN, categoricals natively
Early Stopping in Gradient Boosting
Set n_iter_no_change=10 in GradientBoostingRegressor — training stops when val loss doesn't improve for 10 rounds. Example: n_estimators=500 + early stopping → best at 92 trees. With subsample=0.25 (Stochastic GB) each tree trains on a random 25% of instances — faster training, additional regularisation.
Stacking (Stacked Generalisation)
The training set is split into K folds. For each fold, base predictors are trained on the remaining K−1 folds and used to predict the hold-out fold. These out-of-fold predictions form the blending dataset. The meta-learner is then trained on this blending set, learning an optimal combination of base predictions. sklearn: StackingClassifier.
Layer 1: Base Predictors (Random Forest, SVM, etc.)
Blending Dataset (Out-of-fold predictions)
Layer 2: Meta-Learner (e.g. Logistic Regression)
Mental Models
Aggregating many diverse, independent predictors almost always beats the best single predictor. The key word is diverse — correlated errors don't cancel out.
Bootstrap sampling creates diverse trees; averaging smooths out individual tree's variance. Bias is roughly unchanged. This is why it helps with high-variance models like deep trees.
~37% of instances are never seen by each tree (out-of-bag). These form a natural validation set. Set oob_score=True and you get a bias-corrected generalization estimate without a separate val split.
Each tree not only sees a bootstrap sample but also evaluates only √n features per split. This decorrelates the trees further — more diversity, bigger variance reduction.
Misclassified instances get higher weights in the next round. The ensemble ends up as a weighted vote where accurate-on-hard-instances classifiers have more influence. SAMME.R uses probabilities for softer weighting.
Each tree fits the negative gradient of the loss — for MSE, these are simply the residuals. The model improves by always correcting what the current ensemble got wrong.
Multiplying each tree contribution by a small η (e.g. 0.1) requires more trees but strongly regularises the model. Lower η + more trees (via early stopping) almost always beats high η + few trees.
Instead of averaging predictions, a meta-learner (blender) is trained on the base predictions. It can learn to trust some predictors more than others depending on the instance — more flexible than simple voting.
Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Ensemble methods — especially Gradient Boosted Trees — are the dominant approach in structured/tabular ML competitions.