View Mode
← Back to Knowledge Hub
HOML · Ch.7April 2026·15 min read

Ensemble Learning: Random Forests, Boosting & Stacking

Chapter 7 covers voting classifiers (hard/soft), bagging vs. pasting, out-of-bag evaluation, Random Patches/Subspaces, RandomForestClassifier (√n features per split), ExtraTrees, feature importance, AdaBoost (SAMME/SAMME.R), Gradient Boosting with shrinkage and early stopping, Histogram-Based GBT (O(b×m)), and Stacking with a blender/meta-learner. The chapter explains why ensembles dominate Kaggle leaderboards.

Voting Classifiers

A voting classifier aggregates predictions from diverse base estimators. The key requirement is diversity — identical classifiers trained on identical data will make identical errors that don't cancel. Soft voting typically outperforms hard voting because it incorporates confidence information; a classifier that is 99% confident on a class outweighs one that is 51% confident.

Hard Voting

Each classifier votes a class; majority wins

Best when: Diverse classifiers, roughly equal accuracy

Simple majority rule — no probability needed

Soft Voting

Average predicted class probabilities; argmax wins

Best when: Classifiers that output calibrated probabilities

Usually outperforms hard voting; requires predict_proba()

Bagging, Pasting & Sampling Variants

Bootstrap aggregating (Bagging) trains each predictor on a bootstrap sample of m instances (drawn with replacement). Each bag contains ~63.2% unique instances. The ~37% not drawn form the out-of-bag (OOB) set — a free validation set. Predictions are aggregated by majority vote (classification) or mean (regression). Bagging reduces variance; bias is largely unchanged.

Bagging

Sampling: Bootstrap (with replacement)

Across: Instances

Some instances repeated; ~63% unique per bag

Pasting

Sampling: Without replacement

Across: Instances

No repeats; smaller effective diversity

Random Patches

Sampling: Bootstrap

Across: Instances + Features

Sample both instances and features

Random Subspaces

Sampling: All instances

Across: Features only

Good for high-dimensional data (images, text)

Out-of-Bag (OOB) Evaluation

Each tree sees ~63% of instances; the remaining ~37% are never used for training that tree. They form a natural held-out set. oob_score=True in BaggingClassifier computes this automatically. Example: oob_score_=0.896, test accuracy=0.92 — close enough that OOB is often a good substitute for a separate validation split.

Random Forests & ExtraTrees

RandomForestClassifier = BaggingClassifier of DecisionTrees with max_features=√n_features. The feature subsampling at each split decorrelates trees beyond what bootstrap sampling alone achieves. ExtraTrees (Extremely Randomized Trees) goes further — it also uses random thresholds per feature (splitter='random'), trading slightly higher bias for lower variance and much faster training.

Feature Importance — Iris Dataset Example

Importance = normalised weighted impurity reduction. Available via feature_importances_

Petal length
44.1%
Petal width
42.3%
Sepal length
10.1%
Sepal width
3.5%

Boosting: AdaBoost & Gradient Boosting

Unlike bagging (parallel, variance reduction), boosting trains predictors sequentially. Each predictor attempts to correct the errors of its predecessor. The key trade-off: boosting reduces both bias and variance but is sensitive to noisy data (overfits outliers more than bagging). A learning rate (shrinkage) moderates each tree's contribution.

AdaBoost

αⱼ = η · log((1 − rⱼ) / rⱼ)

Re-weight misclassified instances; each predictor focuses on predecessors' errors

sklearn: AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1))

Gradient Boosting

hⱼ(x) fits ∇loss of ensemble so far

Fit each tree to the residual errors (pseudo-residuals) of the previous ensemble

learning_rate = shrinkage; smaller → need more trees but generalises better

Histogram-Based GB (HGB)

O(b × m) vs O(n × m × log m)

Bins continuous features into b bins; splits evaluated on histogram counts

sklearn HistGradientBoostingClassifier; supports NaN, categoricals natively

Early Stopping in Gradient Boosting

Set n_iter_no_change=10 in GradientBoostingRegressor — training stops when val loss doesn't improve for 10 rounds. Example: n_estimators=500 + early stopping → best at 92 trees. With subsample=0.25 (Stochastic GB) each tree trains on a random 25% of instances — faster training, additional regularisation.

Stacking (Stacked Generalisation)

The training set is split into K folds. For each fold, base predictors are trained on the remaining K−1 folds and used to predict the hold-out fold. These out-of-fold predictions form the blending dataset. The meta-learner is then trained on this blending set, learning an optimal combination of base predictions. sklearn: StackingClassifier.

Layer 1: Base Predictors (Random Forest, SVM, etc.)

Blending Dataset (Out-of-fold predictions)

Layer 2: Meta-Learner (e.g. Logistic Regression)

Mental Models

01
Wisdom of crowds

Aggregating many diverse, independent predictors almost always beats the best single predictor. The key word is diverse — correlated errors don't cancel out.

02
Bagging reduces variance, not bias

Bootstrap sampling creates diverse trees; averaging smooths out individual tree's variance. Bias is roughly unchanged. This is why it helps with high-variance models like deep trees.

03
OOB evaluation is free cross-validation

~37% of instances are never seen by each tree (out-of-bag). These form a natural validation set. Set oob_score=True and you get a bias-corrected generalization estimate without a separate val split.

04
Random Forests add feature randomness to bagging

Each tree not only sees a bootstrap sample but also evaluates only √n features per split. This decorrelates the trees further — more diversity, bigger variance reduction.

05
AdaBoost focuses on the hard examples

Misclassified instances get higher weights in the next round. The ensemble ends up as a weighted vote where accurate-on-hard-instances classifiers have more influence. SAMME.R uses probabilities for softer weighting.

06
Gradient Boosting = iterative residual fitting

Each tree fits the negative gradient of the loss — for MSE, these are simply the residuals. The model improves by always correcting what the current ensemble got wrong.

07
Learning rate is shrinkage

Multiplying each tree contribution by a small η (e.g. 0.1) requires more trees but strongly regularises the model. Lower η + more trees (via early stopping) almost always beats high η + few trees.

08
Stacking replaces averaging with learning

Instead of averaging predictions, a meta-learner (blender) is trained on the base predictions. It can learn to trust some predictors more than others depending on the instance — more flexible than simple voting.

Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Ensemble methods — especially Gradient Boosted Trees — are the dominant approach in structured/tabular ML competitions.