AI.FM — Advancing Accessible AI Infrastructure

The CART Algorithm

At each node, CART evaluates every feature and every threshold, selecting the pair (k, tₖ) that minimises the weighted impurity cost:

J(k, tₖ) = (m_left / m) · Gᵢ_left + (m_right / m) · Gᵢ_right

Training complexity: O(n · m · log m) — each level sorts m instances across n features. Prediction: O(log₂m) — traverse root to leaf.

Reading a Decision Tree: Iris Walkthrough

Each tree node stores: the split condition, sample count, class distribution (value), and impurity (Gini). A leaf's predicted class is simply the majority class. Nodes with Gini=0 are pure — all samples belong to one class.

RootGini = 0.667

Test: petal length ≤ 2.45 cm

Samples: 150

Left → setosa (pure)

Depth 1 RightGini = 0.500

Test: petal width ≤ 1.75 cm

Samples: 100

Left/Right → versicolor/virginica

Depth 2 LeftGini = 0.168

Test: petal length ≤ 4.95 cm

Samples: 54

Left = likely versicolor

Depth 2 RightGini = 0.043

Test: petal length ≤ 4.85 cm

Samples: 46

Right = mostly virginica

Gini Impurity vs. Entropy

Both measure node impurity; both equal 0 for a pure node. Gini tends to isolate the most frequent class in one branch (pure leaf faster). Entropy tends to produce slightly more balanced trees. In practice the difference is negligible — use Gini for speed, entropy if you want to try a second option during hyperparameter search.

Gini Impurity

Gᵢ = 1 − Σₖ pᵢ,ₖ²

Range: 0 (pure) → 1 − 1/K (max impure)

Default: Yes (sklearn default)

Faster to compute. Tends to isolate the most frequent class into one branch.

Entropy

Hᵢ = − Σₖ pᵢ,ₖ log₂(pᵢ,ₖ)

Range: 0 (pure) → log₂(K) (max impure)

Default: No (criterion="entropy")

Slightly more balanced splits. More expensive (log calls). Usually produces similar trees.

Regularisation: Constraining the Tree

sklearn uses pre-pruning (stopping criteria) rather than post-pruning (grow then prune). The key knobs: max_depth, min_samples_split, min_samples_leaf, max_features, max_leaf_nodes. Alternatively, ccp_alpha enables minimal cost-complexity pruning (post-pruning) — prune leaves whose impurity reduction doesn't justify the complexity cost.

max_depth

Maximum tree depth. Restricts splits at deeper levels.

💡 Start at 2–3, increase with cross-val. A key lever for underfitting vs. overfitting.

min_samples_split

Node must have ≥ N samples to be split further.

💡 Default 2. Raise (e.g. 20) to prevent splits on tiny groups.

min_samples_leaf

Each leaf must contain ≥ N training instances.

💡 Ensures leaves represent meaningful data subsets.

max_features

Number of features evaluated per split.

💡 Random subsets add stochasticity. Basis of Random Forests.

max_leaf_nodes

Hard cap on total leaf count.

💡 Alternative to max_depth. Grows best-first, not depth-first.

Regression Trees

CART for regression replaces Gini with MSE as the splitting criterion. The leaf prediction is the mean of target values in that region. Regularisation works identically to classification trees. Key limitation: regression trees cannot extrapolate — for inputs outside the training range, they return the mean of the nearest training region (horizontal plateau).

Regression CART Split Cost

J(k, tₖ) = (m_left / m) · MSE_left + (m_right / m) · MSE_right

Each leaf predicts ȳ_node = mean of targets in that region. Minimising MSE = minimising prediction variance within regions.

Limitations

↔

Axis-Aligned Boundaries

CART only splits on one feature at a time — boundaries are always perpendicular to a feature axis. A 45° diagonal boundary requires many staircase splits. Oblique decision trees exist but sklearn does not include them.

📊

High Variance

Small data perturbations produce drastically different trees. Random seed, one flipped sample — different structure. This instability is why Chapter 7 introduces Random Forests: averaging many diverse trees cancels out individual variance.

🚧

No Extrapolation

Regression trees predict the mean of a training bin. For inputs outside the training range they output the nearest bin's mean — not an extrapolated line. They are inherently bounded by training data range.

Mental Models

A tree is a series of yes/no questions

Each internal node is a question about one feature. Each leaf is an answer. Prediction is just traversing the path of answers from root to leaf — O(log₂m).

CART is greedy, not globally optimal

At each node, CART picks the single best split — it does not backtrack to find the globally optimal tree. This is NP-hard to solve; greedy approximation is practical.

Gini and entropy almost always agree

Both measure node purity; both are minimised when all samples belong to one class. Use gini (default) for speed; entropy if you suspect it will find slightly better trees — cross-validate either way.

Regularise by restricting the tree, not the data

max_depth, min_samples_leaf, and max_leaf_nodes all add constraints that force the tree to generalise rather than memorise.

Regression trees split on MSE reduction

CART for regression minimises MSE instead of Gini/entropy. Each split asks: "What single threshold reduces MSE the most?" Leaves predict the mean target in that region.

Feature importance = impurity reduction share

Importance = total weighted impurity reduction caused by splits on that feature, normalised to sum to 1. Available as feature_importances_ in sklearn — a quick way to see which features the tree relies on most.

Variance → Random Forests

A single decision tree's high variance is its main weakness. The fix (Ch.7): train many diverse trees on random data subsets and average their predictions. Individual variance averages out; bias stays low.

Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Decision trees are the building block for ensemble methods — understanding their strengths and weaknesses is essential for Chapter 7 (Random Forests and Gradient Boosting).

← Previous

Ch.5 Support Vector Machines

Ch.7 Ensemble Learning

Decision Trees: CART, Gini, and Growing Interpretable Models