← Knowledge Hub·HOML · Ch.2March 2026 · 16 min read

From Raw Data to Deployed Model:
The End-to-End ML Playbook

Chapter 2 of Géron's Hands-On ML is the book's centrepiece: a full end-to-end walkthrough of a supervised regression problem using the California Housing dataset. This article distils the workflow into the decisions that actually matter — from problem framing and sampling strategy through preprocessing pipeline design, model selection, and hyperparameter optimisation — foregrounding the "why" behind each step rather than restating the code.

📖 Part of the HOML Reading Series — Notes from Géron's Hands-On ML, 3rd ed. ← Ch.1: The ML Landscape

The Full Workflow

Eight Steps, One Working System

Every ML project — regardless of domain or algorithm — follows this cycle. Click any step to expand its explanation for your current view mode.

Model Benchmarks

How the Models Compare

Training RMSE vs. cross-validation RMSE tells the real story. A model can look perfect on training data while failing badly on new examples — this gap is the overfitting signal.

Model	Train RMSE	CV RMSE (10-fold)	Diagnosis
Linear Regression	~68,688	~69,858 ± 4,182	Underfitting
Decision Tree	0 (!)	~66,868 ± 2,061	Severe overfitting
Random Forest	~17,474	~47,019 ± 1,034	Best so far (some overfitting)
Random Forest (tuned)	—	~44,042	Final model

All RMSE values in USD. Lower is better. CV scores averaged over 10 non-overlapping folds.

Key Insights

What This Chapter Actually Teaches

🎓

Most ML work isn't modelling

The pipeline, imputation, encoding, scaling, and feature engineering dwarfs the model training step. A rough model with clean data beats a sophisticated model with dirty data.

📊

RMSE vs MAE: pick your outlier tolerance

RMSE (L2 norm) punishes large errors heavily; MAE (L1 norm) treats all errors equally. For bell-shaped distributions, RMSE is preferred. In outlier-heavy data, MAE is more robust.

🎯

Stratify on what matters

Random sampling risks a skewed test set. Stratify on the most predictive continuous variable (income category here) to guarantee proportional representation across strata.

🔄

Zero training error is a warning sign

The DecisionTree scored RMSE=0 on training data — which looks perfect but means it memorised, not learned. Cross-validation exposes this: the same model scored ~67K on unseen folds.

🧩

Engineer before you model

bedrooms_ratio (total_bedrooms / total_rooms) correlated at r=-0.256 with price — far stronger than either raw component. Domain-informed feature combinations often matter more than algorithm choice.

📐

Confidence intervals beat point estimates

A final RMSE of 41,424 sounds precise, but the 95% CI [39,574; 43,780] is what actually informs a launch decision. Point estimates hide the uncertainty that matters for risk management.

Pipeline Anatomy

The Preprocessing Pipeline in Full

Scikit-Learn's Pipeline and ColumnTransformer classes enforce the discipline of fitting transformers only on training data — preventing data leakage from imputed medians, learned cluster centres, and categorical encodings.

ratio_pipeline

Computes bedrooms_ratio, rooms_per_house, people_per_house via column division + StandardScaler

→ 3 features

log_pipeline

Log-transforms heavy-tailed columns (total_bedrooms, total_rooms, population, households, median_income)

→ 5 features

cluster_simil

KMeans(n=10) on lat/lon → Gaussian RBF similarity to each cluster centre (geographic encoding)

→ 10 features

cat_pipeline

SimpleImputer(most_frequent) + OneHotEncoder for ocean_proximity categorical attribute

→ 5 features

default_num_pipeline

SimpleImputer(median) + StandardScaler for remaining numerical column (housing_median_age)

→ 1 feature

🎯

Total output

24 features → RandomForestRegressor

Mental Models

Three Ideas Worth Keeping

The test set is your conscience

Data snooping bias means that every time you look at the test set, you risk making decisions that overfit to it — even indirectly. The discipline of stratified splitting before exploration, and cross-validation for model selection, exists to preserve the test set's integrity as a single honest evaluation.

Pipelines are reproducibility infrastructure

Writing preprocessing as imperative code creates subtle production bugs: what median was used to impute missing bedrooms? Were scalers fitted on training or the whole dataset? A pipeline makes these questions moot — the fit() call records all statistics, and the same pipeline handles training, validation, and production inference identically.

Model rot is the silent production killer

The world that generated your training data is not static. Model rot — the gradual performance decay due to distributional shift between training and inference time — is often undetectable until downstream metrics show problems. The monitoring infrastructure is often harder to build than the model itself.

← Ch.1: The ML Landscape

HOML Reading Series · Ch.3 coming next

From Raw Data to Deployed Model:The End-to-End ML Playbook