From Raw Data to Deployed Model:
The End-to-End ML Playbook
Chapter 2 of GΓ©ron's Hands-On ML is the book's centrepiece: a full end-to-end walkthrough of a supervised regression problem using the California Housing dataset. This article distils the workflow into the decisions that actually matter β from problem framing and sampling strategy through preprocessing pipeline design, model selection, and hyperparameter optimisation β foregrounding the "why" behind each step rather than restating the code.
Eight Steps, One Working System
Every ML project β regardless of domain or algorithm β follows this cycle. Click any step to expand its explanation for your current view mode.
How the Models Compare
Training RMSE vs. cross-validation RMSE tells the real story. A model can look perfect on training data while failing badly on new examples β this gap is the overfitting signal.
| Model | Train RMSE | CV RMSE (10-fold) | Diagnosis |
|---|---|---|---|
| Linear Regression | ~68,688 | ~69,858 Β± 4,182 | Underfitting |
| Decision Tree | 0 (!) | ~66,868 Β± 2,061 | Severe overfitting |
| Random Forest | ~17,474 | ~47,019 Β± 1,034 | Best so far (some overfitting) |
| Random Forest (tuned) | β | ~44,042 | Final model |
All RMSE values in USD. Lower is better. CV scores averaged over 10 non-overlapping folds.
What This Chapter Actually Teaches
Most ML work isn't modelling
The pipeline, imputation, encoding, scaling, and feature engineering dwarfs the model training step. A rough model with clean data beats a sophisticated model with dirty data.
RMSE vs MAE: pick your outlier tolerance
RMSE (L2 norm) punishes large errors heavily; MAE (L1 norm) treats all errors equally. For bell-shaped distributions, RMSE is preferred. In outlier-heavy data, MAE is more robust.
Stratify on what matters
Random sampling risks a skewed test set. Stratify on the most predictive continuous variable (income category here) to guarantee proportional representation across strata.
Zero training error is a warning sign
The DecisionTree scored RMSE=0 on training data β which looks perfect but means it memorised, not learned. Cross-validation exposes this: the same model scored ~67K on unseen folds.
Engineer before you model
bedrooms_ratio (total_bedrooms / total_rooms) correlated at r=-0.256 with price β far stronger than either raw component. Domain-informed feature combinations often matter more than algorithm choice.
Confidence intervals beat point estimates
A final RMSE of 41,424 sounds precise, but the 95% CI [39,574; 43,780] is what actually informs a launch decision. Point estimates hide the uncertainty that matters for risk management.
The Preprocessing Pipeline in Full
Scikit-Learn's Pipeline and ColumnTransformer classes enforce the discipline of fitting transformers only on training data β preventing data leakage from imputed medians, learned cluster centres, and categorical encodings.
Computes bedrooms_ratio, rooms_per_house, people_per_house via column division + StandardScaler
β 3 featuresLog-transforms heavy-tailed columns (total_bedrooms, total_rooms, population, households, median_income)
β 5 featuresKMeans(n=10) on lat/lon β Gaussian RBF similarity to each cluster centre (geographic encoding)
β 10 featuresSimpleImputer(most_frequent) + OneHotEncoder for ocean_proximity categorical attribute
β 5 featuresSimpleImputer(median) + StandardScaler for remaining numerical column (housing_median_age)
β 1 featureThree Ideas Worth Keeping
The test set is your conscience
Data snooping bias means that every time you look at the test set, you risk making decisions that overfit to it β even indirectly. The discipline of stratified splitting before exploration, and cross-validation for model selection, exists to preserve the test set's integrity as a single honest evaluation.
Pipelines are reproducibility infrastructure
Writing preprocessing as imperative code creates subtle production bugs: what median was used to impute missing bedrooms? Were scalers fitted on training or the whole dataset? A pipeline makes these questions moot β the fit() call records all statistics, and the same pipeline handles training, validation, and production inference identically.
Model rot is the silent production killer
The world that generated your training data is not static. Model rot β the gradual performance decay due to distributional shift between training and inference time β is often undetectable until downstream metrics show problems. The monitoring infrastructure is often harder to build than the model itself.