View Mode
← Back to Knowledge Hub
HOML · Ch.10April 2026·16 min read

Neural Networks with Keras: MLPs, Backpropagation & the Deep Learning Toolkit

Chapter 10 opens Part II of HOML, covering the McCulloch–Pitts neuron → Perceptron → MLP lineage, activation functions (ReLU, sigmoid, softmax, tanh), backpropagation via autodiff and the chain rule, the Keras Sequential and Functional APIs, model compilation (loss, optimizer, metrics), callbacks (ModelCheckpoint, EarlyStopping, TensorBoard), and hyperparameter tuning with Keras Tuner. The regression/classification examples on California Housing and Fashion-MNIST ground every concept.

From Perceptron to Multi-Layer Networks

A Perceptron computes a step function of a weighted sum: ŷ = step(wᵀx + b). The Perceptron learning rule converges if data is linearly separable; fails on XOR. Adding hidden layers with non-linear activations creates an MLP. The Universal Approximation Theorem: a single hidden layer with enough neurons can approximate any measurable function on a compact domain. Depth makes this practical at polynomial (not exponential) width.

Backpropagation: How Networks Learn

Backprop applies reverse-mode autodiff to compute ∂L/∂θ exactly. Two passes per mini-batch: forward (compute all activations + cache them), backward (compute gradients using the chain rule from output to input). Modern frameworks (TensorFlow, PyTorch) implement this via computational graphs — you never write ∂L/∂w manually.

1
Forward pass

Compute predictions ŷ by passing inputs through all layers sequentially. Store all intermediate activations.

2
Compute loss

Compare ŷ to y using the loss function (MSE for regression; cross-entropy for classification).

3
Backward pass

Apply the chain rule from output to input: ∂L/∂wᵢ = (∂L/∂aⱼ) · (∂aⱼ/∂wᵢ). This is the gradient.

4
Update weights

θ ← θ − η · ∇θL. Optimisers (Adam, SGD+momentum) determine how the gradient translates into a weight update.

Activation Functions

Activation functions provide the non-linearity that makes deep networks powerful. ReLU dominates hidden layers: computationally trivial, doesn't saturate for z>0, but can "die" (unit permanently stuck at 0). Variants: LeakyReLU (small negative slope), ELU (smooth, negative saturation), SELU (self-normalising). Output activations depend on task: sigmoid (binary), softmax (multiclass), linear (regression).

ReLU

max(0, z)

Vanishing grad: No (for z > 0) | Default: Yes — hidden layers default

Dying ReLU problem (neurons stuck at 0). Use LeakyReLU or ELU if an issue.

Sigmoid

1 / (1 + e⁻ᶻ)

Vanishing grad: Yes — saturates near 0/1 | Default: Output layer (binary classification)

Squashes to (0,1). Outputs are probabilities. Slow training for hidden layers.

Softmax

exp(zᵢ) / Σ exp(zⱼ)

Vanishing grad: Mild | Default: Output layer (multiclass)

Converts logits to probability distribution summing to 1.

tanh

(eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)

Vanishing grad: Yes — saturates near ±1 | Default: Less common; some recurrent nets

Zero-centred unlike sigmoid. Better for deep nets than sigmoid but slower than ReLU.

Key Layer Types

Dense (Fully Connected)

a = activation(Wx + b)

Use when: Default for tabular data, final layers of most networks

All inputs connect to all outputs. O(n_in · n_out) parameters.

Batch Normalisation

z_norm = (z − μ) / σ; y = γ·z_norm + β

Use when: After Dense/Conv layers in deep networks

Stabilises training; allows higher learning rates; acts as mild regulariser.

Dropout

p(drop) = dropout_rate per neuron per step

Use when: Regularisation in large networks

Randomly sets neurons to 0 during training. Test-time: all neurons active. Effective against co-adaptation.

Conv2D

feature_map = kernel ⊛ input

Use when: Images, sequence data, spatial structure

Parameter sharing: same kernel applied everywhere. Translation-invariant features.

Building with Keras

Keras offers three APIs: Sequential (linear stack), Functional (DAG, multiple inputs/outputs, shared layers), and Subclassing (full Python control). The Functional API handles most real-world architectures. compile() sets loss, optimizer, and metrics; fit() runs the training loop with callbacks. history.history tracks per-epoch train/val metrics for learning curve analysis.

Essential Callbacks

ModelCheckpoint — saves model when val_loss improves (save_best_only=True). EarlyStopping — halts training when val_loss doesn't improve for patience epochs; restore_best_weights=True resets to the saved best. ReduceLROnPlateau — halves learning rate after N stagnant epochs. TensorBoard — logs training metrics, histograms, and computational graph for browser visualisation.

Mental Models

01
A neuron is a weighted sum + threshold

Each artificial neuron computes a dot product of its inputs with weights, adds a bias, then applies a non-linear activation. Stack enough of these and the universal approximation theorem guarantees you can approximate any continuous function.

02
Depth creates compositionality

Shallow networks need exponentially many neurons to represent what deep networks do compactly. Each layer builds on abstractions from the previous: pixels → edges → shapes → objects → classes.

03
Backprop = chain rule at scale

The chain rule of calculus lets you compute how each weight affects the loss by multiplying gradients backwards through the network. Automatic differentiation (autodiff) in TensorFlow/PyTorch does this symbolically — you never implement it manually.

04
Vanishing gradients kill deep learning

Sigmoid and tanh saturate — their gradients near 0 or 1 are essentially zero. Multiply many near-zero gradients together and the signal disappears in early layers. ReLU prevents this for positive activations. Batch norm also helps.

05
Adam = SGD + moment + scale

Adam tracks a running mean of gradients (momentum) and a running mean of squared gradients (adaptive scaling per parameter). This makes learning rates effectively per-parameter and naturally decaying. Usually the safest default.

06
Callbacks = training hooks

ModelCheckpoint saves the best weights. EarlyStopping halts when val_loss plateaus. ReduceLROnPlateau shrinks learning rate when stuck. TensorBoard logs metrics for visualisation. Use all four in every serious training run.

07
Transfer learning = steal most of the model

Pre-trained networks (ResNet, BERT) already learned useful representations from massive datasets. Freeze their weights, add a new head, and fine-tune on your small dataset. Achieves state-of-the-art with 1% of the data required for training from scratch.

Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Chapter 10 marks the transition from classical ML to deep learning — the foundation for CNNs, RNNs, Transformers, and beyond.