From Perceptron to Multi-Layer Networks
A Perceptron computes a step function of a weighted sum: ŷ = step(wᵀx + b). The Perceptron learning rule converges if data is linearly separable; fails on XOR. Adding hidden layers with non-linear activations creates an MLP. The Universal Approximation Theorem: a single hidden layer with enough neurons can approximate any measurable function on a compact domain. Depth makes this practical at polynomial (not exponential) width.
Backpropagation: How Networks Learn
Backprop applies reverse-mode autodiff to compute ∂L/∂θ exactly. Two passes per mini-batch: forward (compute all activations + cache them), backward (compute gradients using the chain rule from output to input). Modern frameworks (TensorFlow, PyTorch) implement this via computational graphs — you never write ∂L/∂w manually.
Compute predictions ŷ by passing inputs through all layers sequentially. Store all intermediate activations.
Compare ŷ to y using the loss function (MSE for regression; cross-entropy for classification).
Apply the chain rule from output to input: ∂L/∂wᵢ = (∂L/∂aⱼ) · (∂aⱼ/∂wᵢ). This is the gradient.
θ ← θ − η · ∇θL. Optimisers (Adam, SGD+momentum) determine how the gradient translates into a weight update.
Activation Functions
Activation functions provide the non-linearity that makes deep networks powerful. ReLU dominates hidden layers: computationally trivial, doesn't saturate for z>0, but can "die" (unit permanently stuck at 0). Variants: LeakyReLU (small negative slope), ELU (smooth, negative saturation), SELU (self-normalising). Output activations depend on task: sigmoid (binary), softmax (multiclass), linear (regression).
ReLU
max(0, z)Vanishing grad: No (for z > 0) | Default: Yes — hidden layers default
Dying ReLU problem (neurons stuck at 0). Use LeakyReLU or ELU if an issue.
Sigmoid
1 / (1 + e⁻ᶻ)Vanishing grad: Yes — saturates near 0/1 | Default: Output layer (binary classification)
Squashes to (0,1). Outputs are probabilities. Slow training for hidden layers.
Softmax
exp(zᵢ) / Σ exp(zⱼ)Vanishing grad: Mild | Default: Output layer (multiclass)
Converts logits to probability distribution summing to 1.
tanh
(eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)Vanishing grad: Yes — saturates near ±1 | Default: Less common; some recurrent nets
Zero-centred unlike sigmoid. Better for deep nets than sigmoid but slower than ReLU.
Key Layer Types
Dense (Fully Connected)
a = activation(Wx + b)Use when: Default for tabular data, final layers of most networks
All inputs connect to all outputs. O(n_in · n_out) parameters.
Batch Normalisation
z_norm = (z − μ) / σ; y = γ·z_norm + βUse when: After Dense/Conv layers in deep networks
Stabilises training; allows higher learning rates; acts as mild regulariser.
Dropout
p(drop) = dropout_rate per neuron per stepUse when: Regularisation in large networks
Randomly sets neurons to 0 during training. Test-time: all neurons active. Effective against co-adaptation.
Conv2D
feature_map = kernel ⊛ inputUse when: Images, sequence data, spatial structure
Parameter sharing: same kernel applied everywhere. Translation-invariant features.
Building with Keras
Keras offers three APIs: Sequential (linear stack), Functional (DAG, multiple inputs/outputs, shared layers), and Subclassing (full Python control). The Functional API handles most real-world architectures. compile() sets loss, optimizer, and metrics; fit() runs the training loop with callbacks. history.history tracks per-epoch train/val metrics for learning curve analysis.
Essential Callbacks
ModelCheckpoint — saves model when val_loss improves (save_best_only=True). EarlyStopping — halts training when val_loss doesn't improve for patience epochs; restore_best_weights=True resets to the saved best. ReduceLROnPlateau — halves learning rate after N stagnant epochs. TensorBoard — logs training metrics, histograms, and computational graph for browser visualisation.
Mental Models
Each artificial neuron computes a dot product of its inputs with weights, adds a bias, then applies a non-linear activation. Stack enough of these and the universal approximation theorem guarantees you can approximate any continuous function.
Shallow networks need exponentially many neurons to represent what deep networks do compactly. Each layer builds on abstractions from the previous: pixels → edges → shapes → objects → classes.
The chain rule of calculus lets you compute how each weight affects the loss by multiplying gradients backwards through the network. Automatic differentiation (autodiff) in TensorFlow/PyTorch does this symbolically — you never implement it manually.
Sigmoid and tanh saturate — their gradients near 0 or 1 are essentially zero. Multiply many near-zero gradients together and the signal disappears in early layers. ReLU prevents this for positive activations. Batch norm also helps.
Adam tracks a running mean of gradients (momentum) and a running mean of squared gradients (adaptive scaling per parameter). This makes learning rates effectively per-parameter and naturally decaying. Usually the safest default.
ModelCheckpoint saves the best weights. EarlyStopping halts when val_loss plateaus. ReduceLROnPlateau shrinks learning rate when stuck. TensorBoard logs metrics for visualisation. Use all four in every serious training run.
Pre-trained networks (ResNet, BERT) already learned useful representations from massive datasets. Freeze their weights, add a new head, and fine-tune on your small dataset. Achieves state-of-the-art with 1% of the data required for training from scratch.
Part of the HOML series. These notes distil Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (3rd ed.) by Aurélien Géron. Chapter 10 marks the transition from classical ML to deep learning — the foundation for CNNs, RNNs, Transformers, and beyond.