Knowledge Hub
Technical writing, architectural deep dives, and open research on AI, machine learning, data systems, and beyond.
From Raw Data to Deployed Model: The End-to-End ML Playbook
A full end-to-end walkthrough of a supervised regression problem: problem framing, stratified sampling, EDA with correlation analysis, preprocessing pipeline design, model benchmarking via k-fold CV, GridSearchCV fine-tuning, and confidence-interval test-set evaluation.
Classification: Beyond Accuracy — The Full Evaluator's Toolkit
Chapter 3 distilled: the MNIST dataset as a pedagogical vehicle for binary, multiclass, multilabel, and multioutput classification. The chapter's real contribution is its thorough treatment of evaluation metrics and the insight that accuracy systematically misleads on imbalanced datasets.
Training Models: From Normal Equations to Regularised Regression
Chapter 4 distilled: closed-form Normal Equation vs iterative Gradient Descent variants (Batch, SGD, Mini-batch); polynomial regression and the bias-variance decomposition; L2/L1/combined regularisation with the Elastic Net compromise; early stopping; and the bridge from linear to logistic to softmax regression.
Support Vector Machines: The Art of Maximum Margin
Chapter 5 distilled: hard vs soft margin (the C trade-off), nonlinear classification via polynomial and RBF kernels, the primal/dual formulation, and why the kernel trick makes high-dimensional feature mapping computationally tractable. Plus SVM regression with the ε-insensitive tube.
Decision Trees: White-Box Models That Ask Questions
Chapter 6 distilled: CART algorithm (greedy splits on Gini/entropy), the Iris tree walkthrough with probability estimation, regularisation via max_depth/min_samples_leaf, regression trees minimising MSE, axis-aligned boundary limitations, and the high-variance problem that motivates ensembles.
Dimensionality Reduction: PCA, Manifolds & the Curse of Dimensions
Chapter 8 covers the curse of dimensionality and the manifold hypothesis, PCA via SVD (explained variance ratio, n_components=0.95 auto-selection), Incremental and Randomised PCA, Kernel PCA (RBF/poly), Locally Linear Embedding, Random Projections (Johnson–Lindenstrauss), and when to use each based on data geometry and scale.
Unsupervised Learning: K-Means, DBSCAN & Gaussian Mixture Models
Chapter 9 covers K-Means (Lloyd + K-Means++, inertia, silhouette scores, mini-batch variant), DBSCAN (core/border/noise points, density-reachability, outlier detection), GMMs (EM algorithm, covariance types, AIC/BIC for model selection), hierarchical clustering, image segmentation, and semi-supervised learning via cluster label propagation.
Neural Networks with Keras: MLPs, Backpropagation & the Deep Learning Toolkit
Chapter 10 opens Part II of HOML: Perceptron → MLP → backprop via autodiff; activation functions (ReLU, sigmoid, softmax, tanh); Keras Sequential and Functional APIs; compile/fit/evaluate; key callbacks (ModelCheckpoint, EarlyStopping, TensorBoard); and hyperparameter tuning with Keras Tuner. Grounds every concept in Fashion-MNIST and California Housing examples.
Ensemble Learning: Random Forests, Boosting & Stacking
Chapter 7 distilled: hard vs soft voting, bagging/pasting with OOB evaluation (oob_score_=0.896), Random Patches/Subspaces, RandomForest with √n features per split, ExtraTrees, feature importance via weighted Gini reduction, AdaBoost SAMME/SAMME.R, Gradient Boosting with shrinkage + early stopping, Histogram-Based GBT O(b×m), and Stacking with a blender.
Trade-offs in Data Systems Architecture
Chapter 1 of DDIA 2nd ed. distilled: the data-intensive vs compute-intensive distinction, the five standard building blocks (DB/cache/search/stream/batch), OLTP vs OLAP access patterns and the 7-property comparison, data warehousing and ETL, the evolution to data lakes and the sushi principle, systems of record vs derived data, and the cloud vs self-hosting spectrum including cloud-native disaggregated storage/compute architecture.
Defining Nonfunctional Requirements
Chapter 2 of DDIA 2nd ed. distilled: functional vs nonfunctional requirements, the social network fan-out case study (pull-on-read vs materialised timelines, celebrity hybrid), latency decomposition (service time + queueing + network), percentile metrics (p50/p95/p99/p999), tail latency amplification, SLOs vs SLAs, fault vs failure distinction, SPOF elimination, hardware vs software fault characteristics, and human error as the leading cause of outages.
Data Models and Query Languages: Relational, Document, Graph, and Beyond
Chapter 3 of DDIA 2nd ed. distilled: the object-relational impedance mismatch and ORM failure modes, document model locality vs join limitations, normalisation/denormalisation as a read-write performance trade-off (with the Twitter timeline as a worked case study), star/snowflake/OBT analytics schemas, schema-on-read vs schema-on-write, property graphs and the Cypher query language, triple-stores and SPARQL, GraphQL's deliberate constraints, and event sourcing/CQRS as a write-optimised architecture with derived materialised views.
Storage and Retrieval: How Databases Really Work Under the Hood
Chapter 4 of DDIA 2nd ed. distilled: hash indexes and their RAM constraints, SSTables with sparse indexes, LSM-trees (WAL → memtable → SSTable flush → background compaction, Bloom filters, size-tiered vs leveled strategies), B-tree internals (fixed-size pages, branching factor, WAL, copy-on-write), a structured B-tree vs LSM-tree comparison across eight dimensions, secondary index variants (clustered, covering, multidimensional, inverted, vector), in-memory databases, column-oriented storage with bitmap compression, vectorised query execution, OLAP cubes, full-text inverted indexes, and vector indexes (IVF, HNSW) for semantic search.
Encoding and Evolution: How Data Survives Schema Changes
Chapter 5 of DDIA 2nd ed. distilled: backward vs forward compatibility and why rolling upgrades make both mandatory; encoding format taxonomy from language-specific serialisers to JSON/XML/CSV to binary schema-driven formats; schema evolution rules for Protocol Buffers (field tags) and Avro (writer/reader schema resolution); the four dataflow modes (databases, REST/RPC, workflow engines, message brokers); the five fundamental problems with RPC; durable execution frameworks (Temporal, Restate); event-driven architectures with message brokers and the actor model.
Replication: Keeping Copies in Sync Across a Distributed System
Chapter 6 of DDIA 2nd ed. distilled: the three replication algorithms (single-leader, multi-leader, leaderless/Dynamo-style), replication log types (statement-based, WAL shipping, logical/row-based), the five consistency models (eventual, read-your-writes, monotonic reads, consistent prefix reads, linearizability), conflict detection and resolution strategies (conflict avoidance, LWW, manual/sibling merge, CRDTs, Operational Transformation), and the real-world failure modes including split brain, STONITH, and the GitHub auto-increment incident.
Sharding: Splitting a Dataset Across Many Machines
Chapter 7 of DDIA 2nd ed. distilled: sharding as the horizontal scaling tool of last resort; the four sharding algorithms (key-range, hash/fixed-shards, hash-range, consistent hashing); skewed workloads and hot-key mitigation strategies; automatic vs. manual rebalancing trade-offs; the three request routing approaches (any-node/gossip, routing tier/ZooKeeper, client-side/Raft); sharding for multitenancy (resource/permission isolation, cell-based architecture, GDPR compliance); and the two secondary index strategies (local/document-partitioned vs. global/term-partitioned) with their scatter/gather vs. distributed-write cost trade-offs.
Transactions: The Safety Net That Makes Concurrent Databases Sane
Ch.8 of DDIA 2nd ed.: ACID properties (with the caveat that "C" is an application concern), the full anomaly taxonomy (dirty reads/writes, lost updates, read skew, write skew, phantoms), MVCC and snapshot isolation visibility rules, lost-update prevention strategies, and the three mechanisms for serializability: actual serial execution, 2PL, and SSI.
The Trouble with Distributed Systems: Everything That Can Go Wrong
Ch.9 of DDIA 2nd ed. catalogs the sources of unreliability: network faults (five failure modes all look like a timeout), unreliable clocks (drift, leap seconds, virtualized VMs, process pauses), impossibility of distinguishing network failures from node crashes, the majority-vote principle, fencing tokens for zombie prevention, and system models (synchronous/partial/async; crash-stop/crash-recovery/Byzantine).
Consistency and Consensus: The Strongest Guarantees in Distributed Systems
Ch.10 of DDIA 2nd ed. covers linearizability (recency guarantee), its use cases (locks, uniqueness constraints, cross-channel ordering), CAP theorem, the performance cost of linearizability (Attiya-Welch bound), logical and hybrid logical clocks (Lamport, HLC, vector clocks), linearizable ID generators, and consensus in its forms: single-value consensus, total order broadcast, and atomic commitment.
Batch Processing: Taming Large Datasets with Bounded Jobs
Ch.11 of DDIA 2nd ed.: the batch processing lineage from Unix tools through MapReduce to Spark/Flink. Key topics: HDFS vs. object stores (S3/GCS), YARN resource management, workflow DAGs (Airflow/Dagster), MapReduce vs. dataflow engines (sort, intermediate data, pipelining), join strategies (sort-merge, broadcast hash, partitioned hash), and batch output types (search indexes, key-value stores, ML models).
Stream Processing: Taming Infinite Event Sequences in Real Time
Ch.12 of DDIA 2nd ed.: message brokers (traditional queues vs. log-based), Kafka architecture (append-only log, partitions, consumer groups, offsets, disk ring buffer), CDC and event sourcing (solving the dual-write problem), and stream processing patterns: windowing (tumbling, sliding, session), joins (stream-table, stream-stream), exactly-once semantics, and output sinks.
A Philosophy of Streaming Systems: Putting the Pieces Together
Ch.13 of DDIA 2nd ed.: data integration challenges (no single tool, dual-write problems), derived vs. distributed transactions, limits of total ordering, lambda vs. kappa architectures (and why kappa won), unbundling databases (federated reads via Trino, unified writes via CDC), loose coupling via event logs, and designing applications as dataflow graphs with immutable state.
Private AI in the Classroom: WebAssembly Inference on the Edge
Exploring the architecture of privacy-preserving educational AI, using WASM-compiled GGUF models for zero-network-payload inference.
Why Serverless Beats Your Cloud ETL Pipeline by 81%
An analysis of the total cost of ownership comparison between reserved-instance cloud ETL workloads and a deterministic serverless edge model.