Research & Insights

Knowledge Hub

Technical writing, architectural deep dives, and open research on AI, machine learning, data systems, and beyond.

The Machine Learning Landscape: A Map Before the Journey

Chapter 1 reading notes from Géron's Hands-On ML (3rd ed.) — covering the taxonomy of learning systems, the canonical failure modes, and validation strategy, distilled into curated insight rather than a chapter summary.

Read full article →

HOML · Ch.216 min read

From Raw Data to Deployed Model: The End-to-End ML Playbook

A full end-to-end walkthrough of a supervised regression problem: problem framing, stratified sampling, EDA with correlation analysis, preprocessing pipeline design, model benchmarking via k-fold CV, GridSearchCV fine-tuning, and confidence-interval test-set evaluation.

Read full article →

HOML · Ch.315 min read

Classification: Beyond Accuracy — The Full Evaluator's Toolkit

Chapter 3 distilled: the MNIST dataset as a pedagogical vehicle for binary, multiclass, multilabel, and multioutput classification. The chapter's real contribution is its thorough treatment of evaluation metrics and the insight that accuracy systematically misleads on imbalanced datasets.

Read full article →

HOML · Ch.417 min read

Training Models: From Normal Equations to Regularised Regression

Chapter 4 distilled: closed-form Normal Equation vs iterative Gradient Descent variants (Batch, SGD, Mini-batch); polynomial regression and the bias-variance decomposition; L2/L1/combined regularisation with the Elastic Net compromise; early stopping; and the bridge from linear to logistic to softmax regression.

Read full article →

HOML · Ch.512 min read

Support Vector Machines: The Art of Maximum Margin

Chapter 5 distilled: hard vs soft margin (the C trade-off), nonlinear classification via polynomial and RBF kernels, the primal/dual formulation, and why the kernel trick makes high-dimensional feature mapping computationally tractable. Plus SVM regression with the ε-insensitive tube.

Read full article →

HOML · Ch.611 min read

Decision Trees: White-Box Models That Ask Questions

Chapter 6 distilled: CART algorithm (greedy splits on Gini/entropy), the Iris tree walkthrough with probability estimation, regularisation via max_depth/min_samples_leaf, regression trees minimising MSE, axis-aligned boundary limitations, and the high-variance problem that motivates ensembles.

Read full article →

HOML · Ch.813 min read

Dimensionality Reduction: PCA, Manifolds & the Curse of Dimensions

Chapter 8 covers the curse of dimensionality and the manifold hypothesis, PCA via SVD (explained variance ratio, n_components=0.95 auto-selection), Incremental and Randomised PCA, Kernel PCA (RBF/poly), Locally Linear Embedding, Random Projections (Johnson–Lindenstrauss), and when to use each based on data geometry and scale.

Read full article →

HOML · Ch.914 min read

Unsupervised Learning: K-Means, DBSCAN & Gaussian Mixture Models

Chapter 9 covers K-Means (Lloyd + K-Means++, inertia, silhouette scores, mini-batch variant), DBSCAN (core/border/noise points, density-reachability, outlier detection), GMMs (EM algorithm, covariance types, AIC/BIC for model selection), hierarchical clustering, image segmentation, and semi-supervised learning via cluster label propagation.

Read full article →

HOML · Ch.1016 min read

Neural Networks with Keras: MLPs, Backpropagation & the Deep Learning Toolkit

Chapter 10 opens Part II of HOML: Perceptron → MLP → backprop via autodiff; activation functions (ReLU, sigmoid, softmax, tanh); Keras Sequential and Functional APIs; compile/fit/evaluate; key callbacks (ModelCheckpoint, EarlyStopping, TensorBoard); and hyperparameter tuning with Keras Tuner. Grounds every concept in Fashion-MNIST and California Housing examples.

Read full article →

HOML · Ch.715 min read

Ensemble Learning: Random Forests, Boosting & Stacking

Chapter 7 distilled: hard vs soft voting, bagging/pasting with OOB evaluation (oob_score_=0.896), Random Patches/Subspaces, RandomForest with √n features per split, ExtraTrees, feature importance via weighted Gini reduction, AdaBoost SAMME/SAMME.R, Gradient Boosting with shrinkage + early stopping, Histogram-Based GBT O(b×m), and Stacking with a blender.

Read full article →

DDIA · Ch.116 min read

Trade-offs in Data Systems Architecture

Chapter 1 of DDIA 2nd ed. distilled: the data-intensive vs compute-intensive distinction, the five standard building blocks (DB/cache/search/stream/batch), OLTP vs OLAP access patterns and the 7-property comparison, data warehousing and ETL, the evolution to data lakes and the sushi principle, systems of record vs derived data, and the cloud vs self-hosting spectrum including cloud-native disaggregated storage/compute architecture.

Read full article →

DDIA · Ch.214 min read

Defining Nonfunctional Requirements

Chapter 2 of DDIA 2nd ed. distilled: functional vs nonfunctional requirements, the social network fan-out case study (pull-on-read vs materialised timelines, celebrity hybrid), latency decomposition (service time + queueing + network), percentile metrics (p50/p95/p99/p999), tail latency amplification, SLOs vs SLAs, fault vs failure distinction, SPOF elimination, hardware vs software fault characteristics, and human error as the leading cause of outages.

Read full article →

DDIA · Ch.322 min read

Data Models and Query Languages: Relational, Document, Graph, and Beyond

Chapter 3 of DDIA 2nd ed. distilled: the object-relational impedance mismatch and ORM failure modes, document model locality vs join limitations, normalisation/denormalisation as a read-write performance trade-off (with the Twitter timeline as a worked case study), star/snowflake/OBT analytics schemas, schema-on-read vs schema-on-write, property graphs and the Cypher query language, triple-stores and SPARQL, GraphQL's deliberate constraints, and event sourcing/CQRS as a write-optimised architecture with derived materialised views.

Read full article →

DDIA · Ch.426 min read

Storage and Retrieval: How Databases Really Work Under the Hood

Chapter 4 of DDIA 2nd ed. distilled: hash indexes and their RAM constraints, SSTables with sparse indexes, LSM-trees (WAL → memtable → SSTable flush → background compaction, Bloom filters, size-tiered vs leveled strategies), B-tree internals (fixed-size pages, branching factor, WAL, copy-on-write), a structured B-tree vs LSM-tree comparison across eight dimensions, secondary index variants (clustered, covering, multidimensional, inverted, vector), in-memory databases, column-oriented storage with bitmap compression, vectorised query execution, OLAP cubes, full-text inverted indexes, and vector indexes (IVF, HNSW) for semantic search.

Read full article →

DDIA · Ch.524 min read

Encoding and Evolution: How Data Survives Schema Changes

Chapter 5 of DDIA 2nd ed. distilled: backward vs forward compatibility and why rolling upgrades make both mandatory; encoding format taxonomy from language-specific serialisers to JSON/XML/CSV to binary schema-driven formats; schema evolution rules for Protocol Buffers (field tags) and Avro (writer/reader schema resolution); the four dataflow modes (databases, REST/RPC, workflow engines, message brokers); the five fundamental problems with RPC; durable execution frameworks (Temporal, Restate); event-driven architectures with message brokers and the actor model.

Read full article →

DDIA · Ch.628 min read

Replication: Keeping Copies in Sync Across a Distributed System

Chapter 6 of DDIA 2nd ed. distilled: the three replication algorithms (single-leader, multi-leader, leaderless/Dynamo-style), replication log types (statement-based, WAL shipping, logical/row-based), the five consistency models (eventual, read-your-writes, monotonic reads, consistent prefix reads, linearizability), conflict detection and resolution strategies (conflict avoidance, LWW, manual/sibling merge, CRDTs, Operational Transformation), and the real-world failure modes including split brain, STONITH, and the GitHub auto-increment incident.

Read full article →

DDIA · Ch.724 min read

Sharding: Splitting a Dataset Across Many Machines

Chapter 7 of DDIA 2nd ed. distilled: sharding as the horizontal scaling tool of last resort; the four sharding algorithms (key-range, hash/fixed-shards, hash-range, consistent hashing); skewed workloads and hot-key mitigation strategies; automatic vs. manual rebalancing trade-offs; the three request routing approaches (any-node/gossip, routing tier/ZooKeeper, client-side/Raft); sharding for multitenancy (resource/permission isolation, cell-based architecture, GDPR compliance); and the two secondary index strategies (local/document-partitioned vs. global/term-partitioned) with their scatter/gather vs. distributed-write cost trade-offs.

Read full article →

DDIA · Ch.826 min read

Transactions: The Safety Net That Makes Concurrent Databases Sane

Ch.8 of DDIA 2nd ed.: ACID properties (with the caveat that "C" is an application concern), the full anomaly taxonomy (dirty reads/writes, lost updates, read skew, write skew, phantoms), MVCC and snapshot isolation visibility rules, lost-update prevention strategies, and the three mechanisms for serializability: actual serial execution, 2PL, and SSI.

Read full article →

DDIA · Ch.922 min read

The Trouble with Distributed Systems: Everything That Can Go Wrong

Ch.9 of DDIA 2nd ed. catalogs the sources of unreliability: network faults (five failure modes all look like a timeout), unreliable clocks (drift, leap seconds, virtualized VMs, process pauses), impossibility of distinguishing network failures from node crashes, the majority-vote principle, fencing tokens for zombie prevention, and system models (synchronous/partial/async; crash-stop/crash-recovery/Byzantine).

Read full article →

DDIA · Ch.1028 min read

Consistency and Consensus: The Strongest Guarantees in Distributed Systems

Ch.10 of DDIA 2nd ed. covers linearizability (recency guarantee), its use cases (locks, uniqueness constraints, cross-channel ordering), CAP theorem, the performance cost of linearizability (Attiya-Welch bound), logical and hybrid logical clocks (Lamport, HLC, vector clocks), linearizable ID generators, and consensus in its forms: single-value consensus, total order broadcast, and atomic commitment.

Read full article →

DDIA · Ch.1124 min read

Batch Processing: Taming Large Datasets with Bounded Jobs

Ch.11 of DDIA 2nd ed.: the batch processing lineage from Unix tools through MapReduce to Spark/Flink. Key topics: HDFS vs. object stores (S3/GCS), YARN resource management, workflow DAGs (Airflow/Dagster), MapReduce vs. dataflow engines (sort, intermediate data, pipelining), join strategies (sort-merge, broadcast hash, partitioned hash), and batch output types (search indexes, key-value stores, ML models).

Read full article →

DDIA · Ch.1226 min read

Stream Processing: Taming Infinite Event Sequences in Real Time

Ch.12 of DDIA 2nd ed.: message brokers (traditional queues vs. log-based), Kafka architecture (append-only log, partitions, consumer groups, offsets, disk ring buffer), CDC and event sourcing (solving the dual-write problem), and stream processing patterns: windowing (tumbling, sliding, session), joins (stream-table, stream-stream), exactly-once semantics, and output sinks.

Read full article →

DDIA · Ch.1322 min read

A Philosophy of Streaming Systems: Putting the Pieces Together

Ch.13 of DDIA 2nd ed.: data integration challenges (no single tool, dual-write problems), derived vs. distributed transactions, limits of total ordering, lambda vs. kappa architectures (and why kappa won), unbundling databases (federated reads via Trino, unified writes via CDC), loose coupling via event logs, and designing applications as dataflow graphs with immutable state.

Read full article →

Privacy & Education9 min read

Private AI in the Classroom: WebAssembly Inference on the Edge

Exploring the architecture of privacy-preserving educational AI, using WASM-compiled GGUF models for zero-network-payload inference.

Read full article →

Enterprise AI10 min read

Why Serverless Beats Your Cloud ETL Pipeline by 81%

An analysis of the total cost of ownership comparison between reserved-instance cloud ETL workloads and a deterministic serverless edge model.

Read full article →