AI.FM — Advancing Accessible AI Infrastructure

The World's Simplest Database

The chapter opens with a two-function bash database: db_set appends key,value to a flat file; db_get greps for the key and takes the last match. Write performance is optimal — sequential appends are the fastest disk operation. The problem is read performance: O(n) full-file scan per lookup. With 10 GB of data, every read loads 10 GB. The core insight is that write efficiency and read efficiency are in fundamental tension, and every index design is a negotiated settlement between them.

Key principle: An index is additional data structure derived from the primary data. Maintaining it adds overhead to every write. Choosing what to index means choosing which queries are fast at the cost of slower writes. Databases generally do not index everything by default — the application developer makes this call.

Hash Indexes: Fast Lookups, RAM Constraints

Bitcask (Riak's storage engine) maintains an in-memory hash map from key → byte offset in the log file. Reads become a memory lookup followed by a single disk seek. Writes remain sequential appends. Compaction of log segments handles reclaiming space from overwritten or deleted keys. The core limitations: the entire key set must fit in RAM (no range queries, no overflow to disk), and crash recovery requires scanning segments to rebuild the hash map (mitigated by snapshotting the map to disk). Best fit: high write rates over a bounded key set where range queries are not required.

SSTables and LSM-Trees: Write-Optimised Storage

SSTable (Sorted String Table) improves on the unsorted hash-indexed log by maintaining keys in sorted order within each segment file. A sparse in-memory index (one entry per compressed block) replaces the full hash map — far lower RAM requirement. Range queries become sequential scans from a known offset. Merging segments uses an O(n) merge-sort procedure, efficient even for large datasets. LSM-tree (Log-Structured Merge-Tree) is the architecture that ties these together: writes go to a WAL, then to an in-memory sorted structure (memtable — typically a red-black tree or skip list), which flushes to immutable SSTable segments on disk, which are periodically compacted in the background.

Write to the WAL

Every write is first appended to a Write-Ahead Log (WAL) on disk. If the process crashes before anything else, the WAL allows full replay and recovery.

Insert into the memtable

The write is then added to an in-memory sorted tree — typically a red-black tree or skip list. The memtable keeps all recent writes in sorted key order, ready for fast sequential flushing.

Flush memtable → SSTable segment

When the memtable exceeds a size threshold (e.g., a few MB), it is flushed to disk as an immutable SSTable (Sorted String Table) segment — keys sorted, values compressed, a sparse index written alongside.

Background compaction

Periodic background threads merge SSTable segments using a merge-sort approach. Compaction removes deleted tombstone records, deduplicates keys (keeping only the latest value), and reclaims disk space.

Bloom filter guards reads

Before scanning any SSTable for a key, the engine consults a Bloom filter — a probabilistic set membership structure. If the filter says "not present," the segment is skipped entirely, eliminating unnecessary disk I/O.

Compaction strategies shape trade-offs

Size-tiered compaction (Cassandra default) merges similar-sized SSTables — high write throughput but larger temporary space during merges. Leveled compaction (RocksDB default) compacts incrementally into bounded levels — better read performance and space efficiency, more I/O during writes.

B-Trees: The Dominant Index Structure for 50 Years

B-trees organise data as a tree of fixed-size pages (4–16 KiB). Each page contains up to n keys and n+1 pointers to child pages (branching factor, typically ~500). Leaf pages contain actual values or heap pointers; internal pages contain only keys and pointers. Tree depth stays at O(log n) — a four-level tree with branching factor 500 can index 500⁴ ≈ 62.5 trillion keys. Writes modify pages in-place; a WAL (write-ahead log) ensures atomicity for multi-page operations (e.g., page splits). An alternative: copy-on-write B-trees (used by LMDB) write modified pages to new locations rather than overwriting — providing atomic snapshots without a WAL, at the cost of more write amplification.

Crash safety in B-trees: A page split (one page overflowing into two) requires writing at least three pages: the two new pages and the parent page containing updated pointers. If the process crashes mid-split, the tree is in an inconsistent state. The WAL (Write-Ahead Log) — an append-only file written before any page is modified — allows the database to replay or undo partial operations after a restart. This is why every major relational database writes to both a WAL and its data files on every write.

B-Tree vs LSM-Tree: An Eight-Dimension Comparison

The choice between B-tree and LSM-tree is a multi-dimensional trade-off. Neither dominates on all axes. B-trees provide more predictable read latency and better lock-based transaction isolation (a key range can be locked on a single tree node). LSM-trees offer higher write throughput, lower write amplification, and lower disk fragmentation — particularly advantageous for write-heavy workloads, flash storage, and systems where compaction can be tuned independently.

Dimension	LSM-Tree	B-Tree
Write pattern	Append-only sequential writes to WAL + memtable	In-place random overwrites to fixed-size pages
Read performance	Must check memtable + multiple SSTable levels	O(log n) page traversal — predictable, single path
Range queries	Parallel scan across compacted SSTables	Natural sequential scan along sorted leaf pages
Write amplification	Lower — one sequential write path; compaction amortised	Higher — WAL + in-place page write (2–3× amplification)
Disk fragmentation	Low — compaction rewrites clean, sequentially packed files	Grows over time — deleted/updated pages leave gaps
Crash recovery	WAL replay + memtable reconstruction on restart	WAL (redo log) must be replayed to fix partial writes
Best use case	Write-heavy workloads: event logs, IoT, time-series ingestion	Read-heavy OLTP: banking, e-commerce, general-purpose SQL
Examples	RocksDB, Cassandra, ScyllaDB, HBase, LevelDB	PostgreSQL, MySQL InnoDB, SQLite, Oracle

Beyond Primary Keys: Secondary Indexes and Index Variants

Secondary indexes are non-unique: many rows may share the same secondary key value. The index value can be a reference to the row's location (heap file pointer or primary key), making lookups require a second read from the heap, or the full row data itself (clustered index — faster reads, but only one clustered index per table, and heavier write cost). A covering index stores some additional columns inside the leaf nodes — enough to answer specific queries entirely from the index, without touching the heap. This "index-only scan" eliminates an entire I/O round-trip for frequent query patterns.

Index Type	Mechanism	Strengths	Limitations	Used By
Hash index	In-memory key → byte offset map	O(1) exact-key lookups	Must fit in RAM; no range queries	Bitcask (Riak)
Sparse index (SSTable)	One entry per compressed block of sorted keys	Compressed blocks, supports range scans	Slightly slower than dense for exact keys	RocksDB, Cassandra, HBase
B-tree	Fixed-size pages, branching factor ~500, O(log n) depth	Fast reads, range queries, well-understood	Write amplification, fragmentation over time	PostgreSQL, MySQL InnoDB, SQLite
Clustered index	Row data stored directly within the index structure	Fastest reads by primary key — no heap lookup	Extra space; only one per table	MySQL InnoDB (primary key), SQL Server
Covering / included-cols index	Some column values stored inside the index leaf	Can answer certain queries from index alone	Duplication slows writes; uses more disk	PostgreSQL, MySQL, SQL Server
Multidimensional (R-tree)	Hierarchical spatial partitioning — groups nearby points	Two-dimensional range queries (lat/lng, color/size)	Not designed for high-dimensional vectors	PostGIS, Oracle Spatial
Inverted index	Term → postings list (doc IDs containing that term)	Full-text keyword search, fuzzy matching	Index rebuild cost; language-specific preprocessing	Lucene / Elasticsearch / Solr, PostgreSQL GIN
Vector index (HNSW/IVF)	Hierarchical graph or centroid-partitioned space	Approximate nearest-neighbour semantic search	Approximate results; memory-intensive	pgvector, Pinecone, Weaviate, Qdrant

In-Memory Databases: Speed Through Encoding Elimination

The performance advantage of in-memory databases is not primarily about avoiding disk reads — the OS page cache already serves hot data from RAM for disk-based databases. The advantage is avoiding the overhead of encoding in-memory data structures into a disk-serializable format. In-memory databases can expose data structures impossible to implement efficiently with disk-based indexes: Redis priority queues, sorted sets, HyperLogLog counters. Durability options include battery-backed DRAM, periodic snapshots, append-only change logs (RAM Cloud's approach), and async replication. VoltDB, SingleStore, and Oracle TimesTen offer relational models in-memory; Redis and Memcached offer key-value stores with varying durability guarantees.

Column-Oriented Storage: Analytics at Petabyte Scale

Column-oriented (columnar) storage stores all values of each column contiguously rather than all columns of each row. A query accessing 4 of 100 columns only reads and decompresses those 4 column files. Columnar storage also compresses dramatically better: a single column contains homogeneous values of one type, and columns in fact tables often have low cardinality (e.g., a product_sk column in a billion-row sales table may have only 100,000 distinct values). Bitmap encoding creates one bitmap per distinct value; run-length encoding compresses runs of repeated bits. Columnar storage is used in Snowflake, BigQuery, Amazon Redshift, DuckDB, Apache Parquet, Apache Arrow, ORC, InfluxDB IOx, TimescaleDB, and others.

Dimension	Row-Oriented (OLTP)	Column-Oriented (OLAP)
Storage layout	All columns of a row stored together on disk	All values of each column stored together
Query pattern	SELECT * — every column needed, few rows	SELECT 3 cols, scan millions of rows (analytics)
I/O efficiency	Loads full 100-column row to read 3 columns	Loads only the 3 required columns
Compression	Mixed types per row — poor compression ratio	Homogeneous values per column — excellent (bitmap + RLE)
Write performance	Efficient in-place row updates	Bulk ETL writes; individual row writes are expensive
Best for	OLTP (banking, e-commerce, general-purpose SQL)	OLAP (data warehouse, BI dashboards, analytics)
Examples	PostgreSQL, MySQL, MongoDB	Snowflake, BigQuery, Redshift, DuckDB, Parquet, Arrow

Bitmap compression example: A product_sk column with 100,000 distinct values in a billion-row table can be compressed into 100,000 bitmaps, each with one bit per row. A bitmap for product_sk = 69 has a 1 at every position where that product was sold, and 0 elsewhere. These bitmaps are typically sparse — run-length encoded (e.g., "0, 4, 12, 2" means "4 zeros, 12 ones, 2 zeros"). A WHERE product_sk IN (31, 68, 69) query loads three bitmaps and computes their bitwise OR — a single instruction on modern CPUs. Queries with AND predicates compute bitwise AND. Both operations work directly on compressed data without decoding it.

Query Execution: Compilation vs Vectorisation

Two approaches optimise query execution over columnar data. Query compilation (used by HyperDB, LLVM-based systems): the query engine translates the SQL into native machine code via LLVM or similar, compiles it, and executes it directly on column-encoded data in memory. Overhead: compilation latency at query start. Upside: maximum CPU efficiency for long-running queries. Vectorised processing (used by DuckDB, Apache Arrow): the engine interprets the query but processes a batch of values at once rather than one row at a time. Fixed-set operators are built into the engine; the query passes arguments (e.g., a column + a constant) and receives a result batch. Both approaches exploit sequential memory access (CPU cache efficiency), tight inner loops (minimal branch mispredictions), SIMD instructions, and operating directly on compressed data.

Materialized Views and OLAP Cubes

A materialized view is an actual on-disk copy of a view's query result — unlike a virtual view, which re-executes its defining query on every read. When the underlying data changes, the materialized view must be updated (some databases do this automatically; others use external systems like Materialize). A data cube (OLAP cube) is a grid of pre-aggregated values grouped by combinations of dimensions. For a two-dimensional cube over date × product, each cell holds the SUM of net_price for that date-product combination. Totals along each axis allow O(1) lookups for aggregated metrics. Limitation: raw data cannot be queried along unanticipated dimensions (e.g., filtering by price range within the pre-computed cube). Most warehouses keep raw columnar data and use cubes only as a performance optimisation for the most common aggregations.

Vector Indexes: From Keywords to Semantics

Vector embeddings translate documents (or images, audio) into high-dimensional floating-point vectors via neural network embedding models (Word2Vec, BERT, GPT, multimodal models). Vectors near each other in the embedding space are semantically similar. A vector index stores these embeddings and supports nearest-neighbour queries: given a query vector, find the k stored vectors with the smallest cosine similarity or Euclidean distance. Three main index types: flat indexes — brute-force comparison, exact but O(n) per query; IVF (Inverted File) indexes — cluster vectors into centroid partitions, compare the query only against vectors in nearby clusters (approximate, controllable accuracy vs speed trade-off via probe count); HNSW (Hierarchical Navigable Small World) — layered graph where each layer is progressively denser; queries start at the sparse top layer and descend, following edges to nearer vectors. HNSW is approximate but typically provides the best accuracy/speed trade-off. Used by pgvector, Pinecone, Weaviate, Qdrant, and Meta's FAISS library.

Full-text search and vector search are structurally related: a full-text inverted index treats each term as a dimension — a document containing term x has value 1 in dimension x. Finding documents containing both "red" and "apples" is a query for a 1 in both the red dimension and the apples dimension — computed as a bitwise AND of two postings-list bitmaps. Lucene (used by Elasticsearch and Solr) stores terms in SSTable-like sorted files, merged in the background using the same log-structured approach as RocksDB. Vector search generalises this: each dimension is a continuous floating-point value, not a binary term membership. The index structure changes from a sorted term→postings-list map to a nearest-neighbour graph, but the conceptual goal — efficiently finding similar documents — is the same.

Six Mental Models Worth Keeping

Every index is a read-write trade-off

Indexes make reads faster at the cost of every write. An index must be updated on every INSERT, UPDATE, and DELETE. A table with 10 indexes pays 10 index-update costs per write. The right number of indexes is the minimum set that makes your query patterns acceptable.

LSM-trees turn random writes into sequential writes

Random disk I/O is 10–100× slower than sequential. LSM-trees's fundamental insight: batch and sort writes in memory, then flush sequentially to disk. This is why Cassandra and RocksDB can saturate SSD write throughput in a way B-trees cannot.

Column storage is I/O selectivity applied to analytics

A data warehouse query touching 4 of 100 columns with row storage must load 96 unnecessary columns for every row scanned. Column storage loads only what the query needs. For petabyte-scale tables, this is the difference between a 3-second query and a 3-hour one.

In-memory databases are faster because of encoding, not skipping disk reads

The OS already caches hot disk blocks in RAM, so a disk-based database rarely touches disk for frequently-read data. In-memory databases win because they avoid the overhead of encoding data structures into a form that can be written to disk — they operate directly on native in-memory representations like linked lists, priority queues, and hash maps.

OLTP and OLAP have fundamentally different access patterns

OLTP: high volume of small reads/writes, typically accessing a single record by primary key or secondary index. OLAP: low volume of complex queries, each scanning millions of rows and aggregating across many columns. These patterns are so different that modern systems specialise — PostgreSQL for OLTP, Snowflake/DuckDB for OLAP.

Vector indexes are multidimensional indexes for semantic similarity

Full-text search treats each word as a dimension. Vector indexes generalise this to 1,000+ floating-point dimensions, where nearby vectors represent semantically similar documents. HNSW indexes navigate a hierarchical graph of the vector space — like R-trees for geospatial data, but for embedding space.

Part of the DDIA Series

This article covers Chapter 4 of Designing Data-Intensive Applications (2nd ed.) by Martin Kleppmann. The series distils each chapter into audience-aware reading notes across three levels: conceptual overview (Layman), architectural context (Reviewer), and engineering detail (Engineer).

← Ch.3: Data Models & Query Languages Back to Knowledge Hub →

Storage and Retrieval: How Databases Really Work Under the Hood