The World's Simplest Database
The chapter opens with a two-function bash database: db_set appends key,value to a flat file; db_get greps for the key and takes the last match. Write performance is optimal β sequential appends are the fastest disk operation. The problem is read performance: O(n) full-file scan per lookup. With 10 GB of data, every read loads 10 GB. The core insight is that write efficiency and read efficiency are in fundamental tension, and every index design is a negotiated settlement between them.
Key principle: An index is additional data structure derived from the primary data. Maintaining it adds overhead to every write. Choosing what to index means choosing which queries are fast at the cost of slower writes. Databases generally do not index everything by default β the application developer makes this call.
Hash Indexes: Fast Lookups, RAM Constraints
Bitcask (Riak's storage engine) maintains an in-memory hash map from key β byte offset in the log file. Reads become a memory lookup followed by a single disk seek. Writes remain sequential appends. Compaction of log segments handles reclaiming space from overwritten or deleted keys. The core limitations: the entire key set must fit in RAM (no range queries, no overflow to disk), and crash recovery requires scanning segments to rebuild the hash map (mitigated by snapshotting the map to disk). Best fit: high write rates over a bounded key set where range queries are not required.
SSTables and LSM-Trees: Write-Optimised Storage
SSTable (Sorted String Table) improves on the unsorted hash-indexed log by maintaining keys in sorted order within each segment file. A sparse in-memory index (one entry per compressed block) replaces the full hash map β far lower RAM requirement. Range queries become sequential scans from a known offset. Merging segments uses an O(n) merge-sort procedure, efficient even for large datasets. LSM-tree (Log-Structured Merge-Tree) is the architecture that ties these together: writes go to a WAL, then to an in-memory sorted structure (memtable β typically a red-black tree or skip list), which flushes to immutable SSTable segments on disk, which are periodically compacted in the background.
B-Trees: The Dominant Index Structure for 50 Years
B-trees organise data as a tree of fixed-size pages (4β16 KiB). Each page contains up to n keys and n+1 pointers to child pages (branching factor, typically ~500). Leaf pages contain actual values or heap pointers; internal pages contain only keys and pointers. Tree depth stays at O(log n) β a four-level tree with branching factor 500 can index 500β΄ β 62.5 trillion keys. Writes modify pages in-place; a WAL (write-ahead log) ensures atomicity for multi-page operations (e.g., page splits). An alternative: copy-on-write B-trees (used by LMDB) write modified pages to new locations rather than overwriting β providing atomic snapshots without a WAL, at the cost of more write amplification.
Crash safety in B-trees: A page split (one page overflowing into two) requires writing at least three pages: the two new pages and the parent page containing updated pointers. If the process crashes mid-split, the tree is in an inconsistent state. The WAL (Write-Ahead Log) β an append-only file written before any page is modified β allows the database to replay or undo partial operations after a restart. This is why every major relational database writes to both a WAL and its data files on every write.
B-Tree vs LSM-Tree: An Eight-Dimension Comparison
The choice between B-tree and LSM-tree is a multi-dimensional trade-off. Neither dominates on all axes. B-trees provide more predictable read latency and better lock-based transaction isolation (a key range can be locked on a single tree node). LSM-trees offer higher write throughput, lower write amplification, and lower disk fragmentation β particularly advantageous for write-heavy workloads, flash storage, and systems where compaction can be tuned independently.
| Dimension | LSM-Tree | B-Tree |
|---|---|---|
| Write pattern | Append-only sequential writes to WAL + memtable | In-place random overwrites to fixed-size pages |
| Read performance | Must check memtable + multiple SSTable levels | O(log n) page traversal β predictable, single path |
| Range queries | Parallel scan across compacted SSTables | Natural sequential scan along sorted leaf pages |
| Write amplification | Lower β one sequential write path; compaction amortised | Higher β WAL + in-place page write (2β3Γ amplification) |
| Disk fragmentation | Low β compaction rewrites clean, sequentially packed files | Grows over time β deleted/updated pages leave gaps |
| Crash recovery | WAL replay + memtable reconstruction on restart | WAL (redo log) must be replayed to fix partial writes |
| Best use case | Write-heavy workloads: event logs, IoT, time-series ingestion | Read-heavy OLTP: banking, e-commerce, general-purpose SQL |
| Examples | RocksDB, Cassandra, ScyllaDB, HBase, LevelDB | PostgreSQL, MySQL InnoDB, SQLite, Oracle |
Beyond Primary Keys: Secondary Indexes and Index Variants
Secondary indexes are non-unique: many rows may share the same secondary key value. The index value can be a reference to the row's location (heap file pointer or primary key), making lookups require a second read from the heap, or the full row data itself (clustered index β faster reads, but only one clustered index per table, and heavier write cost). A covering index stores some additional columns inside the leaf nodes β enough to answer specific queries entirely from the index, without touching the heap. This "index-only scan" eliminates an entire I/O round-trip for frequent query patterns.
| Index Type | Mechanism | Strengths | Limitations | Used By |
|---|---|---|---|---|
| Hash index | In-memory key β byte offset map | O(1) exact-key lookups | Must fit in RAM; no range queries | Bitcask (Riak) |
| Sparse index (SSTable) | One entry per compressed block of sorted keys | Compressed blocks, supports range scans | Slightly slower than dense for exact keys | RocksDB, Cassandra, HBase |
| B-tree | Fixed-size pages, branching factor ~500, O(log n) depth | Fast reads, range queries, well-understood | Write amplification, fragmentation over time | PostgreSQL, MySQL InnoDB, SQLite |
| Clustered index | Row data stored directly within the index structure | Fastest reads by primary key β no heap lookup | Extra space; only one per table | MySQL InnoDB (primary key), SQL Server |
| Covering / included-cols index | Some column values stored inside the index leaf | Can answer certain queries from index alone | Duplication slows writes; uses more disk | PostgreSQL, MySQL, SQL Server |
| Multidimensional (R-tree) | Hierarchical spatial partitioning β groups nearby points | Two-dimensional range queries (lat/lng, color/size) | Not designed for high-dimensional vectors | PostGIS, Oracle Spatial |
| Inverted index | Term β postings list (doc IDs containing that term) | Full-text keyword search, fuzzy matching | Index rebuild cost; language-specific preprocessing | Lucene / Elasticsearch / Solr, PostgreSQL GIN |
| Vector index (HNSW/IVF) | Hierarchical graph or centroid-partitioned space | Approximate nearest-neighbour semantic search | Approximate results; memory-intensive | pgvector, Pinecone, Weaviate, Qdrant |
In-Memory Databases: Speed Through Encoding Elimination
The performance advantage of in-memory databases is not primarily about avoiding disk reads β the OS page cache already serves hot data from RAM for disk-based databases. The advantage is avoiding the overhead of encoding in-memory data structures into a disk-serializable format. In-memory databases can expose data structures impossible to implement efficiently with disk-based indexes: Redis priority queues, sorted sets, HyperLogLog counters. Durability options include battery-backed DRAM, periodic snapshots, append-only change logs (RAM Cloud's approach), and async replication. VoltDB, SingleStore, and Oracle TimesTen offer relational models in-memory; Redis and Memcached offer key-value stores with varying durability guarantees.
Column-Oriented Storage: Analytics at Petabyte Scale
Column-oriented (columnar) storage stores all values of each column contiguously rather than all columns of each row. A query accessing 4 of 100 columns only reads and decompresses those 4 column files. Columnar storage also compresses dramatically better: a single column contains homogeneous values of one type, and columns in fact tables often have low cardinality (e.g., a product_sk column in a billion-row sales table may have only 100,000 distinct values). Bitmap encoding creates one bitmap per distinct value; run-length encoding compresses runs of repeated bits. Columnar storage is used in Snowflake, BigQuery, Amazon Redshift, DuckDB, Apache Parquet, Apache Arrow, ORC, InfluxDB IOx, TimescaleDB, and others.
| Dimension | Row-Oriented (OLTP) | Column-Oriented (OLAP) |
|---|---|---|
| Storage layout | All columns of a row stored together on disk | All values of each column stored together |
| Query pattern | SELECT * β every column needed, few rows | SELECT 3 cols, scan millions of rows (analytics) |
| I/O efficiency | Loads full 100-column row to read 3 columns | Loads only the 3 required columns |
| Compression | Mixed types per row β poor compression ratio | Homogeneous values per column β excellent (bitmap + RLE) |
| Write performance | Efficient in-place row updates | Bulk ETL writes; individual row writes are expensive |
| Best for | OLTP (banking, e-commerce, general-purpose SQL) | OLAP (data warehouse, BI dashboards, analytics) |
| Examples | PostgreSQL, MySQL, MongoDB | Snowflake, BigQuery, Redshift, DuckDB, Parquet, Arrow |
Bitmap compression example: A product_sk column with 100,000 distinct values in a billion-row table can be compressed into 100,000 bitmaps, each with one bit per row. A bitmap for product_sk = 69 has a 1 at every position where that product was sold, and 0 elsewhere. These bitmaps are typically sparse β run-length encoded (e.g., "0, 4, 12, 2" means "4 zeros, 12 ones, 2 zeros"). A WHERE product_sk IN (31, 68, 69) query loads three bitmaps and computes their bitwise OR β a single instruction on modern CPUs. Queries with AND predicates compute bitwise AND. Both operations work directly on compressed data without decoding it.
Query Execution: Compilation vs Vectorisation
Two approaches optimise query execution over columnar data. Query compilation (used by HyperDB, LLVM-based systems): the query engine translates the SQL into native machine code via LLVM or similar, compiles it, and executes it directly on column-encoded data in memory. Overhead: compilation latency at query start. Upside: maximum CPU efficiency for long-running queries. Vectorised processing (used by DuckDB, Apache Arrow): the engine interprets the query but processes a batch of values at once rather than one row at a time. Fixed-set operators are built into the engine; the query passes arguments (e.g., a column + a constant) and receives a result batch. Both approaches exploit sequential memory access (CPU cache efficiency), tight inner loops (minimal branch mispredictions), SIMD instructions, and operating directly on compressed data.
Materialized Views and OLAP Cubes
A materialized view is an actual on-disk copy of a view's query result β unlike a virtual view, which re-executes its defining query on every read. When the underlying data changes, the materialized view must be updated (some databases do this automatically; others use external systems like Materialize). A data cube (OLAP cube) is a grid of pre-aggregated values grouped by combinations of dimensions. For a two-dimensional cube over date Γ product, each cell holds the SUM of net_price for that date-product combination. Totals along each axis allow O(1) lookups for aggregated metrics. Limitation: raw data cannot be queried along unanticipated dimensions (e.g., filtering by price range within the pre-computed cube). Most warehouses keep raw columnar data and use cubes only as a performance optimisation for the most common aggregations.
Vector Indexes: From Keywords to Semantics
Vector embeddings translate documents (or images, audio) into high-dimensional floating-point vectors via neural network embedding models (Word2Vec, BERT, GPT, multimodal models). Vectors near each other in the embedding space are semantically similar. A vector index stores these embeddings and supports nearest-neighbour queries: given a query vector, find the k stored vectors with the smallest cosine similarity or Euclidean distance. Three main index types: flat indexes β brute-force comparison, exact but O(n) per query; IVF (Inverted File) indexes β cluster vectors into centroid partitions, compare the query only against vectors in nearby clusters (approximate, controllable accuracy vs speed trade-off via probe count); HNSW (Hierarchical Navigable Small World) β layered graph where each layer is progressively denser; queries start at the sparse top layer and descend, following edges to nearer vectors. HNSW is approximate but typically provides the best accuracy/speed trade-off. Used by pgvector, Pinecone, Weaviate, Qdrant, and Meta's FAISS library.
Full-text search and vector search are structurally related: a full-text inverted index treats each term as a dimension β a document containing term x has value 1 in dimension x. Finding documents containing both "red" and "apples" is a query for a 1 in both the red dimension and the apples dimension β computed as a bitwise AND of two postings-list bitmaps. Lucene (used by Elasticsearch and Solr) stores terms in SSTable-like sorted files, merged in the background using the same log-structured approach as RocksDB. Vector search generalises this: each dimension is a continuous floating-point value, not a binary term membership. The index structure changes from a sorted termβpostings-list map to a nearest-neighbour graph, but the conceptual goal β efficiently finding similar documents β is the same.
Six Mental Models Worth Keeping
Indexes make reads faster at the cost of every write. An index must be updated on every INSERT, UPDATE, and DELETE. A table with 10 indexes pays 10 index-update costs per write. The right number of indexes is the minimum set that makes your query patterns acceptable.
Random disk I/O is 10β100Γ slower than sequential. LSM-trees's fundamental insight: batch and sort writes in memory, then flush sequentially to disk. This is why Cassandra and RocksDB can saturate SSD write throughput in a way B-trees cannot.
A data warehouse query touching 4 of 100 columns with row storage must load 96 unnecessary columns for every row scanned. Column storage loads only what the query needs. For petabyte-scale tables, this is the difference between a 3-second query and a 3-hour one.
The OS already caches hot disk blocks in RAM, so a disk-based database rarely touches disk for frequently-read data. In-memory databases win because they avoid the overhead of encoding data structures into a form that can be written to disk β they operate directly on native in-memory representations like linked lists, priority queues, and hash maps.
OLTP: high volume of small reads/writes, typically accessing a single record by primary key or secondary index. OLAP: low volume of complex queries, each scanning millions of rows and aggregating across many columns. These patterns are so different that modern systems specialise β PostgreSQL for OLTP, Snowflake/DuckDB for OLAP.
Full-text search treats each word as a dimension. Vector indexes generalise this to 1,000+ floating-point dimensions, where nearby vectors represent semantically similar documents. HNSW indexes navigate a hierarchical graph of the vector space β like R-trees for geospatial data, but for embedding space.
Part of the DDIA Series
This article covers Chapter 4 of Designing Data-Intensive Applications (2nd ed.) by Martin Kleppmann. The series distils each chapter into audience-aware reading notes across three levels: conceptual overview (Layman), architectural context (Reviewer), and engineering detail (Engineer).