Data Models as Cognitive Frameworks
Most applications are built by layering data models: the application layer models the world as objects; those map to JSON documents, relational tables, or graph vertices; the storage engine represents those as B-tree pages or LSM segments; hardware represents those as electrical signals. Each layer hides complexity below via a clean abstraction. The key question at the application-to-storage boundary β which model to use β is the subject of this chapter. The critical insight: no model is universally superior. Each expresses certain query patterns naturally and makes others awkward.
The Relational Model: Dominant for Sixty Years
The relational model's durability comes from SQL being declarative: you specify the result pattern, and the query optimiser determines the execution strategy β which indexes to use, whether to parallelise, in which order to execute joins. This separation enables performance improvements without changing query syntax. The NoSQL movement introduced schema flexibility and new data models, but most NoSQL systems have since added joins, secondary indexes, and declarative query languages. Relational databases added JSON column types. The implementations are converging even as the conceptual models remain distinct.
The Object-Relational Impedance Mismatch
ORMs reduce translation boilerplate but don't eliminate the mismatch. Key failure modes: N+1 queries (fetch N rows, then N separate queries for related data instead of one JOIN); schema drift (ORM-generated schemas tuned for the ORM, not for direct SQL users); analytics blindspot (ORMs target OLTP β data engineers still work with the raw schema); join impedance at scale (ORMs make lazy loading easy, masking expensive query patterns).
Document Model: Locality and Schema Flexibility
The document model fits naturally when data has a tree structure of one-to-many relationships typically loaded as a unit. JSON has better data locality: all related data co-located, retrieved in one read. The relational technique of shredding a document into multiple tables introduces join overhead for queries that need the whole tree. Limitations surface at many-to-many relationships β document databases lack native join support (or have limited syntax like MongoDB's $lookup), and absence of foreign key constraints means consistency must be managed in application code.
Schema-on-Write (Relational)
Every row must conform to the column definitions at insert time. Schema is explicit documentation. Errors surface immediately. Migration required to change structure β but ALTER TABLE ADD COLUMN with a NULL default is usually fast.
Schema-on-Read (Document)
Arbitrary JSON accepted at write time; the application interprets structure at read time. "Schemaless" is a misnomer β the schema is implicit in the reading code. Useful when objects are genuinely heterogeneous or the schema is controlled by an external system.
Normalisation vs Denormalisation
The normalisation/denormalisation axis is the central read-write performance trade-off in data modelling. Normalised data β each fact stored once, referenced by ID β is faster to write, consistent by construction, but requires joins on every read. Denormalised data co-locates copies for fast reads but creates consistency obligations: when the original changes, all copies must be updated atomically or the system becomes inconsistent. The Twitter timeline case study is instructive: the precomputed fan-out (denormalised) avoids expensive join-on-read at 2M reads/sec, but post content and like counts are deliberately kept normalised because they mutate too frequently for duplication to be practical.
| Dimension | Normalised | Denormalised |
|---|---|---|
| Write speed | Faster β one copy to update | Slower β every copy must be updated |
| Read speed | Slower β requires joins | Faster β data is co-located |
| Consistency risk | None β one source of truth | High β copies can diverge |
| Storage cost | Lower β no duplication | Higher β data repeated per record |
| Best fit | OLTP systems, frequently updated data | Analytics, rarely-changing dimensions, caches |
Analytics Schemas: Star, Snowflake, One Big Table
Data warehouses use denormalisation aggressively. The star schema centres on a fact table (one row per event) surrounded by dimension tables (who/what/where/when/why). The snowflake schema normalises dimensions into sub-dimensions β more correct, but harder for analysts. One Big Table (OBT) folds dimension data directly into the fact table, eliminating all joins. OBT is viable for analytics because historical facts don't change β the consistency risk of denormalisation disappears.
Graph Models: When Many-to-Many Is the Whole Point
When many-to-many relationships are pervasive and connections carry semantic meaning, the property graph model is the natural fit. In the labeled property graph model, each vertex has a unique ID, a label, and a jsonb property map; each edge has a tail vertex, head vertex, a relationship label, and its own properties. Two indexes on tail_vertex and head_vertex enable efficient bidirectional traversal. The edges table is the many-to-many join table from relational theory, generalised to support heterogeneous relationship types in a single structure.
Cypher: Find people born in the US who now live in Europe
MATCH
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (:Location {name:'United States'})
(person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (:Location {name:'Europe'})
RETURN person.name*0.. means "follow zero or more WITHIN edges" β traversing any depth of geographic hierarchy. The equivalent SQL WITH RECURSIVE CTE is 31 lines.
Triple-Stores and SPARQL
The triple-store model stores data as subjectβpredicateβobject triples. (lucy, marriedTo, alain) β object is a vertex β is an edge. (lucy, birthYear, 1989) β object is a literal β is a vertex property. SPARQL predates Cypher; Cypher borrowed its pattern-matching syntax. Triple-stores originated with the Semantic Web project; their legacy lives on in knowledge graphs (Google, Wikidata), JSON-LD, and biomedical ontologies.
Query Languages: Declarative vs Imperative
Declarative languages expose only the result specification to the query optimiser, which can exploit indexes, choose join algorithms, and parallelise execution freely without any changes to the query. Imperative traversal code fixes the execution strategy at write time. The performance gap is significant at scale.
| Language | Paradigm | Scope | Note |
|---|---|---|---|
| SQL | Declarative | Relational + graph (via WITH RECURSIVE) | Industry standard; optimiser can parallelise freely |
| Cypher / GQL | Declarative | Property graphs | Pattern-matching syntax; 4 lines for what SQL needs 31 |
| SPARQL | Declarative | Triple-stores (RDF) | Predates Cypher; Cypher borrowed its pattern syntax |
| Datalog | Declarative (rule-based) | Relational + recursive graphs | Subset of Prolog; used by Datomic, LinkedIn LIquid |
| GraphQL | Declarative (client-defined shape) | API layer over any backend | No recursive queries β intentional DoS prevention |
| MongoDB Aggregation | Imperative pipeline | Document collections | JSON-based; $match/$group/$lookup pipeline stages |
GraphQL: API Query Language, Not a Graph Database Language
Despite the name, GraphQL is not a graph database query language. It lets clients specify exactly which fields they need and get back a JSON document matching that shape β no over-fetching. Its deliberate restrictions (no recursive queries, no arbitrary search) prevent clients from issuing denial-of-service-inducing traversals. GraphQL can be implemented over any backend β relational, document, or graph.
Event Sourcing and CQRS: Writing Facts, Deriving State
Event sourcing stores every state change as an immutable, append-only event. Current state is derived by processing the event log sequentially. CQRS (Command Query Responsibility Segregation) separates the write path (event log) from the read path (materialised views), allowing each to be optimised independently. A command is validated first; only if valid does it become a fact appended to the log β the log contains only valid events, and consumers cannot reject events. Events are named in the past tense (booking_payment_confirmed, not confirm_booking) because they record completed facts.
The event log is a complete, ordered record of every change. Invaluable in regulated industries (finance, healthcare) where you must answer "what happened and when?"
The same event log can derive many materialised views β one for a customer confirmation page, one for an organiser dashboard, one for a badge printer β each tailored to its query pattern.
Materialised views are recomputed deterministically from the event log. Bug in a view? Delete it and rebuild from scratch. Impossible with mutable state.
Appending to a log is sequential β fast and simple. Downstream consumers catch up asynchronously without blocking the write path.
An erroneous event is corrected by a subsequent compensating event. In traditional databases, a committed UPDATE or DELETE is very hard to reverse.
Add a new feature by building a new materialised view from the existing log β no changes to the write path. Old events remain untouched.
Choosing Your Data Model
Model selection should be driven by the dominant relationship structure in your data and the query patterns that matter most. These models are not mutually exclusive β PostgreSQL's jsonb, recursive CTEs, and event log tables can coexist in the same database. Knowing what each model optimises for is what lets you use that convergence intentionally.
| Model | Best for | Weak at | Query language | Examples |
|---|---|---|---|---|
| Relational | Many-to-one and many-to-many joins, structured records | Deeply nested trees, complex graph traversal | SQL | PostgreSQL, MySQL, SQLite |
| Document | Self-contained tree-shaped objects, schema flexibility | Many-to-many relationships, cross-document joins | MQL, JSONPath, $lookup | MongoDB, CouchBase, Firestore |
| Graph | Dense many-to-many relationships, variable-depth traversal | Large tabular aggregations, simple key-value access | Cypher, SPARQL, Gremlin | Neo4j, Amazon Neptune, KΓΉzuDB |
| Event Log | Append-heavy writes, full auditability, multiple read views | Low-latency direct reads (needs materialised views) | Stream processing (Flink, Kafka Streams) | Kafka, Datomic, EventStoreDB |
Mental Models
Kleppmann opens with Wittgenstein: "The limits of my language mean the limits of my world." Choosing the relational model makes you think in tables and joins. Choosing the graph model makes you think in vertices and traversals. The model is a cognitive framework, not just a storage format.
SQL, Cypher, SPARQL, and Datalog let you describe the pattern you want. The query engine decides how β and can exploit indexes, parallelism, and join order optimisations you'd never implement by hand.
Neither is universally better. Denormalise things that change infrequently; normalise things that change often. Twitter stores only post IDs in its materialised timeline (normalised) but pre-computes the fan-out (denormalised graph). Both in the same system.
Finding people born in the US who now live in Europe takes 4 lines of Cypher and 31 lines of SQL WITH RECURSIVE. The right data model for the problem eliminates complexity, not just code volume.
All other models answer "how should I store current state?" Event sourcing asks "what if I only stored immutable facts about what happened?" Current state becomes a derived view β one of many possible views of the same event history.
PostgreSQL has jsonb and full JSON operators. MongoDB has $lookup joins and secondary indexes. The implementations are blurring. The conceptual distinction β locality and schema-on-read vs joins and schema-on-write β remains useful for making design decisions.
Part of the DDIA series. These notes distil Designing Data-Intensive Applications (2nd ed.) by Martin Kleppmann β the definitive guide to the trade-offs behind databases, messaging systems, and distributed architectures. Chapter 3 covers the data model and query language landscape; later chapters go deeper on storage engines, encoding, replication, and distributed transactions.