View Mode
← Back to Knowledge Hub
DDIA Β· Ch.3Data & Systems22 min read

Data Models and Query Languages: Relational, Document, Graph, and Beyond

Chapter 3 covers the full data model landscape: the object-relational impedance mismatch and ORM limitations, document model locality and schema-on-read, normalisation vs denormalisation trade-offs with the Twitter timeline as a worked case study, star/snowflake/OBT analytics schemas, property graphs with Cypher, triple-stores with SPARQL, Datalog's recursive relational queries, GraphQL's intentional constraints, and event sourcing/CQRS as a write-optimised primitive with derived materialised views.

Data Models as Cognitive Frameworks

Most applications are built by layering data models: the application layer models the world as objects; those map to JSON documents, relational tables, or graph vertices; the storage engine represents those as B-tree pages or LSM segments; hardware represents those as electrical signals. Each layer hides complexity below via a clean abstraction. The key question at the application-to-storage boundary β€” which model to use β€” is the subject of this chapter. The critical insight: no model is universally superior. Each expresses certain query patterns naturally and makes others awkward.

The Relational Model: Dominant for Sixty Years

The relational model's durability comes from SQL being declarative: you specify the result pattern, and the query optimiser determines the execution strategy β€” which indexes to use, whether to parallelise, in which order to execute joins. This separation enables performance improvements without changing query syntax. The NoSQL movement introduced schema flexibility and new data models, but most NoSQL systems have since added joins, secondary indexes, and declarative query languages. Relational databases added JSON column types. The implementations are converging even as the conceptual models remain distinct.

The Object-Relational Impedance Mismatch

ORMs reduce translation boilerplate but don't eliminate the mismatch. Key failure modes: N+1 queries (fetch N rows, then N separate queries for related data instead of one JOIN); schema drift (ORM-generated schemas tuned for the ORM, not for direct SQL users); analytics blindspot (ORMs target OLTP β€” data engineers still work with the raw schema); join impedance at scale (ORMs make lazy loading easy, masking expensive query patterns).

Document Model: Locality and Schema Flexibility

The document model fits naturally when data has a tree structure of one-to-many relationships typically loaded as a unit. JSON has better data locality: all related data co-located, retrieved in one read. The relational technique of shredding a document into multiple tables introduces join overhead for queries that need the whole tree. Limitations surface at many-to-many relationships β€” document databases lack native join support (or have limited syntax like MongoDB's $lookup), and absence of foreign key constraints means consistency must be managed in application code.

Schema-on-Write (Relational)

Every row must conform to the column definitions at insert time. Schema is explicit documentation. Errors surface immediately. Migration required to change structure β€” but ALTER TABLE ADD COLUMN with a NULL default is usually fast.

Schema-on-Read (Document)

Arbitrary JSON accepted at write time; the application interprets structure at read time. "Schemaless" is a misnomer β€” the schema is implicit in the reading code. Useful when objects are genuinely heterogeneous or the schema is controlled by an external system.

Normalisation vs Denormalisation

The normalisation/denormalisation axis is the central read-write performance trade-off in data modelling. Normalised data β€” each fact stored once, referenced by ID β€” is faster to write, consistent by construction, but requires joins on every read. Denormalised data co-locates copies for fast reads but creates consistency obligations: when the original changes, all copies must be updated atomically or the system becomes inconsistent. The Twitter timeline case study is instructive: the precomputed fan-out (denormalised) avoids expensive join-on-read at 2M reads/sec, but post content and like counts are deliberately kept normalised because they mutate too frequently for duplication to be practical.

DimensionNormalisedDenormalised
Write speedFaster β€” one copy to updateSlower β€” every copy must be updated
Read speedSlower β€” requires joinsFaster β€” data is co-located
Consistency riskNone β€” one source of truthHigh β€” copies can diverge
Storage costLower β€” no duplicationHigher β€” data repeated per record
Best fitOLTP systems, frequently updated dataAnalytics, rarely-changing dimensions, caches

Analytics Schemas: Star, Snowflake, One Big Table

Data warehouses use denormalisation aggressively. The star schema centres on a fact table (one row per event) surrounded by dimension tables (who/what/where/when/why). The snowflake schema normalises dimensions into sub-dimensions β€” more correct, but harder for analysts. One Big Table (OBT) folds dimension data directly into the fact table, eliminating all joins. OBT is viable for analytics because historical facts don't change β€” the consistency risk of denormalisation disappears.

Graph Models: When Many-to-Many Is the Whole Point

When many-to-many relationships are pervasive and connections carry semantic meaning, the property graph model is the natural fit. In the labeled property graph model, each vertex has a unique ID, a label, and a jsonb property map; each edge has a tail vertex, head vertex, a relationship label, and its own properties. Two indexes on tail_vertex and head_vertex enable efficient bidirectional traversal. The edges table is the many-to-many join table from relational theory, generalised to support heterogeneous relationship types in a single structure.

Cypher: Find people born in the US who now live in Europe

MATCH
  (person) -[:BORN_IN]->  () -[:WITHIN*0..]-> (:Location {name:'United States'})
  (person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (:Location {name:'Europe'})
RETURN person.name

*0.. means "follow zero or more WITHIN edges" β€” traversing any depth of geographic hierarchy. The equivalent SQL WITH RECURSIVE CTE is 31 lines.

Triple-Stores and SPARQL

The triple-store model stores data as subject–predicate–object triples. (lucy, marriedTo, alain) β€” object is a vertex β€” is an edge. (lucy, birthYear, 1989) β€” object is a literal β€” is a vertex property. SPARQL predates Cypher; Cypher borrowed its pattern-matching syntax. Triple-stores originated with the Semantic Web project; their legacy lives on in knowledge graphs (Google, Wikidata), JSON-LD, and biomedical ontologies.

Query Languages: Declarative vs Imperative

Declarative languages expose only the result specification to the query optimiser, which can exploit indexes, choose join algorithms, and parallelise execution freely without any changes to the query. Imperative traversal code fixes the execution strategy at write time. The performance gap is significant at scale.

LanguageParadigmScopeNote
SQLDeclarativeRelational + graph (via WITH RECURSIVE)Industry standard; optimiser can parallelise freely
Cypher / GQLDeclarativeProperty graphsPattern-matching syntax; 4 lines for what SQL needs 31
SPARQLDeclarativeTriple-stores (RDF)Predates Cypher; Cypher borrowed its pattern syntax
DatalogDeclarative (rule-based)Relational + recursive graphsSubset of Prolog; used by Datomic, LinkedIn LIquid
GraphQLDeclarative (client-defined shape)API layer over any backendNo recursive queries β€” intentional DoS prevention
MongoDB AggregationImperative pipelineDocument collectionsJSON-based; $match/$group/$lookup pipeline stages

GraphQL: API Query Language, Not a Graph Database Language

Despite the name, GraphQL is not a graph database query language. It lets clients specify exactly which fields they need and get back a JSON document matching that shape β€” no over-fetching. Its deliberate restrictions (no recursive queries, no arbitrary search) prevent clients from issuing denial-of-service-inducing traversals. GraphQL can be implemented over any backend β€” relational, document, or graph.

Event Sourcing and CQRS: Writing Facts, Deriving State

Event sourcing stores every state change as an immutable, append-only event. Current state is derived by processing the event log sequentially. CQRS (Command Query Responsibility Segregation) separates the write path (event log) from the read path (materialised views), allowing each to be optimised independently. A command is validated first; only if valid does it become a fact appended to the log β€” the log contains only valid events, and consumers cannot reject events. Events are named in the past tense (booking_payment_confirmed, not confirm_booking) because they record completed facts.

01
Auditability by default

The event log is a complete, ordered record of every change. Invaluable in regulated industries (finance, healthcare) where you must answer "what happened and when?"

02
Multiple read-optimised views

The same event log can derive many materialised views β€” one for a customer confirmation page, one for an organiser dashboard, one for a badge printer β€” each tailored to its query pattern.

03
Reproducibility

Materialised views are recomputed deterministically from the event log. Bug in a view? Delete it and rebuild from scratch. Impossible with mutable state.

04
High write throughput

Appending to a log is sequential β€” fast and simple. Downstream consumers catch up asynchronously without blocking the write path.

05
Reversibility

An erroneous event is corrected by a subsequent compensating event. In traditional databases, a committed UPDATE or DELETE is very hard to reverse.

06
Evolvability

Add a new feature by building a new materialised view from the existing log β€” no changes to the write path. Old events remain untouched.

Choosing Your Data Model

Model selection should be driven by the dominant relationship structure in your data and the query patterns that matter most. These models are not mutually exclusive β€” PostgreSQL's jsonb, recursive CTEs, and event log tables can coexist in the same database. Knowing what each model optimises for is what lets you use that convergence intentionally.

ModelBest forWeak atQuery languageExamples
RelationalMany-to-one and many-to-many joins, structured recordsDeeply nested trees, complex graph traversalSQLPostgreSQL, MySQL, SQLite
DocumentSelf-contained tree-shaped objects, schema flexibilityMany-to-many relationships, cross-document joinsMQL, JSONPath, $lookupMongoDB, CouchBase, Firestore
GraphDense many-to-many relationships, variable-depth traversalLarge tabular aggregations, simple key-value accessCypher, SPARQL, GremlinNeo4j, Amazon Neptune, KΓΉzuDB
Event LogAppend-heavy writes, full auditability, multiple read viewsLow-latency direct reads (needs materialised views)Stream processing (Flink, Kafka Streams)Kafka, Datomic, EventStoreDB

Mental Models

01
The data model shapes how you think, not just what you store

Kleppmann opens with Wittgenstein: "The limits of my language mean the limits of my world." Choosing the relational model makes you think in tables and joins. Choosing the graph model makes you think in vertices and traversals. The model is a cognitive framework, not just a storage format.

02
Declarative languages beat imperative traversal almost always

SQL, Cypher, SPARQL, and Datalog let you describe the pattern you want. The query engine decides how β€” and can exploit indexes, parallelism, and join order optimisations you'd never implement by hand.

03
Normalise/denormalise is a read-write trade-off, not a moral judgment

Neither is universally better. Denormalise things that change infrequently; normalise things that change often. Twitter stores only post IDs in its materialised timeline (normalised) but pre-computes the fan-out (denormalised graph). Both in the same system.

04
Graph query expressiveness: 4 lines vs 31 lines

Finding people born in the US who now live in Europe takes 4 lines of Cypher and 31 lines of SQL WITH RECURSIVE. The right data model for the problem eliminates complexity, not just code volume.

05
Event sourcing flips the data model question

All other models answer "how should I store current state?" Event sourcing asks "what if I only stored immutable facts about what happened?" Current state becomes a derived view β€” one of many possible views of the same event history.

06
Document and relational models are converging in practice

PostgreSQL has jsonb and full JSON operators. MongoDB has $lookup joins and secondary indexes. The implementations are blurring. The conceptual distinction β€” locality and schema-on-read vs joins and schema-on-write β€” remains useful for making design decisions.

Part of the DDIA series. These notes distil Designing Data-Intensive Applications (2nd ed.) by Martin Kleppmann β€” the definitive guide to the trade-offs behind databases, messaging systems, and distributed architectures. Chapter 3 covers the data model and query language landscape; later chapters go deeper on storage engines, encoding, replication, and distributed transactions.