View Mode
DDIA ยท Ch.1028 min read

Consistency and Consensus: The Strongest Guarantees in Distributed Systems

Ch.10 of DDIA 2nd ed. covers linearizability (recency guarantee for single objects), its use cases (locks, uniqueness constraints, cross-channel ordering), CAP theorem, the performance cost of linearizability (Attiya-Welch bound), logical and hybrid logical clocks (Lamport, HLC, vector clocks), linearizable ID generators, and consensus in its forms: single-value consensus, total order broadcast, and atomic commitment. Consensus algorithms (Raft, Paxos, Zab) and the FLP impossibility result are discussed.

Linearizability: The Recency Guarantee

Linearizability (also called atomic consistency, strong consistency, or external consistency) is the strongest single-object consistency model. Its guarantee is simple: after a write completes, all subsequent reads must return the written value โ€” from any node, from any client, anywhere in the system.

More formally: a system is linearizable if operations appear to have been executed atomically on a single copy of the data at some point between their invocation and response. The system behaves as if there is a single, shared register, and every read returns the value of the most recently committed write.

Crucially, linearizability is a recency guarantee, not a transaction isolation guarantee. It applies to individual operations on individual objects. It says nothing about multi-object consistency โ€” that is the domain of serializability.

Linearizability vs. Serializability

Linearizability

Single-object recency guarantee. Reads return the most recently written value. Time-based: there is a real-time ordering of operations. Does not concern transactions or multi-object consistency.

Serializability

Multi-object isolation guarantee. Transactions appear to execute one at a time in some serial order. Order does not have to match real time โ€” snapshot isolation satisfies serializability but not linearizability (stale reads are allowed).

Strict serializability = linearizability + serializability. Both recency and transaction isolation guaranteed. This is what PostgreSQL provides with serializable isolation on a single node.

When Linearizability Is Required

Distributed Locks and Leader Election

Only one node must hold the lock or be the leader at a time. Requires linearizable compare-and-set (CAS) โ€” "set this value only if it is currently X." Linearizability ensures the CAS is atomic across all replicas.

Systems: ZooKeeper, etcd (Raft-based), Apache Curator

Uniqueness Constraints

Username or email uniqueness across concurrent registrations. Two concurrent requests to register the same username must result in exactly one succeeding. Requires linearizable writes.

Systems: PostgreSQL serializable isolation, single-leader database with uniqueness constraint

Cross-Channel Timing Dependencies

User sets account private on device A, then uploads photo on device B. Photo service must see the "private" setting before allowing the photo. Without linearizability (using Lamport clocks instead), the photo may be uploaded as public.

Systems: Single database with linearizable reads, or Spanner TrueTime

Which Replication Schemes Are Linearizable?

Replication ApproachLinearizable?Caveat
Single-leader replication (reads from leader)โœ“ Potentially โ€” reads and writes through single leader, using synchronous replicationNot linearizable by default โ€” reads from followers see stale data. Leader must not have concurrent election. Concurrency bugs or failover can break linearizability.
Multi-leader replicationโŒ Not linearizableMultiple leaders can accept concurrent writes and replicate asynchronously. No single source of truth. Conflicts require resolution, breaking the recency guarantee.
Leaderless replication (Dynamo-style)โŒ Probably not linearizableEven with w + r > n quorums, linearizability is not guaranteed due to timing (a node may return a stale value if it hasn't received the latest write yet). Sloppy quorums and hinted handoff explicitly break linearizability.
Consensus algorithms (Raft, Paxos, Zab)โœ“ LinearizableDesigned specifically to provide linearizable operations with fault tolerance. Used in ZooKeeper, etcd, CockroachDB, YugabyteDB. Performance cost is real โ€” every write requires majority acknowledgment.

The CAP Theorem

The CAP theorem states: a distributed system cannot simultaneously provide more than two of the following three guarantees โ€” Consistency (linearizability), Availability (every request receives a response), and Partition tolerance (the system continues to operate when network partitions occur).

In practice, network partitions are not optional โ€” any distributed system deployed across multiple machines or datacenters must handle them. The real choice is not "which two of three" but rather: when a partition occurs, do you sacrifice consistency or availability?

CAP is often overstated. During normal operation (no partition), you don't have to choose โ€” you can have both consistency and availability. The choice only matters during the relatively rare event of a partition. "PACELC" extends CAP: even in the absence of a partition (Else), there is a tradeoff between Latency and Consistency.

Consistent (CP)

When a partition occurs, the system refuses to serve requests rather than risk returning stale data. Sacrifices availability for consistency.

Examples: ZooKeeper, etcd, HBase, MongoDB (single-leader with read concern: majority)

Tradeoff: Returns errors or timeouts to clients during partition. High availability SLAs are hard to meet.

Available (AP)

When a partition occurs, the system continues serving requests from all nodes, potentially returning stale or conflicting data. Sacrifices consistency for availability.

Examples: Cassandra, DynamoDB, CouchDB, Riak (with eventual consistency)

Tradeoff: Returns potentially stale reads during partition. Concurrent writes may conflict and require resolution.

The Cost of Linearizability

Attiya and Welch (1994) proved that in a network with variable message delays, the response time of any linearizable read or write is at least proportional to the uncertainty in the message delay. In high-latency or variable-latency networks (like multi-datacenter deployments), linearizability has a real and unavoidable latency cost. This is not an engineering failure โ€” it is a mathematical lower bound. Systems like Spanner accept this cost and design around it (commit-wait); others (Cassandra, DynamoDB) sacrifice linearizability for lower latency.

Logical Clocks and Distributed ID Generators

Physical clocks are unreliable for ordering events in distributed systems. Logical clocks solve the ordering problem without requiring accurate physical time โ€” they track the relative order of events (causality) without measuring elapsed time.

Lamport Clock

MECHANISM

Single counter. Each node increments on every operation. When a message is received, counter = max(local, received) + 1. Pair: (counter, node_id) for uniqueness.

ORDERING

Total order consistent with causality: if A happened-before B, then timestamp(A) < timestamp(B)

LIMITATION

Cannot detect concurrency: if two events have different counters, you can't tell if they are causally related or truly concurrent. Counter-intuitive: higher counter doesn't mean later in real time.

Used in: Transaction IDs in MVCC databases, event ordering in distributed logs

Hybrid Logical Clock (HLC)

MECHANISM

Combines physical time with Lamport-style logical component. Moves forward with the physical clock; increments the logical component when physical time matches. Result: approximately physical time + causal ordering.

ORDERING

Total order consistent with causality + approximately reflects physical time

LIMITATION

Slightly more complex. Ordering still not linearizable (can't detect if two events were truly concurrent).

Used in: CockroachDB, YugabyteDB, MongoDB MVCC

Vector Clock

MECHANISM

One counter per node, stored with every write. If write A has higher counter than B on one node and lower on another, A and B are concurrent. If A strictly dominates B on all nodes, A happened after B.

ORDERING

Partial order โ€” correctly identifies causality AND concurrency

LIMITATION

Size proportional to number of nodes. Expensive to propagate and compare.

Used in: Dynamo-style conflict detection (version vectors), Git commits (implicitly)

Linearizable ID Generators

Lamport clocks guarantee causal ordering but not linearizability โ€” a node's counter only reflects events it has seen. A true linearizable ID generator requires a single point of coordination: a node that atomically increments a counter. TiDB/TiKV calls this a "timestamp oracle" (inspired by Google Percolator). The oracle is replicated for fault tolerance using single-leader replication. To avoid a disk write on every request, it persists batches of IDs ahead of time. Alternative: Spanner's TrueTime commit-wait achieves linearizable IDs without a central node โ€” by waiting for clock uncertainty intervals to pass before assigning timestamps, ensuring no future transaction will receive a smaller timestamp.

Consensus: The Fundamental Problem

Consensus is the problem of getting multiple nodes in a distributed system to agree on a single value. It is one of the most important and difficult problems in distributed computing. Many distributed systems problems โ€” leader election, distributed locking, uniqueness constraints, atomic commit โ€” are all instances of consensus in disguise.

The three forms of consensus are mathematically equivalent: if you can solve any one, you can solve the others.

Single-value consensus

Get multiple nodes to agree on a single value. Used for leader election, lock acquisition, atomic compare-and-set operations. Once decided, the value cannot change.

Properties: Uniform agreement (no two nodes decide differently), Integrity (once decided, cannot change), Validity (decided value was proposed), Termination (every non-crashed node eventually decides)

Total order broadcast (shared log)

A shared append-only log where every node reads the same sequence of entries in the same order. Multiple nodes can propose entries; all nodes see the same ordering.

Properties: Eventual append (proposed values are eventually delivered), Reliable delivery (no entries lost), Append-only (immutable once read), Agreement (all nodes see same sequence)

Atomic commitment (2PC)

All participants in a distributed transaction must agree to commit or abort. A single abort vote forces all to abort. Used in distributed and cross-shard transactions.

Properties: Uniform agreement (no node commits while another aborts), Integrity (once committed, cannot abort), Validity (commit requires all votes to commit), Termination (eventually all commit or abort)

FLP Impossibility

Fischer, Lynch, and Paterson (1985) proved that in a purely asynchronous system (no timeouts or clocks), no deterministic consensus algorithm can always terminate if even one node might crash. This is often misquoted as "consensus is impossible" โ€” but the operative word is "always." FLP doesn't say consensus can't usually be reached; it says you can't guarantee it always terminates in a timing-free model. In practice, systems use timeouts (to suspect crashed nodes) and randomization to work around FLP, making consensus practical if not theoretically guaranteed to terminate in all adversarial scenarios.

Consensus Algorithms in Practice

The best-known consensus algorithms for crash-recovery (non-Byzantine) systems are Viewstamped Replication, Paxos, Raft, and Zab. These algorithms are similar in their guarantees but differ in their protocols. Most modern production systems use Raft (etcd, CockroachDB, TiKV, Consul) because its design was specifically optimized for understandability.

A consensus algorithm requires a majority (quorum) of nodes to make a decision. With 3 nodes, 1 can be faulty. With 5 nodes, 2 can be faulty. Safety properties (agreement, integrity, validity) are always maintained โ€” even with a minority of failures. The liveness property (termination) may fail if more than half the nodes are unavailable, but the system will not make incorrect decisions: it simply stops until a majority is reachable again.

Mental Models

Linearizability โ‰  Serializability

Linearizability is a recency guarantee for individual objects (reads return the most recently written value). Serializability is an isolation guarantee for multi-object transactions (equivalent to serial execution). A system can be serializable but not linearizable (snapshot isolation), or linearizable but not serializable (single-object atomic writes). Strict serializability = both.

CAP is About Partitions, Not Design Tradeoffs

CAP's insight is narrow: when a network partition occurs, you must choose between consistency (refuse requests) and availability (serve potentially stale data). During normal operation, you can have both. "PACELC" captures the broader tradeoff: Partition โ†’ C or A; Else โ†’ Latency or Consistency.

Consensus = Total Order Broadcast = Linearizable CAS

These three abstractions are equivalent โ€” you can implement any one from any other. Single-leader replication (without consensus for failover) is NOT equivalent โ€” it stops being total-order-broadcast-like when the leader changes. Consensus algorithms solve this by making leader election itself a consensus operation.

Lamport Clocks Are Not Enough for Linearizability

A Lamport clock gives you total ordering consistent with causality, but not linearizability. For linearizability you need to know whether all other nodes have seen a particular write. Lamport clocks can only tell you what events a single node has observed. You need a linearizable single-node ID generator (or Spanner-style TrueTime) for true linearizable IDs.

Linearizability Has a Real Performance Cost

Linearizable reads and writes require at least one round trip to a majority of nodes. Attiya & Welch proved that in a network with variable delays, the response time of linearizable operations must be at least proportional to the delay uncertainty. You cannot have both low latency and linearizability on high-latency networks.

FLP โ‰  Consensus is Impossible

The FLP impossibility result says a deterministic consensus algorithm cannot always terminate in an asynchronous system (one that cannot use timeouts). In practice, systems use timeouts and randomization to work around FLP. The result matters theoretically but rarely constrains practical systems โ€” it just means you can't have a protocol that both never blocks AND always terminates.

DESIGNING DATA-INTENSIVE APPLICATIONS ยท 2ND EDITION

This article is part of an ongoing series distilling DDIA 2nd ed. into structured, audience-aware reading notes. The series covers storage engines, replication, sharding, transactions, distributed systems, and stream processing.

โ† Ch.9: The Trouble with Distributed SystemsCh.11: Batch Processing โ†’