Why Keep Multiple Copies?
Replication means keeping a copy of the same data on multiple machines connected via a network. Three reasons motivate this:
Latency โ by placing replicas geographically close to users, read requests can be served from a nearby node rather than across the world. Availability โ if one machine fails, others can continue serving requests without interruption. Scalability โ read throughput can be increased by distributing reads across multiple replicas (writes still bottleneck unless you use multi-leader or leaderless replication).
The hard problem: every write to the database must also be replicated to every replica. If replication were instantaneous and never failed, it would be trivial. The difficulties arise from handling changes over time โ schema evolution, changing replication lag, and node failures โ while data is being written.
The Three Replication Algorithms
Almost all replication systems use one of three fundamental approaches. Each makes different trade-offs between write availability, consistency, and operational complexity.
| Dimension | Single-Leader | Multi-Leader | Leaderless |
|---|---|---|---|
| Write availability | โ Single point of failure โ leader must be reachable | โ Any region's leader accepts writes independently | โ Any w-of-n replicas accepting writes succeeds |
| Read availability | โ Any follower (may be stale) or leader (fresh) | โ Any follower or leader | โ Query r replicas, take newest |
| Consistency strength | โ Strongest โ total write order enforced by leader | โ ๏ธ Weaker โ conflicts possible; no global total order | โ ๏ธ Eventual โ quorum overlap gives probabilistic freshness |
| Failover complexity | โ ๏ธ Leader election, split brain risk, write loss risk | โ Regions operate independently; no failover needed | โ No failover concept โ system degrades gracefully |
| Conflict handling | โ None โ total order prevents conflicts | โ Must detect and resolve (LWW, CRDTs, manual) | โ Must detect and resolve (LWW, CRDTs) |
| Latency (multi-region) | โ All writes cross the inter-region link to the leader | โ Writes processed locally; inter-region replication async | โ Write to local replicas; cross-region async |
| Best suited for | OLTP, financial systems requiring strong consistency | Geo-distributed apps, offline-first / local-first software | High-availability key-value stores (Cassandra, Riak, DynamoDB) |
Single-Leader Replication
In single-leader replication (also called active/passive or master/slave), one replica is designated the leader. All writes go to the leader, which processes them and simultaneously writes them to a replication log (or change stream). Followers read the replication log and apply the same writes in the same order, so that their local copy of the data is eventually identical to the leader.
This is the most widely used approach: PostgreSQL, MySQL, MongoDB, Kafka, and most relational databases use some form of it.
Synchronous vs. Asynchronous Replication
The replication can be synchronous (the leader waits for the follower to acknowledge before confirming the write to the client) or asynchronous (the leader confirms immediately and replicates in the background). In practice, most setups are semi-synchronous: one follower is kept synchronous as a durability guarantee (if the leader fails, the synchronous follower has an up-to-date copy), while the rest are asynchronous. Fully synchronous replication is rare because a single unavailable follower would block all writes.
Setting Up New Followers
You can't simply copy files from the leader โ clients are constantly writing to the database, so the copy would be different files at different points in time. The standard procedure is: take a consistent snapshot of the leader's data (without locking), copy it to the new follower, then connect the follower to the leader and replay all the replication log entries that occurred since the snapshot, identified by the log's position (log sequence number in PostgreSQL, binlog coordinates in MySQL, or a GTID).
A modern variant, pioneered by systems like WarpStream and Confluent Freight, uses object storage (S3) as the shared data layer โ zero-disk architecture (ZDA). There is no local disk replication between nodes; instead, all data is written to object storage, and "followers" are stateless compute nodes that pull from S3. This eliminates replication lag for durability at the cost of higher read latency (object storage is slower than local disk).
Failover: The Three Steps and Their Failure Modes
When the leader fails, a new one must be elected. The process has three stages: detect the failure (typically a timeout โ but how long? Too short causes unnecessary failovers; too long means prolonged unavailability), elect a new leader (usually the most up-to-date follower, determined by consensus), and reconfigure the system (all clients and followers must be pointed to the new leader).
Failover is full of dangerous edge cases. A new leader may not have received all writes from the old leader โ if the old leader comes back and they are discarded, client-visible data is lost. In GitHub's 2012 incident, a promoted MySQL follower had fallen behind; when it became leader, its auto-increment counters were lower than those the old leader had already used. New rows were assigned IDs that conflicted with IDs in a Redis cache that was still serving the old data โ causing users to see each other's private information.
The split-brain problem occurs when two nodes both believe they are the leader. If both accept writes, the data diverges with no mechanism to reconcile. The standard mitigation is STONITH (Shoot The Other Node In The Head) โ the new leader sends a signal to forcibly shut down the old leader before accepting writes. But this itself can fail, and configuring it safely requires careful operational discipline.
Replication Log Types
How does the leader communicate its writes to followers? Four main approaches have been used:
| Type | How it works | Problems | Verdict |
|---|---|---|---|
| Statement-based | Replicate the SQL INSERT/UPDATE/DELETE statement text itself | Non-deterministic functions (NOW(), RAND()); triggers; auto-increment with ordering dependencies | โ Mostly abandoned โ too many edge cases |
| WAL shipping | Ship the storage engine's Write-Ahead Log bytes to followers | Tightly coupled to storage engine internals โ prevents independent version upgrades on followers | โ ๏ธ Used by PostgreSQL/Oracle streaming replication; zero-downtime upgrades require logical replication |
| Logical (row-based) replication | Log each row mutation (insert: full row; update: old key + new values; delete: primary key). Storage-engine independent | Slightly more verbose than WAL; must handle large transactions | โ Best general choice โ decoupled from storage engine, supports CDC/change data capture |
| Trigger-based | Application-level triggers write change records to a separate table; external process reads and replicates | High overhead; bug-prone; only subset of changes | โ ๏ธ Use for selective replication or cross-database migrations only |
Logical replication is the current best practice. Because it is decoupled from the storage engine's internal format, followers can run a different version of the database software โ enabling zero-downtime upgrades via rolling upgrades. Logical replication logs are also the foundation for change data capture (CDC): streaming database changes to derived systems like Elasticsearch, data warehouses, or caches as they happen.
Replication Lag and Consistency Anomalies
When a leader processes a write and confirms it to the client, but the followers haven't yet applied that write, the system is in a state of replication lag. Under normal operation, lag is a few milliseconds. Under heavy load or network issues, it can grow to minutes. The three classic anomalies that emerge from lag:
User submits a form; the write goes to the leader. They immediately reload the page; the read goes to a follower that hasn't yet received the write. The user sees their submission disappear. Solution: always read the user's own data from the leader, or track the user's last write timestamp and refuse to serve reads from replicas that are behind that timestamp.
User refreshes a page twice; the first request goes to a replica with low lag (shows a comment), the second to a replica with high lag (the comment disappears). Time appears to run backward. Solution: ensure each user always reads from the same replica โ route by hashing the user ID, not randomly.
In a sharded database, Mr. Poons asks "How far into the future can you see?" and Mrs. Cake answers "About ten seconds." An observer reading from different shards may see Mrs. Cake's answer before Mr. Poons's question โ as if she answered before being asked. Solution: write causally related operations to the same shard, or use causal-dependency tracking algorithms.
Consistency Models: A Spectrum
The three anomalies above are each addressed by a named consistency model. Together with eventual consistency and linearizability, they form a spectrum from weakest to strongest:
| Model | Guarantee | Strength | When it's enough |
|---|---|---|---|
| Eventual consistency | If no new writes are made, all replicas will eventually converge to the same value | Weakest | Analytics, caches, social media likes/view counts |
| Read-your-writes (read-after-write) | A user always sees their own writes immediately after submitting them | Moderate | User profile pages, form submissions, comment threads |
| Monotonic reads | A user will never see time go backward โ once they've seen a value, they won't see an older one | Moderate | Activity feeds, comment sections, paginated data |
| Consistent prefix reads | Reads will see writes in causal order โ no seeing an answer before the question | Moderate | Sharded databases with causally related writes across shards |
| Linearizability (strong consistency) | The system behaves as if there is only one replica โ all reads return the most recently written value | Strongest | Bank balances, unique username registration, leader election |
Multi-Leader Replication
Multi-leader replication (also called active/active or bidirectional replication) allows more than one node to accept writes. Each leader simultaneously acts as a follower to the other leaders. The primary motivation: in a single-leader setup, all writes must cross the network to reach the leader; in a geo-distributed system, this adds significant latency and creates a write-availability dependency on a single region.
In a multi-leader geo-distributed setup, each data centre has a leader. Within each data centre, standard leader/follower replication is used. Between data centres, each region's leader asynchronously replicates changes to the leaders of the other regions โ hiding inter-region latency from the write path entirely.
When Multi-Leader Makes Sense
Multi-leader replication is appropriate in three main scenarios. Geo-distributed server systems: MySQL, Oracle, SQL Server, and YugabyteDB support multi-leader as an external add-on or built-in feature. Offline-first / local-first software: your phone's calendar app is a multi-leader system โ the device is a "region," offline writes are accepted, and sync happens when connectivity is restored. Tools like PouchDB/CouchDB, Automerge, and Yjs implement this pattern. Real-time collaborative editing: Google Docs, Figma, and Linear treat each open browser tab as a leader โ changes are broadcast immediately without waiting for a server round-trip, then reconciled asynchronously. The library implementing this is called a sync engine.
Multi-Leader Topologies
With multiple leaders, writes must propagate across all of them. The communication paths form a replication topology. A circular topology passes writes from node to node around a ring; a star topology routes all writes through a central root node; an all-to-all topology sends each write directly to every other leader. Circular and star topologies are more fragile โ a single node failure can interrupt the replication chain. All-to-all is more resilient but can cause write ordering issues when faster network links cause update messages to "overtake" the inserts they depend on. Addressing this requires version vectors to track causal ordering.
Dealing with Write Conflicts
The fundamental challenge of multi-leader replication: if two leaders accept concurrent writes to the same record, a conflict arises. This problem does not exist in single-leader replication because the total order of writes is enforced by the leader. Here are the main resolution strategies:
| Strategy | How it works | Trade-off | Used in |
|---|---|---|---|
| Conflict avoidance | Route all writes for a given record through the same leader โ makes multi-leader behave like single-leader for that record | Breaks when you need to change the designated leader (region failure, user migration) | Geo-distributed databases with "home region" per user |
| Last Write Wins (LWW) | Attach a timestamp to each write; always keep the one with the highest timestamp โ concurrent writes resolved by discarding "losers" | โ ๏ธ Data loss โ successfully committed writes are silently discarded. Sensitive to clock skew; use logical clocks to mitigate | Cassandra (default), ScyllaDB; acceptable only for immutable key-value data |
| Manual / sibling merge | Keep all concurrent values as siblings; return them all on read; let application code or user resolve | API complexity; risky if merging carelessly (Amazon shopping cart ghost item bug โ deleted items reappeared) | CouchDB; any system where data loss is unacceptable |
| Automatic merge (CRDTs) | Conflict-free Replicated Data Types: mathematical data structures (counters, sets, maps, text) that merge deterministically regardless of operation order | Limited to CRDT-compatible data types; some operations (delete) require tombstones | Riak (sets/maps), Redis Enterprise, Automerge/Yjs (collaborative text editing) |
| Operational Transformation (OT) | Record operations at an index; when replicas exchange operations, transform each operation's index to account for concurrent operations already applied | Complex to implement correctly; significant algorithmic subtlety | Google Docs (real-time collaborative text editing), ShareDB |
Leaderless Replication (Dynamo-Style)
Leaderless replication abandons the concept of a leader entirely. Any replica can accept writes directly from clients. Amazon's internal Dynamo system pioneered this approach in 2007; Riak, Cassandra, and ScyllaDB are the major open-source implementations (collectively called "Dynamo-style" databases). Note: DynamoDB is a different product from AWS that uses single-leader replication โ confusingly named.
Quorum Reads and Writes
In a leaderless system, writes are sent to all n replicas in parallel. The write is considered successful when w replicas acknowledge it. Reads are sent to all n replicas; the client waits for r responses and uses the one with the highest version number (timestamp). As long as w + r > n, the sets of nodes written to and read from must overlap by at least one node, so reads are guaranteed to see the most recent write.
| n, w, r | Can tolerate | Notes |
|---|---|---|
| n=3, w=2, r=2 | 1 unavailable node | Typical setup: majority quorums, tolerates 1 failure |
| n=5, w=3, r=3 | 2 unavailable nodes | Higher durability; slower writes (must wait for 3) |
| n=3, w=3, r=1 | 0 for writes, 2 for reads | Maximum durability writes, fast reads โ fragile write path |
| n=3, w=1, r=3 | 2 for writes, 0 for reads | Fast writes, always-consistent reads โ leader-like reads |
| n=5, w=5, r=1 | 0 for writes | Synchronous writes to all โ maximum consistency, zero fault tolerance on writes |
You may also deliberately set w + r โค n to allow lower latency at the cost of stale reads โ useful in high-throughput, availability-first workloads. When a large-scale network interruption prevents quorum formation, some databases offer a sloppy quorum option: any reachable replica can accept writes, even if it's not one of the usual n for that key. This improves write availability at the cost of weakening the quorum freshness guarantee. When the unavailable nodes return, a hinted handoff transfers the stored writes to the correct replicas.
Catching Up on Missed Writes
When a node comes back online after a failure, it needs to catch up on writes it missed. Dynamo-style databases use three mechanisms. Read repair: when a client reads from multiple replicas and detects that one returned a stale value, it writes the newer value back to that replica. Works well for frequently-read keys. Hinted handoff: replicas that temporarily stored writes for an unavailable node forward those writes when the node recovers, then delete the hints. Handles infrequently-read keys. Anti-entropy: a background process that continuously compares replicas and copies any missing data between them โ does not copy writes in any particular order and may have significant delay.
Limitations of Quorum Consistency
Even with w + r > n, quorum consistency has important edge cases that prevent it from being a strong guarantee: a node carrying a new value may fail and be restored from an old replica (reducing the count of up-to-date nodes below w); a rebalancing operation in progress can cause read and write quorums to no longer overlap; a write that succeeded on fewer than w replicas before an error is not rolled back on the replicas where it did succeed, causing inconsistency. Dynamo-style databases are fundamentally optimised for availability and eventual consistency โ the n/w/r parameters tune the probability of reading stale data, not eliminate it.
Detecting Concurrent Writes: Version Vectors
To resolve conflicts correctly, the system must distinguish between "A happened before B" (B can safely overwrite A) and "A and B are concurrent" (they must be merged). Two operations are concurrent if neither knows about the other โ this is independent of whether they literally happened at the same time (clocks can't be trusted; see Ch.9).
The algorithm: each server maintains a version number per key, incremented on every write. Clients must read before writing โ a write must include the version number from the prior read. The server uses this to determine: values with version numbers at or below the submitted version can be overwritten (they've been "seen" by this write); values with higher version numbers must be kept as siblings (they're concurrent). Clients receive all siblings when reading and must merge them before writing.
With multiple replicas, a single version number is insufficient. Each replica maintains its own version number per key. The collection of all per-replica version numbers is a version vector (also called dotted version vector in Riak 2.0, distinct from but related to the older vector clock). Version vectors are sent from replicas to clients on reads and must be sent back to the database on writes โ allowing the system to correctly identify which values can be merged and which are concurrent.
Single-Leader vs. Leaderless: Performance Trade-offs
Single-leader replication can provide strong consistency and ACID transactions, but has three performance liabilities: read throughput is bounded by leader capacity; leader failure causes downtime during failover; and the system is sensitive to leader performance degradation (a slow leader immediately affects all users). Leaderless systems are more resilient because there is no failover and requests go to multiple replicas in parallel โ one slow replica has minimal impact since the client uses the fastest responses (request hedging). This significantly reduces tail latency. However, leaderless systems have their own performance costs: larger quorums increase the chance of hitting a slow replica; the hinted handoff process creates additional load on recovering nodes; and large-scale network interruptions may prevent quorum formation entirely.
Six Mental Models for Replication
Followers replicate asynchronously โ a read on a lagging follower can return a value from seconds or minutes ago. The three lag anomalies (read-your-writes, monotonic reads, consistent prefix reads) are all variants of "time appears to go backward." Choose a database that provides the consistency level you need, or route reads carefully.
Leader election has three steps (detect โ elect โ reconfigure), each with failure modes. The GitHub incident (MySQL auto-increment key collision after failover) shows that discarding a failed leader's committed writes can corrupt application-level invariants. Split brain + STONITH is the distributed systems equivalent of a fire drill that burns down the building.
The moment two leaders can independently accept writes to the same record, conflicts become possible. There is no way around this โ it is a fundamental limitation of distributed systems. The CAP theorem formalises this. Multi-leader gives you write availability across network partitions at the cost of needing conflict resolution.
Even with w + r > n, edge cases exist: a node carrying a new value can fail and be restored from an old replica; a rebalance in progress can make quorums not overlap; a concurrent read/write can produce non-monotonic results. Dynamo-style databases are optimised for availability โ don't treat them as strongly consistent.
Conflict-free Replicated Data Types guarantee that all replicas converge to the same state regardless of operation arrival order, without requiring coordination. The key insight: design data types whose merge function is commutative, associative, and idempotent. Counters, sets, maps, and even collaborative text can be modelled as CRDTs.
To distinguish "A happened before B" from "A and B are concurrent," each replica maintains a per-key version number. The collection across all replicas is a version vector (related to but subtly different from a vector clock). When version vectors overlap, one operation supersedes the other; when they don't, the writes are concurrent and must be merged.
Part of the Designing Data-Intensive Applications series