AI.FM — Advancing Accessible AI Infrastructure

Why Keep Multiple Copies?

Replication means keeping a copy of the same data on multiple machines connected via a network. Three reasons motivate this:

Latency — by placing replicas geographically close to users, read requests can be served from a nearby node rather than across the world. Availability — if one machine fails, others can continue serving requests without interruption. Scalability — read throughput can be increased by distributing reads across multiple replicas (writes still bottleneck unless you use multi-leader or leaderless replication).

The hard problem: every write to the database must also be replicated to every replica. If replication were instantaneous and never failed, it would be trivial. The difficulties arise from handling changes over time — schema evolution, changing replication lag, and node failures — while data is being written.

The Three Replication Algorithms

Almost all replication systems use one of three fundamental approaches. Each makes different trade-offs between write availability, consistency, and operational complexity.

Dimension	Single-Leader	Multi-Leader	Leaderless
Write availability	❌ Single point of failure — leader must be reachable	✓ Any region's leader accepts writes independently	✓ Any w-of-n replicas accepting writes succeeds
Read availability	✓ Any follower (may be stale) or leader (fresh)	✓ Any follower or leader	✓ Query r replicas, take newest
Consistency strength	✓ Strongest — total write order enforced by leader	⚠️ Weaker — conflicts possible; no global total order	⚠️ Eventual — quorum overlap gives probabilistic freshness
Failover complexity	⚠️ Leader election, split brain risk, write loss risk	✓ Regions operate independently; no failover needed	✓ No failover concept — system degrades gracefully
Conflict handling	✓ None — total order prevents conflicts	❌ Must detect and resolve (LWW, CRDTs, manual)	❌ Must detect and resolve (LWW, CRDTs)
Latency (multi-region)	❌ All writes cross the inter-region link to the leader	✓ Writes processed locally; inter-region replication async	✓ Write to local replicas; cross-region async
Best suited for	OLTP, financial systems requiring strong consistency	Geo-distributed apps, offline-first / local-first software	High-availability key-value stores (Cassandra, Riak, DynamoDB)

Single-Leader Replication

In single-leader replication (also called active/passive or master/slave), one replica is designated the leader. All writes go to the leader, which processes them and simultaneously writes them to a replication log (or change stream). Followers read the replication log and apply the same writes in the same order, so that their local copy of the data is eventually identical to the leader.

This is the most widely used approach: PostgreSQL, MySQL, MongoDB, Kafka, and most relational databases use some form of it.

Synchronous vs. Asynchronous Replication

The replication can be synchronous (the leader waits for the follower to acknowledge before confirming the write to the client) or asynchronous (the leader confirms immediately and replicates in the background). In practice, most setups are semi-synchronous: one follower is kept synchronous as a durability guarantee (if the leader fails, the synchronous follower has an up-to-date copy), while the rest are asynchronous. Fully synchronous replication is rare because a single unavailable follower would block all writes.

Setting Up New Followers

You can't simply copy files from the leader — clients are constantly writing to the database, so the copy would be different files at different points in time. The standard procedure is: take a consistent snapshot of the leader's data (without locking), copy it to the new follower, then connect the follower to the leader and replay all the replication log entries that occurred since the snapshot, identified by the log's position (log sequence number in PostgreSQL, binlog coordinates in MySQL, or a GTID).

A modern variant, pioneered by systems like WarpStream and Confluent Freight, uses object storage (S3) as the shared data layer — zero-disk architecture (ZDA). There is no local disk replication between nodes; instead, all data is written to object storage, and "followers" are stateless compute nodes that pull from S3. This eliminates replication lag for durability at the cost of higher read latency (object storage is slower than local disk).

Failover: The Three Steps and Their Failure Modes

When the leader fails, a new one must be elected. The process has three stages: detect the failure (typically a timeout — but how long? Too short causes unnecessary failovers; too long means prolonged unavailability), elect a new leader (usually the most up-to-date follower, determined by consensus), and reconfigure the system (all clients and followers must be pointed to the new leader).

Failover is full of dangerous edge cases. A new leader may not have received all writes from the old leader — if the old leader comes back and they are discarded, client-visible data is lost. In GitHub's 2012 incident, a promoted MySQL follower had fallen behind; when it became leader, its auto-increment counters were lower than those the old leader had already used. New rows were assigned IDs that conflicted with IDs in a Redis cache that was still serving the old data — causing users to see each other's private information.

The split-brain problem occurs when two nodes both believe they are the leader. If both accept writes, the data diverges with no mechanism to reconcile. The standard mitigation is STONITH (Shoot The Other Node In The Head) — the new leader sends a signal to forcibly shut down the old leader before accepting writes. But this itself can fail, and configuring it safely requires careful operational discipline.

Replication Log Types

How does the leader communicate its writes to followers? Four main approaches have been used:

Type	How it works	Problems	Verdict
Statement-based	Replicate the SQL INSERT/UPDATE/DELETE statement text itself	Non-deterministic functions (NOW(), RAND()); triggers; auto-increment with ordering dependencies	❌ Mostly abandoned — too many edge cases
WAL shipping	Ship the storage engine's Write-Ahead Log bytes to followers	Tightly coupled to storage engine internals — prevents independent version upgrades on followers	⚠️ Used by PostgreSQL/Oracle streaming replication; zero-downtime upgrades require logical replication
Logical (row-based) replication	Log each row mutation (insert: full row; update: old key + new values; delete: primary key). Storage-engine independent	Slightly more verbose than WAL; must handle large transactions	✓ Best general choice — decoupled from storage engine, supports CDC/change data capture
Trigger-based	Application-level triggers write change records to a separate table; external process reads and replicates	High overhead; bug-prone; only subset of changes	⚠️ Use for selective replication or cross-database migrations only

Logical replication is the current best practice. Because it is decoupled from the storage engine's internal format, followers can run a different version of the database software — enabling zero-downtime upgrades via rolling upgrades. Logical replication logs are also the foundation for change data capture (CDC): streaming database changes to derived systems like Elasticsearch, data warehouses, or caches as they happen.

Replication Lag and Consistency Anomalies

When a leader processes a write and confirms it to the client, but the followers haven't yet applied that write, the system is in a state of replication lag. Under normal operation, lag is a few milliseconds. Under heavy load or network issues, it can grow to minutes. The three classic anomalies that emerge from lag:

1 · Read-your-writes violation

User submits a form; the write goes to the leader. They immediately reload the page; the read goes to a follower that hasn't yet received the write. The user sees their submission disappear. Solution: always read the user's own data from the leader, or track the user's last write timestamp and refuse to serve reads from replicas that are behind that timestamp.

2 · Monotonic reads violation

User refreshes a page twice; the first request goes to a replica with low lag (shows a comment), the second to a replica with high lag (the comment disappears). Time appears to run backward. Solution: ensure each user always reads from the same replica — route by hashing the user ID, not randomly.

3 · Consistent prefix reads violation

In a sharded database, Mr. Poons asks "How far into the future can you see?" and Mrs. Cake answers "About ten seconds." An observer reading from different shards may see Mrs. Cake's answer before Mr. Poons's question — as if she answered before being asked. Solution: write causally related operations to the same shard, or use causal-dependency tracking algorithms.

Consistency Models: A Spectrum

The three anomalies above are each addressed by a named consistency model. Together with eventual consistency and linearizability, they form a spectrum from weakest to strongest:

Model	Guarantee	Strength	When it's enough
Eventual consistency	If no new writes are made, all replicas will eventually converge to the same value	Weakest	Analytics, caches, social media likes/view counts
Read-your-writes (read-after-write)	A user always sees their own writes immediately after submitting them	Moderate	User profile pages, form submissions, comment threads
Monotonic reads	A user will never see time go backward — once they've seen a value, they won't see an older one	Moderate	Activity feeds, comment sections, paginated data
Consistent prefix reads	Reads will see writes in causal order — no seeing an answer before the question	Moderate	Sharded databases with causally related writes across shards
Linearizability (strong consistency)	The system behaves as if there is only one replica — all reads return the most recently written value	Strongest	Bank balances, unique username registration, leader election

The NewSQL insight: the NoSQL movement of the 2010s claimed that strong consistency limits scalability and that large-scale systems must embrace eventual consistency. Since then, systems like Google Spanner, CockroachDB, and YugabyteDB have demonstrated that you can have linearizable, ACID-transactional distributed databases with competitive performance. The trade-off remains, but it is less stark than it once appeared.

Multi-Leader Replication

Multi-leader replication (also called active/active or bidirectional replication) allows more than one node to accept writes. Each leader simultaneously acts as a follower to the other leaders. The primary motivation: in a single-leader setup, all writes must cross the network to reach the leader; in a geo-distributed system, this adds significant latency and creates a write-availability dependency on a single region.

In a multi-leader geo-distributed setup, each data centre has a leader. Within each data centre, standard leader/follower replication is used. Between data centres, each region's leader asynchronously replicates changes to the leaders of the other regions — hiding inter-region latency from the write path entirely.

When Multi-Leader Makes Sense

Multi-leader replication is appropriate in three main scenarios. Geo-distributed server systems: MySQL, Oracle, SQL Server, and YugabyteDB support multi-leader as an external add-on or built-in feature. Offline-first / local-first software: your phone's calendar app is a multi-leader system — the device is a "region," offline writes are accepted, and sync happens when connectivity is restored. Tools like PouchDB/CouchDB, Automerge, and Yjs implement this pattern. Real-time collaborative editing: Google Docs, Figma, and Linear treat each open browser tab as a leader — changes are broadcast immediately without waiting for a server round-trip, then reconciled asynchronously. The library implementing this is called a sync engine.

Multi-Leader Topologies

With multiple leaders, writes must propagate across all of them. The communication paths form a replication topology. A circular topology passes writes from node to node around a ring; a star topology routes all writes through a central root node; an all-to-all topology sends each write directly to every other leader. Circular and star topologies are more fragile — a single node failure can interrupt the replication chain. All-to-all is more resilient but can cause write ordering issues when faster network links cause update messages to "overtake" the inserts they depend on. Addressing this requires version vectors to track causal ordering.

Dealing with Write Conflicts

The fundamental challenge of multi-leader replication: if two leaders accept concurrent writes to the same record, a conflict arises. This problem does not exist in single-leader replication because the total order of writes is enforced by the leader. Here are the main resolution strategies:

Strategy	How it works	Trade-off	Used in
Conflict avoidance	Route all writes for a given record through the same leader — makes multi-leader behave like single-leader for that record	Breaks when you need to change the designated leader (region failure, user migration)	Geo-distributed databases with "home region" per user
Last Write Wins (LWW)	Attach a timestamp to each write; always keep the one with the highest timestamp — concurrent writes resolved by discarding "losers"	⚠️ Data loss — successfully committed writes are silently discarded. Sensitive to clock skew; use logical clocks to mitigate	Cassandra (default), ScyllaDB; acceptable only for immutable key-value data
Manual / sibling merge	Keep all concurrent values as siblings; return them all on read; let application code or user resolve	API complexity; risky if merging carelessly (Amazon shopping cart ghost item bug — deleted items reappeared)	CouchDB; any system where data loss is unacceptable
Automatic merge (CRDTs)	Conflict-free Replicated Data Types: mathematical data structures (counters, sets, maps, text) that merge deterministically regardless of operation order	Limited to CRDT-compatible data types; some operations (delete) require tombstones	Riak (sets/maps), Redis Enterprise, Automerge/Yjs (collaborative text editing)
Operational Transformation (OT)	Record operations at an index; when replicas exchange operations, transform each operation's index to account for concurrent operations already applied	Complex to implement correctly; significant algorithmic subtlety	Google Docs (real-time collaborative text editing), ShareDB

CRDTs vs. Operational Transformation: Both achieve automatic conflict-free merging but with different mechanisms. OT tracks the index at which operations occur and transforms indices to account for concurrent operations — used by Google Docs. CRDTs assign each element a stable unique ID (not a mutable index) and use that ID to determine insertion/deletion positions — used by Riak, Redis Enterprise, Automerge. OT is older and more mature for text; CRDTs are theoretically cleaner and more general. Combining their advantages is an active research area.

Leaderless Replication (Dynamo-Style)

Leaderless replication abandons the concept of a leader entirely. Any replica can accept writes directly from clients. Amazon's internal Dynamo system pioneered this approach in 2007; Riak, Cassandra, and ScyllaDB are the major open-source implementations (collectively called "Dynamo-style" databases). Note: DynamoDB is a different product from AWS that uses single-leader replication — confusingly named.

Quorum Reads and Writes

In a leaderless system, writes are sent to all n replicas in parallel. The write is considered successful when w replicas acknowledge it. Reads are sent to all n replicas; the client waits for r responses and uses the one with the highest version number (timestamp). As long as w + r > n, the sets of nodes written to and read from must overlap by at least one node, so reads are guaranteed to see the most recent write.

n, w, r	Can tolerate	Notes
n=3, w=2, r=2	1 unavailable node	Typical setup: majority quorums, tolerates 1 failure
n=5, w=3, r=3	2 unavailable nodes	Higher durability; slower writes (must wait for 3)
n=3, w=3, r=1	0 for writes, 2 for reads	Maximum durability writes, fast reads — fragile write path
n=3, w=1, r=3	2 for writes, 0 for reads	Fast writes, always-consistent reads — leader-like reads
n=5, w=5, r=1	0 for writes	Synchronous writes to all — maximum consistency, zero fault tolerance on writes

You may also deliberately set w + r ≤ n to allow lower latency at the cost of stale reads — useful in high-throughput, availability-first workloads. When a large-scale network interruption prevents quorum formation, some databases offer a sloppy quorum option: any reachable replica can accept writes, even if it's not one of the usual n for that key. This improves write availability at the cost of weakening the quorum freshness guarantee. When the unavailable nodes return, a hinted handoff transfers the stored writes to the correct replicas.

Catching Up on Missed Writes

When a node comes back online after a failure, it needs to catch up on writes it missed. Dynamo-style databases use three mechanisms. Read repair: when a client reads from multiple replicas and detects that one returned a stale value, it writes the newer value back to that replica. Works well for frequently-read keys. Hinted handoff: replicas that temporarily stored writes for an unavailable node forward those writes when the node recovers, then delete the hints. Handles infrequently-read keys. Anti-entropy: a background process that continuously compares replicas and copies any missing data between them — does not copy writes in any particular order and may have significant delay.

Limitations of Quorum Consistency

Even with w + r > n, quorum consistency has important edge cases that prevent it from being a strong guarantee: a node carrying a new value may fail and be restored from an old replica (reducing the count of up-to-date nodes below w); a rebalancing operation in progress can cause read and write quorums to no longer overlap; a write that succeeded on fewer than w replicas before an error is not rolled back on the replicas where it did succeed, causing inconsistency. Dynamo-style databases are fundamentally optimised for availability and eventual consistency — the n/w/r parameters tune the probability of reading stale data, not eliminate it.

Detecting Concurrent Writes: Version Vectors

To resolve conflicts correctly, the system must distinguish between "A happened before B" (B can safely overwrite A) and "A and B are concurrent" (they must be merged). Two operations are concurrent if neither knows about the other — this is independent of whether they literally happened at the same time (clocks can't be trusted; see Ch.9).

The algorithm: each server maintains a version number per key, incremented on every write. Clients must read before writing — a write must include the version number from the prior read. The server uses this to determine: values with version numbers at or below the submitted version can be overwritten (they've been "seen" by this write); values with higher version numbers must be kept as siblings (they're concurrent). Clients receive all siblings when reading and must merge them before writing.

With multiple replicas, a single version number is insufficient. Each replica maintains its own version number per key. The collection of all per-replica version numbers is a version vector (also called dotted version vector in Riak 2.0, distinct from but related to the older vector clock). Version vectors are sent from replicas to clients on reads and must be sent back to the database on writes — allowing the system to correctly identify which values can be merged and which are concurrent.

Single-Leader vs. Leaderless: Performance Trade-offs

Single-leader replication can provide strong consistency and ACID transactions, but has three performance liabilities: read throughput is bounded by leader capacity; leader failure causes downtime during failover; and the system is sensitive to leader performance degradation (a slow leader immediately affects all users). Leaderless systems are more resilient because there is no failover and requests go to multiple replicas in parallel — one slow replica has minimal impact since the client uses the fastest responses (request hedging). This significantly reduces tail latency. However, leaderless systems have their own performance costs: larger quorums increase the chance of hitting a slow replica; the hinted handoff process creates additional load on recovering nodes; and large-scale network interruptions may prevent quorum formation entirely.

Six Mental Models for Replication

01 · Replication lag is a time machine bug

Followers replicate asynchronously — a read on a lagging follower can return a value from seconds or minutes ago. The three lag anomalies (read-your-writes, monotonic reads, consistent prefix reads) are all variants of "time appears to go backward." Choose a database that provides the consistency level you need, or route reads carefully.

02 · Failover is never free

Leader election has three steps (detect → elect → reconfigure), each with failure modes. The GitHub incident (MySQL auto-increment key collision after failover) shows that discarding a failed leader's committed writes can corrupt application-level invariants. Split brain + STONITH is the distributed systems equivalent of a fire drill that burns down the building.

03 · Multi-leader trades consistency for availability

The moment two leaders can independently accept writes to the same record, conflicts become possible. There is no way around this — it is a fundamental limitation of distributed systems. The CAP theorem formalises this. Multi-leader gives you write availability across network partitions at the cost of needing conflict resolution.

04 · Quorums give probabilistic not absolute freshness

Even with w + r > n, edge cases exist: a node carrying a new value can fail and be restored from an old replica; a rebalance in progress can make quorums not overlap; a concurrent read/write can produce non-monotonic results. Dynamo-style databases are optimised for availability — don't treat them as strongly consistent.

05 · CRDTs are the math behind eventual consistency

Conflict-free Replicated Data Types guarantee that all replicas converge to the same state regardless of operation arrival order, without requiring coordination. The key insight: design data types whose merge function is commutative, associative, and idempotent. Counters, sets, maps, and even collaborative text can be modelled as CRDTs.

06 · Version vectors track causal history

To distinguish "A happened before B" from "A and B are concurrent," each replica maintains a per-key version number. The collection across all replicas is a version vector (related to but subtly different from a vector clock). When version vectors overlap, one operation supersedes the other; when they don't, the writes are concurrent and must be merged.

Part of the Designing Data-Intensive Applications series

Ch.1 · Trade-offs Ch.2 · Nonfunctional Req.Ch.3 · Data Models Ch.4 · Storage Ch.5 · Encoding

← Ch.5 · Encoding and Evolution Back to Knowledge Hub →

Replication: Keeping Copies in Sync Across a Distributed System