DDIA · Ch.922 min read

The Trouble with Distributed Systems: Everything That Can Go Wrong

Ch.9 of DDIA 2nd ed. catalogs the sources of unreliability in distributed systems: network faults (packets delayed, lost, reordered), unreliable clocks (drift, leap seconds, virtualized VMs, process pauses), the impossibility of distinguishing network failures from node crashes, the majority-vote principle, fencing tokens for zombie prevention, and system models (synchronous/partially synchronous/asynchronous; crash-stop/crash-recovery/Byzantine).

Unreliable Networks

The distributed systems used in data infrastructure are almost always built on shared, asynchronous networks. The fundamental property of such networks: when you send a message and wait for a response, you cannot distinguish between the following failure scenarios — from the sender's perspective, they all look identical.

Request LostTimeout on the sender; no response

Your request never reached the destination. Packet dropped by a switch, router, or NIC. The receiver never knew you asked.

Request QueuedTimeout on the sender; response eventually comes or times out

Request arrived at the network queue or remote node but hasn't been processed yet — the receiver is busy or backlogged.

Node CrashedTimeout on the sender; no response

The remote node went down — hardware failure, OOM kill, power cut, OS crash. It never received or processed the request.

Response LostTimeout on the sender — request may or may not have been applied

The remote node processed the request successfully and sent a response, but the response packet was dropped on the way back.

Response DelayedTimeout, then later surprise: duplicate response arrives after retry

The response was sent but is queued somewhere in the network. It may still arrive — possibly after your timeout fired and you've already retried.

The Fundamental Problem

From a sender's perspective, all five scenarios look identical: a timeout. This means you cannot know whether your request was received, processed, or applied when a timeout occurs. Retrying may cause duplicate execution. Not retrying may leave work undone. Idempotent operations (applying the same operation twice is safe) are one design response to this uncertainty.

Timeouts and Adaptive Timeout Selection

How long should a timeout be? Too short: declare healthy nodes dead unnecessarily, triggering expensive failovers and split-brain risks. Too long: the system is slow to detect genuine failures and respond.

TCP's flow control and congestion control deliberately slow down senders when the network is overloaded — adding latency variance. A good timeout strategy adapts to observed network latency distributions. Systems like Cassandra use an exponentially weighted moving average (EWMA) of response times and set timeouts as a multiple of that average, detecting anomalies dynamically.

Network delays in shared cloud environments are highly variable. A packet queued at a switch overload may wait tens of milliseconds before being forwarded. This unbounded queuing makes it impossible to guarantee maximum network latency without dedicated hardware (financial trading systems use kernel bypass networking to achieve microsecond-level predictability).

Unreliable Clocks

Clocks seem simple and fundamental, but in distributed systems they are deeply treacherous. Each machine has a local quartz oscillator clock that drifts over time. NTP attempts to correct this drift by syncing to reference servers, but NTP accuracy is limited by network round-trip time — typically 10-100ms on the public internet.

Two types of clocks serve different purposes. Time-of-day clocks (wall clocks) measure absolute calendar time — they can jump backward due to NTP corrections and are not monotonic. Monotonic clocks measure elapsed time since an arbitrary epoch — they always move forward and are safe for measuring durations, but meaningless for comparing timestamps across machines.

Time-of-Day Clock Drift

Quartz oscillators drift by tens of milliseconds per day. NTP sync can jump the clock forward or backward, violating monotonicity.

CONSEQUENCE

Timestamps from two machines cannot be compared reliably. LWW (Last Write Wins) conflict resolution based on wall-clock timestamps will silently drop writes.

EXAMPLE

Client B's write (causally later) has an earlier timestamp than Client A's write due to 3ms clock skew → B's write incorrectly wins under LWW.

Leap Seconds

Occasionally a minute has 59 or 61 seconds to keep atomic and astronomical time aligned. NTP servers may lie during this period.

CONSEQUENCE

Systems that don't handle leap seconds correctly have crashed at scale (Linux, Java, AWS, Cloudflare had incidents). Leap seconds will be abolished from 2035 onwards.

EXAMPLE

Systems that assumed a day always has 86,400 seconds: Linux kernel 2012 leap second incident caused CPU spikes due to timer loop bugs.

Virtual Machine Clock Virtualization

VMs pause and resume (live migration, hypervisor scheduling). During the pause, the hardware clock is managed by the hypervisor for other VMs.

CONSEQUENCE

From the VM's perspective, the clock suddenly jumps forward. A process checking the clock before and after the pause may observe non-monotonic time.

EXAMPLE

A VM paused for 15 seconds resumes with a lease check that was valid 15 seconds ago — it may proceed thinking the lease is still valid.

Clock Confidence Intervals

clock_gettime() returns a single value, but the actual time is only known within a confidence interval (NTP uncertainty + quartz drift).

CONSEQUENCE

Systems that compare timestamps without accounting for uncertainty may order events incorrectly. Google's TrueTime API returns [earliest, latest] to expose the interval.

EXAMPLE

Google Spanner: if two transaction confidence intervals don't overlap, the order is certain. If they overlap, Spanner waits for the interval to pass before committing.

Process Pauses Mimicking Clock Issues

Garbage collection, OS context switches, disk I/O, memory paging, hypervisor preemption — all can pause a thread for seconds or minutes unpredictably.

CONSEQUENCE

A node that was healthy 15 seconds ago may have been silently paused during that time, making any time-based reasoning (lease validity, heartbeats) unreliable.

EXAMPLE

"Stop-the-world" GC pauses used to last several minutes. Modern GC is much better, but ms-scale pauses still occur even with concurrent collectors.

Google Spanner TrueTime: Treating Clock Uncertainty as a First-Class Concern

Spanner uses GPS receivers and atomic clocks in each datacenter to achieve ~7ms global clock accuracy. Its TrueTime API returns [earliest, latest] instead of a single timestamp. If two transactions have non-overlapping confidence intervals, their order is certain. If they overlap, Spanner deliberately waits for the confidence interval to pass before committing — ensuring transaction timestamps reflect causality. This "commit-wait" is why Spanner's write latency is bounded by GPS/atomic clock uncertainty, not just network latency.

Process Pauses and the Zombie Problem

A node in a distributed system can be paused at any point in its execution — for garbage collection, hypervisor scheduling, OS context switches, disk I/O, or memory paging — and may not even know it was paused. While it sleeps, the rest of the world moves on. Leases expire, new leaders are elected, its "valid" lock token is revoked.

When the paused node resumes, it believes no time has passed. If it was holding a distributed lock or lease, it may proceed to use data that it no longer has exclusive access to. This is called the zombie problem: a formerly-valid leaseholder acting as if its lease is still valid.

The Zombie Scenario

Client 1 holds a lease. It pauses for 15 seconds (GC). Lease expires. Client 2 acquires the lease and starts writing. Client 1 resumes, checks its (still locally-valid) lease, and also writes to the same resource. Both writes succeed — file is now corrupted. HBase had exactly this bug.

The Fencing Token Solution

Each time a lock is granted, the lock service issues a monotonically increasing fencing token (ZooKeeper: zxid or cversion; Kafka: epoch number). Clients must include the token in write requests. The storage service rejects writes with a lower token than the most recently seen — even if the client believes its lease is valid. Stale writes from zombies are idempotently rejected.

STONITH: Shoot The Other Node In The Head

Some systems attempt to fence zombies by forcibly shutting down the potentially-stale node (via network disconnection or power cutoff). This is called STONITH. Unfortunately it has problems: it doesn't protect against large network delays (the delayed write may arrive after the STONITH action), it can cause mutual STONITH (both nodes shut each other down), and by the time the zombie is killed the data may already be corrupted. Fencing tokens are a more robust and widely applicable alternative.

System Models and Fault Assumptions

To reason about and prove properties of distributed algorithms, we need to formalize what we assume about the system. System models define the bounds on network timing; fault models define how nodes can fail. Real systems are not designed for worst-case models but for the most realistic one.

Timing Models

System Model	Network Delay	Process Pauses	Clock Error	Realistic?	Algorithm Design
Synchronous System	Bounded — maximum known delay	Bounded — maximum known pause duration	Bounded — maximum known drift	❌ Not realistic for shared networks (internet, Ethernet, cloud)	Simplest to design algorithms for — can use timeouts with certainty
Partially Synchronous System	Usually bounded, occasionally unbounded	Usually bounded, occasionally unbounded	Usually bounded, occasionally unbounded	✓ Realistic model for most distributed systems	Most practical algorithms target this — timeouts work most of the time
Asynchronous System	Unbounded — no timing assumptions	Unbounded	Unbounded	⚠ Overly conservative — few systems are truly asynchronous	Hardest to design for; FLP result proves consensus is impossible in this model without randomization

Fault Models

Crash-stop (fail-stop)

A node either works correctly or crashes and never responds again. Dead nodes stay dead. Detection is possible (node goes silent).

Used in: Simplest model for algorithm analysis

⚠ Somewhat realistic — but dead nodes can revive

Crash-recovery

Nodes can crash and restart. They may forget in-memory state (must be recovered from disk). Node is offline while crashed, then resumes.

Used in: Most consensus algorithms (Raft, Paxos) target this model

✓ Most realistic model for real distributed systems

Byzantine (arbitrary)

Nodes can behave arbitrarily — including sending deliberately incorrect, contradictory, or malicious messages. The "Byzantine Generals Problem."

Used in: Blockchain consensus, aircraft flight computers, Byzantine Fault Tolerant (BFT) algorithms

⚠ Rare in internal systems; important in public blockchains and adversarial environments

Byzantine Faults in Practice

In internal datacenter systems, Byzantine faults are rare (hardware bugs do occasionally corrupt data; checksums catch most of these). In public systems (blockchains, peer-to-peer networks), adversarial nodes are a real concern. Byzantine Fault Tolerant (BFT) consensus algorithms like PBFT can tolerate up to ⌊(n-1)/3⌋ Byzantine nodes — but they are significantly more complex and expensive than crash-recovery algorithms like Raft. Most distributed databases and consensus systems (Raft, Paxos, ZooKeeper, etcd) assume non-Byzantine (crash-recovery) faults only.

Knowledge, Truth, and the Majority Rules

A node in a distributed system cannot know anything for certain about other nodes — it can only make inferences from the messages it receives (or doesn't receive). A remote node not responding does not mean it is dead. A node that believes it is the leader may have been voted out.

This epistemic limitation is not an engineering failure — it is a fundamental property of asynchronous systems. The solution is quorums: require a minimum number of nodes to agree before acting. A majority quorum (more than half the nodes) ensures there can only ever be one quorum making a decision at a time — two majorities cannot coexist.

This has a philosophical implication: a node must abide by the quorum decision even if it disagrees. If a quorum of nodes declares node X dead, X must step down — even though X is running perfectly fine. "The individual node must abide by the quorum decision." Truth in distributed systems is a social construct determined by majority vote.

Mental Models

A Timeout is Not a Death Certificate

When a remote node doesn't respond within your timeout, you cannot distinguish: node crashed, request lost, response lost, response delayed, network partitioned. The node may be alive and may have already executed your request. Timeouts are detection heuristics, not ground truth.

The Network Has No Shared Clock

Two messages sent from different machines with timestamps cannot be reliably ordered. Even with NTP, clock skew of tens of milliseconds is normal. Last-write-wins conflict resolution based on wall-clock timestamps silently loses writes.

Process Pauses Are Silent

A thread that was running yesterday may have been paused for 15 seconds due to GC, hypervisor migration, or OS scheduling. The thread has no idea it was paused. Every assumption it made before the pause — lease validity, lock ownership, connection liveness — may be invalid.

The Majority Rules — Even Against the Quorum Member

A node cannot trust its own judgment about whether it is the leader, whether it is alive, or whether its lease is valid. Distributed algorithms rely on quorums — if a majority votes you dead, you must step down, even if you disagree. "The semi-disconnected node is dragged to the graveyard, kicking and screaming."

Fencing Tokens Over Leases

Distributed locks and leases are prone to the zombie problem: a formerly-valid leaseholder resumes after a pause, believing its lease is still valid. Fencing tokens (monotonically increasing numbers granted with each lock) allow downstream services to reject stale operations — even from legitimate but paused nodes.

Truth is a Social Construct in Distributed Systems

What is "true" in a distributed system is determined by majority vote. A node cannot independently determine facts about the world — it can only observe messages. Reliable behavior requires assuming the underlying infrastructure can fail in surprising ways and designing algorithms that remain correct despite those failures.

DESIGNING DATA-INTENSIVE APPLICATIONS · 2ND EDITION

This article is part of an ongoing series distilling DDIA 2nd ed. into structured, audience-aware reading notes. The series covers storage engines, replication, sharding, transactions, distributed systems, and stream processing.

← Ch.8: Transactions Ch.10: Consistency and Consensus →