Data & SystemsDDIA · Chapter 2

Defining Nonfunctional Requirements

A structured look at performance metrics, reliability engineering, and scalability trade-offs — grounded in the Twitter home timeline case study from Designing Data-Intensive Applications Ch. 2.

Functional vs Nonfunctional Requirements

Functional requirements define the inputs, outputs, and transformations a system must perform. Nonfunctional requirements constrain how it performs them — the observable quality properties that determine whether a system is production-worthy. DDIA structures these around four axes: performance (speed and throughput), reliability (correctness under adversity), scalability (graceful behaviour under increased load), and maintainability (the cost of evolving and operating the system over time).

⚡

Performance

How fast is it? How much work can it handle at once?

🛡️

Reliability

Does it keep working when things go wrong — hardware failures, software bugs, human mistakes?

📈

Scalability

Can we handle 10× more users without rebuilding the whole system?

🔧

Maintainability

Can the team fix bugs, add features, and hand off knowledge without heroics?

Case Study: Social Network Home Timelines

The Twitter-style home timeline problem illustrates the classic read/write trade-off in social systems. The relational schema has three tables: users, posts, and follows. Rendering one home timeline naively requires a SQL JOIN across all three — at 2 million home timeline reads per second, each requiring lookups across hundreds of thousands of followed-user rows, the database cannot keep up.

The fan-out-on-write approach materialises each user's timeline in a cache (like Redis). When a user posts, their tweet is fanned out to every follower's cached timeline immediately. Reads become trivial — a single cache lookup. The problem is celebrities: a user with 30 million followers would trigger 30 million cache writes per tweet, generating 1 million writes per second site-wide. The hybrid approach uses pull-at-read for celebrity accounts merged with pre-computed results for regular accounts.

Pull on Read vs Fan-out on Write

Aspect	Pull on Read (Naive)	Fan-out on Write
Approach	Run SQL JOIN across users, posts, and follows at read time	Pre-write each new post into every follower's materialized timeline cache
Read cost	Very high — must query 2M+ follower records per home page load	Very low — single cache read per home page load
Write cost	Low — just write the post once to the posts table	High — fan out to potentially millions of followers per post
Staleness	Always fresh (reads current DB state)	Near-real-time (small propagation delay)
Celebrity problem	No special case needed	Celebrities with 30M+ followers make write fan-out impractical
Hybrid solution	Used for celebrities — pull and merge at read time	Used for regular users — pre-computed, merged with celebrity posts at read time

Performance: Latency & Throughput

Response time and throughput are complementary performance axes. For batch processing, throughput (records/sec) is primary. For online services, response time percentiles matter: p50 captures the typical user; p95/p99 reveal queueing effects and GC pressure; p999 exposes tail latency amplification. Never average percentiles across time windows or machines — always aggregate raw histograms (HdrHistogram, t-digest) before computing percentiles.

Latency = Service Time + Queueing Delays + Network Latency

Service time

Actual CPU/IO work done by the server for the request

Query execution, serialisation, business logic

Queueing delays

Time waiting in a queue because the server is busy

Head-of-line blocking: one slow request stalls many fast ones behind it

Network latency

Time for bits to travel between client and server

Round-trip time; exacerbated by retransmissions or long geographic distance

Understanding Percentiles

p50 (median)

Half of requests complete faster than this. A good baseline but hides tail problems.

p95

1 in 20 requests is slower. Often the first signal of queueing or GC pauses.

p99

1 in 100 requests is slower. Amazon uses p99 for internal service SLAs.

p999

1 in 1000 requests is slower. Relevant when a single slow backend call blocks many frontend requests.

Tail Latency Amplification

When a user request fans out to N parallel backend calls, end-to-end latency is the maximum of all N responses. If each backend has p99 latency of 1 second (1% of calls slow), a request making 100 parallel calls will experience at least one slow backend call 63% of the time (1 − 0.99^100 ≈ 0.63). This is why Amazon tracks p999 for internal services — not because 0.1% of calls seem rare, but because they frequently become the critical path.

SLO — Service Level Objective

An internal target: "p99 response time < 200ms, measured over a 1-minute rolling window." SLOs drive engineering priorities and on-call alerts. They are expressed in percentiles, not averages.

SLA — Service Level Agreement

An external contract: "if p99 < 200ms is not met for more than 0.01% of requests in a month, the customer receives a credit." SLAs formalise SLOs with financial consequences.

Resilience Patterns

Four complementary patterns that prevent individual faults from cascading into system-wide failures:

Exponential back-off

Double the wait time between retries. Prevents clients from hammering an already-overloaded server after a transient failure.

Circuit breaker

After N consecutive failures, stop sending requests to the downstream service for a cool-off period before retrying.

Load shedding

Deliberately reject low-priority work when the system is overloaded. Return 503 instead of slowing down for every caller.

Back-pressure

Signal upstream producers to slow down when a buffer is full. Keeps queues bounded and prevents memory exhaustion.

Reliability: Faults, Failures & Fault Tolerance

A fault is one component deviating from its specification. A failure is the whole system failing to meet its SLO as observed by users. Fault-tolerant systems build redundancy so faults remain localised. A single point of failure (SPOF) is any component whose failure causes a system failure — eliminating SPOFs is the first step in reliability engineering. Chaos engineering (Netflix's Chaos Monkey) deliberately injects faults in production to verify that fault-tolerance mechanisms actually work before a real incident does.

Hardware faults

Hard disks fail at ~2–5% per year (MTTF ~10–50 years)
SSDs fail at ~0.5–1% per year — more reliable but not immune
In a cluster of 10,000 disks, expect 1–2 failures per day
RAID for disk redundancy; UPS + generator for power; hot-swap components
Hardware failures are largely independent — low correlation between nodes

Software faults

Correlated: the same bug can strike every node simultaneously
Runaway processes consuming CPU, memory, or disk until the node dies
A dependency that slows down or returns wrong responses cascades upstream
Cascading failures: one overloaded service triggers failures in callers
Often dormant for months until triggered by unusual input or edge case

Human errors

Configuration changes are the #1 cause of outages (hardware failures: only 10–25%)
Blameless post-mortems: focus on systemic causes, not individual blame
Property-based testing: generate random inputs to find edge cases automatically
Rollback mechanisms: make it easy and fast to revert a bad deploy or config change
Well-designed interfaces: make the right thing easy, wrong thing hard

Mental Models

Eight durable intuitions from this chapter

Nonfunctional requirements are often the deciding factor

Features determine what a system does; nonfunctional requirements determine whether it works in production. A system that returns correct results in 60 seconds or that crashes under load is still a failed system. Performance, reliability, scalability, and maintainability constraints must be defined up front, not retrofitted.

Define your load parameter before discussing scalability

Scalability is meaningless without a specific load metric. "Scale" for Twitter means fan-out per tweet. For a database it might be reads/writes per second or working-set size. For a video platform it might be concurrent stream count. Naming the bottleneck number makes architectural decisions concrete and testable.

Percentiles, not averages, reveal tail latency

Average response time hides the suffering of unlucky users. p99 and p999 reveal slow outliers caused by GC pauses, disk seeks, or queueing. For services that make dozens of parallel backend calls, tail latency amplification means even rare p99 slowdowns regularly affect end-user experience.

Fan-out is the core tension in social feed systems

Every write-heavy social feature (home timelines, news feeds, activity streams) faces the fan-out dilemma: pre-compute and store per-user views (fast reads, expensive writes) or compute at read time (cheap writes, expensive reads). The right answer depends on read/write ratio, follower distribution, and freshness requirements — and is often a hybrid.

Queueing delays dominate latency under high load

At low traffic, response time ≈ service time. As the server approaches saturation, queueing time spikes faster than service time. Head-of-line blocking means one slow request in a queue stalls all faster requests behind it. This is why p99 latency degrades sharply near capacity limits while p50 looks fine.

Faults are inevitable; failures are not

A fault is one component deviating from spec. A failure is the whole system failing to meet its SLO. Good system design tolerates faults through redundancy and isolation so they never become failures. Chaos engineering deliberately introduces faults in production to verify that fault-tolerance mechanisms actually work.

Hardware faults are independent; software faults are correlated

Two hard drives rarely fail for the same reason at the same moment — RAID and node redundancy work. But a software bug or a bad config change can strike every replica simultaneously. This is why software reliability requires different techniques: staged rollouts, canaries, property-based testing, and blameless post-mortems.

Maintainability is an organisational property, not just a code property

Code that is simple and well-documented still becomes a maintenance burden if there are no runbooks, no dashboards, and no safe way to deploy changes. Operability means thinking about who will run the system at 2am; simplicity means resisting accidental complexity that grows from clever shortcuts; evolvability means making future changes cheaper than building from scratch.

← Ch.1: Data Systems & Trade-offs Knowledge Hub →