DDIA · Ch.1322 min read

A Philosophy of Streaming Systems: Putting the Pieces Together

Ch.13 of DDIA 2nd ed. synthesizes the book's themes into architectural principles: data integration challenges (no single tool, dual-write problems), reasoning about dataflows, derived vs. distributed transactions, ordering events for causality, lambda vs. kappa architectures, "unbundling" databases (federated reads via Trino/FDW, unified writes via CDC/event logs), loose coupling benefits, and designing applications around dataflow with immutable state and derived views.

The Data Integration Problem

The central premise of this chapter is that no single data storage or processing tool is right for all access patterns and workloads. Most applications end up using several specialized systems: a relational database for OLTP, a search index for full-text search, a cache for low-latency reads, a data warehouse for analytics, a message queue for async processing. The problem is keeping them all in sync.

No single tool fits all access patterns

OLTP databases are fast for point queries but slow for analytics. Search indexes support full-text search but not aggregations. Caches serve low-latency reads but lose data on restart. Data warehouses handle analytics but are stale by hours.

Solution: Accept that you need multiple specialized systems. The problem becomes: how do you keep them in sync?

Dual-write is a race condition

Writing the same data to two systems in parallel (e.g., database + search index) risks inconsistency: one write succeeds, one fails, and you have no atomic way to roll back. The two systems are now out of sync and may be undetectable.

Solution: Designate one system as the leader (source of truth). Derive all other systems from it via CDC or event log replay.

Total ordering across partitions is hard

In a sharded system, there is no single global event log. Events on different shards have no total ordering. Microservices with independent databases have even less coordination. A user action may touch multiple services with no shared clock.

Solution: Accept causal ordering (Lamport/hybrid clocks) for most use cases. Use consensus for the cases that truly require total ordering.

Schema evolution breaks pipelines

When the source schema changes (new column, renamed field, type change), all downstream consumers must update. In tightly coupled architectures, this cascade is painful and slow.

Solution: Use a schema registry (Confluent Schema Registry with Avro/Protobuf). Consumers opt into schemas they support. Forward and backward compatibility rules enforce safe evolution.

Derived Data vs. Distributed Transactions

There are two main approaches to keeping multiple systems in sync. They represent a fundamental tradeoff between consistency guarantees and operational complexity.

Derived Data via CDC/Event Log

One system is the authoritative source of truth. Changes propagate asynchronously to derived systems via CDC or an event log. Search index, cache, and analytics DB are all followers.

CONSISTENCY

Eventually consistent across derived systems. Primary is always consistent.

FAILURE BEHAVIOR

Derived systems can lag or be unavailable; primary continues to work. On recovery, derived systems replay from the log.

BEST FOR

Most data synchronization use cases. Acceptable when derived systems can be slightly stale.

Distributed Transactions (2PC/XA)

A coordinator orchestrates an atomic commit across multiple systems. All participants either commit or abort together. Provides atomicity across heterogeneous systems.

CONSISTENCY

Immediately consistent across all systems after commit.

FAILURE BEHAVIOR

Coordinator is a SPOF. If coordinator crashes after preparing but before committing, participants are blocked indefinitely (in-doubt transactions). Very expensive.

BEST FOR

When strong cross-system atomicity is required. Rarely used in modern architectures due to operational complexity.

Lambda vs. Kappa Architecture

Two competing architectural patterns address the need for both low-latency results and the ability to reprocess historical data with corrected logic.

Lambda Architecture

⚠ Largely abandoned in favor of Kappa

Run both a batch layer (processes full history, high accuracy) and a speed layer (processes recent data, low latency). A serving layer merges results from both.

PROS

Handles both historical reprocessing and low-latency queries. Batch layer can correct errors in the speed layer.

CONS

Two codebases to maintain (often in different frameworks). Complexity of merging results. Speed layer results replaced by batch results eventually — but "eventually" may be hours. Code divergence between the two layers leads to inconsistencies.

Kappa Architecture

✓ Preferred modern approach

Process everything as a stream. For reprocessing, replay the event log through a new version of the stream processor. No separate batch layer.

PROS

Single codebase. Simpler to operate. Reprocessing by replaying the log with a new consumer group — run new and old in parallel, swap when ready.

CONS

Stream processor must be able to handle both high-throughput reprocessing (batch-like) and low-latency production traffic. Log must retain full history or a sufficient lookback.

Unbundling the Database

A traditional monolithic database bundles together several distinct functions. The "unbundled database" vision is to implement each function as a best-of-breed component, connected via standard interfaces. This provides flexibility at the cost of operational complexity.

Two complementary directions: federated reads (query across multiple stores as if they were one — Trino, PostgreSQL FDW) and unified writes (propagate writes to all stores via CDC/event log — a single authoritative write path, all others derived).

Component	Traditional (Bundled)	Unbundled
Storage	Database manages its own storage format (B-tree, LSM-tree) internally	Object store (S3) + columnar format (Parquet, ORC) + table catalog (Iceberg, Delta Lake, Hudi)
Query execution	Database's query planner and executor, tightly coupled to storage	Query engine (Trino/Presto, Spark, DuckDB) reads from any compatible storage format
Change propagation	Database replication log, accessible only within the database ecosystem	CDC tool (Debezium) + Kafka log. Any consumer can subscribe to any change stream.
Indexing	Database maintains indexes on write path, tightly coupled to storage engine	Search index (Elasticsearch/OpenSearch) maintained as a CDC consumer. Updated asynchronously.
Caching	Database buffer pool; application-level cache (Memcached) managed separately	Redis/Memcached maintained as a CDC consumer. Cache invalidation driven by change events.

The Application as a Dataflow Graph

The synthesizing insight of DDIA is that a data-intensive application can be thought of as a dataflow graph: immutable events flow in from sources, through a series of transformations (stream processors, batch jobs, CDC pipelines), and into derived views (search indexes, caches, analytics databases, serving APIs). Application code is a derivation function — it transforms one representation of data into another. State and code are separated: state lives in the event log and its derived materializations; code lives in the processors. This separation makes evolution, debugging, and reprocessing far more tractable than traditional architectures where state is buried in mutable database tables.

Mental Models

Derive, Don't Synchronize

Instead of trying to keep multiple systems in sync via dual-write or distributed transactions, make one system the source of truth and derive all others. This shifts the problem from coordination (hard, failure-prone) to transformation (composable, retry-able, replayable).

The Application as a Dataflow Graph

Modern data-intensive applications are not monoliths reading from one database. They are graphs of data transformations: events flow from sources through CDC pipelines, stream processors, batch jobs, and into serving systems. Designing the application is designing the dataflow.

Reprocessing is a Superpower

When the business logic changes, you don't need to go back in time and change history. You keep the immutable event log and re-derive all outputs by replaying the log through new logic. This is how data teams evolve analytics, retrain ML models, and fix bugs in derived data — without touching source data.

Unbundling the Database

A database is a bundle of: storage, indexing, query execution, change propagation, caching. Each of these can be implemented as a separate specialized system. "Unbundling" means choosing the best component for each function and connecting them via standard interfaces (Kafka, Parquet, Iceberg). More flexible, more operationally complex.

Loose Coupling via the Event Log

When systems communicate via an event log (Kafka) rather than direct API calls, they are loosely coupled: producers don't know who their consumers are; consumers can be added or removed without touching producers; consumers can be at different points in the log. This is the architectural benefit of CDC over direct database connections.

Avoiding Coordination is a Design Goal

Every time two services or systems must coordinate — via locks, distributed transactions, or consensus — there is a performance cost, an availability risk (what if one side is down?), and a complexity tax. Designing systems that derive state from immutable logs rather than coordinate on mutable state avoids most of this coordination overhead.

END OF SERIES · DESIGNING DATA-INTENSIVE APPLICATIONS · 2ND EDITION

This is the final chapter in the DDIA reading notes series. The series covered Ch.1–13, tracing the full arc from single-machine storage engines and data models through distributed replication, sharding, transactions, the trouble with distributed systems, consistency, consensus, and finally batch and stream processing architectures. The throughline: the fundamental tradeoffs — consistency vs. availability, latency vs. throughput, simplicity vs. generality — recur throughout every layer of the stack.

← Ch.12: Stream Processing ← Back to Knowledge Hub