View Mode
DDIA ยท Ch.1226 min read

Stream Processing: Taming Infinite Event Sequences in Real Time

Ch.12 of DDIA 2nd ed. covers message brokers (traditional queues vs. log-based), Kafka architecture (append-only log, partitions, consumer groups, offsets, disk-based ring buffer), CDC and event sourcing, keeping derived data in sync, and stream processing patterns: windowing (tumbling, sliding, session), joins (stream-table, stream-stream), fault tolerance via checkpointing, and output sinks.

Messaging Systems: Traditional Brokers vs. Log-Based Brokers

Messaging systems decouple producers (who write events) from consumers (who process them). The key architectural question is whether the system behaves like a message queue (consume and discard) or like a database (persist and replay).

DimensionTraditional Message BrokerLog-Based Broker (Kafka/Kinesis)
Message persistenceTypically deleted on acknowledgment (RabbitMQ, ActiveMQ). Small working set.Retained on disk for configurable period. Consumers track their offset. S3 tiering for unlimited retention.
Consumer modelLoad balancing: message delivered to exactly one consumer in a group. Fan-out: each subscriber gets each message.Load balancing via partition assignment to consumers. Fan-out via separate consumer groups (each group reads its own offset independently).
ReplayโŒ Once acknowledged, message is gone. No replay.โœ“ Seek to any offset in the log and re-read. Full history available within retention window.
OrderingOrdering within a queue, but load balancing breaks ordering across consumers.Total ordering within a partition. Consumers assigned to a partition see events in order.
BackpressureBuffer fills โ†’ drop messages or block producer. Some support overflow to disk.Consumer lag accumulates โ€” producer never slows. Consumers can fall behind and catch up later.
ThroughputModerate. Better for complex routing, per-message priority.Very high. Sequential disk writes. Batching. Kafka: millions of messages/sec per node.

Kafka Deep Dive

๐Ÿ“

Log as Storage

Each topic partition is an append-only log file. Producers append records; consumers read from any offset. Sequential disk I/O enables very high throughput โ€” OS page cache makes reads fast without additional in-memory buffering.

โšก

Partitions for Parallelism

A topic is split into N partitions, each assigned to a broker. Producers choose a partition by key hash or round-robin. Each consumer group assigns partitions to consumers โ€” N partitions = N parallel consumers in a group.

๐Ÿ“

Consumer Offsets

Each consumer group tracks its read position (offset) per partition. Offset is committed back to Kafka (or stored externally). This enables replaying from any offset, independent consumer groups, and seamless consumer restarts.

๐ŸŒŠ

Object Storage Tiering

Segments older than the hot retention period are tiered to S3/GCS. Consumers can still read from tiered storage โ€” infinite retention at low cost. Kafka's Tiered Storage (since 3.x) makes this production-ready.

Change Data Capture and Event Sourcing

A common problem in data architectures is keeping multiple systems in sync: a relational database, a search index, a cache, and an analytics database all need to reflect the same data. The naive approach โ€” writing to each system in parallel (dual-write) โ€” is a race condition: what happens if the database write succeeds but the search write fails?

CDC and event sourcing both solve this by making one system the undisputed source of truth and treating all others as derived views. They differ in what the source of truth is.

AspectChange Data Capture (CDC)Event Sourcing
What is capturedRow-level changes to a database โ€” INSERTs, UPDATEs, DELETEs โ€” extracted from the database's replication log or write-ahead log. The database is the system of record.Application-level business events โ€” "user registered", "item added to cart", "payment processed." Events are the system of record; the database state is derived from them.
SchemaEvents mirror database schema. Schema changes in the DB immediately change CDC events.Events have application-defined schemas. Schema evolution is explicit and versioned.
Source of truthDatabase is the leader. CDC consumers (search index, cache, analytics) are followers.The event log is the source of truth. The current "state" is a materialized view derived from the log.
Dual-write problemSolves it: write only to the database; CDC propagates changes to all other systems atomically (from the replication log). No dual-write race condition.Solved by design: the event log is the only write target. All projections are derived from it.
Time travelLimited to log retention window.Full history โ€” replay entire event log to reconstruct state at any point in time.

CDC Tools in Practice

CDC reads from the database's replication log (PostgreSQL WAL, MySQL binlog, MongoDB oplog) rather than polling via queries โ€” this avoids impacting the database under load. Tools: Debezium (open-source, reads from multiple databases, writes to Kafka), Maxwell (MySQL), Airbyte (broader connectivity), Fivetran/Stitch (managed). A full history snapshot is taken once to initialize consumers; subsequent changes stream from the log. Consumers can replay from any point in the retained history.

Stream Processing Patterns

Real-world stream processing is more complex than filtering or mapping events one at a time. The hard problems involve aggregating events across time (windows), correlating events across streams (joins), and doing so correctly even when events arrive late or out of order.

Event-time vs. Processing-time

Events have two timestamps: event time (when the event occurred, embedded in the event payload) and processing time (when the stream processor saw the event). Network delays, retries, and consumer lag cause events to arrive out of order relative to event time.

Approach: Use event time for windowing; track watermarks (estimated completion of a time window) to decide when to close and emit window results. Late-arriving events can be handled via allowed lateness or dedicated late data sinks.

Tumbling Windows

Fixed-size, non-overlapping time windows. Each event belongs to exactly one window. E.g., "aggregate all events in each 1-minute window."

Approach: Group events by floor(event_time / window_size). Emit result when watermark passes the window boundary.

Sliding Windows

Fixed-size windows that overlap. An event may belong to multiple windows. E.g., "count events in the past 5 minutes, updated every 1 minute."

Approach: More memory-intensive. Each event is assigned to multiple buckets. Flink's sliding window operator handles this natively.

Session Windows

Variable-length windows defined by periods of inactivity. A session ends after a gap in events exceeding a configured timeout. Used for user session analytics.

Approach: No fixed duration. Window state must be merged when late events arrive within the timeout gap of a closed window.

Stream-Table Join

Enrich a stream of events with data from a slowly-changing reference table (e.g., add product details to a stream of purchase events).

Approach: Load the table into a local hash map in each stream processor instance. Table updates are applied as they arrive (table is subscribed as a changelog stream).

Stream-Stream Join

Join two streams where events from both streams must be correlated within a time window. E.g., match ad impressions with subsequent ad clicks.

Approach: Buffer events from both streams in a per-key state store within the join window duration. When matching event arrives, emit joined record. Evict expired buffered events.

Fault Tolerance: At-Least-Once vs. Exactly-Once

Stream processors can fail mid-processing. The guarantees available depend on the system and output sink:

At-least-once

Messages may be processed multiple times (if consumer crashes before committing offset). Acceptable if processing is idempotent (applying it twice has the same effect as once). Used by most simple stream pipelines.

At-most-once

Messages may be skipped but never processed twice. Offset is committed before processing. Acceptable for lossy analytics (a dropped metric doesn't matter), never for financial transactions.

Exactly-once

Each message is processed exactly once, even if failures occur. Achieved via: (1) idempotent writes + offset-as-part-of-output, or (2) distributed transactions across consumer offset store and output sink. Flink uses periodic checkpointing + transactional sinks.

Mental Models

A Stream is an Unbounded Batch

The conceptual difference between batch and stream is purely the boundedness of the input. A batch job processes a finite dataset; a stream processor processes an infinite sequence of events. Many stream processing frameworks (Flink, Spark) unify both under the same API.

The Log is the Database

Kafka's log-based message broker is architecturally equivalent to a database's replication log. The current state of a database is the accumulated result of applying all past writes in order. The log IS the source of truth โ€” the table is just a projection. This is the key insight behind event sourcing and CDC.

Dual-Write is a Race Condition

Writing the same data to two systems (e.g., database AND search index) introduces a race condition: the database write succeeds but the search write fails, or vice versa. CDC solves this: write only to the database; treat the replication log as the event stream that propagates changes to all other systems.

Consumer Lag is Not Always Bad

With log-based brokers, a slow consumer simply falls behind โ€” it doesn't slow down producers or other consumers. Lag can be monitored and alerts set on it. Consumers can catch up during low-traffic periods. This property enables batch and stream processing to share the same message bus.

Watermarks Are Estimates, Not Guarantees

A watermark of "event time T" means the stream processor believes all events with timestamps โ‰ค T have been received. This is an estimate โ€” late-arriving events can arrive after the watermark, requiring a decision: include in the window (recompute) or route to a late data sink. There is no perfect watermark in a real distributed system.

State is the Hard Part of Stream Processing

Stateless stream processing (filter, map, project) is simple. The hard problems are stateful: windowed aggregations, joins across streams, maintaining running totals. State must be persistent (survives processor restarts), correctly keyed (each key's state must be co-located), and queryable. Flink's RocksDB state backend is designed specifically for this.

DESIGNING DATA-INTENSIVE APPLICATIONS ยท 2ND EDITION

This article is part of an ongoing series distilling DDIA 2nd ed. into structured, audience-aware reading notes.

โ† Ch.11: Batch ProcessingCh.13: A Philosophy of Streaming Systems โ†’