DDIA · Ch.1124 min read

Batch Processing: Taming Large Datasets with Bounded Jobs

Ch.11 of DDIA 2nd ed. covers the batch processing lineage: Unix tools as the philosophical ancestor, distributed filesystems (HDFS, GlusterFS) and object stores (S3, GCS) as storage layers, YARN/Kubernetes as resource managers, MapReduce as the foundational model, and dataflow engines (Spark, Flink) as its evolution. Key tradeoffs: sort-merge vs. hash joins, HDFS vs. object store, co-located vs. separated compute.

Three Modes of Data Processing

Before diving into batch processing, it helps to situate it among the three main data processing paradigms. The key dimension is latency — and with it, the design constraints that latency imposes.

System Type	Latency	Input	Output	Priority	Examples
OLTP (Online Transaction Processing)	Milliseconds	User-initiated queries or writes, small number of records at a time	Immediate response to user	Low latency, high availability, ACID transactions	PostgreSQL, MySQL, MongoDB — serving web applications
OLAP / Analytics	Seconds to minutes	Analytical queries scanning large portions of the dataset	Aggregated summaries, reports	Query throughput, read-heavy optimization (columnar storage)	Snowflake, BigQuery, Redshift, ClickHouse
Batch Processing	Minutes to hours	Large dataset (bounded), often entire dataset or full history	Derived dataset — search index, ML model, aggregate statistics	Throughput, fault tolerance via retry, human-fault tolerance via immutability	Hadoop MapReduce, Spark, Hive, Flink batch mode
Stream Processing	Milliseconds to seconds	Unbounded stream of events as they arrive	Continuously updated derived outputs	Low latency, windowing, state management	Apache Kafka Streams, Flink, Spark Streaming, Dataflow

Batch Processing's Superpower: Human Fault Tolerance

Because batch jobs take immutable inputs and write to new output locations, any mistakes can be corrected by re-running the job. There is no in-place mutation, no user-facing state to corrupt. This "time travel" property — the ability to go back and reprocess historical data with corrected logic — is fundamental to how data teams iterate on analytics and ML pipelines. It's the reason data pipelines use immutable storage (append-only logs, versioned snapshots) rather than mutable databases.

Distributed Storage: HDFS vs. Object Stores

HDFS (Hadoop Distributed File System) was designed around a key assumption: moving computation to the data is cheaper than moving data to computation. Files are split into 128 MB blocks stored on DataNodes; a central NameNode tracks file-to-block mapping. YARN schedules compute tasks preferentially on nodes that hold the relevant data blocks.

Object stores (S3, Google Cloud Storage, Azure Blob Storage) are fundamentally different: they separate storage from compute entirely. Objects are immutable blobs stored in a flat namespace; there are no real directories. The compute-data separation means any compute cluster can read any data, enabling truly elastic scaling — but at the cost of network I/O between compute and storage.

Property	HDFS	Object Store (S3/GCS/Azure)
Structure	128 MB–1 GB blocks distributed across DataNodes; directory tree via NameNode	Flat namespace; "directories" are just key prefixes. Objects are immutable blobs.
Compute Locality	✓ Excellent — YARN/Kubernetes can schedule tasks on nodes that hold the data blocks	❌ None — compute and storage are fully separated (read over network)
Mutation	Files can be appended; HBase provides random writes on top of HDFS	Objects are immutable — to "update" you write a new object and delete the old one
Directory Listing	Efficient directory operations via NameNode metadata	No real directories; listing objects with a prefix requires scanning and may be slow at scale
Cost Model	Fixed cluster cost; must provision for peak storage + compute	Pay per GB stored + per operation; effectively unlimited; compute separate (e.g., EMR, Dataproc)
Durability	Default 3x replication across racks; NameNode is SPOF without HA setup	11 nines durability (AWS S3); highly available by design; no NameNode SPOF

MapReduce vs. Dataflow Engines

MapReduce popularized the concept of distributed batch processing, but its design has fundamental performance inefficiencies. A MapReduce job must write all intermediate data to HDFS between the map and reduce phases, requiring 3x replication and disk writes. Complex pipelines require multiple chained jobs, each writing to HDFS — a chain of 10 jobs means 10 full HDFS writes.

Dataflow engines (Spark, Flink) address this by treating an entire pipeline as a single job. Intermediate results stay in memory or local disk; the system tracks the data lineage graph (which operations produced which data) for fault recovery. Unnecessary sorts are eliminated; operators can be pipelined without barriers.

Dimension	MapReduce	Dataflow Engines (Spark/Flink)
Job boundary	Each MapReduce job is independent; intermediate results written to HDFS between jobs	Entire workflow is one job; intermediate state stays in memory or as streams between operators
Sorting	Sort is mandatory — all map output is sorted before reduce. Even if you don't need sort semantics.	Sort only when needed; often replaced by hash aggregation which is faster
Intermediate data	Written to HDFS (fault tolerant, but slow — 3x replication, fsync to disk)	Kept in memory or local disk (spills only when needed); much faster
Pipelining	No — reduce cannot start until all map tasks are complete (full barrier)	Yes — operators can be pipelined; output of one feeds into next without waiting
Fault recovery	Re-run individual map/reduce tasks from HDFS intermediate results	Re-run from last checkpoint or from source (more recomputation, less I/O)
Examples	Hadoop MapReduce, Hive (compiles to MR), Pig	Apache Spark, Apache Flink, Google Dataflow (Apache Beam), Tez

Join Strategies in Batch Processing

Joins are the most expensive operation in batch processing. Unlike single-node databases where join algorithms have been optimized for decades, distributed joins must move data across the network. The choice of join strategy has a massive impact on job performance.

Sort-merge join

Both sides of the join are sorted by the join key, then merged. The sort step is the shuffle/reduce phase in MapReduce.

Complexity: O(n log n) sort + O(n) merge

Best when: Large datasets where neither side fits in memory. Default in MapReduce.

Broadcast hash join (map-side join)

Load the small side of the join entirely into a hash table in each mapper. Stream the large side through without any shuffle.

Complexity: O(n) — no shuffle, no sort

Best when: One side of join is small enough to fit in memory of each mapper. Very fast.

Partitioned hash join

Both inputs are partitioned by the join key. Each partition of the small input fits in memory. Join each pair of corresponding partitions.

Complexity: O(n) with partitioned parallelism

Best when: Both sides too large for broadcast, but after partitioning each partition's small side fits in memory.

Batch Job Outputs

Batch jobs typically produce one of three output types. In all cases, the output is written to a new location atomically — the job either fully succeeds (new output becomes visible) or fully fails (old output remains).

🔍

Search Indexes

Lucene/Solr segments built from a full corpus scan. The resulting index files are written to a new location; the search service atomically switches to the new index. Enables full-text search over billions of documents.

📦

Key-Value Stores

Materialized views or denormalized lookup tables. The batch job computes the final key-value pairs and writes them directly into a read-optimized store (e.g., HFile files in HBase, SSTable files in Cassandra) with bulk loading for high throughput.

🤖

ML Models

Feature computation, model training, model evaluation — all as batch steps. Output is a model artifact (weights, embeddings, decision tree) that a serving system loads. Model retraining pipelines run on a schedule or triggered by data drift.

Mental Models

Immutable Inputs, Derived Outputs

Batch jobs take immutable input (e.g., yesterday's log files) and produce derived outputs (aggregates, indexes, models). If a job fails or produces bad output, you re-run it. You never corrupt the source. This human fault tolerance is a fundamental property of batch systems.

The Shuffle is the Bottleneck

In MapReduce and similar systems, the shuffle (sorting and transferring map output to reducers) is the most expensive operation — network-intensive, disk-intensive, requires full barrier synchronization. Dataflow engines reduce shuffle cost by avoiding unnecessary sorts and keeping intermediate data in memory.

HDFS is a Distributed Sort Machine

MapReduce is essentially a distributed sort utility. The "word count" example obscures this: the map-then-sort-then-reduce pattern can express any aggregation where the key space fits in a sort. This is why Hive and Pig originally compiled everything to MapReduce jobs.

Object Store + Compute Separation is the New Normal

The original Hadoop model (HDFS + co-located MapReduce) assumed compute locality was critical. In practice, modern networks (10–100 Gbps within a datacenter) make the cost of reading from S3/GCS acceptable. The industry has largely shifted to separated compute (Spark on EKS, Dataproc) + object store storage.

Batch Outputs Are Immutable Snapshots

Batch job outputs (search indexes, derived datasets, ML models) are typically written atomically — the new output replaces the old one as a single operation. Readers see either the old output or the new output, never a partial state. This contrasts sharply with streaming systems where state is continuously mutated.

Workflow Schedulers Are DAG Engines

Airflow, Dagster, and Prefect represent batch workflows as DAGs (directed acyclic graphs) of tasks. Each task has dependencies; the scheduler runs tasks as soon as their dependencies are complete. This is essentially the same model as Spark's lineage graph, just externalized to a workflow layer.

DESIGNING DATA-INTENSIVE APPLICATIONS · 2ND EDITION

This article is part of an ongoing series distilling DDIA 2nd ed. into structured, audience-aware reading notes.

← Ch.10: Consistency and Consensus Ch.12: Stream Processing →