View Mode
DDIA ยท Ch.1124 min read

Batch Processing: Taming Large Datasets with Bounded Jobs

Ch.11 of DDIA 2nd ed. covers the batch processing lineage: Unix tools as the philosophical ancestor, distributed filesystems (HDFS, GlusterFS) and object stores (S3, GCS) as storage layers, YARN/Kubernetes as resource managers, MapReduce as the foundational model, and dataflow engines (Spark, Flink) as its evolution. Key tradeoffs: sort-merge vs. hash joins, HDFS vs. object store, co-located vs. separated compute.

Three Modes of Data Processing

Before diving into batch processing, it helps to situate it among the three main data processing paradigms. The key dimension is latency โ€” and with it, the design constraints that latency imposes.

System TypeLatencyInputOutputPriorityExamples
OLTP (Online Transaction Processing)MillisecondsUser-initiated queries or writes, small number of records at a timeImmediate response to userLow latency, high availability, ACID transactionsPostgreSQL, MySQL, MongoDB โ€” serving web applications
OLAP / AnalyticsSeconds to minutesAnalytical queries scanning large portions of the datasetAggregated summaries, reportsQuery throughput, read-heavy optimization (columnar storage)Snowflake, BigQuery, Redshift, ClickHouse
Batch ProcessingMinutes to hoursLarge dataset (bounded), often entire dataset or full historyDerived dataset โ€” search index, ML model, aggregate statisticsThroughput, fault tolerance via retry, human-fault tolerance via immutabilityHadoop MapReduce, Spark, Hive, Flink batch mode
Stream ProcessingMilliseconds to secondsUnbounded stream of events as they arriveContinuously updated derived outputsLow latency, windowing, state managementApache Kafka Streams, Flink, Spark Streaming, Dataflow

Batch Processing's Superpower: Human Fault Tolerance

Because batch jobs take immutable inputs and write to new output locations, any mistakes can be corrected by re-running the job. There is no in-place mutation, no user-facing state to corrupt. This "time travel" property โ€” the ability to go back and reprocess historical data with corrected logic โ€” is fundamental to how data teams iterate on analytics and ML pipelines. It's the reason data pipelines use immutable storage (append-only logs, versioned snapshots) rather than mutable databases.

Distributed Storage: HDFS vs. Object Stores

HDFS (Hadoop Distributed File System) was designed around a key assumption: moving computation to the data is cheaper than moving data to computation. Files are split into 128 MB blocks stored on DataNodes; a central NameNode tracks file-to-block mapping. YARN schedules compute tasks preferentially on nodes that hold the relevant data blocks.

Object stores (S3, Google Cloud Storage, Azure Blob Storage) are fundamentally different: they separate storage from compute entirely. Objects are immutable blobs stored in a flat namespace; there are no real directories. The compute-data separation means any compute cluster can read any data, enabling truly elastic scaling โ€” but at the cost of network I/O between compute and storage.

PropertyHDFSObject Store (S3/GCS/Azure)
Structure128 MBโ€“1 GB blocks distributed across DataNodes; directory tree via NameNodeFlat namespace; "directories" are just key prefixes. Objects are immutable blobs.
Compute Localityโœ“ Excellent โ€” YARN/Kubernetes can schedule tasks on nodes that hold the data blocksโŒ None โ€” compute and storage are fully separated (read over network)
MutationFiles can be appended; HBase provides random writes on top of HDFSObjects are immutable โ€” to "update" you write a new object and delete the old one
Directory ListingEfficient directory operations via NameNode metadataNo real directories; listing objects with a prefix requires scanning and may be slow at scale
Cost ModelFixed cluster cost; must provision for peak storage + computePay per GB stored + per operation; effectively unlimited; compute separate (e.g., EMR, Dataproc)
DurabilityDefault 3x replication across racks; NameNode is SPOF without HA setup11 nines durability (AWS S3); highly available by design; no NameNode SPOF

MapReduce vs. Dataflow Engines

MapReduce popularized the concept of distributed batch processing, but its design has fundamental performance inefficiencies. A MapReduce job must write all intermediate data to HDFS between the map and reduce phases, requiring 3x replication and disk writes. Complex pipelines require multiple chained jobs, each writing to HDFS โ€” a chain of 10 jobs means 10 full HDFS writes.

Dataflow engines (Spark, Flink) address this by treating an entire pipeline as a single job. Intermediate results stay in memory or local disk; the system tracks the data lineage graph (which operations produced which data) for fault recovery. Unnecessary sorts are eliminated; operators can be pipelined without barriers.

DimensionMapReduceDataflow Engines (Spark/Flink)
Job boundaryEach MapReduce job is independent; intermediate results written to HDFS between jobsEntire workflow is one job; intermediate state stays in memory or as streams between operators
SortingSort is mandatory โ€” all map output is sorted before reduce. Even if you don't need sort semantics.Sort only when needed; often replaced by hash aggregation which is faster
Intermediate dataWritten to HDFS (fault tolerant, but slow โ€” 3x replication, fsync to disk)Kept in memory or local disk (spills only when needed); much faster
PipeliningNo โ€” reduce cannot start until all map tasks are complete (full barrier)Yes โ€” operators can be pipelined; output of one feeds into next without waiting
Fault recoveryRe-run individual map/reduce tasks from HDFS intermediate resultsRe-run from last checkpoint or from source (more recomputation, less I/O)
ExamplesHadoop MapReduce, Hive (compiles to MR), PigApache Spark, Apache Flink, Google Dataflow (Apache Beam), Tez

Join Strategies in Batch Processing

Joins are the most expensive operation in batch processing. Unlike single-node databases where join algorithms have been optimized for decades, distributed joins must move data across the network. The choice of join strategy has a massive impact on job performance.

Sort-merge join

Both sides of the join are sorted by the join key, then merged. The sort step is the shuffle/reduce phase in MapReduce.

Complexity: O(n log n) sort + O(n) merge

Best when: Large datasets where neither side fits in memory. Default in MapReduce.

Broadcast hash join (map-side join)

Load the small side of the join entirely into a hash table in each mapper. Stream the large side through without any shuffle.

Complexity: O(n) โ€” no shuffle, no sort

Best when: One side of join is small enough to fit in memory of each mapper. Very fast.

Partitioned hash join

Both inputs are partitioned by the join key. Each partition of the small input fits in memory. Join each pair of corresponding partitions.

Complexity: O(n) with partitioned parallelism

Best when: Both sides too large for broadcast, but after partitioning each partition's small side fits in memory.

Batch Job Outputs

Batch jobs typically produce one of three output types. In all cases, the output is written to a new location atomically โ€” the job either fully succeeds (new output becomes visible) or fully fails (old output remains).

๐Ÿ”

Search Indexes

Lucene/Solr segments built from a full corpus scan. The resulting index files are written to a new location; the search service atomically switches to the new index. Enables full-text search over billions of documents.

๐Ÿ“ฆ

Key-Value Stores

Materialized views or denormalized lookup tables. The batch job computes the final key-value pairs and writes them directly into a read-optimized store (e.g., HFile files in HBase, SSTable files in Cassandra) with bulk loading for high throughput.

๐Ÿค–

ML Models

Feature computation, model training, model evaluation โ€” all as batch steps. Output is a model artifact (weights, embeddings, decision tree) that a serving system loads. Model retraining pipelines run on a schedule or triggered by data drift.

Mental Models

Immutable Inputs, Derived Outputs

Batch jobs take immutable input (e.g., yesterday's log files) and produce derived outputs (aggregates, indexes, models). If a job fails or produces bad output, you re-run it. You never corrupt the source. This human fault tolerance is a fundamental property of batch systems.

The Shuffle is the Bottleneck

In MapReduce and similar systems, the shuffle (sorting and transferring map output to reducers) is the most expensive operation โ€” network-intensive, disk-intensive, requires full barrier synchronization. Dataflow engines reduce shuffle cost by avoiding unnecessary sorts and keeping intermediate data in memory.

HDFS is a Distributed Sort Machine

MapReduce is essentially a distributed sort utility. The "word count" example obscures this: the map-then-sort-then-reduce pattern can express any aggregation where the key space fits in a sort. This is why Hive and Pig originally compiled everything to MapReduce jobs.

Object Store + Compute Separation is the New Normal

The original Hadoop model (HDFS + co-located MapReduce) assumed compute locality was critical. In practice, modern networks (10โ€“100 Gbps within a datacenter) make the cost of reading from S3/GCS acceptable. The industry has largely shifted to separated compute (Spark on EKS, Dataproc) + object store storage.

Batch Outputs Are Immutable Snapshots

Batch job outputs (search indexes, derived datasets, ML models) are typically written atomically โ€” the new output replaces the old one as a single operation. Readers see either the old output or the new output, never a partial state. This contrasts sharply with streaming systems where state is continuously mutated.

Workflow Schedulers Are DAG Engines

Airflow, Dagster, and Prefect represent batch workflows as DAGs (directed acyclic graphs) of tasks. Each task has dependencies; the scheduler runs tasks as soon as their dependencies are complete. This is essentially the same model as Spark's lineage graph, just externalized to a workflow layer.

DESIGNING DATA-INTENSIVE APPLICATIONS ยท 2ND EDITION

This article is part of an ongoing series distilling DDIA 2nd ed. into structured, audience-aware reading notes.

โ† Ch.10: Consistency and ConsensusCh.12: Stream Processing โ†’