Batch & Stream Processing

Table of Contents

🟠 P1 — two paradigms for processing large volumes of data

Problem #

Data processing at scale falls into two patterns: process everything at once (batch) or process events as they arrive (stream). The choice affects latency, complexity, and failure recovery.

Comparison #

Dimension	Batch Processing	Stream Processing
Latency	Minutes to hours	Seconds to milliseconds
Input	Bounded dataset	Unbounded, continuous
Processing	MapReduce, Spark	Kafka Streams, Flink, Storm
Fault tolerance	Re-run the job	Checkpointing, exactly-once
Use case	ETL, analytics, ML training	Real-time metrics, fraud detect

Architectures #

Lambda Architecture #

Run batch AND stream in parallel. Batch for accuracy, stream for speed. Merge results. Problem: maintaining two codepaths that must produce the same results.

Kappa Architecture #

Stream only. Reprocess by replaying the event log. Problem: replaying large logs is slow; need efficient log storage (Kafka with long retention).

Instinct #

Kappa is the modern default. Kafka’s log retention + stream processing frameworks make batch-like reprocessing possible within the stream paradigm. Lambda is legacy — maintain two systems only if you have a strong reason. For interviews, know the trade-off and express a preference.

References #

Questioning the Lambda Architecture — Jay Kreps
Apache Flink Architecture

DDIA 2e Reference #

Chapter 10: Batch Processing
Chapter 11: Stream Processing