Skip to main content
  1. References/
  2. Architecture Design Basics/
  3. Pattern Taxonomy/
  4. Scaling & Performance/

Batch & Stream Processing

·· 202 words· 1 min

🟠 P1 — two paradigms for processing large volumes of data

Problem #

Data processing at scale falls into two patterns: process everything at once (batch) or process events as they arrive (stream). The choice affects latency, complexity, and failure recovery.

Comparison #

DimensionBatch ProcessingStream Processing
LatencyMinutes to hoursSeconds to milliseconds
InputBounded datasetUnbounded, continuous
ProcessingMapReduce, SparkKafka Streams, Flink, Storm
Fault toleranceRe-run the jobCheckpointing, exactly-once
Use caseETL, analytics, ML trainingReal-time metrics, fraud detect

Architectures #

Lambda Architecture #

Run batch AND stream in parallel. Batch for accuracy, stream for speed. Merge results. Problem: maintaining two codepaths that must produce the same results.

Kappa Architecture #

Stream only. Reprocess by replaying the event log. Problem: replaying large logs is slow; need efficient log storage (Kafka with long retention).

Instinct #

Kappa is the modern default. Kafka’s log retention + stream processing frameworks make batch-like reprocessing possible within the stream paradigm. Lambda is legacy — maintain two systems only if you have a strong reason. For interviews, know the trade-off and express a preference.

References #

DDIA 2e Reference #

  • Chapter 10: Batch Processing
  • Chapter 11: Stream Processing