- rtshkmr's digital garden/
- References/
- Architecture Design Basics/
- Pattern Taxonomy/
- Scaling & Performance/
- Batch & Stream Processing/
Batch & Stream Processing
Table of Contents
🟠P1 — two paradigms for processing large volumes of data
Problem #
Data processing at scale falls into two patterns: process everything at once (batch) or process events as they arrive (stream). The choice affects latency, complexity, and failure recovery.
Comparison #
| Dimension | Batch Processing | Stream Processing |
|---|---|---|
| Latency | Minutes to hours | Seconds to milliseconds |
| Input | Bounded dataset | Unbounded, continuous |
| Processing | MapReduce, Spark | Kafka Streams, Flink, Storm |
| Fault tolerance | Re-run the job | Checkpointing, exactly-once |
| Use case | ETL, analytics, ML training | Real-time metrics, fraud detect |
Architectures #
Lambda Architecture #
Run batch AND stream in parallel. Batch for accuracy, stream for speed. Merge results. Problem: maintaining two codepaths that must produce the same results.
Kappa Architecture #
Stream only. Reprocess by replaying the event log. Problem: replaying large logs is slow; need efficient log storage (Kafka with long retention).
Instinct #
Kappa is the modern default. Kafka’s log retention + stream processing frameworks make batch-like reprocessing possible within the stream paradigm. Lambda is legacy — maintain two systems only if you have a strong reason. For interviews, know the trade-off and express a preference.
References #
DDIA 2e Reference #
- Chapter 10: Batch Processing
- Chapter 11: Stream Processing