- rtshkmr's digital garden/
- References/
- Architecture Design Basics/
- Pattern Taxonomy/
- Reliability, Consistency & Synchronisation/
- Circuit Breaker/
Circuit Breaker
Table of Contents
🔴 P0 — prevents cascading failures by stopping calls to a failing service
Problem #
When a downstream service is failing, continuing to send requests wastes resources, increases latency (waiting for timeouts), and can cause cascading failures upstream.
Mechanism #
State Machine:
CLOSED ──(failures exceed threshold)──→ OPEN
↑ │
│ (timer expires)
│ ↓
└──(probe succeeds)──────────── HALF-OPEN
│
(probe fails)
│
↓
OPEN- Closed: Normal operation. Track failure rate.
- Open: All requests fail immediately (no downstream call). Return fallback or error.
- Half-Open: Allow one probe request through. If it succeeds → Closed. If it fails → Open.
Key Trade-offs #
- Threshold tuning: Too sensitive → false positives (circuit opens on transient errors). Too lax → slow detection.
- Fallback quality: When circuit is open, what do you return? Cached data? Degraded response? Error? The fallback design is as important as the circuit breaker itself.
- Granularity: Per-service? Per-endpoint? Per-host? Finer granularity avoids over-broad circuit opening but requires more state.
Instinct #
Circuit breaker + timeout + retry form a resilience trinity. Timeout detects slow calls, retry handles transient failures, circuit breaker prevents sustained hammering of a dead service.
In interviews, describe all three together:
Each call has a 500ms timeout. On timeout, retry once with jitter. If failure rate exceeds 50% over 10 seconds, circuit opens for 30 seconds, returning cached data as fallback.
- INTERVIEW: Circuit breaker discussion comes up when the interviewer deep-dives on: reliability and failure modes, external API calls to 3rd parties, DB connections that might timeout, inter-service communication in microservices, or any resource-intensive operation that could become a bottleneck.
- EXP: Cascading failures from a missing circuit breaker are one of the most common production incidents I’ve seen. The pattern pays for itself the first time a downstream service goes down under load.
Fallback Mechanisms #
| Fallback Strategy | Mechanism | When appropriate |
|---|---|---|
| Cached response | Return last-known-good value | Data staleness acceptable (product catalog) |
| Default response | Return static/hardcoded result | Feature works with defaults (empty recs) |
| Degraded service | Call simpler backup service | Critical path with a simpler alternative |
| Fail fast with error | Return 503 immediately | Caller handles errors (batch jobs) |
| Queue for later | Buffer request, process when up | Writes that tolerate delay |
Instinct: “What’s your fallback?” is the most important follow-up question when someone proposes a circuit breaker. A circuit breaker that returns 503 is barely better than a timeout — the value comes from intelligent degradation.
References #
- Release It! (2nd Edition) — Michael Nygard; the book that introduced Circuit Breaker and Bulkhead
- Resilience4j: Circuit Breaker — modern JVM implementation
DDIA 2e Reference #
- Chapter 8: Detecting and handling failures