Circuit Breaker

Table of Contents

🔴 P0 — prevents cascading failures by stopping calls to a failing service

Problem #

When a downstream service is failing, continuing to send requests wastes resources, increases latency (waiting for timeouts), and can cause cascading failures upstream.

Mechanism #

State Machine:

  CLOSED ──(failures exceed threshold)──→ OPEN
    ↑                                       │
    │                              (timer expires)
    │                                       ↓
    └──(probe succeeds)──────────── HALF-OPEN
                                     │
                              (probe fails)
                                     │
                                     ↓
                                   OPEN

Closed: Normal operation. Track failure rate.
Open: All requests fail immediately (no downstream call). Return fallback or error.
Half-Open: Allow one probe request through. If it succeeds → Closed. If it fails → Open.

Key Trade-offs #

Threshold tuning: Too sensitive → false positives (circuit opens on transient errors). Too lax → slow detection.
Fallback quality: When circuit is open, what do you return? Cached data? Degraded response? Error? The fallback design is as important as the circuit breaker itself.
Granularity: Per-service? Per-endpoint? Per-host? Finer granularity avoids over-broad circuit opening but requires more state.

Instinct #

Circuit breaker + timeout + retry form a resilience trinity. Timeout detects slow calls, retry handles transient failures, circuit breaker prevents sustained hammering of a dead service.

In interviews, describe all three together:

Each call has a 500ms timeout. On timeout, retry once with jitter. If failure rate exceeds 50% over 10 seconds, circuit opens for 30 seconds, returning cached data as fallback.

INTERVIEW: Circuit breaker discussion comes up when the interviewer deep-dives on: reliability and failure modes, external API calls to 3rd parties, DB connections that might timeout, inter-service communication in microservices, or any resource-intensive operation that could become a bottleneck.
EXP: Cascading failures from a missing circuit breaker are one of the most common production incidents I’ve seen. The pattern pays for itself the first time a downstream service goes down under load.

Fallback Mechanisms #

Fallback Strategy	Mechanism	When appropriate
Cached response	Return last-known-good value	Data staleness acceptable (product catalog)
Default response	Return static/hardcoded result	Feature works with defaults (empty recs)
Degraded service	Call simpler backup service	Critical path with a simpler alternative
Fail fast with error	Return 503 immediately	Caller handles errors (batch jobs)
Queue for later	Buffer request, process when up	Writes that tolerate delay

Instinct: “What’s your fallback?” is the most important follow-up question when someone proposes a circuit breaker. A circuit breaker that returns 503 is barely better than a timeout — the value comes from intelligent degradation.

References #

Release It! (2nd Edition) — Michael Nygard; the book that introduced Circuit Breaker and Bulkhead
Resilience4j: Circuit Breaker — modern JVM implementation

DDIA 2e Reference #

Chapter 8: Detecting and handling failures