Skip to main content
  1. References/
  2. Architecture Design Basics/
  3. Pattern Taxonomy/
  4. Reliability, Consistency & Synchronisation/

Circuit Breaker

·· 403 words· 2 mins

🔴 P0 — prevents cascading failures by stopping calls to a failing service

Problem #

When a downstream service is failing, continuing to send requests wastes resources, increases latency (waiting for timeouts), and can cause cascading failures upstream.

Mechanism #

State Machine:

  CLOSED ──(failures exceed threshold)──→ OPEN
    ↑                                       │
    │                              (timer expires)
    │                                       ↓
    └──(probe succeeds)──────────── HALF-OPEN
                                     │
                              (probe fails)
                                     │
                                     ↓
                                   OPEN
  • Closed: Normal operation. Track failure rate.
  • Open: All requests fail immediately (no downstream call). Return fallback or error.
  • Half-Open: Allow one probe request through. If it succeeds → Closed. If it fails → Open.

Key Trade-offs #

  • Threshold tuning: Too sensitive → false positives (circuit opens on transient errors). Too lax → slow detection.
  • Fallback quality: When circuit is open, what do you return? Cached data? Degraded response? Error? The fallback design is as important as the circuit breaker itself.
  • Granularity: Per-service? Per-endpoint? Per-host? Finer granularity avoids over-broad circuit opening but requires more state.

Instinct #

Circuit breaker + timeout + retry form a resilience trinity. Timeout detects slow calls, retry handles transient failures, circuit breaker prevents sustained hammering of a dead service.

In interviews, describe all three together:

Each call has a 500ms timeout. On timeout, retry once with jitter. If failure rate exceeds 50% over 10 seconds, circuit opens for 30 seconds, returning cached data as fallback.

  • INTERVIEW: Circuit breaker discussion comes up when the interviewer deep-dives on: reliability and failure modes, external API calls to 3rd parties, DB connections that might timeout, inter-service communication in microservices, or any resource-intensive operation that could become a bottleneck.
  • EXP: Cascading failures from a missing circuit breaker are one of the most common production incidents I’ve seen. The pattern pays for itself the first time a downstream service goes down under load.

Fallback Mechanisms #

Fallback StrategyMechanismWhen appropriate
Cached responseReturn last-known-good valueData staleness acceptable (product catalog)
Default responseReturn static/hardcoded resultFeature works with defaults (empty recs)
Degraded serviceCall simpler backup serviceCritical path with a simpler alternative
Fail fast with errorReturn 503 immediatelyCaller handles errors (batch jobs)
Queue for laterBuffer request, process when upWrites that tolerate delay

Instinct: “What’s your fallback?” is the most important follow-up question when someone proposes a circuit breaker. A circuit breaker that returns 503 is barely better than a timeout — the value comes from intelligent degradation.

References #

DDIA 2e Reference #

  • Chapter 8: Detecting and handling failures