Skip to main content
  1. References/
  2. Architecture Design Basics/
  3. Pattern Taxonomy/
  4. Reliability, Consistency & Synchronisation/

Retry with Backoff & Jitter

·· 309 words· 2 mins

🔴 P0 — the default error recovery mechanism; Stripe definitely asks about this

Problem #

Transient failures (network blip, brief overload) are common. Retrying often succeeds. But naive retrying (immediate, unlimited) creates retry storms that make outages worse.

Mechanism #

Exponential Backoff #

Attempt 1: wait 100ms
Attempt 2: wait 200ms
Attempt 3: wait 400ms
Attempt 4: wait 800ms
(cap at max_delay, e.g. 30s)

Adding Jitter #

Without jitter, all clients retry at the same time (correlated retries → thundering herd). Jitter randomises the delay:

Full jitter:    delay = random(0, min(cap, base × 2^attempt))
Equal jitter:   delay = min(cap, base × 2^attempt) / 2 + random(0, half)
Decorrelated:   delay = random(base, previous_delay × 3)

Key Trade-offs #

DimensionNo retryImmediate retryBackoffBackoff + Jitter
Recovery from transient❌ No✅ Fast✅ Yes✅ Yes
Retry storm riskNone🔴 High🟡 Medium✅ Low
Total latencyLowestLowHigherHigher
ImplementationTrivialTrivialSimpleSimple

Instinct #

Always use exponential backoff with full jitter. Cap retries (3-5 max). After max retries, fail open to fallback or circuit breaker. For idempotent operations (GET, PUT with idempotency key), retry aggressively. For non-idempotent operations, retry only if you can confirm the previous attempt didn’t succeed. See also: Idempotency.

  • TRICK: In interviews, state the full retry policy (Resilience Trinity) concisely in one sentence:

Each call has a 500ms timeout. On timeout, retry once with exponential backoff and full jitter. If failure rate exceeds 50% over 10 seconds, the circuit breaker opens for 30 seconds, returning cached data as fallback.

This demonstrates the “resilience trinity” (timeout + retry + circuit breaker) in a single statement.

Retry Budgets #

Individual retries compound across hops: 5 services × 3 retries each = 3⁵ = 243 attempts at the leaf.

Retry budget: limit total retries to <10% of traffic. If exceeded, stop retrying — the system is overloaded and more retries make it worse.

Instinct: “Retry policies must account for the full call chain. Individual service retries compound multiplicatively. A retry budget at the edge prevents the amplification cascade.”

Reference #