Retry with Backoff & Jitter

Table of Contents

🔴 P0 — the default error recovery mechanism; Stripe definitely asks about this

Problem #

Transient failures (network blip, brief overload) are common. Retrying often succeeds. But naive retrying (immediate, unlimited) creates retry storms that make outages worse.

Mechanism #

Exponential Backoff #

Attempt 1: wait 100ms
Attempt 2: wait 200ms
Attempt 3: wait 400ms
Attempt 4: wait 800ms
(cap at max_delay, e.g. 30s)

Adding Jitter #

Without jitter, all clients retry at the same time (correlated retries → thundering herd). Jitter randomises the delay:

Full jitter:    delay = random(0, min(cap, base × 2^attempt))
Equal jitter:   delay = min(cap, base × 2^attempt) / 2 + random(0, half)
Decorrelated:   delay = random(base, previous_delay × 3)

Key Trade-offs #

Dimension	No retry	Immediate retry	Backoff	Backoff + Jitter
Recovery from transient	❌ No	✅ Fast	✅ Yes	✅ Yes
Retry storm risk	None	🔴 High	🟡 Medium	✅ Low
Total latency	Lowest	Low	Higher	Higher
Implementation	Trivial	Trivial	Simple	Simple

Instinct #

Always use exponential backoff with full jitter. Cap retries (3-5 max). After max retries, fail open to fallback or circuit breaker. For idempotent operations (GET, PUT with idempotency key), retry aggressively. For non-idempotent operations, retry only if you can confirm the previous attempt didn’t succeed. See also: Idempotency.

TRICK: In interviews, state the full retry policy (Resilience Trinity) concisely in one sentence:

Each call has a 500ms timeout. On timeout, retry once with exponential backoff and full jitter. If failure rate exceeds 50% over 10 seconds, the circuit breaker opens for 30 seconds, returning cached data as fallback.

This demonstrates the “resilience trinity” (timeout + retry + circuit breaker) in a single statement.

Retry Budgets #

Individual retries compound across hops: 5 services × 3 retries each = 3⁵ = 243 attempts at the leaf.

Retry budget: limit total retries to <10% of traffic. If exceeded, stop retrying — the system is overloaded and more retries make it worse.

Instinct: “Retry policies must account for the full call chain. Individual service retries compound multiplicatively. A retry budget at the edge prevents the amplification cascade.”

Reference #

AWS Builders’ Library: Timeouts, Retries, and Backoff with Jitter
Google SRE: Handling Overload — retry budgets in practice