- rtshkmr's digital garden/
- References/
- Architecture Design Basics/
- Pattern Taxonomy/
- Reliability, Consistency & Synchronisation/
- Retry with Backoff & Jitter/
Retry with Backoff & Jitter
Table of Contents
🔴 P0 — the default error recovery mechanism; Stripe definitely asks about this
Problem #
Transient failures (network blip, brief overload) are common. Retrying often succeeds. But naive retrying (immediate, unlimited) creates retry storms that make outages worse.
Mechanism #
Exponential Backoff #
Attempt 1: wait 100ms
Attempt 2: wait 200ms
Attempt 3: wait 400ms
Attempt 4: wait 800ms
(cap at max_delay, e.g. 30s)Adding Jitter #
Without jitter, all clients retry at the same time (correlated retries → thundering herd). Jitter randomises the delay:
Full jitter: delay = random(0, min(cap, base × 2^attempt))
Equal jitter: delay = min(cap, base × 2^attempt) / 2 + random(0, half)
Decorrelated: delay = random(base, previous_delay × 3)Key Trade-offs #
| Dimension | No retry | Immediate retry | Backoff | Backoff + Jitter |
|---|---|---|---|---|
| Recovery from transient | ❌ No | ✅ Fast | ✅ Yes | ✅ Yes |
| Retry storm risk | None | 🔴 High | 🟡 Medium | ✅ Low |
| Total latency | Lowest | Low | Higher | Higher |
| Implementation | Trivial | Trivial | Simple | Simple |
Instinct #
Always use exponential backoff with full jitter. Cap retries (3-5 max). After max retries, fail open to fallback or circuit breaker. For idempotent operations (GET, PUT with idempotency key), retry aggressively. For non-idempotent operations, retry only if you can confirm the previous attempt didn’t succeed. See also: Idempotency.
- TRICK: In interviews, state the full retry policy (Resilience Trinity) concisely in one sentence:
Each call has a 500ms timeout. On timeout, retry once with exponential backoff and full jitter. If failure rate exceeds 50% over 10 seconds, the circuit breaker opens for 30 seconds, returning cached data as fallback.
This demonstrates the “resilience trinity” (timeout + retry + circuit breaker) in a single statement.
Retry Budgets #
Individual retries compound across hops: 5 services × 3 retries each = 3⁵ = 243 attempts at the leaf.
Retry budget: limit total retries to <10% of traffic. If exceeded, stop retrying — the system is overloaded and more retries make it worse.
Instinct: “Retry policies must account for the full call chain. Individual service retries compound multiplicatively. A retry budget at the edge prevents the amplification cascade.”
Reference #
- AWS Builders’ Library: Timeouts, Retries, and Backoff with Jitter
- Google SRE: Handling Overload — retry budgets in practice