- rtshkmr's digital garden/
- References/
- Architecture Design Basics/
- Pattern Taxonomy/
- Reliability, Consistency & Synchronisation/
- Timeout Strategies/
Timeout Strategies
Table of Contents
🔴 P0 — the most fundamental reliability mechanism; every network call needs one
Problem #
Without timeouts, a slow or dead downstream can cause the caller to hang indefinitely, consuming resources and propagating failures upstream.
Types #
| Timeout Type | What it bounds | Typical Value |
|---|---|---|
| Connection timeout | Time to establish TCP connection | 1-5 seconds |
| Request timeout | Time for a complete request-response | 5-30 seconds |
| Idle timeout | Time a connection can sit unused | 60-300 seconds |
| Deadline/budget | Total time for a multi-hop request chain | Propagated from upstream |
Deadline Propagation #
Client sets deadline: 5 seconds
→ Gateway: remaining = 5s, gateway processing = 200ms
→ Service A: remaining = 4.8s, processing = 100ms
→ Service B: remaining = 4.7s
→ Database: remaining = 4.5s
If any hop exceeds remaining budget → abort immediatelyWithout deadline propagation: Each service uses its own timeout. Service B might start a 30s query even though the client already gave up.
Health Checks & Heartbeats #
| Mechanism | How it works | Use case |
|---|---|---|
| Liveness check | “Are you alive?” (basic ping/200) | Restart crashed processes |
| Readiness check | “Can you serve traffic?” (DB connected, warm) | Remove from load balancer pool |
| Heartbeat | Periodic signal between peers | Cluster membership, leader elect |
Graceful Degradation #
When a dependency is slow or down, degrade rather than fail:
- Cached fallback: Serve stale data when fresh source is unavailable
- Default response: Return sensible default (empty recs list rather than error)
- Feature shedding: Disable non-critical features under load
Instinct: “The system should get worse gradually, not cliff-edge.” Design each dependency with a fallback answer to: “If this is down, what do we show instead?”
Instinct #
Every external call must have a timeout. No exceptions. Use deadline propagation for multi-hop chains (gRPC has this built-in).
Set timeouts based on p99 latency of the downstream (e.g. if p99 is 200ms, timeout at 500ms-1s).
When a timeout fires, the question is: retry, circuit-break, or fallback?
See also: Retry with Backoff, Circuit Breaker.
References #
DDIA 2e Reference #
- Chapter 8: Timeouts and Unbounded Delays