Skip to main content
  1. References/
  2. Architecture Design Basics/
  3. Pattern Taxonomy/
  4. Reliability, Consistency & Synchronisation/

Timeout Strategies

·· 296 words· 2 mins

🔴 P0 — the most fundamental reliability mechanism; every network call needs one

Problem #

Without timeouts, a slow or dead downstream can cause the caller to hang indefinitely, consuming resources and propagating failures upstream.

Types #

Timeout TypeWhat it boundsTypical Value
Connection timeoutTime to establish TCP connection1-5 seconds
Request timeoutTime for a complete request-response5-30 seconds
Idle timeoutTime a connection can sit unused60-300 seconds
Deadline/budgetTotal time for a multi-hop request chainPropagated from upstream

Deadline Propagation #

Client sets deadline: 5 seconds
  → Gateway: remaining = 5s, gateway processing = 200ms
    → Service A: remaining = 4.8s, processing = 100ms
      → Service B: remaining = 4.7s
        → Database: remaining = 4.5s

If any hop exceeds remaining budget → abort immediately

Without deadline propagation: Each service uses its own timeout. Service B might start a 30s query even though the client already gave up.

Health Checks & Heartbeats #

MechanismHow it worksUse case
Liveness check“Are you alive?” (basic ping/200)Restart crashed processes
Readiness check“Can you serve traffic?” (DB connected, warm)Remove from load balancer pool
HeartbeatPeriodic signal between peersCluster membership, leader elect

Graceful Degradation #

When a dependency is slow or down, degrade rather than fail:

  • Cached fallback: Serve stale data when fresh source is unavailable
  • Default response: Return sensible default (empty recs list rather than error)
  • Feature shedding: Disable non-critical features under load

Instinct: “The system should get worse gradually, not cliff-edge.” Design each dependency with a fallback answer to: “If this is down, what do we show instead?”

Instinct #

Every external call must have a timeout. No exceptions. Use deadline propagation for multi-hop chains (gRPC has this built-in).

Set timeouts based on p99 latency of the downstream (e.g. if p99 is 200ms, timeout at 500ms-1s).

When a timeout fires, the question is: retry, circuit-break, or fallback?

See also: Retry with Backoff, Circuit Breaker.

References #

DDIA 2e Reference #

  • Chapter 8: Timeouts and Unbounded Delays