Skip to main content
  1. References/
  2. Architecture Design Basics/
  3. Pattern Taxonomy/
  4. Fundamental Concepts/

Failure & Partial Failures

·· 314 words· 2 mins

🔴 P0 — the defining characteristic of distributed systems

Problem #

In a single-process system, things either work or they don’t. In distributed systems, some parts can fail while others continue working. This partial failure is non-deterministic: you often can’t tell whether a remote call succeeded, failed, or is still in progress.

Mechanism #

The three outcomes of a remote call:

Request → Network → Service → Network → Response

Outcome 1: ✅ Success      → got response, operation succeeded
Outcome 2: ❌ Failure       → got error, operation failed
Outcome 3: ❓ Unknown       → timeout. Did it succeed? Fail? Still running?

Outcome 3 is the hard one. It’s why we need idempotency, why we need timeouts, why we need circuit breakers. The inability to distinguish “slow” from “dead” from “succeeded but response lost” is the fundamental challenge.

Failure taxonomy #

Failure TypeExampleDetectionRecovery
Crash failureProcess diesHeartbeat timeoutRestart, failover
Omission failureMessage dropped silentlyTimeoutRetry
Timing failureResponse too slowDeadline exceededTimeout + retry/fallback
Byzantine failureNode sends corrupted/malicious dataChecksums, votingQuorum, BFT (rare)

Key Trade-offs #

Detection speed vs false positives #

  • Short timeouts: detect failures fast, but false positives on slow responses → unnecessary failovers
  • Long timeouts: fewer false positives, but slow recovery → cascading delays

Availability vs consistency under failure #

  • Accept partial results (available but possibly stale) vs reject until full recovery (consistent but unavailable)
  • This is the essence of the CAP theorem trade-off — see also: CAP & PACELC

Instinct #

Design for Outcome 3. Every remote call in your system should have an answer to: “What happens if I never get a response?” The answer is usually a combination of: timeout → retry (with idempotency) → circuit breaker → fallback → alert.

References #

DDIA 2e Reference #

  • Chapter 8: The Trouble with Distributed Systems (this IS Chapter 8)
  • Chapter 9: Consistency and Consensus (what we build on top of failure handling)