Failure & Partial Failures

Table of Contents

🔴 P0 — the defining characteristic of distributed systems

Problem #

In a single-process system, things either work or they don’t. In distributed systems, some parts can fail while others continue working. This partial failure is non-deterministic: you often can’t tell whether a remote call succeeded, failed, or is still in progress.

Mechanism #

The three outcomes of a remote call:

Request → Network → Service → Network → Response

Outcome 1: ✅ Success      → got response, operation succeeded
Outcome 2: ❌ Failure       → got error, operation failed
Outcome 3: ❓ Unknown       → timeout. Did it succeed? Fail? Still running?

Outcome 3 is the hard one. It’s why we need idempotency, why we need timeouts, why we need circuit breakers. The inability to distinguish “slow” from “dead” from “succeeded but response lost” is the fundamental challenge.

Failure taxonomy #

Failure Type	Example	Detection	Recovery
Crash failure	Process dies	Heartbeat timeout	Restart, failover
Omission failure	Message dropped silently	Timeout	Retry
Timing failure	Response too slow	Deadline exceeded	Timeout + retry/fallback
Byzantine failure	Node sends corrupted/malicious data	Checksums, voting	Quorum, BFT (rare)

Key Trade-offs #

Detection speed vs false positives #

Short timeouts: detect failures fast, but false positives on slow responses → unnecessary failovers
Long timeouts: fewer false positives, but slow recovery → cascading delays

Availability vs consistency under failure #

Accept partial results (available but possibly stale) vs reject until full recovery (consistent but unavailable)
This is the essence of the CAP theorem trade-off — see also: CAP & PACELC

Instinct #

Design for Outcome 3. Every remote call in your system should have an answer to: “What happens if I never get a response?” The answer is usually a combination of: timeout → retry (with idempotency) → circuit breaker → fallback → alert.

References #

Gray Failure: The Achilles Heel of Cloud-Scale Systems — Microsoft Research (2017)
Harvest, Yield, and Scalable Tolerant Systems — Fox & Brewer (1999)
Notes on Distributed Systems for Young Bloods — Jeff Hodges

DDIA 2e Reference #

Chapter 8: The Trouble with Distributed Systems (this IS Chapter 8)
Chapter 9: Consistency and Consensus (what we build on top of failure handling)