- rtshkmr's digital garden/
- References/
- Architecture Design Basics/
- Pattern Taxonomy/
- Fundamental Concepts/
- Failure & Partial Failures/
Failure & Partial Failures
Table of Contents
🔴 P0 — the defining characteristic of distributed systems
Problem #
In a single-process system, things either work or they don’t. In distributed systems, some parts can fail while others continue working. This partial failure is non-deterministic: you often can’t tell whether a remote call succeeded, failed, or is still in progress.
Mechanism #
The three outcomes of a remote call:
Request → Network → Service → Network → Response
Outcome 1: ✅ Success → got response, operation succeeded
Outcome 2: ❌ Failure → got error, operation failed
Outcome 3: ❓ Unknown → timeout. Did it succeed? Fail? Still running?Outcome 3 is the hard one. It’s why we need idempotency, why we need timeouts, why we need circuit breakers. The inability to distinguish “slow” from “dead” from “succeeded but response lost” is the fundamental challenge.
Failure taxonomy #
| Failure Type | Example | Detection | Recovery |
|---|---|---|---|
| Crash failure | Process dies | Heartbeat timeout | Restart, failover |
| Omission failure | Message dropped silently | Timeout | Retry |
| Timing failure | Response too slow | Deadline exceeded | Timeout + retry/fallback |
| Byzantine failure | Node sends corrupted/malicious data | Checksums, voting | Quorum, BFT (rare) |
Key Trade-offs #
Detection speed vs false positives #
- Short timeouts: detect failures fast, but false positives on slow responses → unnecessary failovers
- Long timeouts: fewer false positives, but slow recovery → cascading delays
Availability vs consistency under failure #
- Accept partial results (available but possibly stale) vs reject until full recovery (consistent but unavailable)
- This is the essence of the CAP theorem trade-off — see also: CAP & PACELC
Instinct #
Design for Outcome 3. Every remote call in your system should have an answer to: “What happens if I never get a response?” The answer is usually a combination of: timeout → retry (with idempotency) → circuit breaker → fallback → alert.
References #
- Gray Failure: The Achilles Heel of Cloud-Scale Systems — Microsoft Research (2017)
- Harvest, Yield, and Scalable Tolerant Systems — Fox & Brewer (1999)
- Notes on Distributed Systems for Young Bloods — Jeff Hodges
DDIA 2e Reference #
- Chapter 8: The Trouble with Distributed Systems (this IS Chapter 8)
- Chapter 9: Consistency and Consensus (what we build on top of failure handling)