- rtshkmr's digital garden/
- Readings/
- Books/
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems/
- Chapter 8. The Trouble with Distributed Systems/
Chapter 8. The Trouble with Distributed Systems
Table of Contents
a lot more edge cases on how things fail in distributed systems
chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system.
however, we ignore Byzantine Faults for now
the chapter focuses on most of the problems to understand how these failures and faults arise
areas of problems:
- network problems
- clock and timing issues
there’s a need to have a consistent way to reason about our systems here and address these problems
Faults and Partial Failures #
the argument is that HW problems will mostly lead to total failures
so flaky software should be seen as a bug, there shouldn’t be a notion of “partial faults” in a single-node software
this is because it’s also an idealized system model – things rarely go wrong
networked, distributed systems are not an operation in an idealized system model – so that’s why there’s a host of problems that may arise
in this case, partial failures may exist:
- parts of the system that are broken in unpredictable wyas while other parts are working fine
- these are non-deterministic failures
Cloud Computing and Supercomputing #
- cloud vs supercomputing: two ends of the spectrum of scaling philosophies
- cloud: often associated with multi-tenant data centers
- examples of factors on which they have different philosophies
- on fault-handling
- supercomputers
state of computation is checkpointed into durable storage so that failure can resume from last checkpoint
pattern: partial failure allowed to escalate into total failure – any part of the system fails, then we crash and resume (e.g. do a cluster repair, which is alright because high-availability isn’t too big of a problem)
similar to a kernel panic on a single node machine, this is more like a single node computer rather than a distributed system
- cloud systems
online, low latency so availability needs to be high
can’t just do a cluster repair
- supercomputers
- use of shared memory
- supercomputers
nodes communicate through shared memory and RDMA (remote direct memory access)
RDMA is very fascinating, look it up!
- supercomputers
- network topologies
supercomputers:
special topologies e.g. multi-dimensional meshes and toruses
better performance for HPC workloads with known communication patterns
cloud datacenters:
- IP and Ethernet based, arranged in Clos Topologies so that the bisection bandwidth is high
- tolerance for failed nodes
cloud environments:
can do rolling upgrades, can kill and spinup specific VMs and so on – can tolerate some failed nodes, the overall system won’t die
- geographical separation of nodes
cloud:
typically geographically distributed, comms happens over internet, can be slow and unreliable
supercomputers:
local networks typically, nodes expected to be close together.
- on fault-handling
- so for distributed systems, the name of the game is to build a reliable system from unreliable components
- a system can be made more reliable than it’s underlying parts, but with some limits
- fault-handling should be part of the software design and need to know what behaviour to expect from the software in case of a fault
Unreliable Networks #
- reasons why shared-nothing approach to distributed nodes works well:
- cheaper because no special HW
- can use commoditized cloud computing services
- can have high reliability by having geographical redundancies
- internet and ethernet-based networks are async packet networks
- any two nodes can comms with each other
- no guarantee that on arrival and arrival time
- many failure reasons, hard to distinguish between:
- lost request
- remote node is down
- lost response
- hard to tell if delivery even happened
- typically, issue of unreliable networks handled via timeouts
- still don’t know if the remote node got the initial packet or not
Network Faults in Practice #
human error is a major cause of outages
so just adding network redundancy is not as useful as initially expected
a bi-directional network link that works in one direction not necessarily can be expected to work in the other direction
e.g. a network interface for some reason dropping all inbound packets but sending all outbound packets successfully
network faults (this book uses this generic term, includes network partitions and so on)
we can’t put software in unanticipated situations because then it may do arbitrary, unexpected things
some examples from network faults:
deadlocked cluster – permanently unable to serve requests, even on recovery of the network
deletion of data
meaning of handling network faults:
doesn’t necessarily mean tolerating them
more of, being able to know how the software reacts to the network problems and ensuring that the system can recover from the problems
can be tested using chaos injection (e.g. Chaos Monkey)
Detecting Faults #
ability to detect faults is important
e.g. for taking a node out of rotation by the load balancer, or by detecting leader-failure and then doing a leader-promotion
some cases where it’s easy to get feedback on faults
TCP connections when closed return a RST or FIN packet as a reply to indicate that the connection is closed
scripts with fini sequences that may notify other nodes about a crash so that another node can take over the nodes operation without having to wait for a timeout expiry
can query network switch interfaces in the data center if we have access within a datacenter
routers may send ICMP destination unreachable packets but the router also doesn’t have magical failure detection capabilities
typical rule of thumb:
successes need positive responses, errors typically can be expected to have no responses
to conclude dead nodes:
You can retry a few times (TCP retries transparently, but you may also retry at the application level), wait for a timeout to elapse, and eventually declare the node dead if you don’t hear back within the timeout.
Timeouts and Unbounded Delays #
non trivial to answer how long a timeout should be
- too long and it takes too much time to correct
- too short and it corrects prematurely
prematurely declaring dead nodes can give problems
duplicate action problems
e.g. original process is not actually dead and then we have duplicated action because of the presumed dead node and retrying causing the duplicate action
cascading failures from overloading the node further because of handover bookkeeping
if it was slow to respond because it’s overloaded and then it needs to xfer load to peers, then its workload increases further
might end up in a situation where all nodes declare each other dead and nothing works
async networks have unbounded delays, so we can’t just do a bounded time delay guarantee (e.g.
2d + rwheredis the delivery time andris the time it takes to handle the request)unbounded, therefore:
- delivery is always attempting to do things ASAP – so we can identify some lowerbound (fastest time)
- but there’s no upper limit on the time
not good enough for the system to just be FAST for most of the time, else just a transient spike in RTTs will throw the system off-balance
network congestion and queueing #
analogous to vehicular traffic, variability of packet delays is typically due to queueing
[queuing @ network link] network congestion:
- busy network link, a packet might have to wait until it’s slotted in
- if too busy and the queue is full, the packet might be dropped
[queueing can be at the destination machine] :
- e.g. CPU cores are busy, so OS just queues the network request
[queueing @ the sender-side]: flow control aka congestion avoidance / backpressure:
self-limits on its own rate of sending to avoid overloading the network link / receiving node
so more queuing @ the sender before the data even enters the network
queueing delays have a wide range of variability when a system is close to its maximum capacity
for public clouds, most of the infra is shared between customers
so batch workloads and such can easily saturate network links, network delays can come from noisy neighbours
in such situations we have to empirically determine the best timeout value
can also be a dynamic system that constantly adapts to jitter
\(\implies\) a Phi Accrual failure detector would work well for such systems
Synchronous Versus Asynchronous Networks #
summary:
async networks allow us to dynamically manage bandwidth requirements
in return they suffer from queueing and therefore unbounded delays \(\implies\) variable delays are a consequence of dynamic resource partitioning
this should be seen as a cost-benefit tradeoff
however, queueing ensures max resource utilisation (of the link layer)
sync networks are guaranteed their bandwidth allocation so the delay is bounded and there’s no need for queuing (can just be pipelined)
latency guarantees typically come at the expense of resource utilisation because resources end up being statically partitioned
alternatively, multi-tenancy with dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside of variable delays.
mechanisms to consider so that packet-systems can emulate circuit-like behaviour
considering QoS: prioritization, scheduled packets
considering admission control: (e.g. rate-limiting the senders)
comparing datacenter networks (asynchronous) to traditional fixed-line telephone networks (synchronous)
lifecycle of a call: a circuit is established (fixed guaranteed bandwidth allocated for the call, along the entire route b/w the two communicating nodes)
since it’s a synchronous network, the bandwidth allocation is guaranteed \(\implies\) e2e latency is fixed \(\implies\) bounded delay
async networks (Ethernet, IP) are packet-switched protocols so they will always suffer from queueing and unbounded delays in the network
they may never form a circuit
benefits of packet switching:
optimised for bursty traffic
variable bandwidth requirements that the network can adapt to dynamically
BGP (Border-Gateway Protocol) makes packet-based internet protocols more like circuits because of the peering
consider the statement:
Peering agreements between internet service providers and the establishment of routes through the Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this level, it is possible to buy dedicated bandwidth.
this is a very interesting characterisation:
Here’s a focused primer on BGP that connects to your quoted statement and helps you “appreciate the wisdom” in it. Direct answer to the core idea - The Border Gateway Protocol (BGP) is the system the Internet uses to exchange reachability information between large networks (autonomous systems, ASes). It is not the IP protocol itself, but the routing policy and path selection mechanism that tells networks how to forward traffic toward destination networks across many other networks. In that sense, BGP shapes which networks are used as transit paths and which routes are preferred, rather than simply delivering packets between hosts. [en.wikipedia](https://en.wikipedia.org/wiki/BGP) Key concepts you’ll want to understand - What BGP is and what it isn’t - BGP is an inter-domain routing protocol used between ASes to advertise IP prefixes and to convey routing policies. It relies on TCP for its transport, which makes it reliable and session-oriented, unlike many interior routing protocols that run directly over UDP or IP. [techtarget](https://www.techtarget.com/searchnetworking/definition/BGP-Border-Gateway-Protocol) - It is not a method for IP packet switching or a replacement for the IP addressing system. IP routes are the actual forwarding entries; BGP is the control plane that distributes those routes and the policies that govern which routes are chosen. [en.wikipedia](https://en.wikipedia.org/wiki/BGP) - Peering, transit, and control of routes - Internet Service Providers (ISPs) peer with each other and with large networks to exchange routes. Through BGP, each ISP (or IXP) can advertise which IP prefixes it can reach and which prefixes it prefers to carry, enabling large-scale path selection across multiple networks. [fortinet](https://www.fortinet.com/resources/cyberglossary/bgp-border-gateway-protocol) - The statement you found about “buying dedicated bandwidth” corresponds to how some exchanges and providers offer private peering or dedicated transit arrangements, where an operator pays for a more controlled or higher-quality path. BGP still governs reachability and policy in these arrangements, but the actual bandwidth and SLAs come from the interconnection terms rather than BGP alone. [fortinet](https://www.fortinet.com/resources/cyberglossary/bgp-border-gateway-protocol) - Path selection and policies - BGP uses attributes like AS path, next hop, local preference, MED (multi-exit discriminator), and route origin to determine the “best” path to a destination prefix. Network operators tune these attributes to influence routing decisions, apply traffic engineering, and implement business policies (for example, preferring one upstream provider over another for outbound traffic). [techtarget](https://www.techtarget.com/searchnetworking/definition/BGP-Border-Gateway-Protocol) - This policy-driven selection is what makes BGP resemble a circuit-switched mindset in practice for large-scale interconnection: operators often shape which physical or logical paths are used, not just which IP addresses are reachable. However, unlike classic circuit switching, BGP does not reserve fixed bandwidth in a circuit by default; bandwidth control happens via interconnection arrangements, QoS on paths, and traffic engineering outside pure BGP route announcements. [cloudflare](https://www.cloudflare.com/learning/security/glossary/what-is-bgp/) - BGP basics you’ll likely encounter in practice - External BGP (EBGP) runs between different ASes; internal BGP (IBGP) runs within the same AS to distribute routing information. IBGP generally carries the same prefixes but relies on careful design to avoid routing loops and ensure consistent views across an AS. [networklessons](https://networklessons.com/bgp/introduction-to-bgp) - BGP sessions are established over TCP, with a defined finite-state machine guiding session setup, maintenance, and teardown. Misconfigurations (e.g., mismatched ASN, version, or route filtering) can cause sessions to fail or routes to be dropped, underscoring the importance of careful implementation and monitoring. [en.wikipedia](https://en.wikipedia.org/wiki/BGP) - Security and risk considerations - BGP has well-known security challenges: misconfigurations can result in traffic rerouting (traffic hijacks), and practical authentication and route validation mechanisms (e.g., using RPKI to validate route origins) are increasingly used to mitigate these risks. Security is largely about policy and verification at the edge of interconnections rather than about the core protocol itself. [cloudflare](https://www.cloudflare.com/learning/security/glossary/what-is-bgp/) How to map the quoted statement to practical understanding - “Peering agreements between ISPs and the establishment of routes through BGP” aligns with how ISPs establish mutual reachability and policy-based routing decisions via BGP advertisements and acceptance rules. The agreements define what traffic can flow where and under what conditions, while BGP disseminates the actual route information and policies between peering networks. [fortinet](https://www.fortinet.com/resources/cyberglossary/bgp-border-gateway-protocol) - “Bear closer resemblance to circuit switching than IP itself” can be interpreted as: at the inter-domain scale, operators actively select specific paths and providers to meet performance, reliability, and business requirements. BGP enables this policy-driven path selection across multiple networks, which in practice can feel like engineering a dedicated path, even though the underlying IP forwarding remains packet-based and dynamic. [cloudflare](https://www.cloudflare.com/learning/security/glossary/what-is-bgp/) - “At this level, it is possible to buy dedicated bandwidth” reflects real-world interconnection options (private peering, dedicated transit, service-level agreements) that complement BGP by guaranteeing performance and SLA-backed paths. BGP will still carry the route advertisements for those paths, but the guaranteed bandwidth and latency come from the interconnection terms, not from BGP alone. [techtarget](https://www.techtarget.com/searchnetworking/definition/BGP-Border-Gateway-Protocol) Practical takeaways for someone studying BGP - Focus on how BGP advertisements translate into reachability and how policies shape path selection across multiple networks. - Understand the distinction between reachability (BGP’s job) and bandwidth/SLA guarantees (interconnection terms, QoS, and traffic engineering outside BGP). - Learn the difference between EBGP and IBGP, and why route reflectors or full mesh IBGP configurations matter in large networks. - Pay attention to security practices like RPKI and route filtering to reduce risks of misrouted traffic. If you’d like, I can tailor this primer to your exact interest—e.g., a deeper dive into: - How to read a BGP table and interpret attributes - A practical example of path selection with two upstream providers - Security best practices for a small (or home-lab) BGP setup in Singapore Would you prefer a deeper technical walkthrough with example configurations or a conceptual map linking BGP concepts to the “circuit-like” behavior described in the statement? Citations: - BGP overview and its use for inter-domain routing and policy: [en.wikipedia](https://en.wikipedia.org/wiki/BGP) - BGP session mechanics and TCP-based transport: [techtarget](https://www.techtarget.com/searchnetworking/definition/BGP-Border-Gateway-Protocol) - EBGP/IBGP concepts and policy influence: [networklessons](https://networklessons.com/bgp/introduction-to-bgp) - ISP peering, transit, and the idea of dedicated interconnect bandwidth: [fortinet](https://www.fortinet.com/resources/cyberglossary/bgp-border-gateway-protocol)Further elaboration:
Great quote. There’s a *lot* of quiet wisdom packed into it, and it’s one of those statements that only really clicks once you stop thinking of “the Internet” as a uniform IP cloud and instead see the **economic + control plane underneath**. I’ll give you a **BGP primer oriented toward appreciating that sentence**, not a generic networking 101. --- ## 1. The mental model most people have (and why it’s misleading) Most people internalize this story: * IP is packet-switched * Packets are routed independently * The Internet is “best effort” * There is no notion of circuits or reservations This is *true* **at the IP forwarding layer** — routers forward packets hop-by-hop with no per-flow guarantees. But the quote is pointing at something *above* and *below* that layer. --- ## 2. What BGP actually is (and is not) ### What BGP is * **BGP is not a routing algorithm in the classic sense** * It is a **policy distribution protocol** * It tells routers *which prefixes they are allowed or willing to carry traffic for* Key properties: * Runs **between Autonomous Systems (ASes)**, not within them * Exchanges *reachability + policy attributes*, not shortest paths * Converges slowly and deliberately (minutes, not milliseconds) * Optimizes for **stability and policy**, not performance In short: > BGP answers: *“Who is willing to carry traffic for whom, and under what conditions?”* --- ## 3. Autonomous Systems = political / economic entities An **AS** is not just a technical unit: * It represents an organization (ISP, cloud provider, large enterprise) * It has commercial interests * It has contracts Examples: * Tier-1 ISPs * Cloud providers (AWS, Google) * IXPs * Large CDNs BGP is how these entities **negotiate reachability**. This is the first hint of why the quote mentions *peering agreements*. --- ## 4. Peering agreements: the real substrate There are two dominant relationship types: ### 1. Transit * AS A pays AS B to carry traffic to the rest of the Internet * Like buying a highway pass ### 2. Peering * AS A and AS B agree to exchange traffic directly * Often settlement-free, sometimes paid * Usually at specific physical locations (IXPs) These agreements specify: * Which prefixes are advertised * Traffic ratios * Capacity limits * Where interconnection happens * Sometimes explicit bandwidth guarantees None of this is visible in IP. All of it is *expressed and enforced via BGP policy*. --- ## 5. Why BGP routing is *not* “packet switched” in spirit At the IP layer: * Each packet is independent * No guarantees * Stateless forwarding At the **inter-AS layer** (BGP + contracts): * Paths are long-lived * Chosen deliberately * Often manually engineered * Often capacity-planned months in advance Once a BGP path is selected: * *All traffic* to that prefix follows it * Until policy changes or failure occurs This starts to look like: > “We’ve agreed that traffic between these networks will flow along *this corridor*.” That’s already drifting away from pure packet switching. --- ## 6. Where the “circuit switching” analogy comes from Classic circuit switching (e.g. telephone networks): * Paths are established ahead of time * Capacity is reserved * Endpoints are stable * Failover is slow and disruptive Now compare to BGP: | Property | Circuit Switching | BGP Inter-AS Routing | | ----------- | --------------------- | -------------------- | | Path setup | Pre-established | Pre-negotiated | | Duration | Long-lived | Long-lived | | Capacity | Planned | Planned | | Control | Central / contractual | Contractual | | Failover | Slow | Slow | | Granularity | Circuit | Prefix | **Key insight:** BGP doesn’t reserve per-flow bandwidth, but it *does* establish **stable, capacity-planned paths** between economic entities. Hence: *resembles circuit switching* — not technically, but **operationally and economically**. --- ## 7. “At this level, it is possible to buy dedicated bandwidth” This line is crucial. You cannot buy “dedicated bandwidth” at the IP packet level. But you *can*: * Pay for higher-capacity interconnects * Pay for private peering * Pay for MPLS-backed transit * Pay for guaranteed ports at IXPs * Pay for direct cloud interconnects (e.g. AWS Direct Connect) BGP is how traffic is **steered onto those paid paths**. Examples: * Two ISPs with a 100 Gbps private link * A CDN with private peering into access networks * A bank buying deterministic paths between data centers Once that link exists: * Traffic is not competing with the general Internet * Congestion behavior is predictable * Routing changes are rare That’s “dedicated bandwidth” in practice. --- ## 8. The layered illusion of “best effort Internet” The wisdom of the quote is that: > The Internet looks packet-switched and best-effort **only if you ignore the control plane and economics**. Reality: * The public Internet is an overlay on top of: * Contracts * Physical cross-connects * Planned capacity * Stable BGP paths So: * You get packet switching *within* those corridors * But the corridors themselves are **quasi-circuits** --- ## 9. Why this matters conceptually This reframing helps explain: ### 1. Why performance is so uneven Some traffic rides: * Premium private peering Other traffic rides: * Congested transit paths Same IP, radically different reality. --- ### 2. Why “the Internet” is not neutral by default Routing is policy-driven. Money shapes paths. --- ### 3. Why cloud networking feels “less Internet-y” Cloud providers recreate: * Controlled interconnects * Capacity planning * Predictable paths They’re leaning *into* the circuit-like nature. --- ## 10. The distilled wisdom in one paragraph **Although IP forwarding is packet-switched and stateless, the global Internet is held together by long-lived, policy-driven, capacity-planned paths negotiated via BGP between economic actors. At that layer, routing behaves less like opportunistic packet switching and more like virtual circuits whose existence, capacity, and stability are governed by contracts rather than algorithms.** That’s what the quote is quietly pointing at. If you want, next we can: * Contrast BGP with MPLS and why MPLS makes this even more explicit * Talk about why BGP convergence is intentionally slow * Relate this to modern “software-defined WAN” thinking * Connect this to reliability vs availability tradeoffs (very DDIA-adjacent) Just tell me where you want to zoom in.
Unreliable Clocks #
some uses of clocks
- for measuring durations (e.g. timeouts, p99 response time of service…)
- for measuring points in time (e.g. cache expiry time, timestamping…)
time-related problems
difficult to determine the order of events when multiple machines are involved because of variable network delays
internal clock-drifts because quartz is not perfectly accurate
synchronising internal clocks across multiple devices
the following are all examples of dangerous clock use in a distributed system
Monotonic Versus Time-of-Day Clocks #
computers at least will have both types of clocks
Time-of-day clocks:
gives wall-clock time, measures with respect to some date on some calendar (e.g. epoch time) and ignores the existence of leap-seconds
typically synced via NTP
not suitable for measuring elapsed time, because:
- ignores leap seconds
- if a local clock is too deviated, resetting done by NTP server might make it seem to jump back in time
monotonic clocks:
used for measuring time-intervals e.g. timeouts and such because they hold the invariant that they always move forward
typically in the range of microseconds or less
the absolute value of the clock is meaningless
- shouldn’t be used to compare the absolute value of monotonic clocks across different computers
monotonicity guarantee with caveats:
each CPU socket may have separate timer, so may have drifts within CPUs
typically CPU will try to manage the discrepancy but this may be a source of error
NTP (slewing): may adjust the frequency of monotonic clocks if it deviates beyond a threshold from the NTP server
NOTE: this doesn’t mean that NTP may cause jumping of the monotonic clocks
don’t need synchronisation
Clock Synchronization and Accuracy #
time-of-day clocks need to by synced to the NTP server
it’s not too reliable or accurate of a system, some sources of fickleness:
clock drift
varies based on the machine temperature
drift limits the best possible accuracy that we can achieve, even if everything is working correctly
too much difference from NTP server
case A: refuse to sync
case B: local clock forcibly reset
if node firewalled from NTP servers, the misconfig can take some time to detect
NTP sync suffers from network delay as well so accuracy is limited to level of congestion of the network
NTP clients are robust (i.e. sample from multiple servers) but NTP servers may be misconfigured and so on
Leap-seconds mess things up
a minute may end up being 59 or 61 seconds long
example of a solution: smearing is the adjustment of leap seconds gradually over the course of the day
VMs with virtualised HW clocks have their own complications
CPU sharing means that the virtualisation itself gets blocked, so the virtualised HW clock will block also –
on other people’s devices, the assupmting that their time is accurate is not a strong assumptiong because of user habits like setting their custom time.
it’s possible to have better sync guarantees and accuracy
- e.g. use GPS receivers, use the PTP (precision time protocol) and so on
Relying on Synchronized Clocks #
like the network themselves, though clocks work most of the time, the software must be robust enough to deal with incorrect clocks
incorrect clocks are hard to notice \(\implies\) typically silent failures
that’s why clocks (and clock offsets between all the machines) should be carefully monitored
Timestamping for ordering events #
it’s not a good idea to do global ordering using local timestamping (time-of-day clocks), especially for a db with multi-leader replication
LWW conflict resolution (Last Write Wins) has some fundamental problems:
can end up having data losses because of time-based write-skews
a node with a lagging clock will be unable to overwrite values previously written by a node with a fast clock until the clock skew b/w the nodes has elapsed
LWW can’t distinguish b/w the following:
writes that occurred sequentially in quick succession i.e. one after the other in short period
writes that were truly concurrent (neither writer aware of the other writer)
solution: using additional causality tracking mechanisms e.g. version vectors to prevent violations of causality
possible for 2 nodes to independently gen writes with same timestamp \(\implies\) might need a tie-breaker value
might also violate other causality rules
we need to use logical clocks e.g. incrementing counters instead of oscillating quartz crystals to do event-ordering
logical because it’s relative to the meaning of the ordering; different from physical clocks (time-of-day clocks, monotonic clocks)
Clock readings have a confidence interval #
even with high resolution (e.g. ns resolution), fine-grained measurements hinge on accuracy
typically, NTP servers will have several ms drifts beacuse of the quartz clock
better mental model is to think of a clock reading as a point in time—it is more like a range of times, within a confidence interval: for example, a system may be 95% confident that the time now is between 10.3 and 10.5 seconds past the minute
uncertainty bound is based on time source
more accurate HW like GPS receiver or atomic (cs) clock will have their expected error range indicated also
for clients, it’s uncertainty of drive based on last sync w the time server added to the uncertainty of the NTP server plus the network RTT
but most systems don’t expose such uncertainties
some do (e.g. Google TrueTime)
Synchronized Clocks for Global Snapshots #
throwback to snapshot isolation for dbs that need to support small, fast, RW transactions and large long-running RO txns
typically, this needs a monotonically increasing txn ID
single-node dbs: good enough to just have a counter
distributed: the monotonically increasing txn id needs coordination \(\implies\) this might actually be a source of bottlenecks if many, small, rapid txns
txn id must reflect causality for the snapshotting to be consistent
NOTE: things like snowflake id somewhat helps but still can’t guarantee and ordering that is consistent with causality
there’s some possible value to doing clock sync for distributed txn semantics
possible to handle the wall-clock syncs and use the time-of-day clocks for this kind of txn causality syncs
e.g. Google’s Spanner uses lack of range overlaps in reported time confidence intervals to determine causality
also, google’s datacenters are all GPS/atom-clock synced
changes since the book was published:
people use HLCs (hybrid logical clocks) now which depends on physical time and if causality repair is needed, uses logical counters
this is a 80/20 solution
modern consensus looks more like part of the tooling
Synchronized clocks are neither necessary nor sufficient for distributed transaction semantics—but when combined with logical ordering, they are extremely powerful.
summary:
Global snapshots require a notion of causality-aware time. You can approximate it with counters, infer it with logical clocks, or anchor it to physical time—but any scalable system must carefully trade coordination cost against semantic strength. Clock synchronization is no longer magical, but correctness still lives in how uncertainty is modeled, not how accurate the clock claims to be.
Great question. This is one of those DDIA sections that aged **surprisingly well**, but also got *quietly reframed* by practice. I’ll walk through it as: **what DDIA claimed → what actually happened → what the current consensus looks like**, and then extract the “wisdom that survived.” --- ## 0. Quick orienting frame DDIA (≈2017) was written when: * Spanner was the poster child * Clock sync felt exotic and “Google-scale” * MVCC + snapshots were dominant * CRDTs were still niche in databases * TrueTime felt like a moonshot So the book asked a real question: > *Can synchronized clocks be a practical foundation for global snapshot isolation?* --- ## 1. Monotonic transaction IDs: still true, still painful ### DDIA claim You need a **monotonically increasing notion of time / version** that: * Reflects causality * Allows consistent snapshots * Works under concurrency ### How this fared ✅ **Still fundamentally true** But practice sharpened the tradeoff: | Approach | Status | | ------------------------ | ----------------------------------------- | | Global counter | ❌ Dead at scale | | Central timestamp oracle | ⚠️ Bottleneck unless carefully engineered | | Snowflake-style IDs | ⚠️ Good for uniqueness, *not* causality | | Logical clocks | ✅ Great for causality, weak for real-time | | Hybrid clocks | ✅ The mainstream answer | The industry converged on: > **You don’t want *global total order*, you want *enough order*.** --- ## 2. Snowflake IDs: exactly as limited as DDIA predicted Your note here is spot-on: > Snowflake somewhat helps but still can't guarantee ordering consistent with causality That assessment has held **perfectly**. Snowflake IDs: * Encode wall-clock time + randomness * Are cheap, scalable, unique * Do **not** capture happens-before relationships They are great for: * Log ordering * Sharding * Rough recency queries They are **not** sufficient for: * Serializable snapshots * External consistency * True causality tracking Industry consensus: 👉 Snowflake is an **ID scheme**, not a **consistency mechanism**. --- ## 3. Clock synchronization: from “exotic” to “normal infrastructure” This is where things shifted most since DDIA. ### Then (DDIA era) * GPS / atomic clocks were “Google-only” * Clock uncertainty was scary * Time sync was considered unreliable ### Now Clock sync is: * Still hard * But **operationally routine** at hyperscale What changed: * Better PTP (IEEE 1588) * Hardware timestamping * NIC-assisted time sync * Time-aware networking * Cloud providers offering time guarantees Examples: * Google still uses TrueTime * AWS uses tightly synchronized time inside regions * Azure and Meta do similar things internally **Key shift:** Clock sync is now treated as **infrastructure**, not theory. --- ## 4. TrueTime-style interval reasoning: niche, but validated ### DDIA’s framing Spanner’s trick: * Don’t assume exact time * Track *time intervals* * Only trust causality when intervals don’t overlap This was presented as: * Elegant * Powerful * But expensive and rare ### How it fared ✅ **Conceptually validated** ⚠️ **Practically niche** Why? * Requires tight operational discipline * Requires hardware support * Adds latency (commit wait) * Overkill for many workloads But the *idea* survived and spread in softer forms. --- ## 5. The big evolution: Hybrid Logical Clocks (HLC) This is the most important post-DDIA development to understand. ### What happened The industry largely converged on **HLCs**: * Physical time (wall clock) * * logical counter for causality repair Used by: * CockroachDB * YugabyteDB * TiDB * FoundationDB (variants) * Many internal systems ### Why this won HLCs: * Preserve causality * Approximate real time * Avoid global coordination * Degrade gracefully under skew * Don’t require atomic clocks This directly answers DDIA’s concern: > *Clock sync might help, but coordination is expensive* **HLCs let you get 80–90% of the benefit for 10–20% of the cost.** --- ## 6. Snapshot isolation vs “good enough” consistency Another shift since DDIA: ### Then Snapshot isolation + external consistency felt like: * The gold standard * Worth paying for ### Now More systems ask: > “Do we really need global snapshots?” Result: * Many systems relax guarantees * Use causal consistency * Use bounded staleness * Use per-partition snapshots Spanner-like semantics remain: * Extremely impressive * Extremely rare They’re chosen only when: * Cross-region correctness is existential * Latency budget allows commit waits * Cost is justified --- ## 7. The reframe: clocks as *one tool*, not *the tool* The modern consensus looks like this: > Synchronized clocks are neither necessary nor sufficient for distributed transaction semantics—but when combined with logical ordering, they are extremely powerful. In other words: * Pure logical clocks → great causality, poor real-time semantics * Pure physical clocks → fast, but unsafe under skew * Hybrid clocks → sweet spot DDIA gestured at this. Practice confirmed it. --- ## 8. What aged exceptionally well in DDIA Your bullet points that **aged like wine**: * Monotonic IDs are unavoidable * Global coordination is a bottleneck * Snowflake ≠ causality * Snapshot semantics demand causality awareness * Physical time *can* help if uncertainty is modeled Those are all still accepted truths. --- ## 9. What aged less well (or softened) * The sense that clock sync is “rare and exotic” * The implied centrality of TrueTime-style designs * The assumption that many systems *should* aim for Spanner-level semantics Reality said: * Clock sync got cheaper * But complexity tolerance stayed low * Most systems prefer partial guarantees --- ## 10. The distilled wisdom, updated to 2026 Here’s the modernized version of the DDIA insight you quoted: > **Global snapshots require a notion of causality-aware time. You can approximate it with counters, infer it with logical clocks, or anchor it to physical time—but any scalable system must carefully trade coordination cost against semantic strength. Clock synchronization is no longer magical, but correctness still lives in how uncertainty is modeled, not how accurate the clock claims to be.** If you want, we can next: * Compare Spanner vs CockroachDB *very concretely* * Map snapshot isolation to robotics / sensor fusion (this connects nicely to your earlier time-sync questions) * Or tie this to CRDTs and why some systems deliberately avoid snapshots altogether Just point the lens.
Process Pauses #
- a single-leader system needs the leader to know that he’s still the leader, there’s a bunch of mechanisms to handle this
lease ownership
similar to a timeout because leases have lifecycles and hence can expire
so, there’s provisioning for the extension of a lease by the leader (before the expiry)
if leader fails then the lease renewing won’t happen and someone else takes over the lease
- some existing sources of error:
if the lease request handling loop compares local time with another clocks time, then there may be drifts within them
if the program execution is paused by scheduler, then this might be a source of error
- the runtime program pauses are very possible, example:
runtimes like JVM having “stop-the-world” GC pauses that last minutes
VMs getting suspended and pausing execution of all procs for doing memory dumping to the actual HW
user-triggered pauses
OS context-switching to another thread / hypervisor switching to different VMs
for VMs, this is called steal time
sync IO blocking within the program
OS swapping to disk (paging) as a memory mangement primitive
worst-case, even leading to page-thrashing (PROTIP: paging should be disabled on server machines).
SIGSTOP signals being sent to the process for any reason
e.g. accidentally being sent by some administrator
- the runtime program pauses are very possible, example:
this problem is similar to the problem of making the program thread-safe on a single machine
thread-safe tools (mutexes, semaphores, atomic counters, lock-free datastructures) work well because single-machine multi-threaded code uses shared memory
distributed systems that typically are shared-nothing system don’t benefit from such tools
node in distributed system must assume that its execution may be paused for a significant length of time at any point (even in the middle of a function) while the rest of the world continues moving
when that node resumes, it may not even notice that it was asleep while the others were paused
- some existing sources of error:
response time guarantees #
the unbounded-time pauses have reasons that we might be able to eliminate
hard-realtime systems: in systems like airline control computers, there are specified deadlines by which the software must respond
if deadline is not met, it may cause the entire system to fail
realtime guarantees need support from all levels of the software stack – that’s what real-time operating systems are all about
they allow guaranteed allocation of CPU time in specified intervals if necessary
functions must document worst-case execution time
dynamic malloc maybe restricted or disallowed entirely
there are real-time GCs
needs to be well-tested and measured
meaning of real-time
case 1: real-time means a system is designed & tested to meet specified timing guarantees in all circumstances
case 2: web real-time: servers pushing data to clients and stream processing without hard response-time constraints
in the case of embedded systems it’s more like case 1 and in web-contexts it’s more like case 2
limiting the impact of garbage collection
idea is to treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage.
so when it’s near to the scheduled GC pause, the application can make that node inactive then become active again
so it’s hiding the GC pauses from clients and reduces the high percentiles of response time \(\implies\) used in latency-sensitive financial trading systems
another variant to the idea: use GC only for short-lived objects (fast to collect), and just do process restarts periodically for the long-lived objects (before they OOM)
something like a rolling upgrade
Knowledge, Truth, and Lies #
- there’s a need to define the system-model, with the assumptions that we have about our distributed system. The algos we have can be proven with respect to the guarantees of the system \(\implies\) that’s how to get reliable behaviour even if the underlying system doesn’t offer many guarantees
- so far, we looked at the ways that distributed systems differed from single-computer run programs:
- no shared memory
- only message passing over unreliable networks with variable delays
- systems may suffer from partial failures, unreliable clocks and processing pauses
- these beg philosophical questions
- how to know true/false in system? esp if system is unreliable?
- should software systems obey the physical world laws like cause and effect?
- so far, we looked at the ways that distributed systems differed from single-computer run programs:
The Truth Is Defined by the Majority #
a node cannot necessarily trust its own judgment of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover.
\(\implies\) this is why distributed systems rely on quorums
3 scenarios (examples) of asymmetric faults within networks that quorums will solve::
particular node receives but can’t send messages and the rest declare it dead even though it’s not
particular node semi-disconnected, so can know that the messages aren’t being ack-ed so there’s some kind of network fault but it can’t do anythign about it
particular node with a long stop-the-world GC pause (doesn’t exactly know how long of a time has elapsed)
nodes-threads pre-empted by the GC and paused for a minute during which the others think that this dead is dead then all of a sudden from their POV, this dead nodes comes back alive
Leader and the lock #
typically we need singleton things in some cases:
e.g.:
- single leader (to avoid split-brain)
- one one txn / client shall hold the lock to a particular resource to have concurrency-guarantees
- one user allowed to register a particular uname, because it’s typically a unique user id
these singletons need to be implemented well (e.g. a distributed lock)
e.g. of problem is one of a process pause that makes a process think it still has a write-lease on a lock and therefore writes and corrupts a file-storage
so singleton-related behaviours (like a chosen one) needs to abide by quorum rules
else it may act in ways that the system can’t handle and lead to corruption
Fencing tokens #
guarantee needed: a node that falsely believes it’s the “chosen one” shouldn’t be able to disrupt the rest of the system
Fencing tokens can detect and block a node that is inadvertently acting in error.
mechanisms e.g. includes fencing: making access to a lock safe using increasing fencing tokens, which typically is a server-side check
example of this fencing token mechanism
every grant by the lock server for a lock or lease will have an associated fencing token
this token is a number that increases everytime a lock is granted
we require that every time a client sends a write request to the storage service, it needs to include the current fencing token it has
the storage server remembers the global state of the fencing token which is how it can do rejections even if there’s some process-pause based fault
fencing token mechanism requires the resource to take an active role in checking tokens by rejecting any writes with an older token than one that has already been processed
\(\implies\) insufficient to just rely on clients to check their lock-status
server-side checks in distributed systems is a good idea because clients may be having their own agendas
Byzantine Faults #
Byzantine faults \(\implies\) lying: deliberate attempt at subverting the system’s guarantees
Byzantine fault-tolerance:
can operate correctly even if some of the nodes are malfunctioning and not obeying the protocol guarantees / malicious entities are interfering with the network
typically relies on a supermajority (beyond majority)
example environments:
aerospace where random bit flips from radiation \(\implies\) unpredictable system deviations. So the flight systems must tolerate Byzantine faults
multiple participating orgs might involve the attempt to cheat or defraud others
e.g. Bitcoin and blockchains that aren’t centrally regulated
typically HW-level dependent for such fault tolerance
P2P networks need some aspects of this fault tolerance
protocols don’t save us from vulnerabilities, security compromises, malicious attacks
weak forms of lying #
- not full blown lying but things like HW issues might essentially make it a lie
- examples:
- network packet corruptions \(\implies\) typically caught by checksums
- internal service that doesn’t have as strict user-input checks as a publically accessible service
System Model and Reality #
Algo design shouldn’t be heavily dependent on the details of the HW and SW configuration on which they run
instead they should formalize the kinds of faults that a system is expects to happen
\(\implies\) that’s what the system model abstraction is all about
For modelling real systems, the partially synchronous model with crash-recovery faults is generally the most useful for us
timing assumptions related system models that are common:
synchronous model
assumes bounded network delay, bounded process pauses, and bounded clock error.
still will have network delay, they are just bounded delays
not realistic in most practical systems because packet-based networks have unbounded delays and systems have pauses
partially synchronous model
behaves like a synchronous system most of the time, but it sometimes exceeds the bounds for network delay, process pauses, and clock drift
realistic, most of the time networks and processes are well-behaved
asynchronous model
algorithm is not allowed to make any timing assumptions. typically, it doesn’t even have a clock (so it cannot use timeouts).
very restrictive model
node failure related system models for nodes:
crash-stop faults
an algorithm may assume that a node can fail in only one way, namely by crashing.
never comes back
crash-recovery faults
nodes may crash at any moment, and perhaps start responding again after some unknown time.
assumed to have stable storage (i.e. nonvolatile disk storage) so that it may be revived (but in-memory state assumed to be lost)
Byzantine (arbitrary) faults
- nodes may do anything they want including being malicious
how distributed algos cope with this:
correctness of an algo #
any algo’s correctness is described via its properties (invariants), including distributed algos
e.g. fencing tokens, properties:
uniqueness
monotonic sequence
availability: node requesting a fencing token and that doesn’t crash will eventually get a response
safety and liveness properties #
using e.g. of fencing tokens,
safety properties: uniqueness, monotonic sequences \(\implies\) nothing bad happens
- can pinpoint the time when property violations happened
- typically, violations can’t be undone
liveness property: availability – have the word “eventually” in their definition, because it’s about eventual consistency \(\implies\) something good eventually happens
- opposite
why care about safety vs liveness properties?
- safety properties need to always hold, in all possible situations of the system model
- liveness properties can have caveats, cuz it’s about eventuality (so during repairing processes, it some invariants may be violated until it’s eventually correct)
mapping system models to real world #
system model is still a simplification of reality
there may be situations that violate the guarantees we have
e.g. suppose the stable storage requirement (necessary for remembering data on outages for quorum invariants in distributed systems) fails, then the model becomes hard to reason about.
interesting take: the difference between computer science and software engineering comes from the parts where a real implementation has to account for more than the abstract / theoretical behaviours that algorithms imply
they’d have to account for real situations of abuse and so on (but still not Byzantine solutions) where a human operator might have to cleanup the mess
correct algos doesn’t necessarily mean that the implementation on a real system will necessarily behave correctly, but it’s a good first step
theoretical analysis can uncover problems in an algorithm that might remain hidden for a long time in a real system, and that only come to bite you when your assumptions (e.g., about timing) are defeated due to unusual circumstances.
Summary #
defining characteristics of distributed systems is that partial failures can occur
that’s why we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken.
how to tolerate faults:
detect them, which can be hard
if can’t detect, rely on timeouts to make assumptions on availability
- can’t distinguish node from network failures
if can detect, it’s hard to tolerate because no global shared memory
unreliable networks, failing nodes and pauses