Skip to main content
  1. Readings/
  2. Books/
  3. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems/

Chapter 8. The Trouble with Distributed Systems

··7894 words·38 mins
  • a lot more edge cases on how things fail in distributed systems

  • chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system.

    however, we ignore Byzantine Faults for now

  • the chapter focuses on most of the problems to understand how these failures and faults arise

    areas of problems:

    • network problems
    • clock and timing issues
  • there’s a need to have a consistent way to reason about our systems here and address these problems

Faults and Partial Failures #

  • the argument is that HW problems will mostly lead to total failures

  • so flaky software should be seen as a bug, there shouldn’t be a notion of “partial faults” in a single-node software

    this is because it’s also an idealized system model – things rarely go wrong

  • networked, distributed systems are not an operation in an idealized system model – so that’s why there’s a host of problems that may arise

    in this case, partial failures may exist:

    • parts of the system that are broken in unpredictable wyas while other parts are working fine
    • these are non-deterministic failures

Cloud Computing and Supercomputing #

  • cloud vs supercomputing: two ends of the spectrum of scaling philosophies
    • cloud: often associated with multi-tenant data centers
  • examples of factors on which they have different philosophies
    1. on fault-handling
      • supercomputers
        • state of computation is checkpointed into durable storage so that failure can resume from last checkpoint

          pattern: partial failure allowed to escalate into total failure – any part of the system fails, then we crash and resume (e.g. do a cluster repair, which is alright because high-availability isn’t too big of a problem)

          similar to a kernel panic on a single node machine, this is more like a single node computer rather than a distributed system

      • cloud systems
        • online, low latency so availability needs to be high

          can’t just do a cluster repair

    2. use of shared memory
      • supercomputers
        • nodes communicate through shared memory and RDMA (remote direct memory access)

          RDMA is very fascinating, look it up!

    3. network topologies
      • supercomputers:

        special topologies e.g. multi-dimensional meshes and toruses

        better performance for HPC workloads with known communication patterns

      • cloud datacenters:

    4. tolerance for failed nodes
      • cloud environments:

        can do rolling upgrades, can kill and spinup specific VMs and so on – can tolerate some failed nodes, the overall system won’t die

    5. geographical separation of nodes
      • cloud:

        typically geographically distributed, comms happens over internet, can be slow and unreliable

      • supercomputers:

        local networks typically, nodes expected to be close together.

  • so for distributed systems, the name of the game is to build a reliable system from unreliable components
    • a system can be made more reliable than it’s underlying parts, but with some limits
  • fault-handling should be part of the software design and need to know what behaviour to expect from the software in case of a fault

Unreliable Networks #

  • reasons why shared-nothing approach to distributed nodes works well:
    1. cheaper because no special HW
    2. can use commoditized cloud computing services
    3. can have high reliability by having geographical redundancies
  • internet and ethernet-based networks are async packet networks
    • any two nodes can comms with each other
    • no guarantee that on arrival and arrival time
    • many failure reasons, hard to distinguish between:
      1. lost request
      2. remote node is down
      3. lost response
    • hard to tell if delivery even happened
  • typically, issue of unreliable networks handled via timeouts
    • still don’t know if the remote node got the initial packet or not

Network Faults in Practice #

  • human error is a major cause of outages

    so just adding network redundancy is not as useful as initially expected

  • a bi-directional network link that works in one direction not necessarily can be expected to work in the other direction

    e.g. a network interface for some reason dropping all inbound packets but sending all outbound packets successfully

  • network faults (this book uses this generic term, includes network partitions and so on)

  • we can’t put software in unanticipated situations because then it may do arbitrary, unexpected things

    some examples from network faults:

    • deadlocked cluster – permanently unable to serve requests, even on recovery of the network

    • deletion of data

  • meaning of handling network faults:

    • doesn’t necessarily mean tolerating them

    • more of, being able to know how the software reacts to the network problems and ensuring that the system can recover from the problems

    • can be tested using chaos injection (e.g. Chaos Monkey)

Detecting Faults #

  • ability to detect faults is important

    e.g. for taking a node out of rotation by the load balancer, or by detecting leader-failure and then doing a leader-promotion

  • some cases where it’s easy to get feedback on faults

    • TCP connections when closed return a RST or FIN packet as a reply to indicate that the connection is closed

    • scripts with fini sequences that may notify other nodes about a crash so that another node can take over the nodes operation without having to wait for a timeout expiry

    • can query network switch interfaces in the data center if we have access within a datacenter

    • routers may send ICMP destination unreachable packets but the router also doesn’t have magical failure detection capabilities

  • typical rule of thumb:

    successes need positive responses, errors typically can be expected to have no responses

  • to conclude dead nodes:

    You can retry a few times (TCP retries transparently, but you may also retry at the application level), wait for a timeout to elapse, and eventually declare the node dead if you don’t hear back within the timeout.

Timeouts and Unbounded Delays #

  • non trivial to answer how long a timeout should be

    • too long and it takes too much time to correct
    • too short and it corrects prematurely
  • prematurely declaring dead nodes can give problems

    • duplicate action problems

      e.g. original process is not actually dead and then we have duplicated action because of the presumed dead node and retrying causing the duplicate action

    • cascading failures from overloading the node further because of handover bookkeeping

      if it was slow to respond because it’s overloaded and then it needs to xfer load to peers, then its workload increases further

      might end up in a situation where all nodes declare each other dead and nothing works

  • async networks have unbounded delays, so we can’t just do a bounded time delay guarantee (e.g. 2d + r where d is the delivery time and r is the time it takes to handle the request)

    unbounded, therefore:

    • delivery is always attempting to do things ASAP – so we can identify some lowerbound (fastest time)
    • but there’s no upper limit on the time

    not good enough for the system to just be FAST for most of the time, else just a transient spike in RTTs will throw the system off-balance

network congestion and queueing #

  • analogous to vehicular traffic, variability of packet delays is typically due to queueing

    • [queuing @ network link] network congestion:

      • busy network link, a packet might have to wait until it’s slotted in
      • if too busy and the queue is full, the packet might be dropped
    • [queueing can be at the destination machine] :

      • e.g. CPU cores are busy, so OS just queues the network request
    • [queueing @ the sender-side]: flow control aka congestion avoidance / backpressure:

      • self-limits on its own rate of sending to avoid overloading the network link / receiving node

      • so more queuing @ the sender before the data even enters the network

  • queueing delays have a wide range of variability when a system is close to its maximum capacity

    • for public clouds, most of the infra is shared between customers

      so batch workloads and such can easily saturate network links, network delays can come from noisy neighbours

    • in such situations we have to empirically determine the best timeout value

    • can also be a dynamic system that constantly adapts to jitter

      \(\implies\) a Phi Accrual failure detector would work well for such systems

Synchronous Versus Asynchronous Networks #

  • summary:

    1. async networks allow us to dynamically manage bandwidth requirements

      in return they suffer from queueing and therefore unbounded delays \(\implies\) variable delays are a consequence of dynamic resource partitioning

      this should be seen as a cost-benefit tradeoff

      however, queueing ensures max resource utilisation (of the link layer)

    2. sync networks are guaranteed their bandwidth allocation so the delay is bounded and there’s no need for queuing (can just be pipelined)

    3. latency guarantees typically come at the expense of resource utilisation because resources end up being statically partitioned

      alternatively, multi-tenancy with dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside of variable delays.

    4. mechanisms to consider so that packet-systems can emulate circuit-like behaviour

      1. considering QoS: prioritization, scheduled packets

      2. considering admission control: (e.g. rate-limiting the senders)

  • comparing datacenter networks (asynchronous) to traditional fixed-line telephone networks (synchronous)

    • lifecycle of a call: a circuit is established (fixed guaranteed bandwidth allocated for the call, along the entire route b/w the two communicating nodes)

    • since it’s a synchronous network, the bandwidth allocation is guaranteed \(\implies\) e2e latency is fixed \(\implies\) bounded delay

  • async networks (Ethernet, IP) are packet-switched protocols so they will always suffer from queueing and unbounded delays in the network

    they may never form a circuit

  • benefits of packet switching:

    1. optimised for bursty traffic

      variable bandwidth requirements that the network can adapt to dynamically

  • BGP (Border-Gateway Protocol) makes packet-based internet protocols more like circuits because of the peering

    consider the statement:

    Peering agreements between internet service providers and the establishment of routes through the Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this level, it is possible to buy dedicated bandwidth.

    this is a very interesting characterisation:

      Here’s a focused primer on BGP that connects to your quoted statement and helps you “appreciate the wisdom” in it.
    
      Direct answer to the core idea
    ​  - The Border Gateway Protocol (BGP) is the system the Internet uses to exchange reachability information between large networks (autonomous systems, ASes). It is not the IP protocol itself, but the routing policy and path selection mechanism that tells networks how to forward traffic toward destination networks across many other networks. In that sense, BGP shapes which networks are used as transit paths and which routes are preferred, rather than simply delivering packets between hosts. [en.wikipedia](https://en.wikipedia.org/wiki/BGP)
    
      Key concepts you’ll want to understand
    
      - What BGP is and what it isn’t
    ​    - BGP is an inter-domain routing protocol used between ASes to advertise IP prefixes and to convey routing policies. It relies on TCP for its transport, which makes it reliable and session-oriented, unlike many interior routing protocols that run directly over UDP or IP. [techtarget](https://www.techtarget.com/searchnetworking/definition/BGP-Border-Gateway-Protocol)
    ​    - It is not a method for IP packet switching or a replacement for the IP addressing system. IP routes are the actual forwarding entries; BGP is the control plane that distributes those routes and the policies that govern which routes are chosen. [en.wikipedia](https://en.wikipedia.org/wiki/BGP)
    
      - Peering, transit, and control of routes
    ​    - Internet Service Providers (ISPs) peer with each other and with large networks to exchange routes. Through BGP, each ISP (or IXP) can advertise which IP prefixes it can reach and which prefixes it prefers to carry, enabling large-scale path selection across multiple networks. [fortinet](https://www.fortinet.com/resources/cyberglossary/bgp-border-gateway-protocol)
    ​    - The statement you found about “buying dedicated bandwidth” corresponds to how some exchanges and providers offer private peering or dedicated transit arrangements, where an operator pays for a more controlled or higher-quality path. BGP still governs reachability and policy in these arrangements, but the actual bandwidth and SLAs come from the interconnection terms rather than BGP alone. [fortinet](https://www.fortinet.com/resources/cyberglossary/bgp-border-gateway-protocol)
    
      - Path selection and policies
    ​    - BGP uses attributes like AS path, next hop, local preference, MED (multi-exit discriminator), and route origin to determine the “best” path to a destination prefix. Network operators tune these attributes to influence routing decisions, apply traffic engineering, and implement business policies (for example, preferring one upstream provider over another for outbound traffic). [techtarget](https://www.techtarget.com/searchnetworking/definition/BGP-Border-Gateway-Protocol)
    ​    - This policy-driven selection is what makes BGP resemble a circuit-switched mindset in practice for large-scale interconnection: operators often shape which physical or logical paths are used, not just which IP addresses are reachable. However, unlike classic circuit switching, BGP does not reserve fixed bandwidth in a circuit by default; bandwidth control happens via interconnection arrangements, QoS on paths, and traffic engineering outside pure BGP route announcements. [cloudflare](https://www.cloudflare.com/learning/security/glossary/what-is-bgp/)
    
      - BGP basics you’ll likely encounter in practice
    ​    - External BGP (EBGP) runs between different ASes; internal BGP (IBGP) runs within the same AS to distribute routing information. IBGP generally carries the same prefixes but relies on careful design to avoid routing loops and ensure consistent views across an AS. [networklessons](https://networklessons.com/bgp/introduction-to-bgp)
    ​    - BGP sessions are established over TCP, with a defined finite-state machine guiding session setup, maintenance, and teardown. Misconfigurations (e.g., mismatched ASN, version, or route filtering) can cause sessions to fail or routes to be dropped, underscoring the importance of careful implementation and monitoring. [en.wikipedia](https://en.wikipedia.org/wiki/BGP)
    
      - Security and risk considerations
    ​    - BGP has well-known security challenges: misconfigurations can result in traffic rerouting (traffic hijacks), and practical authentication and route validation mechanisms (e.g., using RPKI to validate route origins) are increasingly used to mitigate these risks. Security is largely about policy and verification at the edge of interconnections rather than about the core protocol itself. [cloudflare](https://www.cloudflare.com/learning/security/glossary/what-is-bgp/)
    
      How to map the quoted statement to practical understanding
    ​  - “Peering agreements between ISPs and the establishment of routes through BGP” aligns with how ISPs establish mutual reachability and policy-based routing decisions via BGP advertisements and acceptance rules. The agreements define what traffic can flow where and under what conditions, while BGP disseminates the actual route information and policies between peering networks. [fortinet](https://www.fortinet.com/resources/cyberglossary/bgp-border-gateway-protocol)
    ​  - “Bear closer resemblance to circuit switching than IP itself” can be interpreted as: at the inter-domain scale, operators actively select specific paths and providers to meet performance, reliability, and business requirements. BGP enables this policy-driven path selection across multiple networks, which in practice can feel like engineering a dedicated path, even though the underlying IP forwarding remains packet-based and dynamic. [cloudflare](https://www.cloudflare.com/learning/security/glossary/what-is-bgp/)
    ​  - “At this level, it is possible to buy dedicated bandwidth” reflects real-world interconnection options (private peering, dedicated transit, service-level agreements) that complement BGP by guaranteeing performance and SLA-backed paths. BGP will still carry the route advertisements for those paths, but the guaranteed bandwidth and latency come from the interconnection terms, not from BGP alone. [techtarget](https://www.techtarget.com/searchnetworking/definition/BGP-Border-Gateway-Protocol)
    
      Practical takeaways for someone studying BGP
    ​  - Focus on how BGP advertisements translate into reachability and how policies shape path selection across multiple networks.
    ​  - Understand the distinction between reachability (BGP’s job) and bandwidth/SLA guarantees (interconnection terms, QoS, and traffic engineering outside BGP).
    ​  - Learn the difference between EBGP and IBGP, and why route reflectors or full mesh IBGP configurations matter in large networks.
    ​  - Pay attention to security practices like RPKI and route filtering to reduce risks of misrouted traffic.
    
      If you’d like, I can tailor this primer to your exact interest—e.g., a deeper dive into:
    ​  - How to read a BGP table and interpret attributes
    ​  - A practical example of path selection with two upstream providers
    ​  - Security best practices for a small (or home-lab) BGP setup in Singapore
    
      Would you prefer a deeper technical walkthrough with example configurations or a conceptual map linking BGP concepts to the “circuit-like” behavior described in the statement?
    
      Citations:
    ​  - BGP overview and its use for inter-domain routing and policy: [en.wikipedia](https://en.wikipedia.org/wiki/BGP)
    ​  - BGP session mechanics and TCP-based transport: [techtarget](https://www.techtarget.com/searchnetworking/definition/BGP-Border-Gateway-Protocol)
    ​  - EBGP/IBGP concepts and policy influence: [networklessons](https://networklessons.com/bgp/introduction-to-bgp)
    ​  - ISP peering, transit, and the idea of dedicated interconnect bandwidth: [fortinet](https://www.fortinet.com/resources/cyberglossary/bgp-border-gateway-protocol)
    

    Further elaboration:

    
      Great quote. There’s a *lot* of quiet wisdom packed into it, and it’s one of those statements that only really clicks once you stop thinking of “the Internet” as a uniform IP cloud and instead see the **economic + control plane underneath**.
    
      I’ll give you a **BGP primer oriented toward appreciating that sentence**, not a generic networking 101.
    
      ---
    
      ## 1. The mental model most people have (and why it’s misleading)
    
      Most people internalize this story:
    
       * IP is packet-switched
    ​   * Packets are routed independently
    ​   * The Internet is “best effort”
    ​   * There is no notion of circuits or reservations
    
      This is *true* **at the IP forwarding layer** — routers forward packets hop-by-hop with no per-flow guarantees.
    
      But the quote is pointing at something *above* and *below* that layer.
    
      ---
    
      ## 2. What BGP actually is (and is not)
    
      ### What BGP is
    
       * **BGP is not a routing algorithm in the classic sense**
    * It is a **policy distribution protocol**
    ​   * It tells routers *which prefixes they are allowed or willing to carry traffic for*
    
      Key properties:
    
       * Runs **between Autonomous Systems (ASes)**, not within them
    ​   * Exchanges *reachability + policy attributes*, not shortest paths
    ​   * Converges slowly and deliberately (minutes, not milliseconds)
    * Optimizes for **stability and policy**, not performance
    
      In short:
    
      > BGP answers: *“Who is willing to carry traffic for whom, and under what conditions?”*
    
      ---
    
      ## 3. Autonomous Systems = political / economic entities
    
      An **AS** is not just a technical unit:
    
       * It represents an organization (ISP, cloud provider, large enterprise)
    ​   * It has commercial interests
    ​   * It has contracts
    
      Examples:
    
       * Tier-1 ISPs
    ​   * Cloud providers (AWS, Google)
    ​   * IXPs
    ​   * Large CDNs
    
      BGP is how these entities **negotiate reachability**.
    
      This is the first hint of why the quote mentions *peering agreements*.
    
      ---
    
      ## 4. Peering agreements: the real substrate
    
      There are two dominant relationship types:
    
      ### 1. Transit
    
       * AS A pays AS B to carry traffic to the rest of the Internet
    ​   * Like buying a highway pass
    
      ### 2. Peering
    
       * AS A and AS B agree to exchange traffic directly
    ​   * Often settlement-free, sometimes paid
    ​   * Usually at specific physical locations (IXPs)
    
      These agreements specify:
    
       * Which prefixes are advertised
    ​   * Traffic ratios
    ​   * Capacity limits
    ​   * Where interconnection happens
    ​   * Sometimes explicit bandwidth guarantees
    
      None of this is visible in IP.
      All of it is *expressed and enforced via BGP policy*.
    
      ---
    
      ## 5. Why BGP routing is *not* “packet switched” in spirit
    
      At the IP layer:
    
       * Each packet is independent
    ​   * No guarantees
    ​   * Stateless forwarding
    
      At the **inter-AS layer** (BGP + contracts):
    
       * Paths are long-lived
    ​   * Chosen deliberately
    ​   * Often manually engineered
    ​   * Often capacity-planned months in advance
    
      Once a BGP path is selected:
    
       * *All traffic* to that prefix follows it
    ​   * Until policy changes or failure occurs
    
      This starts to look like:
    
      > “We’ve agreed that traffic between these networks will flow along *this corridor*.”
    
      That’s already drifting away from pure packet switching.
    
      ---
    
      ## 6. Where the “circuit switching” analogy comes from
    
      Classic circuit switching (e.g. telephone networks):
    
       * Paths are established ahead of time
    ​   * Capacity is reserved
    ​   * Endpoints are stable
    ​   * Failover is slow and disruptive
    
      Now compare to BGP:
    
      | Property    | Circuit Switching     | BGP Inter-AS Routing |
      | ----------- | --------------------- | -------------------- |
      | Path setup  | Pre-established       | Pre-negotiated       |
      | Duration    | Long-lived            | Long-lived           |
      | Capacity    | Planned               | Planned              |
      | Control     | Central / contractual | Contractual          |
      | Failover    | Slow                  | Slow                 |
      | Granularity | Circuit               | Prefix               |
    
      **Key insight:**
      BGP doesn’t reserve per-flow bandwidth, but it *does* establish **stable, capacity-planned paths** between economic entities.
    
      Hence: *resembles circuit switching* — not technically, but **operationally and economically**.
    
      ---
    
      ## 7. “At this level, it is possible to buy dedicated bandwidth”
    
      This line is crucial.
    
      You cannot buy “dedicated bandwidth” at the IP packet level.
    
      But you *can*:
    
       * Pay for higher-capacity interconnects
    ​   * Pay for private peering
    ​   * Pay for MPLS-backed transit
    ​   * Pay for guaranteed ports at IXPs
    ​   * Pay for direct cloud interconnects (e.g. AWS Direct Connect)
    
      BGP is how traffic is **steered onto those paid paths**.
    
      Examples:
    
       * Two ISPs with a 100 Gbps private link
    ​   * A CDN with private peering into access networks
    ​   * A bank buying deterministic paths between data centers
    
      Once that link exists:
    
       * Traffic is not competing with the general Internet
    ​   * Congestion behavior is predictable
    ​   * Routing changes are rare
    
      That’s “dedicated bandwidth” in practice.
    
      ---
    
      ## 8. The layered illusion of “best effort Internet”
    
      The wisdom of the quote is that:
    
      > The Internet looks packet-switched and best-effort **only if you ignore the control plane and economics**.
    
      Reality:
    
       * The public Internet is an overlay on top of:
    
        * Contracts
    ​    * Physical cross-connects
    ​    * Planned capacity
    ​    * Stable BGP paths
    
      So:
    
       * You get packet switching *within* those corridors
    * But the corridors themselves are **quasi-circuits**
    
      ---
    
      ## 9. Why this matters conceptually
    
      This reframing helps explain:
    
      ### 1. Why performance is so uneven
    
      Some traffic rides:
    
       * Premium private peering
    
      Other traffic rides:
    
       * Congested transit paths
    
      Same IP, radically different reality.
    
      ---
    
      ### 2. Why “the Internet” is not neutral by default
    
      Routing is policy-driven.
      Money shapes paths.
    
      ---
    
      ### 3. Why cloud networking feels “less Internet-y”
    
      Cloud providers recreate:
    
       * Controlled interconnects
    ​   * Capacity planning
    ​   * Predictable paths
    
      They’re leaning *into* the circuit-like nature.
    
      ---
    
      ## 10. The distilled wisdom in one paragraph
    
      **Although IP forwarding is packet-switched and stateless, the global Internet is held together by long-lived, policy-driven, capacity-planned paths negotiated via BGP between economic actors. At that layer, routing behaves less like opportunistic packet switching and more like virtual circuits whose existence, capacity, and stability are governed by contracts rather than algorithms.**
    
      That’s what the quote is quietly pointing at.
    
      If you want, next we can:
    
       * Contrast BGP with MPLS and why MPLS makes this even more explicit
    ​   * Talk about why BGP convergence is intentionally slow
    ​   * Relate this to modern “software-defined WAN” thinking
    ​   * Connect this to reliability vs availability tradeoffs (very DDIA-adjacent)
    
      Just tell me where you want to zoom in.
    

Unreliable Clocks #

  • some uses of clocks

    1. for measuring durations (e.g. timeouts, p99 response time of service…)
    2. for measuring points in time (e.g. cache expiry time, timestamping…)
  • time-related problems

    1. difficult to determine the order of events when multiple machines are involved because of variable network delays

    2. internal clock-drifts because quartz is not perfectly accurate

      synchronising internal clocks across multiple devices

the following are all examples of dangerous clock use in a distributed system

Monotonic Versus Time-of-Day Clocks #

  • computers at least will have both types of clocks

  • Time-of-day clocks:

    1. gives wall-clock time, measures with respect to some date on some calendar (e.g. epoch time) and ignores the existence of leap-seconds

    2. typically synced via NTP

    3. not suitable for measuring elapsed time, because:

      • ignores leap seconds
      • if a local clock is too deviated, resetting done by NTP server might make it seem to jump back in time
  • monotonic clocks:

    1. used for measuring time-intervals e.g. timeouts and such because they hold the invariant that they always move forward

      typically in the range of microseconds or less

    2. the absolute value of the clock is meaningless

      • shouldn’t be used to compare the absolute value of monotonic clocks across different computers
    3. monotonicity guarantee with caveats:

      • each CPU socket may have separate timer, so may have drifts within CPUs

      • typically CPU will try to manage the discrepancy but this may be a source of error

      • NTP (slewing): may adjust the frequency of monotonic clocks if it deviates beyond a threshold from the NTP server

        NOTE: this doesn’t mean that NTP may cause jumping of the monotonic clocks

    4. don’t need synchronisation

Clock Synchronization and Accuracy #

  • time-of-day clocks need to by synced to the NTP server

    it’s not too reliable or accurate of a system, some sources of fickleness:

    1. clock drift

      • varies based on the machine temperature

      • drift limits the best possible accuracy that we can achieve, even if everything is working correctly

    2. too much difference from NTP server

      case A: refuse to sync

      case B: local clock forcibly reset

    3. if node firewalled from NTP servers, the misconfig can take some time to detect

    4. NTP sync suffers from network delay as well so accuracy is limited to level of congestion of the network

    5. NTP clients are robust (i.e. sample from multiple servers) but NTP servers may be misconfigured and so on

    6. Leap-seconds mess things up

      a minute may end up being 59 or 61 seconds long

      example of a solution: smearing is the adjustment of leap seconds gradually over the course of the day

    7. VMs with virtualised HW clocks have their own complications

      CPU sharing means that the virtualisation itself gets blocked, so the virtualised HW clock will block also –

    8. on other people’s devices, the assupmting that their time is accurate is not a strong assumptiong because of user habits like setting their custom time.

  • it’s possible to have better sync guarantees and accuracy

    • e.g. use GPS receivers, use the PTP (precision time protocol) and so on

Relying on Synchronized Clocks #

  • like the network themselves, though clocks work most of the time, the software must be robust enough to deal with incorrect clocks

  • incorrect clocks are hard to notice \(\implies\) typically silent failures

    that’s why clocks (and clock offsets between all the machines) should be carefully monitored

Timestamping for ordering events #

  • it’s not a good idea to do global ordering using local timestamping (time-of-day clocks), especially for a db with multi-leader replication

  • LWW conflict resolution (Last Write Wins) has some fundamental problems:

    1. can end up having data losses because of time-based write-skews

      a node with a lagging clock will be unable to overwrite values previously written by a node with a fast clock until the clock skew b/w the nodes has elapsed

    2. LWW can’t distinguish b/w the following:

      1. writes that occurred sequentially in quick succession i.e. one after the other in short period

      2. writes that were truly concurrent (neither writer aware of the other writer)

      solution: using additional causality tracking mechanisms e.g. version vectors to prevent violations of causality

    3. possible for 2 nodes to independently gen writes with same timestamp \(\implies\) might need a tie-breaker value

      might also violate other causality rules

  • we need to use logical clocks e.g. incrementing counters instead of oscillating quartz crystals to do event-ordering

    logical because it’s relative to the meaning of the ordering; different from physical clocks (time-of-day clocks, monotonic clocks)

Clock readings have a confidence interval #

  • even with high resolution (e.g. ns resolution), fine-grained measurements hinge on accuracy

    typically, NTP servers will have several ms drifts beacuse of the quartz clock

  • better mental model is to think of a clock reading as a point in time—it is more like a range of times, within a confidence interval: for example, a system may be 95% confident that the time now is between 10.3 and 10.5 seconds past the minute

  • uncertainty bound is based on time source

    • more accurate HW like GPS receiver or atomic (cs) clock will have their expected error range indicated also

    • for clients, it’s uncertainty of drive based on last sync w the time server added to the uncertainty of the NTP server plus the network RTT

      but most systems don’t expose such uncertainties

      some do (e.g. Google TrueTime)

Synchronized Clocks for Global Snapshots #

  • throwback to snapshot isolation for dbs that need to support small, fast, RW transactions and large long-running RO txns

  • typically, this needs a monotonically increasing txn ID

    • single-node dbs: good enough to just have a counter

    • distributed: the monotonically increasing txn id needs coordination \(\implies\) this might actually be a source of bottlenecks if many, small, rapid txns

      txn id must reflect causality for the snapshotting to be consistent

      NOTE: things like snowflake id somewhat helps but still can’t guarantee and ordering that is consistent with causality

  • there’s some possible value to doing clock sync for distributed txn semantics

    possible to handle the wall-clock syncs and use the time-of-day clocks for this kind of txn causality syncs

    e.g. Google’s Spanner uses lack of range overlaps in reported time confidence intervals to determine causality

    also, google’s datacenters are all GPS/atom-clock synced

  • changes since the book was published:

    • people use HLCs (hybrid logical clocks) now which depends on physical time and if causality repair is needed, uses logical counters

      this is a 80/20 solution

    • modern consensus looks more like part of the tooling

      Synchronized clocks are neither necessary nor sufficient for distributed transaction semantics—but when combined with logical ordering, they are extremely powerful.

    • summary:

      Global snapshots require a notion of causality-aware time. You can approximate it with counters, infer it with logical clocks, or anchor it to physical time—but any scalable system must carefully trade coordination cost against semantic strength. Clock synchronization is no longer magical, but correctness still lives in how uncertainty is modeled, not how accurate the clock claims to be.

          Great question. This is one of those DDIA sections that aged **surprisingly well**, but also got *quietly reframed* by practice.
      
          I’ll walk through it as: **what DDIA claimed → what actually happened → what the current consensus looks like**, and then extract the “wisdom that survived.”
      
          ---
      
          ## 0. Quick orienting frame
      
          DDIA (≈2017) was written when:
      
           * Spanner was the poster child
      ​     * Clock sync felt exotic and “Google-scale”
      ​     * MVCC + snapshots were dominant
      ​     * CRDTs were still niche in databases
      ​     * TrueTime felt like a moonshot
      
          So the book asked a real question:
      
          > *Can synchronized clocks be a practical foundation for global snapshot isolation?*
      
          ---
      
          ## 1. Monotonic transaction IDs: still true, still painful
      
          ### DDIA claim
      
          You need a **monotonically increasing notion of time / version** that:
      
           * Reflects causality
      ​     * Allows consistent snapshots
      ​     * Works under concurrency
      
          ### How this fared
      
      **Still fundamentally true**
      
          But practice sharpened the tradeoff:
      
          | Approach                 | Status                                    |
          | ------------------------ | ----------------------------------------- |
          | Global counter           | ❌ Dead at scale                           |
          | Central timestamp oracle | ⚠️ Bottleneck unless carefully engineered |
          | Snowflake-style IDs      | ⚠️ Good for uniqueness, *not* causality   |
          | Logical clocks           | ✅ Great for causality, weak for real-time |
          | Hybrid clocks            | ✅ The mainstream answer                   |
      
          The industry converged on:
      
          > **You don’t want *global total order*, you want *enough order*.**
      
          ---
      
          ## 2. Snowflake IDs: exactly as limited as DDIA predicted
      
          Your note here is spot-on:
      
          > Snowflake somewhat helps but still can't guarantee ordering consistent with causality
      
          That assessment has held **perfectly**.
      
          Snowflake IDs:
      
           * Encode wall-clock time + randomness
      ​     * Are cheap, scalable, unique
      * Do **not** capture happens-before relationships
      
          They are great for:
      
           * Log ordering
      ​     * Sharding
      ​     * Rough recency queries
      
          They are **not** sufficient for:
      
           * Serializable snapshots
      ​     * External consistency
      ​     * True causality tracking
      
          Industry consensus:
          👉 Snowflake is an **ID scheme**, not a **consistency mechanism**.
      
          ---
      
          ## 3. Clock synchronization: from “exotic” to “normal infrastructure”
      
          This is where things shifted most since DDIA.
      
          ### Then (DDIA era)
      
           * GPS / atomic clocks were “Google-only”
      ​     * Clock uncertainty was scary
      ​     * Time sync was considered unreliable
      
          ### Now
      
          Clock sync is:
      
           * Still hard
      * But **operationally routine** at hyperscale
      
          What changed:
      
           * Better PTP (IEEE 1588)
      ​     * Hardware timestamping
      ​     * NIC-assisted time sync
      ​     * Time-aware networking
      ​     * Cloud providers offering time guarantees
      
          Examples:
      
           * Google still uses TrueTime
      ​     * AWS uses tightly synchronized time inside regions
      ​     * Azure and Meta do similar things internally
      
          **Key shift:**
          Clock sync is now treated as **infrastructure**, not theory.
      
          ---
      
          ## 4. TrueTime-style interval reasoning: niche, but validated
      
          ### DDIA’s framing
      
          Spanner’s trick:
      
           * Don’t assume exact time
      ​     * Track *time intervals*
      ​     * Only trust causality when intervals don’t overlap
      
          This was presented as:
      
           * Elegant
      ​     * Powerful
      ​     * But expensive and rare
      
          ### How it fared
      
      **Conceptually validated**
          ⚠️ **Practically niche**
      
          Why?
      
           * Requires tight operational discipline
      ​     * Requires hardware support
      ​     * Adds latency (commit wait)
      ​     * Overkill for many workloads
      
          But the *idea* survived and spread in softer forms.
      
          ---
      
          ## 5. The big evolution: Hybrid Logical Clocks (HLC)
      
          This is the most important post-DDIA development to understand.
      
          ### What happened
      
          The industry largely converged on **HLCs**:
      
           * Physical time (wall clock)
      * * logical counter for causality repair
      
          Used by:
      
           * CockroachDB
      ​     * YugabyteDB
      ​     * TiDB
      ​     * FoundationDB (variants)
      ​     * Many internal systems
      
          ### Why this won
      
          HLCs:
      
           * Preserve causality
      ​     * Approximate real time
      ​     * Avoid global coordination
      ​     * Degrade gracefully under skew
      ​     * Don’t require atomic clocks
      
          This directly answers DDIA’s concern:
      
          > *Clock sync might help, but coordination is expensive*
      
          **HLCs let you get 80–90% of the benefit for 10–20% of the cost.**
      
          ---
      
          ## 6. Snapshot isolation vs “good enough” consistency
      
          Another shift since DDIA:
      
          ### Then
      
          Snapshot isolation + external consistency felt like:
      
           * The gold standard
      ​     * Worth paying for
      
          ### Now
      
          More systems ask:
      
          > “Do we really need global snapshots?”
      
          Result:
      
           * Many systems relax guarantees
      ​     * Use causal consistency
      ​     * Use bounded staleness
      ​     * Use per-partition snapshots
      
          Spanner-like semantics remain:
      
           * Extremely impressive
      ​     * Extremely rare
      
          They’re chosen only when:
      
           * Cross-region correctness is existential
      ​     * Latency budget allows commit waits
      ​     * Cost is justified
      
          ---
      
          ## 7. The reframe: clocks as *one tool*, not *the tool*
      
          The modern consensus looks like this:
      
          > Synchronized clocks are neither necessary nor sufficient for distributed transaction semantics—but when combined with logical ordering, they are extremely powerful.
      
          In other words:
      
           * Pure logical clocks → great causality, poor real-time semantics
      ​     * Pure physical clocks → fast, but unsafe under skew
      ​     * Hybrid clocks → sweet spot
      
          DDIA gestured at this.
          Practice confirmed it.
      
          ---
      
          ## 8. What aged exceptionally well in DDIA
      
          Your bullet points that **aged like wine**:
      
           * Monotonic IDs are unavoidable
      ​     * Global coordination is a bottleneck
      ​     * Snowflake ≠ causality
      ​     * Snapshot semantics demand causality awareness
      ​     * Physical time *can* help if uncertainty is modeled
      
          Those are all still accepted truths.
      
          ---
      
          ## 9. What aged less well (or softened)
      
           * The sense that clock sync is “rare and exotic”
      ​     * The implied centrality of TrueTime-style designs
      ​     * The assumption that many systems *should* aim for Spanner-level semantics
      
          Reality said:
      
           * Clock sync got cheaper
      ​     * But complexity tolerance stayed low
      ​     * Most systems prefer partial guarantees
      
          ---
      
          ## 10. The distilled wisdom, updated to 2026
      
          Here’s the modernized version of the DDIA insight you quoted:
      
          > **Global snapshots require a notion of causality-aware time. You can approximate it with counters, infer it with logical clocks, or anchor it to physical time—but any scalable system must carefully trade coordination cost against semantic strength. Clock synchronization is no longer magical, but correctness still lives in how uncertainty is modeled, not how accurate the clock claims to be.**
      
          If you want, we can next:
      
           * Compare Spanner vs CockroachDB *very concretely*
      ​     * Map snapshot isolation to robotics / sensor fusion (this connects nicely to your earlier time-sync questions)
      ​     * Or tie this to CRDTs and why some systems deliberately avoid snapshots altogether
      
          Just point the lens.
      

Process Pauses #

  • a single-leader system needs the leader to know that he’s still the leader, there’s a bunch of mechanisms to handle this
    • lease ownership

      similar to a timeout because leases have lifecycles and hence can expire

      so, there’s provisioning for the extension of a lease by the leader (before the expiry)

      if leader fails then the lease renewing won’t happen and someone else takes over the lease

      • some existing sources of error:
        • if the lease request handling loop compares local time with another clocks time, then there may be drifts within them

        • if the program execution is paused by scheduler, then this might be a source of error

          • the runtime program pauses are very possible, example:
            1. runtimes like JVM having “stop-the-world” GC pauses that last minutes

            2. VMs getting suspended and pausing execution of all procs for doing memory dumping to the actual HW

            3. user-triggered pauses

            4. OS context-switching to another thread / hypervisor switching to different VMs

              for VMs, this is called steal time

            5. sync IO blocking within the program

            6. OS swapping to disk (paging) as a memory mangement primitive

              worst-case, even leading to page-thrashing (PROTIP: paging should be disabled on server machines).

            7. SIGSTOP signals being sent to the process for any reason

              e.g. accidentally being sent by some administrator

        • this problem is similar to the problem of making the program thread-safe on a single machine

          thread-safe tools (mutexes, semaphores, atomic counters, lock-free datastructures) work well because single-machine multi-threaded code uses shared memory

          distributed systems that typically are shared-nothing system don’t benefit from such tools

        • node in distributed system must assume that its execution may be paused for a significant length of time at any point (even in the middle of a function) while the rest of the world continues moving

          when that node resumes, it may not even notice that it was asleep while the others were paused

response time guarantees #

  • the unbounded-time pauses have reasons that we might be able to eliminate

  • hard-realtime systems: in systems like airline control computers, there are specified deadlines by which the software must respond

    if deadline is not met, it may cause the entire system to fail

  • realtime guarantees need support from all levels of the software stack – that’s what real-time operating systems are all about

    • they allow guaranteed allocation of CPU time in specified intervals if necessary

    • functions must document worst-case execution time

    • dynamic malloc maybe restricted or disallowed entirely

      there are real-time GCs

    • needs to be well-tested and measured

  • meaning of real-time

    case 1: real-time means a system is designed & tested to meet specified timing guarantees in all circumstances

    case 2: web real-time: servers pushing data to clients and stream processing without hard response-time constraints

    in the case of embedded systems it’s more like case 1 and in web-contexts it’s more like case 2

  • limiting the impact of garbage collection

    • idea is to treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage.

      • so when it’s near to the scheduled GC pause, the application can make that node inactive then become active again

        so it’s hiding the GC pauses from clients and reduces the high percentiles of response time \(\implies\) used in latency-sensitive financial trading systems

    • another variant to the idea: use GC only for short-lived objects (fast to collect), and just do process restarts periodically for the long-lived objects (before they OOM)

      something like a rolling upgrade

Knowledge, Truth, and Lies #

  • there’s a need to define the system-model, with the assumptions that we have about our distributed system. The algos we have can be proven with respect to the guarantees of the system \(\implies\) that’s how to get reliable behaviour even if the underlying system doesn’t offer many guarantees
    • so far, we looked at the ways that distributed systems differed from single-computer run programs:
      1. no shared memory
      2. only message passing over unreliable networks with variable delays
      3. systems may suffer from partial failures, unreliable clocks and processing pauses
    • these beg philosophical questions
    • how to know true/false in system? esp if system is unreliable?
    • should software systems obey the physical world laws like cause and effect?

The Truth Is Defined by the Majority #

  • a node cannot necessarily trust its own judgment of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover.

    \(\implies\) this is why distributed systems rely on quorums

    3 scenarios (examples) of asymmetric faults within networks that quorums will solve::

    1. particular node receives but can’t send messages and the rest declare it dead even though it’s not

    2. particular node semi-disconnected, so can know that the messages aren’t being ack-ed so there’s some kind of network fault but it can’t do anythign about it

    3. particular node with a long stop-the-world GC pause (doesn’t exactly know how long of a time has elapsed)

      nodes-threads pre-empted by the GC and paused for a minute during which the others think that this dead is dead then all of a sudden from their POV, this dead nodes comes back alive

Leader and the lock #

  • typically we need singleton things in some cases:

    e.g.:

    • single leader (to avoid split-brain)
    • one one txn / client shall hold the lock to a particular resource to have concurrency-guarantees
    • one user allowed to register a particular uname, because it’s typically a unique user id
  • these singletons need to be implemented well (e.g. a distributed lock)

    e.g. of problem is one of a process pause that makes a process think it still has a write-lease on a lock and therefore writes and corrupts a file-storage

  • so singleton-related behaviours (like a chosen one) needs to abide by quorum rules

    else it may act in ways that the system can’t handle and lead to corruption

Fencing tokens #

  • guarantee needed: a node that falsely believes it’s the “chosen one” shouldn’t be able to disrupt the rest of the system

    Fencing tokens can detect and block a node that is inadvertently acting in error.

  • mechanisms e.g. includes fencing: making access to a lock safe using increasing fencing tokens, which typically is a server-side check

    example of this fencing token mechanism

    1. every grant by the lock server for a lock or lease will have an associated fencing token

      this token is a number that increases everytime a lock is granted

    2. we require that every time a client sends a write request to the storage service, it needs to include the current fencing token it has

    3. the storage server remembers the global state of the fencing token which is how it can do rejections even if there’s some process-pause based fault

  • fencing token mechanism requires the resource to take an active role in checking tokens by rejecting any writes with an older token than one that has already been processed

    \(\implies\) insufficient to just rely on clients to check their lock-status

  • server-side checks in distributed systems is a good idea because clients may be having their own agendas

Byzantine Faults #

  • Byzantine faults \(\implies\) lying: deliberate attempt at subverting the system’s guarantees

  • Byzantine fault-tolerance:

    • can operate correctly even if some of the nodes are malfunctioning and not obeying the protocol guarantees / malicious entities are interfering with the network

    • typically relies on a supermajority (beyond majority)

    • example environments:

      1. aerospace where random bit flips from radiation \(\implies\) unpredictable system deviations. So the flight systems must tolerate Byzantine faults

      2. multiple participating orgs might involve the attempt to cheat or defraud others

        e.g. Bitcoin and blockchains that aren’t centrally regulated

    • typically HW-level dependent for such fault tolerance

    • P2P networks need some aspects of this fault tolerance

  • protocols don’t save us from vulnerabilities, security compromises, malicious attacks

weak forms of lying #

  • not full blown lying but things like HW issues might essentially make it a lie
  • examples:
    1. network packet corruptions \(\implies\) typically caught by checksums
    2. internal service that doesn’t have as strict user-input checks as a publically accessible service

System Model and Reality #

  • Algo design shouldn’t be heavily dependent on the details of the HW and SW configuration on which they run

    instead they should formalize the kinds of faults that a system is expects to happen

    \(\implies\) that’s what the system model abstraction is all about

  • For modelling real systems, the partially synchronous model with crash-recovery faults is generally the most useful for us

    • timing assumptions related system models that are common:

      1. synchronous model

        • assumes bounded network delay, bounded process pauses, and bounded clock error.

        • still will have network delay, they are just bounded delays

        • not realistic in most practical systems because packet-based networks have unbounded delays and systems have pauses

      2. partially synchronous model

        • behaves like a synchronous system most of the time, but it sometimes exceeds the bounds for network delay, process pauses, and clock drift

        • realistic, most of the time networks and processes are well-behaved

      3. asynchronous model

        • algorithm is not allowed to make any timing assumptions. typically, it doesn’t even have a clock (so it cannot use timeouts).

        • very restrictive model

    • node failure related system models for nodes:

      1. crash-stop faults

        • an algorithm may assume that a node can fail in only one way, namely by crashing.

        • never comes back

      2. crash-recovery faults

        • nodes may crash at any moment, and perhaps start responding again after some unknown time.

        • assumed to have stable storage (i.e. nonvolatile disk storage) so that it may be revived (but in-memory state assumed to be lost)

      3. Byzantine (arbitrary) faults

        • nodes may do anything they want including being malicious
  • how distributed algos cope with this:

correctness of an algo #

  • any algo’s correctness is described via its properties (invariants), including distributed algos

    e.g. fencing tokens, properties:

    1. uniqueness

    2. monotonic sequence

    3. availability: node requesting a fencing token and that doesn’t crash will eventually get a response

safety and liveness properties #

  • using e.g. of fencing tokens,

    safety properties: uniqueness, monotonic sequences \(\implies\) nothing bad happens

    • can pinpoint the time when property violations happened
    • typically, violations can’t be undone

    liveness property: availability – have the word “eventually” in their definition, because it’s about eventual consistency \(\implies\) something good eventually happens

    • opposite
  • why care about safety vs liveness properties?

    • safety properties need to always hold, in all possible situations of the system model
    • liveness properties can have caveats, cuz it’s about eventuality (so during repairing processes, it some invariants may be violated until it’s eventually correct)

mapping system models to real world #

  • system model is still a simplification of reality

    there may be situations that violate the guarantees we have

  • e.g. suppose the stable storage requirement (necessary for remembering data on outages for quorum invariants in distributed systems) fails, then the model becomes hard to reason about.

  • interesting take: the difference between computer science and software engineering comes from the parts where a real implementation has to account for more than the abstract / theoretical behaviours that algorithms imply

    they’d have to account for real situations of abuse and so on (but still not Byzantine solutions) where a human operator might have to cleanup the mess

  • correct algos doesn’t necessarily mean that the implementation on a real system will necessarily behave correctly, but it’s a good first step

    theoretical analysis can uncover problems in an algorithm that might remain hidden for a long time in a real system, and that only come to bite you when your assumptions (e.g., about timing) are defeated due to unusual circumstances.

Summary #

  • defining characteristics of distributed systems is that partial failures can occur

    that’s why we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken.

  • how to tolerate faults:

    1. detect them, which can be hard

    2. if can’t detect, rely on timeouts to make assumptions on availability

      • can’t distinguish node from network failures
    3. if can detect, it’s hard to tolerate because no global shared memory

      unreliable networks, failing nodes and pauses