Skip to main content
  1. Readings/
  2. Books/
  3. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems/

Chapter 9. Consistency and Consensus

··9774 words·46 mins
  • just like how we have transactions as a general purpose abstraction to solve a class of problems, distributed computing problems shall be solved by general purpose abstractions as well:

    1. abstraction of consensus
  • first we explore the range of guarantees and abstractions that we can have in distributed systems; to get the scope of what can be done and what can’t be done

  • this book just focuses on general intuition without going into correctness-proofs

Consistency Guarantees #

  • eventual consistency is a weak guarantee, its just says replicated dbs will have temporary inconsistency

    also a better naming could be convergence

    when working with db that provides only weak guarantees, we have to be sensitive about its limitations and not assume too much

    edge cases for eventual consistency only are apparent when there is a fault in the system e.g. high concurrency / network interruption

  • no free lunch for stronger consistency models

    stronger consistency models exist but they may come at the expense of performance / less fault-tolerance

  • distributed consistency models and transaction isolation levels’ hierarchy can be overlapping in ideas

    difference:

    • transaction isolation is about avoiding race conditions that come from concurrently executing transactions

    • distributed consistency is about coordinating the state of replicas in the face of delays and faults

Linearizability: a recency guarantee #

  • it’s a recency guarantee

  • idea: since replicas exist and there’s a chance that if we ask 2 different replicas the same question, we may get different answers,

    we can try strive for a situation where there’s an illusion of only a single db, every client would then get the same answer without worrying about replication lag

    basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.

    • as soon as one client completes a write successfully, all clients reading from the db must be able to see the value just written

      so this value can’t come from a stale cache / replicak

What Makes a System Linearizable? #

  • terminology, the “variable” is better called as a register which may be a kv-pair or a row in a table …

    • consider the operations w.r.t register:
      • read
      • write
      • cas(x, \(v_{old}\), \(v_{_}{new}\)) (compare-and-set)
        • atomic operation
        • if \(x != v_{old}\) then operation leaves register without changing, returns error on client-side
        • if \(x == v_{old}\) then atomically set to \(v_{new}_{}\)
  • regular register: a register in which reads may return either the old or new value if the read is concurrent with a write

  • The requirement of linearizability is that the lines joining up the operation markers always move forward in time (from left to right), never backward.

    This operation ensures the recency guarantee we discussed earlier: once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again.

  • It is possible (though computationally expensive) to test whether a system’s behavior is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order

Linearizability vs Serializability #

  • similar sounding guarantees because both have something to do with “arranging in a sequential order”, but they are different guarantees

  • Serializability:

    • isolation property of transactions
    • every txn may R/W multiple objects
    • guarantee: txn behave the same as if they had executed in some serial order (w.r.t transactions). It’s alright for that order to be different from the order in which txns were actually run
  • Linearizability

    • recency guarantee on R/W of a register (individual object)
    • no grouping of operations into txns, doesn’t solve write-skew and all unless other steps (e.g. materializing conflicts) are taken
  • strict serializability aka strong one-copy serializability (strong ISR): when a db provides both serializability and linearizability

    • implementation e.g. 2PL or actual serial execution that make the db typically linearizable
  • serializable snapshot isolation is not linearizable

    • it reads from a consistent snapshot so ats to avoid lock contention b/w readers and writers
    • it doesn’t include writes that are more recent than the snapshot so it reads from snapshots that are not linearizable

Relying on Linearizability #

cases where linearizability matters for system correctness:

  1. locking and leader election

    leader locks have to be linearizable

    typically every nodes that spins up tries to acquire the lock and the one that succeeds is the leader

    a linearizable storage service is the foundation for these coordination tasks

    distributed locking can be done at the page-level on the storage system

  2. contraints and uniqueness guarantees

    • if we want to enforce constraints (e.g. uniqueness constraints) as the data is written, then we need linearizability

    • similar situation to a lock, can see it the “locking” being done on a “column”

    • the operation is similar to a compare-and-set also

    • also relevant when constraints such as “min value” need to be enforced (e.g. min val in bank, last item sale,…)

      • some of these constraints, if it’s alright to be slightly looser in the constraint, then we don’t need linearizability
    • fk / attribute constraints can be implemented without requiring linearizabilty though

  3. cross-channel timing dependencies

    • in some cases, the detection of the linearizability violation happens beause there’s an additional communication channel in the system (cross-channel)

    • given example: a webserver that accepts image uploads and works together with a image resizer

      in this system the webserver and the image resizer communicate both through file-storage and via a message queue, which now opens up the possibility of a race condition

      if the file storage service is linearizable, then the system works alright

      else, there’s the risk of race conditions:

      • e.g message queue faster than the internal replication inside the storage service so when the resizer fetches the image, it might see an old version of the image – the full-size and resized images in the file-storage become permanently inconsistent
    • in example, 2 comms channels between the webserver and the resizer: file storage an the message queue

      race condition happens if there’s no recency guarantee of linearizability

    • this problem can be solved in other ways, but linearizability is a simple enough way to solve it

Implementing Linearizable Systems #

  • most common approach is to use replication so that systems are fault-tolerant, so the replication methods can be looked at to see if they can be made linearizable:
    1. single leader replication: potentially linearizable

      • leader has primary copy, followers have backup copies on other nodes

      • reads from the leader / synchronously updated followers can potentially be linearizable

      • using the leader for reads \(\implies\) we know who the leader is

      • situations when linearizability is violated:

        1. a delusional leader (even if eventually it realises that it’s not the leader) will viola

        2. when it’s async replication, then failover may lose committed writes which will violate both durability and linearizabilty

    2. consensus algos: linearizable

      • some of these algos feel like single-leader replication but they have mechanisms that prevent split brain and stale replicas

        so, consensus algos can implement linearizable storage safely

    3. multi-leader replication: not linearizable

      concurrent processing of writes on multiple nodes + async replication to other nodes = conflicting writes that require resolution

      conflicts – artifact of the lack of a single copy of the data

    4. leaderless replication: generally not linearizable

      • the claim is that we can get “strong consistency” if we require quorum reads and writes (w + r > n)

        depends on:

        • config of the quorum
        • how strong consistency is defined
      • LWW (last write wins) conflict resolution strategy most of the time cmi because clock timestamps can’t be guaranteed to be have consistent, actual ordering because of clock skew

      • sloppy quorums don’t allow linearizability

        there’s situations when even strict quorums can still violate linearizability as well

Linearizability and quorums #

  • it’s possible to get a violation

  • it’s also possible to add in linearizablity (only for read and write operations it may be feasible, not for compare-and-set operations) by doing the following:

    1. readers: add in read repair mechanisms for readers which is synchronous (so performance takes a hit) before returnig read results to the application

    2. writer: must read the latest state of a quorum of nodes before sending it writes

The Cost of Linearizability #

  • need to understand pros and cons to get the tradeoffs of this guarantee

  • network interruptions force a choice between availability and linearizability

    on interruption of network between data-centers:

    1. if multi-leader db:

      within data-center, can function normally

      the writes are queued up and exchanged when network connectivity (b/w data-centers) is restored

    2. if single-leader replication: depends on whether we connect to the data center containing the leader or not

      leader must be in one of the data-centers

      so any writes and any linearizable reads must be sent to the leader – so any clients connected to a follower data center must be sent synchronously over the network to the leader data center

      if interrupted, clients connected to the follower data centers can’t contact the leader so can’t make any writes nor linearizable reads

      can still make reads from follower but the data may be stale (non-linearizable)

      no issue if clients are connected to the datacenter that has the leader within it, BAU there

The CAP Theorem #

  • linearizability issue: happens regardless of single/multi-leader replication, any type of single/multi-datacenter deployments and on any unreliable data-center, even if one datacenter

    tradeoffs:

    1. either linearizable (consistent) or available under network partitions
      1. if app requires linearizability

        if network problem e.g. disconnected network partition \(\implies\) some replicas can’t process requirements when disconnected \(\implies\) they must wait until the network problem is fixed or they must return an error \(\implies\) service is unavailable

      2. no need linearizability

        can be written such that each replica can independently process, even if disconnected (e.g. multi-leader)

        so, application can be available in face of network problem but behaviour is not linearizable

        more tolerant of network problems

  • CAP was supposed to be a rule of thumb, more of a discussion of tradeoffs

  • the unhelpful framing of CAP:

    typically presented as:

    CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3. Unfortunately, putting it this way is misleading [32] because network partitions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not

    so a better framing of this set of tradeoffs should be:

    a better way of phrasing CAP would be either Consistent or Available when Partitioned

  • in its formal definition, the CAP theorem has a very narrow scope where it’s relevant:

    only when the consistency model is linearizability and when the type of fault is network partitions

  • there are more important things to think about when designing distributed systems instead of mulling that much over CAP theorem

  • there’s a newer acronym to help guide us PACELC (ref wikipedia for the design)

    If there’s a Partition (P), we have to choose between Availability (A) and Consistency (C)

    Else (E),

    When the system is healthy, then we have to choose between Latency(L) and Consistency(C)

    There are four configurations or trade-offs in the PACELC space:

    • PA/EL - prioritize availability and low latency over consistency
    • PA/EC - when there is a partition, choose availability; else, choose consistency
    • PC/EL - when there is a partition, choose consistency; else, choose lower latency
    • PC/EC - choose consistency at all times
      Nice, let’s dig in 🙂
      I’ll do this in three layers: **CAP → PACELC → why PACELC is still not enough**, tying it back to DDIA-style thinking rather than slogans.
    
      ---
    
      ## 1. CAP, restated precisely (no folklore)
    
      **Formal CAP says**:
    
      > In the presence of a *network partition*, a system that provides *linearizability* cannot also provide *availability*.
    
      Key constraints (you already captured these, but worth restating cleanly):
    
       * **Consistency = linearizability**, not “strong consistency”, not serializability, not read-your-writes.
    * **Availability** means *every non-failing node eventually returns a non-error response*.
    * **Partition tolerance** is not optional — partitions are faults.
    
      So CAP is not:
    
       * a taxonomy of databases
    ​   * a performance model
    ​   * a latency argument
    ​   * a general statement about “distributed systems are hard”
    
      It’s a **liveness vs safety** statement under a *very specific* failure mode.
    
      That narrowness is exactly why people felt dissatisfied and invented PACELC.
    
      ---
    
      ## 2. PACELC: what it adds (and why it was attractive)
    
      PACELC reframes the discussion as:
    
      > **If there is a Partition (P)**, choose between **Availability (A)** and **Consistency (C)**
      > **Else (E)**, when the system is healthy, choose between **Latency (L)** and **Consistency (C)**
    
      So:
    
       * **PA/PC** → what you do *during partitions*
    * **EL/EC** → what you optimize *during normal operation*
    
      This adds something CAP never talked about: **the steady state**, which is where systems live 99.99% of the time.
    
      ### Why this felt like progress
    
      Because in practice:
    
       * Engineers care *far more* about:
    
        * tail latency
    ​    * quorum round trips
    ​    * geo-replication delays
    ​   * CAP only talks about a rare failure mode.
    
      PACELC made it acceptable to say:
    
       * “We choose to sacrifice linearizability **even when there is no partition** because we want lower latency.”
    
      That’s a real, everyday tradeoff.
    
      ---
    
      ## 3. Where PACELC still falls short
    
      This is the important part — and where your instinct that these frameworks are “not the whole story” is correct.
    
      ### 3.1 It still treats “consistency” as a single axis
    
      PACELC inherits CAP’s biggest sin:
    
      > **Consistency is not one thing**
    
      It collapses:
    
       * linearizability
    ​   * causal consistency
    ​   * session guarantees
    ​   * snapshot isolation
    ​   * serializability
    ​   * application-level invariants
    
      …into a single C/¬C switch.
    
      But in reality:
    
       * Many systems give up **linearizability** while preserving **causality**
    ​   * Many applications don’t need linearizability, but *do* need:
    
        * monotonic reads
    ​    * read-your-writes
    ​    * bounded staleness
    
      PACELC can’t express this.
    
      ---
    
      ### 3.2 It ignores *what* you are coordinating
    
      Both CAP and PACELC assume:
    
      > “The system is a single blob with one consistency model.”
    
      But DDIA’s core insight is:
    
       * **Different operations need different guarantees**
       *Examples:
    
       * Leader election → needs linearizability
    ​   * Analytics queries → don’t
    ​   * Counters → maybe CRDT
    ​   * Uniqueness constraints → CAS-style linearizability
    ​   * Caches → eventual consistency is fine
    
      Real systems are **heterogeneous**, not point-labeled on a 2D grid.
    
      ---
    
      ### 3.3 It ignores correctness vs performance separation
    
      PACELC still frames everything as a **performance tradeoff**.
    
      But many consistency choices are about:
    
       * **correctness**
    * **invariant preservation**
    * **bug avoidance**
    
      Example (from your notes):
    
       * Cross-channel timing bugs
    ​   * Distributed locks
    ​   * Uniqueness constraints
    
      These are not “performance preferences”.
      They are **semantic requirements**.
    
      PACELC has no vocabulary for:
    
      > “This part of the system *must not* be wrong.”
    
      ---
    
      ### 3.4 It ignores failure modes beyond partitions
    
      PACELC, like CAP, is obsessed with partitions.
    
      But real systems fail due to:
    
       * clock skew
    ​   * GC pauses
    ​   * slow replicas
    ​   * partial disk failure
    ​   * Byzantine clients (sometimes)
    ​   * misconfiguration
    ​   * overload collapse
    
      Linearizability can fail *without* a partition.
      Availability can fail *without* a partition.
      PACELC doesn’t model this.
    
      ---
    
      ## 4. What replaces CAP/PACELC in DDIA thinking?
    
      Not another acronym — but a **shift in mindset**.
    
      ### 4.1 Think in *abstractions*, not labels
    
      DDIA’s move is:
    
       * Stop classifying databases
    * Start reasoning about **abstractions**:
    
        * linearizable register
    ​    * atomic CAS
    ​    * transaction
    ​    * snapshot
    ​    * log
    ​    * consensus
    
      Then ask:
    
      > Which abstraction does my problem *actually* require?
    
      ---
    
      ### 4.2 Think in *failure semantics*
    
      Instead of:
    
      > “Is this AP or CP?”
    
      Ask:
    
       * What happens during:
    
        * message delay?
    ​    * leader crash?
    ​    * failover?
    ​    * clock skew?
    ​   * What invariants break?
    ​   * What errors surface to users?
    
      This is far more predictive than CAP labels.
    
      ---
    
      ### 4.3 Think in *application invariants*
    
      The strongest DDIA lesson:
    
      > Consistency models exist to protect **invariants**, not to win theoretical arguments.
    
      Examples:
    
       * “Only one leader exists”
    ​   * “Stock never goes below zero”
    ​   * “Email is sent once”
    ​   * “Usernames are unique”
    
      Once you state the invariant, the needed consistency often becomes obvious — CAP or PACELC be damned.
    
      ---
    
      ## 5. Bottom line (connecting back to your notes)
    
      Your notes implicitly conclude:
    
      > CAP is correct, but not a useful primary design tool.
    
      That conclusion:
    
       * ✔️ Matches DDIA
    ​   * ✔️ Matches modern distributed systems practice
    ​   * ✔️ Explains why people keep rediscovering the same bugs despite “knowing CAP”
    
      PACELC is a **better story**, but still a story — not a design method.
    
      If you want, next we can:
    
       * Apply this thinking to **real systems** (etcd, Spanner, Dynamo, Kafka)
    * Or map **consensus vs transactions** as competing abstractions
    * Or do a deep dive on **why linearizability is a *semantic* guarantee, not a performance one**
    
      Your call.
    

Linearizability and network delays #

  • funny things is that even RAM on multi-core CPUs is not linearizable

    unless there’s a memory barrier or memory fence being used, then it’s not serializable

    why?

    • every CPU core has its own memcache and store buffer

    • mem access reaches a cache first and any changes async written out to main memory

    • since data access in cache is much faster than going to memory

    • now there’s several copies of the data (main memory, several caches) and the copies are async updated so linearizability is lost

    • reason for dropping linearizability is to prioritize performance

  • Linearizability is slow all the time, not just during a network fault

  • proof that if you want linearizability, the response time of R/W requests is at least proportional to the uncertainty of delays in the network

    • high variable delays (most networks), response time of linearizable reads and writes will be high
  • KIV chapter 12 for approaches for avoiding linearizabiity without sacrificing correctness

Ordering Guarantees #

  • ordering is a fundamental concept that has poppped up in a bunch of contexts:

    1. leader in the single leaders replications determines the order of writes in the replication log

      • the order in which to apply the writes
    2. serializability: ensuring txns behave as though they were executed in some sequential order

      2 ways to achieve:

      • literally executing txns in serial order

      • allowing concurrent executing while preventing serialization conflicts (by locking / aborting)

    3. in the context of timestamps and clocks in distributed systems

      attempts to determine which one of two writes happened later

  • there are deep connections between ordering, linearizability, and consensus.

Ordering and Causality #

  • Ordering helps preserve causality.

    ordering on events: cause comes before effect; a message is sent before that message is received; the question comes before the answer.

  • Some patterns we’ve observed in the book so far about causality:

    1. causal dependency between questions and answers

    2. updates to a record must happen after they exist (creation of that record)

    3. happened-before relationships is an expression of causality

      given 2 operations A and B

      1. A before B

      2. B before A

      3. A and B concurrent \(\implies\) no causal link b/w the two operations

    4. transaction consistency is consistency with causality (transactions, snapshot isolation)

      so no read skew (i.e. non-repeatable reads, erading data in a state that violates causality)

      observing the entire db at a single point in time makes it consistent with causality:

      • effects of all ops happening before that time are visible

      • no effects of operations happening after that time are visible

    5. write skew b/w txns show causal dependencies

      where the write action is causally dependent on the observation of who’s currently on call

      SSI (serializable snapshot isolation) detects this write skew by tracking the causal dependencies b/w txns

    6. causal violation that was detected because of cross-channel timing dependencies

      e.g. the sports example where the exclamation of the result (comms channel) was the reason why the violation could be detected

  • causally consistent: system obeying the ordering imposed by causality

Causal order is not a total order #

  • Total ordering: any 2 elements can be compared

    math sets are incomparable and possibly partially ordered

    i.e. in some cases one set is greater than another (if one set is a superset of another) and in all other cases, they are incomparable

  • diff b/w total and partial order is shown in the diff db consistency models

    1. linearizability \(\implies\) total ordering of operations

      if system behaves like it has a single copy and every operation is atomic then all elements (operations) are comparable \(\implies\) that’s why it can be illustrated as a timeline

    2. causality \(\implies\) partial ordering

      2 events ordered if they are causally related (one before another), incomparable if they’re concurrent \(\implies\) partial ordering

  • Implication: no concurrent operations in a linearizable datastore – must be a single timeline along which all operations are totally ordered

    still can have multiple requests being queued and all, just the datastore does everything atomically, on a single copy of data, along a single timeline without any concurrency

  • concurrency – it’s like a branching and merging of timelines

    very much like git

    • operations on different branches are incomparable

Linearizability is stronger than causal consistency #

  • linearizability implies causality

    that’s great, easy to understand and all

    but performance hits and problems with availability is why distributed data systems don’t prioritise linearizability

  • possible to be causally consistent without incurring the performance hit of making things linearizable

  • causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures

    most systems that appear to need linearizability just need causal consistency actually

Capturing causal dependencies #

Just some key ideas

  • to maintain causality just need to maintain partial order, so that we can do the happened-before r/s

  • ways to determine the happened-before r/s b/w operations similar to the detection of concurrent writes (ref)

    going beyond just the detection ofn concurrent writes to the same key and by generalizing version vectors is how we can do this.

    examples of using version vectors in other situations:

    • e.g. version number of prior operations is passed back to the db on a write

    • e.g. in the conflict detection of SSI, a txn that wants to commit will pass the version of the data that it has read and checks if it’s still up to date

Sequence Number Ordering #

  • actually keeping track of ALL causal deps can be impractical because large overhead so we just pick simpler thing (sequence numbers / timestamps)

    • compact values
    • preserve total order (every operation with a unique sequence number)
    • 2 sequence numbers are comparable
  • use a logical clock for this

  • total order that is consistent with causality: the happened-before relationship is preserved, concurrent operations may be ordered arbitrarily.

    • this captures all the causality info
  • replication log within a single-leader replication db has a total ordering of write operations that is consistent-with causality

    typically a monotonic counter is used by the leader

    followers may lag but the causal consistency is still guaranteed

Noncausal sequence number generators #

  • harder to generate sequence numbers for operations when the db is NOT single-leader replication (e.g. multi-leader / leaderless replication)

  • some ways:

    1. each node generates its independent set of sequence numbers

      e.g. if 2 nodes, then even/odd number sequences

      problem: can’t compare if even-odd number

    2. attach timestamp from a time-of-day (physical) clock to each operation and make sure the resolution is high enough – might be sufficient enough to totally order operations even

      problem: might have clock skew when using physical clocks

      used in the conflict resolution strategy behind last write wins (ref)

  1. preallocated blocks of sequence numbers, for each block

  2. problem: sequence number generation like these methods are not consistent with causality

    this is because it preserves ordering within a node but not across nodes

Lamport Timestamps #

  • lamport timestamp is pretty simple: (counter, nodeID)

  • it’s a stamp of a logical clock that provides total ordering

    if time-value within the timestamp is the same, then we just pick the one with the greater node ID

    this part is similar to the even-odd numbers approach

  • improvement: nodes keep max counter seen and updates it on every request every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

    • as long as the max counter value seen is carried along with every operation, the causal consistency is present

      because every causal dependency results in an increased timestamp

  • lamport timestamps are not version vectors, their purpose is different

    1. version vectors: to distinguish whether two operations are concurrent or if one is causally dependent on the other

    2. Lamport timestamps: to enforce a total ordering

      \(\implies\) this total ordering is why it’s not possible to tell if 2 operations are concurrent or causally dependent

Timestamp ordering is not sufficient #

  • in some cases, the distributed nature of things means that total ordering (via lamport timestamps) is not sufficient

  • problem here is that the total order of operations only emerges after you have collected all of the operations. If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations: the unknown operations from the other node may need to be inserted at various positions in the total order.

  • in order to implement something like a uniqueness constraint for usernames, it’s not sufficient to have a total ordering of operations – you also need to know when that order is finalized.

  • that’s that total order broadcast helps us do

Total Order Broadcast aka atomic broadcast #

  • to have the whole system (all nodes) agree on the same total ordering of operations

    AND

    to scale the system if the throughput is greater than what a single leader can handle

    AND

    handle leader failovers

    \(\implies\) total order broadcast / atomic broadcast problem

Using total order broadcast #

  • KIV for consensus section but there’s a strong connection between total order broadcast and consensus

  • alternative lens: total order broadcast helps create a log (replication log, txn log, WAL) and delivering a message is like appending to that log

  • can be used to do state machine replication: is what is needed for db replication

    • for replicas to be consistent with each other the ordering must be consistent with each other so that the writes are processed in the same order
  • can be used to implement serializable transactions

  • can be used to implement a lock service that gives fencing tokens (ref)

    • sequence number can serve as the fencing token

    • the log for fencing token requests is yet another log

  • it is stronger than timestamp ordering because the ordering is fixed at the time the messages are delivered – no node can retroactively insert a message into an earlier position

Implementing linearizable storage using total order broadcast #

  • recap:

    • total order broadcast is async: messages get delivered reliably in a fixed order but there’s no guarantee when it gets delivered so there may be replication lags
    • linearizability is more of a recency guarantee
  • if we have total order broadcast, we can build linearizable storage ontop of it:

    e.g. in the case of usernames, we could try to do a compare and set with original value = null so only one of the control and sets will work. This requires a linearizable register on which we can do atomic compare and set operations

    • implementing timeline/sequential consistency (slightly weaker guarantee than linearizability) using total broadcast as an append-only log:

      1. append message to log, indicating that uname you want to claim

      2. read the log, wait for the message you appended to deliver back to you

      3. check for any msgs claiming uname desired. if first message is our message then we succeed, else abort the operation

      this helps guarnatee linearizable writes but NOT linearizable reads (reading from a store that is updated async risks stale reads)

    • how to linearize reads:

      1. sequence the reads using the log, perform the read when the actual message is delivered back to you

        so message’s position in time is when the read happens

      2. if log allows fetching of the position of the latest log message in a linearizable way, then we can query that position and then wait for all the entries up to that position to be delivered then perform the read – that’s how ZooKeeper’s sync works

      3. read from a replica that is synchronously updated on writes and up to date – used in chain replication

Implementing total order broadcast using linearizable storage #

  • the converse, given a linearizable storage, how to build total order broadcast from ti

  • ways:

    1. assume we have a linearizable register that stores and integer and has atomic increment-and-get operation

    2. use an atomic compare-and-set operation

  • algo: for every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.

    • the number we get from incrementing a linearizable register will form a sequence with no gaps
  • it’s difficult to make a linearizable integer with an atomic increment-and-get operation because the difficulty lies in handling failure cases, when nodes fail:

    • so need to handle connection interruptions
    • handle restoring the value when that node fails

    it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus

Distributed Transactions and Consensus #

  • many subtleties that need to be appreciated behind this problem to solve: making nodes agree on something

    • understanding builds on: replication, txns, system models, linearizability and total order broadcast
  • situations when consensus is important:

    1. leader-election

      leadership position might be contested if network faults make it such that some nodes can’t communicate with other nodes \(\implies\) need the nodes to come to consensus else bad failover \(\implies\) split-brain situation \(\implies\) data divergence, inconsistency and data loss

    2. atomic commits (solving the atomic commit problem)

      • consider db with txns that span multiple nodes / partitions and there’s the problem of txns failing on some nodes but succeeded in others

      • to get txn atomicity (ref) we need all the nodes to have a consensus on how to deal with the txn (abort or commit)

               Atomic commit is formalized slightly differently from consensus: an atomic transaction can commit only if all participants vote to commit, and must abort if any participant needs to abort. Consensus is allowed to decide on any value that is proposed by one of the participants. However, atomic commit and consensus are reducible to each other [70, 71]. Nonblocking atomic commit is harder than consensus—see
        
  • we explore the atomic commit problem progressively, going from using 2PC then to Zab (used in Zookeeper) then Raft (etcd)

Impossibility of Consensus #

  • there’s a FLP result, a proof that no algo exists that can reach consensus if there’s a risk that nodes may crash

  • FLP is proven in the context of the asynchronous system model, which is restrictive:

    • algo has to be deterministic
    • algo can’t use clocks or timeouts
  • if the algo can find a way to identify the suspected crashed nodes e.g. use timeouts, or some way random numbers then it’s possible to achieve consensus

Atomic Commit and Two-Phase Commit (2PC) #

  • recap: benefits of atomic transactions
    • outcome is either a commit or an abort & rollback
    • prevents failed txns from littering the db with half-finished, half-updated state
    • ensures that the secondary index stays consistent with the primary data, remains useful

from single-node to distributed atomic commit #

  • for single-db-node, the txn writes other than being durable, will also append a commit record to the log on disk – allows crash-recovery

    • transaction commitment crucially depends on the order in which data is durably written to disk: first the data, then the commit record

    • it’s a single device (the controller for the disk drive attached to the single node) that makes the commitment atomic

  • NOTE: the book is outdated in it’s description that nosql data stores don’t offer distributed txns

      Short answer: **the statement was largely correct when DDIA was written, but it’s no longer broadly true today without qualification.** The landscape has shifted quite a bit.
    
      Your quoted passage reflects the **2010–2017 reality** well — but since then, many NoSQL systems have added **distributed transaction support**, at least in some form.
    
      Let’s place this on a timeline and clarify what is true, what changed, and what nuance matters now.
    
      ---
    
      # ✅ What the statement meant (and when it was true)
    
      The claim:
    
      > Most NoSQL distributed datastores do not support distributed transactions, but various clustered relational systems do.
    
      This was broadly accurate in the early NoSQL wave.
    
      ## Early NoSQL design philosophy
    
      Original NoSQL systems (Dynamo-style, Bigtable-style, Cassandra, etc.) were designed around:
    
       * partition tolerance
    ​   * high availability
    ​   * horizontal scale
    ​   * low latency
    ​   * simple data model
    * **single-row / single-partition atomicity only**
    
      They intentionally **avoided distributed transactions** because:
    
       * 2PC hurts availability
    ​   * cross-partition coordination hurts latency
    ​   * global locking hurts throughput
    ​   * consensus across shards is expensive
    
      So they offered:
    
      ```
      atomicity scope = one key / one partition
      ```
    
      Not:
    
      ```
      atomicity scope = arbitrary multi-key transaction
      ```
    
      This matched:
    
       * CAP tradeoff choices
    ​   * operational simplicity goals
    ​   * scale-first philosophy
    
      So historically:
      ✔️ The statement is correct.
    
      ---
    
      # 🔄 What changed over time
    
      Major shift:
      **Several NoSQL databases added distributed transactions later.**
    
      Why?
    
      Because:
    
       * real applications needed invariants
    ​   * developers demanded correctness
    ​   * cloud infrastructure improved
    ​   * consensus got cheaper/faster
    ​   * Google Spanner proved it was feasible
    
      ---
    
      # ✅ Modern NoSQL systems that DO support distributed transactions
    
      ## MongoDB
    
       * Originally: single-document atomicity only
    ​   * Now: **multi-document distributed transactions**
    ​   * Uses: snapshot isolation + 2PC-style coordination
    ​   * Cost: slower, but available
    
      ---
    
      ## CockroachDB
    
       * SQL layer but NoSQL-style architecture
    ​   * Fully distributed transactions
    ​   * Serializable isolation
    ​   * Consensus per range
    ​   * Spanner-inspired
    
      ---
    
      ## YugabyteDB
    
       * Distributed SQL + NoSQL storage engine
    ​   * Supports distributed ACID transactions
    ​   * Raft + transaction manager
    
      ---
    
      ## FoundationDB
    
       * Key-value NoSQL store
    ​   * Strong distributed ACID transactions
    ​   * Designed around transactions from the start
    
      ---
    
      ## Spanner / Spanner-like systems
    
       * Not labeled NoSQL — but nontraditional architecture
    ​   * Globally distributed transactions
    ​   * TrueTime-based ordering
    
      ---
    
      # ⚠️ Systems that still do NOT support general distributed transactions
    
      Some NoSQL systems still avoid them:
    
      ## Cassandra
    
       * No general distributed transactions
    ​   * Lightweight transactions = per-partition CAS only
    ​   * Paxos per key — not multi-key ACID
    
      ---
    
      ## Dynamo-style systems
    
       * Still avoid cross-partition transactions
    ​   * Prefer:
    
        * conditional writes
    ​    * idempotency
    ​    * application-level compensation
    
      ---
    
      ## Redis Cluster
    
       * No general cross-shard transactions
    ​   * Only same-slot transactions
    
      ---
    
      # 🎯 Important nuance: “supports transactions” ≠ same semantics
    
      Even when supported, distributed transactions vary in:
    
       * isolation level
    ​   * performance cost
    ​   * failure behavior
    ​   * latency penalty
    ​   * operational limits
    
      Support often means:
    
      ```
      available but discouraged for hot paths
      ```
    
      Not:
    
      ```
      free and default
      ```
    
      ---
    
      # 🧠 The deeper architectural shift
    
      The field learned:
    
      > Avoiding transactions pushes complexity into application code.
    
      So newer systems moved toward:
    
      ```
      coordination when needed
      local autonomy when possible
      ```
    
      Modern pattern:
    
       * default = local/partition operations
    ​   * optional = distributed transactions for invariants
    
      This is a **more mature tradeoff** than early NoSQL ideology.
    
      ---
    
      # 📌 How to restate the claim correctly today
    
      Instead of:
    
      > Most NoSQL databases do not support distributed transactions
    
      A modern accurate version would be:
    
      > Early NoSQL systems generally avoided distributed transactions, but many modern distributed databases — including several NoSQL and NewSQL systems — now support them, usually with performance tradeoffs.
    
      ---
    
      # ✅ Your notes are still conceptually correct if read historically
    
      Since you’re reading DDIA-style material:
    
       * Treat the statement as **historically accurate**
    ​   * Understand the **design motivation**
    ​   * Update your mental model with:
    
        * modern system capabilities
    ​    * optional distributed transactions
    ​    * consensus-backed storage engines
    
      ---
    
      If you want, next I can give you a clean comparison:
    
      ```
      Single-node commit
      vs
      2PC
      vs
      consensus-backed commit
      vs
      Spanner-style commit
      ```
    
      That ties directly into the chapter you’re reading and makes the evolution very crisp.
    
  • requirements for a distributed atomic commit:

    • a node must only commit once it is certain that all other nodes in the txn are also going to commit

    • a txn commit must be irrevocable

      this is different from the ability to run compensating transactions because compensating transactions, from the DB’s pov is seen as a separate txn

      more on compensating txns:

          Great topic. Compensating transactions are one of those ideas that *never went away*, but also never quite became the silver bullet people once hoped for. They sit in an interesting middle ground between “no coordination” and “full distributed transactions”.
      
          I’ll walk through this in a layered way:
      
          1. What compensating transactions actually are
          2. Why they became popular (historically)
          3. How they work in practice (with examples)
          4. Where they are still used today
          5. Where they break down (important)
          6. The modern consensus: when to use them vs not
      
          ---
      
          ## 1️⃣ What is a compensating transaction?
      
          A **compensating transaction** is:
      
          > An explicit operation that semantically *undoes* the effect of a previously completed operation.
      
          Key properties:
      
           * The original operation **already committed**
      * You **cannot roll it back** atomically
      ​     * You later apply a new operation to restore consistency
      
          So instead of:
      
          ```
          Do A
          If failure → rollback A
          ```
      
          You do:
      
          ```
          Do A (commit)
          If failure later → do Compensate(A)
          ```
      
          This is *logical undo*, not physical rollback.
      
          ---
      
          ## 2️⃣ Why compensating transactions existed in the first place
      
          They arose because:
      
           * Distributed transactions were expensive or unavailable
      ​     * Systems needed to span:
      
            * multiple services
      ​      * multiple databases
      ​      * external APIs
      ​     * You couldn’t rely on 2PC across administrative boundaries
      
          Classic domains:
      
           * payments
      ​     * booking systems
      ​     * order fulfillment
      ​     * long-running workflows
      
          This led to:
      
           * **Saga pattern** (1980s, revived in microservices era)
      
          ---
      
          ## 3️⃣ How compensating transactions work (concrete example)
      
          ### Example: Order placement
      
          Steps:
      
          1. Reserve inventory
          2. Charge credit card
          3. Create shipment
      
          If step 3 fails:
      
           * You cannot “uncharge” by rolling back DB state
      ​     * You issue a **refund** (compensation)
      ​     * You release inventory
      
          Sequence:
      
          ```
          Reserve inventory  → commit
          Charge card        → commit
          Ship order         → fails
      
          Compensate:
          Refund card
          Release inventory
          ```
      
          Each step:
      
           * commits independently
      ​     * has a defined compensating action
      
          ---
      
          ## 4️⃣ Where compensating transactions are still used today
      
          ### ✅ Very much still used — in specific domains
      
          #### A. Microservices / workflow systems
      
           * Order processing
      ​     * Billing pipelines
      ​     * Supply chains
      ​     * Travel booking
      ​     * E-commerce
      
          Frameworks:
      
           * Saga orchestrators
      ​     * Workflow engines (e.g., durable workflows)
      ​     * Message-driven state machines
      
          ---
      
          #### B. Cross-boundary systems
      
          You *cannot* use distributed transactions when:
      
           * services are owned by different teams
      ​     * vendors are external
      ​     * APIs are asynchronous
      ​     * failures are unavoidable
      
          Compensation is the only viable model.
      
          ---
      
          #### C. Human-in-the-loop systems
      
          If humans are involved:
      
           * refunds
      ​     * reversals
      ​     * approvals
      
          Rollback is impossible — compensation is natural.
      
          ---
      
          ## 5️⃣ Where compensating transactions break down
      
          This is crucial.
      
          ### ❌ Compensation is not rollback
      
          Rollback:
      
           * automatic
      ​     * invisible
      ​     * restores exact previous state
      
          Compensation:
      
           * explicit
      ​     * visible
      ​     * may not fully restore state
      
          Examples:
      
           * Refund ≠ original charge (fees, FX rates)
      ​     * Releasing inventory doesn’t undo customer notification
      ​     * Emails sent cannot be “unsent”
      
          ---
      
          ### ❌ Compensation is not always possible
      
          Some operations are:
      
           * irreversible
      ​     * lossy
      ​     * externally visible
      
          Examples:
      
           * sending an email
      ​     * triggering a webhook
      ​     * shipping a package
      ​     * publishing a message
      
          You can only *mitigate*, not undo.
      
          ---
      
          ### ❌ Compensation introduces temporal inconsistency
      
          Between failure and compensation:
      
           * system is inconsistent
      ​     * invariants are temporarily violated
      
          This may be acceptable — or not.
      
          ---
      
          ### ❌ Complexity moves to application logic
      
          You must:
      
           * design compensations
      ​     * handle partial failures
      ​     * handle retries
      ​     * handle duplicate compensations
      ​     * handle compensation failure (!)
      
          This is **harder than transactions** in many cases.
      
          ---
      
          ## 6️⃣ Modern consensus: how compensating txns are used today
      
          ### 🧠 The field’s learned position
      
          > Compensating transactions are **not a substitute for ACID transactions**
          > They are a **workflow-level recovery mechanism**
      
          They are best for:
      
           * long-running processes
      ​     * business workflows
      ​     * cross-system coordination
      
          They are **not good** for:
      
           * tight invariants
      ​     * correctness-critical counters
      ​     * financial ledger integrity
      ​     * uniqueness constraints
      
          ---
      
          ### 🧩 Modern hybrid approach
      
          Most real systems now use:
      
           * **ACID transactions** *inside* a service or database
      ​     * **Compensation** *between* services
      
          Diagrammatically:
      
          ```
          Service A (ACID)
          Service B (ACID)
          Service C (ACID)
      
          Failures handled by compensation at service boundaries
          ```
      
          This minimizes complexity while preserving correctness.
      
          ---
      
          ## 7️⃣ Relationship to earlier discussion (CAP, skew, consistency)
      
          This ties directly back to your earlier insights:
      
           * Compensation accepts **temporary inconsistency**
      * It assumes **eventual convergence**
      * It avoids **global coordination**
      ​     * It trades correctness guarantees for availability
      
          In CAP terms:
      
           * compensation-heavy systems lean AP
      ​     * ACID-heavy systems lean CP
      
          But again: CAP is not the right lens — **invariants are**.
      
          ---
      
          ## 8️⃣ A simple decision rule that actually works
      
          Use compensating transactions when:
      
           * operations are long-running
      ​     * rollback is impossible
      ​     * consistency can be eventual
      ​     * invariants are soft
      ​     * external systems are involved
      
          Avoid compensating transactions when:
      
           * invariants must never be violated
      ​     * state must be immediately correct
      ​     * failure handling must be automatic
      ​     * retries are complex or dangerous
      
          ---
      
          ## Final takeaway (the one to remember)
      
          > **Transactions prevent inconsistency.
          > Compensation repairs inconsistency.**
      
          Both are still essential — just at different layers.
      
          If you want, next we can:
      
           * compare Sagas vs 2PC step-by-step
      ​     * map compensation to real failure timelines
      ​     * connect this to idempotency and exactly-once semantics
      
          Just tell me where you want to zoom in.
      

Introduction to 2PC (two-phase commit) #

  • algo for atomic txn commit across mutiple nodes

  • the commit/abort process in 2PC split into 2 phases (instead of a single commit request)

  • uses a coordinator aka tranasction manager to manage the logic:

    1. app reads and writes data on mutiple db nodes like as usual

      db nodes are called participants in the txn

    2. app ready to commit? coordinator begins phase 1: sends prepare request to each of the nodes

      • case 1: all participants say yes – coordinator sends phase 2: commit request

      • case 2: not all participants say yes – coordinator sends phase 2: abort request

  • analogy is like a western marriage ceremony

  • if one of the participants or the network fails during 2PC: if any of the prepare requests fail or time out, the coordinator aborts the transaction; if any of the commit or abort requests fail, the coordinator retries them indefinitely.

  • 2PC (two-phase commit) \(\neq\) 2PL (two-phase locking)

    • 2PC is for atomic commits in a distributed db
    • 2PL is about providing serializable isolation
  • system of promises

    • might seem that 2PC may suffer from packet loss and so on, why would it work?

      there’s a bunch of promise-like steps, especially 2 points of no return (which single-node atomic txn lumps into a single action [writing the commit record to the transaction log.)])

      • when participant votes “yes” that it can commit later
      • when the coordinator decides, that decision is irrevocable
  • coordinator failure

    • failure before sending the prepare requests: participant can safely abort the txn

    • failure after participant receives a prepare request and participant voted “yes” – then it can no longer abort unilaterally, participants need to wait for the coordinator

      participant’s txn state is in doubt, uncertain

      timeout doesn’t help here

    • technically participants could go and communicate with each other but that’s not part of the typical 2PC protocol

3PC (Three-phase commit) #

  • 2PC is blocking atomic commit protocol because the nodes can be stuck waiting for the coordinator to recover

    it could be tweaked to make the protocol non-blocking, but it’s not straightforward

  • 3PC assumes that the network has bounded delay and nodes will have bounded response times – not a likely assumption

  • nonblocking atomic commits need a perfect failure detector - a way to know whether a node has crashed or not

  • 3PC is hard to implement so 2PC still popular

Distributed Transactions in Practice #

  • distributed transactions incur heavy performance penalties

    • most of the cost comes from crash recovery; additional network round-trips and the disk
  • there’s a need to refine the meaning of “distributed transactions”

    1. database-internal distributed transactions

      • all nodes in participating txns running the same db software, some distributed dbs support internal transactions amongst the nodes of that database.

      • this can use any protocol, so can optimise in their own way and work well

    2. Heterogeneous distributed transactions

      • when there’s a mix of different db technologies

      • systems may be entirely different under the hood but a distributed txn across these systems must ensure atomic commit

      • this is a lot more challenging

Exactly-once message passing #

  • heterogeneous distributed txns allows diverse system integrations

    e.g. message passing queues and db systems that are different

  • an exactly-once message passing pattern can be done in systems where the transactions are able to use the samme atomic commit protocol

XA (X/Open XA aka eXtended Architecture) transactions #

  • standard for implementing 2PC across heterogenous technologies, supported by many traditional relational dbs and message brokers

  • it’s a C-API for interacting with a transaction coordinator, it’s not a network protocol of its own

    so it’s more of an API-binding

  • assumptions:

    • network uses a network driver / client library to communicate with the participant dbs / messaging services
  • note: it’s the txn coordinator that implements the XA API

  • can use a disk-based log so that the coordinator may be revived

  • note: there’s been changes to this since the book was published:

      **The points on 3PC remain largely accurate and hold up well against developments since DDIA (2017), with no major breakthroughs overturning the core critiques.** Research has reaffirmed 3PC's impracticality due to its synchronous assumptions, while production systems continue favoring 2PC or alternatives like Paxos/Raft. [bravenewgeek](https://bravenewgeek.com/tag/three-phase-commit/)
    
      ## 2PC Blocking Confirmed
      2PC is indeed blocking: participants can deadlock waiting for a crashed coordinator, and recovery isn't straightforward without logs or external help. No widespread "tweaks" have made it non-blocking in practice; instead, systems use timeouts, presumed aborts, or XA heuristics. [graphapp](https://www.graphapp.ai/blog/understanding-two-phase-commit-a-comprehensive-guide)
    
      ## 3PC Assumptions
      3PC requires bounded delays and response times for non-blocking behavior, which fails in asynchronous networks—still the realistic model today. Partitions cause timeouts, preventing progress, as noted in recent analyses (up to 2025). [cs.unibo](https://www.cs.unibo.it/babaoglu/papers/pdf/commit.pdf)
    
      ## Perfect Failure Detectors
      Non-blocking atomic commit provably requires a perfect failure detector (P), distinguishing crashes from delays—which is impossible in async systems without oracles. Encodings exist but remain theoretical; no practical implementations since DDIA. [drops.dagstuhl](https://drops.dagstuhl.de/storage/00lipics/lipics-vol246-disc2022/LIPIcs.DISC.2022.35/LIPIcs.DISC.2022.35.pdf)
    
      ## Implementation and Popularity
      3PC is complex (extra pre-commit phase/state), rarely implemented fully—e.g., not standard in databases like PostgreSQL or MySQL XA. 2PC dominates for XA transactions; alternatives like 2PC-aside (Google Spanner) or Sagas/Paxos sidestep it entirely. [developer.jboss](https://developer.jboss.org/docs/DOC-56051)
    
      ## Recent Trends (2017-2026)
      Papers explore non-blocking variants (e.g., NetLR-inspired, but not 3PC), but focus shifts to leader-based replication (Raft) or multi-leader with CRDTs/CRQs. No production adoption of 3PC; critiques in 2022+ echo DDIA. [vldb](https://www.vldb.org/pvldb/vol15/p1337-lee.pdf)
    

holding locks when in doubt #

  • the main problem about the transaction status being in doubt is that it’s a blocking hold of the lock

    \(\implies\) causes availability issues

  • typically row-level locks

recovering from coordinator failure #

  • there are chances of in-doubt transactions being orphaned: when the coordinator can’t decide the outcome for various reasons (e.g. txn log corruption)
  • reboots won’t fix the problem
  • human (admin) must manually fix this
  • XA implementations have heuristic decisions as escape-hatches:
    • heuristic is euphemism for “probably breaking atomicity

Limitations of distributed transactions #

  • the coordinator itself is something like a db, so using a coordinator and having XA has its own operational problems

  • for db-internal distributed txns, distributed version of SSI (serialisable snapshot isolation) is possible.

    if any of the nodes are faulty, distributed transactions will amplify failures

Fault-Tolerant Consensus #

  • formalisation of the consensus problem: one or more nodes may propose values, and the consensus algorithm decides on one of those values.

    satisfies the properties:

    1. Uniform agreement: no two nodes decide differently

    2. integrity: no node decides twice

    3. validity: if a node decides that the value is v then v must have been proposed by some node

    4. termination: every node that doesn’t crash eventually decides some value

      • this property formalises the idea of fault tolerance because it says that the consensus algo must make progress inspite of node failures

      • system model of consensus assumes that when a node “crashes” it disappears and never comes back, so any algo that has to wait for a node to recover is not going to be able to satisfy the termination property

        so 2PC won’t meet these requirements

      • is subjected to the assumption that fewer than half the nodes are crashed/unreachable

  • there’s a limit to the number of failures that an algo can tolerate – typically need at least a quorum of nodes to be functioning to make forward progress

  • Popular algos:

    1. VSR (viewstamped replication)
    2. Paxos
    3. Raft
    4. Zab
  • most of the popular algos decide on a sequence of values (which is why there are more of a total order broadcast algo)

    they don’t directly use the formal consensus model that we use (i.e. the proposing and deciding, while satisfying the agreement, integrity, validity and termination properties)

    total order broadcast is equivalent to performing several rounds of consensus (like a breakdown, where each consensus decision corresponds to one message delivery)

  • each consensus-decision corresponds to one message delivery:

    • agreement property: all nodes decide to deliver the same messages

    • integrity property: messages not duplicated

    • validity property: no message corruption no fabrication

    • termination property: messages are not lost

  • in the case of Paxos, this is actually multi-paxos

Single-leader replication and consensus #

  • in the case of single-leader replication, it’s actually total order broadcast that is being done; happens by the leader

  • the important part is how the leader is chosen

  • by making the leader election and failover automatic, followers can be the new leaders if the old leader fails and it brings the system closer to fault-tolerant total-order broadcast – closer to solving consensus

  • but to implement leader election, it may seem like we need to have a leader to elect a leader – so this is how we might consider otherwise.

Epoch numbering and quorums #

  • internally the consensus algos use a leader but they don’t guarantee that the leader is unique

    \(\implies\) a weaker guarantee: protocols define an epoch number (aka ballot number [paxos], view number [Viewstamped Replication], term number [Raft]).

    Within each epoch, the leader is unique

  • leader election happens when current leader thought to be dead \(\implies\) the election happens via an incremented epoch number

    \(\implies\) epoch numbers are totally ordered and monotonically increasing

    the higher epoch number is used to tie-break if there’s a contest for who is the leader

  • split-brain preventative measures: leaders need to check that there isn’t any other leader with a higher epoch number

    • this is moderated by getting a quorum of nodes

    • so a leader must get approval from quorum of nodes to respond to the favor of the proposal

    • so it ends up having 2 rounds of voting:

      round 1: choose a leader

      round 2: vote on the leaders proposal

      quorum from both these votes must overlap

Limitations of consensus #

  • benefits of consensus algos:

    • bring concrete safety properties (4 properties above)

    • are fault-tolerant as long as majority of nodes are working

    • since they provide total order broadcast, they can implement linearizable atomic operations in a fault-tolerant way

  • Costs:

    1. the voting process on proposals is similar in effect to synchronous replication

      so some committed data can potentially be lost on failover – typically a risk that people accept for better performance

    2. blocked on network failure

    3. typical assumption is that for voting, it’s a fixed set of nodes that participate in voting, so can’t dynamically update the cluster

      consensus algos can have extensions to support dynamic membership – less understood

    4. typically reliant on timeouts to detect failed nodes – so if the network delays are too varied, then it’s hard to choose a correct timeout duration

      this variability can result in many false positives in detecting node failures

      doesn’t affect correctness but can add extra overheads because need to work around these things

    5. particular sensitivities to network problems:

      1. Raft: if entire network but one node is working correctly, then there’s a consistent unreliability – may end up in situations where leadership bounces between two nodes / current leader is continually forced to resign \(\implies\) system can’t actually make effective progress

Membership and Coordination Services #

  • “distributed key-value stores” / “coordination and configuration services” – what things like ZooKeeper/etcd are called

    • their APIs feel like DB apis
    • they implement consensus algorithms
  • thing is the the reliance on such tools is likely to be indirect (as a fact of using another tool like Hadoop, Kafka)

  • these tools are designed to work best with in-memory size data (even though they might persist on disk for durability)

    this data is what is replicated across all the nodes using a fault-tolerant total order broadcast system

  • some other features that ZooKeeper gives:

    1. basic stuff: total order broadcast (therefore consensus)

      fault-tolerant data replication, same writes in the same order to make replicas consistent with each other

    2. linearizable atomic operations (only this is the one that requires consensus, really)

      • using atomic set-and-compare, we can get a lock

      • consensus protocol guarantees that the operation will be atomic and linearizable

      • distributed lock eventually implemented as a lease, with and expiry time so that it’s eventually released in the case that the client fails (see failure cases)

    3. total ordering of operations

      • ref leader and the lock, where we need a fencing token to prevent clients from conflicting with each other when there are process pauses
    4. failure detection

      • there’s a heartbeat exchange that each client maintains on long-lived ZooKeeper servers, that’s how if heartbeat miss goes beyond a timeout duration, then the session may be considered as dead

      • ZooKeeper offers ephemeral nodes that auto-timesout also

    5. change notifications

      • clients can subscribe to another client’s events, saves the overhead that would have come from polling

Allocating work to nodes #

  • a feature useful for job schedulers / stateful systems is for follower nodes to automatically get promoted to leaders

  • rebalancing load in partitioned resource: when new nodes join the cluster, some partitions need to decide which partition assigned to which node – when this happens, when nodes are removed or failed, then there needs to be other nodes taking over

  • should use applications like Apache Curator that offer higher-level APIs on top of ZooKeeper client APIs instead of trying to implement the consensus algos from scratch

  • ZooKeeper may have thousands of nodes, but only a fixed number (3-5) perform the voting while supporting a large number of clients

    so it’s an “outsourcing” from the coordinating nodes that happens

Service Discovery #

  • these systems used for service discovery as well
  • at time of writing, it’s not clear if service discovery needs consensus, DNS being the OG way of looking up IPs for services
  • in the case of DNS,
    • DNS reads are not linearizable
    • sometimes tolerated if DNS reads are stale
    • more important that DNS is reliably available, robust to network interruptions
  • Leader-election needs consensus though, so can use the consensus systems to tell other services who the leader is \(\implies\) it makes sense to have RO caching replicas of consensus systems to read this information

Membership Services #

  • membership services is the category that tools like ZooKeeper and friends would fall under
  • membership service determines which nodes are currently active and live members of a cluster.
    • typically unbounded network delays makes it hard to detect if a node has failed BUT…
    • with failure detection and consensus, nodes can come to an agreement about who should be considered alive or not
    • false positives on the dead-node detection is still possible, but it’s tolerable
    • to be even able to do leader election, the first step is to have consensus on who the current members are

Summary #

  1. terms :

    • linearizability: linearizability, a popular consistency model: its goal is to make replicated data appear as though there were only a single copy, and to make all operations act on it atomically.

    • causality: happened before relationship, weaker consistency model than linearizability

      not sufficient to have causal ordering (e.g. username selection) \(\implies\) need consensus

    • consensus

  2. range of problems that are reducible to consensus (there’s some equivalence):

    1. Linearizable compare-and-set registers
      • register atomically decides whether to set its value (compared to current value …)
    2. Atomic transaction commit
      • for distributed transactions, db must decide on the order to deliver msgs
    3. Total order broadcast
      • messaging system must decide on the order in which to deliver messages
    4. Locks and leases
      • several clients trying to grab a lock / lease, then lock decides which one acquires it successfully
    5. Membership / Coordination service
      • System must decide, given a failure detection mechanism like timeouts, which nodes are members of the distributed system
    6. Uniqueness constraint
      • multiple transactions trying to create conflicting records for the same key, constraint is to decide which one to allow and which to fail with a constraint violation
  3. single leader systems are simpler, if leader fails we have the following options:

    1. wait for the leader to recover

      but this doesn’t satisfy the termination property for consensus; can possibly be blocked forever

      e.g. XA/JTA txn coordinators do this

    2. manual human failover

      typically done by relational dbs; consensus by “act of God”; speed of failover is limited by the speed at which the human can act, which is slower than computers

    3. use an algo for leader election

      needs a consensus algo, use a proven algo

  4. who doesn’t need consensus?

    • leaderless and multi-leader replication systems

      typically don’t need global consensus; they encounter conflicts but that could be tolerable in some POVs

      maybe we simply need to cope without linearizability and learn to work better with data that has branching and merging version histories.