Chapter 9. Consistency and Consensus

Table of Contents

just like how we have transactions as a general purpose abstraction to solve a class of problems, distributed computing problems shall be solved by general purpose abstractions as well:
1. abstraction of consensus
first we explore the range of guarantees and abstractions that we can have in distributed systems; to get the scope of what can be done and what can’t be done
this book just focuses on general intuition without going into correctness-proofs

Consistency Guarantees #

eventual consistency is a weak guarantee, its just says replicated dbs will have temporary inconsistency
also a better naming could be convergence
when working with db that provides only weak guarantees, we have to be sensitive about its limitations and not assume too much
edge cases for eventual consistency only are apparent when there is a fault in the system e.g. high concurrency / network interruption
no free lunch for stronger consistency models
stronger consistency models exist but they may come at the expense of performance / less fault-tolerance
distributed consistency models and transaction isolation levels’ hierarchy can be overlapping in ideas
difference:
- transaction isolation is about avoiding race conditions that come from concurrently executing transactions
- distributed consistency is about coordinating the state of replicas in the face of delays and faults

Linearizability: a recency guarantee #

it’s a recency guarantee
idea: since replicas exist and there’s a chance that if we ask 2 different replicas the same question, we may get different answers,
we can try strive for a situation where there’s an illusion of only a single db, every client would then get the same answer without worrying about replication lag
basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.
- as soon as one client completes a write successfully, all clients reading from the db must be able to see the value just written
  so this value can’t come from a stale cache / replicak

What Makes a System Linearizable? #

terminology, the “variable” is better called as a register which may be a kv-pair or a row in a table …
- consider the operations w.r.t register:
  - read
  - write
  - cas(x, \(v_{old}\), \(v_{_}{new}\)) (compare-and-set)
    - atomic operation
    - if \(x != v_{old}\) then operation leaves register without changing, returns error on client-side
    - if \(x == v_{old}\) then atomically set to \(v_{new}_{}\)
regular register: a register in which reads may return either the old or new value if the read is concurrent with a write
The requirement of linearizability is that the lines joining up the operation markers always move forward in time (from left to right), never backward.
This operation ensures the recency guarantee we discussed earlier: once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again.
It is possible (though computationally expensive) to test whether a system’s behavior is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order

Linearizability vs Serializability #

similar sounding guarantees because both have something to do with “arranging in a sequential order”, but they are different guarantees
Serializability:
- isolation property of transactions
- every txn may R/W multiple objects
- guarantee: txn behave the same as if they had executed in some serial order (w.r.t transactions). It’s alright for that order to be different from the order in which txns were actually run
Linearizability
- recency guarantee on R/W of a register (individual object)
- no grouping of operations into txns, doesn’t solve write-skew and all unless other steps (e.g. materializing conflicts) are taken
strict serializability aka strong one-copy serializability (strong ISR): when a db provides both serializability and linearizability
- implementation e.g. 2PL or actual serial execution that make the db typically linearizable
serializable snapshot isolation is not linearizable
- it reads from a consistent snapshot so ats to avoid lock contention b/w readers and writers
- it doesn’t include writes that are more recent than the snapshot so it reads from snapshots that are not linearizable

Relying on Linearizability #

cases where linearizability matters for system correctness:

locking and leader election
leader locks have to be linearizable
typically every nodes that spins up tries to acquire the lock and the one that succeeds is the leader
a linearizable storage service is the foundation for these coordination tasks
distributed locking can be done at the page-level on the storage system
contraints and uniqueness guarantees
- if we want to enforce constraints (e.g. uniqueness constraints) as the data is written, then we need linearizability
- similar situation to a lock, can see it the “locking” being done on a “column”
- the operation is similar to a compare-and-set also
- also relevant when constraints such as “min value” need to be enforced (e.g. min val in bank, last item sale,…)
  - some of these constraints, if it’s alright to be slightly looser in the constraint, then we don’t need linearizability
- fk / attribute constraints can be implemented without requiring linearizabilty though
cross-channel timing dependencies
- in some cases, the detection of the linearizability violation happens beause there’s an additional communication channel in the system (cross-channel)
- given example: a webserver that accepts image uploads and works together with a image resizer
  in this system the webserver and the image resizer communicate both through file-storage and via a message queue, which now opens up the possibility of a race condition
  if the file storage service is linearizable, then the system works alright
  else, there’s the risk of race conditions:
  - e.g message queue faster than the internal replication inside the storage service so when the resizer fetches the image, it might see an old version of the image – the full-size and resized images in the file-storage become permanently inconsistent
- in example, 2 comms channels between the webserver and the resizer: file storage an the message queue
  race condition happens if there’s no recency guarantee of linearizability
- this problem can be solved in other ways, but linearizability is a simple enough way to solve it

Implementing Linearizable Systems #

most common approach is to use replication so that systems are fault-tolerant, so the replication methods can be looked at to see if they can be made linearizable:
1. single leader replication: potentially linearizable
  - leader has primary copy, followers have backup copies on other nodes
  - reads from the leader / synchronously updated followers can potentially be linearizable
  - using the leader for reads \(\implies\) we know who the leader is
  - situations when linearizability is violated:
    1. a delusional leader (even if eventually it realises that it’s not the leader) will viola
    2. when it’s async replication, then failover may lose committed writes which will violate both durability and linearizabilty
2. consensus algos: linearizable
  - some of these algos feel like single-leader replication but they have mechanisms that prevent split brain and stale replicas
    so, consensus algos can implement linearizable storage safely
3. multi-leader replication: not linearizable
  concurrent processing of writes on multiple nodes + async replication to other nodes = conflicting writes that require resolution
  conflicts – artifact of the lack of a single copy of the data
4. leaderless replication: generally not linearizable
  - the claim is that we can get “strong consistency” if we require quorum reads and writes (w + r > n)
    depends on:
    - config of the quorum
    - how strong consistency is defined
  - LWW (last write wins) conflict resolution strategy most of the time cmi because clock timestamps can’t be guaranteed to be have consistent, actual ordering because of clock skew
  - sloppy quorums don’t allow linearizability
    there’s situations when even strict quorums can still violate linearizability as well

Linearizability and quorums #

it’s possible to get a violation
it’s also possible to add in linearizablity (only for read and write operations it may be feasible, not for compare-and-set operations) by doing the following:
1. readers: add in read repair mechanisms for readers which is synchronous (so performance takes a hit) before returnig read results to the application
2. writer: must read the latest state of a quorum of nodes before sending it writes

The Cost of Linearizability #

need to understand pros and cons to get the tradeoffs of this guarantee
network interruptions force a choice between availability and linearizability
on interruption of network between data-centers:
1. if multi-leader db:
  within data-center, can function normally
  the writes are queued up and exchanged when network connectivity (b/w data-centers) is restored
2. if single-leader replication: depends on whether we connect to the data center containing the leader or not
  leader must be in one of the data-centers
  so any writes and any linearizable reads must be sent to the leader – so any clients connected to a follower data center must be sent synchronously over the network to the leader data center
  if interrupted, clients connected to the follower data centers can’t contact the leader so can’t make any writes nor linearizable reads
  can still make reads from follower but the data may be stale (non-linearizable)
  no issue if clients are connected to the datacenter that has the leader within it, BAU there

The CAP Theorem #

linearizability issue: happens regardless of single/multi-leader replication, any type of single/multi-datacenter deployments and on any unreliable data-center, even if one datacenter
tradeoffs:
1. either linearizable (consistent) or available under network partitions
  1. if app requires linearizability
    if network problem e.g. disconnected network partition \(\implies\) some replicas can’t process requirements when disconnected \(\implies\) they must wait until the network problem is fixed or they must return an error \(\implies\) service is unavailable
  2. no need linearizability
    can be written such that each replica can independently process, even if disconnected (e.g. multi-leader)
    so, application can be available in face of network problem but behaviour is not linearizable
    more tolerant of network problems
CAP was supposed to be a rule of thumb, more of a discussion of tradeoffs
the unhelpful framing of CAP:
typically presented as:
CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3. Unfortunately, putting it this way is misleading [32] because network partitions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not
so a better framing of this set of tradeoffs should be:
a better way of phrasing CAP would be either Consistent or Available when Partitioned
in its formal definition, the CAP theorem has a very narrow scope where it’s relevant:
only when the consistency model is linearizability and when the type of fault is network partitions
there are more important things to think about when designing distributed systems instead of mulling that much over CAP theorem

there’s a newer acronym to help guide us PACELC (ref wikipedia for the design)

If there’s a Partition (P), we have to choose between Availability (A) and Consistency (C)
Else (E),
When the system is healthy, then we have to choose between Latency(L) and Consistency(C)

There are four configurations or trade-offs in the PACELC space:
PA/EL - prioritize availability and low latency over consistency
PA/EC - when there is a partition, choose availability; else, choose consistency
PC/EL - when there is a partition, choose consistency; else, choose lower latency
PC/EC - choose consistency at all times

  Nice, let’s dig in 🙂
  I’ll do this in three layers: **CAP → PACELC → why PACELC is still not enough**, tying it back to DDIA-style thinking rather than slogans.

  ---

  ## 1. CAP, restated precisely (no folklore)

  **Formal CAP says**:

  > In the presence of a *network partition*, a system that provides *linearizability* cannot also provide *availability*.

  Key constraints (you already captured these, but worth restating cleanly):

   * **Consistency = linearizability**, not “strong consistency”, not serializability, not read-your-writes.
   * **Availability** means *every non-failing node eventually returns a non-error response*.
   * **Partition tolerance** is not optional — partitions are faults.

  So CAP is not:

   * a taxonomy of databases
   * a performance model
   * a latency argument
   * a general statement about “distributed systems are hard”

  It’s a **liveness vs safety** statement under a *very specific* failure mode.

  That narrowness is exactly why people felt dissatisfied and invented PACELC.

  ---

  ## 2. PACELC: what it adds (and why it was attractive)

  PACELC reframes the discussion as:

  > **If there is a Partition (P)**, choose between **Availability (A)** and **Consistency (C)**
  > **Else (E)**, when the system is healthy, choose between **Latency (L)** and **Consistency (C)**

  So:

   * **PA/PC** → what you do *during partitions*
   * **EL/EC** → what you optimize *during normal operation*

  This adds something CAP never talked about: **the steady state**, which is where systems live 99.99% of the time.

  ### Why this felt like progress

  Because in practice:

   * Engineers care *far more* about:

    * tail latency
    * quorum round trips
    * geo-replication delays
   * CAP only talks about a rare failure mode.

  PACELC made it acceptable to say:

   * “We choose to sacrifice linearizability **even when there is no partition** because we want lower latency.”

  That’s a real, everyday tradeoff.

  ---

  ## 3. Where PACELC still falls short

  This is the important part — and where your instinct that these frameworks are “not the whole story” is correct.

  ### 3.1 It still treats “consistency” as a single axis

  PACELC inherits CAP’s biggest sin:

  > **Consistency is not one thing**

  It collapses:

   * linearizability
   * causal consistency
   * session guarantees
   * snapshot isolation
   * serializability
   * application-level invariants

  …into a single C/¬C switch.

  But in reality:

   * Many systems give up **linearizability** while preserving **causality**
   * Many applications don’t need linearizability, but *do* need:

    * monotonic reads
    * read-your-writes
    * bounded staleness

  PACELC can’t express this.

  ---

  ### 3.2 It ignores *what* you are coordinating

  Both CAP and PACELC assume:

  > “The system is a single blob with one consistency model.”

  But DDIA’s core insight is:

   * **Different operations need different guarantees**
   *Examples:

   * Leader election → needs linearizability
   * Analytics queries → don’t
   * Counters → maybe CRDT
   * Uniqueness constraints → CAS-style linearizability
   * Caches → eventual consistency is fine

  Real systems are **heterogeneous**, not point-labeled on a 2D grid.

  ---

  ### 3.3 It ignores correctness vs performance separation

  PACELC still frames everything as a **performance tradeoff**.

  But many consistency choices are about:

   * **correctness**
   * **invariant preservation**
   * **bug avoidance**

  Example (from your notes):

   * Cross-channel timing bugs
   * Distributed locks
   * Uniqueness constraints

  These are not “performance preferences”.
  They are **semantic requirements**.

  PACELC has no vocabulary for:

  > “This part of the system *must not* be wrong.”

  ---

  ### 3.4 It ignores failure modes beyond partitions

  PACELC, like CAP, is obsessed with partitions.

  But real systems fail due to:

   * clock skew
   * GC pauses
   * slow replicas
   * partial disk failure
   * Byzantine clients (sometimes)
   * misconfiguration
   * overload collapse

  Linearizability can fail *without* a partition.
  Availability can fail *without* a partition.
  PACELC doesn’t model this.

  ---

  ## 4. What replaces CAP/PACELC in DDIA thinking?

  Not another acronym — but a **shift in mindset**.

  ### 4.1 Think in *abstractions*, not labels

  DDIA’s move is:

   * Stop classifying databases
   * Start reasoning about **abstractions**:

    * linearizable register
    * atomic CAS
    * transaction
    * snapshot
    * log
    * consensus

  Then ask:

  > Which abstraction does my problem *actually* require?

  ---

  ### 4.2 Think in *failure semantics*

  Instead of:

  > “Is this AP or CP?”

  Ask:

   * What happens during:

    * message delay?
    * leader crash?
    * failover?
    * clock skew?
   * What invariants break?
   * What errors surface to users?

  This is far more predictive than CAP labels.

  ---

  ### 4.3 Think in *application invariants*

  The strongest DDIA lesson:

  > Consistency models exist to protect **invariants**, not to win theoretical arguments.

  Examples:

   * “Only one leader exists”
   * “Stock never goes below zero”
   * “Email is sent once”
   * “Usernames are unique”

  Once you state the invariant, the needed consistency often becomes obvious — CAP or PACELC be damned.

  ---

  ## 5. Bottom line (connecting back to your notes)

  Your notes implicitly conclude:

  > CAP is correct, but not a useful primary design tool.

  That conclusion:

   * ✔️ Matches DDIA
   * ✔️ Matches modern distributed systems practice
   * ✔️ Explains why people keep rediscovering the same bugs despite “knowing CAP”

  PACELC is a **better story**, but still a story — not a design method.

  If you want, next we can:

   * Apply this thinking to **real systems** (etcd, Spanner, Dynamo, Kafka)
   * Or map **consensus vs transactions** as competing abstractions
   * Or do a deep dive on **why linearizability is a *semantic* guarantee, not a performance one**

  Your call.

Linearizability and network delays #

funny things is that even RAM on multi-core CPUs is not linearizable
unless there’s a memory barrier or memory fence being used, then it’s not serializable
why?
- every CPU core has its own memcache and store buffer
- mem access reaches a cache first and any changes async written out to main memory
- since data access in cache is much faster than going to memory
- now there’s several copies of the data (main memory, several caches) and the copies are async updated so linearizability is lost
- reason for dropping linearizability is to prioritize performance
Linearizability is slow all the time, not just during a network fault
proof that if you want linearizability, the response time of R/W requests is at least proportional to the uncertainty of delays in the network
- high variable delays (most networks), response time of linearizable reads and writes will be high
KIV chapter 12 for approaches for avoiding linearizabiity without sacrificing correctness

Ordering Guarantees #

ordering is a fundamental concept that has poppped up in a bunch of contexts:
1. leader in the single leaders replications determines the order of writes in the replication log
  - the order in which to apply the writes
2. serializability: ensuring txns behave as though they were executed in some sequential order
  2 ways to achieve:
  - literally executing txns in serial order
  - allowing concurrent executing while preventing serialization conflicts (by locking / aborting)
3. in the context of timestamps and clocks in distributed systems
  attempts to determine which one of two writes happened later
there are deep connections between ordering, linearizability, and consensus.

Ordering and Causality #

Ordering helps preserve causality.
ordering on events: cause comes before effect; a message is sent before that message is received; the question comes before the answer.

Some patterns we’ve observed in the book so far about causality:
1. causal dependency between questions and answers
2. updates to a record must happen after they exist (creation of that record)
3. happened-before relationships is an expression of causality
  given 2 operations A and B
  1. A before B
  2. B before A
  3. A and B concurrent \(\implies\) no causal link b/w the two operations
4. transaction consistency is consistency with causality (transactions, snapshot isolation)
  so no read skew (i.e. non-repeatable reads, erading data in a state that violates causality)
  observing the entire db at a single point in time makes it consistent with causality:
  - effects of all ops happening before that time are visible
  - no effects of operations happening after that time are visible
5. write skew b/w txns show causal dependencies
  where the write action is causally dependent on the observation of who’s currently on call
  SSI (serializable snapshot isolation) detects this write skew by tracking the causal dependencies b/w txns
6. causal violation that was detected because of cross-channel timing dependencies
  e.g. the sports example where the exclamation of the result (comms channel) was the reason why the violation could be detected
causally consistent: system obeying the ordering imposed by causality

Causal order is not a total order #

Total ordering: any 2 elements can be compared
math sets are incomparable and possibly partially ordered
i.e. in some cases one set is greater than another (if one set is a superset of another) and in all other cases, they are incomparable
diff b/w total and partial order is shown in the diff db consistency models
1. linearizability \(\implies\) total ordering of operations
  if system behaves like it has a single copy and every operation is atomic then all elements (operations) are comparable \(\implies\) that’s why it can be illustrated as a timeline
2. causality \(\implies\) partial ordering
  2 events ordered if they are causally related (one before another), incomparable if they’re concurrent \(\implies\) partial ordering
Implication: no concurrent operations in a linearizable datastore – must be a single timeline along which all operations are totally ordered
still can have multiple requests being queued and all, just the datastore does everything atomically, on a single copy of data, along a single timeline without any concurrency
concurrency – it’s like a branching and merging of timelines
very much like git
- operations on different branches are incomparable

Linearizability is stronger than causal consistency #

linearizability implies causality
that’s great, easy to understand and all
but performance hits and problems with availability is why distributed data systems don’t prioritise linearizability
possible to be causally consistent without incurring the performance hit of making things linearizable
causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures
most systems that appear to need linearizability just need causal consistency actually

Capturing causal dependencies #

Just some key ideas

to maintain causality just need to maintain partial order, so that we can do the happened-before r/s
ways to determine the happened-before r/s b/w operations similar to the detection of concurrent writes (ref)
going beyond just the detection ofn concurrent writes to the same key and by generalizing version vectors is how we can do this.
examples of using version vectors in other situations:
- e.g. version number of prior operations is passed back to the db on a write
- e.g. in the conflict detection of SSI, a txn that wants to commit will pass the version of the data that it has read and checks if it’s still up to date

Sequence Number Ordering #

actually keeping track of ALL causal deps can be impractical because large overhead so we just pick simpler thing (sequence numbers / timestamps)
- compact values
- preserve total order (every operation with a unique sequence number)
- 2 sequence numbers are comparable
use a logical clock for this
total order that is consistent with causality: the happened-before relationship is preserved, concurrent operations may be ordered arbitrarily.
- this captures all the causality info
replication log within a single-leader replication db has a total ordering of write operations that is consistent-with causality
typically a monotonic counter is used by the leader
followers may lag but the causal consistency is still guaranteed

Noncausal sequence number generators #

harder to generate sequence numbers for operations when the db is NOT single-leader replication (e.g. multi-leader / leaderless replication)
some ways:
1. each node generates its independent set of sequence numbers
  e.g. if 2 nodes, then even/odd number sequences
  problem: can’t compare if even-odd number
2. attach timestamp from a time-of-day (physical) clock to each operation and make sure the resolution is high enough – might be sufficient enough to totally order operations even
  problem: might have clock skew when using physical clocks
  used in the conflict resolution strategy behind last write wins (ref)

preallocated blocks of sequence numbers, for each block
problem: sequence number generation like these methods are not consistent with causality
this is because it preserves ordering within a node but not across nodes

Lamport Timestamps #

lamport timestamp is pretty simple: (counter, nodeID)
it’s a stamp of a logical clock that provides total ordering
if time-value within the timestamp is the same, then we just pick the one with the greater node ID
this part is similar to the even-odd numbers approach
improvement: nodes keep max counter seen and updates it on every request every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.
- as long as the max counter value seen is carried along with every operation, the causal consistency is present
  because every causal dependency results in an increased timestamp
lamport timestamps are not version vectors, their purpose is different
1. version vectors: to distinguish whether two operations are concurrent or if one is causally dependent on the other
2. Lamport timestamps: to enforce a total ordering
  \(\implies\) this total ordering is why it’s not possible to tell if 2 operations are concurrent or causally dependent

Timestamp ordering is not sufficient #

in some cases, the distributed nature of things means that total ordering (via lamport timestamps) is not sufficient
problem here is that the total order of operations only emerges after you have collected all of the operations. If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations: the unknown operations from the other node may need to be inserted at various positions in the total order.
in order to implement something like a uniqueness constraint for usernames, it’s not sufficient to have a total ordering of operations – you also need to know when that order is finalized.
that’s that total order broadcast helps us do

Total Order Broadcast aka atomic broadcast #

to have the whole system (all nodes) agree on the same total ordering of operations
AND
to scale the system if the throughput is greater than what a single leader can handle
AND
handle leader failovers
\(\implies\) total order broadcast / atomic broadcast problem

Using total order broadcast #

KIV for consensus section but there’s a strong connection between total order broadcast and consensus
alternative lens: total order broadcast helps create a log (replication log, txn log, WAL) and delivering a message is like appending to that log
can be used to do state machine replication: is what is needed for db replication
- for replicas to be consistent with each other the ordering must be consistent with each other so that the writes are processed in the same order
can be used to implement serializable transactions
can be used to implement a lock service that gives fencing tokens (ref)
- sequence number can serve as the fencing token
- the log for fencing token requests is yet another log
it is stronger than timestamp ordering because the ordering is fixed at the time the messages are delivered – no node can retroactively insert a message into an earlier position

Implementing linearizable storage using total order broadcast #

recap:
- total order broadcast is async: messages get delivered reliably in a fixed order but there’s no guarantee when it gets delivered so there may be replication lags
- linearizability is more of a recency guarantee
if we have total order broadcast, we can build linearizable storage ontop of it:
e.g. in the case of usernames, we could try to do a compare and set with original value = null so only one of the control and sets will work. This requires a linearizable register on which we can do atomic compare and set operations
- implementing timeline/sequential consistency (slightly weaker guarantee than linearizability) using total broadcast as an append-only log:
  1. append message to log, indicating that uname you want to claim
  2. read the log, wait for the message you appended to deliver back to you
  3. check for any msgs claiming uname desired. if first message is our message then we succeed, else abort the operation
  this helps guarnatee linearizable writes but NOT linearizable reads (reading from a store that is updated async risks stale reads)
- how to linearize reads:
  1. sequence the reads using the log, perform the read when the actual message is delivered back to you
    so message’s position in time is when the read happens
  2. if log allows fetching of the position of the latest log message in a linearizable way, then we can query that position and then wait for all the entries up to that position to be delivered then perform the read – that’s how ZooKeeper’s sync works
  3. read from a replica that is synchronously updated on writes and up to date – used in chain replication

Implementing total order broadcast using linearizable storage #

the converse, given a linearizable storage, how to build total order broadcast from ti
ways:
1. assume we have a linearizable register that stores and integer and has atomic increment-and-get operation
2. use an atomic compare-and-set operation
algo: for every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.
- the number we get from incrementing a linearizable register will form a sequence with no gaps
it’s difficult to make a linearizable integer with an atomic increment-and-get operation because the difficulty lies in handling failure cases, when nodes fail:
- so need to handle connection interruptions
- handle restoring the value when that node fails
it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus

Distributed Transactions and Consensus #

many subtleties that need to be appreciated behind this problem to solve: making nodes agree on something
- understanding builds on: replication, txns, system models, linearizability and total order broadcast
situations when consensus is important:
1. leader-election
  leadership position might be contested if network faults make it such that some nodes can’t communicate with other nodes \(\implies\) need the nodes to come to consensus else bad failover \(\implies\) split-brain situation \(\implies\) data divergence, inconsistency and data loss
2. atomic commits (solving the atomic commit problem)
  - consider db with txns that span multiple nodes / partitions and there’s the problem of txns failing on some nodes but succeeded in others
  - to get txn atomicity (ref) we need all the nodes to have a consensus on how to deal with the txn (abort or commit)
```
       Atomic commit is formalized slightly differently from consensus: an atomic transaction can commit only if all participants vote to commit, and must abort if any participant needs to abort. Consensus is allowed to decide on any value that is proposed by one of the participants. However, atomic commit and consensus are reducible to each other [70, 71]. Nonblocking atomic commit is harder than consensus—see
```
we explore the atomic commit problem progressively, going from using 2PC then to Zab (used in Zookeeper) then Raft (etcd)

Impossibility of Consensus #

there’s a FLP result, a proof that no algo exists that can reach consensus if there’s a risk that nodes may crash
FLP is proven in the context of the asynchronous system model, which is restrictive:
- algo has to be deterministic
- algo can’t use clocks or timeouts
if the algo can find a way to identify the suspected crashed nodes e.g. use timeouts, or some way random numbers then it’s possible to achieve consensus

Atomic Commit and Two-Phase Commit (2PC) #

recap: benefits of atomic transactions
- outcome is either a commit or an abort & rollback
- prevents failed txns from littering the db with half-finished, half-updated state
- ensures that the secondary index stays consistent with the primary data, remains useful

from single-node to distributed atomic commit #

for single-db-node, the txn writes other than being durable, will also append a commit record to the log on disk – allows crash-recovery
- transaction commitment crucially depends on the order in which data is durably written to disk: first the data, then the commit record
- it’s a single device (the controller for the disk drive attached to the single node) that makes the commitment atomic

NOTE: the book is outdated in it’s description that nosql data stores don’t offer distributed txns

  Short answer: **the statement was largely correct when DDIA was written, but it’s no longer broadly true today without qualification.** The landscape has shifted quite a bit.

  Your quoted passage reflects the **2010–2017 reality** well — but since then, many NoSQL systems have added **distributed transaction support**, at least in some form.

  Let’s place this on a timeline and clarify what is true, what changed, and what nuance matters now.

  ---

  # ✅ What the statement meant (and when it was true)

  The claim:

  > Most NoSQL distributed datastores do not support distributed transactions, but various clustered relational systems do.

  This was broadly accurate in the early NoSQL wave.

  ## Early NoSQL design philosophy

  Original NoSQL systems (Dynamo-style, Bigtable-style, Cassandra, etc.) were designed around:

   * partition tolerance
   * high availability
   * horizontal scale
   * low latency
   * simple data model
   * **single-row / single-partition atomicity only**

  They intentionally **avoided distributed transactions** because:

   * 2PC hurts availability
   * cross-partition coordination hurts latency
   * global locking hurts throughput
   * consensus across shards is expensive

  So they offered:

  ```
  atomicity scope = one key / one partition
  ```

  Not:

  ```
  atomicity scope = arbitrary multi-key transaction
  ```

  This matched:

   * CAP tradeoff choices
   * operational simplicity goals
   * scale-first philosophy

  So historically:
  ✔️ The statement is correct.

  ---

  # 🔄 What changed over time

  Major shift:
  **Several NoSQL databases added distributed transactions later.**

  Why?

  Because:

   * real applications needed invariants
   * developers demanded correctness
   * cloud infrastructure improved
   * consensus got cheaper/faster
   * Google Spanner proved it was feasible

  ---

  # ✅ Modern NoSQL systems that DO support distributed transactions

  ## MongoDB

   * Originally: single-document atomicity only
   * Now: **multi-document distributed transactions**
   * Uses: snapshot isolation + 2PC-style coordination
   * Cost: slower, but available

  ---

  ## CockroachDB

   * SQL layer but NoSQL-style architecture
   * Fully distributed transactions
   * Serializable isolation
   * Consensus per range
   * Spanner-inspired

  ---

  ## YugabyteDB

   * Distributed SQL + NoSQL storage engine
   * Supports distributed ACID transactions
   * Raft + transaction manager

  ---

  ## FoundationDB

   * Key-value NoSQL store
   * Strong distributed ACID transactions
   * Designed around transactions from the start

  ---

  ## Spanner / Spanner-like systems

   * Not labeled NoSQL — but nontraditional architecture
   * Globally distributed transactions
   * TrueTime-based ordering

  ---

  # ⚠️ Systems that still do NOT support general distributed transactions

  Some NoSQL systems still avoid them:

  ## Cassandra

   * No general distributed transactions
   * Lightweight transactions = per-partition CAS only
   * Paxos per key — not multi-key ACID

  ---

  ## Dynamo-style systems

   * Still avoid cross-partition transactions
   * Prefer:

    * conditional writes
    * idempotency
    * application-level compensation

  ---

  ## Redis Cluster

   * No general cross-shard transactions
   * Only same-slot transactions

  ---

  # 🎯 Important nuance: “supports transactions” ≠ same semantics

  Even when supported, distributed transactions vary in:

   * isolation level
   * performance cost
   * failure behavior
   * latency penalty
   * operational limits

  Support often means:

  ```
  available but discouraged for hot paths
  ```

  Not:

  ```
  free and default
  ```

  ---

  # 🧠 The deeper architectural shift

  The field learned:

  > Avoiding transactions pushes complexity into application code.

  So newer systems moved toward:

  ```
  coordination when needed
  local autonomy when possible
  ```

  Modern pattern:

   * default = local/partition operations
   * optional = distributed transactions for invariants

  This is a **more mature tradeoff** than early NoSQL ideology.

  ---

  # 📌 How to restate the claim correctly today

  Instead of:

  > Most NoSQL databases do not support distributed transactions

  A modern accurate version would be:

  > Early NoSQL systems generally avoided distributed transactions, but many modern distributed databases — including several NoSQL and NewSQL systems — now support them, usually with performance tradeoffs.

  ---

  # ✅ Your notes are still conceptually correct if read historically

  Since you’re reading DDIA-style material:

   * Treat the statement as **historically accurate**
   * Understand the **design motivation**
   * Update your mental model with:

    * modern system capabilities
    * optional distributed transactions
    * consensus-backed storage engines

  ---

  If you want, next I can give you a clean comparison:

  ```
  Single-node commit
  vs
  2PC
  vs
  consensus-backed commit
  vs
  Spanner-style commit
  ```

  That ties directly into the chapter you’re reading and makes the evolution very crisp.

requirements for a distributed atomic commit:

a node must only commit once it is certain that all other nodes in the txn are also going to commit

a txn commit must be irrevocable

this is different from the ability to run compensating transactions because compensating transactions, from the DB’s pov is seen as a separate txn

Introduction to 2PC (two-phase commit) #

algo for atomic txn commit across mutiple nodes
the commit/abort process in 2PC split into 2 phases (instead of a single commit request)
uses a coordinator aka tranasction manager to manage the logic:
1. app reads and writes data on mutiple db nodes like as usual
  db nodes are called participants in the txn
2. app ready to commit? coordinator begins phase 1: sends prepare request to each of the nodes
  - case 1: all participants say yes – coordinator sends phase 2: commit request
  - case 2: not all participants say yes – coordinator sends phase 2: abort request
analogy is like a western marriage ceremony
if one of the participants or the network fails during 2PC: if any of the prepare requests fail or time out, the coordinator aborts the transaction; if any of the commit or abort requests fail, the coordinator retries them indefinitely.

2PC (two-phase commit) \(\neq\) 2PL (two-phase locking)
- 2PC is for atomic commits in a distributed db
- 2PL is about providing serializable isolation

system of promises
- might seem that 2PC may suffer from packet loss and so on, why would it work?
  there’s a bunch of promise-like steps, especially 2 points of no return (which single-node atomic txn lumps into a single action [writing the commit record to the transaction log.)])
  - when participant votes “yes” that it can commit later
  - when the coordinator decides, that decision is irrevocable

coordinator failure
- failure before sending the prepare requests: participant can safely abort the txn
- failure after participant receives a prepare request and participant voted “yes” – then it can no longer abort unilaterally, participants need to wait for the coordinator
  participant’s txn state is in doubt, uncertain
  timeout doesn’t help here
- technically participants could go and communicate with each other but that’s not part of the typical 2PC protocol

3PC (Three-phase commit) #

2PC is blocking atomic commit protocol because the nodes can be stuck waiting for the coordinator to recover
it could be tweaked to make the protocol non-blocking, but it’s not straightforward
3PC assumes that the network has bounded delay and nodes will have bounded response times – not a likely assumption
nonblocking atomic commits need a perfect failure detector - a way to know whether a node has crashed or not
3PC is hard to implement so 2PC still popular

Distributed Transactions in Practice #

distributed transactions incur heavy performance penalties
- most of the cost comes from crash recovery; additional network round-trips and the disk
there’s a need to refine the meaning of “distributed transactions”
1. database-internal distributed transactions
  - all nodes in participating txns running the same db software, some distributed dbs support internal transactions amongst the nodes of that database.
  - this can use any protocol, so can optimise in their own way and work well
2. Heterogeneous distributed transactions
  - when there’s a mix of different db technologies
  - systems may be entirely different under the hood but a distributed txn across these systems must ensure atomic commit
  - this is a lot more challenging

Exactly-once message passing #

heterogeneous distributed txns allows diverse system integrations
e.g. message passing queues and db systems that are different
an exactly-once message passing pattern can be done in systems where the transactions are able to use the samme atomic commit protocol

XA (X/Open XA aka eXtended Architecture) transactions #

standard for implementing 2PC across heterogenous technologies, supported by many traditional relational dbs and message brokers
it’s a C-API for interacting with a transaction coordinator, it’s not a network protocol of its own
so it’s more of an API-binding
assumptions:
- network uses a network driver / client library to communicate with the participant dbs / messaging services
note: it’s the txn coordinator that implements the XA API
can use a disk-based log so that the coordinator may be revived

note: there’s been changes to this since the book was published:

  **The points on 3PC remain largely accurate and hold up well against developments since DDIA (2017), with no major breakthroughs overturning the core critiques.** Research has reaffirmed 3PC's impracticality due to its synchronous assumptions, while production systems continue favoring 2PC or alternatives like Paxos/Raft. [bravenewgeek](https://bravenewgeek.com/tag/three-phase-commit/)

  ## 2PC Blocking Confirmed
  2PC is indeed blocking: participants can deadlock waiting for a crashed coordinator, and recovery isn't straightforward without logs or external help. No widespread "tweaks" have made it non-blocking in practice; instead, systems use timeouts, presumed aborts, or XA heuristics. [graphapp](https://www.graphapp.ai/blog/understanding-two-phase-commit-a-comprehensive-guide)

  ## 3PC Assumptions
  3PC requires bounded delays and response times for non-blocking behavior, which fails in asynchronous networks—still the realistic model today. Partitions cause timeouts, preventing progress, as noted in recent analyses (up to 2025). [cs.unibo](https://www.cs.unibo.it/babaoglu/papers/pdf/commit.pdf)

  ## Perfect Failure Detectors
  Non-blocking atomic commit provably requires a perfect failure detector (P), distinguishing crashes from delays—which is impossible in async systems without oracles. Encodings exist but remain theoretical; no practical implementations since DDIA. [drops.dagstuhl](https://drops.dagstuhl.de/storage/00lipics/lipics-vol246-disc2022/LIPIcs.DISC.2022.35/LIPIcs.DISC.2022.35.pdf)

  ## Implementation and Popularity
  3PC is complex (extra pre-commit phase/state), rarely implemented fully—e.g., not standard in databases like PostgreSQL or MySQL XA. 2PC dominates for XA transactions; alternatives like 2PC-aside (Google Spanner) or Sagas/Paxos sidestep it entirely. [developer.jboss](https://developer.jboss.org/docs/DOC-56051)

  ## Recent Trends (2017-2026)
  Papers explore non-blocking variants (e.g., NetLR-inspired, but not 3PC), but focus shifts to leader-based replication (Raft) or multi-leader with CRDTs/CRQs. No production adoption of 3PC; critiques in 2022+ echo DDIA. [vldb](https://www.vldb.org/pvldb/vol15/p1337-lee.pdf)

holding locks when in doubt #

the main problem about the transaction status being in doubt is that it’s a blocking hold of the lock
\(\implies\) causes availability issues
typically row-level locks

recovering from coordinator failure #

there are chances of in-doubt transactions being orphaned: when the coordinator can’t decide the outcome for various reasons (e.g. txn log corruption)
reboots won’t fix the problem
human (admin) must manually fix this
XA implementations have heuristic decisions as escape-hatches:
- heuristic is euphemism for “probably breaking atomicity”

Limitations of distributed transactions #

the coordinator itself is something like a db, so using a coordinator and having XA has its own operational problems
for db-internal distributed txns, distributed version of SSI (serialisable snapshot isolation) is possible.
if any of the nodes are faulty, distributed transactions will amplify failures

Fault-Tolerant Consensus #

formalisation of the consensus problem: one or more nodes may propose values, and the consensus algorithm decides on one of those values.
satisfies the properties:
1. Uniform agreement: no two nodes decide differently
2. integrity: no node decides twice
3. validity: if a node decides that the value is v then v must have been proposed by some node
4. termination: every node that doesn’t crash eventually decides some value
  - this property formalises the idea of fault tolerance because it says that the consensus algo must make progress inspite of node failures
  - system model of consensus assumes that when a node “crashes” it disappears and never comes back, so any algo that has to wait for a node to recover is not going to be able to satisfy the termination property
    so 2PC won’t meet these requirements
  - is subjected to the assumption that fewer than half the nodes are crashed/unreachable
there’s a limit to the number of failures that an algo can tolerate – typically need at least a quorum of nodes to be functioning to make forward progress

consensus algos and total order broadcast #

Popular algos:
1. VSR (viewstamped replication)
2. Paxos
3. Raft
4. Zab
most of the popular algos decide on a sequence of values (which is why there are more of a total order broadcast algo)
they don’t directly use the formal consensus model that we use (i.e. the proposing and deciding, while satisfying the agreement, integrity, validity and termination properties)
total order broadcast is equivalent to performing several rounds of consensus (like a breakdown, where each consensus decision corresponds to one message delivery)
each consensus-decision corresponds to one message delivery:
- agreement property: all nodes decide to deliver the same messages
- integrity property: messages not duplicated
- validity property: no message corruption no fabrication
- termination property: messages are not lost
in the case of Paxos, this is actually multi-paxos

Single-leader replication and consensus #

in the case of single-leader replication, it’s actually total order broadcast that is being done; happens by the leader
the important part is how the leader is chosen
by making the leader election and failover automatic, followers can be the new leaders if the old leader fails and it brings the system closer to fault-tolerant total-order broadcast – closer to solving consensus
but to implement leader election, it may seem like we need to have a leader to elect a leader – so this is how we might consider otherwise.

Epoch numbering and quorums #

internally the consensus algos use a leader but they don’t guarantee that the leader is unique
\(\implies\) a weaker guarantee: protocols define an epoch number (aka ballot number [paxos], view number [Viewstamped Replication], term number [Raft]).
Within each epoch, the leader is unique
leader election happens when current leader thought to be dead \(\implies\) the election happens via an incremented epoch number
\(\implies\) epoch numbers are totally ordered and monotonically increasing
the higher epoch number is used to tie-break if there’s a contest for who is the leader
split-brain preventative measures: leaders need to check that there isn’t any other leader with a higher epoch number
- this is moderated by getting a quorum of nodes
- so a leader must get approval from quorum of nodes to respond to the favor of the proposal
- so it ends up having 2 rounds of voting:
  round 1: choose a leader
  round 2: vote on the leaders proposal
  quorum from both these votes must overlap

Limitations of consensus #

benefits of consensus algos:
- bring concrete safety properties (4 properties above)
- are fault-tolerant as long as majority of nodes are working
- since they provide total order broadcast, they can implement linearizable atomic operations in a fault-tolerant way
Costs:
1. the voting process on proposals is similar in effect to synchronous replication
  so some committed data can potentially be lost on failover – typically a risk that people accept for better performance
2. blocked on network failure
3. typical assumption is that for voting, it’s a fixed set of nodes that participate in voting, so can’t dynamically update the cluster
  consensus algos can have extensions to support dynamic membership – less understood
4. typically reliant on timeouts to detect failed nodes – so if the network delays are too varied, then it’s hard to choose a correct timeout duration
  this variability can result in many false positives in detecting node failures
  doesn’t affect correctness but can add extra overheads because need to work around these things
5. particular sensitivities to network problems:
  1. Raft: if entire network but one node is working correctly, then there’s a consistent unreliability – may end up in situations where leadership bounces between two nodes / current leader is continually forced to resign \(\implies\) system can’t actually make effective progress

Membership and Coordination Services #

“distributed key-value stores” / “coordination and configuration services” – what things like ZooKeeper/etcd are called
- their APIs feel like DB apis
- they implement consensus algorithms
thing is the the reliance on such tools is likely to be indirect (as a fact of using another tool like Hadoop, Kafka)
these tools are designed to work best with in-memory size data (even though they might persist on disk for durability)
this data is what is replicated across all the nodes using a fault-tolerant total order broadcast system
some other features that ZooKeeper gives:
1. basic stuff: total order broadcast (therefore consensus)
  fault-tolerant data replication, same writes in the same order to make replicas consistent with each other
2. linearizable atomic operations (only this is the one that requires consensus, really)
  - using atomic set-and-compare, we can get a lock
  - consensus protocol guarantees that the operation will be atomic and linearizable
  - distributed lock eventually implemented as a lease, with and expiry time so that it’s eventually released in the case that the client fails (see failure cases)
3. total ordering of operations
  - ref leader and the lock, where we need a fencing token to prevent clients from conflicting with each other when there are process pauses
4. failure detection
  - there’s a heartbeat exchange that each client maintains on long-lived ZooKeeper servers, that’s how if heartbeat miss goes beyond a timeout duration, then the session may be considered as dead
  - ZooKeeper offers ephemeral nodes that auto-timesout also
5. change notifications
  - clients can subscribe to another client’s events, saves the overhead that would have come from polling

Allocating work to nodes #

a feature useful for job schedulers / stateful systems is for follower nodes to automatically get promoted to leaders
rebalancing load in partitioned resource: when new nodes join the cluster, some partitions need to decide which partition assigned to which node – when this happens, when nodes are removed or failed, then there needs to be other nodes taking over
should use applications like Apache Curator that offer higher-level APIs on top of ZooKeeper client APIs instead of trying to implement the consensus algos from scratch
ZooKeeper may have thousands of nodes, but only a fixed number (3-5) perform the voting while supporting a large number of clients
so it’s an “outsourcing” from the coordinating nodes that happens

Service Discovery #

these systems used for service discovery as well
at time of writing, it’s not clear if service discovery needs consensus, DNS being the OG way of looking up IPs for services
in the case of DNS,
- DNS reads are not linearizable
- sometimes tolerated if DNS reads are stale
- more important that DNS is reliably available, robust to network interruptions
Leader-election needs consensus though, so can use the consensus systems to tell other services who the leader is \(\implies\) it makes sense to have RO caching replicas of consensus systems to read this information

Membership Services #

membership services is the category that tools like ZooKeeper and friends would fall under
membership service determines which nodes are currently active and live members of a cluster.
- typically unbounded network delays makes it hard to detect if a node has failed BUT…
- with failure detection and consensus, nodes can come to an agreement about who should be considered alive or not
- false positives on the dead-node detection is still possible, but it’s tolerable
- to be even able to do leader election, the first step is to have consensus on who the current members are

Summary #

terms :
- linearizability: linearizability, a popular consistency model: its goal is to make replicated data appear as though there were only a single copy, and to make all operations act on it atomically.
- causality: happened before relationship, weaker consistency model than linearizability
  not sufficient to have causal ordering (e.g. username selection) \(\implies\) need consensus
- consensus
range of problems that are reducible to consensus (there’s some equivalence):
1. Linearizable compare-and-set registers
  - register atomically decides whether to set its value (compared to current value …)
2. Atomic transaction commit
  - for distributed transactions, db must decide on the order to deliver msgs
3. Total order broadcast
  - messaging system must decide on the order in which to deliver messages
4. Locks and leases
  - several clients trying to grab a lock / lease, then lock decides which one acquires it successfully
5. Membership / Coordination service
  - System must decide, given a failure detection mechanism like timeouts, which nodes are members of the distributed system
6. Uniqueness constraint
  - multiple transactions trying to create conflicting records for the same key, constraint is to decide which one to allow and which to fail with a constraint violation
single leader systems are simpler, if leader fails we have the following options:
1. wait for the leader to recover
  but this doesn’t satisfy the termination property for consensus; can possibly be blocked forever
  e.g. XA/JTA txn coordinators do this
2. manual human failover
  typically done by relational dbs; consensus by “act of God”; speed of failover is limited by the speed at which the human can act, which is slower than computers
3. use an algo for leader election
  needs a consensus algo, use a proven algo
who doesn’t need consensus?
- leaderless and multi-leader replication systems
  typically don’t need global consensus; they encounter conflicts but that could be tolerable in some POVs
  maybe we simply need to cope without linearizability and learn to work better with data that has branching and merging version histories.