- rtshkmr's digital garden/
- Readings/
- Books/
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems/
- Chapter 9. Consistency and Consensus/
Chapter 9. Consistency and Consensus
Table of Contents
just like how we have transactions as a general purpose abstraction to solve a class of problems, distributed computing problems shall be solved by general purpose abstractions as well:
- abstraction of consensus
first we explore the range of guarantees and abstractions that we can have in distributed systems; to get the scope of what can be done and what can’t be done
this book just focuses on general intuition without going into correctness-proofs
Consistency Guarantees #
eventual consistency is a weak guarantee, its just says replicated dbs will have temporary inconsistency
also a better naming could be convergence
when working with db that provides only weak guarantees, we have to be sensitive about its limitations and not assume too much
edge cases for eventual consistency only are apparent when there is a fault in the system e.g. high concurrency / network interruption
no free lunch for stronger consistency models
stronger consistency models exist but they may come at the expense of performance / less fault-tolerance
distributed consistency models and transaction isolation levels’ hierarchy can be overlapping in ideas
difference:
transaction isolation is about avoiding race conditions that come from concurrently executing transactions
distributed consistency is about coordinating the state of replicas in the face of delays and faults
Linearizability: a recency guarantee #
it’s a recency guarantee
idea: since replicas exist and there’s a chance that if we ask 2 different replicas the same question, we may get different answers,
we can try strive for a situation where there’s an illusion of only a single db, every client would then get the same answer without worrying about replication lag
basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.
as soon as one client completes a write successfully, all clients reading from the db must be able to see the value just written
so this value can’t come from a stale cache / replicak
What Makes a System Linearizable? #
terminology, the “variable” is better called as a register which may be a kv-pair or a row in a table …
- consider the operations w.r.t register:
- read
- write
- cas(x, \(v_{old}\), \(v_{_}{new}\)) (compare-and-set)
- atomic operation
- if \(x != v_{old}\) then operation leaves register without changing, returns error on client-side
- if \(x == v_{old}\) then atomically set to \(v_{new}_{}\)
- consider the operations w.r.t register:
regular register: a register in which reads may return either the old or new value if the read is concurrent with a write
The requirement of linearizability is that the lines joining up the operation markers always move forward in time (from left to right), never backward.
This operation ensures the recency guarantee we discussed earlier: once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again.
It is possible (though computationally expensive) to test whether a system’s behavior is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order
Linearizability vs Serializability #
similar sounding guarantees because both have something to do with “arranging in a sequential order”, but they are different guarantees
Serializability:
- isolation property of transactions
- every txn may R/W multiple objects
- guarantee: txn behave the same as if they had executed in some serial order (w.r.t transactions). It’s alright for that order to be different from the order in which txns were actually run
Linearizability
- recency guarantee on R/W of a register (individual object)
- no grouping of operations into txns, doesn’t solve write-skew and all unless other steps (e.g. materializing conflicts) are taken
strict serializability aka strong one-copy serializability (strong ISR): when a db provides both serializability and linearizability
- implementation e.g. 2PL or actual serial execution that make the db typically linearizable
serializable snapshot isolation is not linearizable
- it reads from a consistent snapshot so ats to avoid lock contention b/w readers and writers
- it doesn’t include writes that are more recent than the snapshot so it reads from snapshots that are not linearizable
Relying on Linearizability #
cases where linearizability matters for system correctness:
locking and leader election
leader locks have to be linearizable
typically every nodes that spins up tries to acquire the lock and the one that succeeds is the leader
a linearizable storage service is the foundation for these coordination tasks
distributed locking can be done at the page-level on the storage system
contraints and uniqueness guarantees
if we want to enforce constraints (e.g. uniqueness constraints) as the data is written, then we need linearizability
similar situation to a lock, can see it the “locking” being done on a “column”
the operation is similar to a compare-and-set also
also relevant when constraints such as “min value” need to be enforced (e.g. min val in bank, last item sale,…)
- some of these constraints, if it’s alright to be slightly looser in the constraint, then we don’t need linearizability
fk / attribute constraints can be implemented without requiring linearizabilty though
cross-channel timing dependencies
in some cases, the detection of the linearizability violation happens beause there’s an additional communication channel in the system (cross-channel)
given example: a webserver that accepts image uploads and works together with a image resizer
in this system the webserver and the image resizer communicate both through file-storage and via a message queue, which now opens up the possibility of a race condition
if the file storage service is linearizable, then the system works alright
else, there’s the risk of race conditions:
- e.g message queue faster than the internal replication inside the storage service so when the resizer fetches the image, it might see an old version of the image – the full-size and resized images in the file-storage become permanently inconsistent
in example, 2 comms channels between the webserver and the resizer: file storage an the message queue
race condition happens if there’s no recency guarantee of linearizability
this problem can be solved in other ways, but linearizability is a simple enough way to solve it
Implementing Linearizable Systems #
- most common approach is to use replication so that systems are fault-tolerant, so the replication methods can be looked at to see if they can be made linearizable:
single leader replication: potentially linearizable
leader has primary copy, followers have backup copies on other nodes
reads from the leader / synchronously updated followers can potentially be linearizable
using the leader for reads \(\implies\) we know who the leader is
situations when linearizability is violated:
a delusional leader (even if eventually it realises that it’s not the leader) will viola
when it’s async replication, then failover may lose committed writes which will violate both durability and linearizabilty
consensus algos: linearizable
some of these algos feel like single-leader replication but they have mechanisms that prevent split brain and stale replicas
so, consensus algos can implement linearizable storage safely
multi-leader replication: not linearizable
concurrent processing of writes on multiple nodes + async replication to other nodes = conflicting writes that require resolution
conflicts – artifact of the lack of a single copy of the data
leaderless replication: generally not linearizable
the claim is that we can get “strong consistency” if we require quorum reads and writes (
w + r > n)depends on:
- config of the quorum
- how strong consistency is defined
LWW (last write wins) conflict resolution strategy most of the time cmi because clock timestamps can’t be guaranteed to be have consistent, actual ordering because of clock skew
sloppy quorums don’t allow linearizability
there’s situations when even strict quorums can still violate linearizability as well
Linearizability and quorums #
it’s possible to get a violation
it’s also possible to add in linearizablity (only for read and write operations it may be feasible, not for compare-and-set operations) by doing the following:
readers: add in read repair mechanisms for readers which is synchronous (so performance takes a hit) before returnig read results to the application
writer: must read the latest state of a quorum of nodes before sending it writes
The Cost of Linearizability #
need to understand pros and cons to get the tradeoffs of this guarantee
network interruptions force a choice between availability and linearizability
on interruption of network between data-centers:
if multi-leader db:
within data-center, can function normally
the writes are queued up and exchanged when network connectivity (b/w data-centers) is restored
if single-leader replication: depends on whether we connect to the data center containing the leader or not
leader must be in one of the data-centers
so any writes and any linearizable reads must be sent to the leader – so any clients connected to a follower data center must be sent synchronously over the network to the leader data center
if interrupted, clients connected to the follower data centers can’t contact the leader so can’t make any writes nor linearizable reads
can still make reads from follower but the data may be stale (non-linearizable)
no issue if clients are connected to the datacenter that has the leader within it, BAU there
The CAP Theorem #
linearizability issue: happens regardless of single/multi-leader replication, any type of single/multi-datacenter deployments and on any unreliable data-center, even if one datacenter
tradeoffs:
- either linearizable (consistent) or available under network partitions
if app requires linearizability
if network problem e.g. disconnected network partition \(\implies\) some replicas can’t process requirements when disconnected \(\implies\) they must wait until the network problem is fixed or they must return an error \(\implies\) service is unavailable
no need linearizability
can be written such that each replica can independently process, even if disconnected (e.g. multi-leader)
so, application can be available in face of network problem but behaviour is not linearizable
more tolerant of network problems
- either linearizable (consistent) or available under network partitions
CAP was supposed to be a rule of thumb, more of a discussion of tradeoffs
the unhelpful framing of CAP:
typically presented as:
CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3. Unfortunately, putting it this way is misleading [32] because network partitions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not
so a better framing of this set of tradeoffs should be:
a better way of phrasing CAP would be either Consistent or Available when Partitioned
in its formal definition, the CAP theorem has a very narrow scope where it’s relevant:
only when the consistency model is linearizability and when the type of fault is network partitions
there are more important things to think about when designing distributed systems instead of mulling that much over CAP theorem
there’s a newer acronym to help guide us
PACELC(ref wikipedia for the design)If there’s a Partition (P), we have to choose between Availability (A) and Consistency (C)
Else (E),
When the system is healthy, then we have to choose between Latency(L) and Consistency(C)
There are four configurations or trade-offs in the PACELC space:
- PA/EL - prioritize availability and low latency over consistency
- PA/EC - when there is a partition, choose availability; else, choose consistency
- PC/EL - when there is a partition, choose consistency; else, choose lower latency
- PC/EC - choose consistency at all times
Nice, let’s dig in 🙂 I’ll do this in three layers: **CAP → PACELC → why PACELC is still not enough**, tying it back to DDIA-style thinking rather than slogans. --- ## 1. CAP, restated precisely (no folklore) **Formal CAP says**: > In the presence of a *network partition*, a system that provides *linearizability* cannot also provide *availability*. Key constraints (you already captured these, but worth restating cleanly): * **Consistency = linearizability**, not “strong consistency”, not serializability, not read-your-writes. * **Availability** means *every non-failing node eventually returns a non-error response*. * **Partition tolerance** is not optional — partitions are faults. So CAP is not: * a taxonomy of databases * a performance model * a latency argument * a general statement about “distributed systems are hard” It’s a **liveness vs safety** statement under a *very specific* failure mode. That narrowness is exactly why people felt dissatisfied and invented PACELC. --- ## 2. PACELC: what it adds (and why it was attractive) PACELC reframes the discussion as: > **If there is a Partition (P)**, choose between **Availability (A)** and **Consistency (C)** > **Else (E)**, when the system is healthy, choose between **Latency (L)** and **Consistency (C)** So: * **PA/PC** → what you do *during partitions* * **EL/EC** → what you optimize *during normal operation* This adds something CAP never talked about: **the steady state**, which is where systems live 99.99% of the time. ### Why this felt like progress Because in practice: * Engineers care *far more* about: * tail latency * quorum round trips * geo-replication delays * CAP only talks about a rare failure mode. PACELC made it acceptable to say: * “We choose to sacrifice linearizability **even when there is no partition** because we want lower latency.” That’s a real, everyday tradeoff. --- ## 3. Where PACELC still falls short This is the important part — and where your instinct that these frameworks are “not the whole story” is correct. ### 3.1 It still treats “consistency” as a single axis PACELC inherits CAP’s biggest sin: > **Consistency is not one thing** It collapses: * linearizability * causal consistency * session guarantees * snapshot isolation * serializability * application-level invariants …into a single C/¬C switch. But in reality: * Many systems give up **linearizability** while preserving **causality** * Many applications don’t need linearizability, but *do* need: * monotonic reads * read-your-writes * bounded staleness PACELC can’t express this. --- ### 3.2 It ignores *what* you are coordinating Both CAP and PACELC assume: > “The system is a single blob with one consistency model.” But DDIA’s core insight is: * **Different operations need different guarantees** *Examples: * Leader election → needs linearizability * Analytics queries → don’t * Counters → maybe CRDT * Uniqueness constraints → CAS-style linearizability * Caches → eventual consistency is fine Real systems are **heterogeneous**, not point-labeled on a 2D grid. --- ### 3.3 It ignores correctness vs performance separation PACELC still frames everything as a **performance tradeoff**. But many consistency choices are about: * **correctness** * **invariant preservation** * **bug avoidance** Example (from your notes): * Cross-channel timing bugs * Distributed locks * Uniqueness constraints These are not “performance preferences”. They are **semantic requirements**. PACELC has no vocabulary for: > “This part of the system *must not* be wrong.” --- ### 3.4 It ignores failure modes beyond partitions PACELC, like CAP, is obsessed with partitions. But real systems fail due to: * clock skew * GC pauses * slow replicas * partial disk failure * Byzantine clients (sometimes) * misconfiguration * overload collapse Linearizability can fail *without* a partition. Availability can fail *without* a partition. PACELC doesn’t model this. --- ## 4. What replaces CAP/PACELC in DDIA thinking? Not another acronym — but a **shift in mindset**. ### 4.1 Think in *abstractions*, not labels DDIA’s move is: * Stop classifying databases * Start reasoning about **abstractions**: * linearizable register * atomic CAS * transaction * snapshot * log * consensus Then ask: > Which abstraction does my problem *actually* require? --- ### 4.2 Think in *failure semantics* Instead of: > “Is this AP or CP?” Ask: * What happens during: * message delay? * leader crash? * failover? * clock skew? * What invariants break? * What errors surface to users? This is far more predictive than CAP labels. --- ### 4.3 Think in *application invariants* The strongest DDIA lesson: > Consistency models exist to protect **invariants**, not to win theoretical arguments. Examples: * “Only one leader exists” * “Stock never goes below zero” * “Email is sent once” * “Usernames are unique” Once you state the invariant, the needed consistency often becomes obvious — CAP or PACELC be damned. --- ## 5. Bottom line (connecting back to your notes) Your notes implicitly conclude: > CAP is correct, but not a useful primary design tool. That conclusion: * ✔️ Matches DDIA * ✔️ Matches modern distributed systems practice * ✔️ Explains why people keep rediscovering the same bugs despite “knowing CAP” PACELC is a **better story**, but still a story — not a design method. If you want, next we can: * Apply this thinking to **real systems** (etcd, Spanner, Dynamo, Kafka) * Or map **consensus vs transactions** as competing abstractions * Or do a deep dive on **why linearizability is a *semantic* guarantee, not a performance one** Your call.
Linearizability and network delays #
funny things is that even RAM on multi-core CPUs is not linearizable
unless there’s a memory barrier or memory fence being used, then it’s not serializable
why?
every CPU core has its own memcache and store buffer
mem access reaches a cache first and any changes async written out to main memory
since data access in cache is much faster than going to memory
now there’s several copies of the data (main memory, several caches) and the copies are async updated so linearizability is lost
reason for dropping linearizability is to prioritize performance
Linearizability is slow all the time, not just during a network fault
proof that if you want linearizability, the response time of R/W requests is at least proportional to the uncertainty of delays in the network
- high variable delays (most networks), response time of linearizable reads and writes will be high
KIV chapter 12 for approaches for avoiding linearizabiity without sacrificing correctness
Ordering Guarantees #
ordering is a fundamental concept that has poppped up in a bunch of contexts:
leader in the single leaders replications determines the order of writes in the replication log
- the order in which to apply the writes
serializability: ensuring txns behave as though they were executed in some sequential order
2 ways to achieve:
literally executing txns in serial order
allowing concurrent executing while preventing serialization conflicts (by locking / aborting)
in the context of timestamps and clocks in distributed systems
attempts to determine which one of two writes happened later
there are deep connections between ordering, linearizability, and consensus.
Ordering and Causality #
Ordering helps preserve causality.
ordering on events: cause comes before effect; a message is sent before that message is received; the question comes before the answer.
Some patterns we’ve observed in the book so far about causality:
causal dependency between questions and answers
updates to a record must happen after they exist (creation of that record)
happened-before relationships is an expression of causality
given 2 operations A and B
A before B
B before A
A and B concurrent \(\implies\) no causal link b/w the two operations
transaction consistency is consistency with causality (transactions, snapshot isolation)
so no read skew (i.e. non-repeatable reads, erading data in a state that violates causality)
observing the entire db at a single point in time makes it consistent with causality:
effects of all ops happening before that time are visible
no effects of operations happening after that time are visible
write skew b/w txns show causal dependencies
where the write action is causally dependent on the observation of who’s currently on call
SSI (serializable snapshot isolation) detects this write skew by tracking the causal dependencies b/w txns
causal violation that was detected because of cross-channel timing dependencies
e.g. the sports example where the exclamation of the result (comms channel) was the reason why the violation could be detected
causally consistent: system obeying the ordering imposed by causality
Causal order is not a total order #
Total ordering: any 2 elements can be compared
math sets are incomparable and possibly partially ordered
i.e. in some cases one set is greater than another (if one set is a superset of another) and in all other cases, they are incomparable
diff b/w total and partial order is shown in the diff db consistency models
linearizability \(\implies\) total ordering of operations
if system behaves like it has a single copy and every operation is atomic then all elements (operations) are comparable \(\implies\) that’s why it can be illustrated as a timeline
causality \(\implies\) partial ordering
2 events ordered if they are causally related (one before another), incomparable if they’re concurrent \(\implies\) partial ordering
Implication: no concurrent operations in a linearizable datastore – must be a single timeline along which all operations are totally ordered
still can have multiple requests being queued and all, just the datastore does everything atomically, on a single copy of data, along a single timeline without any concurrency
concurrency – it’s like a branching and merging of timelines
very much like git
- operations on different branches are incomparable
Linearizability is stronger than causal consistency #
linearizability implies causality
that’s great, easy to understand and all
but performance hits and problems with availability is why distributed data systems don’t prioritise linearizability
possible to be causally consistent without incurring the performance hit of making things linearizable
causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures
most systems that appear to need linearizability just need causal consistency actually
Capturing causal dependencies #
Just some key ideas
to maintain causality just need to maintain partial order, so that we can do the happened-before r/s
ways to determine the happened-before r/s b/w operations similar to the detection of concurrent writes (ref)
going beyond just the detection ofn concurrent writes to the same key and by generalizing version vectors is how we can do this.
examples of using version vectors in other situations:
e.g. version number of prior operations is passed back to the db on a write
e.g. in the conflict detection of SSI, a txn that wants to commit will pass the version of the data that it has read and checks if it’s still up to date
Sequence Number Ordering #
actually keeping track of ALL causal deps can be impractical because large overhead so we just pick simpler thing (sequence numbers / timestamps)
- compact values
- preserve total order (every operation with a unique sequence number)
- 2 sequence numbers are comparable
use a logical clock for this
total order that is consistent with causality: the happened-before relationship is preserved, concurrent operations may be ordered arbitrarily.
- this captures all the causality info
replication log within a single-leader replication db has a total ordering of write operations that is consistent-with causality
typically a monotonic counter is used by the leader
followers may lag but the causal consistency is still guaranteed
Noncausal sequence number generators #
harder to generate sequence numbers for operations when the db is NOT single-leader replication (e.g. multi-leader / leaderless replication)
some ways:
each node generates its independent set of sequence numbers
e.g. if 2 nodes, then even/odd number sequences
problem: can’t compare if even-odd number
attach timestamp from a time-of-day (physical) clock to each operation and make sure the resolution is high enough – might be sufficient enough to totally order operations even
problem: might have clock skew when using physical clocks
used in the conflict resolution strategy behind last write wins (ref)
preallocated blocks of sequence numbers, for each block
problem: sequence number generation like these methods are not consistent with causality
this is because it preserves ordering within a node but not across nodes
Lamport Timestamps #
lamport timestamp is pretty simple:
(counter, nodeID)it’s a stamp of a logical clock that provides total ordering
if time-value within the timestamp is the same, then we just pick the one with the greater node ID
this part is similar to the even-odd numbers approach
improvement: nodes keep max counter seen and updates it on every request every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.
as long as the max counter value seen is carried along with every operation, the causal consistency is present
because every causal dependency results in an increased timestamp
lamport timestamps are not version vectors, their purpose is different
version vectors: to distinguish whether two operations are concurrent or if one is causally dependent on the other
Lamport timestamps: to enforce a total ordering
\(\implies\) this total ordering is why it’s not possible to tell if 2 operations are concurrent or causally dependent
Timestamp ordering is not sufficient #
in some cases, the distributed nature of things means that total ordering (via lamport timestamps) is not sufficient
problem here is that the total order of operations only emerges after you have collected all of the operations. If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations: the unknown operations from the other node may need to be inserted at various positions in the total order.
in order to implement something like a uniqueness constraint for usernames, it’s not sufficient to have a total ordering of operations – you also need to know when that order is finalized.
that’s that total order broadcast helps us do
Total Order Broadcast aka atomic broadcast #
to have the whole system (all nodes) agree on the same total ordering of operations
AND
to scale the system if the throughput is greater than what a single leader can handle
AND
handle leader failovers
\(\implies\) total order broadcast / atomic broadcast problem
Using total order broadcast #
KIV for consensus section but there’s a strong connection between total order broadcast and consensus
alternative lens: total order broadcast helps create a log (replication log, txn log, WAL) and delivering a message is like appending to that log
can be used to do state machine replication: is what is needed for db replication
- for replicas to be consistent with each other the ordering must be consistent with each other so that the writes are processed in the same order
can be used to implement serializable transactions
can be used to implement a lock service that gives fencing tokens (ref)
sequence number can serve as the fencing token
the log for fencing token requests is yet another log
it is stronger than timestamp ordering because the ordering is fixed at the time the messages are delivered – no node can retroactively insert a message into an earlier position
Implementing linearizable storage using total order broadcast #
recap:
- total order broadcast is async: messages get delivered reliably in a fixed order but there’s no guarantee when it gets delivered so there may be replication lags
- linearizability is more of a recency guarantee
if we have total order broadcast, we can build linearizable storage ontop of it:
e.g. in the case of usernames, we could try to do a compare and set with
original value = nullso only one of the control and sets will work. This requires a linearizable register on which we can do atomic compare and set operationsimplementing timeline/sequential consistency (slightly weaker guarantee than linearizability) using total broadcast as an append-only log:
append message to log, indicating that uname you want to claim
read the log, wait for the message you appended to deliver back to you
check for any msgs claiming uname desired. if first message is our message then we succeed, else abort the operation
this helps guarnatee linearizable writes but NOT linearizable reads (reading from a store that is updated async risks stale reads)
how to linearize reads:
sequence the reads using the log, perform the read when the actual message is delivered back to you
so message’s position in time is when the read happens
if log allows fetching of the position of the latest log message in a linearizable way, then we can query that position and then wait for all the entries up to that position to be delivered then perform the read – that’s how ZooKeeper’s
syncworksread from a replica that is synchronously updated on writes and up to date – used in chain replication
Implementing total order broadcast using linearizable storage #
the converse, given a linearizable storage, how to build total order broadcast from ti
ways:
assume we have a linearizable register that stores and integer and has atomic increment-and-get operation
use an atomic compare-and-set operation
algo: for every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.
- the number we get from incrementing a linearizable register will form a sequence with no gaps
it’s difficult to make a linearizable integer with an atomic increment-and-get operation because the difficulty lies in handling failure cases, when nodes fail:
- so need to handle connection interruptions
- handle restoring the value when that node fails
it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus
Distributed Transactions and Consensus #
many subtleties that need to be appreciated behind this problem to solve: making nodes agree on something
- understanding builds on: replication, txns, system models, linearizability and total order broadcast
situations when consensus is important:
leader-election
leadership position might be contested if network faults make it such that some nodes can’t communicate with other nodes \(\implies\) need the nodes to come to consensus else bad failover \(\implies\) split-brain situation \(\implies\) data divergence, inconsistency and data loss
atomic commits (solving the atomic commit problem)
consider db with txns that span multiple nodes / partitions and there’s the problem of txns failing on some nodes but succeeded in others
to get txn atomicity (ref) we need all the nodes to have a consensus on how to deal with the txn (abort or commit)
Atomic commit is formalized slightly differently from consensus: an atomic transaction can commit only if all participants vote to commit, and must abort if any participant needs to abort. Consensus is allowed to decide on any value that is proposed by one of the participants. However, atomic commit and consensus are reducible to each other [70, 71]. Nonblocking atomic commit is harder than consensus—see
we explore the atomic commit problem progressively, going from using 2PC then to Zab (used in Zookeeper) then Raft (etcd)
Impossibility of Consensus #
there’s a FLP result, a proof that no algo exists that can reach consensus if there’s a risk that nodes may crash
FLP is proven in the context of the asynchronous system model, which is restrictive:
- algo has to be deterministic
- algo can’t use clocks or timeouts
if the algo can find a way to identify the suspected crashed nodes e.g. use timeouts, or some way random numbers then it’s possible to achieve consensus
Atomic Commit and Two-Phase Commit (2PC) #
- recap: benefits of atomic transactions
- outcome is either a commit or an abort & rollback
- prevents failed txns from littering the db with half-finished, half-updated state
- ensures that the secondary index stays consistent with the primary data, remains useful
from single-node to distributed atomic commit #
for single-db-node, the txn writes other than being durable, will also append a commit record to the log on disk – allows crash-recovery
transaction commitment crucially depends on the order in which data is durably written to disk: first the data, then the commit record
it’s a single device (the controller for the disk drive attached to the single node) that makes the commitment atomic
NOTE: the book is outdated in it’s description that nosql data stores don’t offer distributed txns
Short answer: **the statement was largely correct when DDIA was written, but it’s no longer broadly true today without qualification.** The landscape has shifted quite a bit. Your quoted passage reflects the **2010–2017 reality** well — but since then, many NoSQL systems have added **distributed transaction support**, at least in some form. Let’s place this on a timeline and clarify what is true, what changed, and what nuance matters now. --- # ✅ What the statement meant (and when it was true) The claim: > Most NoSQL distributed datastores do not support distributed transactions, but various clustered relational systems do. This was broadly accurate in the early NoSQL wave. ## Early NoSQL design philosophy Original NoSQL systems (Dynamo-style, Bigtable-style, Cassandra, etc.) were designed around: * partition tolerance * high availability * horizontal scale * low latency * simple data model * **single-row / single-partition atomicity only** They intentionally **avoided distributed transactions** because: * 2PC hurts availability * cross-partition coordination hurts latency * global locking hurts throughput * consensus across shards is expensive So they offered: ``` atomicity scope = one key / one partition ``` Not: ``` atomicity scope = arbitrary multi-key transaction ``` This matched: * CAP tradeoff choices * operational simplicity goals * scale-first philosophy So historically: ✔️ The statement is correct. --- # 🔄 What changed over time Major shift: **Several NoSQL databases added distributed transactions later.** Why? Because: * real applications needed invariants * developers demanded correctness * cloud infrastructure improved * consensus got cheaper/faster * Google Spanner proved it was feasible --- # ✅ Modern NoSQL systems that DO support distributed transactions ## MongoDB * Originally: single-document atomicity only * Now: **multi-document distributed transactions** * Uses: snapshot isolation + 2PC-style coordination * Cost: slower, but available --- ## CockroachDB * SQL layer but NoSQL-style architecture * Fully distributed transactions * Serializable isolation * Consensus per range * Spanner-inspired --- ## YugabyteDB * Distributed SQL + NoSQL storage engine * Supports distributed ACID transactions * Raft + transaction manager --- ## FoundationDB * Key-value NoSQL store * Strong distributed ACID transactions * Designed around transactions from the start --- ## Spanner / Spanner-like systems * Not labeled NoSQL — but nontraditional architecture * Globally distributed transactions * TrueTime-based ordering --- # ⚠️ Systems that still do NOT support general distributed transactions Some NoSQL systems still avoid them: ## Cassandra * No general distributed transactions * Lightweight transactions = per-partition CAS only * Paxos per key — not multi-key ACID --- ## Dynamo-style systems * Still avoid cross-partition transactions * Prefer: * conditional writes * idempotency * application-level compensation --- ## Redis Cluster * No general cross-shard transactions * Only same-slot transactions --- # 🎯 Important nuance: “supports transactions” ≠ same semantics Even when supported, distributed transactions vary in: * isolation level * performance cost * failure behavior * latency penalty * operational limits Support often means: ``` available but discouraged for hot paths ``` Not: ``` free and default ``` --- # 🧠 The deeper architectural shift The field learned: > Avoiding transactions pushes complexity into application code. So newer systems moved toward: ``` coordination when needed local autonomy when possible ``` Modern pattern: * default = local/partition operations * optional = distributed transactions for invariants This is a **more mature tradeoff** than early NoSQL ideology. --- # 📌 How to restate the claim correctly today Instead of: > Most NoSQL databases do not support distributed transactions A modern accurate version would be: > Early NoSQL systems generally avoided distributed transactions, but many modern distributed databases — including several NoSQL and NewSQL systems — now support them, usually with performance tradeoffs. --- # ✅ Your notes are still conceptually correct if read historically Since you’re reading DDIA-style material: * Treat the statement as **historically accurate** * Understand the **design motivation** * Update your mental model with: * modern system capabilities * optional distributed transactions * consensus-backed storage engines --- If you want, next I can give you a clean comparison: ``` Single-node commit vs 2PC vs consensus-backed commit vs Spanner-style commit ``` That ties directly into the chapter you’re reading and makes the evolution very crisp.requirements for a distributed atomic commit:
a node must only commit once it is certain that all other nodes in the txn are also going to commit
a txn commit must be irrevocable
this is different from the ability to run compensating transactions because compensating transactions, from the DB’s pov is seen as a separate txn
more on compensating txns:
Great topic. Compensating transactions are one of those ideas that *never went away*, but also never quite became the silver bullet people once hoped for. They sit in an interesting middle ground between “no coordination” and “full distributed transactions”. I’ll walk through this in a layered way: 1. What compensating transactions actually are 2. Why they became popular (historically) 3. How they work in practice (with examples) 4. Where they are still used today 5. Where they break down (important) 6. The modern consensus: when to use them vs not --- ## 1️⃣ What is a compensating transaction? A **compensating transaction** is: > An explicit operation that semantically *undoes* the effect of a previously completed operation. Key properties: * The original operation **already committed** * You **cannot roll it back** atomically * You later apply a new operation to restore consistency So instead of: ``` Do A If failure → rollback A ``` You do: ``` Do A (commit) If failure later → do Compensate(A) ``` This is *logical undo*, not physical rollback. --- ## 2️⃣ Why compensating transactions existed in the first place They arose because: * Distributed transactions were expensive or unavailable * Systems needed to span: * multiple services * multiple databases * external APIs * You couldn’t rely on 2PC across administrative boundaries Classic domains: * payments * booking systems * order fulfillment * long-running workflows This led to: * **Saga pattern** (1980s, revived in microservices era) --- ## 3️⃣ How compensating transactions work (concrete example) ### Example: Order placement Steps: 1. Reserve inventory 2. Charge credit card 3. Create shipment If step 3 fails: * You cannot “uncharge” by rolling back DB state * You issue a **refund** (compensation) * You release inventory Sequence: ``` Reserve inventory → commit Charge card → commit Ship order → fails Compensate: Refund card Release inventory ``` Each step: * commits independently * has a defined compensating action --- ## 4️⃣ Where compensating transactions are still used today ### ✅ Very much still used — in specific domains #### A. Microservices / workflow systems * Order processing * Billing pipelines * Supply chains * Travel booking * E-commerce Frameworks: * Saga orchestrators * Workflow engines (e.g., durable workflows) * Message-driven state machines --- #### B. Cross-boundary systems You *cannot* use distributed transactions when: * services are owned by different teams * vendors are external * APIs are asynchronous * failures are unavoidable Compensation is the only viable model. --- #### C. Human-in-the-loop systems If humans are involved: * refunds * reversals * approvals Rollback is impossible — compensation is natural. --- ## 5️⃣ Where compensating transactions break down This is crucial. ### ❌ Compensation is not rollback Rollback: * automatic * invisible * restores exact previous state Compensation: * explicit * visible * may not fully restore state Examples: * Refund ≠ original charge (fees, FX rates) * Releasing inventory doesn’t undo customer notification * Emails sent cannot be “unsent” --- ### ❌ Compensation is not always possible Some operations are: * irreversible * lossy * externally visible Examples: * sending an email * triggering a webhook * shipping a package * publishing a message You can only *mitigate*, not undo. --- ### ❌ Compensation introduces temporal inconsistency Between failure and compensation: * system is inconsistent * invariants are temporarily violated This may be acceptable — or not. --- ### ❌ Complexity moves to application logic You must: * design compensations * handle partial failures * handle retries * handle duplicate compensations * handle compensation failure (!) This is **harder than transactions** in many cases. --- ## 6️⃣ Modern consensus: how compensating txns are used today ### 🧠 The field’s learned position > Compensating transactions are **not a substitute for ACID transactions** > They are a **workflow-level recovery mechanism** They are best for: * long-running processes * business workflows * cross-system coordination They are **not good** for: * tight invariants * correctness-critical counters * financial ledger integrity * uniqueness constraints --- ### 🧩 Modern hybrid approach Most real systems now use: * **ACID transactions** *inside* a service or database * **Compensation** *between* services Diagrammatically: ``` Service A (ACID) ↓ Service B (ACID) ↓ Service C (ACID) Failures handled by compensation at service boundaries ``` This minimizes complexity while preserving correctness. --- ## 7️⃣ Relationship to earlier discussion (CAP, skew, consistency) This ties directly back to your earlier insights: * Compensation accepts **temporary inconsistency** * It assumes **eventual convergence** * It avoids **global coordination** * It trades correctness guarantees for availability In CAP terms: * compensation-heavy systems lean AP * ACID-heavy systems lean CP But again: CAP is not the right lens — **invariants are**. --- ## 8️⃣ A simple decision rule that actually works Use compensating transactions when: * operations are long-running * rollback is impossible * consistency can be eventual * invariants are soft * external systems are involved Avoid compensating transactions when: * invariants must never be violated * state must be immediately correct * failure handling must be automatic * retries are complex or dangerous --- ## Final takeaway (the one to remember) > **Transactions prevent inconsistency. > Compensation repairs inconsistency.** Both are still essential — just at different layers. If you want, next we can: * compare Sagas vs 2PC step-by-step * map compensation to real failure timelines * connect this to idempotency and exactly-once semantics Just tell me where you want to zoom in.
Introduction to 2PC (two-phase commit) #
algo for atomic txn commit across mutiple nodes
the commit/abort process in 2PC split into 2 phases (instead of a single commit request)
uses a coordinator aka tranasction manager to manage the logic:
app reads and writes data on mutiple db nodes like as usual
db nodes are called participants in the txn
app ready to commit? coordinator begins phase 1: sends prepare request to each of the nodes
case 1: all participants say yes – coordinator sends phase 2: commit request
case 2: not all participants say yes – coordinator sends phase 2: abort request
analogy is like a western marriage ceremony
if one of the participants or the network fails during 2PC: if any of the prepare requests fail or time out, the coordinator aborts the transaction; if any of the commit or abort requests fail, the coordinator retries them indefinitely.
2PC (two-phase commit) \(\neq\) 2PL (two-phase locking)
- 2PC is for atomic commits in a distributed db
- 2PL is about providing serializable isolation
system of promises
might seem that 2PC may suffer from packet loss and so on, why would it work?
there’s a bunch of promise-like steps, especially 2 points of no return (which single-node atomic txn lumps into a single action [writing the commit record to the transaction log.)])
- when participant votes “yes” that it can commit later
- when the coordinator decides, that decision is irrevocable
coordinator failure
failure before sending the prepare requests: participant can safely abort the txn
failure after participant receives a prepare request and participant voted “yes” – then it can no longer abort unilaterally, participants need to wait for the coordinator
participant’s txn state is in doubt, uncertain
timeout doesn’t help here
technically participants could go and communicate with each other but that’s not part of the typical 2PC protocol
3PC (Three-phase commit) #
2PC is blocking atomic commit protocol because the nodes can be stuck waiting for the coordinator to recover
it could be tweaked to make the protocol non-blocking, but it’s not straightforward
3PC assumes that the network has bounded delay and nodes will have bounded response times – not a likely assumption
nonblocking atomic commits need a perfect failure detector - a way to know whether a node has crashed or not
3PC is hard to implement so 2PC still popular
Distributed Transactions in Practice #
distributed transactions incur heavy performance penalties
- most of the cost comes from crash recovery; additional network round-trips and the disk
there’s a need to refine the meaning of “distributed transactions”
database-internal distributed transactions
all nodes in participating txns running the same db software, some distributed dbs support internal transactions amongst the nodes of that database.
this can use any protocol, so can optimise in their own way and work well
Heterogeneous distributed transactions
when there’s a mix of different db technologies
systems may be entirely different under the hood but a distributed txn across these systems must ensure atomic commit
this is a lot more challenging
Exactly-once message passing #
heterogeneous distributed txns allows diverse system integrations
e.g. message passing queues and db systems that are different
an exactly-once message passing pattern can be done in systems where the transactions are able to use the samme atomic commit protocol
XA (X/Open XA aka eXtended Architecture) transactions #
standard for implementing 2PC across heterogenous technologies, supported by many traditional relational dbs and message brokers
it’s a C-API for interacting with a transaction coordinator, it’s not a network protocol of its own
so it’s more of an API-binding
assumptions:
- network uses a network driver / client library to communicate with the participant dbs / messaging services
note: it’s the txn coordinator that implements the XA API
can use a disk-based log so that the coordinator may be revived
note: there’s been changes to this since the book was published:
**The points on 3PC remain largely accurate and hold up well against developments since DDIA (2017), with no major breakthroughs overturning the core critiques.** Research has reaffirmed 3PC's impracticality due to its synchronous assumptions, while production systems continue favoring 2PC or alternatives like Paxos/Raft. [bravenewgeek](https://bravenewgeek.com/tag/three-phase-commit/) ## 2PC Blocking Confirmed 2PC is indeed blocking: participants can deadlock waiting for a crashed coordinator, and recovery isn't straightforward without logs or external help. No widespread "tweaks" have made it non-blocking in practice; instead, systems use timeouts, presumed aborts, or XA heuristics. [graphapp](https://www.graphapp.ai/blog/understanding-two-phase-commit-a-comprehensive-guide) ## 3PC Assumptions 3PC requires bounded delays and response times for non-blocking behavior, which fails in asynchronous networks—still the realistic model today. Partitions cause timeouts, preventing progress, as noted in recent analyses (up to 2025). [cs.unibo](https://www.cs.unibo.it/babaoglu/papers/pdf/commit.pdf) ## Perfect Failure Detectors Non-blocking atomic commit provably requires a perfect failure detector (P), distinguishing crashes from delays—which is impossible in async systems without oracles. Encodings exist but remain theoretical; no practical implementations since DDIA. [drops.dagstuhl](https://drops.dagstuhl.de/storage/00lipics/lipics-vol246-disc2022/LIPIcs.DISC.2022.35/LIPIcs.DISC.2022.35.pdf) ## Implementation and Popularity 3PC is complex (extra pre-commit phase/state), rarely implemented fully—e.g., not standard in databases like PostgreSQL or MySQL XA. 2PC dominates for XA transactions; alternatives like 2PC-aside (Google Spanner) or Sagas/Paxos sidestep it entirely. [developer.jboss](https://developer.jboss.org/docs/DOC-56051) ## Recent Trends (2017-2026) Papers explore non-blocking variants (e.g., NetLR-inspired, but not 3PC), but focus shifts to leader-based replication (Raft) or multi-leader with CRDTs/CRQs. No production adoption of 3PC; critiques in 2022+ echo DDIA. [vldb](https://www.vldb.org/pvldb/vol15/p1337-lee.pdf)
holding locks when in doubt #
the main problem about the transaction status being in doubt is that it’s a blocking hold of the lock
\(\implies\) causes availability issues
typically row-level locks
recovering from coordinator failure #
- there are chances of in-doubt transactions being orphaned: when the coordinator can’t decide the outcome for various reasons (e.g. txn log corruption)
- reboots won’t fix the problem
- human (admin) must manually fix this
- XA implementations have heuristic decisions as escape-hatches:
- heuristic is euphemism for “probably breaking atomicity”
Limitations of distributed transactions #
the coordinator itself is something like a db, so using a coordinator and having XA has its own operational problems
for db-internal distributed txns, distributed version of SSI (serialisable snapshot isolation) is possible.
if any of the nodes are faulty, distributed transactions will amplify failures
Fault-Tolerant Consensus #
formalisation of the consensus problem: one or more nodes may propose values, and the consensus algorithm decides on one of those values.
satisfies the properties:
Uniform agreement: no two nodes decide differently
integrity: no node decides twice
validity: if a node decides that the value is
vthenvmust have been proposed by some nodetermination: every node that doesn’t crash eventually decides some value
this property formalises the idea of fault tolerance because it says that the consensus algo must make progress inspite of node failures
system model of consensus assumes that when a node “crashes” it disappears and never comes back, so any algo that has to wait for a node to recover is not going to be able to satisfy the termination property
so 2PC won’t meet these requirements
is subjected to the assumption that fewer than half the nodes are crashed/unreachable
there’s a limit to the number of failures that an algo can tolerate – typically need at least a quorum of nodes to be functioning to make forward progress
consensus algos and total order broadcast #
Popular algos:
- VSR (viewstamped replication)
- Paxos
- Raft
- Zab
most of the popular algos decide on a sequence of values (which is why there are more of a total order broadcast algo)
they don’t directly use the formal consensus model that we use (i.e. the proposing and deciding, while satisfying the agreement, integrity, validity and termination properties)
total order broadcast is equivalent to performing several rounds of consensus (like a breakdown, where each consensus decision corresponds to one message delivery)
each consensus-decision corresponds to one message delivery:
agreement property: all nodes decide to deliver the same messages
integrity property: messages not duplicated
validity property: no message corruption no fabrication
termination property: messages are not lost
in the case of Paxos, this is actually multi-paxos
Single-leader replication and consensus #
in the case of single-leader replication, it’s actually total order broadcast that is being done; happens by the leader
the important part is how the leader is chosen
by making the leader election and failover automatic, followers can be the new leaders if the old leader fails and it brings the system closer to fault-tolerant total-order broadcast – closer to solving consensus
but to implement leader election, it may seem like we need to have a leader to elect a leader – so this is how we might consider otherwise.
Epoch numbering and quorums #
internally the consensus algos use a leader but they don’t guarantee that the leader is unique
\(\implies\) a weaker guarantee: protocols define an epoch number (aka ballot number [paxos], view number [Viewstamped Replication], term number [Raft]).
Within each epoch, the leader is unique
leader election happens when current leader thought to be dead \(\implies\) the election happens via an incremented epoch number
\(\implies\) epoch numbers are totally ordered and monotonically increasing
the higher epoch number is used to tie-break if there’s a contest for who is the leader
split-brain preventative measures: leaders need to check that there isn’t any other leader with a higher epoch number
this is moderated by getting a quorum of nodes
so a leader must get approval from quorum of nodes to respond to the favor of the proposal
so it ends up having 2 rounds of voting:
round 1: choose a leader
round 2: vote on the leaders proposal
quorum from both these votes must overlap
Limitations of consensus #
benefits of consensus algos:
bring concrete safety properties (4 properties above)
are fault-tolerant as long as majority of nodes are working
since they provide total order broadcast, they can implement linearizable atomic operations in a fault-tolerant way
Costs:
the voting process on proposals is similar in effect to synchronous replication
so some committed data can potentially be lost on failover – typically a risk that people accept for better performance
blocked on network failure
typical assumption is that for voting, it’s a fixed set of nodes that participate in voting, so can’t dynamically update the cluster
consensus algos can have extensions to support dynamic membership – less understood
typically reliant on timeouts to detect failed nodes – so if the network delays are too varied, then it’s hard to choose a correct timeout duration
this variability can result in many false positives in detecting node failures
doesn’t affect correctness but can add extra overheads because need to work around these things
particular sensitivities to network problems:
- Raft: if entire network but one node is working correctly, then there’s a consistent unreliability – may end up in situations where leadership bounces between two nodes / current leader is continually forced to resign \(\implies\) system can’t actually make effective progress
Membership and Coordination Services #
“distributed key-value stores” / “coordination and configuration services” – what things like ZooKeeper/etcd are called
- their APIs feel like DB apis
- they implement consensus algorithms
thing is the the reliance on such tools is likely to be indirect (as a fact of using another tool like Hadoop, Kafka)
these tools are designed to work best with in-memory size data (even though they might persist on disk for durability)
this data is what is replicated across all the nodes using a fault-tolerant total order broadcast system
some other features that ZooKeeper gives:
basic stuff: total order broadcast (therefore consensus)
fault-tolerant data replication, same writes in the same order to make replicas consistent with each other
linearizable atomic operations (only this is the one that requires consensus, really)
using atomic set-and-compare, we can get a lock
consensus protocol guarantees that the operation will be atomic and linearizable
distributed lock eventually implemented as a lease, with and expiry time so that it’s eventually released in the case that the client fails (see failure cases)
total ordering of operations
- ref leader and the lock, where we need a fencing token to prevent clients from conflicting with each other when there are process pauses
failure detection
there’s a heartbeat exchange that each client maintains on long-lived ZooKeeper servers, that’s how if heartbeat miss goes beyond a timeout duration, then the session may be considered as dead
ZooKeeper offers ephemeral nodes that auto-timesout also
change notifications
- clients can subscribe to another client’s events, saves the overhead that would have come from polling
Allocating work to nodes #
a feature useful for job schedulers / stateful systems is for follower nodes to automatically get promoted to leaders
rebalancing load in partitioned resource: when new nodes join the cluster, some partitions need to decide which partition assigned to which node – when this happens, when nodes are removed or failed, then there needs to be other nodes taking over
should use applications like Apache Curator that offer higher-level APIs on top of ZooKeeper client APIs instead of trying to implement the consensus algos from scratch
ZooKeeper may have thousands of nodes, but only a fixed number (3-5) perform the voting while supporting a large number of clients
so it’s an “outsourcing” from the coordinating nodes that happens
Service Discovery #
- these systems used for service discovery as well
- at time of writing, it’s not clear if service discovery needs consensus, DNS being the OG way of looking up IPs for services
- in the case of DNS,
- DNS reads are not linearizable
- sometimes tolerated if DNS reads are stale
- more important that DNS is reliably available, robust to network interruptions
- Leader-election needs consensus though, so can use the consensus systems to tell other services who the leader is \(\implies\) it makes sense to have RO caching replicas of consensus systems to read this information
Membership Services #
- membership services is the category that tools like ZooKeeper and friends would fall under
- membership service determines which nodes are currently active and live members of a cluster.
- typically unbounded network delays makes it hard to detect if a node has failed BUT…
- with failure detection and consensus, nodes can come to an agreement about who should be considered alive or not
- false positives on the dead-node detection is still possible, but it’s tolerable
- to be even able to do leader election, the first step is to have consensus on who the current members are
Summary #
terms :
linearizability: linearizability, a popular consistency model: its goal is to make replicated data appear as though there were only a single copy, and to make all operations act on it atomically.
causality: happened before relationship, weaker consistency model than linearizability
not sufficient to have causal ordering (e.g. username selection) \(\implies\) need consensus
consensus
range of problems that are reducible to consensus (there’s some equivalence):
- Linearizable compare-and-set registers
- register atomically decides whether to set its value (compared to current value …)
- Atomic transaction commit
- for distributed transactions, db must decide on the order to deliver msgs
- Total order broadcast
- messaging system must decide on the order in which to deliver messages
- Locks and leases
- several clients trying to grab a lock / lease, then lock decides which one acquires it successfully
- Membership / Coordination service
- System must decide, given a failure detection mechanism like timeouts, which nodes are members of the distributed system
- Uniqueness constraint
- multiple transactions trying to create conflicting records for the same key, constraint is to decide which one to allow and which to fail with a constraint violation
- Linearizable compare-and-set registers
single leader systems are simpler, if leader fails we have the following options:
wait for the leader to recover
but this doesn’t satisfy the termination property for consensus; can possibly be blocked forever
e.g. XA/JTA txn coordinators do this
manual human failover
typically done by relational dbs; consensus by “act of God”; speed of failover is limited by the speed at which the human can act, which is slower than computers
use an algo for leader election
needs a consensus algo, use a proven algo
who doesn’t need consensus?
leaderless and multi-leader replication systems
typically don’t need global consensus; they encounter conflicts but that could be tolerable in some POVs
maybe we simply need to cope without linearizability and learn to work better with data that has branching and merging version histories.