Chapter 5. Replication

Table of Contents

Complexity in replication is from handling changes to replicated data.

Typical areas of misunderstanding is on the meaning of eventual consistency, which this chapter will go deeper in detail for.

Dimensions to consider:

replication algorithm: single / multi / no leader
sync/ async replication
how to handle failures @ replicas

Leaders and Followers #

Leader based replication / active/passive / master-slave:

clients write to primary (master / primary)
It’s only possible to write to leader
leader sends changes to replication log / change stream
follower takes log from leader and updates local copy, preserving write order and so on
clients wanting to do reads can read from any replica
leader based replication also works for distributed message brokers.

Synchronous Versus Asynchronous Replication #

applies to any replicated system
for the replicated log, it’s a matter of whether the leader waits for the ack from the followers or not
semi-synchronous setup:
we don’t wanna let the entire system be blocked by synchronous, so even in the sync config, the ack is expected from roughly one follower node, the rest being async. This guarantees that at least 2 nodes have an updated copy.
experimental replication algos: chain-replication
KIV for chapter 9: relationship b/w consistency of replication and consensus
fully async:
- [con] not guaranteed for writes to be durable (if there’s any failure to the leader, then there’s data-loss if the replication hasn’t happend yet)
- [pro] leader can be unblocked and so can continue processing writes even if there’s a lag

Setting Up New Followers #

Key idea is to make a snapshot first and keep track of the log sequence when that snapshot happened. We should try to do this without any db locking so that functionality remains highly-available.

Then fill the new replica with the snapshot

Then refer to the replication log, copy over from when the snapshot happend and that’s how the newly created follower can catch up

Handling Node Outages #

follower failure: catch-up recovery mechanisms
a follower keeps internal log updates. So if it goes down, it just need to request from the leader the missing updates from when it went down (referenced locally) to now (referenced indirectly via the leader)
leader failure: failover mechanism
- can be auto or manually handled
- auto process:
  1. detect leader failure
    - typically via a heartbeat check with timeouts
  2. leader election
    - best candidate is the replica with the most updated info
    - can be elected by mechanisms such a using a controller node
  3. system to use the new leader –reconfig
    - need to avoid situations like the old leader being live again and still thinking that they’re the leader \(\implies\) system needs to make sure that the old leader becomes a follower replica
- sources of error, complexity:
  1. durability problems e.g. if it’s async replication and the new leader hasn’t received all the updated writes from the old leader before the old leader had failed
    might end up discarding wrights
  2. [problems with discarded writes] dangerous to discard writes if there’s data dependencies (other storage systems outside the db coordinated with db contents)
    e.g. say a running counter exists for some other info and the counter info gets lost because of discarded writes, might end up reusing counts or something.
  3. fault scenario: split brain
    when 2 nodes believe they’re the leader, it would lead to data corruption and all
  4. correct timeout duration hard to choose
- its complexity is a good reason why people prefer to manually perform failovers
  the problems are common to most distributed systems: node failures; unreliable networks; and trade-offs around replica consistency, durability, availability, and latency

Implementation of Replication Logs #

Approaches:

statement-based replication
not the best approach
mechanism:
- write-requests are statements that get forwarded to the followers
- followers execute statements as though they got it from the client
problems:
1. non-deterministic functions (e.g. NOW())
2. auto-increment / sequences must be executed in the same order as the leader for the outcomes to be the same
3. statements with side-effects are hard to manage
WAL Shipping
- writes appended to a log
  1. log structured storage engine – SSTables, LSM-trees
    with log segment compaction and garbage collection happening in the background
  2. B-trees: overwriting of disk blocks
- just use the WALs for writing to the followers, the log can be the basis of comms over a network
- disadvantage = coupling of the log to the storage engine
  - it’s because the log is @ a low level (byte changes on the disk)
  - so this coupling means that we can’t run different versions of the same db software and the leaders – the versions must sync.
    Postgres faces this problem!
  - if we could do across versions, it’s easier to have 0 downtime upgrades of the db software
Logical (row-based) log replication
- logical logs are decoupled from the physical data representation
  so different log formats for replication and for the storage engine
  there’s row-level granularity of the information that is kept
- decoupling allows easier backwards compatibility \(\implies\) leader and followers can run different versions of the db software, or even different db storage engines
- allows the parsing to be done by other observers / other data sinks e.g. data warehouse for analysis
  - KIV chapter 11 for change data capture
Trigger based replication
- using features within the relational dbs: triggers and stored procedures
  - trigger: exec custom app code
    typically an external process
- has greater overheads and is more prone to bugs than other replication methods
  gives a lot of flexibility to us though.

Problems with Replication Lag #

leader based replication is helpful for read-heavy systems – read-scaling architecture is when you scale the followers based on the pressure on reads
this works best if it’s async replication
eventual consistency – term is intentionally vague, relates to how the followers do catchup
- dependent on the replication lag, which is more apparent when the system is operating @ capacity

Reading Your Own Writes #

possible loss of read-after-write consistency if you write to the leader but immediately want to read (from a stale replica), and there’s a replication lag, then that’s a problem \(\implies\) may appear to the user as lost data
read-after-write consistency is what we want, together with leader-based replication, following techniques:
1. modified data, read it from the leader – works if the system is mostly read-heavy (so writes / updates are infrequent)
2. time based monitoring of the replication lag \(\implies\) becomes a dynamic, best-effort approach
3. system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp, else the read is handled by another reply or wait until replica has caught up
  this needs the client to remember (logical/wall) timestamp of its most recent write
cross-device read-after-write consistency e.g. writing from one and viewing it from another device
- remembering the timestamp part needs to be centralised in some way so that the other device also knows this
- some forced routing hack

Monotonic Reads #

monotonic reads == if a user makes several reads in sequence, they will not see time go backward – ie.. they won’t read older data after having previously read newer data
how this anomaly arises:
A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads.
so guarantees against this anomaly == monotonic reads
mechanisms:
1. read from the same replica, e.g. by making the choice of replica based on the hash of their user id
  - some deterministic routing strategy

guarantees:
- weaker guarantee than strong consistency
- stronger guarantee than eventual consistency

Consistent Prefix Reads #

sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
better understood from a 3rd person reader pov
typically a problem in sharded dbs, and we have causal dependencies in the writing of the data.
anomaly avoided if db always applies the writes in the same order (without getting affected by replication lag)
hard to do because partitions typically operate independently \(\implies\) no global ordering to the writes
there are algos to help us deal with these causal dependencies

Solutions for Replication Lag #

guiding question: “what happens if we make the replication lag go to several mins / hours”
1. case A: nothing
2. case B: impacts UX \(\implies\) need to figure out what guarantees we want
  1. we can’t pretend that it’s synchronous when it’s actually async
an application can provide a stronger guarantee than the underlying database, it’s just complex and error-prone
easier to rely on transaction primitives instead: they are a way for a database to provide stronger guarantees so that the application can be simpler.
the difficulty is when we progress beyond single-node transactions and consider distributed systems

Multi-Leader Replication #

to move away from just having a single leader for writes (single point)
- aka multi-leader, master-master, active/active replication
each leader simultaneously acts as a follower to the other leaders.

Use Cases for Multi-Leader Replication #

multi-datacenter operation
just need to make sure that there’s at least one leader in each datacenter
typically considered dangerous and good to avoid because:
- multi-leader replication is typically retrofitted into dbs so config pitfalls may be many
- has nuances like autoincrementing keys and all that, which make things complex
benefits over single-leader config:
1. no more single flow, better latency for writes
  perceived performance improves also
2. outage-tolerance – no more single failure point
  on outage, there’s no need for failover mechanisms, the other data-centers’ write leader will still work until restore
3. network-tolerance
  - within datacenter, local internet within leader-followers means that the async replication can tolerate network problems better – temporary interruptions won’t prevent writes from being processed
Clients w offline operations
- for offline use-cases also
- architecturally similar to the multi-leader replication case with multi-datacenters
  local device has a local leader, there’s an async replication process with other devices – the replication lag may be long
- CouchDB supports this multi-leader config
Collaborative Editing
- realtime collaborative editing, typically reliant more on the conflict resolution algos
- this problem is typically not seen as a db replication problem, so the similarity is as such:
  1. local changes are committed to the local replica (document in browser)
  2. then async replication onto the server, all editing the same document
  3. e.g. to avoid editing conflicts, need to lock the document / area of document before edits \(\implies\) similar to single-leader replication with transactions on the loader (atomic, blocking)
  4. improving collaboration:
    - keep the unit of change small (e.g. keystroke), avoid locking
    - makes the multi-leader replication and conflict resolution problems more pronounced

Handling Write Conflicts #

main problem is write-conflicts \(\implies\) need conflict resolution problem doesn’t happen in single-leader (which would otherwise exhibit blocking behaviour)

Sync vs Async conflict detection #

there’s 2 options for the conflict detection:
1. sync detection: wait for the write to be replicated to the replicas then let the user know about the success outcome (succeeded or conflicted)
  problem: can’t use the independent writes benefits for multi-leader replication \(\implies\) might as well use a single-leader replication

Conflict avoidance #

typically recommended because the multi-leader replication is hard so this part is likely going to give trouble
avoid by making the routing deterministic (e.g. to a particular datacenter), problem is that this breaks down if the datacenter needs to be downed because of whatever reason

Converging towards a consistent state #

multi-leader replication, can’t tie-break because it’s not clear what the final value should be
there’s a need for convergent conflict handling – all replicas arrive at the same final value when all changes have been replicated
approaches to arriving at a convergent, consistent state:
1. LWW (last-write wins) - popular but dangerous because of possible data loss
  kiv end of this chapter
2. prioritise higher-numbered writers to do the tie-breaking
  also prone to data-loss
3. other merging strategiees e.g. alphabetical ordering / merging of data
4. custom resolution logic add logic to present the conflict to the user for user-assisted conflict-resolution
  like git and all

Custom resolution logic (via conflict handler) #

conflict resolution is typically application dependent, so tools typically allow use of custom conflict resolution logic
can be executed @ reads or @ writes
- reads: defer the handling until the time when it’s read, the conflict handler is invoked then. (couchdb works like this)
conflict resolution usually applies at the level of an individual row / document – it’s not applied to an entire transaction
so it has to break down the transaction to writes within it

automatic resolution #

since DDIA was written, this area has improved a lot. There’s good tooling around this.

some ways:

CRDTs (conflict-free replicated datatypes)
- family of datastructures, with sets, maps, ordered lists, counters…
- uses 2-way merges
mergeable persistent data structures
- history tracked explicitly, like in Git
- merging is done in 3-way merge
operational transformation
- conflict resolution algo specifically for collaborative editing (Etherpad, Google Docs)

Improvements in this field since this edition of the textbook was published:

Logs became substrates (Kafka)
Actors became substrates (Erlang/BEAM)
CRDTs are now often:
- Embedded in storage engines
- just consumed, not manually written out
- Exposed as APIs
- Hidden behind collaboration frameworks


--

## Automatic Conflict Resolution in Multi-Leader Systems (Landscape)

**Core framing (DDIA):**
Real-time collaboration and offline-first systems are best understood as **multi-leader replicated systems**. Once multiple leaders accept concurrent writes, **conflicts are inevitable**, so systems must define *deterministic resolution semantics*. Ad-hoc resolution logic is brittle and can lead to surprising bugs (e.g. Amazon shopping cart deletions reappearing).

---

## Main Lines of Work

### 1. CRDTs (Conflict-Free Replicated Data Types)

**Idea:** Design data structures whose merge operations are *commutative, associative, and idempotent*, guaranteeing convergence under multi-leader replication.

 * Shift from “resolve conflicts at runtime” → **encode resolution in the data model**
 * Modern CRDTs support:
   * Counters, sets, maps
   * Ordered sequences (text, arrays)
   * Nested, JSON-like structures
 * Strong fit for:
   * Real-time collaboration
   * Offline-first systems
   * Low-latency, coordination-free writes
 * State of the art:
   * Widely used in production (collaborative editors, sync engines)
   * Typically consumed via libraries/frameworks rather than implemented by hand
 * Limitation:
   * Guarantee convergence, *not* business invariants or semantic correctness

> CRDTs are best seen as a **replication substrate** for multi-leader systems, not a general replacement for transactions.

---

### 2. Operational Transformation (OT)

**Idea:** Operations are rewritten dynamically when concurrency is detected so user intent is preserved.

 * Designed specifically for ordered lists (e.g. text)
 * Used historically by Google Docs, Etherpad
 * Strengths:
   * Excellent intent preservation for text
   * Efficient metadata
 * Weaknesses:
   * Complex correctness proofs
   * Often requires central coordination
   * Hard to generalize beyond text
 * Status today:

  * Mature but niche
  * Largely stable, not expanding in scope

> OT resolves conflicts *procedurally*; CRDTs resolve them *structurally*.

---

### 3. Mergeable Persistent Data Structures (Git-like)

**Idea:** Track full history and perform **three-way merges**.

 * Explicit conflict surfacing
 * Human intervention expected
 * Strong fit for:
   * Code
   * Configurations
   * Developer workflows
 * Poor fit for:
   * Real-time collaboration
   * Continuous user interaction

> This is multi-leader replication with **manual conflict resolution**, not automatic convergence.

---

## Tooling & Practice Today

 * CRDTs have moved from research into **production frameworks** and **sync layers**
 * Databases increasingly **embed CRDT ideas**, but cautiously
 * Most real systems are **hybrid**:
   * CRDTs for collaborative / coordination-free state
   * Transactions and validation for invariant-heavy logic

---

## Key Lessons

 * Automatic conflict resolution works best when:
   * Semantics are simple
   * Invariants are weak
   * User intent is local
 * CRDTs prevent divergence, but **cannot encode arbitrary business rules**
 * Multi-leader systems must choose where conflicts are resolved:

  * In the data model (CRDTs)
  * In algorithms (OT)
  * Or by humans (Git-style merges)

**Bottom line:**
Automatic conflict resolution has matured significantly since DDIA—especially via CRDTs—but it remains a *domain-specific tool*, not a universal solution.

Multi-Leader Replication Topologies #

replication topology: comms path that the writes are propogated with
in star topologies:
- there’s a need to forward the rights, so there’s a need to avoid infinite replication loops – can do it using a identifier for it
- don’t confuse with star schema (which is about the structure of the data model)
circular and star topologies might end up breaking if one of the nodes failed – the paths don’t have redundancies \(\implies\) needs a more densly connected topology
problem with density:
- the consistent prefix writes may not be there, can’t just rely on clock sync to correct the order of events
  \(\implies\) might need to rely on version vectors

Leaderless Replication #

dynamo style: amazon’s dynamo system is early on the leaderless replication, others are described to be in dynamo style: Riak, Cassandra, Voldemort
this is not the same as dynamodb (which uses single-leader replication), the dynamo system is separate
approaches:
1. client directly sends to several replicas
2. use a coordinator node, do 1 for the client
  - coordinator doesn’t enforce the particular ordering of writes though

Writing to the Database When a Node Is Down #

in this case, if we send to multiple replicas, then one is down – data will be stale
so when doing reads, we have the following strategies:
1. read repair and anti-entropy
  read repair
  reading from several nodes in parallel, can detect stale data
  anti-entropy process
  a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another.
  - without this, it’s
2. quorums for reading and writing
  - typical quorum requirements (strict quorum): w + r > n where the variables, as set by us:
    - w : every write must be confirmed by w nodes to be considered successful
    - r: must query at least r nodes for each read
    - n: number of replicas
    - we can tolerate unavaible nodes by setting the variables in specific ways

Limitations of Quorum Consistency #

quorums are not just majorities, it only matters that the sets of nodes used by the read and write operations overlap in at least one node.
cases where even w + r > n may have edge cases:
1. when using sloppy quorums, then the overlap guarantee may not be there
2. 2 writes happening concurrently, not clear which was done first, so only safe way is to merge the concurrent writes
3. concurrent R and W – undetermined if read returns the old or new value
4. partial success on W, i.e. quorum not reached for that W \(\implies\) W reported as failed – subsequent reads may / may not return that write
5. if updated replica fails AND restore is done using a stale replica, then the new value may end up falling below w and the quorum condition may be broken
6. timing / linearizability problems
the guarantees we need are the ones here
1. reading your writes
2. monotonic reads
3. consistent prefix reads
The parameters w and r allow you to adjust the probability of stale values being read, but it’s wise to not take them as absolute guarantees.
to get the stronger guarantees, have to rely on transactions / consensus

Monitoring staleness #

we can only do this for leader-based replication
- by substracting a node’s current position from leader’s current position, we get the replication lag

Sloppy Quorums and Hinted Handoff #

characteristics that make dbs with leader-replication appealing for high availability, low latency use cases:
1. if correctly configured quorums, can tolerate the failure of individual nodes
2. can tolerate individual nodes going slow, because requests don’t have to wait for all n nodes to respond
there’s cases of network partitioning / interruption, where technically a client should be able to reach some db nodes, so tradeoff:
1. return errors to all requests for which we can’t get a quorum of w or r?
2. should accept writes anyway, write to some nodes that we can reach but not among the n on which the value usually lives?
  this is called a sloppy quorum
  - not actually a quorum, it’s an assurance of durability
  - idea:
    By analogy, if you lock yourself out of your house, you may knock on the neighbor’s door and ask whether you may stay on their couch temporarily. Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes.
```
 Once you find the keys to your house again, your neighbor
```
    politely asks you to get off their couch and go home.
    the second part is hinted handoff

multi-datacenter operation #

n is for all nodes across all datacenters
each write from a client is sent to ALL replicas, regardless of datacenters:
1. client usually only waits for ack from a quorum of nodes within local datacenter
2. the higher latency writes to other datacenters are usually async (to be unaffected by delays and interruptions on the cross-datacenter link)

Detecting Concurrent Writes #

for dynamo-style dbs, allows several clients to concurrently write to the same key – even with strict quorums, we get conflicts
1. can happen @ read-repair
2. can happen @ hinted-handoff
defining concurrency
defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred.
KIV chapter 8
problem: there’s no well-defined ordering
solution: the application logic needs to be cognizant of the conflict handling, here’s a few strategies to achieve eventual convergence:

Last Write Wins (discarding concurrent writes) #

each replica just stores most “recent” value, there should be a way to determine globally which write is more “recent”, then the replication can lead us to eventual convergence
it’s at the cost of durability, because there’s going to be dropped writes based on recency within the concurrent group
good for caching and such but bad for conflict resolution (if losing data is not acceptable)
ordering:
- can be arbitrarily defined (e.g. time-stamp based)
cassandra only supports this kind of resolution, optional in Riak
can be safely used if the key is immutable, so no other concurrent updates to the same key

“happens-before” r/s and concurrency #

there’s causal dependencies between writes within a concurrent group of writes
A “happens-before” B:
1. B knows about A
2. B depends on A
3. B builds upon A in some way
so for any 2 ops A,B there’s 3 cases:
1. A before B
2. B before A
3. A and B concurrent
  - we just need an algo to help identify this case

capturing “happens-before” relationship
the algo for detecting concurrency, reliant on version numbers
1. server maintains a version number for every key, increments on writes
  so value = <version number, value> for every key
2. on read, server returns all values without overwrites + latest version
  rule: client must always read a key before writing
3. when client writes a key, must include version number from the prior read, client must merge together all the values it rcvd
4. server receiving the write with the version number can overwrite all values with that version number or below
  must keep all values with higher version number (since those numbers are concurrent with the incoming write)

merging concurrently written writes
- guarantees that there’s no data dropped, but needs the clients to do the merging work (to merge concurrently written values (aka sibling-values))
- merging sibling values is similar to conflict resolution in multi-leader replication
  e.g. by using timestamps to figure out ordering
- it’s not a simple union operation (else delete operations across versions won’t be preserved)
  that’s why deletes need to be tombstones
- typically CRDTs allow us to NOT have to implement these from scratch

version vectors
- For cases where it’s a leaderless replication, there’s a need to figure out versions
- just keep <version number, replica, key> so have to bind the replica info into the version vector
- collection of version numbers from all the replicas == version vector
- version vectors can help distinguish overwrites from concurrent writes
- misnomer: sometimes people say vector clock but it’s not the same as version vectors

Summary #

the failure modes considered for these replicated data included simpler cases like network interruptions, unavailable nodes
there’s more complicated cases like software-bug-introduced data corruptions (that are typically more silent)
covered 3 approaches to replication:
1. single-leader
2. multi-leader
  - hard, complex but gives robustness in typical error modes
3. leaderless
replication may be sync or async, with its own behaviours
covered effects of replication lag
covered consistency models (hard guarantees)
1. read-after-write consistency
  - should be able to see data that they just submitted
2. monotonic reads
  - after seeing data @ a particular point in time, they shouldn’t see data from an earlier point in time
3. consistent prefix-reads
  - users should see data in a state that makes causal sense: e.g. question then its reply, in the correct order
conflicts @ multi-leader & leaderless replication, needing us to find merge strategies for the concurrent updates

Updates since DDIA publication #

DDIA correctly identified the fundamental tradeoffs of distributed systems. Since its publication, industry practice has shifted toward avoiding multi-leader writes in general-purpose databases, embracing immutability, logs, and single-writer designs, while confining automatic conflict resolution (e.g. CRDTs) to domains where semantics are simple and conflicts are expected.

Addendum: How DDIA’s Ideas Evolved in Practice #

Since the publication of Designing Data-Intensive Applications, the industry has broadly internalized its core lessons and adjusted system designs accordingly.

1. Decline of general-purpose leaderless databases
Dynamo-style systems (e.g. Cassandra, Riak) demonstrated that leaderless, write-anywhere replication is possible, but production experience showed that:
Application-level conflict resolution is error-prone
Operational complexity is high
Most workloads implicitly rely on single-writer semantics
As a result, leaderless replication is now used selectively, not as a default.

1. Migration of multi-leader complexity upward
Multi-leader replication did not disappear; instead, it moved to:
Collaboration systems
Offline-first sync engines
CRDT-based application layers
In these domains:
Conflicts are expected
Semantics are simpler
Automatic resolution is feasible

1. Rise of log- and append-centric architectures
Systems like Kafka, ClickHouse, Iceberg, and Delta Lake embody a shared principle:
Total ordering + immutability dramatically simplifies distributed correctness.
Key patterns:
Single-writer logs
Immutable data files
Background compaction/merging
Transactional metadata, not transactional data mutation

1. Stronger defaults, fewer footguns
Modern systems increasingly:
Avoid exposing low-level consistency knobs
Prefer clear, opinionated semantics
Push complexity into infrastructure rather than applications
This reflects a shift from:
“Flexibility first” → “Predictability first”

1. CRDTs as a targeted solution, not a universal one
CRDTs have matured and are widely deployed, but with a clearer scope:
Best for coordination-free, user-driven state
Poor fit for invariant-heavy business logic
Typically embedded in sync layers, not databases