Chapter 7. Transactions

Table of Contents

simplifies the programming model for applications accessing a db
so that the app can get some security / concurrency guarantees and doesn’t have to explicitly deal with them
groups actions into single atomic operation that has 2 outcomes
1. commit
2. abort, rollback
benefits
1. easier error handling because no need to worry about partial failures
possible to not use transactions for performance / availability benefits as well
this chapter’s scope
1. failure modes
2. algos that guard against the failure modes \(\implies\) safety guarantees
3. concurrency control and race conditions
4. isolation levels: read, committed, snapshot isolation, serialisability

The Slippery Concept of a Transaction #

turns out that NoSQL implementations deviated a little from using transactions
Transactions were the main casualty of this movement: many of this new generation of databases abandoned transactions entirely, or redefined the word to describe a much weaker set of guarantees than had previously been understood
two opposing hyperbolic points:
1. “transactions are the antithesis of scalability adn any large scale system would have to abandon transactions”
2. “transactional guarantees are there for serious applications with valuable data”
we need to understand the tradeoffs between them so that we can choose situations that suit our requirements

The Meaning of ACID #

characteristics conferred by safety guarantees via fault tolerance mechanisms
ACID & BASE, both are ambiguious terms
- ACID: Atomicity, consistency, isolation, durability
- BASE: Basically Available, Soft State, Eventual Consistency
  lmao what does this even mean
implementation of ACID isn’t standardised, in some ways “ACID-compliant” is somewhat a marketing term too
some examples
1. meaning of isolation is ambiguious

Atomicity #

can’t be decomposed, context dependent meaning:
- multi-threading: no partial states, complete or none execution
- in the context of ACID:
  - isolation aligns more with the multi-threading atomicity point
  - atomicity refers to the partial faults handling, the grouping together of operations into a unit
  - abortability could have been a better word because the main feature that comes out of this is the ability to abort (and retry) on a failure
    we can abort a transaction error and have all writes from that transaction be discarded – that’s the main feature of ACID atomicity

Consistency #

once again, the meaning of consistency is context dependent
1. context of async replicated systems:
  replica consistency – how that relates to eventual consistency
2. context of rebalancing:
  partitioning approach
3. CAP theorem:
  consistency means linearizability
4. ACID:
  consistency means app-specific notion of the db being in a “good state”
ACID consistency is when we uphold data-related invariants \(\implies\) property of the application
- depends on the application’s notion of invariants –that’s why it’s the app’s responsibility to define its transactions correctly to preserve consistency
- Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application.

Isolation #

relates to concurrency problems (race conditions)
ACID isolation means that concurrent transactions are isolated from each other
it’s formalised as serializability because the outcome of concurrent execution should be the same as if they had run serially
serializable isolation is too strong a guarantee - has performance penalty
so typically weaker guarantees, e.g. snapshot isolation may be used

Durability #

Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.

specifics differ from single-node vs distributed data structures
1. single node: just need non-volatile storage, sometimes with a WAL for recovery from data-corruption
2. replicated nodes:
  - data successfully copied to a number of ndoes within the cluster
  - order to provide a durability guarantee, a database must wait until these writes or replications are complete before reporting a transaction as successfully committed.
  - it’s interesting that the meaning from just storage to replication is a more recent change in the meaning of durability
total durability doesn’t exist, there’s way too many failure modes
- there’s only risk-reduction techniques for this and they should be used together
  take theoretical “guarantees” with a healthy grain of salt

Some failure modes for durability
``The truth is, nothing is perfect:''

Single-Object and Multi-Object Operations #

the atomicity and isolation guarantees imply that our intent is to modify several objects (rows, documents…) at once
multi-object transactions are often needed if several pieces of data need to be kept in sync.
- needs a way to determine which read and write operations belong to the same transaction
  e.g. TCP connection to the db server (though it’s not ideal), via a transaction manager or something
may have “dirty reads” if multi-object isolation is not done properly
when isolation is violated – when one transaction reads another transaction’s uncommited writes
noSQL dbs don’t have a good way of grouping operations together, even things like multi-put operations don’t necessarily have transaction semantics and can be left in a dirty state
correction: since the book was published, this has changed a bit
What’s changed since DDIA
1. Some NoSQL systems added real transactions
Examples:
MongoDB: multi-document ACID transactions (with real cost)
FoundationDB: first-class multi-key transactions
Spanner / CockroachDB / Yugabyte: distributed SQL with strong transactions
So the blanket statement:
“NoSQL DBs don’t have a good way…”
is no longer universally correct.
Atomicity and isolation are required when multiple objects must be updated together to preserve immediate invariants. While many modern distributed databases now offer multi-object transactions, these remain expensive and are often intentionally avoided in favor of idempotent operations, eventual consistency, and reconciliation. As a result, many NoSQL systems still provide only single-key atomicity by default, and batch or multi-put operations should not be assumed to have transactional semantics unless explicitly documented.

single-object writes #

storage engines typically provide atomicity and isolation on the level of a single object (e.g. key-val pair) on one node
possible ways to implement:
- atomicity: use a log for crash recovery
- isolation: use a lock on objects, so that it’s atomic (concurrency context) access
single object operations:
1. read-modify-write cycle
2. compare-and-set operation
  - allows writes if no other concurrent changes done by others
3. other fancy but non-ubiquitous operations:
  1. atomic increments
single object operations are not exactly transactions, though they may be marketed as such
- typically transaction is an operation-grouping mechanism that applies on multiple objects

need for multi-object transactions #

main problem is it’s difficult to implement multi-object transactions across partitions and may impede high-availability or performance goals
in some contexts, just having key-value data-model with single-object operations are sufficient
many other cases typically need writes to several different objects that need to be coordinated. examples:
1. relational dbs, can ensure that the foreign key references remain valid
  keeps referential integrity
  similar point for graph dbs also
2. document data-models with denormalized data
  - typically multiple fields that need updating are in the same document (same object) so it might appear that this is sufficient
  - but remember that document data-models encourage de-normalisation of data + they may lack join functionality \(\implies\) updating of denormalized data is the updating of several documents at once.
    so, transactions are helpful to prevent denormalised data from going out of sync
3. dbs with secondary indexes
  - the indexes also need to be updated
  - from the POV of a transaction, different indexes are different databased objects
error handling becomes managable if transactions are used for multi-object writes

handling errors and aborts #

transactions help with aborting and safe retries
design principle: “better fail than violate constraints”
leaderless replication don’t follow that design principle, they do more of “best-effort” approaches
becomes the application’s responsiblity to recover from errors
Software anti-pattern: thinking only using happy paths
frameworks encourage happy-path thinking so the use of ORMs in some places won’t automatically attempt to do safe retry logic, but will bubble that error up the stack
retries are not perfect
1. if txn succeeded but only network failed – deduplication needed and it typically ends up being the onus on the application logic for this (idempotency issues)
2. overload errors don’t get fixed with retries
3. retrying is pointless if it’s a permanent error (e.g. constraint violation)
  - what’s worth retrying?
    1. after transient errors (deadlocks, isolation violation)
4. side-effects coupled to the transaction that are outside the db would still be effected
  - e.g. email sending side-effects
5. if client process fails while retrying there’s dataloss regardless

Weak Isolation Levels #

it’s safe to run txns concurrently if they don’t touch the same data
it’s difficult to wrap heads around concurrency bugs, that’s why db’s help by giving developers transaction isolation
- helps the developer pretend that it’s essentially serialized \(\implies\) serializable isolation but that’s not feasible, so we need weaker guarantees
  - that’s why they protect against some concurrency issues but not all
most relational db systems that consider themselves ACID are actually using weak isolation
that’s why it’s necessary to understand the different weak isolation levels to know what kind of race conditions will and won’t happen

Read Committed #

the guarantee is that no dirty reads and no dirty writes
this supports: it allows aborts (required for atomicity), it prevents reading the incomplete results of transactions, and it prevents concurrent writes from getting intermingled.
there are some dbs that provide weaker isolation levels to this
- read uncommitted: prevents dirty writes but doesn’t prevent dirty reads
there are race conditions that this still doesn’t solve (allowed):
1. non-repeatable read / /read-skew: when there’s an inconsistent state that is read
  e.g. if you read and re-read during a transaction while also being given the read-committing isolation level, you still can end up seeing inconsistent states
  this is accepted under read-committed isolation
  NOTE: the term ‘skew’ here is made in the context of a timing anomaly (not the same as skew in the context of unbalanced hotspots)

No dirty reads #

dirty read:
a write transaction that hasn’t completed yet (i.e. not committed or aborted), if it can be read \(\implies\) dirty read
so with this guarantee, only a committed write can be read
benefits:
1. readers don’t read in a partially updated state – this is typically confusing for users
2. to prevent seeing intermediate info that might actually be rolled back – so users find it confusing too

No dirty writes #

dirty write:
- when 2 transactions concurrently try to update the same update and
- when the earlier write is part of a transaction that hasn’t been committed yet – the later transaction ends up overwriting an uncommitted value
@ read-committed isolation level, the writes need to be delayed until the first write’s transaction has been complete
benefits:
1. prevents bad, impossible states:
  multiple objects’ update, it prevents bad outcomes e.g. same item of unit quantity being allocated to 2 different people because of dirty writes (like an auction)
edge cases:
1. race conditions @ counter increments still can happen
  - if 2 counter increments happen, the second write happens after the first transaction has committed BUT the value ends up still being an overwrite (+1 instead of +2)
    so it’s not safe for counter increments

Implementing Read Committed
- on preventing dirty writes:
  - works via a row-level lock that only one transaction may hold
  - the locking is managed via the db if it’s in a read-committed mode
- on preventing dirty reads:
  - not feasible to use the same write-lock because a long-running write may block many read-only transactions
  - instead
    1. for every object that is written, db remembers the old committed value and the new value set by the transaction that currently holds the write lock
    2. during the ongoing transaction, any other transactions that read the object are given the old value
    3. when the transaction is complete and new value committed, then any new transactions will be able to read the new value

Snapshot Isolation and Repeatable Read #

Read-committed still suffers from some edge cases
- non-repeatable read / /read-skew
  leads to reading of inconsistent states even with the isolation guarantee
  e.g. A has 1k value in bank, split equally into a/c 1 and a/c 2 (500 each). During the transaction processing, if she sees one account balance at a time before the incoming payment arrives (balance 500) and the other account after the outgoing transfer has been made (new balance == 400) then it appears as though the total value in bank is 900 somehow – so 100 has just vanished.
  - though it’s temporary inconsistency, in some cases, this can’t be tolerated:
    1. backups
      - if backup is done with the inconsistency, then the inconsistency gets persisted
        e.g. the disappearing money becomes permanent if we restore from backup
    2. analytic queries and integrity checks
      - false reads on periodic data-integrity checks…
snapshot isolation:
each txn reads from a consistent snapshot of the db – all the data that was committed in the db at the start of the transaction
snapshot does a time-freeze effectively – allows us to do repeated reads without worrying about underlying data-state changes
something like multi-version dbs

implementing snapshot isolation #

mechanism
- key invariant:
“Readers never block writers, and writers never block readers”
- writes:
  so write locks are used to prevent dirty writes
- reads:
  reads don’t need any locks
  MVCC (multi-version concurrency control)
  - mechanism is a generalisation of the temp history that the read-committed isolation level uses
  - needs to keep more than just one version because there many be multiple in-progress transactions coinciding
  - typically uses autoincrementing txn-id that is used as a tag that is used as value for the created_by
    deletions are temporarily tombstoned and deleted when the system is sure that no transaction can any longer access the deleted data – gc takes care of this for the tombstone-cleaning
    updates are just broken into delete and create
this allows handling of long-running read queries on a consistent snapshot at the same time as processing writes normally \(\implies\) there’s no lock contention between the two
By never updating values in place but instead creating a new version every time a value is changed, the database can provide a consistent snapshot while incurring only a small overhead.

visibility rules for observing a consistent snapshot #

when txn reads from db, txn ids used to decide which objects it can see and which are invisible – visibility rules

Object is visible if the following 2 conditions are true:

@ time of reader’s transaction start – the transaction that created the object had already committed
object is not marked for deletion, or if it is, the transaction that requested deletion is not yet committed at the time the reader’s transaction started

Described as a list of rules:

@ start of txn, db lists other in-progress txns. Writes from those txns are ignored
Any writes from aborted txns are ignored
writes from txns with txn id that comes later is ignored, even if committed
all other writes visible to the application’s queries

Indexes and snapshot isolation #

some approaches:
1. pedestrian solution for this:
  let the index point to all versions of an object and require an index query to filter out any object versions that the current transaction doesn’t have visibility on
  implementation details relevant for the actual performance
2. using b-trees and using append-only / copy-on-write variants
  no overwriting of pages of the tree when they are updated – it’s always new page creation on modification
  parent pages, to the root of the tree are copied and updated to point to the new version of child pages
  unaffected pages by the write don’t need to be copied and are immutable
  so every txn containing a write creates a new B-tree root, so roots end up forming consistent snapshots of the db at the point in time when it was created
  compaction and garbage collection still needed, some background bookkeeping still needed

Repeatable read and naming confusion #

the implementation name for this kind of snapshot isolation is not standardised

sometimes called serializable, repeatable read

SQL standard’s definition of isolation-levels is flawed because it’s more ambiguous than an implementation-independent standard should be

Preventing Lost Updates #

lost update problem:
more interesting conflicts that can happen between concurrently writing transactions \(\implies\) best known = lost update problem
how it happens:
consider a read-modify-write cycle, if two transactions do that concurrently, one of the modifications can be lost because the second write does not include the first write
aka the second write clobbers the earlier write \(\implies\) this is common enough
- example scenarios when it happens:
  1. incrementing counter / updating an account balance
    read value, calculate new value and write updated value
  2. making local change to a complex value
    e.g. json document with a nested list is added to, so document is parsed, changed and the modified document can be written back
  3. 2 users editing a wiki page at the same time

solution: atomic write operations #

many dbs give an atomic update operation – this avoids the need for a read-modify-write cycle in the app code
for document stores, there’s usually atomic operations for modifying the data-structures as well
some operations can’t be atomic though, e.g. wiki page with arbitrary text editing
typical implementation:
- [/cursor stability/]
  exclusive lock on the object when it is read so no other object can read it until the update has been complete
- force all operations to be executed on a single thread
ORMs make it easier to do unsafe read-modify-write cycles instead of using atomic operations that the db provides
so be careful when using ORMs and think of the query cycles

solution: explicit locking #

if the db doesn’t give atomic built-ins, then we can make the application explicitly lock objects that are to be updated

the app can then do the read-modify-write cycle and if any other transaction tries to concurrently read teh same object, it is forced to wait until the first read-modify-write cycle has completed

this is typically what I do when using ORMs that don’t outright float good patterns

  BEGIN TRANSACTION;
  SELECT * FROM figures
  WHERE name = 'robot' AND game_id = 222
  FOR UPDATE; -- this indicates that the db should take a lock on all rows returned by this query.

  -- Check whether move is valid, then update the position
  -- of the piece that was returned by the previous SELECT.
  UPDATE figures SET position = 'c4' WHERE id = 1234;
  COMMIT;

solution: automatically detecting lost updates #

previous methods were preventative, by forcing the read-modify-write cycles to be sequential
we can choose to do reactive approaches as well
in some dbs, there’s automatic and efficient checking of lost updates in conjunction with snapshot isolation – which helps auto-detect when a lost update has occurred and we can abort the offending transaction
benefits:
1. doesn’t need application code to use any special db features
  - in the preventative approaches, if we forget to use a lock or an atomic operation, then we introduce a bug, but lost update detection happens automatically and is less error-prone

solution: compare-and-set #

for dbs that don’t offer transactions, there’s typically an atomic compare-and-set
avoids lost updates by allowing an update to happen ONLY IF the value hasn’t changed since the last we read t

there’s a need to check db functionality before relying on it

e.g. typically last write wins (LWW) is the default for many DBs but that conflict resolution method is prone to lost updates

e.g. for this code:

  -- This may or may not be safe, depending on the database implementation
  UPDATE wiki_pages SET content = 'new content'
  WHERE id = 1234 AND content = 'old content';

  If the content has changed and no longer matches 'old content', this update will have no effect, so you need to check whether the update took effect and retry if necessary. However, if the database allows the WHERE clause to read from an old snapshot, this statement may not prevent lost updates, because the condition may be true even though another concurrent write is occurring. Check whether your database’s compare-and-set operation is safe before relying on it.

conflict resolution and replication #

replicated databases add a complexity dimension because they have copies of the data on multiple nodes and the data can potentially be modified concurrently on different nodes – so need to take some extra steps to prevent lost updates
locks and compare-and-set operations won’t work for (async) replicated dbs
- they assume that there’s a single up-to-date copy of the data
approaches:
generally, it’s similar to the way we did the detection of concurrent writes
1. allow concurrent writes to create conflicting versions (siblings) and then use application code or special data structures to resolve and merge the sibling versions after the fact
2. atomic operations that are commutative can still be used without suffering from lost writes
NOTE: last write wins (LWW) conflict resolution is prone to lost updates, it won’t work but it’s also the default in many replicated dbs

Write Skew and Phantoms #

when it comes to race conditions between concurrent writes, it’s more than just dirty writes and lost updates
there’s a bunch of subtle ones too
ref the outdated premise parts below which describes:
- i.e. txn is taking an action based on a premise
- when the txn wants to commit, the original data may have changed and the premise may no longer be true

Characterizing Write Skew #

because the transactions are on different objects, it’s not a dirty write, nor a lost update
anomalous behaviour only exists when transactions were run concurrently, wouldn’t have happened if it were to be sequential
can be seen as a generalisation of the lost update problem:
- when 2 txns read the same object
- then update some of those objects (diff txns may update different objects)
  - case A: different transactions update the same object
    \(\implies\) dirty write / lost update anomaly (depending on timing)
  - else, it’s a write skew
solutions are even more restricted than preventing lost updates:
1. unhelpful:: atomic single-object operations
  unhelpful because multiple objects are involved
2. unhelpful:: auto-detection of lost updates
  because it’s not automatically detectable
  automatic prevention of write skews requires true serializable isolation
3. plausible: using logical constraints on triggers / materialised views
  it has to be a constraint that relates to multiple objects – not all dbs support multi-object constraints
  therefore, implementing it with triggers or materialised views may work
4. helpful: explicitly locking the rows that the txn depends on
  e.g.
```
     BEGIN TRANSACTION;

     SELECT * FROM doctors
     WHERE on_call = true
     AND shift_id = 1234 FOR UPDATE; -- this is what locks all rows the are returned by this query

     UPDATE doctors
     SET on_call = false
     WHERE name = 'Alice'
     AND shift_id = 1234;

     COMMIT;
```
  the FOR UDPATE is what does the row-level lock

More examples of write-skew #

meeting room booking system

suppose we stick to the invariant that same meeting room same time can’t have 2 bookings via:

   BEGIN TRANSACTION;

   -- Check for any existing bookings that overlap with the period of noon-1pm
   SELECT COUNT(*) FROM bookings
   WHERE room_id = 123 AND
   end_time > '2015-01-01 12:00' AND start_time < '2015-01-01 13:00';

   -- If the previous query returned zero:
   INSERT INTO bookings
   (room_id, start_time, end_time, user_id)
   VALUES (123, '2015-01-01 12:00', '2015-01-01 13:00', 666);
   COMMIT;

problem: snapshot isolation doesn’t prevent another user from concurrently inserting a conflicting meeting
need serializable isolation again (see solution b below)

solutions:

lock a parent row (explicit logical mutex)

        BEGIN;

        -- Lock the room as a logical mutex
        SELECT 1 FROM rooms
        WHERE id = 123
        FOR UPDATE;

        -- Now no other transaction can book this room concurrently
        SELECT COUNT(*)
        FROM bookings
        WHERE room_id = 123
          AND end_time > '2015-01-01 12:00'
          AND start_time < '2015-01-01 13:00';

        INSERT INTO bookings
        (room_id, start_time, end_time, user_id)
        VALUES (123, '2015-01-01 12:00', '2015-01-01 13:00', 666);

        COMMIT;


        -- mechanism:
        -- All bookings for a room serialize on the same row
        -- You’ve created a manual critical section
        -- This works under READ COMMITTED or REPEATABLE READ

        -- tradeoffs:
        -- Reduced concurrency per room
        -- Explicit application-level locking
        -- Very common in practice

serializable isolation

this is what the textbook is alluding to

        BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;

        SELECT COUNT(*)
        FROM bookings
        WHERE room_id = 123
          AND end_time > '2015-01-01 12:00'
          AND start_time < '2015-01-01 13:00';

        INSERT INTO bookings (...)
        VALUES (...);

        COMMIT;

exclusion constraint

        CREATE EXTENSION IF NOT EXISTS btree_gist;

        ALTER TABLE bookings
        ADD CONSTRAINT no_overlapping_bookings
        EXCLUDE USING gist (
          room_id WITH =,
          tstzrange(start_time, end_time) WITH &&
        );


        -- so the transaction just becomes:
        INSERT INTO bookings
        (room_id, start_time, end_time, user_id)
        VALUES (123, '2015-01-01 12:00', '2015-01-01 13:00', 666);

Why this is the gold standard
- Invariant enforced by the database
- Correct under concurrency
- No race conditions
- No explicit locking
- No serializable retries
- Postgres internally takes the necessary locks for you.

This is the solution experienced Postgres engineers expect.

multiplayer game
- locks prevent lost updates but this doesn’t help us for enforcing certain invariants / constraints that might make us vulnerable to write skew
- example: claiming a username
  - 2 users creating an account with same uname at the same time
    using a txn under snapshot isolation is not safe
    simple solution: use a uniqueness constraint – the duplicate uname will violate it and the txn will be aborted
- example: preventing double-spending
  e.g. we want to make sure that users can’t spend negative money / points on items
  implementation may include an insertion of a temporary negative spending item to the balance, listing all the items in teh account adn checking whether the sum is positive
  but with write skew, it could happen that the sum is positive but two spending items are inserted concurrently which together cause the balance to be negative but the transactions don’t notice each other

Phantoms causing write skew #

phantoms create write skews
- when one transaction changes the result of a search query in another transaction
- snapshot isolation avoids phantoms in R/O queries but in R/W transactions like the examples above, it can lead to tricky cases of write skew
there’s a common pattern in the examples that lead to write skew: Phantoms
1. SELECT query that checks some requirement sat by scanning rows that match a condition \(\implies\) some sort of logical invariant
  - e.g. at least 2 doctors on call
  - e.g. no existing bookings for that room at that time
2. application code uses the query result to determine next action
  1. proceed to next action in transaction
  2. abort
3. if proceed, then it does a write (INSERT, UPDATE, DELETE) to the db and commits the txn
4. this write changes the precondition of the decision of step 2
  if we were to repeat the SELECT query from step 1 after commiting the write in step 3, we would get a different result
  this is because the write would have changed the set of rows matching the search condition:
  - e.g. one fewer doctor on call
  - e.g. meeting room is now booked for that time
  - e.g. uname is taken now
the steps may happen in a different order, as long as there’s a phantom, it creates the write skew

Materializing conflicts
- problem is that there isn’t a distinct object that we can lock, but we can artificially materialize locks (and their conflicts) to help us
  NOTE: this should be treated as last resort because it can end up being complex and may bleed into the data modelling side of things as well.
- we can create a proxy / collection (e.g. a table of timeslots and rooms, so each row gives a particular room for a particular time and we generate the cross product to get all possible combinations)
  so a transaction can lock rows in that options table

Serializability #

isolation levels (read-committed , snapshot…) help deal with some race conditions but not always
there’s a bunch of other complexities
1. they are hard to understand and inconsistently implemented in different databases (e.g. “repeatable read” meaning varies)
2. just looking at application code, it’s hard to tell whether it’s safe to run at a particular isolation level
  especially for large, concurrent codebases
3. no good tools to help us detect race conditions, testing for concurrency issues is hard because they are non-deterministic – typically it’s unlucky timing that surfaces them in the first place
serializable isolation is regarded as the strongest isolation level
db guarantees that if the transactions behave correctly when run individually, they will continue to be correct when run in parallel
i.e. db prevents ALL race conditions
implementation has 3 possible flavours:
1. literally executing transactions in serial order
2. 2 phase locking: used to be an old option
3. optimistic concurrency control techniques like serializable snapshot isolation
We focus on single-node dbs first.
KIV chapter 9 to exampine in the context of multiple, distributed nodes

Actual Serial Execution #

deal with concurrency issues by removing the concurrency and making it by definition serializable
so only one transaction at a time, in serial order, on a single thread
typically we’d think that single-threaded execution wouldn’t be considered because multi-threaded concurrency was considered essential for good performance.
we can now because:
1. RAM is cheap enough – we can keep the active dataset in memory
  disk I/O doesn’t limit performance if transaction can be executed by whatever is on RAM
2. realisation that OLTP transactions are short and only make a small number of reads and writes – so it’s not that slow
  differ from analytical queries that are read-only – these can be run on consistent snapshot (snapshot isolation) outside of the serial execution loop
sometimes, single-threaded systems can be better than multi-threaded because they avoid the overhead from coordinating multi-threading (e.g. locking)
limited throughput though (to a single CPU core)
so single-thread, transactions need to be structured differently from their traditional form

encapsulating transactions in stored procedures #

old purpose of a transaction was to encompass an entire user flow
problem: user input I/O can be slow (people taking their time) and so user-input based I/O can’t be blocking
OLTP transactions need to be kept short by avoiding interactive waiting for user input within a transaction
this is why systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions
\(\implies\) app must send the txn code ahead of time via a stored procedure
upsides of stored procedures (not a good reputation)
1. db vendors have their own language for stored procedures
  - languages are not as nice as modern programming languages, lack the ecosystem of libraries one would expect
  - improvements: some db providers allow writing stored procs using general purpose languages (e.g. Redis using Lua)
2. code within db hard to manage compared to application server code, easier to c, deploy, test…
3. db is performance sensitive, so a badly written procedure does more harm than good compared to badly written logic in the application server

partitioning #

the cost of serial transaction being that the throughput of the db is limited to the speed of a single CPU core on a single machine – can be a serious bottleneck for applications with high write throughput
partitioning will help to scale to multiple CPU cores and multiple nodes
so if we can partition the dataset such that each txn only needs to R/W data within a single partition, then each partition can have its own txn-processing-thread that runs independently from others
each CPU core – it’s own partition which allows transaction throughput to scale linearly with the number of CPU cores
cross-partition transactions end up being super slow because the stored procedure ends up being needed to be performed in-lockstep across all partitions to have serializability across the whole system
so, it’s depends ont he structure of the data used by the application whether we can use single-partition transactions

summary for serial execution #

need to make each txn small and fast, else can be bottlenecked easily
limited to situations when the active dataset can fit in memory, avoiding the IO overheads
write throughput should be low enough to handle on a single CPU core, else txns need to be partitioned without having cross-partition coordination
corss-partition transactions are possible, but super slow

Two-Phase Locking (2PL) #

algo for serializability in dbs, there’s variants of this algo
2PL is aka Strong, Strict 2PL (SS2PL) sometimes
stronger lock requirements than simple write-locks that prevent dirty writes:
- several transactions are allowed to concurrently read the same object as long as nobody is writing to it
- when anyone wishes to write (modify / delete) an object, they need exclusive access
  1. readers block writers::
    if object just was read by a txn (txn A), then A has to be complete first before B can continue
  2. writers block readers::
    if A has written an object and B wants to read it, B msut wait until A completes before it can continue
mantra: writers block readers and readers block writers
- NOTE: different from snapshot isolation where readers never block writers and writers never block readers.
- this protects against all the race conditions covered so far, including write skew and lost updates
Implementation
1. lock is on each object in the db, can be in shared mode or exclusive mode
  1. [read -> shared mode]
    if txn wants to read an obj, then it needs to acquire the lock in shared mode
    multiple txns can hold the lock in shared mode
    they must wait if any other txn has an exclusive lock on the object
  2. [write -> exlusive mode]
    if txn wants to write, then it needs to acquire the lock in exclusive mode
    no other txn can hold the lock at the same time (in either modes) so this txn must wait if there’s an existing lock on the object
  3. [read then write -> shared phase lock then exclusive lock]
    if txn first reads then writes
    upgrading works the same as getting an exclusive lock directly
  4. [after acquisition, hold lock until end of txn]
    it’s 2 phase because there’s 2 phases to the mechanism:
    1. first phase (when txn is executing), that’s when locks are acquired
    2. second phase (end of txn) when the locks are released
It’s possible to get deadlocks from this because there are numerous locks at play
db does the supervision for this, will detect deadlocks (between txns) and aborts one of them to continue progress

Performance of 2PL #

worse txn throughput and longer query-response times under 2PL than under weak isolation
why?
1. [partial reason]: overhead from lock management
2. because of reduced concurrency
can expect it to be slow at high percentiles and have unstable latencies
instability is a problem when robust operation is desired by the system
deadlocks are more common than compared to lock-based read committed isolation level
the abort mechanism for deadlock breaking ends up wasting effort (because need to retry again)

Predicate Locks #

a re-emphasis that a database with serializable isolation must prevent phantoms (when one transaction changing the results of another transaction’s search query)
conceptually, similar to shared/exclusive lock
rather than belonging to a particular object (e.g., one row in a table), it belongs to all objects that match some search condition,
e.g. for this query:
```
  SELECT * FROM bookings
  WHERE room_id = 123 AND
  end_time > '2018-01-01 12:00' AND
  start_time < '2018-01-01 13:00';
```
- if txn A wishes to read objects matching some condition (e.g. in the select query above) then if any other txn, B has an exclusive lock on any object matching those conditions, A must wait until B releases its lock before its allowed to make its query
- if txn A wants to insert, update or delete any object, it must first check either old / new values matches any existing predicate lock
  if so (e.g. B is holding a matching predicate lock) then A must wait until that has completed before it can continue
- predicate locks do not perform well:
  the checking of matching locks is the main overhead
  if there are many locks by active transactions, checking for matching locks becomes time-consuming.

index-range locks (aka next-key locking) #

it’s a simplified approx of predicate locking – safe to simplify a predicate by making it match a greater set of objects
it’s just a cover range that we’re looking at here, doesn’t give us wrong answers for conflicts
it’s just less precise of a locking mechanism than predicate locks, but it’s a good compromise because of the lower overheads
these ranges, we likely will have an index on the object we’re trying to lock (e.g. room_id or time-based index also works )
as long as a lock can be placed on a dimension that we’re doing queries on and it’s a coverage, it will be fine for us to use this kind of lock
this works against phantoms and write skew

Serializable Snapshot Isolation (SSI) #

pretext so far is that it seems that serializable isolation and good performance are fundamentally at odds with each other seems like serializability implementations don’t perform well
- 2PL’s don’t perform well, serial execution doesn’t scale well
- weak isolation levels (with good performance) prone to race conditions (lost updates, write skew, phantoms … )
SSI (Serilizable Snapshot Isolation)
- gives full serializability
- only small performance penalty compared to snapshot isolation

pessimistic (e.g. 2PL) vs optimistic (SSI) concurrency control #

2-phase locking == pessimistic concurrency control mechanism:
based on the principle that if there’s a chance for things to go wrong (i.e. lock held by another txn), then it’s better to wait until the situation is safe again.
like mutual exclusion (mutexes)
- serial execution is pessimistic to the extreme
SSI is optimistic concurrency control
- instead of blocking if something possibly dangerous happens, let it continue and then check if there’s a violation (e.g. isolation violation)
  if violation exists, then abort and retry
  \(\implies\) only txns that executed serializably are allowed to commit
- relies on snapshot isolation – all reads within a txn made from consistent snapshot of the db
  this is what differentiates SSI from earlier optimistic concurrency control techniques
- ADDITIONALLY, SSI adds an algo for detecting serialization conflicts among writes and determining which txns to abort
drawbacks
1. when high contention (i.e. many txns trying to access the same objects) \(\implies\) high proportions of txns needing to abort
  so if system close to max throughput, then additional txn load from retries will make performance worse
upsides
1. if spare capacity & contention b/w txns is manageable, then optimistic concurrency control does better than pessimistic
  - contention can be reduced with commutative atomic operations – e.g. concurrent increments being applied in any order

decisions based on outdated premise #

previously, when seeing write skew (under snapshot isolation), we realise that the problem is when a txn reads data from a db, examines the query result and takes action based on the observed result but during this process, the data is changed
- i.e. txn is taking an action based on a premise
- when the txn wants to commit, the original data may have changed and the premise may no longer be true
- it’s a matter of causal dependencies between queries and the writes in the txn
to be able to provide serializable isolation, db must detect situations in which a transaction may have acted on an outdated premise and therefore abort the txn in that case
2 cases to consider when detecting this:
1. detecting reads of a stale MVCC object (e.g. uncommitted write ocurred before read)
2. detecting writes that affect prior read (write happens before read)

SSI Consideration Case 1: Detecting stale MVCC reads
- remember: snapshot isolation implemented by multi-version concurrency control
- so db has to track when a txn ignores another txn’s writes due to MVCC visibility rules
  this is the optimistic part of the SSI algo
  it has to wait until any of the ignored writes have been committed for the transaction to commit
  why?
  1. if the waiting txn was read-only, there’s no need to abort because no risk of write-skew
  2. when waiting txn would have done a read, the db wouldn’t have known that an existing txn is going to make a write that will make the premise stale
  3. the write-txn may abort instead of committing, so there’s n need for the waiting txn to prematurely abort just on detection of the other txns
- so it’s the SSI algo that helps preserve snapshot isolation’s support for long-running reads from a consistent snapshot

SSI Consideration Case 2: Detecting writes that affect prior reads
- another txn mods data after that data has been read
- solution can be similar to our use of index-range locks (like in 2PL), just that SSI locks don’t block other transaction (because they are optimistic)
- mechanism:
  - for the dimension (e.g. shift_id), let there be an index (else can use table locks), we just track what are the readers to an entry of shift_id
  - this info can be temporary, only necessary until the txn has finished and all concurrent txns have finished then it may forget it
  - when txn writes to the db, it must look in the indexes for any txn that have recently read the affected data
    similar to acquiring a write lock on the affected key range
  - instead of blocking until the readers have committed (which would be pessimistic), we let the lock act as a tripwire:
    i.e. notify the txn that the data they read may no longer be up to date
  - when the txns finish, A will notify B that the prior read is outdated and B will notify A that the prior read is outdated
    whichever is first gets first dibs and can complete, the other one needs to abort the txn and retry

performance of SSI #

tradeoff: granularity of tracking transactions’ R/W (precision) vs performance (speed)
- if db keeps track at a low level of granularity, it can be precise about which txns need to abort \(\implies\) but bookkeeping overhead
- Less detailed tracking \(\implies\) faster, but more transactions being aborted than necessary
sometimes, it’s tolerable to read info that got overwritten by another txn – PG uses this to reduce the number of unnecessary aborts
query latency is more predictable and less variable [vs 2PL]
compared to 2PL, there’s nothing blocking – because writers don’t block readers and readers don’t block writers
1. RO queries can run on a consistent snapshot without requiring any locks
not limited to throughput of a single CPU core [compared to serial execution]
even with partitioned across multiple machines, txns can read and write data in multiple partitions while ensuring serializable isolation
rate of aborts affects the overall performance of SSI
- txns that R/W data over a long period of time will likely conflict and abort \(\implies\) SSI requires that R-W txns be fairly short (long-running RO txns may be okay)
- SSI likely less sensitive to slow txns than 2PL or serial executions though

Summary #

This chapter is a little involved, the summary is good to re-read as a quick refresher.

Unifying our vocabulary: the meaning of the word skew #

Skew as the projection mismatch across dimensions.
think: “These values should line up — but they’re systematically offset relative to a reference.”"

So contextualising it:

Read Skew:
- Read skew = a read observes parts of state from different logical times
- reference axis: a consistent snapshot time
Write Skew:
- Write skew = two transactions make decisions based on overlapping but not identical views, leading to inconsistent combined outcome
- reference axis: invariant that must hold across writes

Clock skew:

Clock skew = different clocks disagree about current time
reference axis: true time (or a reference clock)

by the way clock skew differs from clock drift in that:

     Skew = how far apart clocks are
     Drift = how differently clocks run

     Skew  = offset at a moment
     Drift = rate difference over time

     Drift causes skew to grow
     Sync fixes skew
     Calibration fixes drift

Distribution skew (statistics):
- Distribution skew = data mass pulled to one side of the center
- reference axis: symmetric distribution around the mean

see bot-elaboration:

Excellent question — this is exactly the kind of vocabulary unification that makes a lot of distributed systems + statistics concepts suddenly “click”.

The word **skew** shows up in many fields, but it actually carries a very consistent core meaning. Once you internalize that core, all the variants — statistical skew, clock skew, write skew, read skew — line up naturally.

I’ll give you:

1️⃣ A single core mental model
2️⃣ A geometric intuition
3️⃣ How it maps to statistics
4️⃣ How it maps to clocks
5️⃣ How it maps to databases (write/read skew)
6️⃣ A one-line test you can use anytime you see the word

---

# ✅ Core Meaning of “Skew”

**Skew = a systematic imbalance caused by misalignment along some reference axis.**

More simply:

> **Skew means things that should line up — don’t.**

There is always:

 * a reference frame (time, distribution center, snapshot, ordering rule)
 * multiple measurements or operations
 * a mismatch in alignment relative to that reference

That’s it. That’s the unifying idea.

---

# 🧭 Visual Mental Model: “Parallel lines that drift”

Imagine multiple lines that are supposed to stay parallel and aligned — but one drifts.

```
Expected:
|   |   |   |   |

Skewed:
|   |    |     |
```

They’re not wildly random — they’re **systematically offset**.

That offset = skew.

---

# 📊 Statistics: Distribution Skew

## What is skew in statistics?

Distribution skew = **data mass pulled to one side of the center**

Reference axis:

> symmetric distribution around the mean

When skew happens:

 * one tail is longer or heavier

Examples:

Right-skewed:

```
|■■■■■■■■■■■■■■■□□□
```

Left-skewed:

```
□□□■■■■■■■■■■■■■■■|
```

Interpretation:

> Values are systematically shifted to one side relative to the center.

Same core idea:
**misalignment relative to the expected center.**

---

# ⏱️ Clock Skew (Distributed Systems)

## What is clock skew?

Clock skew = **different clocks disagree about current time**

Reference axis:

> true time (or a reference clock)

Example:

```
Node A clock: 12:00:05
Node B clock: 11:59:58
Node C clock: 12:00:11
```

They should align — but they are offset.

That offset = skew.

Not randomness — **systematic time offset**.

Important:

 * skew = offset
 * drift = rate difference over time

(They’re related but distinct.)

---

# 🗄️ Database: Read Skew

## What is read skew?

Read skew = **a read observes parts of state from different logical times**

Reference axis:

> a consistent snapshot time

Example:

Transaction reads:

```
Account A balance → after update
Account B balance → before update
```

Should align to one snapshot — but don’t.

That mismatch = skew.

You’re seeing **misaligned state in time**.

---

# ✍️ Database: Write Skew

This one feels trickier, but it fits perfectly.

## What is write skew?

Write skew = **two transactions make decisions based on overlapping but not identical views, leading to inconsistent combined outcome**

Reference axis:

> invariant that must hold across writes

Example:

```
Rule: at least 1 doctor must be on call

T1 reads: doctor B on call → OK → removes A
T2 reads: doctor A on call → OK → removes B
```

Each transaction saw a “valid” snapshot —
but those snapshots were **not aligned with each other**.

Decision inputs were skewed.

Result:

```
no doctors on call
```

Skew here = **decision-making based on misaligned reads**.

---

# 🧠 The Unifying Template

Whenever you see **skew**, ask:

> **Skew relative to what reference alignment?**

| Domain     | Reference alignment  | What is skew?                       |
| ---------- | -------------------- | ----------------------------------- |
| Statistics | symmetric center     | distribution mass shifted           |
| Clocks     | true time            | time offset                         |
| Read skew  | snapshot time        | mixed-time reads                    |
| Write skew | invariant view       | decisions from misaligned snapshots |
| Load skew  | uniform distribution | uneven request spread               |
| Data skew  | partition balance    | uneven data placement               |

Same pattern every time.

---

# 🧩 Why “Skew” and not just “Error”?

Because skew implies:

 * not random noise
 * not symmetric deviation
 * ***directional bias**
 *  **systematic offset**

Skew = *tilted*, not *noisy*.

---

# 🔬 Even deeper intuition (optional but powerful)

Think of skew as:

> **projection mismatch across dimensions**

Examples:

Statistics:

 * distribution projected onto number line → off-center mass

Clocks:

 * time projected across nodes → offset values

Databases:

 * system state projected across reads → inconsistent slice

It’s always:
**multiple observers → one axis → misalignment**

---

# ✅ One-line universal mental model

When you see the word **skew**, translate it to:

> **“These values should line up — but they’re systematically offset relative to a reference.”**

If you apply that sentence:

 * statistical skew → makes sense
 * clock skew → makes sense
 * read skew → makes sense
 * write skew → makes sense

Same word. Same structure. Different axes.

---

If you want, next we can build a similar unifying mental model for:

 * drift
 * bias
 * variance
 * anomaly
 * inconsistency

Those also show up across stats + distributed systems in surprisingly parallel ways.