Chapter 6. Partitioning

Table of Contents

Terminology for a “partition” may vary based on the tool used
primary reason for partitioning is scalability
- typically defined such that each piece of data belongs to exactly one partition
- with a shared nothing cluster, a large dataset can be effectively distributed across many disks; query load distributed across many processors
- queries on a single partition, independently execute in its own partition – so can parallelize across nodes
  - hard to do this for complex queries though
chapter scope:
1. different approaches for partitioning large datasets
2. how indexing of data interacts with partitioning
3. rebalancing: adding / removing nodes in cluster

Partitioning and Replication #

typically combined with replication (for fault-tolerance benefits)
interesting: each node may be the leader for some partitions and a follower for other partitions
this is how replication and partitioning is combined

Partitioning of Key-Value Data #

Goal: spread the data and the query load evenly across nodes

unfair partitioning == skewed data / queries, eats away @ efficiency
hot spot: partition with disproportionately high load
the ideal distribution would probably be randomly distributing records. Problem is that we lose the ability to search index (cuz random). So we explore other

Partitioning by Key Range #

analogy: encyclopedia volumes
- just need to know the content (node) is in which volume
  so we can pick the correct book off the shelf
characteristics of key ranges:
- continuous range of keys (min-max)
- can determine which partition to use given the range-boundaries
- partition boundaries manually / automatically chosen
- range not necessarily evenly spaced, follows data-distribution
within partition:
- keys kept in sorted order (SSTables, LSM-trees) \(\implies\) easy range-scans
- also allows fetching several related records in one query
downsides:
- hotspot-problem / skewed access pattern
  can be easy to get hotspots because of the access patterns
  e.g. say it’s a time-based write of sensor data, then we can end up getting write amplification due to temporal locality
  - in this case, the key should be something that will allow it to be distributed (e.g. we can do sensor-prefixed timestamps) as the key, then the spread will be better
- hot partitions creating noisy neighbours

Partitioning by Hash of Key #

not that useful
there needs to be a good hashing function
- good hashing functions will give a good spread, so better distribution that may avoid skew/hotspots
- hash functions for partitioning need not be cryptographically strong, just need it to be well-distributed
- but the language-native hash functions may not be good enough because they may be non-deterministic / bound of factors beyond the data
- [consistency]
  the consistency in hashing is different from replica consistency and different from ACID consistency
  just use the term “hash partitioning”
range of hashes instead of range of keys can be used for the partitions – can still be searched and directly referenced
this just becomes something like a jump table
downsides:
1. can’t do range querying anymore
  because the scattering is now random because of the hash function
  - possible improvement: concatenated index approach
    compound primary keys spanning several columns with partial hashing and concatenated index
    - only first part of the key is hashed for finding the partition, other columns used as concatenated index
      - the concatenated index can be sorted (e.g. Cassandra’s SSTables)
    - A query therefore cannot search for a range of values within the first column of a compound key, but if it specifies a fixed value for the first column, it can perform an efficient range scan over the other columns of the key.
    - useful for one-to-many relationships

Skewed Workloads and Relieving Hot Spots #

even for hashed approaches, there are still edge cases of access patterns that lead to skewed workloads
the onus is on the app to compensate for highly skewed workloads
some tricks:
1. key-salting
  adding a 2-digit decimal random number @ beginning / end of a key – 100-place spread
  problem: this means now there’s a need to manually book-keep across 100-keys and combine

bot-dump

  Yes — this *does* have names, though they’re annoyingly inconsistent across communities.

  I’ll answer in two parts:

  1. **What this trick is called (and how it’s usually described)**
  2. **Which systems try to detect and compensate for skew automatically**

  ---

  ## 1. What is this trick called?

  The technique you’re describing—*salting a hot key with a random suffix/prefix to spread load*—is commonly referred to as:

  ### Primary names (most common)

  #### **Key salting**

   * By far the most widely used term in distributed data systems
   * Example: `user123` → `user123#07`
   * The “salt” is the random or pseudo-random component

  You’ll see this in:

   * DDIA (exactly where your quote comes from)
   * HBase / Bigtable / Cassandra docs
   * MapReduce and data warehousing literature

  ---

  ### Related / overlapping terms

  #### **Hot key sharding** or **Hotspot sharding**

   * Emphasizes the *reason* rather than the mechanism
   * Often used in production postmortems and design docs

  #### **Artificial key fan-out**

   * Highlights that one logical key maps to many physical keys
   * Common in analytical systems and stream processors

  #### **Write sharding**

   * Focused on the write path specifically
   * Reads then become scatter–gather

  ---

  ### How it’s usually described precisely

  At a more formal level, this is:

  > **Manual application-level partitioning of a logical key into multiple physical keys to reduce load skew**

  Key properties:

   * Logical key ≠ physical partition key
   * Hash function remains unchanged
   * Load distribution is achieved *before* hashing

  This is why your intuition about

  > “hash of two identical IDs is still the same”

  is exactly right: the system hash function is not the problem — the **key distribution is**.

  ---

  ### Conceptual cousins (useful for intuition)

  | Technique               | Similarity         | Difference                    |
  | ----------------------- | ------------------ | ----------------------------- |
  | Consistent hashing      | Spreads keys       | Cannot fix skew for *one* key |
  | Virtual nodes           | Balance partitions | Still fail on hot keys        |
  | Load-aware routing      | Dynamic            | Requires system support       |
  | CRDT-style multi-writer | Write scaling      | Different problem domain      |

  ---

  ## 2. Do systems automatically detect and compensate for skew?

  Short answer: **some do, partially, and usually with caveats**.

  Long answer below.

  ---

  ### Systems that *explicitly* address hot keys

  #### **Google Bigtable / Spanner**

   * Detects hot tablets / ranges
   * Can split tablets dynamically
   * **But**: cannot split a *single row key*
   * You still need key salting for single-row hotspots

  ➡️ Automatic *range* rebalancing, not logical-key splitting.

  ---

  #### **Apache HBase**

   * Region splitting works like Bigtable
   * Suffers from classic “hot row” problem
   * Official recommendation: key salting

  ➡️ Detection: yes
  ➡️ Automatic fix: no

  ---

  #### **Cassandra**

   * Monitors partition load
   * Can rebalance tokens
   * Still cannot split a hot partition key

  ➡️ Explicitly documents hot partition problem
  ➡️ Recommends write sharding

  ---

  #### **Kafka (interesting case)**

   * Partitions are explicit
   * Hot partitions are visible
   * Some tooling can reassign partitions

  But:

   * Ordering constraints prevent transparent splitting
   * Applications still must choose keys carefully

  ➡️ Skew is visible, not automatically fixed

  ---

  ### Systems that do *better* (but not magically)

  #### **Streaming systems (Flink, Beam, Spark Structured Streaming)**

  These can:

   * Detect skewed operators
   * Apply **key splitting with aggregation**
   * Merge partial aggregates automatically

  Example:

   * Flink’s **adaptive skew handling**
   * Beam’s **hot key fanout**

  This works well for:

   * associative/commutative operations
   * aggregates (counts, sums)

  But not for:

   * arbitrary reads
   * random-access lookups

  ➡️ Works because semantics are constrained.

  ---

  #### **Modern OLAP systems (e.g. Druid, ClickHouse)**

   * Use adaptive parallelism
   * Rebalance segments
   * Avoid single-key hotspots by design

  But:

   * They optimize *scan-heavy* workloads
   * Not general-purpose key-value access

  ---

  ### Why automatic compensation is hard (the core reason)

  The fundamental blocker is **semantics**:

  To automatically split a hot key, the system must know:

   * Are writes commutative?
   * Can reads be merged?
   * Is ordering required?
   * Is there a single “latest” value?

  In general-purpose databases:

   * The answer is *unknown*
   * So the system cannot safely rewrite your key space

  This is why DDIA says:

  > “it’s the responsibility of the application”

  Not because databases are lazy — but because **only the application knows the meaning of the data**.

  ---

  ## A clean way to name this in your own writing

  If you want to sound precise and aligned with literature:

  > *“This is a form of **key salting** (or **hot-key sharding**), where a single logical key is mapped to multiple physical keys to reduce write skew, at the cost of scatter–gather reads and additional bookkeeping.”*

  That phrasing will land well with DDIA-aware readers.

  ---

  ## One last meta-observation (relevant to your earlier questions)

  This pattern is a **classic example of the boundary between system guarantees and application semantics**:

   * The system guarantees distribution *given keys*
   * The application must choose keys that reflect load, not just identity

  That theme connects very directly to:

   * actor models
   * CRDTs
   * multi-leader databases
   * robotics data pipelines

  If you want, next we can:

   * connect this to **actor mailboxes becoming hot**
   * relate it to **ROS topic partitioning**
   * or contrast this with **CRDT-based write scaling**

  Just say where you want to go.

Partitioning and Secondary Indexes #

The points so far have been about primary indexes and how they can be map to partitions based on the partitioning scheme (a key-value data-model)
The problem with secondary indices is that they don’t map neatly to partitions
2 approaches to partitioning a db with secondary indices:
1. document-based partitioning
2. term-based partitioning

Partitioning Secondary Indexes by Document #

it’s a local index, not a global one. Each partition is completely separate: each maintains its own secondary indexes that cover only the documents in that partition
typically, there’s a need to combines across partitions to get the complete response for the query
- the querying approach is aka scatter/gather
- makes reads on secondary indexes expensive
Most database vendors recommend that you structure your partitioning scheme so that secondary index queries can be served from a single partition, but that is not always possible,

Partitioning Secondary Indexes by Term #

each partition doesn’t have its own secondary index (i.e. its local index) – it has a global index that covers data in ALL partitions
the global index must also be partitioned (else it may undo the point of the primary partitioning scheme in the first place)
this is term-partitioning (where a term is a key-value e.g. color:red) terms are all the words that occur in a document
- the secondary index partitioning may be done on the term (allows range querying) / hash of the term (allows even distribution)
upsides of global (term-partitioned) index
- reads are now more efficient (compared to scatter/gather)
  just request to the partition containing the term we want
downsides:
- slower writes, more complicated. Writes now affect more partitions (since every term in document might be on a different partition, on a different node)
we hope that distributed transactions are provided by the db engine chosen

Rebalancing Partitions #

Rebalancing: process of moving load from one node in the cluster to another
necessary for basic maintenance, it’s expected
expected outcomes from rebalancing:
1. load (data storage, read / write requests) should be shared fairly between the nodes in the cluster
2. during rebalancing, the db should continue accepting queries (read & write)
3. shouldn’t overcorrect – no more data than necessary should be moved between nodes

Strategies for Rebalancing #

Few different ways to assign partitions to nodes.

Negative Example: hash mod N #

seems like a sensible solution at first-glance, but the realisation is that when we change N, the number of nodes, then we need to move a lot of the nodes from one to another
so problem: rebalancing is expensive when this rebalancing strategy is used because of the need for frequent node movements
unnecessary movement of data around because of the rebalancing strategy chosen

Legit: Fixed number of partitions; partition migration #

make more partitions than nodes then several partitions to each node
only the mapping of partitions to nodes is changed
- no change in the number of partitions
- no change in the assignment of keys to partitions
e.g. of usage in Riak, Elasticsearch, Couchbase…
entire partitions are moved (migrated) between nodes:
1. when new node is added, it steals a few partitions from every existing node until partitions are fairly distributed again
2. this is a time-taking process, during which reads and writes will still work because the old assignment of partitions can still be used.
3. can’t willy-nilly choose ridiculously high number of extra partitions because will incur management overhead
tradeoffs in partition size
- too large == rebalancing and recovery from node failures becomes expensive
- too small == too much overhead
- size of partition grows proportionally to the total amount of data in the cluster
  so the goldilocks zone is hard to determine for this if the number of partitions is FIXED and the dataset size is VARIABLE
extra benefits:
1. can assign more partitions to nodes that are more powerful – so some designing to account for access patterns can be done by us too
typically fixed-partition dbs don’t implement partition-splitting, keeps things operationally simpler that way
just need to choose the initial configuration correct, accounting for future growth

Legit: Dynamic partitioning #

when the partitioning scheme is key-range, using fixed partitioning is too troublesome, so that’s why there’s dynamic partitioning done
- if wrong boundaries, then there can be hotspots
- re-configuring the partition boundaries manually would be tedious
dynamic partitioning: similar to how B-tree’s top-level expands contracts
upsides
- number of partitions adapts to the total data volume, the max size of an individual partition can be configured by us to a max ceiling
  when starting off, can do some pre-splitting and set an initial set of partitions. In the case of key-range partitioning, pre-splitting requires that you already know what the key distribution is going to look like

Legit: Partitioning Proportionally to Nodes #

in previous cases, the number of partitions was independent of the number of nodes;
we can consider making the number of partitions proportional to the number of nodes \(\implies\) fixed number of partitions per node
size of each partition grows proportionally to the dataset size while the number of nodes remain unchanged
we an still do horizontal scaling by increasing the number of nodes and the partitions become smaller again (but fixed size)
When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place.
- Picking partition boundaries randomly requires that hash-based partitioning is used (so the boundaries can be picked from the range of numbers produced by the hash function).

Operations: Automatic or Manual Rebalancing #

it’s a gradient of options from fully automatic rebalancing to fully manual
e.g. system my automatically suggest a rebalancing and the db admin has to commit it before it takes effect
rebalancing can be expensive, so it’s something we don’t want a great deal of uncertainty – typically we wouldn’t want full automation
expensive because:
1. need to reroute requests
2. move a large amount of data from one node to another
3. so may overload the network or the nodes and harm the performance of other requests while the rebalancing is happening
just have a human in the loop! slow but predictabile operation
may actually end up causing cascading failures: e.g.:
1. congested / overloaded node because of automatic rebalancing
2. other nodes deem it as dead
3. automatically rebalance the node to move it away
4. adds more load on the overloaded node and other nodes – cascades down

Request Routing #

general case of service discovery, in this case of the db nodes
high-level approaches to service discovery:
1. round-robin load balancer
  allow clients to contact any node, if it’s not the one, then the node will forward it over until the correct node is hit and replies
2. using a partition-aware load-balancer
  all requests from client go to a routing tier that determines the node
3. invariant: clients must have visibility to the partitioning and assignment of partitions
  allows clients to directly connect to the partitions
problem is not just service discovery, it’s about how to achieve consensus as rebalancing happens over the service lifecycle
some ways of managing this :
- coordination services that become the routing authority e.g. ZooKeeper that tracks cluster metadata
  other actors like the routing tier / partition-aware client can subscribe to ZooKeeper for information
  they get notified when partition changes ownership
  examples:
  1. linkedin espresso uses helix which uses ZooKeeper
  2. SolrCloud, Kafka all use ZooKeeper
  3. MongoDB uses a custom config server that uses mongos daemons for the routing tier
- use a gossip protocol among the nodes to disseminate any changes in the cluster state
  so any node can help to forward requests based on the gossip

Parallel Query Execution #

KIV chapter 10 for fast parallel execution of data-warehouse queries

Summary #

Partitioning is necessary when you have so much data that storing and processing it on a single machine is no longer feasible.
spreads data and query load evenly across multiple machines
avoids hotspots
2 main partitioning schemes:
there can be hybrid approaches
1. key-range partitioning
2. hash partitioning
how secondary indexes are affected by partitioning
1. local indexes (document-partitioned indexes)
2. global indexes (term-partitioned indexes)
relevance since DDIA was printed:
Since DDIA was published, massively parallel query execution has moved from specialized data warehouse systems into mainstream infrastructure. Modern analytical engines decouple storage from compute, rely on columnar formats and vectorized execution, and can elastically scale parallel workers over shared object storage. While the execution principles described in DDIA remain valid, the operational burden of parallelism has shifted from careful data placement and rebalancing toward over-partitioning, elastic compute, and scan-optimized execution.