- rtshkmr's digital garden/
- References/
- Architecture Design Basics/
- Pattern Taxonomy/
- Scaling & Performance/
- Sharding Strategies/
Sharding Strategies
Table of Contents
π΄ P0 — how to split data across multiple databases when one isn’t enough
Problem #
A single database can hold ~1-10TB comfortably. Beyond that, or when write throughput exceeds a single node’s capacity, you need to split data across multiple nodes (sharding/partitioning).
Approaches #
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Range-based | Shard by key range (A-M, N-Z) | Range queries efficient | Hot spots if keys aren’t uniform |
| Hash-based | Shard by hash(key) % N | Even distribution | Range queries require scatter |
| Consistent hashing | Hash ring with virtual nodes | Minimal redistribution on resize | More complex |
| Directory-based | Lookup table maps key β shard | Flexible | Lookup table is SPOF |
The Hard Problems #
Partition Key Selection #
The shard key determines what goes together. user_id as shard key means all of a user’s data is co-located (good for user-scoped queries, bad if one user generates 100Γ more data).
Hot Partitions #
- Celebrity accounts on
user_idsharding β one shard gets all the load - Mitigation: add jitter to keys, split hot keys, application-level routing
Data Hotspots \(\neq\) Imbalanced Shards #
| Phenomenon | What it is | Root cause | Affects | Signal |
|---|---|---|---|---|
| Imbalanced shards | Unequal data volume distribution | Hash function / key selection strategy | Storage utilisation, disk/memory per shard | Data distribution problem |
| Data hotspots | Unequal access pattern distribution | Traffic patterns, user behaviour, business logic | Throughput, latency, CPU/network per shard | Operational bottleneck problem |
- can have one without the other:
- balanced shard + hotspot
- data perfectly distributed by volume across the shards (storage is fine), but the access pattern shows hotspots (throughput is affected)
- imbalanced shards + no hotspots: e.g. if hash function is bad and volume is unequally distributed, but traffic is evenly distributed across shards – though this is a more tolerable (and rare) scenario
- balanced shard + hotspot
- they have different solutions:
imbalanced shards need to fix the volume distribution strategy
this means better hash function, consistent hashing with virtual nodes, range-based sharding with better key-selection, a full re-shard maybe…
data hotspots need to work around the imbalance
this means replica specialisation (hot shard to be replicated more heavily or on better HW), request routing improvements (do it via a shard-specific cache), micro-sharding (shard it further), queue-based writes
Cross-Shard Queries #
- Joins across shards are expensive (scatter-gather)
- Transactions across shards require 2PC (see also: Two-Phase Commit)
- Denormalisation often necessary to keep queries shard-local
Instinct #
Delay sharding as long as possible. A well-indexed, properly-tuned single PostgreSQL instance handles more than most people think (~50K queries/s, several TB). Try to solve the problem using caching/replication first before resorting to sharding.
When you must shard, choose the partition key based on your most common query pattern, accept that cross-shard queries will be slow, and plan for hot-partition mitigation from day one.
Framing #
Our hotspots are structurally determined by merchant size, not a hash function problem. We’re accepting the imbalance as a feature, and our solution is replica asymmetry and per-shard caching. Here’s the trade-off: we lose some operational simplicity but avoid constant resharding.
- MISCONCEPTION: Sharding is a tuning action, not a scaling starting point. Rely on read replicas and caching first. Only consider sharding when capacity math shows you’re hitting the ceiling of what a single machine can handle.
- INTERVIEW: When to bring up sharding: when suggesting that a single DB won’t work AND replication and caching won’t solve the problem either.
- CONVENTION: Technically “partitioning” refers to splitting data within a single DB instance, and “sharding” refers to splitting across multiple machines. In practice, people use them interchangeably.
- PITFALL: Cross-shard queries explode latency. If a query requires data from multiple shards, you need scatter-gather, which defeats the purpose. Design the shard key so your most common queries hit a single shard.
References #
- Vitess: Sharding β YouTube’s sharding layer
- Sharding & IDs at Instagram
- Principles of Sharding for Relational Databases β Citus
DDIA 2e Reference #
- Chapter 6: Partitioning β this IS Chapter 6