Sharding Strategies

Table of Contents

🔴 P0 — how to split data across multiple databases when one isn’t enough

Problem #

A single database can hold ~1-10TB comfortably. Beyond that, or when write throughput exceeds a single node’s capacity, you need to split data across multiple nodes (sharding/partitioning).

Approaches #

Strategy	Mechanism	Pros	Cons
Range-based	Shard by key range (A-M, N-Z)	Range queries efficient	Hot spots if keys aren’t uniform
Hash-based	Shard by hash(key) % N	Even distribution	Range queries require scatter
Consistent hashing	Hash ring with virtual nodes	Minimal redistribution on resize	More complex
Directory-based	Lookup table maps key → shard	Flexible	Lookup table is SPOF

The Hard Problems #

Partition Key Selection #

The shard key determines what goes together. user_id as shard key means all of a user’s data is co-located (good for user-scoped queries, bad if one user generates 100× more data).

Hot Partitions #

Celebrity accounts on user_id sharding → one shard gets all the load
Mitigation: add jitter to keys, split hot keys, application-level routing

Data Hotspots \(\neq\) Imbalanced Shards #

Phenomenon	What it is	Root cause	Affects	Signal
Imbalanced shards	Unequal data volume distribution	Hash function / key selection strategy	Storage utilisation, disk/memory per shard	Data distribution problem
Data hotspots	Unequal access pattern distribution	Traffic patterns, user behaviour, business logic	Throughput, latency, CPU/network per shard	Operational bottleneck problem

can have one without the other:
1. balanced shard + hotspot
  - data perfectly distributed by volume across the shards (storage is fine), but the access pattern shows hotspots (throughput is affected)
  - imbalanced shards + no hotspots: e.g. if hash function is bad and volume is unequally distributed, but traffic is evenly distributed across shards – though this is a more tolerable (and rare) scenario
they have different solutions:
1. imbalanced shards need to fix the volume distribution strategy
  this means better hash function, consistent hashing with virtual nodes, range-based sharding with better key-selection, a full re-shard maybe…
2. data hotspots need to work around the imbalance
  this means replica specialisation (hot shard to be replicated more heavily or on better HW), request routing improvements (do it via a shard-specific cache), micro-sharding (shard it further), queue-based writes

Cross-Shard Queries #

Joins across shards are expensive (scatter-gather)
Transactions across shards require 2PC (see also: Two-Phase Commit)
Denormalisation often necessary to keep queries shard-local

Instinct #

Delay sharding as long as possible. A well-indexed, properly-tuned single PostgreSQL instance handles more than most people think (~50K queries/s, several TB). Try to solve the problem using caching/replication first before resorting to sharding.

When you must shard, choose the partition key based on your most common query pattern, accept that cross-shard queries will be slow, and plan for hot-partition mitigation from day one.

Framing #

Our hotspots are structurally determined by merchant size, not a hash function problem. We’re accepting the imbalance as a feature, and our solution is replica asymmetry and per-shard caching. Here’s the trade-off: we lose some operational simplicity but avoid constant resharding.

MISCONCEPTION: Sharding is a tuning action, not a scaling starting point. Rely on read replicas and caching first. Only consider sharding when capacity math shows you’re hitting the ceiling of what a single machine can handle.
INTERVIEW: When to bring up sharding: when suggesting that a single DB won’t work AND replication and caching won’t solve the problem either.
CONVENTION: Technically “partitioning” refers to splitting data within a single DB instance, and “sharding” refers to splitting across multiple machines. In practice, people use them interchangeably.
PITFALL: Cross-shard queries explode latency. If a query requires data from multiple shards, you need scatter-gather, which defeats the purpose. Design the shard key so your most common queries hit a single shard.

References #

Vitess: Sharding — YouTube’s sharding layer
Sharding & IDs at Instagram
Principles of Sharding for Relational Databases — Citus

DDIA 2e Reference #

Chapter 6: Partitioning — this IS Chapter 6