Skip to main content
  1. References/
  2. Architecture Design Basics/
  3. Pattern Taxonomy/
  4. Scaling & Performance/

Sharding Strategies

·· 634 words· 3 mins

πŸ”΄ P0 — how to split data across multiple databases when one isn’t enough

Problem #

A single database can hold ~1-10TB comfortably. Beyond that, or when write throughput exceeds a single node’s capacity, you need to split data across multiple nodes (sharding/partitioning).

Approaches #

StrategyMechanismProsCons
Range-basedShard by key range (A-M, N-Z)Range queries efficientHot spots if keys aren’t uniform
Hash-basedShard by hash(key) % NEven distributionRange queries require scatter
Consistent hashingHash ring with virtual nodesMinimal redistribution on resizeMore complex
Directory-basedLookup table maps key β†’ shardFlexibleLookup table is SPOF

The Hard Problems #

Partition Key Selection #

The shard key determines what goes together. user_id as shard key means all of a user’s data is co-located (good for user-scoped queries, bad if one user generates 100Γ— more data).

Hot Partitions #

  • Celebrity accounts on user_id sharding β†’ one shard gets all the load
  • Mitigation: add jitter to keys, split hot keys, application-level routing

Data Hotspots \(\neq\) Imbalanced Shards #

PhenomenonWhat it isRoot causeAffectsSignal
Imbalanced shardsUnequal data volume distributionHash function / key selection strategyStorage utilisation, disk/memory per shardData distribution problem
Data hotspotsUnequal access pattern distributionTraffic patterns, user behaviour, business logicThroughput, latency, CPU/network per shardOperational bottleneck problem
  • can have one without the other:
    1. balanced shard + hotspot
      • data perfectly distributed by volume across the shards (storage is fine), but the access pattern shows hotspots (throughput is affected)
      • imbalanced shards + no hotspots: e.g. if hash function is bad and volume is unequally distributed, but traffic is evenly distributed across shards – though this is a more tolerable (and rare) scenario
  • they have different solutions:
    1. imbalanced shards need to fix the volume distribution strategy

      this means better hash function, consistent hashing with virtual nodes, range-based sharding with better key-selection, a full re-shard maybe…

    2. data hotspots need to work around the imbalance

      this means replica specialisation (hot shard to be replicated more heavily or on better HW), request routing improvements (do it via a shard-specific cache), micro-sharding (shard it further), queue-based writes

Cross-Shard Queries #

  • Joins across shards are expensive (scatter-gather)
  • Transactions across shards require 2PC (see also: Two-Phase Commit)
  • Denormalisation often necessary to keep queries shard-local

Instinct #

Delay sharding as long as possible. A well-indexed, properly-tuned single PostgreSQL instance handles more than most people think (~50K queries/s, several TB). Try to solve the problem using caching/replication first before resorting to sharding.

When you must shard, choose the partition key based on your most common query pattern, accept that cross-shard queries will be slow, and plan for hot-partition mitigation from day one.

Framing #

Our hotspots are structurally determined by merchant size, not a hash function problem. We’re accepting the imbalance as a feature, and our solution is replica asymmetry and per-shard caching. Here’s the trade-off: we lose some operational simplicity but avoid constant resharding.

  • MISCONCEPTION: Sharding is a tuning action, not a scaling starting point. Rely on read replicas and caching first. Only consider sharding when capacity math shows you’re hitting the ceiling of what a single machine can handle.
  • INTERVIEW: When to bring up sharding: when suggesting that a single DB won’t work AND replication and caching won’t solve the problem either.
  • CONVENTION: Technically “partitioning” refers to splitting data within a single DB instance, and “sharding” refers to splitting across multiple machines. In practice, people use them interchangeably.
  • PITFALL: Cross-shard queries explode latency. If a query requires data from multiple shards, you need scatter-gather, which defeats the purpose. Design the shard key so your most common queries hit a single shard.

References #

DDIA 2e Reference #

  • Chapter 6: Partitioning β€” this IS Chapter 6