Data Modelling

Table of Contents

🔴 P0 — the first design decision that cascades into everything else: scaling, consistency, query performance

AI usage disclosure Claude Opus 4.6 · content-consolidation

Structured and consolidated from personal study notes using Claude for semantic comparison, merging of overlapping content, accuracy review, and editorial polish. All technical content originates from personal notes, DDIA 2e, and HelloInterview references.

Data modelling determines what queries are efficient, how the system scales, and where consistency boundaries lie. Getting it wrong means backtracking. Getting it right means every subsequent decision inherits good defaults.

For which database to choose, see Data Model Selection. This entry covers how to design schemas once the database type is chosen.

INTERVIEW: Data modelling appears in two phases of the delivery framework:
1. During requirements gathering: identify core entities (1:1 map with tables/collections)
2. During high-level design: schema alongside the DB component — key fields, relationships, indexes, query patterns
INTERVIEW: Expectations are lower than in a data-engineering interview. You don’t need complete schema diagrams or normal-form analysis. Demonstrate that your schema supports the access patterns from your API design.

The Design Process #

A layered approach, each step informed by the previous:

Step 1: Determine Database Type #

Based on workload characteristics and access patterns. See Data Model Selection.

PITFALL: Resist temptation to flex esoteric databases. It’s typically a relational DB (PostgreSQL) unless requirements clearly point elsewhere. Start boring; justify complexity.

Step 2: Define Core Entities #

Map functional requirements to tables/collections with clear identifiers and relationships.

users:    id (PK), username, email
posts:    id (PK), user_id (FK → users.id), content, created_at
comments: id (PK), post_id (FK → posts.id), user_id (FK → users.id), content
likes:    user_id (FK), post_id (FK) — composite PK

PITFALL: Use generated IDs (UUID or sequence), not meaningful fields like email. Decouple identity from business rules — emails change, IDs don’t.

Step 3: Define Relationships #

Type	Example	Enforcement
1:N	User → Posts	FK on the “many” side
N:M	Users ↔ Groups	Junction table with composite FK
1:1	User → Profile	FK with unique constraint

Referential integrity via FKs prevents orphaned records but adds verification overhead.
RULE OF THUMB: Consider tombstoning (soft delete) vs hard deletion. Manage cascades explicitly — implicit cascading deletes are a production footgun.

Step 4: Add Constraints #

NOT NULL for required fields. UNIQUE for natural keys (email, username). CHECK for domain invariants.

Step 5: Index for Access Patterns #

Connect indexes directly to API endpoints and query patterns.

Social media app:
  Index on posts.user_id              → GET /users/{id}/posts
  Index on posts.created_at           → GET /posts?sort=recent
  Composite index (user_id, created_at) → GET /users/{id}/posts?sort=recent

INTERVIEW: In interviews, connect your indexes to your endpoints explicitly: “The GET /users/{id}/posts endpoint needs an index on posts.user_id.” This shows you’re thinking about real query performance.
See Index Trade-offs (B-Tree vs LSM) for storage engine internals.

Step 6: Normalise, Then Selectively Denormalise #

Start normalised (no redundancy). Denormalise only when read performance demands it — and know the cost.

When to denormalise	Why
Analytics and reporting	Aggregation of infrequently-changing data
Event logs and audit trails	Snapshot capture at point-in-time
Read-heavy, consistency-tolerant systems	Speed over freshness (search engines, feeds)

INTERVIEW: Start clean and normalised. Denormalise when needed, with justification:

Since we need to load feeds quickly and likes can be eventually consistent, I’ll denormalise like counts into the posts table.

TRICK: Consider putting a cache in front with the “denormalised representation” before making schema changes. A Redis cache with pre-computed joins achieves the same read performance without the consistency burden of actual denormalisation.

Step 7: Consider Sharding #

Only when capacity math demands it. See Sharding Strategies.

RULE OF THUMB: Shard by primary access pattern. This choice is effectively permanent and affects every query — be deliberate.
PITFALL: Time-range sharding is usually a bad idea for growing tables: all new writes go to the most recent shard, creating a hotspot. Time-range partitioning is appropriate for archival/analytics workloads where recent data is read-heavy but writes are spread out.
MISCONCEPTION: “We need to shard because we have a lot of data.” A well-indexed, properly-tuned single PostgreSQL instance handles ~50K queries/s and several TB. Exhaust caching and read replicas before sharding.

DB-Specific Notes #

Relational Databases #

See Data Model Selection for PostgreSQL vs MySQL comparison.

Key capabilities for system design:

Arbitrary indexes (multi-column, partial, covering) for different query optimisations
ACID transactions for correctness-critical workloads (payments, inventory)
Complex queries handled well, but avoid reporting-style multi-join queries at scale — opt for denormalised forms, caching, or pre-computed results
INTERVIEW: When strong consistency is a non-functional requirement (payments, inventory that can’t oversell), reach for relational. Modern relational DBs scale better than people assume: read replicas, sharding, connection pooling, caching.

Document Databases #

For deeply nested, self-contained entities with flexible schemas. See Data Model Selection for detailed decision-axes.

INTERVIEW: System design interviews scope requirements clearly enough that the “evolving schema” argument for document DBs is weak. Consider primarily when data is deeply nested (many joins in SQL) or records have vastly different structures.

Auxiliary Persistence #

When the primary DB can’t efficiently support a specialised access pattern:

Need	Tool	Mechanism
Full-text search	Elasticsearch	Inverted indexes, tokenisation
Geospatial queries	PostGIS (Postgres ext.)	Spatial indexes
Geospatial (Redis)	Redis geospatial type	Geo-hashing

INSIGHT: Always check whether your existing Postgres instance can support the need (e.g. GIN indexes for full-text search) before adding a new system.
CHALLENGE: Auxiliary indexes typically sync via CDC (change data capture), introducing index lag from the primary DB.

References #

DDIA 2e Reference #

Chapter 2: Data Models and Query Languages
Chapter 3: Storage and Retrieval (indexes, B-Trees, LSM)
Chapter 6: Partitioning (sharding strategies and trade-offs)