Data Model Selection

Table of Contents

🔴 P0 — the first data decision in every design; determines query flexibility, scaling, and consistency

Problem #

Your data model determines what queries are efficient, how data scales, and what consistency guarantees are natural. The wrong model means fighting the database instead of using it.

Mechanism #

Model	Data shape	Query strength	Scaling pattern	Tool Example
Relational	Normalised, joins	Flexible (SQL)	Vertical, then shard	PostgreSQL, MySQL
Document	Nested, denormalised	By document/key	Horizontal (built-in)	MongoDB, DynamoDB
Wide-column	Sparse, column families	By row/column key	Horizontal (built-in)	Cassandra, HBase
Key-Value	Opaque blobs by key	By key only	Horizontal (trivial)	Redis, DynamoDB (simple)
Time-series	Timestamped metrics	Time-range queries	By time partitions	InfluxDB, TimescaleDB
Search	Inverted indexes	Full-text, faceted	Sharded by index	Elasticsearch, Solr

Relational #

Typically the common choice is mysql vs postgres, so here’s a small comparison of the two:

PostgreSQL vs MySQL #

Aspect	PostgreSQL	MySQL (InnoDB)	What it means
Concurrency model	MVCC (Multi-Version CC)	Row-level locks + undo logs	Postgres readers don’t block writers; MySQL readers may wait for write locks
Query optimizer	Sophisticated cost-based	Simpler heuristic-based	Postgres can handle complex queries better; MySQL can surprise you with suboptimal plans
Write path	MVCC creates versions, lazy cleanup (vacuum)	In-place updates with lock	Postgres trades storage bloat for read concurrency; MySQL trades complexity for tighter storage
Lock contention	Rare—MVCC means no readers blocking writers	Higher under write-heavy + read-heavy mixed load	For payment systems or high-concurrency scenarios, Postgres scales reads better

Postgres: MVCC trades storage bloat for better read concurrency; sophisticated optimizer for complex queries. Choose when: high read/write concurrency, complex queries, or semi-structured data.

gotchas
- VACUUM bloat — Unused versions accumulate if autovacuum isn’t tuned. Can cause unexpected performance cliffs.
- Transaction wraparound — Long-running transactions can freeze the database if not monitored
- WAL management — Misconfigured archiving can fill disks fast
- Connection pooling required — Postgres is heavier per-connection; you must use pgBouncer or similar
- Replication lag visibility — Easier to miss replication lag in Postgres; requires explicit monitoring

MySQL: Simpler locking, smaller operational surface. Choose when: simple OLTP, resource-constrained, or operational simplicity > feature richness.

gotchas
- Implicit type coercion — “5” = 5 is true. Can cause subtle bugs and security issues.
- Replication inconsistency — Async replication means replicas can lag indefinitely without warning
- Query optimizer surprises — Sometimes chooses terrible plans with no obvious reason
- InnoDB lock waits — Can be hard to debug which query locked which row
- Behavioural differences — SQL_MODE changes behaviour; defaults vary across versions

Document #

A pedestrian observation would be that the benefit of document-dbs is for the schema-flexibility and may wrongly attribute lack of mature requirements for the data-model to be the reason to pick document-dbs but that would be a contrived point. It’s contrived because it’s conflating schema-maturity with other pertinent deciding-factors.

The decision for what to use should follow some decision-axes:

Table 1: Example of when document-dbs make sense

Factor	Reasoning
Schema flexibility	“Flexibility is a tool; the question is whether polymorphic/sparse data is inherent to the domain”
Access patterns	“Our primary queries are document-scoped (user + all their settings in one fetch) → NoSQL is a better fit operationally”
Write characteristics	“High-cardinality writes, append-only, or time-series → NoSQL’s partition-transparent scaling beats relational’s normalized write paths”
Consistency model	“We can tolerate eventual consistency for user metadata but not for financial ledgers → different tools for different data”
Operational cost	“Relational’s operational complexity at scale (connection pooling, query optimization, sharding is painful) vs. NoSQL’s predictability”

We started with DynamoDB because our data was sparse and access patterns were document-oriented. As the system matured, we actually stayed with DynamoDB for the hot path (user sessions, settings) because the bottleneck was write concurrency, not complex queries. We added a relational database alongside it for the reporting and analytics path, where we needed joins and ad-hoc queries.

When to Use What #

Use case	Model	Why
Transactional OLTP	Relational	ACID, joins, complex queries
User profiles, product catalog	Document	Self-contained entities, flexible schema
IoT sensor data	Time-series	Optimised for time-range inserts/queries
Session/cache data	Key-Value	Simple lookups, high throughput
Full-text search, logs	Search DB	Inverted index, relevance scoring
Social graph, recommendations	Graph	Traversal queries (friends-of-friends)

Instinct #

Start with PostgreSQL. It handles relational workloads, JSON documents (jsonb), and even basic time-series. Only reach for specialised databases when Postgres demonstrably can’t handle the access pattern. This is the “boring technology” principle: complexity must be justified.

INTERVIEW: The layered decision-making process for data modelling is:
1. Workload classification (OLTP vs OLAP): determines throughput requirements, consistency model, scalability approach
2. Access pattern analysis: what queries are common? What relationships matter? Exact lookups vs range scans?
3. DB model choice: relational, document, K-V, time-series — based on which model efficiently serves the identified access patterns
4. Schema design: normalisation vs denormalisation based on read/write patterns, indexing, partitioning
PITFALL: Don’t conflate “schema flexibility” with “immature requirements.” The case for document DBs should be grounded in access patterns and data shape, not schema uncertainty.

Framing #

comparing pg vs mysql #

For the payment ledger (simple inserts + lookups by ID), we chose MySQL—the straightforward locking model is predictable, and we don’t need Postgres’s query optimizer. For the analytics dashboard (complex aggregations, evolving schemas), we replicated to Postgres. The operational cost is higher, but the query performance difference justifies it. We’re accepting the complexity of running two databases because each is optimized for its access pattern.

References #

Choose Boring Technology — Dan McKinley
PostgreSQL JSON Types — Postgres as a document store

DDIA 2e Reference #

Chapter 2: Data Models and Query Languages
Chapter 3: Storage and Retrieval (how models are implemented)