- rtshkmr's digital garden/
- References/
- Architecture Design Basics/
- Pattern Taxonomy/
- Data Storage & Retrieval/
- Data Model Selection/
Data Model Selection
Table of Contents
π΄ P0 — the first data decision in every design; determines query flexibility, scaling, and consistency
Problem #
Your data model determines what queries are efficient, how data scales, and what consistency guarantees are natural. The wrong model means fighting the database instead of using it.
Mechanism #
| Model | Data shape | Query strength | Scaling pattern | Tool Example |
|---|---|---|---|---|
| Relational | Normalised, joins | Flexible (SQL) | Vertical, then shard | PostgreSQL, MySQL |
| Document | Nested, denormalised | By document/key | Horizontal (built-in) | MongoDB, DynamoDB |
| Wide-column | Sparse, column families | By row/column key | Horizontal (built-in) | Cassandra, HBase |
| Key-Value | Opaque blobs by key | By key only | Horizontal (trivial) | Redis, DynamoDB (simple) |
| Time-series | Timestamped metrics | Time-range queries | By time partitions | InfluxDB, TimescaleDB |
| Search | Inverted indexes | Full-text, faceted | Sharded by index | Elasticsearch, Solr |
Relational #
Typically the common choice is mysql vs postgres, so here’s a small comparison of the two:
PostgreSQL vs MySQL #
| Aspect | PostgreSQL | MySQL (InnoDB) | What it means |
|---|---|---|---|
| Concurrency model | MVCC (Multi-Version CC) | Row-level locks + undo logs | Postgres readers don’t block writers; MySQL readers may wait for write locks |
| Query optimizer | Sophisticated cost-based | Simpler heuristic-based | Postgres can handle complex queries better; MySQL can surprise you with suboptimal plans |
| Write path | MVCC creates versions, lazy cleanup (vacuum) | In-place updates with lock | Postgres trades storage bloat for read concurrency; MySQL trades complexity for tighter storage |
| Lock contention | RareβMVCC means no readers blocking writers | Higher under write-heavy + read-heavy mixed load | For payment systems or high-concurrency scenarios, Postgres scales reads better |
Postgres: MVCC trades storage bloat for better read concurrency; sophisticated optimizer for complex queries. Choose when: high read/write concurrency, complex queries, or semi-structured data.
- gotchas
- VACUUM bloat β Unused versions accumulate if autovacuum isn’t tuned. Can cause unexpected performance cliffs.
- Transaction wraparound β Long-running transactions can freeze the database if not monitored
- WAL management β Misconfigured archiving can fill disks fast
- Connection pooling required β Postgres is heavier per-connection; you must use pgBouncer or similar
- Replication lag visibility β Easier to miss replication lag in Postgres; requires explicit monitoring
MySQL: Simpler locking, smaller operational surface. Choose when: simple OLTP, resource-constrained, or operational simplicity > feature richness.
- gotchas
- Implicit type coercion β “5” = 5 is true. Can cause subtle bugs and security issues.
- Replication inconsistency β Async replication means replicas can lag indefinitely without warning
- Query optimizer surprises β Sometimes chooses terrible plans with no obvious reason
- InnoDB lock waits β Can be hard to debug which query locked which row
- Behavioural differences β SQL_MODE changes behaviour; defaults vary across versions
Document #
A pedestrian observation would be that the benefit of document-dbs is for the schema-flexibility and may wrongly attribute lack of mature requirements for the data-model to be the reason to pick document-dbs but that would be a contrived point. It’s contrived because it’s conflating schema-maturity with other pertinent deciding-factors.
The decision for what to use should follow some decision-axes:
| Factor | Reasoning |
|---|---|
| Schema flexibility | “Flexibility is a tool; the question is whether polymorphic/sparse data is inherent to the domain” |
| Access patterns | “Our primary queries are document-scoped (user + all their settings in one fetch) β NoSQL is a better fit operationally” |
| Write characteristics | “High-cardinality writes, append-only, or time-series β NoSQL’s partition-transparent scaling beats relational’s normalized write paths” |
| Consistency model | “We can tolerate eventual consistency for user metadata but not for financial ledgers β different tools for different data” |
| Operational cost | “Relational’s operational complexity at scale (connection pooling, query optimization, sharding is painful) vs. NoSQL’s predictability” |
We started with DynamoDB because our data was sparse and access patterns were document-oriented. As the system matured, we actually stayed with DynamoDB for the hot path (user sessions, settings) because the bottleneck was write concurrency, not complex queries. We added a relational database alongside it for the reporting and analytics path, where we needed joins and ad-hoc queries.
When to Use What #
| Use case | Model | Why |
|---|---|---|
| Transactional OLTP | Relational | ACID, joins, complex queries |
| User profiles, product catalog | Document | Self-contained entities, flexible schema |
| IoT sensor data | Time-series | Optimised for time-range inserts/queries |
| Session/cache data | Key-Value | Simple lookups, high throughput |
| Full-text search, logs | Search DB | Inverted index, relevance scoring |
| Social graph, recommendations | Graph | Traversal queries (friends-of-friends) |
Instinct #
Start with PostgreSQL. It handles relational workloads, JSON documents (jsonb), and even basic time-series. Only reach for specialised databases when Postgres demonstrably can’t handle the access pattern. This is the “boring technology” principle: complexity must be justified.
- INTERVIEW: The layered decision-making process for data modelling is:
- Workload classification (OLTP vs OLAP): determines throughput requirements, consistency model, scalability approach
- Access pattern analysis: what queries are common? What relationships matter? Exact lookups vs range scans?
- DB model choice: relational, document, K-V, time-series — based on which model efficiently serves the identified access patterns
- Schema design: normalisation vs denormalisation based on read/write patterns, indexing, partitioning
- PITFALL: Don’t conflate “schema flexibility” with “immature requirements.” The case for document DBs should be grounded in access patterns and data shape, not schema uncertainty.
Framing #
comparing pg vs mysql #
For the payment ledger (simple inserts + lookups by ID), we chose MySQLβthe straightforward locking model is predictable, and we don’t need Postgres’s query optimizer. For the analytics dashboard (complex aggregations, evolving schemas), we replicated to Postgres. The operational cost is higher, but the query performance difference justifies it. We’re accepting the complexity of running two databases because each is optimized for its access pattern.
References #
- Choose Boring Technology β Dan McKinley
- PostgreSQL JSON Types β Postgres as a document store
DDIA 2e Reference #
- Chapter 2: Data Models and Query Languages
- Chapter 3: Storage and Retrieval (how models are implemented)