- rtshkmr's digital garden/
- Readings/
- Books/
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems/
- Chapter 3. Storage and Retrieval/
Chapter 3. Storage and Retrieval
Table of Contents
DB’s role is fundamentally just storage and retrieval
This chapter cares about how the db actually does the storage and retrieval internally to help us pick the right tools (storage engine) for the right job.
storage engines for transaction processing vs analytics differs significantly.
Data Structures That Power Your Database #
Some categories that are more common:
- log-structured storage engines
- log is append only,
loghere is a little more generic in its meaning:- log is used in the more general sense: an append-only sequence of records.
- not necessarily has to be human-friendly, could just be binary blobs for machines to read
- log is append only,
- page-oriented storage engines
- log-structured storage engines
indices
- are derived from primary data
- typically incur write overheads because the index structure needs to be updated on every write \(\implies\) classic read/write tradeoff
- we need
indexesto improve querying, this chapter is about different indexing structures and how they fare against each other
Hash Indexes #
Similar to dictionaries, allow hashes to define offsets to assist the search \(\implies\) in-mem hash table
In logs, we can do compaction to reduce older, irrelevant data
We can do multiple segments, each with their own in-memory hash table so keys to file offets of sorts this can create a segment hierarchy for us to nest and explore if we can find it in fresher segements
Some specifics:
file formats: faster to use binary formats than csv
deletion: typically easier to tombstone things
crash recovery: in-mem info is lost on crashes, so needs to be restored
- posssible to restore from snapshots
partial writes & corruption
- can include checksums for this, so partial writes can just be dumped
concurrency control
- log is append only, sequential order can be preserved using a single writer thread.
Value of append only logs
appending & sequence merging are sequential writes \(\implies\) faster than random writes
preferred in SSDs also
append only helps make concurrency and crash recovery simpler also
merging old segments avoids data fragmentation problems overtime
Problems of hash table index
- memory limits
- range queries not efficient
SSTables and LSM-Trees #
RocksDB and LevelDB are some examples of
to the append only constraints we apply new ones:
the sequence of key-val pairs is sorted by key
each key appears once within each merged segment file (compaction will ensure this)
hence Sorted String Table \(\implies\) SSTable
advantages:
merging segments is simpler, even if files > memory
sorted behaviour is preserved automatically, it’s stable
searching for a key within the file doesn’t need a key-index, can guess-search
it becomes more like a skip-list
still need some sparse in-mem index though for the key offsets for some of them
ranges can be block-compressed for disk-storage
the sparse in-memory index then becomes pointers to the compressed blocks
compression helps reduce I/O bandwidth use also.
datastructures
basically tree structures herm for this the in-mem trees are called memtable
- AVL trees
- red-black trees
can maintain write-logs so that crashes don’t cause data loss on the recent memtable writes
this can be solved by doing snapshots of the memtable
LSMs (Log-Structured Merge-Tree)
basically a cascade of SSTables
there’s entire filesystems and storage engines for this
term dictionaries and full text search
the indexing is not the same but has similar ideas
lucene is the indexing engine that works for this.
optimisations:
- missing keys will be costly to search for \(\implies\) better to pair it with bloom filters that can be used to do existence checks first then to do the disk read.
B-Trees #
most widely used indexing structure
characteristics
k-v pairs are sorted by key \(\implies\) efficient lookups and range queries
tree is balanced, depth is always \(O(log n)\) for n keys
it’s page-structured (fixed-sized blocks),
it is NOT segment structured like the SSTables
we need to care about B-tree branching factor
this is the number of refs to child pages in one page of the B-tree
the branching factor depends on the amount of space required to store the page refer ences and the range boundaries, but typically it is several hundred.
reliability of b-trees
writes require overwrites and pages are overwritten. It’s not an append only
concern: if too many pages to overwrite then we need to guard against crashes during partial write
i.e. there’s a need to prevent orphan pages
that’s why there’s a WAL (write-ahead-log) typically. Can be used as a redo-log also. It’s append-only store that can be used to restore crashed / corrupted states.
concern: concurrency management
typically done using latches (lightweight locks)
optimisations
copy-on-write schemes instead of pairing it up with a WAL
useful for concurrency control also
e.g. LMDB
Comparing B-Trees and LSM-Trees #
following are some relevant dimensions to compare indices by
maturity
- b-trees are more mature
generally faster for ?
- LSM-trees faster for writes
- b-trees faster for reads
this is more of a rule-of-thumb kind, things should be profiled for our actual use case. So, test systems with your particular workload in order to make a valid comparison.
we always need to do empirical testing to determine which storage engine is better
write amplification: there are reasons for repeated writes, this is where the amplification comes from
for SSTables, multiple writes may be from the compaction and merging of the SSTables
for B-trees, there’s at least 2 writes (one for the WAL and the actual page-write)
why we should care
matters if you’re using SSDs
write amplification affects the lifetimes of SSDs because SSDs have bounded overwrites \(\implies\) matters more for SSDs
write-heavy tasks the perf bottlenecks may be write amplification itself
LSM-trees typically can sustain higher throughput than B-trees.
predictability of resource allocation & need for resource monitoring
compaction may block writes if it’s expensive – resource constraints
For LSM-trees, there’s a need to monitor for resource allocation issues. If compaction is misconfigured, then there may be too many unmerged segments and we may run out of disk space. Typical there’s no throttling also, so need to explicitly monitor resources.
B-trees are more predictable. Each key is in exactly one place in the index \(\implies\) B-trees give strong transactional semantics. We can lock ranges of keys for transaction isolation and the locks can be directly attachhed to the tree.
Other Indexing Structures #
Both B-trees and log-structured indices are good for secondary indexes.
How indexes store values:
can store the value directly \(\implies\) clustered index
this avoid the extra hop from index to the heapfile. Possibly faster performance for reads.
need additional storage, can add write-overheads, more effort to maintain transactional guarantees
can store a reference to the row (non-sclustered index). Typically references a heapfile, useful when there’s multiple secondary indexes.
- helps keeps updates efficient
hybrid approach: covering index where some of the tables cols are within the index directly
“covering” because it can cover some queries by answering them using the index alone.
Multi-column Index #
Intent: When we wish to query multiple columns of a table (or multiple fields in a document) at the same time.
Types:
concatenated index fields are literally concatenated, sort order maintained
multi-dimensional index
more general way of querying several columns at once e.g. for geospatial data
useful for multi-dim range queries e.g.
SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079 AND longitude > -0.1162 AND longitude < -0.1004;How to do this?
translate 2D into single number using a space-filling curve then use B-tree to regularise
use R-trees
e.g. PostGIS using R-trees
Full-text search, Fuzzy Indexes #
Fuzzy techniques are useful when we are doing similarity-based searches
Edit-distance-based approaches work here. Lucene uses it.
in memory dbs
can be cache-specific only
can also have special hardware for durability purposes
the actual reason they are fast is because there’s no need to encode in-memory datastructures in a form that can be written to the disk
it’s counterintuitive because the performance benefits are NOT because there’s no need to read from disks.
Transaction Processing or Analytics? #
background on transactions
origins of transactions is in commercial transactions
the basic access patterns for other domains were similar to processing business transactions
group of reads and writes as a logical unit == transactions
not necessarily ACID transactions
analytics access patterns are different
a separate analytics db is the data warehouse itself
comparison:
Property Transaction processing systems (OLTP) Analytic systems (OLAP) Main read pattern Small number of records per query, fetched by key Aggregate over large number of records Main write pattern Random-access, low-latency writes from user input Bulk import (ETL) or continuous event stream Primarily used by End user / customer, via web or mobile application Internal analyst, for reporting and decision support What data represents Latest state of data (current point in time) History of events that happened over time Typical dataset size Gigabytes to terabytes Terabytes to petabytes
Data Warehousing #
- less diversity of data models for warehousing, it’s formulaic
- e.g. star schema / dimensional modelling
- Benefits of keeping a separate warehouse
- allows R/O copy of different OLTP systems’ data
- allows optimisations for analytical queries and actions
- exploratory actions: drill-downs, slicing, dicing
- ETL processes guide the entry of data into the warehouse
- data can be inserted periodically or as a stream
- examples of open source engines (SQL-on-hadoop projects): Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill
Stars and Snowflakes: Schemas for Analytics #
Star Schema:
- terms:
- fact tables Captures events as rows
- dimension tables:
related to from the fact, give info about a particular dimension of info. Gives information about the event.
- it can be a good idea to have a dim table for datetime, to ad more info about it
- attributes : just primitive information
- consider the case where individual events are rows in a fact table, that correlate to dimension tables (that give more information about a particular dimension).
- terms:
Snowflake Schema:
- variation of the classic star schema. Dims may hahve sub-dims, so it extends out like a snowflake
- more normalised than star schemas, but star schemas are chhosen becuase of simplicity for analysts to work with
tables are typically very wide, fact tables with many columns, dims can also end up being very wide.
Fact tables can be expected to be in the order of petabytes and dim tables can be expected to be in the order of millions of rows (smaller footprint)
Column-Oriented Storage #
Compared to OLTP dbs where storage is row-oriented (values from one row are co-located), so doing analytical querying on them (i.e. subset of columns important for query) can end up having large overheads because a wide table would need to be loaded into memory then the parsing, filtering can be done for satisfying the query.
A better approach for analytical queries would be to co-locate values from the same column together.
Applies for both relational and document data as well.
Column Compression #
- data within columns can show characteristics that makes it easy to apply some common compression formats:
- when there’s repetitive data in the columns:
- can consider bitmap encoding, esp when there’s few distinct values \(\implies\) bitmap indexing
- can consider run-length encoding, when the bits within the bitmaps are relatively sparse (usually when the variety of distinct values is high)
- there’s a bunch of other compression schemes as well
- NOTE: Column families e.g. in cassandra borrow from Bigtable model, whichh is still row-oriented. Shouldn’t be thought of as column oriented.
- when there’s repetitive data in the columns:
memory bandwidth and vectorized processing #
Some bottlenecks when scanning millions of rows:
- bandwidth from getting data from disk into memory
- bandwidth from main memory into CPU cache e.g. avoiding branch mispredictions
Ways to be efficient:
- reduce the volume of data needed to be loaded from disk
- use CPU Cycles efficiently
- e.g. can design operators to operate on chunks of compressed column data directly \(\implies\) vectorized processing
Sort Order in Column Storage #
order doesn’t matter, can just do insertion order
we can establish order and use it as indexing mechanism
sorting is w.r.t a row, so can’t sort independent columns because we use implicit offsets to associate row-equivalence
sorting is done by rows, has to make sense based on common queries e.g. if date-range queries, then first sort key should be along the date column
sorting can help with:
- faster querying and optimisations
- for grouping of related info on storage (colocation)
- compression of columns
- suppose there’s sparse deviation of values, then even billion rows’ worth of sort index can be compressed down to a few KB’s.
- compression effect is strongest on the first sort key
Several different sort orders #
- can store the same data in several different ways for different queries
- akin to having multiple sort orders in a column-oriented store is a bit similar to having multiple secondary indexes in a row-oriented store.
Writing to Column-Oriented Storage #
- typical load is read-only load, so we can optimise it in those ways, it makes writes more complex
- typically they’d do something like LSM trees, rely on some in-memory representation as well for the writes and then on filled buffers actually write over
this means that queries need to examine both the column data on disk and the recent writes in memory, then combine the two.
this is handled by the query optimiser without letting the user worry about it
Aggregation: Data Cubes and Materialized Views #
many of the queries rely on materialized aggregates, so it makes sense to cache some of these
materialised views in column stores are NOT virtual, they actually copy the query results to the disk,
it’s a denormalised copy of the data \(\implies\) needs updates if the underlying data changes
comparison: in relational data models, this is typically virtual, it’s more of a shortcut to writing queries.
in read-heavy cases like in such OLAPs, it makes sense to have materialized views so that the writes are easier to handle. They may have no effect on the read queries though.
data cube is an example of a materialized view
it’s actually more of a hybercube (n-dimensional) that arranges itself by aggregates along the different dimensions that exist.
the core principle: each cell contains the sales for a particular date-product-store-promotion- customer combination. These values can then repeatedly be summarized along each of the dimensions.
performance benefits:
- this is essentially pre-computation as a form of speeding up analytical queries.
problems:
- less flexibility than being able to query the raw data
Most data warehouses therefore try to keep as much raw data as possible, and use aggregates such as data cubes only as a performance boost for certain queries.