Chapter 3. Storage and Retrieval

Table of Contents

DB’s role is fundamentally just storage and retrieval

This chapter cares about how the db actually does the storage and retrieval internally to help us pick the right tools (storage engine) for the right job.

storage engines for transaction processing vs analytics differs significantly.

Data Structures That Power Your Database #

Some categories that are more common:
1. log-structured storage engines
  1. log is append only, log here is a little more generic in its meaning:
    - log is used in the more general sense: an append-only sequence of records.
    - not necessarily has to be human-friendly, could just be binary blobs for machines to read
2. page-oriented storage engines
indices
- are derived from primary data
- typically incur write overheads because the index structure needs to be updated on every write \(\implies\) classic read/write tradeoff
- we need indexes to improve querying, this chapter is about different indexing structures and how they fare against each other

Hash Indexes #

Similar to dictionaries, allow hashes to define offsets to assist the search \(\implies\) in-mem hash table

In logs, we can do compaction to reduce older, irrelevant data

We can do multiple segments, each with their own in-memory hash table so keys to file offets of sorts this can create a segment hierarchy for us to nest and explore if we can find it in fresher segements

Some specifics:

file formats: faster to use binary formats than csv
deletion: typically easier to tombstone things
crash recovery: in-mem info is lost on crashes, so needs to be restored
- posssible to restore from snapshots
partial writes & corruption
- can include checksums for this, so partial writes can just be dumped
concurrency control
- log is append only, sequential order can be preserved using a single writer thread.

Value of append only logs

appending & sequence merging are sequential writes \(\implies\) faster than random writes
preferred in SSDs also
append only helps make concurrency and crash recovery simpler also
merging old segments avoids data fragmentation problems overtime

Problems of hash table index

memory limits
range queries not efficient

SSTables and LSM-Trees #

RocksDB and LevelDB are some examples of

to the append only constraints we apply new ones:
1. the sequence of key-val pairs is sorted by key
2. each key appears once within each merged segment file (compaction will ensure this)
hence Sorted String Table \(\implies\) SSTable
advantages:
1. merging segments is simpler, even if files > memory
  sorted behaviour is preserved automatically, it’s stable
2. searching for a key within the file doesn’t need a key-index, can guess-search
  it becomes more like a skip-list
  still need some sparse in-mem index though for the key offsets for some of them
3. ranges can be block-compressed for disk-storage
  the sparse in-memory index then becomes pointers to the compressed blocks
  compression helps reduce I/O bandwidth use also.
datastructures
- basically tree structures herm for this the in-mem trees are called memtable
  - AVL trees
  - red-black trees
- can maintain write-logs so that crashes don’t cause data loss on the recent memtable writes
  this can be solved by doing snapshots of the memtable
LSMs (Log-Structured Merge-Tree)
basically a cascade of SSTables
there’s entire filesystems and storage engines for this

term dictionaries and full text search
the indexing is not the same but has similar ideas
lucene is the indexing engine that works for this.
optimisations:
- missing keys will be costly to search for \(\implies\) better to pair it with bloom filters that can be used to do existence checks first then to do the disk read.

B-Trees #

most widely used indexing structure
characteristics
- k-v pairs are sorted by key \(\implies\) efficient lookups and range queries
- tree is balanced, depth is always \(O(log n)\) for n keys
- it’s page-structured (fixed-sized blocks),
  it is NOT segment structured like the SSTables
  - we need to care about B-tree branching factor
    this is the number of refs to child pages in one page of the B-tree
    the branching factor depends on the amount of space required to store the page refer ences and the range boundaries, but typically it is several hundred.
reliability of b-trees
- writes require overwrites and pages are overwritten. It’s not an append only
- concern: if too many pages to overwrite then we need to guard against crashes during partial write
  i.e. there’s a need to prevent orphan pages
  that’s why there’s a WAL (write-ahead-log) typically. Can be used as a redo-log also. It’s append-only store that can be used to restore crashed / corrupted states.
- concern: concurrency management
  typically done using latches (lightweight locks)
optimisations
- copy-on-write schemes instead of pairing it up with a WAL
  useful for concurrency control also
  e.g. LMDB

Comparing B-Trees and LSM-Trees #

following are some relevant dimensions to compare indices by

maturity
- b-trees are more mature
generally faster for ?
- LSM-trees faster for writes
- b-trees faster for reads
this is more of a rule-of-thumb kind, things should be profiled for our actual use case. So, test systems with your particular workload in order to make a valid comparison.
we always need to do empirical testing to determine which storage engine is better
write amplification: there are reasons for repeated writes, this is where the amplification comes from
- for SSTables, multiple writes may be from the compaction and merging of the SSTables
- for B-trees, there’s at least 2 writes (one for the WAL and the actual page-write)
why we should care
1. matters if you’re using SSDs
  write amplification affects the lifetimes of SSDs because SSDs have bounded overwrites \(\implies\) matters more for SSDs
2. write-heavy tasks the perf bottlenecks may be write amplification itself
  LSM-trees typically can sustain higher throughput than B-trees.
predictability of resource allocation & need for resource monitoring
compaction may block writes if it’s expensive – resource constraints
For LSM-trees, there’s a need to monitor for resource allocation issues. If compaction is misconfigured, then there may be too many unmerged segments and we may run out of disk space. Typical there’s no throttling also, so need to explicitly monitor resources.
B-trees are more predictable. Each key is in exactly one place in the index \(\implies\) B-trees give strong transactional semantics. We can lock ranges of keys for transaction isolation and the locks can be directly attachhed to the tree.

Other Indexing Structures #

Both B-trees and log-structured indices are good for secondary indexes.

How indexes store values:

can store the value directly \(\implies\) clustered index
- this avoid the extra hop from index to the heapfile. Possibly faster performance for reads.
- need additional storage, can add write-overheads, more effort to maintain transactional guarantees
can store a reference to the row (non-sclustered index). Typically references a heapfile, useful when there’s multiple secondary indexes.
- helps keeps updates efficient
hybrid approach: covering index where some of the tables cols are within the index directly
“covering” because it can cover some queries by answering them using the index alone.

Multi-column Index #

Intent: When we wish to query multiple columns of a table (or multiple fields in a document) at the same time.

Types:

concatenated index fields are literally concatenated, sort order maintained
multi-dimensional index
more general way of querying several columns at once e.g. for geospatial data
useful for multi-dim range queries e.g.
```
   SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079
   AND longitude > -0.1162 AND longitude < -0.1004;
```
How to do this?
1. translate 2D into single number using a space-filling curve then use B-tree to regularise
2. use R-trees
  e.g. PostGIS using R-trees

Full-text search, Fuzzy Indexes #

Fuzzy techniques are useful when we are doing similarity-based searches

Edit-distance-based approaches work here. Lucene uses it.

in memory dbs
can be cache-specific only
can also have special hardware for durability purposes
the actual reason they are fast is because there’s no need to encode in-memory datastructures in a form that can be written to the disk
it’s counterintuitive because the performance benefits are NOT because there’s no need to read from disks.

Transaction Processing or Analytics? #

background on transactions
- origins of transactions is in commercial transactions
  the basic access patterns for other domains were similar to processing business transactions
- group of reads and writes as a logical unit == transactions
  not necessarily ACID transactions
analytics access patterns are different
a separate analytics db is the data warehouse itself

comparison:

Property	Transaction processing systems (OLTP)	Analytic systems (OLAP)
Main read pattern	Small number of records per query, fetched by key	Aggregate over large number of records
Main write pattern	Random-access, low-latency writes from user input	Bulk import (ETL) or continuous event stream
Primarily used by	End user / customer, via web or mobile application	Internal analyst, for reporting and decision support
What data represents	Latest state of data (current point in time)	History of events that happened over time
Typical dataset size	Gigabytes to terabytes	Terabytes to petabytes

Data Warehousing #

less diversity of data models for warehousing, it’s formulaic
- e.g. star schema / dimensional modelling
Benefits of keeping a separate warehouse
- allows R/O copy of different OLTP systems’ data
- allows optimisations for analytical queries and actions
  - exploratory actions: drill-downs, slicing, dicing
ETL processes guide the entry of data into the warehouse
- data can be inserted periodically or as a stream
examples of open source engines (SQL-on-hadoop projects): Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill

Stars and Snowflakes: Schemas for Analytics #

Star Schema:
- terms:
  - fact tables Captures events as rows
  - dimension tables: related to from the fact, give info about a particular dimension of info. Gives information about the event.
    - it can be a good idea to have a dim table for datetime, to ad more info about it
  - attributes : just primitive information
- consider the case where individual events are rows in a fact table, that correlate to dimension tables (that give more information about a particular dimension).
Snowflake Schema:
- variation of the classic star schema. Dims may hahve sub-dims, so it extends out like a snowflake
- more normalised than star schemas, but star schemas are chhosen becuase of simplicity for analysts to work with
tables are typically very wide, fact tables with many columns, dims can also end up being very wide.
Fact tables can be expected to be in the order of petabytes and dim tables can be expected to be in the order of millions of rows (smaller footprint)

Column-Oriented Storage #

Compared to OLTP dbs where storage is row-oriented (values from one row are co-located), so doing analytical querying on them (i.e. subset of columns important for query) can end up having large overheads because a wide table would need to be loaded into memory then the parsing, filtering can be done for satisfying the query.
A better approach for analytical queries would be to co-locate values from the same column together.
Applies for both relational and document data as well.

Column Compression #

data within columns can show characteristics that makes it easy to apply some common compression formats:
- when there’s repetitive data in the columns:
  - can consider bitmap encoding, esp when there’s few distinct values \(\implies\) bitmap indexing
  - can consider run-length encoding, when the bits within the bitmaps are relatively sparse (usually when the variety of distinct values is high)
- there’s a bunch of other compression schemes as well
- NOTE: Column families e.g. in cassandra borrow from Bigtable model, whichh is still row-oriented. Shouldn’t be thought of as column oriented.

memory bandwidth and vectorized processing #

Some bottlenecks when scanning millions of rows:

bandwidth from getting data from disk into memory
bandwidth from main memory into CPU cache e.g. avoiding branch mispredictions

Ways to be efficient:

reduce the volume of data needed to be loaded from disk
use CPU Cycles efficiently
- e.g. can design operators to operate on chunks of compressed column data directly \(\implies\) vectorized processing

Sort Order in Column Storage #

order doesn’t matter, can just do insertion order
we can establish order and use it as indexing mechanism
sorting is w.r.t a row, so can’t sort independent columns because we use implicit offsets to associate row-equivalence
sorting is done by rows, has to make sense based on common queries e.g. if date-range queries, then first sort key should be along the date column
sorting can help with:
1. faster querying and optimisations
2. for grouping of related info on storage (colocation)
3. compression of columns
  - suppose there’s sparse deviation of values, then even billion rows’ worth of sort index can be compressed down to a few KB’s.
  - compression effect is strongest on the first sort key

Several different sort orders #

can store the same data in several different ways for different queries
akin to having multiple sort orders in a column-oriented store is a bit similar to having multiple secondary indexes in a row-oriented store.

Writing to Column-Oriented Storage #

typical load is read-only load, so we can optimise it in those ways, it makes writes more complex
typically they’d do something like LSM trees, rely on some in-memory representation as well for the writes and then on filled buffers actually write over
- this means that queries need to examine both the column data on disk and the recent writes in memory, then combine the two.
  this is handled by the query optimiser without letting the user worry about it

Aggregation: Data Cubes and Materialized Views #

many of the queries rely on materialized aggregates, so it makes sense to cache some of these
materialised views in column stores are NOT virtual, they actually copy the query results to the disk,
- it’s a denormalised copy of the data \(\implies\) needs updates if the underlying data changes
- comparison: in relational data models, this is typically virtual, it’s more of a shortcut to writing queries.
- in read-heavy cases like in such OLAPs, it makes sense to have materialized views so that the writes are easier to handle. They may have no effect on the read queries though.
data cube is an example of a materialized view
it’s actually more of a hybercube (n-dimensional) that arranges itself by aggregates along the different dimensions that exist.
the core principle: each cell contains the sales for a particular date-product-store-promotion- customer combination. These values can then repeatedly be summarized along each of the dimensions.

performance benefits:
- this is essentially pre-computation as a form of speeding up analytical queries.
problems:
- less flexibility than being able to query the raw data
Most data warehouses therefore try to keep as much raw data as possible, and use aggregates such as data cubes only as a performance boost for certain queries.