Chapter 2. Data Models and Query Languages

Table of Contents

Relational Model Versus Document Model #

The Birth of NoSQL #

The Object-Relational Mismatch #

Many-to-One and Many-to-Many Relationships #

Are Document Databases Repeating History? #

Relational Versus Document Databases Today #

Roots of Relational Data Model is in Transaction and Batch Processing #

So the historical mental models to adopt here are that of transaction processing and batch processing.

The roots of relational databases lie in business data processing, which was performed on mainframe computers in the 1960s and ’70s. The use cases appear mundane from today’s perspective: typically transaction processing (entering sales or banking trans‐ actions, airline reservations, stock-keeping in warehouses) and batch processing (cus‐ tomer invoicing, payroll, reporting).

Query Languages for Data #

Declarative Queries on the Web #

focuses on the what not the how to query part, allowing optimizer to figure out the “how” parts
- i.e. no focus on the implementation of the data
SQL being a limited form of expressiveness allows the query optimizer freedom to choose how that query is executed
declarative allows us to do parallel execution.

MapReduce Querying #

Should be seen as a low-level programming model for distributed execution on machine clusters.

Hybrid of declarative and imperative querying for data, can be distributed across storage.

It’s a map then reduce operation applied onto documents that meet the query.

Restrictions:

must be pure functions
no subqueries

Usability problems:

the two functions must be coordinated, can be harder to do so compared to a single query
plain declarative QL is easier to be optimised \(\implies\) mongodb has aggregation pipeline support

Graph-Like Data Models #

Graphs are not JUST for homogeneous data; can be used to consistently store completely different types of objects in a single datastore.

Multiple ways of structuring and querying data in graphs:

Property Graphs #

analogous to having 2 r/s tables: vertices and edges

CREATE TABLE vertices (
vertex_id
integer PRIMARY KEY,
properties json
);

CREATE TABLE edges (
edge_id
integer PRIMARY KEY,
tail_vertex integer REFERENCES vertices (vertex_id),
head_vertex integer REFERENCES vertices (vertex_id),
label
text,
properties json
);

CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);

can efficiently traverse the graph because we can find both incoming and outgoing edges for any given vertex
can join any two vertices together
labels can be used for categorising things
Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures.

The Cypher Query Language #

just gives some easy expressiveness to do matching on patterns in a declarative fashion and query the graph

Graph Queries in SQL #

vertex finding usually involves a bunch of edges being traversed, not easy to know in advance what’s the path. SQL join statements naturally mean that it’s easier if we know this information in advance.
we have to rely on recursive CTE if we use SQL syntax for this instead
CTEs, being a clumsy syntax, is indicative of the fact that relational dbs’ data models don’t play well with graphs

Triple-Stores and SPARQL #

triple-store, all information is stored in the form of very simple three-part statements: (subject, predicate, object).
e.g. (Jim, likes, bananas)
Syntax can be more kwargs like

@prefix : <urn:example:>.
_:lucy
a :Person;
:name "Lucy";
:bornIn _:idaho.
_:idaho
a :Location; :name "Idaho";
:type "state";
:within _:usa.
_:usa
a :Location; :name "United States"; :type "country"; :within _:namerica.
_:namerica a :Location; :name "North America"; :type "continent".

predicates can be:
- property
- edge representation (actual predicate)
interestingly triples are useful for semantic web, initially marketed for “web for machines”. This book is somewhat old but the cool thing is that we actually do see emergence of products that are supposed to be “internet for bots”.

RDF data model (resource description framework model) (ref wiki)
- essentially an XML-ised format for representing such resources within a graph db
- designed for interenet itself

The Foundation: Datalog #

kinda old but some understanding will help wrap our heads around modern langs like Cascalog for querying hadoop
it’s a prolog subset!

written as predicate(subject, object)

predicates not stored within the db, it’s rules that get defined, then used in the query. rules can refer to other rules so we can compose complex logic into rules

  name(namerica, 'North America').
  type(namerica, continent).
  name(usa, 'United States').
  type(usa, country).
  within(usa, namerica).
  name(idaho, 'Idaho').
  type(idaho, state).
  within(idaho, usa).
  name(lucy, 'Lucy').
  born_in(lucy, idaho).

  within_recursive(Location, Name) :- name(Location, Name).
  /* Rule 1 */
  within_recursive(Location, Name) :- within(Location, Via),
  /* Rule 2 */
  within_recursive(Via, Name).
  migrated(Name, BornIn, LivingIn) :- name(Person, Name),
  /* Rule 3 */
  born_in(Person, BornLoc),
  within_recursive(BornLoc, BornIn),
  lives_in(Person, LivingLoc),
  within_recursive(LivingLoc, LivingIn).
  ?- migrated(Who, 'United States', 'Europe').
  /* Who = 'Lucy'. */

the composability is a big advantage of using datalog

Summary #

nosql gives 2 main directions of progress for data models
- document models
- graph models
other special data requirements not mentioned:
- sequence similarty searching (for genomic sequences searching) – e.g. GenBank (for genome database)