Non-Relational Data Storage - ut · • Non-relational data models aim to store aggregated data together • Aggregate is a collection of data that is treated as a unit –E.g. a

Non-Relational Databases

Pelle Jakovits

25 October 2017

Outline

• Background– Relational model– Database scaling– The NoSQL Movement– CAP Theorem

• Non-relational data models– Key-value– Document-oriented– Column family– Graph

• Example databases

2/36

The Relational Model

• Data is stored in tables

• Strict relationships between tables

– Foreign key references between columns

• The expected format of the data is specified with a restrictive schema

• Data is typically accessed using Structured Query Language (SQL)

3/36

Database Scaling

• Vertical scaling – on one machine

• Horizontal scaling – across a cluster of machines

• Relational model does not scale well horizontally

– Because there are too many dependencies in relational model

– Database sharding is one approach

4/36

Sharding

5/36

Relational database

6/38

https://dev.mysql.com/doc/sakila/en/

The NoSQL Movement

• Emergence of persistence solutions using non-relational data models

• Driven by the rise of „Big Data“ and Cloud Computing

• Non-relational data models are based on Key - Value structure

• Simpler schema-less key-value based data models scale better than the relational model

• NoSQL is a broad term with no clear boundaries– The term NoSQL itself is very misleading

– No SQL?

– Not only SQL?

7/36

CAP Theorem

• It is impossible for a distributed computer system to simultaneously provide all three of the following:– Consistency - every read receives the most recent write or

an error– Availability - every request receives a response– Partition/Fault tolerance - the system continues to operate

despite arbitrary partitioning (network failures, dropped

• Have to choose between consistency or availability• NoSQL solutions which are more focused on

availability try to achieve eventual consistency• When trying to aim for both consistency or availability

– Will have high latency

8/36

CAP Theorem

9/36

Benefits of the Key-value Model

• Horizontal scalability

– Data with the same Key stored close to each other

– Suitable for cloud computing

• Flexible schema-less models suitable for unstructured data

• Fetching data by key can be very fast

10/36

Indexing and Partitioning

• In Key-value model, Key acts as an index

• Secondary indexes can be created in some solutions

– Implementations vary from partitioning to local distributed indexes

• Data is partitioned between different machines in the cluster

• Usually data is partitioned by rows or/and column families

• Users can often specify partitioning parameters

– Gives control over how data is distributed

– Very important for optimizing query speed

11/36

Aggregate-oriented DB

• Non-relational data models aim to store aggregated data together

• Aggregate is a collection of data that is treated as a unit– E.g. a customer and all of his orders

• In normalized relational databases aggregates are computed using GroupBy operations

• Keyed aggregates make for a natural unit of data sharding

12/36

Non-relational Data Models

The Key-value Model

• Data stored as key-value pairs

• The value is an opaque blob to the database

• Examples: Dynamo, Riak

14

The Document-oriented Model

• Data is also stored as key-value pairs

• Value is a „document“ and has further structure

• No strict schema

• Examples: CouchDB, MongoDB

15/36

Example

• Aggregates described in JSON using map and array data structures

16

The Column Family Model

• Data stored in large sparse tabular structures

• Columns are grouped into column families

– Column family is a meaningful group of columns

– Similar concept as a table in relational database

• A record can be thought of as a two-level map

• New columns can be added at any time

• Examples: BigTable, Cassandra, HBase

17/36

Column Family Example

18/36

a001

Names

username jsmith

firstname John

lastname Smith

Contactsphone 5 550 001

email [email protected]

Messages

item1 Message1

item2 Message2

… …

itemN MessageN

b014

Names

username pauljones

Contacts

Messages

item1 new Message

mailto:[email protected]

https://neo4j.com/developer/guide-build-a-recommendation-engine/

Graph Databases

• Data stored as nodes and edges of a graph

• The nodes and edges can have fields and values

• Aimed at graph traversal queries in connected data

• Examples: Neo4J, FlockDB

19/36

Non-relational Data Models

• In non-relational data stores data is denormalized

• It is common to also store JSON documents in key-value stores

• The key-value, document-oriented and column family models are aggregate oriented models

• The other models are based on the key-value model

• In reality the classification of databases into different models is not as straight forward

• Multiparadigm databases also exist (e.g ArangoDB)

20/36

Non Relational Database Examples

Riak

• Based on Amazon's Dynamo specification

• Key-value model

• Distributed decentralized persistent hash table

– Keyspace is partitioned between nodes

• Consistent hashing

– Avoid hash-to-node remapping when a node is added or removed

• Eventual Consistency using Multiversion Concurrency Control (MVCC)

22/36

Riak

• RESTful query interface

– Basic PUT, GET, POST, and DELETE functions

• Links and link walking

– One way relationships between data objects

– Turns Key-Value store into a simple Graph database

• Higher level querying on-top of Key-Value structure:

– Riak search is based on Apache Solr search engine

– MapReduce in Erlang and JavaScript

23/36

MongoDB

• Document oriented (BSON)

• Query language based on JavaScript

• GridFS for BLOB file storage

• Master-slave architecture

• Linking between documents

• Supports MapReduce in JavaScript

24/36

Query example

db.inventory.find({

status: "A",

$or: [ {qty: { $lt: 30 }}, { item: /^p/ }]

})

• Matches the following SQL query:

SELECT * FROM inventory WHERE status = "A" AND ( qty < 30 OR item LIKE "p%")

25/38

CouchDB

• Document oriented (JSON)

• RESTful query interface

• Built in web server

• Web based query front-end Futon

26/36

CouchDB

• MapReduce is the primary query method (JavaScript and Erlang)

• Materialized views as results of incremental MapReduce jobs

• CouchApps – javascript-heavy web applications built entirely on top of CouchDBwithout a separate web server for the logic layer

27/36

Query example

Documents:

{ „id“: 2, "position" : "programmer", "first_name": "James", "salary" : 100 }

{ „id“: 7, "position" : "support", "first_name": "John", "salary" : 23 }

Lets extract average salary for each unique position

Map:

function(doc, meta) {

if (doc.position && doc.salary) { emit(doc.position,doc.salary); }

}

Reduce:

function(key, values, rereduce) {

return sum(values)/values.length;

}

Some Reduce functions have been pre-defined: _sum(), _count(), _stats()

28/38

Cassandra

• Column family model

• Data model from BigTable, distribution model from Dynamo (decentralized)

• Uses Cassandra Query Language (CQL)

– SQL-like querying language

29/36

Cassandra

• Provides Availability & Partition-Tolerance from the CAP theorem

• Static and dynamic column families

• Dynamic column families as materialized views on data

• The concept of super-columns

– Family of column families

30/36

Neo4J

• Open source NoSQL Graph Database

• Uses the Cypher Query Language for querringgraph data

• Graph consists of Nodes, Edges and Attributes

31/36

Neo4J Querry example

MATCH (you {name:"You"})

MATCH (expert)-[:WORKED_WITH]->

(db:Database {name:"Neo4j"})

MATCH path = shortestPath(

(you)-[:FRIEND*..5]-(expert)

)

RETURN db,expert,path

32/38https://neo4j.com/developer/cypher-query-language/

Why use Non-Relational DB?

• Volume of Data

• Elasticity

• Flexible Schemas

• Cost

33/38

http://martinfowler.com/bliki/PolyglotPersistence.html

Polyglot Persistence

34/36

Conclusions

• In recent years there has been a rise in non-relational (NoSQL) data stores

• This is related to the rise of cloud computing –key-value models offer better scalability

• The NoSQL landscape is extremely varied• The four basic non-relational data models are:

– The key-value model– The document-oriented model– The column family model– Graph databases

35/36

That is All

• Next week's practice session

– Using NoSQL databases

• Exam times

– Wednesday - 01 November 2017

– Thursday - 02 November 2017

• NB! A set of example exam questions are available on the course website

36/36