NOSQL Databases: Topics

NOSQL Databases: Topics

• Introduction

• Rationale

• Key-value stores

• MapReduce

• Implementations

1

Introduction

• NOSQL := Not Only SQL

• Acronym introduced in 2009

3 as the name of a meetup about open-source distributed non-relationaldatabases

• Message misunderstood, giving birth to “NoSQL”

2

Rationale (1)

• Performance

• Scalability

• Flexibility

• Kind of Data

3

Rationale (2)

• Brewer’s CAP Theorem

• Cannot guarantee more than two of

3 Coherence

3 Availability

3 Partition tolerance

4

Implementations

NOSQL

KV VolatileMemcached

Redis

Document

Store

eXist

CouchDB

MongoDB

Column Store

MonetDB

Infobright

KV Durable

Dynamo

Voldemort

Riak

Graph

Neo4j

HyperGraphDB

5

Key-Value Stores

• Global collection of Key/Value pairs

• Multiple types

3 In memory (Redis, Memcache)

3 On disk (BerkeleyDB)

3 Eventually Consistent (Cassandra, Dynamo, Voldemort)

6

Document Databases

• Similar to Key/Value database, with whole document as values.

• Flexible schema

• Documents are Serialized

• Examples: CouchDB, MongoDB

7

Column Family Database

• Similar to Key/Value database, with multiple attributes (columns) as values.

• Not to be confused with column-oriented DBMS

8

Graph Databases

• Inspired by Graph Theory

• Gains popularity as RDF store

• Examples Neo4j, InfiniteGraph

9

Other

• Many other exist:

3 Any database outside the relational model

• Object databases

• File System

10

Key-Value Stores

• Basic Idea

• Mapping Tables to KV pairs

• Consistent Hashing

11

Basic Idea

• Very simple data model

• {key,value} pairs with unique keys

3 {student_id: student_name}

3 {part_id: part_manufacturer}

3 {child_id: parent_id}

• Values have no type constraint

12

API

• put(key, value)

• get(key)

3 value = get(key)

• value is usually composite

3 Opaque blob (e.g. TokyoCabinet)

3 Directly supported (e.g. MongoDB)

13

Implementation

• Usually B-trees or extensible hash tables

• Well-known structures in RDMS world

14

Mapping Tables to KV pairs

15

Mapping Tables to KV pairs

CREATE TABLE user (

id INTEGER PRIMARY KEY,

username VARCHAR( 64 ) ,

password VARCHAR(64)

) ;

CREATE TABLE f o l l o w s (

f o l l o w e r INTEGER REFERENCES user ( id ) ,

f o l l owed INTEGER REFERENCES user ( id )

) ;

CREATE TABLE tweets (

id INTEGER,

u ser INTEGER REFERENCES user ( id ) ,

message VARCHAR(140) ,

timestamp TIMESTAMP

) ;

16

Mapping Tables to KV pairs — Redis

• Creating a user

INCR g l o b a l : nextUserId => 1000

SET uid : 1 0 0 0 : username john smith

SET uid : 1 0 0 0 : password sunnyEvening

• Enabling logging-in

SET username : john smith : uid 1000

• Following:

uid : 1 0 0 0 : f o l l o w e r s => Set o f u ids

uid : 1 0 0 0 : f o l l o w i n g => Set o f u ids

17

Mapping Tables to KV pairs — Redis

• Messages by user:

uid : 1 0 0 0 : pos t s => a L i s t o f post i d s

• Adding a new message:

SET post :10343 ” $owner id | $time | I ’m having fun ”

18

Consistent Hashing

• Huge amounts of data

3 Naive approach:

s e r v e r i d = hash ( key ) % number o f s e rve r s

3 Hash function: anything → int

• Distribution?

19

Consistent Hashing — Circle

• Assume int to be an 8-bit unsigned integer

• We have hash(key) ∈ J0, 255K

• We can represent these values on a circle and:

3 Assign a position to each server

3 Compute the position of each key

3 Assume a key k belongs to the next server on the circle (clockwise)

20

Consistent Hashing — Circle

• Each node (server) is assigned a random value

• The hash of this value gives the position of the server on the circle

• A server is responsible for the arc before its position

21

• Adding a node

22

Virtual nodes

23

Moving nodes

24

Replication

• Coordinator as defined previously

• In charge of replication to other nodes (e.g. N next ones)

• Parameters :

3 Number of replicates (N)

3 Minimal number of successful writes (W )

3 Minimal number of coherent reads (R)

3 Must respect R + W > N (Why ?)

• Repair-on-read

25

NOSQL Databases: Topics

I Introduction

I Rationale

I Key-value stores

I MapReduce

I Implementations

MapReduce

I Parallel processing model

I Introduced to tackle computations over very large datasetsI Based on the well-known divide and conquer approach

I Large problem divided in many small problemsI Each tackled by one “processor” (Map)I Results are then combined (Reduce)

I References: MapReduce (textbook), Lin and Dyer, 2010

Parallelism?

I Not a new problemI E.g. threads, MPI, sockets, remote shell, . . .I Generally tackles computation distribution, not data

distribution.I The developper is in charge of the implementation details.

I MapReduce offers an abstraction of many mechanisms byimposing a structure to the program.

MapReduce Concepts

Data

MapperMapperMapper Mapper Mapper

Reducer Reducer Reducer Reducer Reducer

Output

Origins of Map

I Map originally comes from the functional programming world

I Basic idea:

for(int i = 0; i < arr.length (); i++) {result[i] = function(arr[i]);

}

I where function is a function in the mathematical sense

Origins of Map

I Idea: isolate the loop, so we can write:

result = map(function , arr);

I What if you could pass functions around as values?I map could be a function that takes as arguments

I a sequenceI a function

and that returns a new sequence where every element is theresult of applying the function on the corresponding elementin the original sequence

I map can abstract many for loops

Origins of Reduce

I map does not cover all for loops

I For example, when you gradually aggregate the results:

int total = 0for(int i = 0; i < arr.length (); i++) {

total = total + arr[i];}

I More generally:

for(int i = 0; i < arr.length (); i++) {total = function(total , arr[i]);

}

I reduce covers these ones:

total = reduce(function , arr);

map and reduce in MapReduce

I In the context of MapReduce, the mapped function mustreturn key-value couples:

map(function, [data, . . . ])→ [(key, value), . . . ]

I Before the reduction, the data has to be aggregated by key:

[(key1, value1), (key1, value2), . . . ]→ (key1, value1, value2, . . . )

I Reduce step acts on values for each key

reduce(key1, value1, value2, . . . )→ (key1, value)

Example

I Counting the words in a text

I map: word→ (word, 1)

Pair make_pair(String word) {return new Pair(word , 1);

}

I Aggregation: (word, 1, 1, 1, . . . )

I reduce:

Pair compute_sum(String word , List <int > values) {int sum = 0;for(int i : values) {

sum += i;}return new Pair(word , sum);

}

Parallelization

I map:I Trivial: absolutely no side effectI (or not: what about transfer times?)

I reduce:I Not fully parallelizable (each step needs the result of the

previous step)

Parallelizing reduce

I Reduce needs to be idempotentI Mathematically: f (f (x)) = f (x)

I Computation can be tree-shaped:

4

2

1

1

2

1

1

I log N instead of N

We lied!

I There is still one step to discuss: How do we aggregate valuesby keys ?

I Naive idea: put a barrier between map and reduceI Wait for all maps to completeI Get all results in one place, sort themI Redistribute them for reduce

Data

MapperMapperMapper Mapper Mapper

Barrier

ReducerReducerReducer Reducer Reducer

Output

Parallelizing aggregation

I The naive approach:I is simpleI does not require an idempotent reduceI is not as parallel as it could be

I Other idea: consistent hashing and idempotenceI Can compute results incrementally (idempotence)I No barrier: better parallelism (hashing)I Can display current results (idempotence)

I Note: usually, the implementation sorts the intermediatekey-value pairs generated by map and the final results by key.This can be exploited by choosing a meaningful key.

Example: Sorting people by name

I map: person→ (person.name, person)

I reduce: (person.name, person1, person2, . . . )→(person.name, person1, person2, . . . )

I The result is sorted by virtue of the MapReduce machineryitself.

Example: Finding all (author,book) pairs

I There can be multiple authors per book!I map

I We need a polymorphic map function, say f , such that:I f (author)→ (author.name, author)I f (book)→ [(book.author.name, book), . . . ]

I Aggregation: (author.name; book∗, author, book∗)

I In the following code, Value is a superclass of Author, Bookand List.

Example: Finding all (author,book) pairsI reduce

Pair reduce(String authorName , List <Value > values) {

Author a = n u l l ;Book prevbook = n u l l ;List <Pair > list = new List <Pair >();

f o r (Value value : values) {

i f (value i n s t a n c e o f Author) {

a = (Author)value;

i f (prevbook != n u l l ) {

list.append(new Pair(a, prevbook ));

prevbook = n u l l ;}

} e l s e i f (value i n s t a n c e o f Book && a == n u l l ) {

i f (prevbook != n u l l ) emit(prevbook );

prevbook = (Book)value;

} e l s e i f (value i n s t a n c e o f Book && a != n u l l ) {

list.append(new Pair(a, prevbook ));

} e l s e i f (value i n s t a n c e o f List <Pair >) {

list.append_all(value);

a = list.first (). author;

}

}

i f (prevbook != n u l l ) emit(prevbook );

i f (!list.empty ()) emit(list);

}

Implementations

I LightCloud

I MongoDB

I Cassandra

LightCloud

I LightCloud is a distributed key-value storeI Implements distributed storage.I “On-site” storage is provided by Tokyo Tyrant/Redis

I Tokyo Tyrant is a local key-value storeI Implements database managment functions

I Network interface and concurency controlI Database replication

I Actual storage is provided by Tokyo Cabinet

I Tokyo CabinetI Implements storage of key/value pairsI Over a single file, for a single client.

LightCloud

Tokyo Tyrant

Tokyo Cabinet

Tokyo Tyrant

Tokyo Cabinet

Tokyo Tyrant

Tokyo Cabinet

Tokyo Cabinet/Tyrant

I Tokyo Cabinet/Tyrant provide a very raw interface for storingkey/value pairs in a given single file

I The desired on-disk layout must be chosenI Extensible Hash Map, B-Tree, Fixed-size records, . . .I Parameters of these structures can be tweaked for better

performanceI Very demanding on the user

I The API consists of get and put and a few variantsI The data are opaque, unstructured blobs!

LightCloud

I Adds (horizontal) scalability to Tokyo Tyrant nodes by meansof consistent hashing

I Mitigates the distribution problemI However, no replication is performed; consistency is preferred

over availability.

I The API is still get and put, over strings.

MongoDB

I MongoDB is a document oriented database

I json documents

{"name": "John Smith","address ": {

"city": "Owatonna","street ": "Lily Road","number ": 32,"zip": 55060

},"hobbies ": [ "yodeling", "ice skating" ]

}

Database Organisation

I Databases contain collections

I Collections contain documents and indexes

Physical layout

I Documents are stored as binary blobs (BSON)I Documents are opaque for the databaseI As a result of a query they are retrieved in their entirety

I Indexes are B-Trees referencing these documents.I Allows to find documents based on the values they contain

without explicitely opening the whole document.

Advanced querying

I Simple queries can be performed efficiently when an index isavailable

I E.g. db.employee.find({"address.city": "Owatonna"})with an index on ”address.city”

I Larger jobs can be done by means of map-reduceI map maps a document to the needed key-value pair.

Advanced querying

I However, there is no facility for:I Joining documentsI Quantifying over other documents (i.e. EXISTS in SQL)

I Such operations are left to the user of the database!I Processing outside the database is costly!I It is therefore important to design the data model in such a

way that it returns the appropriate data directly.

Sharding

I MongoDB can shard documents over multiple serversI Data are split into chunksI A chunk has a starting and ending value.I A server is Responsible for multiple chunks.

I Individual collections and not whole databases are sharded

I Example: Sharding Persons over the Age field on 3 servers

Server 1 Server 2 Server 3

1–10 11–20 22–2921–22 30–41 42–5051–72 73+

I To be efficient, each server must keep roughly the sameamount of data.

I Mongodb provides automated balancing (auto-sharding) asmuch as possible

I Shards are created explicitely by the database administratorI shard = (collection, key)I Well chosen, can improve query performanceI Otherwise, the load of each server can be very unbalanced

Cassandra

I Introduction and history

I Data model and layoutI Distribution

I ReplicationI Adding nodesI Handling problemsI Timestamping

Cassandra — Introduction

I Created by FacebookI Based on DynamoI Lead Dynamo engineer hired by Facebook

I Released as Apache projectI Source code released in July 2008I Adopted by Apache in March 2009I Became high priority in February 2010

Cassandra — Data model

I Databases are conceptually two-dimensional

I Disks are one-dimensional

I Table:1 23 4

can be stored as either row-oriented (1, 2, 3, 4)

or column-oriented (1, 3, 2, 4); Cassandra is column-oriented

I No cost for NULL entries

I Easy column creationI Structure:

I Column family ∼ tableI Super column ∼ columnsI Column ∼ column

I May be seen as a hash table with 4 or 5 dimensions:

get(keyspace , key , column_family[, super_column], column)

Cassandra — Distribution

I CAP Theorem:I (Consistency)I AvailabilityI Partition tolerance

I Design goalsI ScalabilityI SimplicityI SpeedI Uniformity between nodes

I Consistent Hashing on a ringI No virtual nodesI Random placement

Cassandra — Replication and consistency

I Availability ⇒ more than one node needs a copy of each pairI Responsible node choses N other nodes to hold copies

I Way in which those are chosen can be changedI Next ones on the ring, different geographic location, etc.

I Attribution table copied to each node

I Possibility of choosing R and W values

Cassandra — Timestamping

I Every data has an associated timestamp

I Every key actually has an associated vector of(timestamp, value) pairs (truncated)

I Used to reach consistency with repair-on-readI Query sequence:

I Identify the nodes that own the data for the keyI Route the request to the node and wait for the responseI If the reply does not arrive within the configured timeout, failI Figure out the latest response based on timestampsI Schedule a repair if needed

I Repair algorithm can be customized

Cassandra — Adding a node

I GossipI Each node must know the position of every other node (and all

replicas)I Whenever a node moves or changes its replicas, it tells a

number of other nodes, sending its whole replication tableI Routing information thus propagatesI Some nodes are preferred (seeds)

I When a new node is inserted, we must give it a keyspaceand the address of a seed

I It chooses its position at randomI It contacts the seed to get a view of the current stateI It begins to move its data

Cassandra — Problem solving

I Overloaded nodeI Causes

I The keys are not uniformly distributedI Some keys are accessed more than othersI The node runs on inferior hardware

I SolutionI Overloaded nodes may move on the ring

I Unresponsive nodeI Causes

I The machine has crashedI There is too much latency on the network

I SolutionI Each node attributes a score to its neighbourI Inverse logarithmic scale: 1 means 10% chance to wake up, 2

means 1%, etc.I Define a threshold after which the node is removed

I Can be mostly automated