CS 525 Advanced Distributed Systems Spring 2014

1

CS 525 Advanced Distributed Systems

Spring 2014

Indranil Gupta (Indy)

Feb 11, 2014

Key-value/NoSQL Stores

Lecture 7

2014, I. Gupta

Based mostly on •Cassandra NoSQL presentation•Cassandra 1.0 documentation at datastax.com•Cassandra Apache project wiki•HBase

2

Cassandra

• Originally designed at Facebook

• Open-sourced

• Some of its myriad users:

• With this many users, one would think– Its design is very complex

– Let’s find out!

3

Why Key-value Store?

• (Business) Key -> Value

• (twitter.com) tweet id -> information about tweet

• (amazon.com) item number -> information about it

• (kayak.com) Flight number -> information about flight, e.g., availability

• (yourbank.com) Account number -> information about it

• Search is usually built on top of a key-value store

4

Isn’t that just a database? • Yes, sort of

• Relational Databases (RDBMSs) have been around for ages

• MySQL is the most popular among them

• Data stored in tables

• Schema-based, i.e., structured tables

• Queried using SQL (Structured Query Language)

SQL queries: SELECT user_id from users WHERE username = “jbellis”

Example’s Source

5

Mismatch with today’s workloads

• Data: Large and unstructured

• Lots of random reads and writes

• Foreign keys rarely needed

• Need– Speed

– Avoid Single point of Failure (SPoF)

– Low TCO (Total cost of operation) and fewer sysadmins

– Incremental Scalability

– Scale out, not up: use more machines that are off the shelf (COTS), not more powerful machines

6

CAP Theorem

• Proposed by Eric Brewer (Berkeley)

• Subsequently proved by Gilbert and Lynch

• In a distributed system you can satisfy at

most 2 out of the 3 guarantees1. Consistency: all nodes see same data at any time, or

reads return latest written value

2. Availability: the system allows operations all the time

3. Partition-tolerance: the system continues to work in spite of network partitions

• Cassandra– Eventual (weak) consistency, Availability, Partition-tolerance

• Traditional RDBMSs– Strong consistency over availability under a partition

7

CAP Tradeoff

• Starting point for NoSQL Revolution

• Conjectured by Eric Brewer in 2000, proved by Gilbert and Lynch in 2002

• A distributed storage system can achieve at most two of C, A, and P.

• When partition-tolerance is important, you have to choose between consistency and availability

Consistency

Partition-tolerance Availability

RDBMSs

Cassandra, RIAK, Dynamo, Voldemort

HBase, HyperTable,BigTable, Spanner

7

8

Eventual Consistency

• If all writers stop (to a key), then all its values (replicas) will converge eventually.

• If writes continue, then system always tries to keep converging.

– Moving “wave” of updated values lagging behind the latest values sent by clients, but always trying to catch up.

• May return stale values to clients (e.g., if many back-to-back writes).

• But works well when there a few periods of low writes. System converges quickly.

9

Cassandra – Data Model • Column Families:

– Like SQL tables

– but may be unstructured (client-specified)

– Can have index tables

• “Column-oriented databases”/ “NoSQL”

– Columns stored together, rather than rows

– No schemas

» Some columns missing from some entries

– NoSQL = “Not Only SQL”

– Supports get(key) and put(key, value) operations

– Often write-heavy workloads

10

Let’s go Inside Cassandra: Key -> Server

Mapping• How do you decide which server(s) a key-value

resides on?

11

11

N80

0Say m=7

N32

N45

Backup replicas forkey K13

Cassandra uses a Ring-based DHT but without routingKey->server mapping is the “Partitioning Function”

N112

N96

N16

Read/write K13

Primary replica forkey K13

(Remember this?)

Coordinator (typically one per DC)

12

Writes • Need to be lock-free and fast (no reads or disk

seeks)

• Client sends write to one front-end node in Cassandra cluster

– Front-end = Coordinator, assigned per key

• Which (via Partitioning function) sends it to all replica nodes responsible for key

– Always writable: Hinted Handoff mechanism

» If any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up.

» When all replicas are down, the Coordinator (front end) buffers writes (for up to a few hours).

– Provides Atomicity for a given key (i.e., within a ColumnFamily)

• One ring per datacenter– Per-DC coordinator elected to coordinate with other DCs

– Election done via Zookeeper, which runs a Paxos variant

13

Writes at a replica node

On receiving a write

•1. log it in disk commit log

•2. Make changes to appropriate memtables– In-memory representation of multiple key-value pairs

•Later, when memtable is full or old, flush to disk– Data File: An SSTable (Sorted String Table) – list of key value pairs,

sorted by key

– Index file: An SSTable of (key, position in data sstable) pairs

– And a Bloom filter (for efficient search) – next slide

•Compaction: Data udpates accumulate over time and SStables and logs need to be compacted

– Merge SSTables, e.g., by merging updates for a key

– Run periodically and locally at each server

•Reads need to touch log and multiple SSTables– A row may be split over multiple SSTables

– Reads may be slower than writes

14

Bloom Filter• Compact way of representing a set of items

• Checking for existence in set is cheap

• Some probability of false positives: an item not in set may check true as being in set

• Never false negativesLarge Bit Map

01

23

69

127

111

Key-K

Hash1

Hash2

Hashk

On insert, set all hashed bits.

On check-if-present, return true if all hashed bits set.•False positives

False positive rate low•k=4 hash functions•100 items• 3200 bits•FP rate = 0.02%

.

.

15

Deletes and Reads

• Delete: don’t delete item right away– Add a tombstone to the log

– Compaction will eventually remove tombstone and delete item

• Read: Similar to writes, except– Coordinator can contacts a number of replicas (e.g., in same rack) specified by

consistency level

» Forwards read to replicas that have responded quickest in past

» Returns latest timestamp value

– Coordinator also fetches value from multiple replicas

» check consistency in the background, initiating a read-repair if any two values are different

» Brings all replicas up to date

– A row may be split across multiple SSTables => reads need to touch multiple SSTables => reads slower than writes (but still fast)

16

Cassandra uses Quorums• Quorum = way of selecting sets so that any pair of

sets intersect– E.g., any arbitrary set with at least Q=N/2 +1 nodes

– Where N = total number of replicas for this key

• Reads– Wait for R replicas (R specified by clients)

– In the background, check for consistency of remaining N-R replicas, and initiate read repair if needed

• Writes come in two default flavors– Block until quorum is reached

– Async: Write to any node

• R = read replica count, W = write replica count

• If W+R > N and W > N/2, you have consistency, i.e., each read returns the latest written value

• Reasonable: (W=1, R=N) or (W=N, R=1) or (W=Q, R=Q)

17

Cassandra Consistency Levels

• In reality, a client can choose one of these levels for a read/write operation:

– ANY: any node (may not be replica)

– ONE: at least one replica

– Similarly, TWO, THREE

– QUORUM: quorum across all replicas in all datacenters

– LOCAL_QUORUM: in coordinator’s DC

– EACH_QUORUM: quorum in every DC

– ALL: all replicas all DCs

• For Write, you also have – SERIAL: claims to implement linearizability for transactions

18

Cluster Membership – Gossip-Style

1

1 10120 66

2 10103 62

3 10098 63

4 10111 65

2

43

Protocol:

•Nodes periodically gossip their membership list

•On receipt, the local membership list is updated, as shown

•If any heartbeat older than Tfail, node is marked as failed

1 10118 64

2 10110 64

3 10090 58

4 10111 65

1 10120 70

2 10110 64

3 10098 70

4 10111 65

Current time : 70 at node 2

(asynchronous clocks)

AddressHeartbeat Counter

Time (local)

Fig and animation by: Dongyun Jin and Thuy Ngyuen

Cassandra uses gossip-based cluster membership

19

Gossip-Style Failure Detection

• If the heartbeat has not increased for more than Tfail seconds (according to local time), the member is considered failed

• But don’t delete it right away

• Wait an additional Tfail seconds, then delete the member from the list

– Why?

20

• What if an entry pointing to a failed process is deleted right after Tfail (= 24) seconds?

• Fix: remember for another Tfail

• Ignore gossips for failed members – Don’t include failed members in gossip messages

1

1 10120 66

2 10103 62

3 10098 55

4 10111 65

2

43

1 10120 66

2 10110 64

3 10098 50

4 10111 65

1 10120 66

2 10110 64

4 10111 65

1 10120 66

2 10110 64

3 10098 75

4 10111 65

Current time : 75 at process 2

Gossip-Style Failure Detection

21

Cluster Membership, contd.

• Suspicion mechanisms to adaptively set the timeout

• Accrual detector: Failure Detector outputs a value (PHI) representing suspicion

• Apps set an appropriate threshold

• PHI = 5 => 10-15 sec detection time

• PHI calculation for a member– Inter-arrival times for gossip messages

– PHI(t) = - log(CDF or Probability(t_now – t_last))/log 10

– PHI basically determines the detection timeout, but takes into account historical inter-arrival time variations for gossiped heartbeats

22

Data Placement Strategies

• Replication Strategy: two options:1. SimpleStrategy

2. NetworkTopologyStrategy

1.SimpleStrategy: uses the Partitioner1. RandomPartitioner: Chord-like hash partitioning

2. ByteOrderedPartitioner: Assigns ranges of keys to servers.

» Easier for range queries (e.g., Get me all twitter users starting with [a-b])

2.NetworkTopologyStrategy: for multi-DC deployments

– Two replicas per DC: allows a consistency level of ONE

– Three replicas per DC: allows a consistency level of LOCAL_QUORUM

– Per DC

» First replica placed according to Partitioner

» Then go clockwise around ring until you hit different rack

23

Snitches

• Maps: IPs to racks and DCs. Configured in cassandra.yaml config file

• Some options:– SimpleSnitch: Unaware of Topology (Rack-unaware)

– RackInferring: Assumes topology of network by octet of server’s IP address

» 101.201.301.401 = x.<DC octet>.<rack octet>.<node octet>

– PropertyFileSnitch: uses a config file

– EC2Snitch: uses EC2.

» EC2 Region = DC

» Availability zone = rack

• Other snitch options available

24

Vs. SQL

• MySQL is one of the most popular (and has been for a while)

• On > 50 GB data

• MySQL – Writes 300 ms avg

– Reads 350 ms avg

• Cassandra – Writes 0.12 ms avg

– Reads 15 ms avg

25

Cassandra Summary

• While RDBMS provide ACID (Atomicity Consistency Isolation Durability)

• Cassandra provides BASE– Basically Available Soft-state Eventual Consistency

– Prefers Availability over consistency

• Other NoSQL products– MongoDB, Riak (look them up!)

• Next: HBase– Prefers (strong) Consistency over Availability

26

HBase

• Google’s BigTable was first “blob-based” storage system

• Yahoo! Open-sourced it -> HBase

• Major Apache project today

• Facebook uses HBase internally

• API– Get/Put(row)

– Scan(row range, filter) – range queries

– MultiPut

27

HBase Architecture

Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Small group of servers runningZab, a consensus protocol (Paxos-like)

HDFS

28

HBase Storage hierarchy

• HBase Table– Split it into multiple regions: replicated across servers

» One Store per combination of ColumnFamily (subset of columns with similar query patterns) + region

• Memstore for each Store: in-memory updates to Store; flushed to disk when full

– StoreFiles for each store for each region: where the data lives

- Blocks

• HFile– SSTable from Google’s BigTable

29

HFile

Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/

SSN:000-01-2345

(For a census table example)

DemographicEthnicity

30

Strong Consistency: HBase Write-Ahead Log

Write to HLog before writing to MemStoreThus can recover from failure

Source: http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

31

Log Replay

• After recovery from failure, or upon bootup (HRegionServer/HMaster)

– Replay any stale logs (use timestamps to find out where the database is w.r.t. the logs)

– Replay: add edits to the MemStore

• Keeps one HLog per HRegionServer rather than per region

– Avoids many concurrent writes, which on the local file system may involve many disk seeks

32

Cross-data center replication

HLog

Zookeeper is actually a file system for control information1. /hbase/replication/state2. /hbase/replication/peers /<peer cluster number>3. /hbase/replication/rs/<hlog>

33

Wrap UpWrap Up

• Key-value stores and NoSQL trade off between Availability and Consistency, while providing partition-tolerance

• Presentations and reviews start this Thursday– Presenters: please email me slides at least 24 hours before

your presentation time

– Everyone: please read instructions on website

– If you have not signed up for a presentation slot, you’ll need to do all reviews.

– Project meetings starting soon.

Documents

CS 525 Advanced Distributed Systems Spring 2014