64
MAKE YOUR CHOICE CONSISTENCY, AVAILABILITY, PARTITION Andrea Giuliano @bit_shark

Consistency, Availability, Partition: Make Your Choice

Embed Size (px)

Citation preview

M A K E Y O U R C H O I C EC O N S I S T E N C Y, A VA I L A B I L I T Y, PA R T I T I O N

A n d r e a G i u l i a n o@ b i t _ s h a r k

D I S T R I B U T E D S Y S T E M S

W H AT A D I S T R I B U T E D S Y S T E M I S

“A distributed system is a software system in which components located on networked computers communicate

and coordinate their actions by passing messages”

D I S T R I B U T E D S Y S T E M SE X A M P L E S

D I S T R I B U T E D S Y S T E M S

R E P L I C AT I O N

R E P L I C AT E D S E R V I C EP R O P E R T I E S

CONSISTENCY

AVAILABILITY

C O N S I S T E N C Y

The result of operations will be predictable

C O N S I S T E N C Y

Strong consistency all replicas return the same value for the same object

C O N S I S T E N C Y

Strong consistency all replicas return the same value for the same object

Weak consistency different replicas can return different values for the same object

S T R O N G V S W E A KC O N S I S T E N C Y

S T R O N G V S W E A KC O N S I S T E N C Y

Strong consistency Atomic, consistent, isolated, durable database

Weak consistency Basically Available Soft-state Eventual consistency database

E X A M P L EC O N S I S T E N C Y

put(price, 10)

E X A M P L EC O N S I S T E N C Y

get(price)

price = 10

AVA I L A B I L I T Y

E X A M P L EA VA I L A B I L I T Y

C O M M U N I C AT I O N

PA R T I T I O N T O L E R A N C E

continue to operate even in presence of partitions

PA R T I T I O N T O L E R A N C E

Network failure groups at each side of a faulty entity network (switch, backbone)

Process failure system split in two groups: correct nodes and crashed node

C A P T H E O R E M

“Of three properties of shared-data systems (data consistency, system availability and

tolerance to network partitions) only two can be achieved at any given moment in time.”

T H E P R O O FC A P T H E O R E M

put(price, 10)

get(price)

price = 0

price = 0 price = 0

price = 0 no response

not consistentnot available

t2

t1partition 1

partition 2

CONSISTENCY AVAILABILITY

PARTITION TOLERANCE

➡ distributed databases ➡ distributed locking ➡ majority protocol ➡ active/passive replication ➡ quorum-based systems

BigTable

C A P T H E O R E M

I N P R A C T I C E

C A P T H E O R E M

CONSISTENCY AVAILABILITY

PARTITION TOLERANCE

➡ web caches ➡ stateless systems ➡ DNS

DynamoDB

C A P T H E O R E M

CONSISTENCY AVAILABILITY

PARTITION TOLERANCE

➡ Single site database ➡ cluster databases ➡ ldap

D Y N A M O

R E Q U I R E M E N T SD Y N A M O

“customers should be able to view and add items to their shopping cart even if disks are failing,

network routes are flapping, or data centers are being destroyed by tornados.”

R E Q U I R E M E N T SD Y N A M O

“customers should be able to view and add items to their shopping cart even if disks are failing,

network routes are flapping, or data centers are being destroyed by tornados.”

➡ reliable ➡ high scalable ➡ always available

S I M P L E I N T E R FA C ED Y N A M O

get(key)returns the object associated with the key and returns a single object or a list of objects with conflicting versions along with a context.

put(key, context, object)determines where the replicas of the object should be placed based on the associated key. The context includes information such as the version of the object.

R E P L I C AT I O N : T H E C H O I C ED Y N A M O

Synchronous replica coordination

‣ strong consistency ‣ availability tradeoff

Optimistic replication technique

‣ high availability ‣ conflicts probability

C O N F L I C T S : W H E ND Y N A M O

At write time

‣ writes rejection probability

At read time

‣ “always writable” datastore

C O N F L I C T S : W H OD Y N A M O

The data store

‣ e.g. “last write win” policy

The application

‣ resolution as implementation detail

A R I N G T O R U L E T H E M A L LD Y N A M O

PA R T I T I O N I N G : T H E R I N GD Y N A M O

A

B

C

DE

F

G

DATAhash

R E P L I C AT I O ND Y N A M O

A

B

C

DE

F

G

N = 3 D will store keys in the range (A, B], (B, C], (C, D]

DATAhash

D ATA V E R S I O N I N GD Y N A M O

put()may return before the update has been propagated to all replicas.

get()subsequent get() may return an object that does not have the latest update

R E C O N C I L I AT I O ND Y N A M O

R E C O N C I L I AT I O ND Y N A M O

Syntactic reconciliation

‣ new version subsumes the previous

Semantic reconciliation

‣ conflicting versions of the same object

V E C T O R C L O C KD Y N A M O

V E C T O R C L O C KD Y N A M O

Definition‣ list of (node, counter) pairs

V E C T O R C L O C KD Y N A M O

Definition‣ list of (node, counter) pairs

D1

[Sx,1]

write

handled by Sx

V E C T O R C L O C KD Y N A M O

Definition‣ list of (node, counter) pairs

D1

[Sx,1]

D2

[Sx,2]

write

handled by Sx

write

handled by Sx

V E C T O R C L O C KD Y N A M O

Definition‣ list of (node, counter) pairs

D1

[Sx,1]

D2

[Sx,2]

D3

[Sx,2], [Sy,1]

write

handled by Sx

write

handled by Sxhandled by Sywrite

V E C T O R C L O C KD Y N A M O

Definition‣ list of (node, counter) pairs

D1

[Sx,1]

D2

[Sx,2]

D3

[Sx,2], [Sy,1]

D4

[Sx,2], [Sz,1]

write

handled by Sx

write

handled by Sx

write

handled by Sy

writehandled by Sz

V E C T O R C L O C KD Y N A M O

Definition‣ list of (node, counter) pairs

D1

[Sx,1]

D2

[Sx,2]

D3

[Sx,2], [Sy,1]

D4

[Sx,2], [Sz,1]

D5 [Sx,3], [Sy,1], [Sz,1]write

handled by Sx

write

handled by Sx

write

handled by Sy

writehandled by Sz

reconciled and written by Sx

P U T ( ) A N D G E T ( )D Y N A M O

R

‣ minimum number of nodes that must partecipate in a read operation.

W

‣ minimum number of nodes that must participate in a successful write operation

P U T ( ) A N D G E T ( )D Y N A M O

put()‣ the coordinator generates the vector clock for the new version and

writes the new version locally ‣ the new version is sent to N nodes ‣ the write is successful if W-1 nodes respond

get()‣ the coordinator requests all existing versions of data ‣ the coordinator waits for R responses before returning the result ‣ the coordinator returns all the version causally unrelated ‣ the divergent versions are reconciled and written back

S L O P P Y Q U O R U MD Y N A M O

A

B

C

DE

F

G

N = 3

W H Y I S A P ?D Y N A M O

‣ requests served even if some replicas are not available

‣ if some node is down the write is stored to another node

‣ consistency conflicts resolved at read time or in the

background

‣ eventually, all the replicas will converge

‣ concurrent read/write operation can make distinct clients

see distinct versions of the same key

B I G TA B L E

R E Q U I R E M E N T SG O O G L E B I G TA B L E

‣ scale to petabyte of data ‣ thousand of machines ‣ high availability ‣ high performance

D ATA M O D E LG O O G L E B I G TA B L E

‣ sparse, distributed, persistent multi-dimensional sorted map

(row: string, column: string, time: int64) string

R O W SG O O G L E B I G TA B L E

‣ arbitrary strings ‣ read/write operations are atomic ‣ data is maintained in lexicographic order by row key ‣ each row range is called a tablet

maps.google.com com.google.maps

C O L U M N SG O O G L E B I G TA B L E

‣ columns keys are grouped into sets: column families ‣ a column family must be created before data can be

stored under any column key in that family

‣ column key named as family:qualifier ‣ access control and both disk and memory

accounting are performed at the column-family level

T I M E S TA M P SG O O G L E B I G TA B L E

C O N T E N T S :

c o m . e x a m p l e

< h t m l > …

< h t m l > …

t 1

t 2

D ATA M O D E L : E X A M P L EG O O G L E B I G TA B L E

L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A

c o m . e x a m p l e e n< ! D O C T Y P E h t m l P U B L I C

c o m . c n n . w w w e n< ! D O C T Y P E h t m l P U B L I C

…“ c n n " “ c n n . c o m ”

c o m . c n n . w w w / f o o e n< ! D O C T Y P E h t m l P U B L I C

column familiesrow keys

sort

ed ro

ws

D I F F E R E N C E S W I T H R D B M SG O O G L E B I G TA B L E

R D B M S B I G TA B L E

q u e r y l a n g u a g e s p e c i f i c a p i

j o i n s n o r e f e r e n t i a l i n t e g r i t y

e x p l i c i t s o r t i n g s o r t i n g d e f i n e d a p r i o r i i n t h e c o l u m n f a m i l y

A R C H I T E C T U R EG O O G L E B I G TA B L E

Google File System (GFS)

‣ store data files and logs

Google SSTable

‣ store BigTable data

Chubby

‣ high-available distributed lock service

C O M P O N E N T SG O O G L E B I G TA B L E

library‣ linked into every client

one master server‣ assigning tablets to tablet server ‣ detecting the addition and expiration of tablet servers ‣ balancing tablet-server load ‣ garbaging collection of files in GFS ‣ handling schema changes

many tablet servers‣ manages 10 to 100 tablets ‣ handles read and write requests to the tablets ‣ splits tablets that have grown too large

C O M P O N E N T SG O O G L E B I G TA B L E

Master server

Client

Tablet server Tablet server Tablet server

Metadata

read/write

S TA R T U P A N D G R O W T HG O O G L E B I G TA B L E

Chubby fileRoot tablet

1st Metadata tablet

other metadata

tablets

UserTableN

UserTable1

TA B L E T A S S I G N M E N TG O O G L E B I G TA B L E

tablet server‣ when started, creates and acquires a lock in Chubby

master‣ grabs a unique master lock in Chubby ‣ scans Chubby to find live tablet servers ‣ asks each tablet server to discover its tablets ‣ scans the Metadata table to learn the full set of tablets ‣ builds a set of unassigned tablet server, for future tablet

assignment

W H Y I S C P ?G O O G L E B I G TA B L E

‣ master death cause services no longer functioning

‣ tablet server death cause tablets unavailable

‣ Chubby death cause BigTable inability to execute

synchronization operations and to serve client requests

‣ Google File System is a CP system

$ W H O A M I

Andrea Giuliano@bit_sharkwww.andreagiuliano.it

joind.in/13224Please rate the talk!

G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store” F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data”

Assets: https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg https://www.flickr.com/photos/avardwoolaver/7137096221

R E F E R E N C E S