Upload
andrea-giuliano
View
85
Download
0
Tags:
Embed Size (px)
Citation preview
M A K E Y O U R C H O I C EC O N S I S T E N C Y, A VA I L A B I L I T Y, PA R T I T I O N
A n d r e a G i u l i a n o@ b i t _ s h a r k
W H AT A D I S T R I B U T E D S Y S T E M I S
“A distributed system is a software system in which components located on networked computers communicate
and coordinate their actions by passing messages”
C O N S I S T E N C Y
Strong consistency all replicas return the same value for the same object
Weak consistency different replicas can return different values for the same object
S T R O N G V S W E A KC O N S I S T E N C Y
Strong consistency Atomic, consistent, isolated, durable database
Weak consistency Basically Available Soft-state Eventual consistency database
PA R T I T I O N T O L E R A N C E
Network failure groups at each side of a faulty entity network (switch, backbone)
Process failure system split in two groups: correct nodes and crashed node
C A P T H E O R E M
“Of three properties of shared-data systems (data consistency, system availability and
tolerance to network partitions) only two can be achieved at any given moment in time.”
T H E P R O O FC A P T H E O R E M
put(price, 10)
get(price)
price = 0
price = 0 price = 0
price = 0 no response
not consistentnot available
t2
t1partition 1
partition 2
CONSISTENCY AVAILABILITY
PARTITION TOLERANCE
➡ distributed databases ➡ distributed locking ➡ majority protocol ➡ active/passive replication ➡ quorum-based systems
BigTable
C A P T H E O R E M
I N P R A C T I C E
C A P T H E O R E M
CONSISTENCY AVAILABILITY
PARTITION TOLERANCE
➡ web caches ➡ stateless systems ➡ DNS
DynamoDB
C A P T H E O R E M
CONSISTENCY AVAILABILITY
PARTITION TOLERANCE
➡ Single site database ➡ cluster databases ➡ ldap
R E Q U I R E M E N T SD Y N A M O
“customers should be able to view and add items to their shopping cart even if disks are failing,
network routes are flapping, or data centers are being destroyed by tornados.”
R E Q U I R E M E N T SD Y N A M O
“customers should be able to view and add items to their shopping cart even if disks are failing,
network routes are flapping, or data centers are being destroyed by tornados.”
➡ reliable ➡ high scalable ➡ always available
S I M P L E I N T E R FA C ED Y N A M O
get(key)returns the object associated with the key and returns a single object or a list of objects with conflicting versions along with a context.
put(key, context, object)determines where the replicas of the object should be placed based on the associated key. The context includes information such as the version of the object.
R E P L I C AT I O N : T H E C H O I C ED Y N A M O
Synchronous replica coordination
‣ strong consistency ‣ availability tradeoff
Optimistic replication technique
‣ high availability ‣ conflicts probability
C O N F L I C T S : W H E ND Y N A M O
At write time
‣ writes rejection probability
At read time
‣ “always writable” datastore
C O N F L I C T S : W H OD Y N A M O
The data store
‣ e.g. “last write win” policy
The application
‣ resolution as implementation detail
R E P L I C AT I O ND Y N A M O
A
B
C
DE
F
G
N = 3 D will store keys in the range (A, B], (B, C], (C, D]
DATAhash
D ATA V E R S I O N I N GD Y N A M O
put()may return before the update has been propagated to all replicas.
get()subsequent get() may return an object that does not have the latest update
R E C O N C I L I AT I O ND Y N A M O
Syntactic reconciliation
‣ new version subsumes the previous
Semantic reconciliation
‣ conflicting versions of the same object
V E C T O R C L O C KD Y N A M O
Definition‣ list of (node, counter) pairs
D1
[Sx,1]
write
handled by Sx
V E C T O R C L O C KD Y N A M O
Definition‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
write
handled by Sx
write
handled by Sx
V E C T O R C L O C KD Y N A M O
Definition‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
write
handled by Sx
write
handled by Sxhandled by Sywrite
V E C T O R C L O C KD Y N A M O
Definition‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
D4
[Sx,2], [Sz,1]
write
handled by Sx
write
handled by Sx
write
handled by Sy
writehandled by Sz
V E C T O R C L O C KD Y N A M O
Definition‣ list of (node, counter) pairs
D1
[Sx,1]
D2
[Sx,2]
D3
[Sx,2], [Sy,1]
D4
[Sx,2], [Sz,1]
D5 [Sx,3], [Sy,1], [Sz,1]write
handled by Sx
write
handled by Sx
write
handled by Sy
writehandled by Sz
reconciled and written by Sx
P U T ( ) A N D G E T ( )D Y N A M O
R
‣ minimum number of nodes that must partecipate in a read operation.
W
‣ minimum number of nodes that must participate in a successful write operation
P U T ( ) A N D G E T ( )D Y N A M O
put()‣ the coordinator generates the vector clock for the new version and
writes the new version locally ‣ the new version is sent to N nodes ‣ the write is successful if W-1 nodes respond
get()‣ the coordinator requests all existing versions of data ‣ the coordinator waits for R responses before returning the result ‣ the coordinator returns all the version causally unrelated ‣ the divergent versions are reconciled and written back
W H Y I S A P ?D Y N A M O
‣ requests served even if some replicas are not available
‣ if some node is down the write is stored to another node
‣ consistency conflicts resolved at read time or in the
background
‣ eventually, all the replicas will converge
‣ concurrent read/write operation can make distinct clients
see distinct versions of the same key
R E Q U I R E M E N T SG O O G L E B I G TA B L E
‣ scale to petabyte of data ‣ thousand of machines ‣ high availability ‣ high performance
D ATA M O D E LG O O G L E B I G TA B L E
‣ sparse, distributed, persistent multi-dimensional sorted map
(row: string, column: string, time: int64) string
R O W SG O O G L E B I G TA B L E
‣ arbitrary strings ‣ read/write operations are atomic ‣ data is maintained in lexicographic order by row key ‣ each row range is called a tablet
maps.google.com com.google.maps
C O L U M N SG O O G L E B I G TA B L E
‣ columns keys are grouped into sets: column families ‣ a column family must be created before data can be
stored under any column key in that family
‣ column key named as family:qualifier ‣ access control and both disk and memory
accounting are performed at the column-family level
T I M E S TA M P SG O O G L E B I G TA B L E
C O N T E N T S :
c o m . e x a m p l e
< h t m l > …
< h t m l > …
t 1
t 2
D ATA M O D E L : E X A M P L EG O O G L E B I G TA B L E
L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A
c o m . e x a m p l e e n< ! D O C T Y P E h t m l P U B L I C
…
c o m . c n n . w w w e n< ! D O C T Y P E h t m l P U B L I C
…“ c n n " “ c n n . c o m ”
c o m . c n n . w w w / f o o e n< ! D O C T Y P E h t m l P U B L I C
…
column familiesrow keys
sort
ed ro
ws
D I F F E R E N C E S W I T H R D B M SG O O G L E B I G TA B L E
R D B M S B I G TA B L E
q u e r y l a n g u a g e s p e c i f i c a p i
j o i n s n o r e f e r e n t i a l i n t e g r i t y
e x p l i c i t s o r t i n g s o r t i n g d e f i n e d a p r i o r i i n t h e c o l u m n f a m i l y
A R C H I T E C T U R EG O O G L E B I G TA B L E
Google File System (GFS)
‣ store data files and logs
Google SSTable
‣ store BigTable data
Chubby
‣ high-available distributed lock service
C O M P O N E N T SG O O G L E B I G TA B L E
library‣ linked into every client
one master server‣ assigning tablets to tablet server ‣ detecting the addition and expiration of tablet servers ‣ balancing tablet-server load ‣ garbaging collection of files in GFS ‣ handling schema changes
many tablet servers‣ manages 10 to 100 tablets ‣ handles read and write requests to the tablets ‣ splits tablets that have grown too large
C O M P O N E N T SG O O G L E B I G TA B L E
Master server
Client
Tablet server Tablet server Tablet server
Metadata
read/write
S TA R T U P A N D G R O W T HG O O G L E B I G TA B L E
Chubby fileRoot tablet
1st Metadata tablet
other metadata
tablets
UserTableN
UserTable1
…
…
…
…
…
…
…
…
…
…
…
TA B L E T A S S I G N M E N TG O O G L E B I G TA B L E
tablet server‣ when started, creates and acquires a lock in Chubby
master‣ grabs a unique master lock in Chubby ‣ scans Chubby to find live tablet servers ‣ asks each tablet server to discover its tablets ‣ scans the Metadata table to learn the full set of tablets ‣ builds a set of unassigned tablet server, for future tablet
assignment
W H Y I S C P ?G O O G L E B I G TA B L E
‣ master death cause services no longer functioning
‣ tablet server death cause tablets unavailable
‣ Chubby death cause BigTable inability to execute
synchronization operations and to serve client requests
‣ Google File System is a CP system
G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store” F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data”
Assets: https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg https://www.flickr.com/photos/avardwoolaver/7137096221
R E F E R E N C E S