Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Versioning,Consistency,andAgreement
COS461:ComputerNetworksSpring2010(MW3:00‐4:20inCS105)
MikeFreedmanhIp://www.cs.princeton.edu/courses/archive/spring10/cos461/
1
Jenkins,ifIwantanotheryes‐man,I’llbuildone!
Lee Lorenz, Brent Sheppard
Timeanddistributedsystems
• WithmulPpleevents,whathappensfirst?
A shoots B
B dies 2
Timeanddistributedsystems
• WithmulPpleevents,whathappensfirst?
A dies
B shoots A
3
Timeanddistributedsystems
• WithmulPpleevents,whathappensfirst?
A shoots B
B dies A dies
B shoots A
4
JustusePmestamps?
• Needsynchronizedclocks
• ClocksynchviaaPmeserver
p Time server S
5
• Usesa!meservertosynchronizeclocks
• TimeserverkeepsthereferencePme
• ClientsaskserverforPmeandadjusttheirlocal
clock,basedontheresponse
– Butdifferentnetworklatency→clockskew?
• Correctforthis?Forlinkswithsymmetricallatency:
CrisPan’sAlgorithm
adjusted-local-time = server-timestamp t + (RTT / 2)
local-clock-error = adjusted-local-time – local-time
RTT = response-received-time – request-sent-time
6
Isthissufficient?
• Serverlatencyduetoload?– Ifcanmeasure:
• adjusted‐local‐Pme=server‐Pmet+(RTT+lag)/2
• Butwhataboutasymmetriclatency?– RTT/2notsufficient!
• WhatdoweneedtomeasureRTT?– Requiresnoclockdrib!
• Whatabout“almost”concurrentevents?– Clockshavemicro/milli‐secondprecision
7
EventsandHistories
• Processesexecutesequencesofevents
• Eventscanbeof3types:– local,send,andreceive
• Thelocalhistoryhpofprocesspisthesequenceofeventsexecutedbyprocess
8
Orderingevents
• ObservaPon1:– Eventsinalocalhistoryaretotallyordered
time!pi
9
Orderingevents
• ObservaPon1:– Eventsinalocalhistoryaretotallyordered
• ObservaPon2:– Foreverymessagem,send(m)precedesreceive(m)
time!
time!
time!
pi
pi
pj
m
10
Happens‐Before(Lamport[1978])
• RelaPvePme?DefineHappens‐Before(→):– Onthesameprocess:a→b,if!me(a)<!me(b)
– Ifp1sendsmtop2:send(m)→receive(m)
– Ifa→bandb→cthena→c
• LamportAlgorithmusesforparPalordering:– Allprocessesuseacounter(clock)withiniPalvalueof0– Counterincrementedbyandassignedtoeachevent,as
itsPmestamp
– Asend(msg)eventcarriesitsPmestamp
– Forreceive(msg)event,counterisupdatedbyMax(receiver‐counter,message‐Pmestamp)+1
11
EventsOccurringatThreeProcesses
12
LamportTimestamps
1
1
2
3 4
5
13
LamportLogicalTime
Host 1
Host 2
Host 3
Host 4
1
2
2
3
3
5
4
5
3
6
4
6
7
8
0
0
0
0
1
2
4
3 3
4
7
Physical Time
14
LamportLogicalTime
Host 1
Host 2
Host 3
Host 4
1
2
2
3
3
5
4
5
3
6
4
6
7
8
0
0
0
0
1
2
4
3 3
4
7
Physical Time
Logically concurrent events!
15
VectorLogicalClocks• WithLamportLogicalTime
– eprecedesf⇒Pmestamp(e)<Pmestamp(f),but
– Pmestamp(e)<Pmestamp(f)⇒eprecedesf
16
VectorLogicalClocks• WithLamportLogicalTime
– eprecedesf⇒Pmestamp(e)<Pmestamp(f),but
– Pmestamp(e)<Pmestamp(f)⇒eprecedesf
• VectorLogicalPmeguaranteesthis:– Allhostsuseavectorofcounters(logicalclocks),ithelementistheclockvalueforhosti,iniPally0
– Eachhosti,incrementstheithelementofitsvectoruponanevent,assignsthevectortotheevent.
– Asend(msg)eventcarriesvectorPmestamp
– Forreceive(msg)event,
Max(Vreceiver[j],Vmsg[j]), ifjisnotself
Vreceiver[j]+1 otherwiseVreceiver[j] =
17
VectorTimestamps
18
VectorLogicalTime
Host 1
Host 2
Host 3
Host 4
1,0,0,0
Physical Time
1,1,0,0
1,0,0,0
1,2,0,0
2,2,3,0
1,2,0,0
2,0,0,0
2,0,1,0 2,0,2,0
Vreceiver[j] =
2,0,2,1
Max(Vreceiver[j],Vmsg[j]), ifjisnotself
Vreceiver[j]+1 otherwise
19
ComparingVectorTimestamps
• a=b iftheyagreeateveryelement
• a<b ifa[i]<=b[i]foreveryi,but!(a=b)• a>b ifa[i]>=b[i]foreveryi,but!(a=b)• a||b ifa[i]<b[i],a[j]>b[j],forsomei,j(conflict!)
• Ifonehistoryisprefixofother,thenonevectorPmestamp<other
• Ifonehistoryisnotaprefixoftheother,then(atleastbyexample)VTswillnotbecomparable.
20
GivenanoPonofPme…
…What’sanoPonofconsistency?
21
StrictConsistency
• Strongestconsistencymodelwe’llconsider– AnyreadonadataitemXreturnsvaluecorrespondingtoresultofthemostrecentwriteonX
• NeedanabsoluteglobalPme– “Mostrecent”needstobeunambiguous
x
WritextoaReadxreturnsa
22
Whatelsecanwedo?
• Strictconsistencyistheidealmodel– Butimpossibletoimplement!
• SequenPalconsistency– Slightlyweakerthanstrictconsistency– DefinedforsharedmemoryformulP‐processors
23
SequenPalConsistency
• DefiniPon:ResultofanyexecuPonisthesameasifall(readandwrite)operaPonsondatastorewereexecutedinsomesequenPalorder,andtheoperaPonsofeachindividualprocessappearinthissequenceintheorderspecifiedbyitsprogram
• DefiniPon:Whenprocessesarerunningconcurrently:– InterleavingofreadandwriteoperaPonsisacceptable,butallprocessesseethesameinterleavingofoperaPons
• Differencefromstrictconsistency– NoreferencetothemostrecentPme
– AbsoluteglobalPmedoesnotplayarole24
ValidSequenPalConsistency?
x 25
Linearizability
• Linearizability– Weakerthanstrictconsistency– StrongerthansequenPalconsistency
• AlloperaPons(OP=read,write)receiveaglobalPme‐stampusingasynchronizedclock
• Linearizability:– RequirementsforsequenPalconsistency,plus
– Iftsop1(x)<tsop2(y),thenOP1(x)shouldprecedeOP2(y)inthesequence
26
CausalConsistency
• NecessarycondiPon:– Writesthatarepoten&allycausallyrelatedmustbeseenbyallprocessesinthesameorder.
– Concurrentwritesmaybeseeninadifferentorderondifferentmachines.
• WeakerthansequenPalconsistency
• Concurrent:Opsthatarenotcausallyrelated
27
CausalConsistency
• Allowedwithcausalconsistency,butnotwithsequenPalorstrictconsistency
• W(x)bandW(x)careconcurrent– Soallprocessesdon’tseetheminthesameorder
• P3andP4readthevalues‘a’and‘b’inorderaspotenPallycausallyrelated.No‘causality’for‘c’.
28
CausalConsistency
x
29
CausalConsistency
• Requireskeepingtrackofwhichprocesseshaveseenwhichwrites
– Needsadependencygraphofwhichopisdependentonwhichotherops
– …orusevectorPmestamps!
30
Eventualconsistency• Ifnonewupdatesaremadetoanobject,abersomeinconsistencywindowcloses,allaccesseswillreturnthelastupdatedvalue
• Prefixproperty:– IfPihaswritewacceptedfromsomeclientbyPj– ThenPihasallwritesacceptedbyPjpriortow
• Usefulwhereconcurrencyappearsonlyinarestrictedform
• AssumpPon:writeconflictswillbeeasytoresolve– Eveneasierifwhole‐”object”updatesonly
31
Systemsusingeventualconsistency
• DB:updatedbyafewproc’s,readbymany– Howfastmustupdatesbepropagated?
• Webpages:typicallyupdatedbysingleuser– So,nowrite‐writeconflicts– Howevercachescanbecomeinconsistent
32
Systemsusingeventualconsistency
• DNS:eachdomainassignedtoanamingauthority– Onlymasterauthoritycanupdatethenamespace– OtherNSserversactas“slave”servers,downloadingDNSzonefilefrommasterauthority
– So,write‐writeconflictswon’thappen
– Isthisalwaystruetoday?
[email protected].( 18 ;serial 1200 ;refresh 600 ;retry 172800 ;expire 21600);minimum
33
TypicalimplementaPonofeventualconsistency
• Distributed,inconsistentstate– Writesonlygotosomesubsetofstoragenodes
• Bydesign(forhigherthroughput)• Duetotransmissionfailures
• “AnP‐entropy”(gossiping)fixesinconsistencies– Usevectorclocktoseewhichisolder– Prefixpropertyhelpsnodesknowconsistencystatus– IfautomaPc,requiressomewaytohandlewriteconflicts
• ApplicaPon‐specificmerge()funcPon• Amazon’sDynamo:UsersmayseemulPpleconcurrent“branches”beforeapp‐specificreconciliaPonkicksin
34
Examples…• Causalconsistency.Non‐causallyrelatedsubjecttonormaleventualconsistencyrules
• Read‐your‐writesconsistency.
• Sessionconsistency.Read‐your‐writesholdsiffclientsessionexists.Ifsessionterminates,noguaranteesbetweensessions.
• Monotonicreadconsistency.Oncereadreturnsaversion,subsequentreadsneverreturnolderversions.
• Monotonicwriteconsistency.Writesbysameprocessareproperlyserialized.Reallyhardtoprogramsystemswithoutthisprocess.
35
Evenread‐your‐writesmaybedifficulttoachieve
36
Whataboutstrongeragreement?
• Two‐phasecommitprotocol
37
• Marriageceremony
• Theater
• Contractlaw
Doyou?Ido.Inowpronounceyou…
Readyontheset?Ready!AcPon!
OfferSignatureDeal/lawsuit
Whataboutstrongeragreement?
• Two‐phasecommitprotocol
38
LeaderAcceptorsAcceptorsAcceptors
PREPARE
READY
COMMIT
ACK
ClientWRITE
ACK
Allprepared?
Allack’d?
Whataboutfailures?
• Ifanacceptorfails:– CansPllensurelinearizabilityif|R|+|W|≥N
– “read”and“write”quorumsoverlapinatleast1node
• Iftheleaderfails?– Loseavailability:systemnotlonger“live”
• Pickanewleader?– Needtomakesureeverybodyagreesonleader!– Needtomakesurethat“group”isknown
39
ConsensusandPaxosAlgorithm• “Consensus”problem
– Nprocesseswanttoagreeonavalue– IffewerthanFfaultsinawindow,consensusachieved
• “Crash”faultsneed2F+1processes• “Malicious”faults(calledByzanPne)need3F+1processes
• CollecPonofprocessesproposingvalues– Onlyproposedvaluemaybechosen
– Onlysinglevaluechosen
• Commonusage:– Viewchange:defineleaderandgroupviaPaxos– Leaderusestwo‐phasecommitforwrites
– Acceptorsmonitorleaderforliveness.Ifdetectfailure,re‐execute“viewchange” 40
Paxos:Algorithm
ViewChangefromcurrentview
Viewi:V={Leader:N2,Group:{N1,N2,N3}}
Phase1(Prepare)• Proposer:Sendpreparewithversion#jtomembersofViewi
• Acceptor:ifj>vers#kofanyotherprepareitseen,respondwithpromisenottoacceptlower‐numberedproposals.Otherwise,
respondwithkandvaluev’accepted.
Phase2(Accept)• Ifmajoritypromise,proposersendsacceptwith(versj,valuev)
• Acceptoracceptsunlessithasrespondedtopreparewithhighervers#thanj.Sendsacknowledgementtoallviewmembers.
41
Summary
• GlobalPmedoesn’texistindistributedsystem
• LogicalPmecanbeestablishedviaversion#’s
• LogicalPmeusefulinvariousconsistencymodels– Strict>Linearizability>SequenPal>Causal>Eventual
• Agreementindistributedsystem– Eventualconsistency:Quorums+anP‐entropy
– Linearizability:Two‐phasecommit,Paxos
42