Upload
vandung
View
216
Download
0
Embed Size (px)
Citation preview
DistributedsystemsLecture5:Consistentcuts,processgroups,andmutualexclusion
Dr RobertN.M.Watson
1
Lasttime• Sawphysicaltimecan’tbekeptexactlyinsync;insteaduse
logicalclocks totrackorderingbetweenevents:– Defineda→ b tomean‘a happens-beforeb’– Easyinsidesingleprocess,&usecausalordering
(send→ receive)toextendrelationacrossprocesses– ifsendi(m1)→ sendj(m2)thendeliverk(m1)→ deliverk(m2)
• Lamportclocks,L(e):aninteger– Incrementto(max of(sender,receiver))+1onreceipt– ButgivenL(a) <L(b),knownothingaboutorderofa andb
• Vectorclocks: listofLamportclocks,oneperprocess– ElementVi[j] captures#eventsatPj observedbyPi– Crucially: ifVi(a)<Vj(b),caninferthata→ b ,and
ifVi(a)~Vj(b),caninferthata~b
2
Vectorclocks:example
• WhenP2 receivesm1,itmerges theentriesfromP1’sclock– choosethemaximumvalueineachposition
• SimilarlywhenP3 receivesm2,itmergesinP2’sclock– thisincorporatesthechangesfromP1 thatP2 alreadysaw
• Vectorclocksexplicitly trackthetransitivecausalorder:f’stimestampcapturesthehistoryofa,b,c &d
3
P1
P2 physicaltime
P3
a b
e f
c d
(1,0,0)
m1
m2
(2,0,0)
(2,1,0) (2,2,0)
(0,0,1) (2,2,2)
Sendevent
Receiveevent
Consistentglobalstate• Wehavethenotionof“a happens-beforeb” (a→ b)or“a isconcurrentwithb”(a ~b)
• Whatabout‘instantaneous’system-widestate?– distributeddebugging,GC,deadlockdetection, ...
• Chandy/Lamport introducedconsistentcuts:– drawa(possiblywiggly)lineacrossallprocesses– thisisaconsistentcutifthesetofevents(onthelhs)isclosedunderthehappens-before relationship
– i.e.ifthecutincludeseventx,then italsoincludesalleventse whichhappenedbeforex
• Inpracticalterms,thismeanseverydeliveredmessageincludedinthecutwasalsosentwithinthecut
4
Consistentcuts:example
• Verticalcutsarealwaysconsistent(duetothewaywedrawthesediagrams),butsomecurvesareoktoo:– providingwedon’t includeanyreceiveeventswithouttheircorrespondingsendevents
• Intuitionisthataconsistentcutcould haveoccurredduringexecution(dependingonschedulingetc),
5
P1
P2 physicaltime
P3
a b
i l
f g
c d
e
k
h
j
Observingconsistentcuts• Chandy/Lamport SnapshotAlgorithm(1985)• Distributedalgorithmtogenerateasnapshot ofrelevant
system-widestate(e.g.allmemory,locksheld,…)• FloodaspecialmarkermessageM toallprocesses;causal
orderofflooddefinesthecut• IfPi receivesM fromPj andithasyettosnapshot:
– Itpausesallcommunication,takeslocalsnapshot&setsCij to{}– ThensendsM toallotherprocessesPk andstartsrecordingCik =
{setofallpostlocalsnapshotmessagesreceivedfromPk }• IfPi receivesM fromsomePk after takingsnapshot
– StopsrecordingCik,andsavesalongsidelocalsnapshot• Globalsnapshotcomprisesalllocalsnapshots&Cij• Assumesreliable, in-ordermessages,&nofailures
6Fearnot! Thisisnotexaminable.
Processgroups• Itisusefultobuilddistributedsystemswithprocessgroups
– Setofprocessesonsomenumberofmachines– Possibletomulticastmessagestoallmembers– Allowsfault-tolerantsystemsevenifsomeprocessesfail
• Membershipcanbefixed ordynamic– ifdynamic,haveexplicitjoin() andleave() primitives
• Groupscanbeopen orclosed:– Closedgroupsonlyallowmessagesfrommembers
• Internallycanbestructured(e.g.coordinatorandsetofslaves),orsymmetric(peer-to-peer)– Coordinatormakese.g.concurrentjoin/leaveeasier…– …butmayrequireextraworktoelect coordinator
7Whenweusemulticast indistributedsystems,wemeansomethingstronger
thanconventionalnetworkmulticastingusingdatagrams– donotconfusethem.
Groupcommunication:assumptions
• Assumewehaveabilitytosendamessagetomultiple(orall)membersofagroup– Don’tcareif‘true’multicast(singlepacketsent,received bymultiplerecipients) or“netcast”(sendsetofmessages,onetoeachrecipient)
• Assumealsothatmessagedeliveryisreliable,andthatmessagesarriveinboundedtime– Butmaytakedifferent amountsoftimetoreachdifferent recipients
• Assume(fornow)thatprocessesdon’tcrash• Whatdeliveryorderings canweenforce?
8
FIFOordering
• WithFIFOordering,messagesfromaparticularprocessPi mustbereceivedatallotherprocessesPj intheordertheyweresent– e.g.intheabove,everyonemustseem1 beforem3– (orderingofm2 andm4 isnotconstrained)
• Seemseasybutnottrivialincaseofdelays/retransmissions– e.g.whatifmessagem1 toP2 takesaloooong time?
• Hencereceiversmayneedtobuffer messages toensureorder
9
P1
P2physicaltime
P4
m1
P3m2
m3
m4
?
Receivingversusdelivering• Groupcommunicationmiddlewareprovidesextrafeaturesabove‘basic’communication– e.g.providing reliability and/ororderingguaranteesontopofIPmulticastornetcast
• AssumethatOSprovidesreceive() primitive:– returnswithapacketwhenonearrivesonwire
• Receivedmessageseitherdeliveredorheldback:– Deliveredmeansinserted intodeliveryqueue– Heldback meansinserted intohold-back queue– held-backmessagesaredelivered laterastheresultofthereceipt ofanothermessage…
10
ImplementingFIFOordering
• EachprocessPi maintainsamessagesequencenumber(SeqNo)Si• EverymessagesentbyPi includesSi,incrementedaftereachsend
– notincludingretransmissions!• Pj maintainsSji :theSeqNo ofthelastdelivered message fromPi
– IfreceivemessagefromPi withSeqNo ≠(Sji+1),holdback– WhenreceivemessagewithSeqNo =(Sji+1),deliver it…andalso
deliveranyconsecutivemessages inholdbackqueue…andupdateSji
11
deliveryqueue
hold-backqueue
receive(M from Pi) {s = SeqNo(M);if (s == (Sji+1) ) {
deliver(M); s = flush(hbq);Sji = s;
} else holdback(M);}
addM todeliveryQ
messagesconsumedbyapplication
heldbackmessagedelivered
Strongerorderings• CanalsoimplementFIFOorderingbyjustusingareliableFIFOtransportlikeTCP/IP
• Butthegeneral‘receiveversusdeliver’modelalsoallowsustoprovidestrongerorderings:– Causalordering:ifeventmulticast(g,m1)→multicast(g,m2),thenallprocesseswillseem1 beforem2
– Totalordering:ifanyprocessesdeliversamessagem1beforem2,thenallprocesseswilldeliverm1 beforem2
• CausalorderingimpliesFIFOordering,sinceanytwomulticastsbythesameprocessarerelatedby→
• Totalordering(asdefined)doesnotimplyFIFO(orcausal)ordering,justsaysthatallprocessesmustagree– OftenwantFIFO-total ordering(combinesthetwo)
12
Causalordering
• Sameexampleaspreviously,butnowcausalorderingmeansthat(a)everyonemustseem1 beforem3 (aswithFIFO),and(b)everyonemustseem1 beforem2 (duetohappens-before)
• Isthisok?– No!m1→ m2,butP2 seesm2 beforem1– Tobecorrect,mustholdback(delay)deliveryofm2 atP2– Buthowdoweknowthis?
13
P1
P2physicaltime
P4
m1
P3m2
m3
m4
Have(0,0,0)!=(1,0,2),somustholdbackm2 untilmissing
eventsseen
Oncem1received,candeliverm1 andthenm2
Implementingcausalordering• Turnsoutthisisprettyeasy!– StartwithreceivealgorithmforFIFOmulticast…– andreplacesequence numberswithvectorclocks
14
• Somecareneededwithdynamicgroups
P1
P2
m1
P3m2
→(1,0,0)
→(1,0,1)
→(2,0,2)
→(1,0,2)
→(1,1,0)
Totalordering• Sometimeswewantallprocessestoseeexactlythesame,FIFO,sequenceofmessages– particularlyforstatemachinereplication(seelater)
• Onewayistohavea‘cansend’token:– Tokenpassedround-robinbetweenprocesses– Onlyprocesswithtokencansend(ifhewants)
• Oruseadedicatedsequencerprocess– Otherprocessesaskforglobalsequenceno.(GSN),andthensendwiththisinpacket
– UseFIFOorderingalgorithm,butonGSNs• Canalsobuildnon-FIFO total-ordermulticastbyhavingprocessesgenerateGSNsthemselvesandresolvingties
15
Orderingandasynchrony• FIFOorderingallowsquitealotofasynchrony
– E.g.anyprocesscandelaysendingamessageuntilithasabatch(toimproveperformance)
– Orcanjusttoleratevariableand/orlongdelays• Causalorderingalsoallowssomeasynchrony
– Butmustbecarefulqueuesdon’tgrowtoolarge!• Traditionaltotalordermulticastnotsogood:
– Sinceeverymessagedeliverytransitivelydependsoneveryotherone,delaysholdsuptheentiresystem
– Insteadtendtoan(almost)synchronousmodel,butthisperformspoorly,particularlyoverthewidearea;-)
– Somecleverworkonvirtualsynchrony (fortheinterested)
16
Distributedmutualexclusion• Infirstpartofcourse,sawneedtocoordinateconcurrentprocesses/threads– Inparticularconsideredhowtoensuremutualexclusion:allowonly1threadinacriticalsection
• Avarietyofschemespossible:– test-and-set locks;semaphores;monitors;activeobjects
• Butmostoftheseultimatelyrelyonhardwaresupport(atomicoperations,ordisablinginterrupts…)– notavailableacrossanentiredistributedsystem
• Assumingwehavesomeshareddistributedresources,howcanweprovidemutualexclusioninthiscase?
17
Solution#1:centrallockserver
• NominateoneprocessCascoordinator– IfPi wantstoentercriticalsection,simplysends lockmessage to
C,andwaitsforareply– Ifresourcefree,CrepliestoPiwithagrantmessage;otherwise
CaddsPi toawaitqueue– Whenfinished,Pi sendsunlockmessage toC– Csendsgrantmessage tofirstprocessinwaitqueue
18
P1
P2 physicaltime
C
...executecriticalsection
Centrallockserver:prosandcons
• Centrallockserverhassomegoodproperties:– Simple tounderstandandverify– Live (providingdelaysarebounded,andnofailure)– Fair (ifqueue isfair,e.g.FIFO),andeasilysupportspriorities ifwewantthem
– Decentperformance:lockacquire takesoneround-trip,andreleaseis‘free’withasynchronousmessages
• ButCcanbecomeaperformancebottleneck…• …andcan’tdistinguishcrashofCfromlongwait– canaddadditionalmessages,atsomecost
19
Solution#2:tokenpassing
• Avoidcentralbottleneck• Arrangeprocessesinalogicalring
– Eachprocessknowsitspredecessor&successor– Singletokenpassescontinuouslyaroundring– Canonlyentercriticalsectionwhenpossesstoken;passtokenonwhenfinished(orifdon’tneedtoenterCS)
20
P0
P4P3
P1
P2
P5
Initial tokengeneratedbyP0 Passesclockwise
around‘ring’Ife.g.P4wantstoenterCS,holdsontotokenforduration
Tokenpassing:prosandcons• Severaladvantages:
– Simpletounderstand:only1processeverhastoken=>mutualexclusionguaranteed byconstruction
– Nocentralserverbottleneck– Livenessguaranteed(intheabsenceoffailure)– So-soperformance(between0andNmessagesuntilawaitingprocessenters,1messagetoleave)
• But:– Doesn’tguaranteefairness(FIFOorder)– Ifaprocesscrashesmustrepairring(routearound)– Andworse:mayneedtoregenerate token– tricky!
• Andconstantnetworktraffic:anadvantage???
21
Solution#3:totallyorderedmulticast
• SchemeduetoRicart &Agrawala (1981)• ConsiderN processes,whereeachprocessmaintainslocal
variablestate whichisoneof{FREE,WANT,HELD }• Toobtainlock,aprocessPi setsstate:= WANT,andthen
multicastslockrequesttoallotherprocesses• WhenaprocessPj receivesarequestfromPi:
– IfPj’s localstateisFREE,thenPj repliesimmediately withOK– IfPj’s localstateisHELD,Pj queuestherequesttoreplylater
• ArequestingprocessPiwaitsforOK fromN-1processes– Oncereceived,setsstate:= HELD,andenterscriticalsection– Oncedone,setsstate:= FREE,&repliestoanyqueuedrequests
• Whataboutconcurrentrequests?
22
Byconcurrentwemean:Pj isalreadyintheWANTstatewhenitreceivesarequestfromPi
Handlingconcurrentrequests• Needtodecideuponatotalorder:
– EachprocessesmaintainsaLamporttimestamp,Ti– ProcessesputcurrentTi intorequestmessage– Insufficientonitsown(recallthatLamporttimestampscanbeidentical)=>useprocessid(orsimilar)tobreakties
• HenceifaprocessPj receivesarequestfromPi andPjhasanoutstandingrequest(i.e.Pj’s localstateisWANT)– If(Tj,Pj)<(Ti,Pi)thenqueuerequest fromPi– Otherwise,replywithOK,andcontinuewaiting
• Notethatusingthetotalorderensurescorrectness,butnotfairness (i.e.noFIFOordering)– Q:canwefixthisbyusingvectorclocks?
23
Totallyorderedmulticast:example
• ImagineP1 andP2 simultaneously trytoacquirelock…– BothsetstatetoWANT,andbothsendmulticastmessage– Assumethattimestampsare17(forP1)and9(forP2)
• P3hasnointerest(stateisFREE),sorepliesOktoboth• Since9<17,P1 repliesOk;P2 staysquiet&queuesP1’srequest• P2 entersthecriticalsectionandexecutes…• …andwhendone,repliestoP1 (whocannowentercriticalsection)
24
P317 17
17
9
9 9
P2
P1 P3OK
P2
P1 P3
P2
P1
OKOK
OK
Additionaldetails• Completelyunstructureddecentralized solution...but:
– Lotsofmessages(1multicast+N-1 unicast)– Okformostrecentholdertore-enterCSwithoutanymessages
• Variantscheme(Lamport)-multicastfortotalordering– Toenter,processPi multicastsrequest(Pi,Ti) [sameasbefore]– Onreceiptofamessage,Pj replieswithanack(Pj,Tj)– Processeskeepallrequestsandacks inorderedqueue– IfprocessPi seeshisrequestisearliest,canenterCS…and
whendone,multicastsarelease(Pi,Ti)message– WhenPj receivesrelease,removesPi’srequestfromqueue– IfPj’s requestisnowearliest inqueue,canenterCS…
• BothRicart &Agrawala andLamport’s schemehaveNpointsoffailure:doomedifany processdies:-(
25
Summary+nexttime• (More)vectorclocks• Consistentglobalstate+consistentcuts• Processgroupsandreliablemulticast• Implementingorder• Distributedmutualexclusion
• Leaderelectionsanddistributedconsensus• Distributedtransactionsandcommitprotocols• Replicationandconsistency
26