38
Distributed Systems 15-440/15-640 – Fall 2019 12 – Distributed Replication 1

Distributed Systems - Synergy Labs · 2019. 10. 6. · • 2PC proposed in 1979 (Gray) • In 1981, Stonebraker proposed a basic, unsafe 3PC • 1988, Brian Oki and Barbara Liskov

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • DistributedSystems



    15-440/15-640–Fall2019

    12–DistributedReplication 


    1

  • FaultToleranceTechniquesSoFar?• Redundancy:information/time/physicalredundancy• E.g.,usedinairplanes

    • Recovery:checkpointingandlogging(ARIES)• E.g.,usedincommercialdatabases

    • Previous(concurrency)protocolsrelyonrecoverytechniques• E.g.,TwoPhaseCommitisnotfaulttolerantbyitself

    •Whynotalwaysusethesetechniques?! Longwaitincaseoffailure

    2

  • OurGoalToday:StayUpDuringFailures

    • Provideaservice• Replicatethemachinesthatserveclients• Survivethefailureofuptofreplicas• Provideidenticalservicetoanon-replicatedversion• (exceptmorereliable,andperhapsdifferentperformance)

    3

  • OutlineforToday

    Consistencywhencontentisreplicated

    Primary-backupreplicationmodel

    Consensusreplicationmodel

    4

  • SimpleExamplesofReplication• Replicatedwebsites

    • e.g.,Yahoo!orAmazon:• DNS-basedloadbalancing(DNSreturnsmultipleIPaddressesforeachname)

    • HardwareloadbalancersputmultiplemachinesbehindeachIPaddress

    • Whenisreplicationeasy?Whenhard?• Workloadassumptions

    5

  • Read-onlycontent

    • Easytoreplicate-justmakemultiplecopiesofit.• Performanceboost:Gettousemultipleserverstohandletheload;

    • Perfboost2:Locality.We’llseethislaterwhenwediscussCDNs,canoftendirectclienttoareplicanearit

    • Availabilityboost:Canfail-over(doneatbothDNSlevel--slower,becauseclientscacheDNSanswers--andatfront-endhardwarelevel)

    6

  • ButRead-writeData...• Requireswritereplication,andsomedegreeofconsistency• StrictConsistency• Readalwaysreturnsvaluefromlatestwrite

    • SequentialConsistency• Allnodesseeoperationsinsomesequentialorder• Operationsofeachprocessappearin-orderinthissequence

    7

  • SequentialConsistency(1)

    • Behavioroftwoprocessesoperatingonthesamedataitem.Thehorizontalaxisistime.

    • P1:Writes“W”valueatovariable“x”• P2:Reads`NIL’from“x”firstandthen`a’

    Adaptedfrom:Tanenbaum&VanSteen,DistributedSystems:PrinciplesandParadigms,2e,(c)2007Prentice-Hall,Inc.Allrightsreserved.0-13-239227-5 8

  • SequentialConsistency(2)

    (b)Adatastorethatisnotsequentiallyconsistent.

    (a)Asequentiallyconsistentdatastore.

    9

  • ButRead-writeData...• Requireswritereplication,andsomedegreeofconsistency• StrictConsistency• Readalwaysreturnsvaluefromlatestwrite

    • SequentialConsistency• Allnodesseeoperationsinsomesequentialorder• Operationsofeachprocessappearin-orderinthissequence

    • CausalConsistency• Allnodesseepotentiallycausallyrelatedwritesinsameorder• Butconcurrentwritesmaybeseenindifferentorderondifferentmachines

    10

  • CausalConsistency(1)

    Thissequenceisallowedwithacausally-consistentstore,butnotwithasequentiallyconsistentstore.

    11

  • CausalConsistency(2)

    Aviolationofacausally-consistentstore.


    (W(x)acausallyrelatedtoR(x)a,W(x)b.)

    12

  • ButRead-writeData...• Requireswritereplication,andsomedegreeofconsistency• StrictConsistency• Readalwaysreturnsvaluefromlatestwrite

    • SequentialConsistency• Allnodesseeoperationsinsomesequentialorder• Operationsofeachprocessappearin-orderinthissequence

    • CausalConsistency• Allnodesseecausallyrelatedwritesinsameorder• Butconcurrentwritesmaybeseenindifferentorderondifferentmachines

    • EventualConsistency• Allnodeswilllearneventuallyaboutallwrites,intheabsenceofupdates

    13

  • ExampleofConsistencyGuarantees• Inpracticeweoftenhaveachoice

    • GoogleMail• Sendingmailisreplicatedto~2physicallyseparateddatacenters(usershateitwhentheythinktheysentmailanditgotlost);mailwillpausewhiledoingthisreplication.• Q:Howlongwouldthistakewith2-phasecommit?inthewidearea?

    • Markingmailreadisonlyreplicatedinthebackground-youcanmarkitread,thereplicationcanfail,andyou’llhavenoclue(re-readingareademailonceinawhileisnobigdeal)

    • Weakerconsistencyischeaperifyoucangetawaywithit.14

  • ReplicationStrategiesWhattoreplicate:StateversusOperations• Propagateonlyanotificationofanupdate

    • Sortofan“invalidation”protocol

    • Transferdatafromonecopytoanother• Read-to-Writeratiohigh,canpropagatelogs(savebandwidth)

    • Propagatetheupdateoperationtoothercopies• Don’ttransferdatamodifications,onlyoperations–“Activereplication”

    Whentoreplicate:PushvsPull• PullBased

    • Replicas/Clientspollforupdates(caches)• PushBased

    • Serverpushesupdates(stateful) 15

  • OutlineforToday

    Consistencywhencontentisreplicated

    Primary-backupreplicationmodel

    Consensusreplicationmodel

    16

  • AssumptionsToday• Groupmembershipmanager• Allowreplicanodestojoin/leave

    • Fail-stop(notByzantine)failuremodel• Serversmightcrash,mightcomeupagain

    • Delayed/lostmessages

    • Failuredetector• E.g.,process-pairmonitoring,etc.

    17

  • Primary-Backup:RemoteWriteProtocol•Writesalwaysgotoprimary,readfromanybackup

    • Implementation• Streamthelog

    • Commoninpractice• Simple

    • Areupdatesblocking?18

  • Local-WriteP-BProtocol

    19

    Primarymigratestotheprocesswantingtoprocessupdate
Forperformance,usenon-blockingop.Whatdoesthisschemeremindyouof?

  • Primary-BackupProperties• Thislookscool.Howmanyfailurescanwedealwith?Whataresomeproblems?• Whatdowedoifareplicahasfailed?• Wewait...howlong?Untilit’smarkeddead.

    • Advantage:WithNservers,cantoleratelossofN-1copies• Notagreatsolutionifyouwantverytightresponsetimeevenwhensomethinghasfailed:Mustwaitforfailuredetector

    • Note:Ifyoudon’tcareaboutstrongconsistency(e.g.,the“mailread”flag),youcanreplytoclientbeforereachingagreementwithbackups(sometimescalled“asynchronousreplication”).

    20

  • OutlineforToday

    Consistencywhencontentisreplicated

    Primary-backupreplicationmodel

    Consensusreplicationmodel

    21

  • QuorumBasedConsensus• Designedtohavefastresponsetimeevenunderfailures• Operateaslongasmajorityofmachinesisstillalive

    • Nomaster,perse• Tohandleffailures,musthave2f+1replicas• Also,forreplicated-write=>writetoallreplica’snotjustone

    • UsuallyboilsdowntoPaxos[Lamport]

    22

  • Decomposetheproblem:

    • BasicPaxos(“singledecree”):• Oneormoreserversproposevalues• Systemmustagreeonasinglevalueaschosen• Onlyonevalueiseverchosen

    •Multi-Paxos:• CombineseveralinstancesofBasicPaxostoagreeonaseriesofvaluesformingthelog

    ThePaxosApproach

    SomeSlidesAdaptedfrom:JohnOusterhout&DiegoOngaro,StanfordUniversity.ImplementingReplicatedLogswithPaxos.2013. 23

  • RequirementsforBasicPaxos• Correctness(safety):• Onlyasinglevaluemaybechosen• Amachineneverlearnsthatavaluehasbeenchosenunlessitreallyhasbeen• TheagreedvalueXhasbeenproposedbysomenode

    • Liveness(termination):• Someproposedvalueiseventuallychosen• Ifavalueischosen,serverseventuallylearnaboutit

    • Fault-tolerance:• IflessthanN/2nodesfail,therestshouldreachagreementeventuallyw.h.p• Livenessisnotguaranteed

    24

  • [FLP’85]ImpossibilityResult• SynchronousDS:boundedamountoftimenodecantaketoprocessandrespondtoarequest

    • AsynchronousDS:timeoutisnotperfect

    Fischer-Lynch-PatersonResult

    Itisimpossibleforasetofprocessorsinanasynchronoussystemtoagreeonabinaryvalue,evenifonlyasingleprocessorissubjecttoanunannouncedfailure.

    25

  • PaxosComponents• Proposers:• Active:putforthparticularvaluestobechosen• Handleclientrequests

    • Acceptors:• Passive:respondtomessagesfromproposers• Responsesrepresentvotesthatformconsensus• Storechosenvalue,stateofthedecisionprocess

    • Forthispresentation:• EachPaxosservercontainsbothcomponents• Ignorethirdrole,akaLearner

    • “Round”:(proposal,messages/voting,decision)• Wemayneedseveralrounds

    ProposerAcceptor

    ProposerAcceptor

    ProposerAcceptor

    26

  • Strawman:BasicTwo-Phase• Coordinatortellsreplicas:“ValueV”• ReplicasACK• Coordinatorbroadcasts“Commit!”

    • Thisisn’tenough•Whatifthere’smorethan1coordinatoratthesametime?•Whatifnewcoordinatorchoosesadifferentvalue?

    • Whatifsomeofthenodesorthecoordinatorfailsduringthecommunication?•Whatifthereisanetworkpartition?

    27

  • Let’sDiscussSomeProblems&Solutions• Problem:can’ttrustasinglenode• Solution:everyonecanpotentiallypropose

    • Problem:severalconcurrentproposers• Solution:Quorum(requiremajorityofacceptors)

    • Problem:splitvotes,noproposerreachesmajority• Solution:acceptorsneedtoallowupdatingoftheirvalue

    • Problem:conflictingchoices(duetoupdating)• Solutiona):prioritizeproposalwithhighestuniquetimestamp(Lamportclocks)• Solutionb):oncemajorityhasagreedonvalue,futureproposalsforcedtopropose/choosesamevalue

    28

  • • Phase1:Preparemessage• Findoutaboutanychosenvalues• Blockolderproposalsthathavenotyetcompleted

    • Phase2:Acceptmessage• Askacceptorstoacceptaspecificvalue

    • (Phase3):Proposerdecides• Ifmajorityagain:chosenvalue,commit.• Ifnomajority:delayandrestartPaxos

    SingleDecreePaxos:InformalDescription

    Proposers Acceptors

    Prepare Check,

    Return

    Accept

    Waitformajority

    CheckAgain,

    ReturnWaitformajority

    Decision

    29

  • SingleDecreePaxos:ProtocolAcceptors

    3)RespondtoPrepare(n):• Ifn>minProposalthenminProposal=nPrepare-OK(acceptedProposal,acceptedValue)elsePrepare-REJECT()

    6)RespondtoAccept(n,value):• Ifn≥minProposal 
 acceptedProposal=minProposal=n 
 acceptedValue=value

    Accept-OK()elseAccept-REJECT()

    AcceptorsmustrecordminProposal,acceptedProposal,andacceptedValueonstablestorage(disk)

    Proposers1)Choosenewproposalnumbern,valuev2)BroadcastPrepare(n)toallservers

    4)Whenresponsesreceivedfrommajority:• IfanyacceptedValuesreturnedv=acceptedValueofhighestacceptedProposal

    5)BroadcastAccept(n,value)toallservers

    6)WhenAccept-OKfrommajorityValueischosen(Commit)ElseRestart:goto1,withlargernumbern

    30

  • Paxos 
Examples

    31

    a) SuccessfulRoundwithaSingleProposer

    b)DuelingProposers

  • SomeRemarks• Onlyproposerknowschosenvalue(majorityaccepted)• Onlyasinglevalueischosen! MultiPaxos

    • Noguaranteethatproposer’soriginalvaluevischosenbyitself

    • NumbernisbasicallyaLamportclock! alwaysuniquen• Keyinvariant:• Ifaproposalwithvalue`v'ischosen,allhigherproposalsmusthavevalue`v’

    • Duelingproposer• Resolvedusingnumberninprepare

    • Therearechallengingcornercases32

  • SingleDecreePaxos:ProtocolAcceptors

    3)RespondtoPrepare(n):• Ifn>minProposalthenminProposal=nPrepare-OK(acceptedProposal,acceptedValue)elsePrepare-REJECT()

    6)RespondtoAccept(n,value):• Ifn≥minProposal 
 acceptedProposal=minProposal=n 
 acceptedValue=value

    Accept-OK()elseAccept-REJECT()

    AcceptorsmustrecordminProposal,acceptedProposal,andacceptedValueonstablestorage(disk)

    Proposers1)Choosenewproposalnumbern,valuev2)BroadcastPrepare(n)toallservers

    4)Whenresponsesreceivedfrommajority:• IfanyacceptedValuesreturnedv=acceptedValueofhighestacceptedProposal

    5)BroadcastAccept(n,value)toallservers

    6)WhenAccept-OKfrommajorityValueischosen(Commit)ElseRestart(goto1,withlargernumbern)

    33

  • Paxosiswidespread!• Industryandacademia•Google:Chubby(distributedlockservice)• Yahoo:Zookeeper(distributedlockservice)•MSR:Frangipani(distributedlockservice)•OpenSourceimplementations▪Libpaxos(paxosbasedatomicbroadcast)▪Zookeeperisopensource,integratedw/Hadoop

    PaxosslidesadaptedfromJinyangLi,NYU; 34

  • PaxosHistoryIttook25yearstocomeupwithsafeprotocol

    • 2PCproposedin1979(Gray)• In1981,Stonebrakerproposedabasic,unsafe3PC• 1988,BrianOkiandBarbaraLiskovcreatedViewstampedReplication,
whichhasthecoreprotocol.

    • In1998,Lamportrediscovereditandexplainedtheprotocolformally,namingitPaxos

    • 2001”Paxosmadesimple”.• In~2007RAFTappears,presentingtheViewstampedReplicationapproachtoPaxos
asacleanlyisolatedprotocol.

    35

  • MoreRemarks

    • Paxosispainfultogetright,particularlythecornercases.Startfromagoodimplementationifyoucan.SeeYahoo’s“Zookeeper”asastartingpoint.

    • Therearelotsofoptimizationstomakethecommon/noorfewfailurescasesgofaster;ifyoufindyourselfimplementing,researchthese.

    • Paxosisexpensive.Usually,usedforcritical,smallerbitsofdataandtocoordinatecheaperreplicationtechniquessuchasprimary-backupforbigbulkdata.

    36

  • BeyondPAXOS

    •Manyfollowupsandvariants• RAFTconsensusalgorithm• https://raft.github.io/

    • Greatvisualizationofhowitworks• http://thesecretlivesofdata.com/raft/

    https://raft.github.io/http://thesecretlivesofdata.com/raft/

  • Summary• Primary-backup• Writeshandledbyprimary,streamlogtobackup(s)• Replicasare“passive”,followprimary• Good:Simpleprotocol.Nmachines,canhandleN-1failures• Bad:Slowresponsetimesincaseoffailures.

    • Quorumconsensus• Designedtohavefastresponsetimeevenunderfailures• Replicasare“active”-participateinprotocol;thereisnomaster,perse.• Good:Clientsdon’tevenseethefailures• Bad:Morecomplex(cornercases).Tohandleffailures,musthave2f+1replicas. 38