Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
1
Append/Hflush/ReadDesignHairongKuang,KonstantinShvachko,NicholasSze,SanjayRadia,RobertChansler
Yahoo!HDFSteam08/06/2009
1. DesignchallengesWithhflush,HDFSneedstomakethelastblockofanunclosedfilevisibletoreaders.Thispresentstwochallenges:
1. Readconsistency.Atagiventimedifferentreplicasofthelastblockmayhavedifferentnumberofbytes.WhatreadconsistencyshouldHDFSprovideandhowtoguaranteetheconsistencyevenincaseoffailures.
2. Datadurability.Whenanyerroroccurs,therecoverycannotsimplythrowthelastblockaway.Insteadtherecoveryneedstopreserveatleastthehflushedbyteswhilemaintainingthereadconsistency.
2. Replica/BlockStatesThisdocumentwillcallablockataDataNodeareplicatodifferentiateitfromablockattheNameNode.
2.1. NeedfornewstatesPre‐append/hflushareplicaataDataNodeiseitherfinalizedortemporary.Whenareplicaisfirstcreated,itisinthetemporarystateonDataNode.Atemporaryreplicabecomesfinalizeduponthecloserequestofaclientwhennomorebytewillbewrittentothisreplica.OnaDataNoderestart,temporaryreplicasareremoved.Thisisacceptablepre‐append/hflushbecauseHDFSprovidesbesteffortdurabilityforunder‐constructionblocks.Thisisnotacceptableafterappend/hflusharesupported.HDFSneedstosupportstrongdurabilityforunder‐constructionblocksthatcontainpre‐appenddataandbesteffortdurabilityforhflusheddata.SosometemporaryreplicasneedtobepreservedacrossDataNoderestarts.
2.2. Replicastates(DataNode)AtaDataNode,thisdesignintroducesareplicabeingwritten(rbw)stateandotherstatesforhandlingerrors.InaDataNode’smemory,areplicacouldbeinanyofthefollowingstate:Finalized:Afinalizedreplicahasfinalizeditsbytes.Nonewbytewillbewrittentothisreplicaunlessitisreopenedforappend.Itsdataandmetadatamatch.Theotherreplicasofthesameblockidhavethesamebytesasthisreplica.Butthegenerationstamp(GS)ofafinalizedreplicadoesnotremainconstant.Itmaybebumpedupasaresultoferrorrecovery.
2
Rbw(ReplicaBeingWrittento):Onceareplicaiscreatedorappended,itisintherbwstate.Bytesarebeingwrittentothisreplica.Itisalwaysareplicaofthelastblockofanunclosedfile.Itslengthisnotfinalizedyet.Itson‐diskdataandmetadatamaynotmatch.Otherreplicasofthesameblockidmayhavemoreorlessbytesthanthisone.Bytes(maynotall)inanrbwreplicaarevisibletoreaders.Incaseofanyfailure,bytesinanrbwreplicawilltrytobepreserved.Rwr(ReplicaWaitingtobeRecovered):IfaDataNodediesandrestarts,allitsrbwreplicaschangetobeintherwrstate.Rwrreplicaswillnotbeinanypipelineandthereforewillnotreceiveanynewbytes.Theywilleitherbecomeoutdatedorwillparticipateinaleaserecoveryiftheclientalsodies.rur(ReplicaUnderRecovery):Areplicachangestobeintherurstatewhenareplicarecoverystartsasaresultofleaseexpiration.Moredetailswillbediscussedintheleaserecoverysection.Temporary:atemporaryreplicaisalsoareplicaunderconstructionbutiscreatedforthepurposeofblockreplicationorclusterbalancing.Itsharesmanyofthepropertiesasanrbwreplica,butitsdataareinvisibletoanyreader.IfthereplicaconstructionfailsoritsDataNoderestarts,atemporaryreplicawillbedeleted.OnaDataNode’sdisk,eachdatadirectorywillhavethreesubdirectories:currentcontainsfinalizedreplicas,tmpcontainstemporaryreplicas,rbwcontainsrbw,rwr,andrurreplicas.WhenareplicaisfirstcreatedbyarequestfromaDFSclient,itisputintherbwdirectory.Whenareplicaisfirstcreatedforthepurposeofreplicationorclusterbalancing,itisputinthetmpdirectory.Onceareplicaisfinalized,itismovedtothecurrentdirectory.WhenaDataNoderestarts,thereplicasinthetmpdirectoryareremoved,thereplicasintherbwdirectoryareloadedasrwrreplicas,andthereplicasinthecurrentdirectoryareloadedasfinalizedreplicas.DuringDataNodeupgrade,allreplicasindirectoriescurrentandrbwneedtobekeptinasnapshot.
2.3. Blockstates(NameNode)NameNodealsointroducesmanynewstatesforablock.Ablockcouldbeinanyofthefollowingstate:UnderConstruction:Onceablockiscreatedorappended,itisintheUnderConstructionstate.Bytesarebeingwrittentothisblock.Itisthelastblockofanunclosedfile.ItslengthandGShasnotfinalizedyet.Data(maynotall)inablockunderconstructionarevisibletoreaders.Ablockunderconstructionkeepstrackofitswritepipeline(i.e.,locationsofvalidrbwreplicas)andthelocationsofitsrwrreplicasiftheclientdies.UnderRecovery:Whenafile’sleaseexpires,ifthelastblockisUnderConstruction,itischangedtobeUnderRecoverystateonceblockrecoverystarts.
3
Committed:Acommittedblockhasfinalizeditsbytesandgenerationstamp(GS),buthasnotseenatleastoneGS/lengthmatchedfinalizedreplicafromDataNodesyet.NonewbytewillbewrittentothisblockanditsGSwillnotbebumpedunlessitisreopenedforappend.Inordertoservereadrequests,acommittedblockstillneedstokeepthelocationsofrbwreplicas.ItalsoneedstotracktheGSandlengthofitsfinalizedreplicas.AnunderconstructionblockofanunclosedfileiscommittedwhenNNisaskedbytheclienttoaddanewblocktothefileorclosethefile.Ifthelastblockisinthecommittedstate,thefilecannotbeclosedandtheclienthastoretry.AddBlockandclosewillbeextendedtoincludethelastblock’sGSandlength.Complete:AcompleteblockisablockwhoselengthandGSarefinalizedandNameNodehasseenaGS/lenmatchedfinalizedreplicaoftheblock.Acompleteblockkeepsonlyfinalizedreplicas’locations.Onlywhenallblocksofafilebecomecomplete,afilecouldbeclosed.Differentfromreplica’sstates,ablock’sstatedoesnotpersistonanydisk.WhenNameNoderestarts,thelastblockofanunclosedfileisloadedasUnderConstruction.AlltherestoftheblocksareloadedasComplete.
Moredetailsaboutabovereplica/blockstateswillbediscussedintherestofthedocument.Areplicastatetransitiondiagramandablockstatetransitiondiagramwillbesummarizedinthelastsection.
3. Write/hflush
3.1. BlockConstructionPipelineAnHDFSfileconsistsofmultipleblocks.Eachblockisconstructedthroughawritepipeline.Bytesarepushedtothepipelinepacketbypacket.Ifnoerroroccurs,ablockconstructiongoesthroughthreestagesasshowninthefollowingpictureillustratedbyapipelineof3DataNodes(DN)andablockof5packets.Inthepicture,boldlinesrepresentdatapackets,dottedlinesrepresentackmessages,andregularlinesrepresentcontrolmessages(setup/close).Fromt0tot1isthepipelinesetupstage.T1tot2isthedatastreamingstage,wheret1isthetimewhenthefirstdatapacketgetssentandt2isthetimethattheacktothelastpacketgetsreceived.Notepacket2isanhflushedpacket.T2tot3isthepipelineclosestage.
client DN0 DN1 DN2
pipeline setup
Data streaming
close
packet 0
packet 1
packet 2
packet 3
packet 4
set up
close
t0
t1
t2
t3
4
Stage1. Setupapipeline
AWrite_Blockrequestissentbyaclientdownstreamalongthepipeline.AfterthelastDataNodereceivestherequest,anackissentbytheDataNodeupstreamalongthepipelinebacktotheclient.Asaresultofthis,networkconnectionsalongthepipelinearesetupandeachDataNodehascreatedoropenedareplicaforwriting.
Stage2. DatastreamingUserdatafirstbufferattheclientside.Afterapacketisfilledup,thedatathengetpushedtothepipeline.Nextpacketcanbepushedtothepipelinebeforereceivingtheackforthepreviouspacket.Thenumberofoutstandingpacketsislimitedbytheoutstandingpacketswindowsizeattheclientside.Iftheuserapplicationexplicitlycallshflush,apacketispushedtothepipelinebeforeitisfilledup.Hflushisasynchronousoperationandnodatacanbewrittenbeforeanacknowledgementfortheflushedpacketcomesback.
Stage3. Close(finalizeablockandshutdownpipeline)Theclientsendsacloserequestonlyafterallpacketshavebeenacknowledgedattheclientside.Thisensuresthatifdatastreamingfails,therecoverydoesnotneedtohandlethecasethatsomereplicashavebeenfinalizedandsomedonothaveallthedata.
3.2. PackethandlingataDataNode
Foreachpacket,aDataNodeinthepipelinehastodo3things.
1. Streamdataa. ReceivedatafromtheupstreamDataNodeortheclientb. PushthedatatothedownstreamDataNodeifthereisany
2. Writethedata/crctoitsblockfile/metafile.3. Streamack
a. ReceiveanackfromthedownstreamDataNodeifthereisanyb. SendanacktotheupstreamDataNodeortheclient
Notethatthenumbersabovedonotindicatetheorderthatthethreethingsmustbeexecutedin.Streamingack(3)isdoneafterstreamingdata(1)bythedefinitionofapipeline.Butintheorywritingdatatodisk(2)couldbedoneanytimeafter1.a.Thisalgorithmchoosestodoitrightafter1.bandbeforereceivingthenextpacket.EachDataNodehastwothreadsperpipeline.Thedatathreadisresponsiblefordatastreaminganddiskwriting.Foreachpacket,itdoes1.a,1.b,and2insequence.Onceapacketisflushedtothedisk,itcanberemovedfromthein‐memorybuffer.Theackthreadisresponsibleforackstreaming.Foreachpacket,itdoes3.aand3.binsequence.Sincethedatathreadandtheackthreadrunconcurrently,thereisno
5
guaranteeontheorderof(2)and(3).Theackofapacketmightbesentbeforethepacketisflushedtothedisk.Thisalgorithmprovidesatradeoffonthewriteperformance,datadurability,andalgorithmsimplicity.It
1. Improvesdatadurabilityagainstfailuresbywritingdatatodisksoonerthanlaterwhentheackisreceived;
2. Parallelizesdata/ackstreamingindownstreampipelineandon‐diskwriting;3. Simplifiesbuffermanagementsincethereisatmostonepacketin‐memory
perpipeline.
3.3. Consistencysupport• Whenaclientreadsbytesfromanrbwreplica,theDataNodethatitreads
frommaynotmakeallthebytesthatitreceivedvisibletotheclient.• Eachrbwreplicamaintainstwocounters:
1. BA:numberofbytesthathavebeenacknowledgedbythedownstreamDataNodes.ThosearethebytesthattheDataNodemakesvisibletoanyreader.Intherestofthedocument,wemayinterchangeablycallitthereplica’svisiblelength.
2. BR:numberofbytesthathavebeenreceivedforthisblock,includingthebyteswrittentoitsblockfileandin‐DataNode‐bufferbytes.
• AssumeinitiallyallDataNodesinthepipelinehave(BA,BR)=(a,a).Thenaclientpushesapacketofbbytestothepipelineandnootherpacketsarepushedtothepipelinebeforetheclientreceivesanackforthepacket.
1. ADataNodechangesits(BA,BR)tobe(a,a+b)rightafterstep1.a.2. ADataNodechangesits(BA,BR)tobe(a+b,a+b)rightafterstep
2.a.3. Whenasuccessackissentbacktotheclient,allDataNodesinthe
pipelinehave(BA,BR)=(a+b,a+b).• ApipelineofNDataNodesDN0,…,DNi,…,DNN‐1,whereDN0isthefirstin
thepipeline,i.e.,theclosesttothewriter,hasthefollowingproperty:atanygiventimet,
where isBAoftheblockatDNiattimetand isBRoftheblockatDNiattimet.
NotethatthispropertyguaranteesthatonceabytebecomesvisibleallDataNodesinthepipelinehasthebyte.• Assume isthenumberofbytesthataclienthassenttothepipeline
attimetand isthenumberofbytesthattheclienthasreceivedackfor.Wehave
6
4. ReadWhenanunclosedfileisopenedforread,thechallengeishowtoprovidetheconsistencyguaranteeifthelastblockisunderconstruction.ThealgorithmneedstomakesurethatabytereadatDataNodeDNicanalsobereadatanotherDataNodeDNj,evenifBAi>BAj.Algorithm1:• Whenareaderreadsanunderconstructionblock,itfirstasksoneofthe
replicasforitsBAbysendingarequesttotheDataNode.• IfanapplicationtriestoreadabytebeyondBAoftheblock,thedfsclient
throwsanEOFException.• Onlyareadrequestthatreadfromapositionlessthanthevisiblelength
ofthelastblockwillbeforwardedtoaDataNode.WhenaDataNodegetsareadrequestfromarangeofbytesthatarelessthanitsBR,returnthebytes.
• Assumethatareadrequestisatriple(blck,off,len),whereblckcontainsablockidanditsgenerationstamp,offisthestartingoffsetintheblockfromwhichtoreadtheblock,andlenisthenumberofbytestoread.
• ADataNodecanservetherequestiftheDataNodehasareplicawithanequalornewerGS.
• ThesummationofoffandlenmustbeequalorlessthanBAj,assumingDNjistheDataNodewherethedfsclientfetchedtheblocklength.
• AssumethatthereadrequestissenttoDataNodeDNiandthereplica’sstateis(BAi,BRi).
1. Ifoff+len<=BAi,DNicansafelysendlenbytesbacktothedfsclientstartingatoff.
2. Ifoff+len>BAi,becauseoff+len<=BAj,BAj>=BAi.DNimustbeintheupstreaminthepipelinetoDNj,i.e.,isclosertothewriterthanDNjis.SoBRi>=BRj>=BAj.ThusBRi>BAj,andthereforeBRi>off+len.ThismeansDNimusthavethebytesthatthedfsclientwantstoread.DNideliversthebytestotheclient.
3. Off+lenshouldneverbegreaterthanBRi.Ifthiseverhappens,theDataNodelogtheerrorandrejectstherequest.
• IfDNigoesdownwhileservingtherequest,thedfsclientcanswitchtoreadfromanyotherDataNodecontainingareplicaoftheblock.
• Thisalgorithmissimplebutitrequiresareopenofafiletogetnewdatabecausethelengthofthelastblockisfetchedbeforereadandadfsclientcannotreadbeyondthelengthofthelastblock.
Algorithm2:• Thisalgorithmletsadfsclient,i.e.,areader,performtheconsistency
control,andDataNodesdeliverbytes.• Areadrequestisatriple(blck,off,len),whereblckcontainsablockid
anditsgenerationstamp,offisthestartingoffsetinablockfromwhichtoreadtheblock,andlenisthenumberofbytestoread.
7
• ADataNodecanservetherequestiftheDataNodehasareplicawithanequalornewerGS.
• Assumethattheblockhasastate(BAi,BRi),DNisendsbytes[off,MIN(off+len,BRi))totheclientalongwithitsBAi.
• Theclientreceivesandbuffersthedata.ItalsokeepstrackofthemaximumBAthatithasseenandonlydeliversbytestotheapplicationuptothemaximumBA.
• IfthereadfromDNifails,thedfsclientcanswitchtoreadfromanyotherDataNodecontainingareplicaoftheblock.
• Howreadconsistencyisguaranteed?AssumewehaveapipelineofNDataNodesDN0,…,DNi,…,DNN‐1,whereDN0isthefirstinthepipeline.Assumethatthenumberofbytesthataclientdeliverstoanapplicationattimetis .Wehave
SonomatterwhichDataNodeitreadsfrom,theDataNodeshouldhavethebyteitreadbefore.
• Thisalgorithmrequiresachangeofthereadprotocolandadfsclientismorecomplicatedsinceitneedstocontrolreadconsistency.Butthealgorithmdoesnotrequireareopeninordertoreadthenewdata.
ForHADOOP0.21,wearestilldiscussingwhethertoimplementalgorithm1oralgorithm2.
5. Append5.1. AppendAPIsupport
1. ClientsendsanappendrequesttoNN.2. NNchecksthefileandmakessurethatitisclosed.ThenNNchecksthe
file’slastblock.Ifitisnotfullandhasnoreplica,failappend.Otherwise,changethefiletobeunderconstruction.Ifthelastblockisfull,NNallocatesanewlastblock.Ifthelastblockisnotfull,NNchangesthisblocktobeanunderconstructionblock,withitsfinalizedreplicasasitsinitialpipeline.Itreturnstheblockid,generationstamp,length,anditslocations.Ifthelastblockisnotfull,italsoneedstoreturnanewgenerationstamp.
3. Setupapipelineforappendiflastblockisnotfull.Otherwisesetupapipelineforcreate.Checkpipelinesetupsectionformoredetails.
4. Ifthelastblockdoesnotendatachecksumchunkboundary,readthelastpartialcrcchunk.Thisisforthepurposeofcalculatingchecksums.
5. Therestisthesameasaregularwrite.
5.2. Durabilitysupport• NNmakessurethatthenumberofreplicasfortheCompleteblocksthat
containpre‐appenddatameetsthefile’sreplicationfactor.
8
• ThedurabilityofanUnderConstructionblockthatcontainspre‐appenddataisomittedinthisdesignfornow.
6. ErrorHandling6.1. PipelineRecovery
Whenablockisunderconstruction,errormayoccuratStage1whenapipelineisbeingsetup,atStage2whendataarestreamingtothepipeline,orStage3whenthepipelineisbeingclosed.ThepipelinerecoveryhandlesthecasewhenoneofDataNodesinthepipelinehasanerror.
6.1.1. RecoverfrompipelinesetupfailureIfaDataNodedetectsafailurewhenapipelineisbeingsetup,aDataNodeclosestheblockfileandclosesallitstcp/ipconnectionsafterafailureacknowledgementissenttotheupstreamDataNode.Oncetheclientdetectsthefailure,ithandlesthefailuredifferentlydependingonthepurposeofsettingupthepipeline.
• Ifthepipelinewasbuiltforcreatinganewblock,theclientsimplyabandonstheblockandasksNameNodeforanewblock.Itthenstartstobuildapipelineforthenewblock.
• Ifthepipelinewasbuiltforappendingtoablock,itrebuildsapipelinewiththeremainingDataNodesandbumpstheblock’sgenerationstamp.Seesection7(pipelinesetup)formoredetails.
Onespecialcaseofpipelinesetupfailuresisaccesstokenerror:oneoftheDataNodecomplainsthattheaccesstokenisinvalidwhenusinganaccesstokentosetupapipeline.Ifapipelinesetupfailureiscausedbyanexpiredaccesstoken,thedfsclientshouldrebuildthepipelinewithalltheDataNodesinthepreviouspipeline.Currenttrunk(0.21)avoidsthisspecialhandlingbyalwaysfetchinganewaccesstokenfromNameNoderightbeforesettingupapipeline.Thisdocumentwillkeepthesamedesign.
6.1.2. Recoverfromdatastreamingfailure• AtaDataNodeanerrormayoccurateither1.a,1.b,2,3.a,or3.basexplained
inSection3.2.Wheneveranerroroccurs,aDataNodetakesitselfoutofthewritepipeline:itclosesallthetcp/ipconnections,writesallbufferedbytesontodiskiftheerrordoesnotoccurat3,andclosestheon‐diskfiles.
• Whenthedfsclientdetectsafailure,stopssendingdatatothepipeline.• ThedfsclientreconstructsawritepipelineusingtheremainingDataNodes.
Seesection7(pipelinesetup)formoredetails.Asaresultofthis,allreplicasoftheblockarebumpedtoanewgenerationstamp.
• ThedfsclientresumessendingdatawiththenewgenerationstampstartingfromBAc.NoteanoptimizationcouldbethattheclientresumessendingbytesstartingatMIN(BRi,forallDataNodesDNiinthenewpipeline).
• WhenaDataNodereceivesapacket,ifitalreadyhasthepacket,thedatastreamsimplypushesthedatadownstreamwithoutwritingittothedisk.
9
Thisrecoveryalgorithmhasaniceproperty:anybytesthatwerevisibletoanyclient,evenfromadownDataNodewiththelargestBAoftheoldpipeline,continuetobevisibletoanyreaderduringandafterapipelinerecovery.ThisisbecausethepipelinerecoverydoesnotdecreaseanyDataNode’sBAandBR.Furthermoreanytimeduringthepipelinerecoverythenewpiplelinemaintainsthepropertydescribedinsection3.3(consistencysupport).
6.1.3. RecoverfromaclosefailureOncetheclientdetectsthefailure,itrebuildsapipelinewiththeremainingDataNodes.EachDataNodebumpstheblock’sgenerationstampandfinalizesthereplicaifitisnotfinalizedyet.Thenetworkconnectionistorndownafteranackissent.Seesection7(pipelinesetup)formoredetails.
6.2. DataNodeRestart• WhenaDataNoderestarts,itreadseachreplicaunderdirectoryrbwand
loadsthereplicainmemoryasWaitingToBeRecovered.Itslengthissettobethemaximumnumberofbytesthatmatchitscrc.
• AnyWaitingToBeRecoveredreplicadoesnotserveanyreadanddoesnotparticipateinapipelinerecovery.
• AWaitingToBeRecoveredreplicawilleitherbecomeoutdatedandbedeletedbyNNiftheclientisstillaliveorbechangedtobefinalizedasaresultofleaserecoveryiftheclientdies.
6.3. NameNodeRestart• Noneofblockstatesarepersistedondisk.SowhenNameNoderestarts,it
needstorestoreeachblock’sstate.ThelastblockofanunclosedfilebecomesUnderConstructionnomatterwhatitspre‐lifestatewas.OtherblocksbecomeComplete.
• AskeachDataNodetoregisterandsenditsblockreportincludingfinalized,rbw,rwr,andrurreplicas.
• NameNodedoesexitsafemodeunlessthenumberofcompleteandunderconstructionblocksthathavereceivedatleastonereplicareachesthepre‐definedthreshold.
6.4. LeaseRecoveryWhenafile’sleaseisexpired,NNneedstoclosethefileforthesakeoftheclient.Therearetwoissues:(1)Concurrencycontrol:whatifaleaserecoveryisperformedwhiletheclientisstillaliveeitherintheprocessofsettinguppipeline,writing,close,orrecovery.Whatiftherearemultipleconcurrentleaserecoveries?(2)Consistencyguarantee:Ifthelastblockisunderconstruction,allitsreplicasneedtorollbacktoaconsistentstate:allreplicashavethesameon‐disklengthandthesamenewgenerationstamp.
1. NNrenewslease,changesthefile’sleaseholdertobedfsandpersiststhechangetoitseditlog.Soiftheclientisstillalive,anyofthewrite‐relatedrequestslikeaskingforanewgenerationstamp,gettinganewblock,orclosingthefile,willberejectedbecausetheclientisnottheleaseholderany
10
more.ThispreventstheclientfromconcurrentlychanginganunclosedfileifitevercontactstheNameNode.
2. NNchecksthestateofthelasttwoblocksofitsfile.Otherblocksshouldbeinthecompletestate.Thefollowingtableshowsallthepossiblecombinationsandtheactiontotakeforeachcombination.
Penultimateblock
Lastblock Actions
Complete Complete ClosethefileComplete CommittedCommitted CompleteCommitted Committed
Retryclosingthefilewhenleaseexpiresnexttime;Forcetoclosethefileafteracertainnumberofretries
Complete UnderConstructionCommitted UnderConstruction
Startsblockrecoveryforthelastblock
Complete UnderRecoveryCommitted UnderRecovery
Startsanewblockrecoveryforthelastblock;stoprecoveryafteracertainnumberofretries
6.5. BlockRecovery1. NNchoosesaprimaryDataNode(PD)toworkastheproxyofNameNodeto
performblockrecovery.PDcouldbeaDataNodewhereoneofitsreplicasresides.Ifnoneofitsreplicasareknown,blockrecoveryaborts.
2. NNgetsanewGS,whichmarksthegenerationthattheblockisgoingtobebumpedtowhentherecoverysuccessfullyfinishes.Itthenchangesthelastblock,ifitisUnderConstruction,tobeUnderRecovery.TheUnderRecoveryblockisstampedwithauniquerecoveryid,whichisnewGSthattheblockisgoingtobebumpedto.AnycommunicationsfromaPDtoNNneedstomatchthisrecoveryid.Thisishowconcurrentblockrecoveriesarehandled.Thebasicruleisthatthelatestrecoveryalwayspreemptspreviousrecoveries.
3. NNthenasksPDtorecovertheblock.NNsendsPDthenewGS,blockidanditsgenerationstamp,andallitsreplicalocationsincludingfinalizedreplicas,replicasbeingwrittento,andreplicaswaitingtoberecovered.
4. PDperformsblockrecovery:a. PDaskseachDataNode,whereonereplicaislocated,toperform
replicarecovery.i. PDsendseachDataNodetherecoveryid,blockidandgenerationstamp;
ii. EachDataNodechecksitsreplicastate:1. Checkexistence:IftheDataNodedoesnothavethe
replicaorthereplicaisolderthantheblock’sGSintherequest,ornewerthantherecoveryid(thisisnotsupposedtohappen),throwsaReplicaNotExistsException.
2. Stopwriter:Ifitisareplicabeingwrittentoandthereisanongoingwriterthread,interruptsthewriterand
11
waitsforthewritertoexit.Whenawriterthreadisinterrupted,ifthethreadisinthemiddleofreceivingapacket,stopsandthrowsawaythepartialpacket.Beforethethreadexits,itmakessurethaton‐diskbytesarethesameasBRandthenclosestheblockandcrcfiles.ThishandlesconcurrentclientwritesandblockrecoveryatDataNodes.Blockrecoverypreemptsclientwrites,resultinginpipelinefailure.SubsequentpipelinerecoverywillfailbecausethedfsclientcannotgetanewgenerationstampfromNNforablockunderrecovery.
3. Stoppreviousblockrecovery:Ifthereplicaisalreadyintherurstate,throwsaRecoveryInProgressExceptionifitsrecoveryidisgreaterthanorequaltothenewrecoveryid.IfthenewGSisgreater,stamptherurreplica’srecoveryidtobethenewone.
4. Statechange:Otherwise,changethereplicatoberur.Setitsrecoveryidtobethenewrecoveryidandareferencetoitsoldstate.AnycommunicationsfromaPDtoitselfneedstomatchthisrecoveryid.Note3and4handleconcurrentblockrecoveriesatDataNodes.Thelatestrecoveryalwayspreemptspreviousrecoveryandnotworecoveriescanbeinterleaved.
5. Crccheck:ThenperformaCRCcheckfortheblockfile.Ifthereisamismatch,throwsCorruptedReplicaExceptionifthereplicaisrbworfinalized.Ifreplicaisrwr,truncatetheblockfiletothelastmatchedbyte.
iii. Ifnoexceptionisthrown,eachDataNodereturnsPDitsreplicastatus<replicaid,replicaGS,replicaon‐disklen,pre‐recoverystate>.
b. AfterreceivingareplyfromeachDataNode,PDdecidestheblocklengththatallreplicasshouldagreeon.
i. IfoneDataNodethrowsRecoveryInProgressException,PDabortsblockrecovery.
ii. IfallDataNodesthrowanexception,abortsblockrecovery.iii. Ifmax(LeniforallreportedDNi)==0,asksNNtoremovethis
block.iv. Otherwise,checkreturnedstateofthereplicaswithnon‐zero
length.Thefollowingtableshowsallthepossiblecombinationofstatesinanexampleoftworeplicasandthelengthtoagreeonforeachcombination.
Cases Replica1state
Replica2state
Lengthtobeagreedon
1 Finalized Finalized Tworeplicasshouldhavethesamelength;Ifnotthesame,
12
thereisanerror,logsitandabortsblockrecovery
2 Finalized rbw Tworeplicasshouldhavethesamelengthbecausetheclientmusthavediedwhenpipelineisbeingsetuportorndown.Iftheyarenotthesame,excludetherbwreplica.
3 Finalized rwr Setnewlengthtobethelengthofthefinalizedreplicaandexcludetherwrreplica.
4 rbw rbw SetnewlengthtobeMIN(len1,len2)whereleniisthelengthofreplicai.
5 rbw rwr Excluderwrreplica.Thisbecomesthesameascase4.
6 rwr rwr SetthenewlengthtobeMIN(len1,len2).Inthiscase,lenimaynotequaltoBRi,sonoguaranteeofvisiblebytessincebothDataNodesdied.
c. Recoverreplicasthatparticipatedinlengthagreementinstepb.iv.
i. PDaskseachDataNodetorecoverthereplica.PDsendstheblockid,newGS,newlength.
ii. IftheDataNodedoesnothavethereplicaintherurstateoritsrecoveryiddoesnotmatchthenewGS,failthereplicarecoveryattheDataNode.
iii. Otherwise,theDataNodechangesthereplica’sGStobethenewGSbothondiskandinmemory.Itthenupdatesinmemoryreplicalengthtobethenewlengthandtruncatestheblockfilesizetothenewlengthandchangecrcfileaccordingly(maycausetruncationand/ormodificationoflast4crcbytes).Itfinalizesthereplicaifthereplicahasnotfinalizedyet.Thereplicarecoverysucceeds.
d. PDcheckstheresultofc.IfnoDataNodesucceeds,blockrecoveryfails.Ifsomesucceedandsomefail,PDgetsanewgenerationstampfromNNandrepeatsblockrecoverywiththesuccessfulDataNodes.IfallDataNodessucceed,PDnotifiesNNthenewGSandlength.NNfinalizestheblockandclosesthefileifallblocksofthefilechangetoCompletestate.NNforcesthefiletocloseafteralimitednumberofcloseretries.
ThisleaserecoveryalgorithmalsoguaranteesthatanybytesthatwerevisibletoaclientdoesnotgetremovedasaresultoftherecoveryifatleastoneDataNodeinthepipelineisstillaliveanditsdataarenotcorrupted.Thisisbecause
13
1. Incases1,2,and3,thereisafinalizedreplica.Theclientmusthavediedawayduringblockconstructionstages1and3.Thealgorithmdoesnotremoveanybyte.
2. Incases4and5,allreplicastoberecoveredareinrbwstate.Theclientmusthavediedduringblockconstructionstage2.Assumethepre‐recoverypipelinehasNDataNodes:DN0,DN1,…,DNN‐1.ThelengthreturnedbyDNistep4.a.iimustbeequaltoBRi.AssumethatasubsetoftheDataNodesSinthepipelineparticipatesthelengthagreement,thenewlengthisMIN(BRi,forallDataNodesinS)>=BRN‐1>=BAN‐1>=..>=BA0.Thisguaranteestheleaserecoverydoesnotremovedatathathavedeliveredtoanyreader.
3. Incase6,thealgorithmdoesnotprovideanyguaranteesinceallDataNodesinthepre‐recoverypipelinehadbeenrestarted.
7. PipelineSetUp7.1. Causesofpipelinesetup
Therearefivecasesthatapipelineneedstobesetup:1. Create:Whenanewblockiscreated,apipelineneedstobeconstructed
beforeanybytesarestreamedtoanyDataNode.2. Append:Whenafileistobeappendedandthelastblockofthefileisnotfull.
ApipelineofallDataNodesthathaveareplicaofthelastblockneedstobesetupbeforeanynewbytesarestreamedtoanyDataNode.
3. Appendrecovery:Whencase2fails,apipelinecontainingtheremainingDataNodesneedstobesetup.
4. Datastreamingrecovery:Ifdatastreamingfails,apipelineoftheremainingDataNodesneedstobesetupbeforethedatastreamingresumes.
5. Closerecovery:Ifpipelineclosefails,apipelineoftheremainingDataNodesneedstobesetupinordertofinalizetheblock.
7.2. Pipelinesetupsteps1. Cases2,3,4,and5buildapipelineonanexistingblock,sotheblock’s
generationstampneedstobebumpedalongwithpipelineconstruction.ThedfsclientasksNNforanewgenerationstamp.
2. ThedfsclientsendsawriteblockrequesttotheDataNodesinthenewpipelinewiththeparameters(notinclusive):blockidwitholdgenerationstamp,blocklength(numberofbytesareplicamusthaveoratleasthave),maxreplicalength,flags,and/oranewgenerationstamp.
CasesBlockid/generationstamp
Blocklen flagsNewgenerationstamp
Maxblocklen
1(create) yes 0 Noflagisset no no
2(Append) yes Pre‐appendblocklen
Appendflagisset yes
no
14
3(AppendRecovery) yes Thesameas2
Appendandrecoveryflagsareset
yes
no
4(Datastreamingrecovery)
yes BAc Recoveryflagisset yes BSc
5(Closerecovery) yes BAc=BSc Close
flagisset yesno
3. ThefollowingtableshowsthebehaviorwhenaDataNodereceivesa
pipelinesetuprequest.Notethatrwrreplicasdonotparticipateinthepipelinerecovery.Wecanrelaxthisrestrictionwithsomespecialhandlingoftherwrreplicas.Butsincethisisaveryrarecase,wechoosenottodoitinthisroundofdesign.
Cases SanityCheck Replicastatechange GSchange
1(Create)Areplicawiththesameblockidshouldnotexist
Createarbwreplicawith(BA,BR)=(0,0) no
2(Append)
Finalizedreplica;itson‐disklenshouldmatchthepre‐appendlen
Openthereplicaforwrite;setthewritestreamoffsetattheendoffile;thereplicabecomesrbw:(BA,BR)=(preAppendLen,preAppendLen)
SettonewGS
3(AppendRecovery)
Finalizedorrbw;replicalength(on‐diskorBR)shouldmatchpre‐appendlen;rbwreplicaGScouldbethesameornewer
Iffinalized,dothesameasabove;Ifrbw,waitforwritertoexit;openblockforwrite;setwritestreamoffsetattheendoffile.
SettonewGS
4(DataStreamingRecovery)
rbwreplica;thesameornewerGS;BAc<=BAi<=BRi<=BSc
Waitforwritertoexit;Openblockforwrite;setwritestreamoffsetattheendoffile.
SettonewGS
5(CloseRecovery)
rbworfinalized;thesameornewerGS;replicalengthshouldbethesameasBAc
Ifrbw,waitforwritertoexitandthenfinalizethereplica;Closepipelinewhenackissentback.
SettonewGS
4. Incases2,3,and4,onasuccessfulpipelinesetup,thedfsclientnotifies
NNofthenewGS,minlength,andtheDataNodesinthenewpipeline.NN
15
thenupdatestheunderconstructionblock’sgenerationstamp,lenandlocations.
5. Ifthepipelinesetupfails,ifatleastoneDataNoderemains,gotostep1witharecoveryflagset.Otherwise,sincethepipelinehasnomoreDataNode,markthispipelineasfailed.Ifauserapplicationisblockedinhflush/write,itwillbeunblockedandgetanEmptyPipelineException.Otherwisethenextwrite/hflush/closewillgetanEmptyPipelineExceptionimmediately.
8. ReportReplica/BlockState/MetaInformationChangetoNN
8.1. ClientReportsAclientinformsNNofanunderconstructionblock’smetainformationchangeorstatechange.Asdiscussedinthepipelinesetupsection,incases2(append),3(appendrecovery),and4(datastreamingrecovery),afteranewpipelineissetup,aclientreportsNNoftheblock’snewGSandtheDataNodesinthepipeline.NNthenupdatestheunderconstructionblock’sGS,length,andlocations.Notethatinthisdesignafterapipelineforcreatingablockissetup,aclientdoesnotreportNNofthenewlycreatedblockanditslocations.InsteadwhentheclientissuesaddBlock/appendtoaskforanewblock,NNputsthenewblockanditslocationsintotheblocksMapbeforeNNreturnstheblockandlocationsbacktoclient.Thisdesignhasaminorflaw.IfanewreaderhappenstoreadthelastblockbetweenthetimetheblockisaddedtoblocksMapatNameNodeandthetimeareplicaoftheblockiscreatedonaDataNode,thereadermaygeta“blockdoesnotexit”error.Butsincethechanceofhavingareaderduringthisshorttimeframeisveryslim,weconsciouslymakethisdesigndecisiontotradeforperformance.AclientdoesnotneedtosendanotificationtoNNafterapipelineissetupforeveryblockcreate.WhenaclientissuesaddBlockorclose(afile),NNwillfinalizethelastblock’sGSandlength.ThelastblockmaymovetotheCompletestateifthelastblockalreadyhasaGS/lenmatchedfinalizedreplica;otherwisethelastblockismovedtotheCommittedstate.Inaddition,ifthenumberofthereplicasofthelastblockislessthanitsreplicationfactor,NNexplicitlyreplicatestheblocktoreachitsreplicationfactor.
8.2. DataNodeReportsADataNodereportsareplica’smetainformationorstatechangebyperiodicallysendingNNablockreportorsendingNNablockReceivedmessagewhenarbwreplicaisfinalized.
16
8.3. BlockReports• Eachblockreportcontainstwolists:oneforfinalizedreplicas,onefor
rbw.Thefinalizedreplicaslistincludefinalizedreplicasandunderrecoveryreplicaswhoseoldstatearefinalized.Therbwreplicalistincludesrbwreplicas,rwrreplicas,andrurreplicaswhoseoldstatearenotfinalized.Anrbwreplica’slengthisitsbytesreceived(BR).Thelengthofanrwrisanegativenumber.
• Noweachreportedreplica’sstateisaquadruple<DataNode,blck_id,blck_GS,blck_len,isRbw>.
• AfterNNreceivesablockreport,compareitwithwhat’sinthememoryandgenerate3lists:(doweneedaupdateStateList?)
o deleteListifblck_idisnotvalid,i.e.noentryinblocksMaporbelongstonofile.
o addStoredBlockListifNNdoesnothavethereplica<DataNode,blck_id>buttheblockreporthasit.
o rmStoredBlockListifNNhasthereplica<DataNode,blck_id>buttheblockreportdoesnothaveit.
o updateStateListifthereplica’sstateinNNisrbwbuttheblockreportsaysitisfinalized.
ToavoidraceconditionbetweenclientreportsandDataNodereports,rbwreplicasareaddedtoonlydeleteListoraddStoredBlockList.
• Addanewreplica1. BlockinNNisComplete
• Ifthereportedreplicaisfinalized,o IfitsGSandlengtharedifferentfromNNrecorded
value,addittotheblocksMapbutmarkitascorrupt.o Otherwise,addthereplica.
• Ifthereportedreplicaisrbw,o Ifthefileisclosed,ifthereplica’sGS/lenisdifferent
fromNN–recordedvalueortheblockhasreacheditsreplicationfactor,instructitsDataNodetodeleteit.
o Otherwise,donothing.2. BlockinNNisCommitted
Thehandlingisverysimilartotheabovecaseexceptthatifthereportedreplicaisfinalizedandmatchestheblock’sGSandlength,NNchangestheblocktothecompletestate.
3. BlockinNNisUnderConstructionorUnderRecovery• Ifthereportedreplicaisfinalizedandthereplica’sGSisequal
toornewerthanNNrecordedGS,addthereplica.AlsomarkthereplicaasfinalizedandkeepstrackofitslengthandGS.
• Ifthereportedreplicaisrbwandthereplicaisvalid(GSnotolderandlengthnotshorter),addthereplica.Ifthereportedreplicaisrbw,markitasrbw;otherwise,markitasrwr.
• Otherwiseignoreit.
17
• Updateareplica’sstateWhenablockreportshowsareplicaischangedfromrbwtofinalized,iftheblockisunderconstruction,NNmarkstheNNstoredreplicaasfinalizedandkeepstrackofthefinalizedreplica’sGSandlength.IftheblockisCommitted,ifthefinalizedreplicamatchestheblock’sGSandlength,NNchangestheblocktotheCompletestate;otherwiseremovethisreplicafromNN.
8.4. blockReceivedDataNodessendblockReceivedtoNNtonotifythatareplicaisfinalized.WhenNNreceivesablockReceivednotification,iteitheraddanewreplicaif(DataNode,block_id)doesnotexistinNN,updatethereplica’sstateifitisrecordedatrbw,oraskaDataNodetodeletethereplicaiftheblockisinvalid.
9. Replica/BlockStateTransition9.1. ReplicaStateTransition
ThefollowingdiagramsummarizesallpossiblereplicastatetransitionsataDataNode.
• Anewreplicaiscreated
18
o Eitherbyaclient.Thenewreplicastartwithareplicabeingwritten(rbw)state.
o OruponaninstructionfromNNtoreplicateorcopyareplicaforthepurposeofbalancing.Thenewreplicaisintemporarystate.
• Anrbwreplicachangestobeareplicawaitingtoberecovered(rwr)whenitsDataNoderestarts.
• Areplicachangestobeaunderrecoveryreplicawhenareplicarecoverystartsinresponsetoleaseexpiration.
• Areplicaisfinalizedwhenaclientissuesaclose,replicarecoverysucceeds,orreplication/copysucceeds.
• Errorrecoveryalwayscausesareplica’sGStobebumped.
9.2. BlockStateTransitionThefollowingdiagramsummarizesallpossibleblockstatetransitionsattheNameNode.
• Ablockiscreated
o EitherwhenaclientissuesaddBlocktoaddanewblocktoafile.o Orwhenaclientissuesappendandthelastblockofthefileisfull.
Thenewlycreatedblockisablockunderconstruction.
Complete Block
Block Under Construction
Block Under Recovery
Committed Block
addBlock
NN restarts if last block of an unclosed
file
lease expires & block recovery
starts
NN restarts
receives a GS/Len matched finalized
replica
Init
addBlock
pipeline recovery succeeds (GS++)
NN restarts if not last block of an unclosed file
block recovery
fails
lease expires &
block recovery
starts(Recovery#
++)
lease expires
block recovery succeeds but no GS/len matched finalized replica
(GS++)
removed
zero length block
append pipeline is setup (GS++)
closelease expires and
force file close
lease expires
NN restarts if not last block of an unclosed file
append if last block is full
addBlock or close
Append or NN restarts
recovery succeeds
19
• AppendmayalsocauseaCompleteblocktobechangedtoablockUnderConstructionifthelastblockispartial.
• WhenaddBlockorcloseisissued,o thelastblockbecomeseitherCompleteiftheblockalreadyhasa
GS/lenmatchedfinalizedreplicaorCommittedotherwise.o addBlockwaitsuntilthepenultimateblocktobecomeComplete.o Afilewon’tbecloseduntilthelasttwoblocksofthefileareComplete.
• Whenleaseexpires,aleaserecoverychangesablockunderconstructiontobeablockunderrecovery.
o Ablockrecoverymaychangeablockunderrecoverytobe Removedifallitsreplicasareoflength0; Committediftherecoverysucceedsandtheblockhasno
GS/lenmatchedfinalizedreplica; CompleteiftherecoverysucceedsandtheblockhasaGS/len
matchedfinalizedreplica.o AleaserecoverymayforceaCommittedblocktobeComplete.
• Blockstatesdonotpersistondisk.WhenaNameNoderestarts,thelastblockofanunclosedfilebecomesunderconstructionandtherestbecomeComplete.
o NoteaCompleteorCommittedblockmaychangetobeanUnderConstructionblockafterNNrestartsifitisthelastblockofafile.Iftheclientisstillalive,theclientwillfinalizeitagain.Otherwisewhenleaseexpires,ablockrecoverywillfinalizeitagain.
• NotethatonceablockbecomesCommittedorComplete,allitsreplicasshouldhavethesameGSandarefinalized.WhenablockisUnderConstruction,itmayhavemultiplegenerationsoftheblockcoexistinginthecluster.