Dynamo:Amazon'sHighlyAvailableKey-valueStore
Author:GiuseppeDeCandia,DenizHastorun,MadanJampani,GunavardhanKakulapati,AvinashLakshman,AlexPilchin,Swaminathan
Sivasubramanian,PeterVosshallandWernerVogels
Presentation:ShijieXu,YingWang
WhyDynamo?
●FullyManaged●Fast,ConsistentPerformance●HighlyScalable●Flexible
SystemAssumptionsandRequirements●QueryModel:simplereadandwriteoperationtoasmalldataitemthatisuniquelyidentifiedbyakey
●ACIDProperties:Atomicity,(Weaker)Consistency,(No)Isolation,Durability
●Efficiency:Latencyrequirementswhichareingeneralmeasuredatthe99.9thpercentileofthedistribution
●OtherAssumption:Onlydealwithbenignfailures
ServiceLevelAgreements
●Applicationcandeliveritsfunctionalityinaboundedtime
Fig-1Service-orientedarchitectureofAmazon’splatform
DesignConsideration
● Sacrificestrongconsistencyforavailability● Alwayswriteable● Conflictresolutionisexecutedduringreadinsteadofwrite● Otherprinciples:
○ Incrementalscalability○ Symmetry○ Decentralization○ Heterogeneity
SystemArchitecture
Coretechniquesused:● Partitioning● Replication● Versioning● Membership● Failurehandling
API● get(key)
ReturnsAsingleobjectoralistofobjects,andAcontext
● put(key,context,object)Useskey todeterminethewritereplicasWritesthereplicastodisk
● ContextSystemmetadataabouttheobject
Partitioning● ConsistentHashing
○ Theoutputrangeofhashedvaluestreatasa“ring”
○ Pros: incrementallyscalable,addingasinglenodedoesnotaffectthesystemsignificantly
○ Cons:leadtotheunevendistributedload,andoblivioustotheheterogeneityintheperformanceofnodes.
Partitioning
● “VirtualNode”○ Eachnodecanberesponsibleformorethanonevirtualnodes
○Workdistributionproportionaltothecapabilitiesoftheindividualnode
Replication
● EachdataitemisreplicatedatNhosts● Preferencelist:Thelistofnodesthatisresponsiblefor
storingaparticularkey○ MaycontainmorethanNnodesduetofailures○ Containsonlydistinctphysicalnodes
Replication
●Example:N=3○NodeBreplicatesthekeykatnodeCandDinadditiontostoringitlocally
○ NodeDwillstorethekeysintherange(A,B],(B,C],and(C,D]
Fig-4PartitioningareplicationofkeysinDynamoring.
DataVersioning
● Systemiseventuallyconsistent,thusaget()callmayreturnmanyversionsofthesameobject
● Challenge:Anobjectcanhavedistinctversionsub-histories,thesystemneedstoreconcileinthefuture
● Solution:SyntacticreconciliationandSemanticreconciliation
VectorClock● Avectorclockisalistof[node,counter]
pairs● Versionedobject->vectorclock● Updateanobject,put(key,context,
object)● “context”isobtainedfromanearlier
readoperation,whichcontainsthevectorclockinformation
Syntacticreconciliation
Whatif?
Source: Rick and Morty S02E01
Semanticreconciliation● Failures+concurrent
updating=>versionbranching
● Collapse● Versionbranchingis
resolvedbydatastoreortheapplication○ Datastore:latestwrite
wins○ Application:merge
VectorClockIssue● Vectorclockmaygrowwhenmanyserverscoordinatethe
writestooneobject● TruncationScheme
○ Deletetheoldest[node,counter]pairwhenthenumberofpairsreachesathreshold
● Moreissue:Inefficienciesinreconciliationbecauseofmissinginformation○ Notshowinproduction
Clientrequestchoices
● Genericloadbalancer○ NocodespecifictoDynamo○ Extrarequestforwardingstep
● Partition-awareclientlibrary○ betterperformance
Executegetandput:QuoruminDynamo
● Thefirstreachablenodeinthepreferencelististhecoordinator● R:minimumnumberofnodesthatmustparticipateinsuccessful
readoperation● W:minimumnumberofnodesthatmustparticipateinasuccessful
writeoperation● SettingR+W>Nyieldsaquorum-likesystem● Thelatencyofaget()(orput())operationisdictatedbytheslowest
oftheR(orW)replicas● RandWareusuallyconfiguredtobelessthanN,toprovidebetter
latency
Executionofget() operationget()● CoordinatorrequestsreadingfromNnodes,waitsforRresponses● Iftheresponsesagree,returnstheobjectwithcontext● Iftheydisagree
○ Iftheyarecausallyrelated,returnsthemostrecentvalue○ Iftheyarecausallyunrelated,returnsallversions
Executionofput() operationput():● Coordinatorgeneratesnewversionvectorclockandwritesnew
versionlocally● Forwardsmetadatatohighestrankedreachablenodesinthe
preferencelist● WaitsforW-1ormorewritestobesucceed
HandlingFailures:HintedHandoff
●“Alwayswriteable”
●Avoidthereadandwriteoperationsfailure,duetotemporarynodeornetworkfailures
Fig-6PartitioningareplicationofkeysinDynamoring.
HandlingpermanentFailures:ReplicaSynchronization●Merkletree:
○ Parentnodearehashesof(immediate)children
○ Comparisonofparentsatthesameleveltellsthedifferenceinchildren
○ Doesnotrequiretransferringentire(key,value)pairs
Thepowerofgossip
Thepowerofgossip
●RingMembership○Allnodesexchangetheirmembershiphistories○Eachnoderandomlycontactapeereverysecond○Eventuallyconsistent○Eachnodeforwardakey’sread/writeoperationsrightsetofnodesdirectly
Thepowerofgossip
●ExternalDiscovery○Nodesmaynotknoweachother- logicalpartitions○SeedNodestoavoidlogicalpartitions
Thepowerofgossip
●FailureDetection○Detectfailurelocallyissufficient○Periodicallyretryfailednode(s)○Noneedforadecentralizedfailuredetector
Implementation
●Java●Localpersistencecomponentallowsfordifferentstorageenginestobepluggedin:
○BerkeleyDatabase(BDB)TransactionalDataStore:objectoftensofkilobytes
○MySQL:objectof>tensofkilobytes
MainmodesofDynamo
●Businesslogicspecificreconciliation○Merge○Application-specificreconciliation
● Timestampbasedreconciliation○ Lastwritewins
●Highperformancereadengine○R=1,WisN,Dynamoprovidestheabilitytopartitionandreplicatetheirdataacrossmultiplenodestherebyofferingincrementalscalability
Experiences
●N:durability●WandR:availability,durabilityandconsistency
○ IncreaseWcanincreasedurabilitybutreduceavailability● (N,R,W)=(3,2,2)providessatisfyingperformance,durability,consistency,andavailabilitySLAs
Performance
●GuaranteeServiceLevelAgreements(SLA)○ Latencies:diurnalpattern(incomingrequestrate)
○Writelatencies>>Readlatencies
●Latenciesaround200ms
Fig-9Averageand99.9percentilesoflatenciesforreadandwriterequestduringpeakseasonofDec.2006
BetterPerformance
●Tradedurabilityforbetterperformance
●Eachstoragenodemaintainsanobjectbufferinitsmainmemory
●Writeobjectsinbuffertodiskusingawriterthreadperiodically
●Readfrombufferinmemory
Fig-10Comparisonofperformanceof99.9th percentilelatenciesforbufferedvs.non-bufferedwritesoveraperiodof24hours
Balance
●Outofbalance○ Ifthenode’srequestloaddeviatesfromtheaverageloadbyavaluemorethanacertainthreshold(hereis15%)
● Imbalanceratiodecreaseswithincreasingload●Underhighloads,alargenumberofpopularkeysareaccessedandtheloadisevenlydistributed
Fig-11Fractionofnodesthatareoutofbalance,andtherecorrespondingrequestload.Theintervalbetweenticksinx-axiscorrespondtoatimeperiodof30mins.
Partitioningandplacementofkey
Strategy1:Trandomtokenspernodeandpartitionbytokenvalue● Slowbootstrappingprocess● RecalculationoftheMerkletree● Datapartitioninganddataplacementareintertwined
Strategy2andStrategy3● Equalsizepartitioningstrategiestodistributeloaduniformly
Server-drivenandClient-drivenCoordination
●Useastatemachinetohandleincomingrequests
●Movethestatemachinetotheclientnodes
Balancingbackground&foreground
●Eachnodeperformsbothbackgroundandforegroundoperation
●Backgroundtriggerresourcecontention
●Admissioncontroller:changetheruntimeslicesoftheresourceforbackground
Conclusion
●Dynamoisahighlyavailable and scalabledatastoreforAmazon’se-commerceplatform.
●Techniques:○Gossiping formembershipandfailuredetection○Consistenthashing fornodeandkeydistribution○Objectversioning foreventually– consistentdataobjects○Quorums forpartition/failuretolerance○Merkletree forresynchronizationafterfailures/partitions
Questions?
HandlingpermanentFailures:ReplicaSynchronization●Comparingtwonodesthataresynchronized
○ Two(key,value)pairs:(k0,v0)&(k1,v1)
Fig-7ReplicaSynchronization
HandlingpermanentFailures:ReplicaSynchronization●Comparingtwonodesthatarenotsynchronized
○ Two(key,value)pairs:(k0,v0)&(k1,v1)
Fig-8ReplicaNotSynchronization
Partitioningandplacementofkey
Strategy1:● Trandomtokenspernodeand
partitionbytokenvalueProblems:● Slowbootstrappingprocess● RecalculationoftheMerkletree● Complicatedarchivalprocess
Fig-12Partitioningandplacementofkey,strategy1
Partitioningandplacementofkey
Fig-13Partitioningandplacementofkey,strategy2
Strategy2:● Trandomtokenspernodeandequalsized
partitions
● DividesthehashspaceintoQequallysizedpartitions
Partitioningandplacementofkey
Fig-14Partitioningandplacementofkey,strategy3
Strategy3:Q/Stokenspernode,equal-sizedpartitions
• DividesthehashspaceintoQequallysizedpartitions
• EachnodeisassignedQ/Stokens
Thankyou