Dynamo: Amazon's Highly Available Key-value Storemanosk/assets/slides/w18/dynamo.pdf ·...

Preview:

Citation preview

Dynamo:Amazon'sHighlyAvailableKey-valueStore

Author:GiuseppeDeCandia,DenizHastorun,MadanJampani,GunavardhanKakulapati,AvinashLakshman,AlexPilchin,Swaminathan

Sivasubramanian,PeterVosshallandWernerVogels

Presentation:ShijieXu,YingWang

WhyDynamo?

●FullyManaged●Fast,ConsistentPerformance●HighlyScalable●Flexible

SystemAssumptionsandRequirements●QueryModel:simplereadandwriteoperationtoasmalldataitemthatisuniquelyidentifiedbyakey

●ACIDProperties:Atomicity,(Weaker)Consistency,(No)Isolation,Durability

●Efficiency:Latencyrequirementswhichareingeneralmeasuredatthe99.9thpercentileofthedistribution

●OtherAssumption:Onlydealwithbenignfailures

ServiceLevelAgreements

●Applicationcandeliveritsfunctionalityinaboundedtime

Fig-1Service-orientedarchitectureofAmazon’splatform

DesignConsideration

● Sacrificestrongconsistencyforavailability● Alwayswriteable● Conflictresolutionisexecutedduringreadinsteadofwrite● Otherprinciples:

○ Incrementalscalability○ Symmetry○ Decentralization○ Heterogeneity

SystemArchitecture

Coretechniquesused:● Partitioning● Replication● Versioning● Membership● Failurehandling

API● get(key)

ReturnsAsingleobjectoralistofobjects,andAcontext

● put(key,context,object)Useskey todeterminethewritereplicasWritesthereplicastodisk

● ContextSystemmetadataabouttheobject

Partitioning● ConsistentHashing

○ Theoutputrangeofhashedvaluestreatasa“ring”

○ Pros: incrementallyscalable,addingasinglenodedoesnotaffectthesystemsignificantly

○ Cons:leadtotheunevendistributedload,andoblivioustotheheterogeneityintheperformanceofnodes.

Partitioning

● “VirtualNode”○ Eachnodecanberesponsibleformorethanonevirtualnodes

○Workdistributionproportionaltothecapabilitiesoftheindividualnode

Replication

● EachdataitemisreplicatedatNhosts● Preferencelist:Thelistofnodesthatisresponsiblefor

storingaparticularkey○ MaycontainmorethanNnodesduetofailures○ Containsonlydistinctphysicalnodes

Replication

●Example:N=3○NodeBreplicatesthekeykatnodeCandDinadditiontostoringitlocally

○ NodeDwillstorethekeysintherange(A,B],(B,C],and(C,D]

Fig-4PartitioningareplicationofkeysinDynamoring.

DataVersioning

● Systemiseventuallyconsistent,thusaget()callmayreturnmanyversionsofthesameobject

● Challenge:Anobjectcanhavedistinctversionsub-histories,thesystemneedstoreconcileinthefuture

● Solution:SyntacticreconciliationandSemanticreconciliation

VectorClock● Avectorclockisalistof[node,counter]

pairs● Versionedobject->vectorclock● Updateanobject,put(key,context,

object)● “context”isobtainedfromanearlier

readoperation,whichcontainsthevectorclockinformation

Syntacticreconciliation

Whatif?

Source: Rick and Morty S02E01

Semanticreconciliation● Failures+concurrent

updating=>versionbranching

● Collapse● Versionbranchingis

resolvedbydatastoreortheapplication○ Datastore:latestwrite

wins○ Application:merge

VectorClockIssue● Vectorclockmaygrowwhenmanyserverscoordinatethe

writestooneobject● TruncationScheme

○ Deletetheoldest[node,counter]pairwhenthenumberofpairsreachesathreshold

● Moreissue:Inefficienciesinreconciliationbecauseofmissinginformation○ Notshowinproduction

Clientrequestchoices

● Genericloadbalancer○ NocodespecifictoDynamo○ Extrarequestforwardingstep

● Partition-awareclientlibrary○ betterperformance

Executegetandput:QuoruminDynamo

● Thefirstreachablenodeinthepreferencelististhecoordinator● R:minimumnumberofnodesthatmustparticipateinsuccessful

readoperation● W:minimumnumberofnodesthatmustparticipateinasuccessful

writeoperation● SettingR+W>Nyieldsaquorum-likesystem● Thelatencyofaget()(orput())operationisdictatedbytheslowest

oftheR(orW)replicas● RandWareusuallyconfiguredtobelessthanN,toprovidebetter

latency

Executionofget() operationget()● CoordinatorrequestsreadingfromNnodes,waitsforRresponses● Iftheresponsesagree,returnstheobjectwithcontext● Iftheydisagree

○ Iftheyarecausallyrelated,returnsthemostrecentvalue○ Iftheyarecausallyunrelated,returnsallversions

Executionofput() operationput():● Coordinatorgeneratesnewversionvectorclockandwritesnew

versionlocally● Forwardsmetadatatohighestrankedreachablenodesinthe

preferencelist● WaitsforW-1ormorewritestobesucceed

HandlingFailures:HintedHandoff

●“Alwayswriteable”

●Avoidthereadandwriteoperationsfailure,duetotemporarynodeornetworkfailures

Fig-6PartitioningareplicationofkeysinDynamoring.

HandlingpermanentFailures:ReplicaSynchronization●Merkletree:

○ Parentnodearehashesof(immediate)children

○ Comparisonofparentsatthesameleveltellsthedifferenceinchildren

○ Doesnotrequiretransferringentire(key,value)pairs

Thepowerofgossip

Thepowerofgossip

●RingMembership○Allnodesexchangetheirmembershiphistories○Eachnoderandomlycontactapeereverysecond○Eventuallyconsistent○Eachnodeforwardakey’sread/writeoperationsrightsetofnodesdirectly

Thepowerofgossip

●ExternalDiscovery○Nodesmaynotknoweachother- logicalpartitions○SeedNodestoavoidlogicalpartitions

Thepowerofgossip

●FailureDetection○Detectfailurelocallyissufficient○Periodicallyretryfailednode(s)○Noneedforadecentralizedfailuredetector

Implementation

●Java●Localpersistencecomponentallowsfordifferentstorageenginestobepluggedin:

○BerkeleyDatabase(BDB)TransactionalDataStore:objectoftensofkilobytes

○MySQL:objectof>tensofkilobytes

MainmodesofDynamo

●Businesslogicspecificreconciliation○Merge○Application-specificreconciliation

● Timestampbasedreconciliation○ Lastwritewins

●Highperformancereadengine○R=1,WisN,Dynamoprovidestheabilitytopartitionandreplicatetheirdataacrossmultiplenodestherebyofferingincrementalscalability

Experiences

●N:durability●WandR:availability,durabilityandconsistency

○ IncreaseWcanincreasedurabilitybutreduceavailability● (N,R,W)=(3,2,2)providessatisfyingperformance,durability,consistency,andavailabilitySLAs

Performance

●GuaranteeServiceLevelAgreements(SLA)○ Latencies:diurnalpattern(incomingrequestrate)

○Writelatencies>>Readlatencies

●Latenciesaround200ms

Fig-9Averageand99.9percentilesoflatenciesforreadandwriterequestduringpeakseasonofDec.2006

BetterPerformance

●Tradedurabilityforbetterperformance

●Eachstoragenodemaintainsanobjectbufferinitsmainmemory

●Writeobjectsinbuffertodiskusingawriterthreadperiodically

●Readfrombufferinmemory

Fig-10Comparisonofperformanceof99.9th percentilelatenciesforbufferedvs.non-bufferedwritesoveraperiodof24hours

Balance

●Outofbalance○ Ifthenode’srequestloaddeviatesfromtheaverageloadbyavaluemorethanacertainthreshold(hereis15%)

● Imbalanceratiodecreaseswithincreasingload●Underhighloads,alargenumberofpopularkeysareaccessedandtheloadisevenlydistributed

Fig-11Fractionofnodesthatareoutofbalance,andtherecorrespondingrequestload.Theintervalbetweenticksinx-axiscorrespondtoatimeperiodof30mins.

Partitioningandplacementofkey

Strategy1:Trandomtokenspernodeandpartitionbytokenvalue● Slowbootstrappingprocess● RecalculationoftheMerkletree● Datapartitioninganddataplacementareintertwined

Strategy2andStrategy3● Equalsizepartitioningstrategiestodistributeloaduniformly

Server-drivenandClient-drivenCoordination

●Useastatemachinetohandleincomingrequests

●Movethestatemachinetotheclientnodes

Balancingbackground&foreground

●Eachnodeperformsbothbackgroundandforegroundoperation

●Backgroundtriggerresourcecontention

●Admissioncontroller:changetheruntimeslicesoftheresourceforbackground

Conclusion

●Dynamoisahighlyavailable and scalabledatastoreforAmazon’se-commerceplatform.

●Techniques:○Gossiping formembershipandfailuredetection○Consistenthashing fornodeandkeydistribution○Objectversioning foreventually– consistentdataobjects○Quorums forpartition/failuretolerance○Merkletree forresynchronizationafterfailures/partitions

Questions?

HandlingpermanentFailures:ReplicaSynchronization●Comparingtwonodesthataresynchronized

○ Two(key,value)pairs:(k0,v0)&(k1,v1)

Fig-7ReplicaSynchronization

HandlingpermanentFailures:ReplicaSynchronization●Comparingtwonodesthatarenotsynchronized

○ Two(key,value)pairs:(k0,v0)&(k1,v1)

Fig-8ReplicaNotSynchronization

Partitioningandplacementofkey

Strategy1:● Trandomtokenspernodeand

partitionbytokenvalueProblems:● Slowbootstrappingprocess● RecalculationoftheMerkletree● Complicatedarchivalprocess

Fig-12Partitioningandplacementofkey,strategy1

Partitioningandplacementofkey

Fig-13Partitioningandplacementofkey,strategy2

Strategy2:● Trandomtokenspernodeandequalsized

partitions

● DividesthehashspaceintoQequallysizedpartitions

Partitioningandplacementofkey

Fig-14Partitioningandplacementofkey,strategy3

Strategy3:Q/Stokenspernode,equal-sizedpartitions

• DividesthehashspaceintoQequallysizedpartitions

• EachnodeisassignedQ/Stokens

Thankyou