Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Deepview:VirtualDiskFailureDiagnosis
andPatternDetectionforAzureQiaoZhang1,Guo Yu2,ChuanxiongGuo3,
Yingnong Dang4,NickSwanson4,Xinsheng Yang4,RandolphYao4,Murali Chintalapati4,ArvindKrishnamurthy1,TomAnderson1
1UniversityofWashington,2CornellUniversity,3Toutiao(Bytedance),4Microsoft
VMAvailability
• IaaSisoneofthelargestcloudservicestoday
•HighVMavailabilityisakeyperformancemetric
• Yet,achieving99.999%VMuptimeremainsachallenge
1. Whatistheavailabilitybottleneck?2. Howtoeliminateit?
Clos Network
AzureIaaSArchitecture• ComputeandstorageclusterswithaClos-likenetwork
• Compute-storageSeparation• VMsandVirtualHardDisks(VHDs)fromdifferentclusters
• Hypervisortransparentlyredirectsdiskaccess
• DatasurvivecomputerackfailureStorage Cluster
VM
Hypervisor
HostVM
Compute Cluster
SubsystemsinsideaDatacenter
ANewTypeofFailure:VHDFailures
• InfrafailurescandisruptVHDaccess
•Hypervisorcanretry,butnotindefinitely
•HypervisorwilleventuallycrashtheVM
• Customersthentakeactionstokeeptheirapp-levelSLAs
Clos Network
Storage Cluster
VM
Hypervisor
HostVM
Compute Cluster
SubsystemsinsideaDatacenter
HowmuchdoVHDfailuresimpactVMavailability?
VHDfailures:• 52% ofunplannedVMdowntime• TensofminutestohourstolocalizeVHD
Failure52%
SWFailure41%
HWFailure6%
Unknown1%
BreakdownofUnplannedVMDowntimeinaYear
VHDfailurelocalizationisthebottleneck
FailureTriagewasSlowandInaccurate
• Eachteamcheckstheirsubsystemforanomaliestomatchtheincident• e.g.,hostheart-beats,storageperf-counters,linkdiscards
• Incidentsgetping-pongedduetofalsepositives• Inaccurateandslowdiagnosis
• Grayfailuresinnetworkandstoragearehardtocatch• Troubledbutnottotallydown• OnlyfailasubsetofVHDrequests• Cantakehourstolocalize
Deepview Approach:GlobalView
C1C2C3C4
S1S2S3
BipartiteModel
C1C2
C3C4
S1 S2 S3GridView
• Isolatefailuresbyexamininginteractionsbetweensubsystems• Insteadofalertingeveryteam
• Bipartitemodel• ComputeClusters(left):StorageClusters(right)• EdgeifVMsfromcomputeclustermountVHDsfromastoragecluster• Edgeweight=VHDfailurerate
Deepview Approach:GlobalView
Azuremeasurementsrevealedmanycommonfailurespatterns
C1C2C3C4
S1
S2
S3
ComputeClusterC2failed
C2FailureGridView
C1C2C3C4
S1 S2 S3
ExampleComputeClusterFailure
C1C2C3C4
S1
S2
S3
StorageClusterS1Failed
ExampleStorageClusterFailure
S1GrayFailureGridView
C1C2C3C4
S1 S2 S3
ChallengesRemainingchallenges:1. Needtolocatenetworkfailures2. Needtohandlegrayfailures3. Needtobenear-real-time
GeneralizedmodelLasso+Hypothesistesting
Streamingdatapipeline
AsystemtolocalizeVHDfailurestounderlyingfailuresincompute,storageornetworksubsystemswithinatimebudgetof15minutes
Summaryofourgoal:
Timebudgetsetbyproductionteamtomeetavailabilitygoals
Outline
•GlobalViewApproach•Model&Algorithm•System•Evaluation•ArchitecturalLessons•RelatedWork
Deepview Model:IncludetheNetwork
Clos Network
Storage ClusterCompute Cluster
•Needtohandlemultipath&ECMP
• SimplifyClosnetworktoatreebyaggregatingnetworkdevices
• Canmodelatthegranularityofclustersorracks
Deepview Model:EstimateComponentHealth
𝐏𝐫𝐨𝐛 𝐩𝐚𝐭𝐡𝐢𝐢𝐬𝐡𝐞𝐚𝐥𝐭𝐡𝐲 = 0 𝐏𝐫𝐨𝐛 𝐜𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭𝐣𝐢𝐬𝐡𝐞𝐚𝐥𝐭𝐡𝐲�
𝐣∈𝐩𝐚𝐭𝐡(𝐢)
𝟏 −𝐞𝐢𝐧𝐢= 0 𝐩𝐣
�
𝐣∈𝐩𝐚𝐭𝐡(𝐢)
𝐥𝐨𝐠 𝟏 −𝐞𝐢𝐧𝐢
= < 𝐥𝐨𝐠𝐩𝐣
�
𝐣∈𝐩𝐚𝐭𝐡(𝐢)
𝐲𝐢 =<𝛃𝐣 𝐱𝐢𝐣+ 𝛆𝐢
𝐍
𝐣B𝟏
𝐲𝐢=𝐥𝐨𝐠 𝟏 − 𝐞𝐢𝐧𝐢
𝛃𝐣=𝐥𝐨𝐠𝐩𝐣𝛆𝐢=measurementnoise
SystemofLinearEquations
Blue:observableRed:unknownPurple:topology
Componentjishealthywith𝐩𝐣 = 𝐞𝐱𝐩(𝛃𝐣)• βD = 0,clearcomponentj• βD ≪ 0,mayblameit
Assumeindependentfailures
𝐞𝐢=num ofVMscrashed𝒏𝐢=num ofVMs
Deepview Algorithm:PreferSimplerExplanationviaLasso
• Potentially,#unknowns>#equations• Traditionalleast-squareregressionwouldfail
Sparsity
𝛃H = 𝐚𝐫𝐠𝐦𝐢𝐧𝛃∈ℝ𝐍,𝛃K𝟎
𝐲 − 𝐗𝛃 𝟐 +𝛌 𝛃 𝟏
LassoObjectiveFunction:
𝐲𝟏 = 𝛃𝐜𝟏 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟏 + 𝛆𝟏𝐲𝟐 = 𝛃𝐜𝟏 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟐 + 𝛆𝟐𝐲𝟑 = 𝛃𝐜𝟐 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟏 + 𝛆𝟑𝐲𝟒 = 𝛃𝐜𝟐 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟐 + 𝛆𝟒
Net
C1 C2 S1 S2
𝐲𝐢 =<𝛃𝐣 𝐱𝐢𝐣+ 𝛆𝐢
𝐍
𝐣B𝟏
Example:
• Butmultiplesimultaneousfailuresarerare• Encodethisdomainknowledgemathematically?
• EquivalenttoprefermostβD tobezero• Lassoregression cangetsparsesolutionsefficiently
Deepview Algorithm:PrincipledBlameDecisionviaHypothesisTesting
• Needabinarydecision(flag/clear)foreachcomponent• Ad-hocthresholdsdonotworkreliably• Canwemakeaprincipleddecision?
• Ifestimatedfailureprobabilityworsethanaverage,thenlikelyarealfailure
• Hypothesistest:• IfrejectHS j ,blamecomponentj;otherwise,clearit
𝐇𝟎 𝐣 : 𝛃𝐣 = 𝛃W𝐯𝐬. 𝐇𝐀 𝐣 : 𝛃𝐣 < 𝛃W
Kusto Engine
Deepview SystemArchitecture:NRTDataPipeline
VHD Failure
VM Info
StorageAcct
Net Topo
VMsPerPath Input
Real-time
Non-RT
IngestionPipeline
RAW DATA SLIDING WINDOW OF INPUT
Output
ACTIONS
Alerts
Vis
Near-realtimeScheduler
RUN ALGO
Algo
Outline
•GlobalViewApproach•Model&Algorithm•System•Evaluation•ArchitecturalLessons•RelatedWork
Evaluation
Deepview hasbeendeployedinproductionatAzure
1. HowwellcanitlocalizeVHDfailuresinproduction?
2. Howaccurateisthealgorithmcomparedtoalternatives?
3. Howfastisthesystem?
SomeStatistics
• AnalyzedDeepview resultsforonemonth• DailyVHDfailures:hundredstotensofthousands
• Detected100failuresinstances• 70matchedwithexistingtickets,30werepreviouslyundetected
• ReducedunclassifiedVHDfailurestolessthanamaxof500perday• Hostfailuresorcustomermistakes(e.g.,expiredstorageaccounts)
CaseStudy1:UnplannedToR Reboot
• UnplannedToR rebootcancauseVMcrashes• Knowthiscanhappen,butnotwhereandwhen
• Deepview canflagthoseToRs
• AssociateVMdowntimewithToR failures• QuantifytheimpactofToR asasingle-point-of-failureonVMavailability
ToR_11
ToR_12
ToR_13
ToR_14
ToR_15
STR
_01
STR
_02
STR
_03
STR
_04
STR
_05
STR
_06
STR
_07
BlamedtherightToR among288components
CaseStudy2:StorageClusterGrayFailure
• AstorageclusterwasbroughtonlinewithabugthatputssomeVHDsinnegativecache
•Deepview flaggedthefaultystorageclusteralmostimmediatelywhilemanualtriagetook20+hours
10
20
0 20 40 60
Hour
Nu
mb
er
of
VM
s w
ith
VH
D F
ailu
res
pe
r H
ou
r
NumberofVMswithVHDFailuresperHourduringaStorageClusterGrayFailure
CaseStudy3:NetworkFailure
• Networkoutagesarerare,butdohappen
• Inanincident,manytoptierlinksweremistakenlyturnedoff,causinglargecapacityloss
• Whenstoragereplicationtraffichit,itcausedhugepacketlossesandmanyVMstocrash
• Deepview pinpointedthemisbehavingaggregateswitches
ANetworkFailureduetoTopTierLink
CapacityLoss
Com
pute
Clu
ster
s
Storage Clusters
0.6
0.3
0.90.67
0.881
00.250.5
0.751
BooleanTomo SCORE Deepview
Precision Recall
AlgorithmAccuracyComparison
• Twoothertomographyalgorithms:Boolean-Tomo andSCORE• Greedyheuristicstofindminimumsetoffailures
• Useproductiontracefrom42incidents• 16Compute,14Storage,10ToR,2Net
Deepview TimetoDetection• Timetodetection(TTD)
• Timefromincidentstarttofailurelocalized• EstimatestarttimefromVHDfailureeventtimestamp
• Deepview’s TTDisunder10min• Dataingestiontakes~3.5min• ~5minutesslidingwindowdelay• Worst-case18secalgorithmrunningtime
• MeetsthetargetTTDof15min• Canbemadefasterbutmitigationtimeisonhumantimescale
Outline
•GlobalViewApproach•Model&Algorithm•System•Evaluation•ArchitecturalLessons•RelatedWork
ToR asaSinglePointofFailure• ReducedNetworkCostvs.AvailabilitycostforusingasingleToR perrack• Softfailures(recoverablebyreboot)vs.hardfailures
ToR Availability
= 𝟏 −𝟗𝟎% ∗ 𝟐𝟎𝐦𝐢𝐧 + 𝟏𝟎% ∗ 𝟏𝟐𝟎𝐦𝐢𝐧 ∗ 𝟎. 𝟏%
𝟑𝟎 ∗ 𝟐𝟒 ∗ 𝟔𝟎𝐦𝐢𝐧
= 𝟏 −%𝐬𝐨𝐟𝐭 ∗ 𝐬𝐨𝐟𝐭𝐝𝐮𝐫.+%𝐡𝐚𝐫𝐝 ∗ 𝐡𝐚𝐫𝐝𝐝𝐮𝐫. ∗ 𝐟𝐫𝐚𝐜. 𝐫𝐞𝐛𝐨𝐨𝐭𝐞𝐝𝐓𝐨𝐑𝐬𝐩𝐞𝐫𝐦𝐨𝐧𝐭𝐡
𝐭𝐨𝐭𝐚𝐥𝐭𝐢𝐦𝐞𝐢𝐧𝐚𝐦𝐨𝐧𝐭𝐡
= 𝟗𝟗. 𝟗𝟗𝟗𝟗𝟑%• Dependentservices(ToRs)needtoprovideoneextraninetotargetservice(VMs)
ToRs notoncriticalpathforVMstoachievefive-ninesavailability
VMsandtheirStorageCo-location• Forloadbalancing,VMscanmountVHDsfromanystorageclusterinthesameregion
• SomeVMshavestoragethatarefurtheraway• CanlongernetworkpathsimpactVMavailability?Andbyhowmuch?
Longernetworkpathdoleadtohigher(11.4%)VHDfailurerate
• AtAzure,52%two-hop,41%four-hop• ComputedailyVHDfailurerates:rS (two-hop),rf (four-hop)• Averageover3-months, rS andrf• rf − rS rS⁄ = 11.4%increase
RelatedWork• NetPoirot [SIGCOMM'16]
• Asingle-nodesolutiontofailurelocalizationusingTCPstatistics• ComplementaryifTCPstatisticsfromcustomerVMsareavailable
• BinaryTomography• Deepview achieveshigherprecision/recallthanthosegreedyheuristics
• (Approximate)BayesianNetwork• Tooslowforourproblem• Futureworktocompareaccuracyexperimentally
Conclusion
• IdentifiedVHDfailuresastheavailabilitybottleneckatAzure
• Deepview reducedunclassifieddailyVHDfailuresfrom10,000sto100s
• Revealednewfailures,e.g.,unplannedToR reboots,storagegrayfailures
• QuantifiedtheimpactofseveralarchitecturaldecisionsonVMavailability
Thankyou!Questions?