Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Deepview:VirtualDiskFailureDiagnosis

andPatternDetectionforAzureQiaoZhang1,Guo Yu2,ChuanxiongGuo3,

Yingnong Dang4,NickSwanson4,Xinsheng Yang4,RandolphYao4,Murali Chintalapati4,ArvindKrishnamurthy1,TomAnderson1

1UniversityofWashington,2CornellUniversity,3Toutiao(Bytedance),4Microsoft

VMAvailability

• IaaSisoneofthelargestcloudservicestoday

•HighVMavailabilityisakeyperformancemetric

• Yet,achieving99.999%VMuptimeremainsachallenge

1. Whatistheavailabilitybottleneck?2. Howtoeliminateit?

Clos Network

AzureIaaSArchitecture• ComputeandstorageclusterswithaClos-likenetwork

• Compute-storageSeparation• VMsandVirtualHardDisks(VHDs)fromdifferentclusters

• Hypervisortransparentlyredirectsdiskaccess

• DatasurvivecomputerackfailureStorage Cluster

VM

Hypervisor

HostVM

Compute Cluster

SubsystemsinsideaDatacenter

ANewTypeofFailure:VHDFailures

• InfrafailurescandisruptVHDaccess

•Hypervisorcanretry,butnotindefinitely

•HypervisorwilleventuallycrashtheVM

• Customersthentakeactionstokeeptheirapp-levelSLAs

Clos Network

Storage Cluster

VM

Hypervisor

HostVM

Compute Cluster

SubsystemsinsideaDatacenter

HowmuchdoVHDfailuresimpactVMavailability?

VHDfailures:• 52% ofunplannedVMdowntime• TensofminutestohourstolocalizeVHD

Failure52%

SWFailure41%

HWFailure6%

Unknown1%

BreakdownofUnplannedVMDowntimeinaYear

VHDfailurelocalizationisthebottleneck

FailureTriagewasSlowandInaccurate

• Eachteamcheckstheirsubsystemforanomaliestomatchtheincident• e.g.,hostheart-beats,storageperf-counters,linkdiscards

• Incidentsgetping-pongedduetofalsepositives• Inaccurateandslowdiagnosis

• Grayfailuresinnetworkandstoragearehardtocatch• Troubledbutnottotallydown• OnlyfailasubsetofVHDrequests• Cantakehourstolocalize

Deepview Approach:GlobalView

C1C2C3C4

S1S2S3

BipartiteModel

C1C2

C3C4

S1 S2 S3GridView

• Isolatefailuresbyexamininginteractionsbetweensubsystems• Insteadofalertingeveryteam

• Bipartitemodel• ComputeClusters(left):StorageClusters(right)• EdgeifVMsfromcomputeclustermountVHDsfromastoragecluster• Edgeweight=VHDfailurerate

Deepview Approach:GlobalView

Azuremeasurementsrevealedmanycommonfailurespatterns

C1C2C3C4

S1

S2

S3

ComputeClusterC2failed

C2FailureGridView

C1C2C3C4

S1 S2 S3

ExampleComputeClusterFailure

C1C2C3C4

S1

S2

S3

StorageClusterS1Failed

ExampleStorageClusterFailure

S1GrayFailureGridView

C1C2C3C4

S1 S2 S3

ChallengesRemainingchallenges:1. Needtolocatenetworkfailures2. Needtohandlegrayfailures3. Needtobenear-real-time

GeneralizedmodelLasso+Hypothesistesting

Streamingdatapipeline

AsystemtolocalizeVHDfailurestounderlyingfailuresincompute,storageornetworksubsystemswithinatimebudgetof15minutes

Summaryofourgoal:

Timebudgetsetbyproductionteamtomeetavailabilitygoals

Outline

•GlobalViewApproach•Model&Algorithm•System•Evaluation•ArchitecturalLessons•RelatedWork

Deepview Model:IncludetheNetwork

Clos Network

Storage ClusterCompute Cluster

•Needtohandlemultipath&ECMP

• SimplifyClosnetworktoatreebyaggregatingnetworkdevices

• Canmodelatthegranularityofclustersorracks

Deepview Model:EstimateComponentHealth

𝐏𝐫𝐨𝐛 𝐩𝐚𝐭𝐡𝐢𝐢𝐬𝐡𝐞𝐚𝐥𝐭𝐡𝐲 = 0 𝐏𝐫𝐨𝐛 𝐜𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭𝐣𝐢𝐬𝐡𝐞𝐚𝐥𝐭𝐡𝐲�

𝐣∈𝐩𝐚𝐭𝐡(𝐢)

𝟏 −𝐞𝐢𝐧𝐢= 0 𝐩𝐣

�


𝐥𝐨𝐠 𝟏 −𝐞𝐢𝐧𝐢

= < 𝐥𝐨𝐠𝐩𝐣

�


𝐲𝐢 =<𝛃𝐣 𝐱𝐢𝐣+ 𝛆𝐢

𝐍

𝐣B𝟏

𝐲𝐢=𝐥𝐨𝐠 𝟏 − 𝐞𝐢𝐧𝐢

𝛃𝐣=𝐥𝐨𝐠𝐩𝐣𝛆𝐢=measurementnoise

SystemofLinearEquations

Blue:observableRed:unknownPurple:topology

Componentjishealthywith𝐩𝐣 = 𝐞𝐱𝐩(𝛃𝐣)• βD = 0,clearcomponentj• βD ≪ 0,mayblameit

Assumeindependentfailures

𝐞𝐢=num ofVMscrashed𝒏𝐢=num ofVMs

Deepview Algorithm:PreferSimplerExplanationviaLasso

• Potentially,#unknowns>#equations• Traditionalleast-squareregressionwouldfail

Sparsity

𝛃H = 𝐚𝐫𝐠𝐦𝐢𝐧𝛃∈ℝ𝐍,𝛃K𝟎

𝐲 − 𝐗𝛃 𝟐 +𝛌 𝛃 𝟏

LassoObjectiveFunction:

𝐲𝟏 = 𝛃𝐜𝟏 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟏 + 𝛆𝟏𝐲𝟐 = 𝛃𝐜𝟏 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟐 + 𝛆𝟐𝐲𝟑 = 𝛃𝐜𝟐 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟏 + 𝛆𝟑𝐲𝟒 = 𝛃𝐜𝟐 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟐 + 𝛆𝟒

Net

C1 C2 S1 S2

𝐲𝐢 =<𝛃𝐣 𝐱𝐢𝐣+ 𝛆𝐢

𝐍

𝐣B𝟏

Example:

• Butmultiplesimultaneousfailuresarerare• Encodethisdomainknowledgemathematically?

• EquivalenttoprefermostβD tobezero• Lassoregression cangetsparsesolutionsefficiently

Deepview Algorithm:PrincipledBlameDecisionviaHypothesisTesting

• Needabinarydecision(flag/clear)foreachcomponent• Ad-hocthresholdsdonotworkreliably• Canwemakeaprincipleddecision?

• Ifestimatedfailureprobabilityworsethanaverage,thenlikelyarealfailure

• Hypothesistest:• IfrejectHS j ,blamecomponentj;otherwise,clearit

𝐇𝟎 𝐣 : 𝛃𝐣 = 𝛃W𝐯𝐬. 𝐇𝐀 𝐣 : 𝛃𝐣 < 𝛃W

Kusto Engine

Deepview SystemArchitecture:NRTDataPipeline

VHD Failure

VM Info

StorageAcct

Net Topo

VMsPerPath Input

Real-time

Non-RT

IngestionPipeline

RAW DATA SLIDING WINDOW OF INPUT

Output

ACTIONS

Alerts

Vis

Near-realtimeScheduler

RUN ALGO

Algo

Outline


Evaluation

Deepview hasbeendeployedinproductionatAzure

1. HowwellcanitlocalizeVHDfailuresinproduction?

2. Howaccurateisthealgorithmcomparedtoalternatives?

3. Howfastisthesystem?

SomeStatistics

• AnalyzedDeepview resultsforonemonth• DailyVHDfailures:hundredstotensofthousands

• Detected100failuresinstances• 70matchedwithexistingtickets,30werepreviouslyundetected

• ReducedunclassifiedVHDfailurestolessthanamaxof500perday• Hostfailuresorcustomermistakes(e.g.,expiredstorageaccounts)

CaseStudy1:UnplannedToR Reboot

• UnplannedToR rebootcancauseVMcrashes• Knowthiscanhappen,butnotwhereandwhen

• Deepview canflagthoseToRs

• AssociateVMdowntimewithToR failures• QuantifytheimpactofToR asasingle-point-of-failureonVMavailability

ToR_11

ToR_12

ToR_13

ToR_14

ToR_15

STR

_01

STR

_02

STR

_03

STR

_04

STR

_05

STR

_06

STR

_07

BlamedtherightToR among288components

CaseStudy2:StorageClusterGrayFailure

• AstorageclusterwasbroughtonlinewithabugthatputssomeVHDsinnegativecache

•Deepview flaggedthefaultystorageclusteralmostimmediatelywhilemanualtriagetook20+hours

10

20

0 20 40 60

Hour

Nu

mb

er

of

VM

s w

ith

VH

D F

ailu

res

pe

r H

ou

r

NumberofVMswithVHDFailuresperHourduringaStorageClusterGrayFailure

CaseStudy3:NetworkFailure

• Networkoutagesarerare,butdohappen

• Inanincident,manytoptierlinksweremistakenlyturnedoff,causinglargecapacityloss

• Whenstoragereplicationtraffichit,itcausedhugepacketlossesandmanyVMstocrash

• Deepview pinpointedthemisbehavingaggregateswitches

ANetworkFailureduetoTopTierLink

CapacityLoss

Com

pute

Clu

ster

s

Storage Clusters

0.6

0.3

0.90.67

0.881

00.250.5

0.751

BooleanTomo SCORE Deepview

Precision Recall

AlgorithmAccuracyComparison

• Twoothertomographyalgorithms:Boolean-Tomo andSCORE• Greedyheuristicstofindminimumsetoffailures

• Useproductiontracefrom42incidents• 16Compute,14Storage,10ToR,2Net

Deepview TimetoDetection• Timetodetection(TTD)

• Timefromincidentstarttofailurelocalized• EstimatestarttimefromVHDfailureeventtimestamp

• Deepview’s TTDisunder10min• Dataingestiontakes~3.5min• ~5minutesslidingwindowdelay• Worst-case18secalgorithmrunningtime

• MeetsthetargetTTDof15min• Canbemadefasterbutmitigationtimeisonhumantimescale

Outline


ToR asaSinglePointofFailure• ReducedNetworkCostvs.AvailabilitycostforusingasingleToR perrack• Softfailures(recoverablebyreboot)vs.hardfailures

ToR Availability

= 𝟏 −𝟗𝟎% ∗ 𝟐𝟎𝐦𝐢𝐧 + 𝟏𝟎% ∗ 𝟏𝟐𝟎𝐦𝐢𝐧 ∗ 𝟎. 𝟏%

𝟑𝟎 ∗ 𝟐𝟒 ∗ 𝟔𝟎𝐦𝐢𝐧

= 𝟏 −%𝐬𝐨𝐟𝐭 ∗ 𝐬𝐨𝐟𝐭𝐝𝐮𝐫.+%𝐡𝐚𝐫𝐝 ∗ 𝐡𝐚𝐫𝐝𝐝𝐮𝐫. ∗ 𝐟𝐫𝐚𝐜. 𝐫𝐞𝐛𝐨𝐨𝐭𝐞𝐝𝐓𝐨𝐑𝐬𝐩𝐞𝐫𝐦𝐨𝐧𝐭𝐡

𝐭𝐨𝐭𝐚𝐥𝐭𝐢𝐦𝐞𝐢𝐧𝐚𝐦𝐨𝐧𝐭𝐡

= 𝟗𝟗. 𝟗𝟗𝟗𝟗𝟑%• Dependentservices(ToRs)needtoprovideoneextraninetotargetservice(VMs)

ToRs notoncriticalpathforVMstoachievefive-ninesavailability

VMsandtheirStorageCo-location• Forloadbalancing,VMscanmountVHDsfromanystorageclusterinthesameregion

• SomeVMshavestoragethatarefurtheraway• CanlongernetworkpathsimpactVMavailability?Andbyhowmuch?

Longernetworkpathdoleadtohigher(11.4%)VHDfailurerate

• AtAzure,52%two-hop,41%four-hop• ComputedailyVHDfailurerates:rS (two-hop),rf (four-hop)• Averageover3-months, rS andrf• rf − rS rS⁄ = 11.4%increase

RelatedWork• NetPoirot [SIGCOMM'16]

• Asingle-nodesolutiontofailurelocalizationusingTCPstatistics• ComplementaryifTCPstatisticsfromcustomerVMsareavailable

• BinaryTomography• Deepview achieveshigherprecision/recallthanthosegreedyheuristics

• (Approximate)BayesianNetwork• Tooslowforourproblem• Futureworktocompareaccuracyexperimentally

Conclusion

• IdentifiedVHDfailuresastheavailabilitybottleneckatAzure

• Deepview reducedunclassifieddailyVHDfailuresfrom10,000sto100s

• Revealednewfailures,e.g.,unplannedToR reboots,storagegrayfailures

• QuantifiedtheimpactofseveralarchitecturaldecisionsonVMavailability

Thankyou!Questions?

Documents

Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong