View
2
Download
0
Category
Preview:
Citation preview
AssessingtheImpactofImperfectDiagnosisonServiceReliability:AParsimoniousModelApproach
Networking and Security Group Aalborg University, Denmark ljg@es.aau.dk
European Dependable Computing Conference 2010 – Valencia, Spain April 28, 2010
<
(Presenter) Jesper Grønbæk Hans-Peter Schwefel Jens Kristian Kjærgård Thomas S. Toftegaard
Tieto IP Solutions, Denmark
Aarhus School of Engineering, University of Aarhus, Denmark
Forschungszentrum Telekommunikation Wien, Austria
April 28, 2010 EDCC 2010 – Valencia, Spain
2
• ConclusionsImperfectDiagnosis
Networkfaultdiagnosis Dependableend‐userserviceprovisioninginNextGenerationNetworkarchitectures
Dominatedbywirelessnetworks,mobilityandvaryingtrafficconditions Challengedbyunreliableobservationsandhiddennetworkstates
ImperfectDiagnosis
Modellingimperfectdiagnosis Goalsofmodelling
A. DeterminebestremediationactionsB. Determinebesttrade‐offofimperfections
Assesspropertiesofagivendiagnosiscomponent(functionlevelmodelling[1],systemlevelsimulation[2])
Light‐weightmodelsdesirableforfrequentmodelre‐evaluations
BackgroundandMotivation
April 28, 2010 EDCC 2010 – Valencia, Spain
3
• ConclusionsImperfectDiagnosis
ODDRdecentralizedfaultmanagementframework[3][4](Observation,Diagnosis,DecisionandRemediation) End‐nodeDrivenFaultManagement Jointviewonimperfectdiagnosisanddecisions(remediation,observationcollection) Operationindynamicenvironmentfrequentmodelre‐evaluations
Subsequentfocusontrade‐offofimperfections(bestdiagnosissettings)
Example:DecentalizedFaultManagementFramework
April 28, 2010 EDCC 2010 – Valencia, Spain
Diagnosisatomicview Singleobservation Twonetworkstates(Normal/Fault) Discretediagnosissteps(periodT)
GenericDiagnosis(stateestimation)definitions
4
• ConclusionsBackgroundonDiagnosisApproachesDefinitionsofDiagnosisOutcomes
April 28, 2010 EDCC 2010 – Valencia, Spain
5
• ConclusionsBackgroundonDiagnosisApproachesDiagnosisClasses
1 Terminology adapted from [5]
2000 repetitions
Twolevelsofcomplexityofdiagnosisbehaviour One‐shot1:diagnosisestimatebasedonasinglesetofobservationsintime
NocorrelationofdiagnosisestimatesfromdiagnosisSimplemodelrepresentationproposedin[3]
Over‐time1:diagnosisestimatebasedonnewandoldobservations Meanstoimprovediagnosisestimates Strongcorrelationaddedbydiagnosiscomponent
Comparison One‐shot:thresholdonround‐triptime(RTT) Over‐time:α‐countheuristic(Bondavallietal.[1])onone‐shotestimates Transienteffectsfromnetworkneglected
Over‐timehashighlytransientphase;yetsignificantimprovement Identifybesttrade‐off:ReactionTime&FalseAlarms Simpleparameterizationfromsteady‐statebehaviourisdifficult
April 28, 2010 EDCC 2010 – Valencia, Spain
Four‐stateMarkovmodelpresentedin[3] ControlledbygeometricON‐OFFnetworkstateprocess
(fault/repairoccurence){pf,pr} 2freeparameters{P(TN|Ns=Normal)=TNR=(1‐FPR),P(TP|Ns=Fault)=TPR=(1‐FNR)}
Exploremodelcapabilities Remediationassumption:fail‐overonnetworkfaultstatediagnosis 6freeparameters fixed{pf,pr}4freeparameters
6
• ConclusionsParsimoniousDiagnosisModelDefinitionandParameters
SystemEquations
April 28, 2010 EDCC 2010 – Valencia, Spain
7
• ConclusionsParsimoniousDiagnosisModel
DiagnosisMetrics ProposedMetrics(steadystate)
ProbabilityonRemediationonFalseAlarm,(pRFA) MeanRemediationReactionTime(µRRT)
Note,twoparametersandfourfree
DiagnosisTrace Startdiagnosisinnormalnetworkstateforagivenset{pf,pr} Observeuntilalarmisdiagnosed PerformMrepetitionsandderiveO=#FA
pRFA=O/M µRRT,meantimetoremediationoverallM
DiagnosisMetricsDefinitions
April 28, 2010 EDCC 2010 – Valencia, Spain
8
• ConclusionsParsimoniousDiagnosisModel
Closed‐formequationsderivedbylinearalgebraicapproaches[6] ProbabilityonRemediationonFalseAlarm(pRFA)Probabilityofabsorption
MeanRemediationReactionTime(µRRT)Meantimetoabsorption
Solvingyieldstwolinearequations:
DiagnosisMetricsEquations
Absorbing states
Initial state
April 28, 2010 EDCC 2010 – Valencia, Spain
Underdeterminedproblemsolvedbyheuristics(MI)MinimizepFPTNandpTPFN.MinimizedirecttransitionsTNFPandFNTP
Behaviourintransientanalysis: Initialstudyparameters:T=0.4s,Meannormalperiod=12.42s,Meanfaultperiod=15s
CapturesaninitialhigherprobabilityofpRTAoverallalarms(pRTA+pRFA)
9
• ConclusionsParameterizationbyDiagnosisMetrics
minimize
minimize
pRFA
pRTA
pRTA
(pRFA + pRTA)
April 28, 2010 EDCC 2010 – Valencia, Spain
10
• ConclusionsCase:TimeConstrainedDataTransfer
QoSrequirement:CompleteSCTPbasedfiletransferwithintdeadlinesecondswiththeprobability:Ω
Fault:Congestioninoperatorinfrastructure(occurrenceandrepair,ON‐OFFmodel)
Remediation:Singlefail‐overfromnetworkAtonetworkB
Diagnosis:SimplethresholdbasedonRTTandα‐count Decision:Fail‐overonnetworkfaultstatediagnosis
Background
April 28, 2010 EDCC 2010 – Valencia, Spain
11
• ConclusionsCase:TimeConstrainedDataTransfer
PolicyEvaluationDiscreteTimeMarkovModel(PEDTMC)[3] StateSpace:
SPE={Activenetwork,Timeprogress,Fileprogress,Networkstate,Diagnosisstate}
Ωmodel=ΣSPEss(r,n)
PolicyEvaluationModel
File Transfer Completion Time CDF
r =1
m
April 28, 2010 EDCC 2010 – Valencia, Spain
12
• ConclusionsModelSensitivityAnalysis
ModelbasedsensitivityanalysisonΩ VaryµRTTandpRFA,tdeadline=30s&filesize=10MByte
Comparetoperfectdiagnosisandno‐failoverpolicy
BothmetricshaveaclearimpactonΩ,µRTTpromptnessandpRFA‐>correctness MostsensitivetohighpRFAwrongfail‐overcannotberemediated Candeliversignificantlyworseperformancethannofail‐over
Perfect Diagnosis
No fail-over
April 28, 2010 EDCC 2010 – Valencia, Spain
13
• ConclusionsReliabilityEvaluationResults
Studypropertiesofα‐countdiagnosiscomponent α‐countcontrolledbytwoparameters:kforgettingfactor,αTthreshold PEDTMCModelbasedanalysis Simulationbasedanalysis
Systemlevelsimulationbasedonns‐2 ProvideevaluationofΩandtracesofdiagnosisperformance
Considertwosettingsofone‐shotdiagnosis:
Tradeoffoptionsofa‐count(obtainedfromsingletraceset,2000runs)
Background&Trade‐offResults
γ0=(TPR,TNR)=(0.983,0.097)γ1=(TPR,TNR)=(0.953,0.225)
April 28, 2010 EDCC 2010 – Valencia, Spain
14
• ConclusionsReliabilityEvaluationResults
PEDTMCmodelbasedanalysis Simplethreshold
γ0performsbetterthanγ1(asshownin[3])
α‐count Overallleadstoimprovement
filteringoutfalsealarms Optimalsettingsexist γ1:k=0.92,aT=2.5leadstobestresults
ObtainablereductionofpRFAwithoutsimilarincreaseinµRTT
Simulationbasedanalysis Consistentconclusionstomodel Qualitativedifferences
stochastictimemodel
Simplifieddata‐transfermodel
Background&Trade‐offResults
Ωsi
mul
atio
n Ω
mod
el
Threshold αT
Simple threshold Threshold αT
April 28, 2010 EDCC 2010 – Valencia, Spain
15
• ConclusionsConclusion&Outlook
Conclusions Proposedparsimoniousimperfectdiagnosismodelforlight‐weightassessmentof
bestdiagnosiscomponentsettings;alsoconsideringcomplexclassofover‐timediagnosiscomponents
Definedrepresentativeimperfectdiagnosisperformancemetricsandderivedtheirclosed‐formequationsinthemodel
Presentedservicereliabilitycaseandperformedmodelbasedsensitivityanalysisofreliabilityonimperfectdiagnosisperformancemetrics
Usedmodeltoassessdiagnosisperformancepropertiesofover‐timediagnosisheuristicfromliteratureanddefinebestsetting
Shownbysystemlevelsimulationanalysisthatdiagnosismodelcancaptureessentialimperfectdiagnosisperformancecharacteristics
Outlook Introducemorecomplexdecisionpolicies
Applicationstateinformationminimizeremediation Multiplefaultdiagnosis DecisionstocollectmoreinformationNeedtostudydiagnosismodelbehaviourafterpositivediagnosisandpotentiallyextend
April 28, 2010 EDCC 2010 – Valencia, Spain DRCN 09 - Washington DC
16
• Conclusions
Questions&Discussion
April 28, 2010 EDCC 2010 – Valencia, Spain
17
References
[1] Threshold-based mechanisms to discriminate transient from intermittent faults. A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, and F. Grandoni, IEEE Transactions on Computers, vol. 49, no. 3, pp. 230–245, 2000.
[2] Probabilistic Fault-Diagnosis in Mobile Networks Using Cross-Layer Observations. A. Nickelsen, J. Grønbæk, T. Renier, and H.-P. Schwefel, “” In Proceedings of AINA 09, pp. 225–232, 2009.
[3] Model based evaluation of policies for end-node driven fault recovery. J. Grønbæk, H.-P. Schwefel, and T. Toftegaard, Proc. DRCN 09, 2009.
[4] Towards self-adaptive reliable network services in highly-uncertain environments. A. Ceccarelli, J. Grønbæk, L. Montecchi, A. Bondavalli, and H. P. Schwefel, To appear in proceedings of WORNUS 10, May, 2010.
[5] Hidden Markov Models as a Support for Diagnosis: Formalization of the Problem and Synthesis of the Solution. A. Daidone, F. Di Giandomenico, S. Chiaradonna, and A. Bondavalli, in 25th IEEE Symposium on Reliable Distributed Systems, 2006. SRDS’06, 2006, pp. 245–256.
[6] Queueing Theory – A Linear Algebraic Approach. L. Lipsky, 2nd ed. Springer, 2009.
,,
Recommended