Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Proactive Fault Management
Felix Salfner
7.7.2010www rok informatik hu berlin de/Members/salfnerwww.rok.informatik.hu-berlin.de/Members/salfner
ContentsContents
•• IntroductionIntroduction
V i bl S l ti• Variable Selection
• Online Failure Prediction Overview
• Four Online Failure Prediction Techniques
• Assessing Failure PredictorsAssessing Failure Predictors
• Taking Action
S• Summary
PFM – 7.7.2010 – 2
Our CredoOur Credo
“Ordinary mortals know what’s happening now, the y pp g ,gods know what the future holds because they alone are totally enlightened.Wise men are aware of future things just about to happen”
C. P. Cavafy, (Greek poet, 1863-1933) “But the Wise Perceive Things about to Happen,” a poem based on lines by Philostratos
PFM – 7.7.2010 – 3
MotivationMotivation
• Ever-increasing systems complexity
• Ever-growing number of attacks and threats, novice users and third-party or open-source
ft COTSsoftware, COTS• Growing connectivity and interoperability
D i it (f t fi ti • Dynamicity (frequent configurations, reconfigurations, updates, upgrades and patches, ad hoc extensions) andad hoc extensions), and
• Natural and man-made disasters
PFM – 7.7.2010 – 4
A Different MindsetA Different Mindset
• Faults errors and failures are common events so • Faults, errors and failures are common events so let us treat them as part of the system behaviorand learn how to cope with themp
• Attractive panacea:tt act e pa acea
(self) ProactiveProactive Fault Management Fault Management (PFM)(self) ProactiveProactive Fault Management Fault Management (PFM)
PFM – 7.7.2010 – 5
Proactive Fault Management (PFM)
PFM is an umbrella term for techniques such as monitoring,
Proactive Fault Management (PFM)
PFM is an umbrella term for techniques such as monitoring, diagnosis, prediction, recovery and preventive maintenance concerned with proactive handling of errors and failures: if the system knows about a critical situation in advance, it can try to y , yapply countermeasures in order to prevent the occurrence of a failure, or it can prepare repair mechanisms for the upcoming failure in order to reduce time-to-repair.
Variable SelectionAdaptive Monitoring
Monitoring
Failure PredictionDiagnosis
SelectionScheduling EvaluationAdaptation
PFM – 7.7.2010 – 6
DiagnosisSchedulingExecution
EvaluationAdaptation
How to Get ThereHow to Get There
System Variable Model Model
TimeSeries
yObservation
ForwardSelection
Selection
Hidden MarkovModels
Estimation
SensitivityAnalysis
Application
Log Files
ProbabilisticWrapper
Universal Basis Functions Prediction
Reaction /Actuation
Offline System Adaptation
Actuation
DS 2009 – 12 – PFM– 7[Hoffmann, Trivedi, Malek 2007]
Comparison to l l l b l hClassical Reliability Theory
Cl i l li bilit th i t i ll f l f l t • Classical reliability theory is typically useful for long term or average behavior predictions and comparative analysis
• Classical reliability theory may help but is not very good for short term prediction due to dynamics, mobility, systems/networks complexity changing execution systems/networks complexity, changing execution environments, upgrades, online repair, etc.
PFM – 7.7.2010 – 8
ContentsContents
• Introduction
V i bl S l tiV i bl S l ti•• Variable SelectionVariable Selection
• Online Failure Prediction Overview
• Four Online Failure Prediction Techniques
• Assessing Failure PredictorsAssessing Failure Predictors
• Taking Action
S• Summary
PFM – 7.7.2010 – 9
Variable SelectionVariable Selection
• What are the right variables to use for modeling?• There are up to 4200 variables (v) and up to hundreds of
fault classes (f) per node• For n nodes: m = v x f x n variables, the number of
combinations c equals:combinations c equals:
m m
c
Combinatorial explosion!
r r1
c
• Combinatorial explosion!
PFM – 7.7.2010 – 10
Variable Selection MethodsVariable Selection Methods
• Selection by experts
• Filter (e.g., mutual information criterion)Filter (e.g., mutual information criterion)
• Wrapper (making use of modeling procedure specifics)- feed forward selection finding independent variables- feed forward selection, finding independent variables- backward elimination- probabilistic (only variables showing correlation and
certain distribution)certain distribution)
• Forward Addition - a method of selecting random variables for inclusion in the regression model by starting with no for inclusion in the regression model by starting with no variables and then gradually adding those that contribute most to prediction
PFM – 7.7.2010 – 11
Variable SelectionVariable Selection
PWA: Class labels
PWA: Time Series• Benchmarked four techniques – Forward selection
n te
chni
que
Forward selection: Time Series
Forward selection– Backward elimination– Expert selected– PWA (Prob Wrapper)
riabl
e se
lect
ion
Backward Elimination: Time Series
Forward selection: Class labels
– PWA (Prob. Wrapper)• Variables
– alloc/ va
r
Backward Elimination: Class labels
Backward Elimination: Time Series– sema/s• PWA performs best
on time series andl l b l d
Expert Selectedclass label data
PFM – 7.7.2010 – 12AUC out-of-sample
0.50 0.55 0.60 0.65 0.70 0.75[Hoffmann, Malek 2006; Hoffmann 2005]
ContentsContents
• Introduction
V i bl S l ti• Variable Selection
•• Online Failure Prediction TaxonomyOnline Failure Prediction Taxonomy
• Online Failure Prediction Techniques
• Assessing Failure PredictorsAssessing Failure Predictors
• Taking Action
S• Summary
PFM – 7.7.2010 – 13
Faults Errors Failures again
TrackingA ditiTesting
Faults, Errors, Failures again …
TrackingAuditingTesting
Affects external state
ReportingFault UndetectedError
Detected
Failureactivation
detection DetectedError
id ff t id ff t
Monitoring
side-effects side-effects
Symptom
DS 2009 – 12 – PFM– 14
Four Ways of Detecting FaultsFour Ways of Detecting Faults
(1) The system can be audited in order to actively search for (1) The system can be audited in order to actively search for faults, e.g., by testing on checksums of data structures, etc.
(2) System parameters such as memory usage number of (2) System parameters such as memory usage, number of processes, workload, etc., can be monitored in order to identify side-effects of the faults. These side-effectsare called symptoms For example the side-effect of a are called symptoms. For example, the side effect of a memory leak is that the amount of free memory decreases over time.
(3) If a fault is activated and detected (observed), it turns into an error.
(4) If the fault is not detected by fault detection mechanisms, it might directly turn into a failure which can be observed from outside the system or component.
PFM – 7.7.2010 – 15
TaxonomyTaxonomy
Online Failure Prediction
FailureTracking
Symptom Monitoring
DetectedError
Reporting
UndetectedError
Auditing
Online Failure Prediction
ProbabilityDistribution Estimation
FunctionApproximation
Reporting
Rule-basedapproaches
Auditing
Co-occurrence Classifiers Co-occurrence
System Models
Pattern-recognition
Time Series Analysis
Statistical Tests
PFM – 7.7.2010 – 16
Classifiers [Salfner, Lenk, Malek, CSUR]
Online Failure Prediction -fDefinition
• The goal of online failure prediction is to identify failure-prone situations, i.e. situations that will failure prone situations, i.e. situations that will probably evolve into a failure. The evaluation is based on runtime monitoring data.
• The output of online failure prediction can either be – a decision that a failure is imminent or not, or– some continuous measure evaluating how failure-prone
the current situation isthe current situation is
PFM – 7.7.2010 – 17
Two Types of Input DataTwo Types of Input Data
• There are two types of There are two types of system measurements– periodic, numerical
event based categorical
Inpu
t
– event-based, categorical
• Examples for periodic data
t
F(t)p p
– system- / CPU load– memory usage
t
F
• Examples for event-based data:
A BC
error events
– interrupts– threshold violations– error events
data windowt
C A BC failure?
prediction
PFM – 7.7.2010 – 18
– error eventsPresent
time
p
ContentsContents
• Introduction
V i bl S l ti• Variable Selection
• Online Failure Prediction Overview
•• Four Online Failure Prediction TechniquesFour Online Failure Prediction Techniques
• Assessing Failure PredictorsAssessing Failure Predictors
• Taking Action
S• Summary
PFM – 7.7.2010 – 19
Prediction Techniques ExamplesPrediction Techniques Examples
1. Universal Basis Functions (UBF)( )
2. Hidden Semi-Markov Model (HSMM)( )
3. Dispersion-Frame Technique (DFT)
4. Eventset method
PFM – 7.7.2010 – 20
Universal Basis Functions (UBF)Universal Basis Functions (UBF)
Running SystemRunning System
Real, UnknownFunction
Failure indicator
timeUBF Model approximation
measurements
threshold
predicted time oft ti
T il d t i di t
predicted time offailure occurrence
present time(time of prediction)
• Tailored to periodic measurements• Function approximation approach: Express target value as
function of input variablesp• Examples for target values:
– AvailabilityMemo cons mption
PFM – 7.7.2010 – 21
– Memory consumption
UBF BackgroundUBF Background
• Starting from radial basis functionsStarting from radial basis functionsLinear combination of kernel functions Gi
n
Gf cxx Replace fixed Gaussian by flexible ii
i iGf cxx
1
p ydomain specific kernel
dωdωd 21 1 G
• Determine parameters by minimizing, e.g., mean square
dωdωd 21 1 i
G
p y g, g , qerror
min H f f x yN
2
PFM – 7.7.2010 – 22
min H f f x yi ii
1
Effect of ω (UBF slider)Effect of ω (UBF slider)
• Linear combination of Linear combination of nonlinear kernel functions
• Examples:G i
1.0
– Gaussian– Sigmoide functions, …
• RBF is special case0.5
p• Improve efficacy by
introducing data specific flexible kernels
activation
0.0
flexible kernels• Universal approximator• Large number of kernels lid
er
0.0
0.2
0.4-1.0-1.0
-0.5
gto cover heavy tailed distributions
UBF
sli0.6
0.8x
-0.5
0.0
0.5
PFM – 7.7.2010 – 23
1.01.0
Dispersion Frame TechniqueDispersion Frame Technique
iWi-1i-2i-3i-4 i
time
DF 2
DF 1
EDI = 3
EDI = 3
EDI = 2
EDI = 2
• Classic technique for error log analysis
EDI: Error Dispersion IndexDF: Dispersion Frame
EDI 2
• Classic technique for error-log analysis• Evaluates the time of error occurrence• Applies a set of heuristic rules evaluating the number of
PFM – 7.7.2010 – 24
pp gerrors within successive dispersion frames
[Lin, Siewiorek 1990]
Event-set MethodEvent set Method
• Approach inspired by data-mining• Focus on type of events• Focus on type of events• Based on sets of events
– Each set contains decisive events that occur prior to a Each set contains decisive events that occur prior to a target event
– Events correspond to errors in our taxonomy– Target events correspond to failures– Event sets do not keep timing information
R lt l b d f il di ti t • Result: rule-based failure prediction system containing a database of indicative eventsets
PFM – 7.7.2010 – 25
[Vilalta, Ma 2002]
Event-set MethodEvent set Method
failure window 1 failure window 2non-failure windowfailure window 1
B A
failure window 2
FAA A A B BF
tt
C
non-failure window
tt
D
tltd
Initial eventset
tltd
Initial eventsetdatabase
A
B AB
ACBCABC
C
keeping only frequent, accurate, validated, and specific eventsets
final eventsetdatabase AB
PFM – 7.7.2010 – 26
Hidden Semi-Markov Model dPrediction
• Components of complex systems depend on each otherComponents of complex systems depend on each other• Dependencies lead to error patterns
nts C4
om
ponen
C3
C2
tErrors
co C1
C BA D E
• Fault-tolerant systems: – Failures occur only under certain conditions– Failure-prone conditions can be identified by specific error
patterns
Use pattern recognition to identify symptomatic situations
PFM – 7.7.2010 – 27
p g y y p[Salfner, Malek 2007; Salfner 2008]
ApproachApproach
• Standard tool for pattern recognition: Hidden Markov ModelsStandard tool for pattern recognition: Hidden Markov Models• Identify symptomatic patterns
– Algorithmically– From recorded training data Machine learning
• Additional assumption: • Additional assumption: – Time between events is decisive (temporal sequence analysis) – Standard Hidden Markov Models need to be extended
l f idd i k d l ( ) Development of a Hidden Semi-Markov Model (HSMM)
• The approach incorporates both type and time-of-occurrence of
tC A BC failure?yp
error eventsdata window
t
Presentprediction
PFM – 7.7.2010 – 28
Presenttime
Hidden Semi-Markov ModelsHidden Semi Markov Models
1 2 3 N…
g (t)b1(A)b1(B)b1(C)
b2(A)b2(B)b2(C)
b3(A)b3(B)b3(C)
bN(A)bN(B)bN(C)
g12(t)
• Discrete Time Markov Chains (DTMC) consist of states (1…N) and transition probabilities pij between statesp pij
• In Hidden Markov Models (HMM) each state can generate a symbol A,B,C according to probability distribution bi(ok)
• Hidden semi-Markov models (HSMMs) replace transitionprobabilities pij by time-continuous cumulative probabilitydistributions g (t)
PFM – 7.7.2010 – 29
distributions gij(t)
Machine Learning: Two StepsMachine Learning: Two Steps1.Training: Fit model parameters to training data
t
Non-FailureSequence 2
Non-failureSequence 1
FailureSequence 2
FailureSequence 1
B A FAA B C B BFC C B A
HSMM f F il S HSMM f N F il S
td
2.Prediction: Processing of runtime measurements
HSMM for Failure Sequences HSMM for Non-Failure Sequences
At
B C
td
Sequence Likelihood Sequence Likelihood
HSMM for Non-Failure SequencesHSMM for Failure Sequences
PFM – 7.7.2010 – 30
Classification
Failure Prediction
ContentsContents
• Introduction
V i bl S l ti• Variable Selection
• Online Failure Prediction Overview
• Four Online Failure Prediction Techniques
•• Assessing Failure PredictorsAssessing Failure PredictorsAssessing Failure PredictorsAssessing Failure Predictors
• Taking Action
S• Summary
PFM – 7.7.2010 – 31
Precision Recall and other MetricsPrecision, Recall and other Metrics
SumSumTrue True successsuccessTrue True failurefailurecontingencycontingency tabletable
# No-AlarmsCorrect no-alarm (TN)Missing alarm (FN)NoNo warningwarning
# AlarmsFalse alarm (FP)Correct alarm (TP)FailureFailure alarmalarm
# Total# Successes# FailuresSum Sum
o a s( )g ( )
• Precision: fraction of correct alarms:
• Recall: fraction of predicted failures:
• False positive rate (fpr):
PFM – 7.7.2010 – 32
• True positive rate is equal to recall
Receiver Operating Characteristics ( )(ROC)
ROC for HSMM predictor1
e
perfect predictor
0.8
1.0
ROC for HSMM predictor
ue p
ositi
ve ra
te
averagepredictor
random predictor 0.4
0.6
true
posi
tive
rate
tru
random predictor
0.0
0.2
• Plot true positive rate (recall) over false positive rate for
1false positive rate0 0.0 0.2 0.4 0.6 0.8 1.0
various thresholds• Threshold ∞ : tpr and fpr equal to zero• Threshold -∞ : tpr and fpr equal to one
PFM – 7.7.2010 – 33
• Threshold -∞ : tpr and fpr equal to one
Precision-Recall-PlotsPrecision Recall Plots
1Precision-Recall plot for HSMM predictor
1
curve B
curve C
0.8
1.0
Prec
isio
n curve B
0.4
0.6
Prec
isio
n
1R ll0
curve A
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
• Plot precision over recall for various thresholds• Threshold ∞ : precision equal to one, recall equal to zero
1Recall0
p q , q• Threshold -∞ : precision equal to ratio of positive and
negative examples, recall equal to one
PFM – 7.7.2010 – 34
Skalar MetricsSkalar Metrics
• ROC, Precision/Recall diagrams etc. are graphsROC, Precision/Recall diagrams etc. are graphs• Good for visual inspection, bad for algorithmic decisions• Goal: obtain one real number to evaluate „quality“ of
predictor• Examples:
– Precision-Recall-Breakeven: Value at which precision and recallPrecision Recall Breakeven: Value at which precision and recallcross
– Area-under-curve (AUC): Area under the ROC curveF M H i f i i d ll– F-Measure: Harmonic mean of precision and recall:
PFM – 7.7.2010 – 35
ContentsContents
• Introduction
V i bl S l ti• Variable Selection
• Online Failure Prediction Overview
• Four Online Failure Prediction Techniques
• Assessing Failure PredictorsAssessing Failure Predictors
•• Taking ActionTaking Action
S• Summary
PFM – 7.7.2010 – 36
Taking ActionTaking Action
• Failure prediction is only the first step in managing faultsFailure prediction is only the first step in managing faultsproactively
• After a potential failure has been predicted, an action must b t k i d tbe taken in order to– Avoid the failure (prevent it from occurring)– Prepare the system such that TTR can be reducedp y– (See lecture seven)
• In general, the following steps have to be performed:
FailureDiagnosis Scheduling Executing
FailurePrediction of an
Actionthe
Action
PFM – 7.7.2010 – 37
Taxonomy of Reaction MethodsTaxonomy of Reaction Methods
A ti t i d b f il di tiActions triggered by failure prediction
Downtime avoidance Downtime minimization
State Preventive Load Prepaired PreventiveState clean-up
Preventivefailover
Loadlowering
Prepairedrepair
Preventiverestart
PFM – 7.7.2010 – 38
Effects of Proactive MethodsEffects of Proactive Methods
• Downtime avoidance improves MTTFDowntime avoidance improves MTTF• Downtime minimization reduces MTTR:
TTR
CP UR dtime
withoutfailure prediction(a)
CP UpReady
improvementTTR
CP F Ready Uptime
(b) withfailure prediction
• However, In case of frequent false positive and false negative predictions, proactive fault management can also reduce availability!
PFM – 7.7.2010 – 39
availability!
ContentsContents
• Introduction
V i bl S l ti• Variable Selection
• Online Failure Prediction Overview
• Four Online Failure Prediction Techniques
• Assessing Failure PredictorsAssessing Failure Predictors
• Taking Action
SS•• SummarySummary
PFM – 7.7.2010 – 40
SummarySummary
• Dynamics and complexity of today‘s systems require adaptive • Dynamics and complexity of today s systems require adaptive and proactive mechanisms to handle faults
• Proactive Fault Management (PFM) builds on– Continuously observing the system
– Predict whether a failure is coming up
– In case of an upcoming failure: o Analyze the fault that causes the upcoming failure
o Decide what to do: Either try to avoid the failure or prepare repairo Decide what to do: Either try to avoid the failure or prepare repairmechanisms for the upcoming failure
• Since online failure predictors build on monitoring data: The b f bl d b d f d bl lbest set of variables need to be identified: Variable selection
• Analyses of PFM suggest that PFM has the potential toenhance system availability by up to an order of magnitude
PFM – 7.7.2010 – 41
enhance system availability by up to an order of magnitude.
Thanks!Questions?
Prediction in Computer SciencePrediction in Computer Science
• Scheduling
• Branch prediction
• Memory management
• Performance evaluation
• Reliability prediction• Reliability prediction
• …
• Failure prediction
PFM – 7.7.2010 – 43
A Long Way to GoA Long Way to Go
• Daimler predicted in 1900 that the Europe’s production of Daimler predicted in 1900 that the Europe s production of cars will not exceed 1,000 a year because it will not be possible to train enough chauffeurs
• In 1900 General Post Office of United Kingdom has predicted that there will be a demand for about ONE phone for each city with a population of more than one hundred thousandcity with a population of more than one hundred thousand
• "There is not the slightest indication that nuclear energy will e e be obtainable " Albe t Einstein 1932ever be obtainable." — Albert Einstein, 1932
• "There is no reason for any individual to have a computer in y ptheir home", Ken Olsen, 1977
PFM – 7.7.2010 – 44
Predicting the FuturePredicting the Future
• Predicting the future has fascinated people from thePredicting the future has fascinated people from thebeginning of times
• Several millions of people work on prediction daily• Astrologists, meteorologists, politicians, pollsters, stock
analysts, doctors,…, and many computer scientists/engineers
PFM – 7.7.2010 – 45
© BBC 1999
ContentsContents
•• Runtime MonitoringRuntime Monitoring
• Error Logs• Error Logs
• Variable Selection
• Online Failure Prediction Taxonomy
• Online Failure Prediction Techniquesq
• A Case Study
• Taking Action• Taking Action
• Summary
PFM – 7.7.2010 – 46
Runtime (Online) Monitoring Runtime (Online) Monitoring
• Runtime monitoring is a continuous observation of • Runtime monitoring is a continuous observation of system variables for a given purpose such as diagnosis or failure predictiong p
• Fundamental questions: – Which variables to monitor?– How frequently (sampling rate)?– At what level should we monitor?
H ill f b ff t d?– How will performance be affected?– What storage will be needed?– When and how to process the monitoring data?When and how to process the monitoring data?
• Remember: monitoring is not free and never complete
PFM – 7.7.2010 – 47
p
Monitoring – At What Level?Monitoring At What Level?
Business P /E t i
Monitoring Monitoring
rings
Monitoring
Process/Enterprise
Monitoring
Service
Monitoring
Software
Hardware
DS 2009 – 12 – PFM– 48
Types of Data Sources Types of Data Sources
• Error Logs (Logfiles) – Lack of uniformity– Standards are just emerging– Redundancy (in some cases a problem is
d 60 000 i )reported 60,000 times)
System Activity Reporter (SAR) data• System Activity Reporter (SAR) data– Up to 4200 parameters can be monitored in real
systemssystems– Up to five can usually be measured and
processed in real time for 1-5 minutes prediction
PFM – 7.7.2010 – 49
ContentsContents
• Runtime Monitoring
E LE L•• Error LogsError Logs
• Variable Selection
• Online Failure Prediction Taxonomy
• Online Failure Prediction TechniquesOnline Failure Prediction Techniques
• A Case Study
T ki A ti• Taking Action
• Summary
PFM – 7.7.2010 – 50
What Is the Use of Logfiles Today?What Is the Use of Logfiles Today?
• Development• Developmento Programming Errors
• Operation and maintenance• Operation and maintenanceo Root cause diagnosis of system failures
o Configuration errorsg
• Researcho Source data for analysisy
PFM – 7.7.2010 – 51
Requirements EvolutionRequirements Evolution
• TargetTarget– Today: Human readable– Tomorrow: Both human and machine readable
• Domain knowledge– Today: Domain specific implicit domain knowledge e g Today: Domain specific, implicit domain knowledge, e.g.,
selected thresholds– Tomorrow: Comprehensive description
• Standardization– Today: Proprietary formats, homogeneous environmenty p y , g– Tomorrow: Standards for heterogeneous environments,
universal tools, analysis server
PFM – 7.7.2010 – 52
Typical Problems with Error LogsTypical Problems with Error Logs
• Timestamp: No standard format and interpretation
• Unknown number representation (decimal or hexadecimal?)
• No token identifier / separator
• No machine readable format specification for the pentire log
• Repetitive patterns (multiple reporting of p p ( p p gproblems)
PFM – 7.7.2010 – 53
Error Log ExampleError Log Example
2004/02/09 19 26 13 634089 29836 00010 LIB ABC2004/02/09-19:26:13.634089-29836-00010-LIB_ABC ANOPTK#0243546463464346|0555553456|00000000000000-4456547457434-2.3.1|356546346|0001
2004/02/09 19:26:13 634089 29836 00010 LIB ABC2004/02/09-19:26:13.634089-29836-00010-LIB_ABC src=APPLICATION sev=SEVERITY_MINOR
2004/02/09-19:26:13.634089-29836-00010-LIB_ABC unknown value specified in Context 000256unknown value specified in Context 000256
PFM – 7.7.2010 – 54
WSDM Event Format (WEF)WSDM Event Format (WEF)
• Log standard by Web Services Distributed Management Log standard by Web Services Distributed Management (WSDM) approved by OASIS
PFM – 7.7.2010 – 55© IBM
Impact of MonitoringImpact of Monitoring
140
100120
4060
80
Increase in Response
020
40 Increase in Response Time [%]
Traffic [100 Bytes / s]
PFM – 7.7.2010 – 56
ContentsContents
• Runtime Monitoring
• Error Logs
• Variable Selection
• Online Failure Prediction Taxonomy
• Online Failure Prediction Techniques• Online Failure Prediction Techniques
•• A Case StudyA Case Study
• Taking Action
• Summary
PFM – 7.7.2010 – 57
y
Universal Basis Functions (UBF)Universal Basis Functions (UBF)
• Tailored to periodic • Tailored to periodic measurements: e.g.,– Semaphore operations per F(
t)
second or minute– Allocated OS-kernel
memory
t
memory• Function approximation
approach: Express target sem
a/s
value as function of input variables
• Examples for target values:
t
• Examples for target values:– Availability– Memory consumption
t
allo
c
PFM – 7.7.2010 – 58
t
[Hoffmann, Malek 2006; Hoffmann 2005]
Case StudyCase Study
• Commercial telecommunication platform• Commercial telecommunication platform
• Platform implements service control functions
– Examples: billing, SMS, pre-paid services
• 400-10 000 service requests per minute• 400-10,000 service requests per minute
• Distributed and component based system
– 1.5+ million lines of code
2000+ classes and 200+ components– 2000+ classes and 200+ components
– Two nodes (up to eight)
PFM – 7.7.2010 – 59
Definition of FailuresDefinition of Failures
0000
0.99
981.
099
40.
9996
avai
labi
lity
0.99
920.
99a
0 2000 4000 60000.
9990
time [min]
250 timeresponse with calls# msA
PFM – 7.7.2010 – 60
calls of # totalA
Experimental SetupExperimental Setup
SAR(system activity reporter)
TelcoSystem
(system activity reporter)• Periodic data• 96 variables• Approx. 13 million
observations
res
observations
call ponse
Log file data• Event driven • 390 variables• Approx. 4 million
Stress
ppobservations
failure log
Generator
PFM – 7.7.2010 – 61
failure log
Results for UBFResults for UBF
• Plotting recall over false iti t i ld positive rate yields
Receiver Operating Characteristic (ROC) curveW th A U d 9
1.0
UBFRBFML
• We use the Area Under ROC Curve (AUC) for comparison
0.8
0.9 UBF-NL
RBF-NL
• A perfect predictor results in AUC = 1.0
• Results: Mean AUC values 0.7
0
AU
Cfor 0,5,10,15 minutes predictions into the future
0.6
• Comparison of UBF with Maximum Likelihood (ML), Radial Basis Functions 0 5 10 15
0.5
PFM – 7.7.2010 – 62
(RBF) and non-linearities(NL)
lead time [min]
Comparison of TechniquesComparison of Techniques7
0.6
DFTEventset
HSMM
0.7
40.4
0.2
0.0
PFM – 7.7.2010 – 63
0
precision F-measure false positive raterecall
ContentsContents
• Runtime Monitoring
• Error Logs• Error Logs
• Variable Selection
• Online Failure Prediction Taxonomy
• Online Failure Prediction Techniquesq
• A Case Study
• Taking Action • Taking Action
•• SummarySummary
PFM – 7.7.2010 – 64
Accumulated Runtime CostAccumulated Runtime Cost
• Assign cost to:Assign cost to:– Correct no-alarms:
o Correctly predicted that no failure is imminent, no action performedo Cost of 1 cost unito Cost of 1 cost unit
– Correct alarms:o A failure was predicted and a proactive action has been carried out o Cost of 10 cost units
– False alarms:o Preventive actions are performed in vaino Cost of 20 cost units
– Missing alarms:o True Failure is not predicted: worst casepo Cost of 100 cost units
• Based on log data, we plot accumulated runtime cost
PFM – 7.7.2010 – 65
Resulting Cumulative CostResulting Cumulative CostOnly
FP and FN st
predictionsco
Only HSMM
predictor TP and TN predictions
Only TP di ti
predictor
PFM – 7.7.2010 – 66
predictions
time
Research IssuesResearch Issues• Runtime Monitoring
Overhead vs completeness / usefulness– Overhead vs. completeness / usefulness– Raw data vs. extracting information
• Root Cause Analysis / DiagnosisWh t th d f t diti l di i b d ti l ?– What methods from traditional diagnosis can be used proactively?
• Prediction-driven actions– What recovery methods use failure prediction most effectively?
• Decision strategies– Models and methods to decide upon or schedule actions
• Accuracyy– Accuracy of predictive diagnosis and prediction techniques– Success probabilities, performance impact of recovery and preventive
maintenance actions
• Analysis– How sensitive is PFM to system changes– What methods from control theory can be applied?
PFM – 7.7.2010 – 67
What methods from control theory can be applied? – Models for Proactive Fault Management– PFM economics
Key Choices for Effective PFMKey Choices for Effective PFM
1) Monitoring: what variables when how and at1) Monitoring: what variables, when, how and atwhat level
2) Failure Prediction: model, method andeffectiveness measures
3) Failure Avoidance or Recovery: when, how and at3) a u e o da ce o eco e y e , o a d atwhat level
4) Closing the Loop: learn, refine, tune and applyagain
PFM – 7.7.2010 – 68