Proactive Fault Management - uni-potsdam.de · Proactive Fault Management (PFM) PFM is an umbrella term for techniques such as monitoring, diagnosis, prediction, recovery and preventive

Proactive Fault Management

Felix Salfner

7.7.2010www rok informatik hu berlin de/Members/salfnerwww.rok.informatik.hu-berlin.de/Members/salfner

ContentsContents

•• IntroductionIntroduction

V i bl S l ti• Variable Selection

• Online Failure Prediction Overview

• Four Online Failure Prediction Techniques

• Assessing Failure PredictorsAssessing Failure Predictors

• Taking Action

S• Summary

PFM – 7.7.2010 – 2

Our CredoOur Credo

“Ordinary mortals know what’s happening now, the y pp g ,gods know what the future holds because they alone are totally enlightened.Wise men are aware of future things just about to happen”

C. P. Cavafy, (Greek poet, 1863-1933) “But the Wise Perceive Things about to Happen,” a poem based on lines by Philostratos

PFM – 7.7.2010 – 3

MotivationMotivation

• Ever-increasing systems complexity

• Ever-growing number of attacks and threats, novice users and third-party or open-source

ft COTSsoftware, COTS• Growing connectivity and interoperability

D i it (f t fi ti • Dynamicity (frequent configurations, reconfigurations, updates, upgrades and patches, ad hoc extensions) andad hoc extensions), and

• Natural and man-made disasters

PFM – 7.7.2010 – 4

A Different MindsetA Different Mindset

• Faults errors and failures are common events so • Faults, errors and failures are common events so let us treat them as part of the system behaviorand learn how to cope with themp

• Attractive panacea:tt act e pa acea

(self) ProactiveProactive Fault Management Fault Management (PFM)(self) ProactiveProactive Fault Management Fault Management (PFM)

PFM – 7.7.2010 – 5

Proactive Fault Management (PFM)

PFM is an umbrella term for techniques such as monitoring,

Proactive Fault Management (PFM)

PFM is an umbrella term for techniques such as monitoring, diagnosis, prediction, recovery and preventive maintenance concerned with proactive handling of errors and failures: if the system knows about a critical situation in advance, it can try to y , yapply countermeasures in order to prevent the occurrence of a failure, or it can prepare repair mechanisms for the upcoming failure in order to reduce time-to-repair.

Variable SelectionAdaptive Monitoring

Monitoring

Failure PredictionDiagnosis

SelectionScheduling EvaluationAdaptation

PFM – 7.7.2010 – 6

DiagnosisSchedulingExecution

EvaluationAdaptation

How to Get ThereHow to Get There

System Variable Model Model

TimeSeries

yObservation

ForwardSelection

Selection

Hidden MarkovModels

Estimation

SensitivityAnalysis

Application

Log Files

ProbabilisticWrapper

Universal Basis Functions Prediction

Reaction /Actuation

Offline System Adaptation

Actuation

DS 2009 – 12 – PFM– 7[Hoffmann, Trivedi, Malek 2007]

Comparison to l l l b l hClassical Reliability Theory

Cl i l li bilit th i t i ll f l f l t • Classical reliability theory is typically useful for long term or average behavior predictions and comparative analysis

• Classical reliability theory may help but is not very good for short term prediction due to dynamics, mobility, systems/networks complexity changing execution systems/networks complexity, changing execution environments, upgrades, online repair, etc.

PFM – 7.7.2010 – 8

ContentsContents

• Introduction

V i bl S l tiV i bl S l ti•• Variable SelectionVariable Selection




• Taking Action

S• Summary

PFM – 7.7.2010 – 9

Variable SelectionVariable Selection

• What are the right variables to use for modeling?• There are up to 4200 variables (v) and up to hundreds of

fault classes (f) per node• For n nodes: m = v x f x n variables, the number of

combinations c equals:combinations c equals:

m m

c

Combinatorial explosion!

r r1

c

• Combinatorial explosion!

PFM – 7.7.2010 – 10

Variable Selection MethodsVariable Selection Methods

• Selection by experts

• Filter (e.g., mutual information criterion)Filter (e.g., mutual information criterion)

• Wrapper (making use of modeling procedure specifics)- feed forward selection finding independent variables- feed forward selection, finding independent variables- backward elimination- probabilistic (only variables showing correlation and

certain distribution)certain distribution)

• Forward Addition - a method of selecting random variables for inclusion in the regression model by starting with no for inclusion in the regression model by starting with no variables and then gradually adding those that contribute most to prediction

PFM – 7.7.2010 – 11

Variable SelectionVariable Selection

PWA: Class labels

PWA: Time Series• Benchmarked four techniques – Forward selection

n te

chni

que

Forward selection: Time Series

Forward selection– Backward elimination– Expert selected– PWA (Prob Wrapper)

riabl

e se

lect

ion

Backward Elimination: Time Series

Forward selection: Class labels

– PWA (Prob. Wrapper)• Variables

– alloc/ va

r

Backward Elimination: Class labels

Backward Elimination: Time Series– sema/s• PWA performs best

on time series andl l b l d

Expert Selectedclass label data

PFM – 7.7.2010 – 12AUC out-of-sample

0.50 0.55 0.60 0.65 0.70 0.75[Hoffmann, Malek 2006; Hoffmann 2005]

ContentsContents

• Introduction


•• Online Failure Prediction TaxonomyOnline Failure Prediction Taxonomy

• Online Failure Prediction Techniques


• Taking Action

S• Summary

PFM – 7.7.2010 – 13

Faults Errors Failures again

TrackingA ditiTesting

Faults, Errors, Failures again …

TrackingAuditingTesting

Affects external state

ReportingFault UndetectedError

Detected

Failureactivation

detection DetectedError

id ff t id ff t

Monitoring

side-effects side-effects

Symptom

DS 2009 – 12 – PFM– 14

Four Ways of Detecting FaultsFour Ways of Detecting Faults

(1) The system can be audited in order to actively search for (1) The system can be audited in order to actively search for faults, e.g., by testing on checksums of data structures, etc.

(2) System parameters such as memory usage number of (2) System parameters such as memory usage, number of processes, workload, etc., can be monitored in order to identify side-effects of the faults. These side-effectsare called symptoms For example the side-effect of a are called symptoms. For example, the side effect of a memory leak is that the amount of free memory decreases over time.

(3) If a fault is activated and detected (observed), it turns into an error.

(4) If the fault is not detected by fault detection mechanisms, it might directly turn into a failure which can be observed from outside the system or component.

PFM – 7.7.2010 – 15

TaxonomyTaxonomy

Online Failure Prediction

FailureTracking

Symptom Monitoring

DetectedError

Reporting

UndetectedError

Auditing

Online Failure Prediction

ProbabilityDistribution Estimation

FunctionApproximation

Reporting

Rule-basedapproaches

Auditing

Co-occurrence Classifiers Co-occurrence

System Models

Pattern-recognition

Time Series Analysis

Statistical Tests

PFM – 7.7.2010 – 16

Classifiers [Salfner, Lenk, Malek, CSUR]

Online Failure Prediction -fDefinition

• The goal of online failure prediction is to identify failure-prone situations, i.e. situations that will failure prone situations, i.e. situations that will probably evolve into a failure. The evaluation is based on runtime monitoring data.

• The output of online failure prediction can either be – a decision that a failure is imminent or not, or– some continuous measure evaluating how failure-prone

the current situation isthe current situation is

PFM – 7.7.2010 – 17

Two Types of Input DataTwo Types of Input Data

• There are two types of There are two types of system measurements– periodic, numerical

event based categorical

Inpu

t

– event-based, categorical

• Examples for periodic data

t

F(t)p p

– system- / CPU load– memory usage

t

F

• Examples for event-based data:

A BC

error events

– interrupts– threshold violations– error events

data windowt

C A BC failure?

prediction

PFM – 7.7.2010 – 18

– error eventsPresent

time

p

ContentsContents

• Introduction



•• Four Online Failure Prediction TechniquesFour Online Failure Prediction Techniques


• Taking Action

S• Summary

PFM – 7.7.2010 – 19

Prediction Techniques ExamplesPrediction Techniques Examples

1. Universal Basis Functions (UBF)( )

2. Hidden Semi-Markov Model (HSMM)( )

3. Dispersion-Frame Technique (DFT)

4. Eventset method

PFM – 7.7.2010 – 20

Universal Basis Functions (UBF)Universal Basis Functions (UBF)

Running SystemRunning System

Real, UnknownFunction

Failure indicator

timeUBF Model approximation

measurements

threshold

predicted time oft ti

T il d t i di t

predicted time offailure occurrence

present time(time of prediction)

• Tailored to periodic measurements• Function approximation approach: Express target value as

function of input variablesp• Examples for target values:

– AvailabilityMemo cons mption

PFM – 7.7.2010 – 21

– Memory consumption

UBF BackgroundUBF Background

• Starting from radial basis functionsStarting from radial basis functionsLinear combination of kernel functions Gi

n

Gf cxx Replace fixed Gaussian by flexible ii

i iGf cxx

1

p ydomain specific kernel

dωdωd 21 1 G

• Determine parameters by minimizing, e.g., mean square

dωdωd 21 1 i

G

p y g, g , qerror

min H f f x yN

2

PFM – 7.7.2010 – 22

min H f f x yi ii

1

Effect of ω (UBF slider)Effect of ω (UBF slider)

• Linear combination of Linear combination of nonlinear kernel functions

• Examples:G i

1.0

– Gaussian– Sigmoide functions, …

• RBF is special case0.5

p• Improve efficacy by

introducing data specific flexible kernels

activation

0.0

flexible kernels• Universal approximator• Large number of kernels lid

er

0.0

0.2

0.4-1.0-1.0

-0.5

gto cover heavy tailed distributions

UBF

sli0.6

0.8x

-0.5

0.0

0.5

PFM – 7.7.2010 – 23

1.01.0

Dispersion Frame TechniqueDispersion Frame Technique

iWi-1i-2i-3i-4 i

time

DF 2

DF 1

EDI = 3

EDI = 3

EDI = 2

EDI = 2

• Classic technique for error log analysis

EDI: Error Dispersion IndexDF: Dispersion Frame

EDI 2

• Classic technique for error-log analysis• Evaluates the time of error occurrence• Applies a set of heuristic rules evaluating the number of

PFM – 7.7.2010 – 24

pp gerrors within successive dispersion frames

[Lin, Siewiorek 1990]

Event-set MethodEvent set Method

• Approach inspired by data-mining• Focus on type of events• Focus on type of events• Based on sets of events

– Each set contains decisive events that occur prior to a Each set contains decisive events that occur prior to a target event

– Events correspond to errors in our taxonomy– Target events correspond to failures– Event sets do not keep timing information

R lt l b d f il di ti t • Result: rule-based failure prediction system containing a database of indicative eventsets

PFM – 7.7.2010 – 25

[Vilalta, Ma 2002]

Event-set MethodEvent set Method

failure window 1 failure window 2non-failure windowfailure window 1

B A

failure window 2

FAA A A B BF

tt

C

non-failure window

tt

D

tltd

Initial eventset

tltd

Initial eventsetdatabase

A

B AB

ACBCABC

C

keeping only frequent, accurate, validated, and specific eventsets

final eventsetdatabase AB

PFM – 7.7.2010 – 26

Hidden Semi-Markov Model dPrediction

• Components of complex systems depend on each otherComponents of complex systems depend on each other• Dependencies lead to error patterns

nts C4

om

ponen

C3

C2

tErrors

co C1

C BA D E

• Fault-tolerant systems: – Failures occur only under certain conditions– Failure-prone conditions can be identified by specific error

patterns

Use pattern recognition to identify symptomatic situations

PFM – 7.7.2010 – 27

p g y y p[Salfner, Malek 2007; Salfner 2008]

ApproachApproach

• Standard tool for pattern recognition: Hidden Markov ModelsStandard tool for pattern recognition: Hidden Markov Models• Identify symptomatic patterns

– Algorithmically– From recorded training data Machine learning

• Additional assumption: • Additional assumption: – Time between events is decisive (temporal sequence analysis) – Standard Hidden Markov Models need to be extended

l f idd i k d l ( ) Development of a Hidden Semi-Markov Model (HSMM)

• The approach incorporates both type and time-of-occurrence of

tC A BC failure?yp

error eventsdata window

t

Presentprediction

PFM – 7.7.2010 – 28

Presenttime

Hidden Semi-Markov ModelsHidden Semi Markov Models

1 2 3 N…

g (t)b1(A)b1(B)b1(C)

b2(A)b2(B)b2(C)

b3(A)b3(B)b3(C)

bN(A)bN(B)bN(C)

g12(t)

• Discrete Time Markov Chains (DTMC) consist of states (1…N) and transition probabilities pij between statesp pij

• In Hidden Markov Models (HMM) each state can generate a symbol A,B,C according to probability distribution bi(ok)

• Hidden semi-Markov models (HSMMs) replace transitionprobabilities pij by time-continuous cumulative probabilitydistributions g (t)

PFM – 7.7.2010 – 29

distributions gij(t)

Machine Learning: Two StepsMachine Learning: Two Steps1.Training: Fit model parameters to training data

t

Non-FailureSequence 2

Non-failureSequence 1

FailureSequence 2

FailureSequence 1

B A FAA B C B BFC C B A

HSMM f F il S HSMM f N F il S

td

2.Prediction: Processing of runtime measurements

HSMM for Failure Sequences HSMM for Non-Failure Sequences

At

B C

td

Sequence Likelihood Sequence Likelihood

HSMM for Non-Failure SequencesHSMM for Failure Sequences

PFM – 7.7.2010 – 30

Classification

Failure Prediction

ContentsContents

• Introduction




•• Assessing Failure PredictorsAssessing Failure PredictorsAssessing Failure PredictorsAssessing Failure Predictors

• Taking Action

S• Summary

PFM – 7.7.2010 – 31

Precision Recall and other MetricsPrecision, Recall and other Metrics

SumSumTrue True successsuccessTrue True failurefailurecontingencycontingency tabletable

# No-AlarmsCorrect no-alarm (TN)Missing alarm (FN)NoNo warningwarning

# AlarmsFalse alarm (FP)Correct alarm (TP)FailureFailure alarmalarm

# Total# Successes# FailuresSum Sum

o a s( )g ( )

• Precision: fraction of correct alarms:

• Recall: fraction of predicted failures:

• False positive rate (fpr):

PFM – 7.7.2010 – 32

• True positive rate is equal to recall

Receiver Operating Characteristics ( )(ROC)

ROC for HSMM predictor1

e

perfect predictor

0.8

1.0

ROC for HSMM predictor

ue p

ositi

ve ra

te

averagepredictor

random predictor 0.4

0.6

true

posi

tive

rate

tru

random predictor

0.0

0.2

• Plot true positive rate (recall) over false positive rate for

1false positive rate0 0.0 0.2 0.4 0.6 0.8 1.0

various thresholds• Threshold ∞ : tpr and fpr equal to zero• Threshold -∞ : tpr and fpr equal to one

PFM – 7.7.2010 – 33

• Threshold -∞ : tpr and fpr equal to one

Precision-Recall-PlotsPrecision Recall Plots

1Precision-Recall plot for HSMM predictor

1

curve B

curve C

0.8

1.0

Prec

isio

n curve B

0.4

0.6

Prec

isio

n

1R ll0

curve A

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

• Plot precision over recall for various thresholds• Threshold ∞ : precision equal to one, recall equal to zero

1Recall0

p q , q• Threshold -∞ : precision equal to ratio of positive and

negative examples, recall equal to one

PFM – 7.7.2010 – 34

Skalar MetricsSkalar Metrics

• ROC, Precision/Recall diagrams etc. are graphsROC, Precision/Recall diagrams etc. are graphs• Good for visual inspection, bad for algorithmic decisions• Goal: obtain one real number to evaluate „quality“ of

predictor• Examples:

– Precision-Recall-Breakeven: Value at which precision and recallPrecision Recall Breakeven: Value at which precision and recallcross

– Area-under-curve (AUC): Area under the ROC curveF M H i f i i d ll– F-Measure: Harmonic mean of precision and recall:

PFM – 7.7.2010 – 35

ContentsContents

• Introduction





•• Taking ActionTaking Action

S• Summary

PFM – 7.7.2010 – 36

Taking ActionTaking Action

• Failure prediction is only the first step in managing faultsFailure prediction is only the first step in managing faultsproactively

• After a potential failure has been predicted, an action must b t k i d tbe taken in order to– Avoid the failure (prevent it from occurring)– Prepare the system such that TTR can be reducedp y– (See lecture seven)

• In general, the following steps have to be performed:

FailureDiagnosis Scheduling Executing

FailurePrediction of an

Actionthe

Action

PFM – 7.7.2010 – 37

Taxonomy of Reaction MethodsTaxonomy of Reaction Methods

A ti t i d b f il di tiActions triggered by failure prediction

Downtime avoidance Downtime minimization

State Preventive Load Prepaired PreventiveState clean-up

Preventivefailover

Loadlowering

Prepairedrepair

Preventiverestart

PFM – 7.7.2010 – 38

Effects of Proactive MethodsEffects of Proactive Methods

• Downtime avoidance improves MTTFDowntime avoidance improves MTTF• Downtime minimization reduces MTTR:

TTR

CP UR dtime

withoutfailure prediction(a)

CP UpReady

improvementTTR

CP F Ready Uptime

(b) withfailure prediction

• However, In case of frequent false positive and false negative predictions, proactive fault management can also reduce availability!

PFM – 7.7.2010 – 39

availability!

ContentsContents

• Introduction





• Taking Action

SS•• SummarySummary

PFM – 7.7.2010 – 40

SummarySummary

• Dynamics and complexity of today‘s systems require adaptive • Dynamics and complexity of today s systems require adaptive and proactive mechanisms to handle faults

• Proactive Fault Management (PFM) builds on– Continuously observing the system

– Predict whether a failure is coming up

– In case of an upcoming failure: o Analyze the fault that causes the upcoming failure

o Decide what to do: Either try to avoid the failure or prepare repairo Decide what to do: Either try to avoid the failure or prepare repairmechanisms for the upcoming failure

• Since online failure predictors build on monitoring data: The b f bl d b d f d bl lbest set of variables need to be identified: Variable selection

• Analyses of PFM suggest that PFM has the potential toenhance system availability by up to an order of magnitude

PFM – 7.7.2010 – 41

enhance system availability by up to an order of magnitude.

Thanks!Questions?

Prediction in Computer SciencePrediction in Computer Science

• Scheduling

• Branch prediction

• Memory management

• Performance evaluation

• Reliability prediction• Reliability prediction

• …

• Failure prediction

PFM – 7.7.2010 – 43

A Long Way to GoA Long Way to Go

• Daimler predicted in 1900 that the Europe’s production of Daimler predicted in 1900 that the Europe s production of cars will not exceed 1,000 a year because it will not be possible to train enough chauffeurs

• In 1900 General Post Office of United Kingdom has predicted that there will be a demand for about ONE phone for each city with a population of more than one hundred thousandcity with a population of more than one hundred thousand

• "There is not the slightest indication that nuclear energy will e e be obtainable " Albe t Einstein 1932ever be obtainable." — Albert Einstein, 1932

• "There is no reason for any individual to have a computer in y ptheir home", Ken Olsen, 1977

PFM – 7.7.2010 – 44

Predicting the FuturePredicting the Future

• Predicting the future has fascinated people from thePredicting the future has fascinated people from thebeginning of times

• Several millions of people work on prediction daily• Astrologists, meteorologists, politicians, pollsters, stock

analysts, doctors,…, and many computer scientists/engineers

PFM – 7.7.2010 – 45

© BBC 1999

ContentsContents

•• Runtime MonitoringRuntime Monitoring

• Error Logs• Error Logs

• Variable Selection

• Online Failure Prediction Taxonomy

• Online Failure Prediction Techniquesq

• A Case Study

• Taking Action• Taking Action

• Summary

PFM – 7.7.2010 – 46

Runtime (Online) Monitoring Runtime (Online) Monitoring

• Runtime monitoring is a continuous observation of • Runtime monitoring is a continuous observation of system variables for a given purpose such as diagnosis or failure predictiong p

• Fundamental questions: – Which variables to monitor?– How frequently (sampling rate)?– At what level should we monitor?

H ill f b ff t d?– How will performance be affected?– What storage will be needed?– When and how to process the monitoring data?When and how to process the monitoring data?

• Remember: monitoring is not free and never complete

PFM – 7.7.2010 – 47

p

Monitoring – At What Level?Monitoring At What Level?

Business P /E t i

Monitoring Monitoring

rings

Monitoring

Process/Enterprise

Monitoring

Service

Monitoring

Software

Hardware

DS 2009 – 12 – PFM– 48

Types of Data Sources Types of Data Sources

• Error Logs (Logfiles) – Lack of uniformity– Standards are just emerging– Redundancy (in some cases a problem is

d 60 000 i )reported 60,000 times)

System Activity Reporter (SAR) data• System Activity Reporter (SAR) data– Up to 4200 parameters can be monitored in real

systemssystems– Up to five can usually be measured and

processed in real time for 1-5 minutes prediction

PFM – 7.7.2010 – 49

ContentsContents

• Runtime Monitoring

E LE L•• Error LogsError Logs



• Online Failure Prediction TechniquesOnline Failure Prediction Techniques

• A Case Study

T ki A ti• Taking Action

• Summary

PFM – 7.7.2010 – 50

What Is the Use of Logfiles Today?What Is the Use of Logfiles Today?

• Development• Developmento Programming Errors

• Operation and maintenance• Operation and maintenanceo Root cause diagnosis of system failures

o Configuration errorsg

• Researcho Source data for analysisy

PFM – 7.7.2010 – 51

Requirements EvolutionRequirements Evolution

• TargetTarget– Today: Human readable– Tomorrow: Both human and machine readable

• Domain knowledge– Today: Domain specific implicit domain knowledge e g Today: Domain specific, implicit domain knowledge, e.g.,

selected thresholds– Tomorrow: Comprehensive description

• Standardization– Today: Proprietary formats, homogeneous environmenty p y , g– Tomorrow: Standards for heterogeneous environments,

universal tools, analysis server

PFM – 7.7.2010 – 52

Typical Problems with Error LogsTypical Problems with Error Logs

• Timestamp: No standard format and interpretation

• Unknown number representation (decimal or hexadecimal?)

• No token identifier / separator

• No machine readable format specification for the pentire log

• Repetitive patterns (multiple reporting of p p ( p p gproblems)

PFM – 7.7.2010 – 53

Error Log ExampleError Log Example

2004/02/09 19 26 13 634089 29836 00010 LIB ABC2004/02/09-19:26:13.634089-29836-00010-LIB_ABC ANOPTK#0243546463464346|0555553456|00000000000000-4456547457434-2.3.1|356546346|0001

2004/02/09 19:26:13 634089 29836 00010 LIB ABC2004/02/09-19:26:13.634089-29836-00010-LIB_ABC src=APPLICATION sev=SEVERITY_MINOR

2004/02/09-19:26:13.634089-29836-00010-LIB_ABC unknown value specified in Context 000256unknown value specified in Context 000256

PFM – 7.7.2010 – 54

WSDM Event Format (WEF)WSDM Event Format (WEF)

• Log standard by Web Services Distributed Management Log standard by Web Services Distributed Management (WSDM) approved by OASIS

PFM – 7.7.2010 – 55© IBM

Impact of MonitoringImpact of Monitoring

140

100120

4060

80

Increase in Response

020

40 Increase in Response Time [%]

Traffic [100 Bytes / s]

PFM – 7.7.2010 – 56

ContentsContents


• Error Logs



• Online Failure Prediction Techniques• Online Failure Prediction Techniques

•• A Case StudyA Case Study

• Taking Action

• Summary

PFM – 7.7.2010 – 57

y

Universal Basis Functions (UBF)Universal Basis Functions (UBF)

• Tailored to periodic • Tailored to periodic measurements: e.g.,– Semaphore operations per F(

t)

second or minute– Allocated OS-kernel

memory

t

memory• Function approximation

approach: Express target sem

a/s

value as function of input variables

• Examples for target values:

t

• Examples for target values:– Availability– Memory consumption

t

allo

c

PFM – 7.7.2010 – 58

t

[Hoffmann, Malek 2006; Hoffmann 2005]

Case StudyCase Study

• Commercial telecommunication platform• Commercial telecommunication platform

• Platform implements service control functions

– Examples: billing, SMS, pre-paid services

• 400-10 000 service requests per minute• 400-10,000 service requests per minute

• Distributed and component based system

– 1.5+ million lines of code

2000+ classes and 200+ components– 2000+ classes and 200+ components

– Two nodes (up to eight)

PFM – 7.7.2010 – 59

Definition of FailuresDefinition of Failures

0000

0.99

981.

099

40.

9996

avai

labi

lity

0.99

920.

99a

0 2000 4000 60000.

9990

time [min]

250 timeresponse with calls# msA

PFM – 7.7.2010 – 60

calls of # totalA

Experimental SetupExperimental Setup

SAR(system activity reporter)

TelcoSystem

(system activity reporter)• Periodic data• 96 variables• Approx. 13 million

observations

res

observations

call ponse

Log file data• Event driven • 390 variables• Approx. 4 million

Stress

ppobservations

failure log

Generator

PFM – 7.7.2010 – 61

failure log

Results for UBFResults for UBF

• Plotting recall over false iti t i ld positive rate yields

Receiver Operating Characteristic (ROC) curveW th A U d 9

1.0

UBFRBFML

• We use the Area Under ROC Curve (AUC) for comparison

0.8

0.9 UBF-NL

RBF-NL

• A perfect predictor results in AUC = 1.0

• Results: Mean AUC values 0.7

0

AU

Cfor 0,5,10,15 minutes predictions into the future

0.6

• Comparison of UBF with Maximum Likelihood (ML), Radial Basis Functions 0 5 10 15

0.5

PFM – 7.7.2010 – 62

(RBF) and non-linearities(NL)

lead time [min]

Comparison of TechniquesComparison of Techniques7

0.6

DFTEventset

HSMM

0.7

40.4

0.2

0.0

PFM – 7.7.2010 – 63

0

precision F-measure false positive raterecall

ContentsContents


• Error Logs• Error Logs



• Online Failure Prediction Techniquesq

• A Case Study

• Taking Action • Taking Action

•• SummarySummary

PFM – 7.7.2010 – 64

Accumulated Runtime CostAccumulated Runtime Cost

• Assign cost to:Assign cost to:– Correct no-alarms:

o Correctly predicted that no failure is imminent, no action performedo Cost of 1 cost unito Cost of 1 cost unit

– Correct alarms:o A failure was predicted and a proactive action has been carried out o Cost of 10 cost units

– False alarms:o Preventive actions are performed in vaino Cost of 20 cost units

– Missing alarms:o True Failure is not predicted: worst casepo Cost of 100 cost units

• Based on log data, we plot accumulated runtime cost

PFM – 7.7.2010 – 65

Resulting Cumulative CostResulting Cumulative CostOnly

FP and FN st

predictionsco

Only HSMM

predictor TP and TN predictions

Only TP di ti

predictor

PFM – 7.7.2010 – 66

predictions

time

Research IssuesResearch Issues• Runtime Monitoring

Overhead vs completeness / usefulness– Overhead vs. completeness / usefulness– Raw data vs. extracting information

• Root Cause Analysis / DiagnosisWh t th d f t diti l di i b d ti l ?– What methods from traditional diagnosis can be used proactively?

• Prediction-driven actions– What recovery methods use failure prediction most effectively?

• Decision strategies– Models and methods to decide upon or schedule actions

• Accuracyy– Accuracy of predictive diagnosis and prediction techniques– Success probabilities, performance impact of recovery and preventive

maintenance actions

• Analysis– How sensitive is PFM to system changes– What methods from control theory can be applied?

PFM – 7.7.2010 – 67

What methods from control theory can be applied? – Models for Proactive Fault Management– PFM economics

Key Choices for Effective PFMKey Choices for Effective PFM

1) Monitoring: what variables when how and at1) Monitoring: what variables, when, how and atwhat level

2) Failure Prediction: model, method andeffectiveness measures

3) Failure Avoidance or Recovery: when, how and at3) a u e o da ce o eco e y e , o a d atwhat level

4) Closing the Loop: learn, refine, tune and applyagain

PFM – 7.7.2010 – 68

Documents

Proactive Fault Management - uni-potsdam.de · Proactive Fault Management (PFM) PFM is an umbrella term for techniques such as monitoring, diagnosis, prediction, recovery and preventive