47
INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN COMPLEX NETWORK AND IT SYSTEMS. Dr. Yuri Rabover Co-founder, Director Product Strategy VMTurbo [email protected] Member of the SMARTS founding team We are looking for summer interns! Please contact Yuri

INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN COMPLEX NETWORK AND IT SYSTEMS.Dr. Yuri RaboverCo-founder, Director Product [email protected] of the SMARTS founding team

We are looking forsummer interns!Please contact Yuri

Page 2: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

WHAT IS IT ALL ABOUT?

IN SEARCH OF ROOT CAUSE….

Page 3: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

BACKGROUNDSmarts: System Management ARTS

Founded in 1993 as a network management R&D labPatented the leading correlation technology

Built a software platform and solutions to leverage it

Thousands of customers throughout the worldNetworksSystems Applications

Bought by EMC2 in 2005 for $275 mln.Yuri joined SMARTS in 1993.

Page 4: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

- 4

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 1

Port 0

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

PROBLEM SCENARIO

Card 0

Card 1

Card 0

Card 1

Card 0

Card 1

Card 0

Card 1

Switch 0

Switch 2

Switch 1

Switch 3

PROBLEM

Page 5: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

- 5

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 1

Port 0

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

Card 0

Card 1

Card 0

Card 1

Card 0

Card 1

Card 0

Card 1

Switch 0

Switch 2

Switch 1

Switch 3

PROBLEMSYMPTOMS

PROBLEM SCENARIO

Page 6: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

- 6

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 1

Port 0

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

Card 0

Card 1

Card 0

Card 1

Card 0

Card 1

Card 0

Card 1

Switch 0

Switch 2

Switch 1

Switch 3

PROBLEMSYMPTOMS

PROBLEM SCENARIO

Page 7: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

- 7

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 1

Port 0

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

Card 0

Card 1

Card 0

Card 1

Card 0

Card 1

Card 0

Card 1

Switch 0

Switch 2

Switch 1

Switch 3UNIQUE SIGNATURE

PROBLEMSYMPTOMS

PROBLEM SCENARIO

Page 8: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

RCA CHALLENGESProblems can start in any logical or physical object in the network, attached systems, or applications.A single problem often causes many symptoms in many related objects.The absence of particular symptoms is as meaningful as the presence of particular symptoms.Different problems can cause many overlapping symptoms. For example, the operationally down status of logical port 0 over physical port 1 in Card 0 of Switch 1 could be caused by a failure in any one of the following components:

Switch 1Switch 1 Card 0Switch 1 Card 0 Port 1Switch 1 Card 0 Port 1 Logical Port 1

Switch 2Switch 2 Card 0Switch 2 Card 0 Port 1Switch 2 Card 0 Port 1 Logical Port 1

The trunk connecting Port 1 in Card 0 in Switch 1 with Port 1 in Card 0 in Switch 2.

Page 9: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

RCA CHALLENGES – CONT.Symptoms can propagate across related components. In our example, a symptom propagated from a card to its ports and from a port to a peer port. This makes it necessary to examine all the symptoms across related elements in order to identify the root cause.Problems are not always observable in the object where they originate. For example, if the switch does not generate card failure traps, the card failure would have no direct (local) symptoms.Problems can occur in objects that do not generate any observable events. For example, there is no event and/or trap associated with a trunk.

Page 10: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

CONVENTIONAL APPROACH WRITE RULES

Page 11: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

CONVENTIONAL APPROACHDOWNSTREAM SUPPRESSION

PSTN

Page 12: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

difficult to separate symptoms from root-cause problemsmanual problem diagnosisrules-based “models”change outpaces rules writingsignificant expertise and time

silo management—no correlation across technology silosmultiple groups chasing the same problemneed the whole view to isolate faults

“Sea of red”

Rules-Based Management puts the burden of analysison IT Operations BUDGETS

CONVENTIONAL APPROACHES

Page 13: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

WHAT’S MISSING?

Data & Event Collection

Routers Storage

Applications Servers IDSFirewallsDatabasesSwitches Telephony Video Optical

Display

No built-in analysis — ongoing rules-writing required

Modeling Analysis

Page 14: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

14

MODELING VS ROOT CAUSE ANALYSIS

Modeling uses a top-down approach:Starting from the problems to be diagnosed

Problem: the battery is deadMoving to the symptoms caused by the problem

Symptom: the car does not start

Root cause analysis uses abductionStarting from the symptoms observed

Symptom: the car does not startInfer the root cause

Problem: the battery is dead

«problem»Battery.Dead

«symptom»Car.DoesNotStart

«causes»

Modeling

Analysis

Page 15: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

15

ROOT CAUSE ANALYSIS – PROBLEM DEFINITIONWikipedia:

Root cause analysis (RCA) is a class of problem solving methods aimed at identifying the root causes of problems or events.

Smarts: Given a set of symptoms, find the problem, or the set of problems, that best explains those symptoms.

Smarts:

Given a set of symptoms, find the set of problems that best explains those

symptoms.

Page 16: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

Causal Network Model of a Technology Object

Causation C – relation from P to S associating individual problems and symptoms.

P

S

d1 d2 d3 d4 d5

m1 m2 m3 m4 m5

Page 17: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

17

BOTTOM-UP APPROACH TO ROOT-CAUSE ANALYSIS

Q. What are the symptoms available to me?Q. What can I make out of these symptoms?

Symptom A1

Symptom X6

Symptom

N11

Sympto

m W75Symptom S52 Sym

ptom Z3

Symptom K4

Symptom

M33

Sympto

m A4Symptom V6

Symptom K19Symptom C42

Symptom F12Symptom Q31

Symptom H17

Symptom T9

Sym

ptom

D1

Symptom A7

?

Page 18: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

18

TOP-DOWN APPROACH TO ROOT-CAUSE ANALYSIS

Q. What problems do I want to be able to diagnose?Q. Which symptoms are manifested by those problems

And what is the minimum set of symptoms needed to uniquely diagnose each problem?

Problem D1

Problem D2Problem D3

Symptom A1

Symptom X6

Symptom

N11Sym

ptom W

75Symptom S52 Symptom

Z3

Symptom K4

Symptom

M33

Sympto

m A4Symptom V6

Symptom K19Symptom C42

Symptom F12Symptom Q31

Symptom H17

Symptom T9

Sym

ptom

D1

Symptom A7

Page 19: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

19

TOP-DOWN APPROACH TO ROOT-CAUSE ANALYSISQ. What problems do I want to be able to diagnose?Q. Which symptoms are manifested by those problems

And what is the minimum set of symptoms needed to uniquely diagnose each problem?

Problem D1

Problem D2Problem D3

Symptom S52 Symptom

Z3

Sympto

m A4

Symptom Q31

Symptom H17

Sym

ptom

D1

Symptom A7

Page 20: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

20

REMINDEREntia non sunt multiplicanda praeter necessitatem.The simplest explanation is usually the best.The simplest explanation Considers the minimum set of symptoms to support each possible explanation.The law of parsimony plays a major role in resource management.Don’t collect all available symptoms, only those relevant to the problems diagnosed.

Page 21: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

- 21

EMC SMARTS TECHNOLOGIES: BEHAVIOR MODELS

Service subscriber

Application

Host

Router

Service Offering

Library of Generic Objects

Switch

s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010

s0 s1 s2 s3 s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010

s0 s1 s2 s3 s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

Codebook

1111

11 1

111

11

11111

1111

1111

1111

111

1111

1111

1

1

1

11

1111

1111

1

1

1

1

11

1 1 1 1

11

11

1

s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010

s0 s1 s2 s3 s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010s0c0p010

s0 s1 s2 s3 s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

s0c0

Codebook

1111

11 1

111

11

11111

1111

1111

1111

111

1111

1111

1

1

1

11

1111

1111

1

1

1

1

11

1 1 1 1

11

11

1

SYM

PTO

MS

PROBLEMS

Problem Signatures

UNIQUE SIGNATURE

Page 22: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

- 22

EMC SMARTS TECHNOLOGIES: BEHAVIOR MODELSLogical Port

Problem Down Causes OperationallyDown,ConnectedPortDownLocal symptom OperationallyDownPropagated symptom ConnectedPortDown to Connected Logical Ports Down

Physical Port– Problem Down Causes PortDown

Propagated symptom PortDown To Logical Ports Layered over the Port are DownPropagated symptom ConnectedPortDown to Layered over Logical Ports

Card– Problem Down Causes PortDown

Propagated symptom PortDown To Physical Ports in the Card are DownPropagated symptom ConnectedPortDown to Physical Ports in the Card

Switch– Problem Down Causes ConnectedPortDown

Propagated symptom ConnectedPortDown to Cards in the Switch

Page 23: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

- 23

EMC SMARTS TECHNOLOGIES: SWITCH 0 CARD 0 SYMPTOMS

Switch 0

Card 0

Card 1

PROBLEM

problemssymptoms

S 0

C 1

s0c0p0l0 s0c0p0l1s0c0p1l0 s0c0p1l1 s0c1p0l0s0c1p0l1s0c1p1l0s0c1p1l1s1c0p0l0s1c0p1l0s1c1p0l0s1c1p1l0s3c0p0l1 s3c0p1l1 s3c1p0l1 s3c1p1l1

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 1

Port 0

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

Card 0

Card 1

Card 0

Card 1

Switch 1

Switch 3

S 0

0C

1111

1111

11

11

11

11

SIGNATURE

Page 24: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

Module 1 -Introduction to

Sm

arts Technology

- 24

EMC SMARTS TECHNOLOGIES: SWITCH 0 CARD 0 SYMPTOMS

Switch 0

Card 0

Card 1

PROBLEM

problemssymptoms

S 0

C 1

s0c0p0l0 s0c0p0l1s0c0p1l0 s0c0p1l1 s0c1p0l0s0c1p0l1s0c1p1l0s0c1p1l1s2c0p0l0s2c0p1l0s2c1p0l0s2c1p1l0s3c0p0l1 s3c0p1l1 s3c1p0l1 s3c1p1l1

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

Port 1

Port 0

Port 0

Port 1

Port 0

Port 1

Port 0

Port 1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

L 0

L1

Card 0

Card 1

Card 0

Card 1

Switch 1

Switch 3

S 0

0C

1111

1111

11

11

11

11

S0 S0 C0 C1

Page 25: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

- 25

Host

EMC SMARTS TECHNOLOGIES: CODEBOOK & PROBLEM SIGNATURES

Current Real Time Topology

Host::WebServer 1 Host::WebServer 2

HostedByHostedBy HostedBy

Hosted App::URL1 Hosted App::URL3

ServedBy

Service Offering::DNS

Service Offering::Secure Web

Subscribes

Service Subscriber:: John

Symptoms Observed in EnvironmentPROBLEMS

SYM

PTO

MS

s0 s1 s2 s3 s0c0

s0c1

s1c0

s1c1

s2c0

s2c1

s3c0

s3c1

s0c0p0

s0c0p1

s0c1p0

s0c1p1

s1c0p0

s1c0p1

s1c1p0

s1c1p1

s2c0p0

s2c0p1

s2c1p0

s2c1p1

s3c0p0

s3c0p1

s3c1p0

s3c1p1

s0c0p0l0 1 1 1 1 1s0c0p0l1 1 1 1 1 1s0c0p1l0 1 1 1 1 1s0c0p1l1 1 1 1 1 1s0c1p0l0 1 1 1 1 1s0c1p0l1 1 1 1 1 1s0c1p1l0 1 1 1 1 1s0c1p1l1 1 1 1 1 1s1c0p0l0 1 1 1 1 1s1c0p0l1 1 1 1 1 1s1c0p1l0 1 1 1 1 1s1c0p1l1 1 1 1 1 1s1c1p0l0 1 1 1 1 1s1c1p0l1 1 1 1 1 1s1c1p1l0 1 1 1 1 1s1c1p1l1 1 1 1 1 1s2c0p0l0 1 1 1 1 1s2c0p0l1 1 1 1 1 1s2c0p1l0 1 1 1 1 1s2c0p1l1 1 1 1 1 1s2c1p0l0 1 1 1 1 1s2c1p0l1 1 1 1 1 1s2c1p1l0 1 1 1 1 1s2c1p1l1 1 1 1 1 1s3c0p0l0 1 1 1 1 1s3c0p0l1 1 1 1 1 1s3c0p1l0 1 1 1 1 1s3c0p1l1 1 1 1 1 1s3c1p0l0 1 1 1 1 1s3c1p0l1 1 1 1 1 1s3c1p1l0 1 1 1 1 1s3c1p1l1 1 1 1 1 1

1

1

1

11

1

1

1

1

Page 26: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

Estimation AlgorithmDiagnosis problem:

Given binary vector of symptoms YFind root cause (codebook column) k

Solution:

Bk is column k of the codebook norm denotes Hamming distance

kBYk −= minarg

Page 27: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

UNFORTUNATELY, NOT THAT SIMPLE

In a very simple case we can get away with binary symptomsIn reality symptoms may manifest themselves with some certaintyWhat’s worse, your data collection is not perfect

Symptoms could be lostSymptoms could be misdiagnosed

The actual analysis algorithm is using a combination of the codebook analysis and abductive inference

Page 28: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

28

ANALYSIS METHODS

Given:Preconditions αPost conditions β and Rule R : α→ β (α therefore β).

Induction: Inference of RAfter numerous examples (evidences) of α and therefore βInduce R

Deduction: Inference of βUsing R, and αMake a conclusion about β (α ∧ R ⊢ β)

Abduction: Inference of αUsing the post condition β and the rule RInfer that the precondition α could explain β (β ∧ R ⊢ α)

Page 29: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

29

ABDUCTION VS. INDUCTION

Induction AbductionHypothesize general rule Hypothesize specific ruleDrawn from many observations Drawn from a single

observation

Page 30: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

30

ABDUCTION VS. DEDUCTION

Deduction AbductionThe result is produced for a specific case

The result is produced for a specific case

If both rule and case are true then the result is true

Even if both rule and result are true, the inferred specific case is only a possibility

Page 31: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

31

ROOT CAUSE ANALYSIS IS ACTUALLY ABDUCTION

R : α→ β (α therefore β)β are the symptomsα are the problemsR is the causality model: which problems cause the manifestation of which symptoms

Page 32: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

32

SIMPLE EXAMPLE: EVERY PROBLEM CAUSES ONE UNIQUE SYMPTOMWhen symptom βi is observed, it is clear that problem αi has occurredThere is a one-one mapping between symptoms and problems

α1

α2

α3

β1

β2

β3

Modeling

α1

α2

α3

β1

β2

β3

Abduction

Page 33: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

33

ANOTHER SIMPLE EXAMPLEEvery problem causes a unique set of symptoms

Still, given a set of symptoms, it is clear which problem has occurred

β1a

β1b

β2a

β2b

β3a

α1

α2

α3

β2a

β2b

α2

β1a

β1b

β3a

α1

α3

Page 34: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

34

LET’S COMPLICATE THINGSSome symptoms caused by more than one problem

α1

α2

α3

βX

βY

α1

α2

α3

βX

βY

α1

α2

α3

βX

βY

α1

α2

α3

βX

βY

?

?

?

?

?

Page 35: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

35

A SPECIFIC EXAMPLE –DIAGNOSING CHICKENPOX

Chickenpox causes positive result of some blood test and also causes feverA Cold causes feverMild chickenpox causes positive blood test but no fever

MCP

CPBT

FC

Modeling

Page 36: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

36

DIAGNOSING CHICKENPOX IN A FLAWLESS ENVIRONMENT

Case 1: Only fever⇒ Cold

Case 2: Fever and blood test is positive⇒ ChickenpoxCase 3: Only positive blood test⇒ Mild ChickpoxCase 4: No symptoms⇒ Healthy

MCP

CPBT

FC

Page 37: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

37

MCP

CPBT

FC

ANOTHER COMPLICATION (OF THE MODEL)

The blood test is not 100% reliableWhen you have chickenpox or mild chickenpox, there is a 50% chance that your blood test would turn out positiveWhen you don’t have chickenpox or mild chickenpox, there is a 2% chance that your blood test would turn out positive H

Problem causes symptomwith probability 2%

Problem causes symptomwith probability 50%

Problem causes symptomwith probability 100%

RemarkBT and F are conditionallyindependent given CP

Page 38: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

38

MORE INFORMATIONEvery disease has an a-priori probabilityAt any given time, you have a chance of:

0.5% of having chickenpox1.5% of having mild chickenpox7% of having a cold91% of being healthy

Now apply the principle of parsimony in the abductive reasoning processUse Bayes law to calculate the likelihood of each explanation (probability of a disease given symptoms)

Page 39: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

39

BAYES LAW APPLIED TO ABDUCTIVE REASONING

Given a set of observed symptoms S, what is the probability that a given problem di has occurred ?Bayes law:

P(di|S) = P(S|di) x P(di) / P(S)

The probability that a set of symptoms occurred knowing that the problem di occurred

A priori probability of problem di

Probability of set of symptoms S.

Page 40: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

40

CHICKENPOX EXAMPLECase 2: Fever and blood test is positiveP(CP | F∧BT) = P(F∧BT | CP) x P(CP) / P(F∧BT) =

= P(F | CP) x P(BT | CP) x P(CP) / P(F∧BT) == 100% x 50% x 0.5% / P(F∧BT)= 0.25% / P(F∧BT)

SimilarlyP(MCP | F∧BT) = 0% / P(F∧BT)P(C | F∧BT) = 0.14% / P(F∧BT)P(H | F∧BT) = 0% / P F∧BT)

P(di|S) = P(S|di) x P(di) / P(S)P(di|S) = P(S|di) x P(di) / P(S)

MCP

CPBT

FC

H

Page 41: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

41

CHICKENPOX EXAMPLE (CONT.)Sum of probabilities is 1 :

P(CP | F∧BT)+P(MCP | F∧BT)+P(C | F∧BT)+P(H | F∧BT)=1

⇒ 0.25% / P(F∧BT) + 0.14% / P(F∧BT) = 1⇒ P(F∧BT) = 0.39%

P(C | F∧BT) = 0.14/0.39 = 36%P(CP | F∧BT) = 0.25/0.39 = 64%

Page 42: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

42

DIAGNOSING CHICKENPOX AND COLD

Case 1: fever, blood test negativeWith probability 96.5% you have a coldWith probability 3.5% you have chickenpox

Case 2: fever, blood test positiveWith probability 64% you have chickenpoxWith probability 36% you have a cold

Case 3: no fever, blood test positiveWith probability 71% you are healthyWith probability 29% you have mild chickenpox

Case 4: no fever, blood test negativeWith probability 99% you are healthyWith probability 1% you have mild chickenpox

MCP

CPBT

FC

H

Page 43: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

Context

Collection

Analysis

Automated Incident Management—Start to Finish

ICIM Library1

4 Codebook Correlation

3 ICIM Repository

Polling/Pinging5

6 Root Cause

Business Impact

Discovery2

EMC SMARTS APPROACH

Page 44: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

ICIM: THE INCHARGE COMMON INFORMATION MODEL

Based on DMTF CIM, extended with rich semantics for integrating and automating management applicationsComprehensive

Models network, systems, applications, services, business entities100+ classes, 50+ relationshipsInfinitely extensible via inheritance

Models the complex web of relationships in the real world: Within entities, across entities

Models cross-domain relationships Key to service management, end-to-end view Pieces together information from heterogeneous sources

Abstract To scale to networks of unlimited size and complexity through multiple levels of abstraction To decouple management application logic from the specifics of an ever-increasing stream of vendor products

EfficientOpen

ServiceSubscriber

Host

HostedApplication

Switch

Router

ServiceOffering

Subscribes

ComposedOf

HostedBy

NeighboringSystems

NeighboringSystems

Page 45: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

ICIM: THE INCHARGE COMMON INFORMATION MODEL

AttributesStored or instrumented – this is transparent to applications

Relationships1-1, 1-many, many-1, many-many

BehaviorsSpecific extensions for each type of FCAPS application, e.g., problems for fault, constraints for configuration

ConstraintsAssure consistency of assigned values, e.g., speed match at both end of a circuit

MODEL: Managed Object DEfinition LanguageBased on CORBA IDL syntax

IP Technology Objects• Switches• Host• MAC Addresses• IP Networks/Addresses • Cards• Chassis• Interfaces• Ports• Servers• Cables• Trunks• Fans• Power Supplies

ATM Technology Objects• ATM Switches• Edge Routers• Cards• Chassis• Interfaces• Physical Ports• Logical Ports• Logical Trunks• Cables• Physical Trunks• Logical Trunks• PVCs

Application/Business/Service Objects

• Hosts• Applications• Application Clusters• Sessions• Transactions• Customers• Business Units• Services

Many Others…

Protocol Technology ObjectsMPLS• P Routers• PE Routers• CE Routers• LSP• LSP Hops• VPNs• VRFsRouting• BGP Services/Sessions• Route Reflectors (RR)• Route Reflector Clusters• OSPF Services/Sessions• OSPF Areas• OSPF members

Relationship Types• ConnectedVia/ConnectedTo• ParentOf/ChildOf• LayeredOver/Underlying• SwappedFrom/SwappedTo• NextHop/PreviousHop• ImportedBy/Import

Page 46: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

- OSPF/BGP Areas/Adjacencies

Multi-Layered Relationships– Intra-Domain Relationships– Inter-Domain Relationships– Business Relationships

BusinessRelationships

TCP Connections,Sessions,

Transactions

L2 - Logical Connectivity (e.g. VLANS)

L1/L2 - Physical Connectivity

- MPLS LSP/LSP Hop/VFR/VPN

L3 - Logical IP-Subnet Connectivity

Subnet YSubnet Y

Subnet XSubnet X

Subnet ZSubnet Z

Subnet ASubnet A

INCHARGE COMMON INFORMATION MODEL

Page 47: INTEGRATED DIAGNOSTICS OF ROOT CAUSE FAULTS IN …web.stanford.edu/class/ee392m/Lecture2Rabover.pdf · 2009. 4. 8. · diagnose? |Q. Which symptoms are manifested by those problems

THANK YOU!Questions, Comments, Suggestions?

And we always in search of good [email protected]