Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab

Toward Optimal Network Fault Toward Optimal Network Fault Correction via End-to-End Correction via End-to-End

InferenceInference

Patrick P. C. Lee, Vishal Misra, Dan Rubenstein

Distributed Network Analysis (DNA) LabColumbia University

May 9, 2007

OutlineOutline

Motivation

Framework for end-to-end inference

Inference algorithm

Performance evaluation

Conclusions

MotivationMotivation Goal: Correct (diagnose and repair) data-path failures in

a system where only end-to-end information is available and link-level probing is unreliable.

Example: overlays across externally managed nodes

Data streamserver

OK!

No data?

No data?

ProblemProblem

What should an administrator do if some paths fail to deliver data?

What the administrator knows: some nodes on the faulty paths must have failed

What the administrator doesn’t know: which nodes on the paths failed how many nodes on the paths failed reasons the nodes failed

Solution: Checking, via a series of sanity tests, the nodes that potentially failed, and repairing those that did.

ConstraintsConstraints

Checking and repairing a node incurs a cost e.g., wages and man-hours of support staff, or cost of test

equipment

Such a cost can be highly varying e.g., service providers may charge different costs of

checking nodes

ObjectiveObjective Assume each node i has a priori known

failure probability pi: the likelihood that node i has failed checking cost ci: the cost needed to perform sanity tests

on node i

Objective: minimize the expected total checking cost of correcting (i.e., diagnosing and repairing) all faulty nodes

∑i

minimize ci Pr(node i is actually checked)

over all sequences of nodes to be checked

End-to-End InferenceEnd-to-End Inference End-to-end inference approach for correcting data-

path failures:

Network topology

Monitor paths

Bad paths exist?

Done

Select the nodesto check

No

Yes

Repair identifiedbad nodes

Input:

How to select nodes to check?

Check nodes

How to Select Nodes to How to Select Nodes to Check?Check?

Suppose that we check one node at a time.

Most-Likely Fault (MLF) approach First check the most likely faulty node, i.e., the

node with the highest conditional failure probability given that some paths failed to deliver data.

Does the MLF approach necessarily minimize the expected total checking cost?

Example – Why the MLF Example – Why the MLF Scheme is not Optimal?Scheme is not Optimal?

1

2

3 4

0.45

0.3

0.6 0.5

NodeConditional failure prob.

1 0.616

2 0.411

3 0.663

4 0.579

No, the MLF scheme is not optimal in general.

Two data paths are given. Both failed to deliver data.

Nodes have: different failure probabilities same checking cost.

The conditional failure probabilities can be determined accordingly.

Example – Why the MLF Example – Why the MLF Scheme is not Optimal?Scheme is not Optimal?

Findings: Node 3 has the highest

conditional failure probability. However, by brute-force

approach, we find that checking node 1 first is optimal (even nodes have the same checking cost).

Intuition: Node 3 affects only one path,

but node 1 affects both paths. We may repair both paths by

only checking node 1.

NodeConditional failure prob.

1 0.616

2 0.411

3 0.663

4 0.579

1

2

3 4

0.45

0.3

0.6 0.5

Our ContributionsOur Contributions

Propose an end-to-end inference approach for correcting all data-path failures.

Identify a set of candidate nodes, and prove that one of them must be checked first in order to minimize the expected total checking cost.

Evaluate via simulation that our inference approach has a smaller expected cost than the prior MLF-based approaches [Katzela and Schwartz, 1995; Kandula

et al., 2005; Steinder and Sethi, 2004].

TopologiesTopologies

Topologies that we consider:

Tree Multiple trees

We prove optimality results for a tree, and propose heuristics for multiple trees.

Finding Good/Bad PathsFinding Good/Bad Paths

For each data path, Good – if the data path has no faulty node and

can deliver data Bad – if the data path has at least one faulty node

and cannot deliver data

Assumption: Each node has the same data-forwarding

behavior across all paths upon which it lies. This implies if a node lies on at least one good path, it is

a non-faulty (good) node.

Forming a Bad TreeForming a Bad Tree Monitor data streams from the root node 1 to each of the

leaf nodes 6, 7, 8, 9.

1

2

43

5 6 7

8 9

3

5 6

8 9 Bad tree: a tree in

which every path is a bad

path

Bad path Bad path

Bad path Good path

Keep only bad paths, and remove any nodes that are known to be good.

Inference AlgorithmInference Algorithm

Our inference algorithm selects which nodes to check:

Each node i is associated with a potential function:

Φ(i) = Pr(T | Xi, Ai) pi

ci (1 – pi) pi = failure probability of node i ci = checking cost of node i Pr(T | Xi, Ai) = conditional probability of having a bad tree

T = the event that the tree is a bad tree Xi = the event that node i is bad Ai = the event that ancestors of node i are good

Intuitively, we should first check the node with high pi and small ci, i.e., the node with the high potential first.

Inference AlgorithmInference Algorithm Candidate node

On each bad path, one node has the highest potential. We call this node a candidate node.

Example of identifying candidate nodes:

3

5 6

8 9

Main theorem To minimize the expected total checking cost of correcting all faulty

nodes for a given bad tree, we must check a candidate node first.

Bad path Candidate node

3-5-8 5

3-5-9 5

3-6 3


For some special cases, we know which candidate node should be checked first to minimize the expected cost.

Examples of the special cases: A path

Check the node with the highest first A tree in which nodes have a fixed failure probability and a

fixed checking cost Check the root node first

pi

ci (1 – pi)


For general cases, we don’t know which candidate node should be checked first to minimize the expected cost.

e.g., not necessarily the candidate node with the highest potential

Heuristics: Sequential strategy: Checks the candidate node with the

highest potential Parallel strategy: Checks simultaneously multiple candidate

nodes that cover all bad paths

Highlights of Highlights of ExperimentsExperiments Setup

Use BRITE to create 200 random experimental networks, each of which has 200 routers

Assign each node a failure probability and a checking cost Focus on multi-tree topologies, each of which is a shortest-

path tree rooted at a randomly selected router

Metric Expected total checking cost to diagnose and repair all faulty

nodes

Heuristics to be compared: Candidate-based heuristics – check the candidate nodes first MLF-based heuristics – check the most-likely faulty nodes first

Highlights of Highlights of ExperimentsExperiments

Random failure prob., fixed checking cost

pi ~ U(0, 0.2) ci = 1

Result: Both heuristics have

almost the same expected total checking cost.


Random failure prob., random checking cost

pi ~ U(0, 0.2) ci ~ U(0, 1)

Result: Checking first the

candidate nodes decreases the expected total checking cost by ~10%.


Fixed failure prob., random checking cost

pi = 0.1 ci ~ U(0, 1)

Result: Checking first the

candidate nodes decreases the expected total checking cost by ~20%.

ConclusionsConclusions

Presented optimality results for diagnosing and repairing all data-path failures, with an objective to minimize the expected total checking cost.

Constructed a potential function to identify candidate nodes, one of which must be checked first to minimize the expected total checking cost.

Showed via evaluation that checking candidate nodes first can reduce the checking cost by up to 20% compared to checking the most likely faulty nodes first.

Documents

Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab