26
Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, Ming Zhang Microsoft Research Presented by Zhenyu Pan

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies

Embed Size (px)

DESCRIPTION

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies. Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, Ming Zhang Microsoft Research Presented by Zhenyu Pan. Introduction Enterprise Network Service. - PowerPoint PPT Presentation

Citation preview

Page 1: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Towards Highly Reliable Enterprise Network Services

via Inference of Multi-level Dependencies

Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, Ming Zhang

Microsoft Research

Presented by Zhenyu Pan

Page 2: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Introduction Enterprise Network Service

• Network service is an (IPaddr, port) pair.

• Enterprise network– network of a single

enterprise.– traffic does not

cross the open Internet.

– user-perceptible service degradations are rampant.

Page 3: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

IntroductionSherlock System

• Conventional approach– treat each service as up or down.– box-centric, blind to the complex set of dependencies– meaningless alerts (15,000 alerts a day, almost universally

ignored).

• The new approach– models service availability as a 3-state value:

• Up: response time is normal; • Down: requests result in either an error status or no response at

all; • Troubled: response times fall significantly outside of normal

response times.

– user-centric, does not report problems that do not directly affect users.

Page 4: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

IntroductionSherlock System

• System components– Detects faults and performance problems by monitoring the

response time of services. Software agents: run on each host, analyze the packets, determine the set of services the host depends.

– Determines the set of components that responsible, a service, a router, or a link, etc. Sherlock server: assembles an multi-level, 3-state inference graph that captures the dependencies between all components.

– Localizes the problem to the most likely component. Ferret Algorithm: localizes faults using the inference graph.

• Main contributions:– Inference Graph– Ferret Algorithm

Page 5: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

IntroductionInference Graph

• An example

Page 6: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference GraphNode Types

• Root-cause node: physical components whose failure can cause an end-user to experience failures. – computer, service, router, IP link, etc; – two special root-cause nodes: always troubled (AT) and always

down (AD) to model external factors

• Observation node: accesses to network services whose performances can be measured by Sherlock.

• Meta-node: model the dependencies between root causes and observations. Three types of meta-nodes: – noisy-max – selector (load-balancers) – failover (failover redundancy)

Page 7: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference GraphNode States

• The state of each node– three-tuple: (Pup, Ptrouble, Pdown)– P stands for probability– Pup + Ptrouble + Pdown = 1

• The state of the root-cause node is independent of any other node.

• The state of observation nodes can be uniquely determined from the state of its ancestors.

Page 8: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference GraphEdges

• Edge from node A to B: – the dependency that node A has to be in the up state for node

B to be up. – For example,

• a client cannot retrieve a file from a file server if the path to that file server is down.

• A client might still be able to retrieve the file even when the DNS server is down, if the file server’s name to IP address mapping is found in the client’s local DNS cache.

– dependency probability indicates how strong the dependency is.

Page 9: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference GraphPropagation of State

• Noisy-Max Meta-Nodes – Max: The node gets the worst condition of its parents.

– Noisy: if the weight of a parent’s edge is d, then with

probability (1-d) the child is not affected

Page 10: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference GraphPropagation of State

• Selector Meta-Nodes– Used to model load balancing

• NLB: Network Load Balancer.• ECMP: routers send packets to a destination along several paths.

Page 11: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference GraphPropagation of State

• Failover Meta-Nodes– Failover: Clients access primary servers and failover to backup

servers when the primary server is inaccessible.

Page 12: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference GraphTime to Propagate

• the majority of the nodes with more than one parent are noisy-max meta-nodes. – For these nodes, computation time is O(n)

• for selector and failover meta-nodes:– Still needs O(3n) time.– HOWEVER, those two types of meta-nodes have no more than

6 parents.

Page 13: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference GraphFault Localization

• Assignment-vector – An assignment of state to every root-cause node which has

probability of 1 of being either up, troubled or down.

• Our target– Find the assignment-vector that best explain the observation

• Ferret – sets the root causes to the states specified in the assignment-

vector and then propagate state probabilities downwards until they reach the observation nodes.

– for each observation node, computes a score based on how well the probabilities in the state of the observation node agree with the statistical evidence.

Page 14: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference GraphFault Localization

• Impossible to traverse all possible assignment vectors to determine the vector with the highest score– OBSERVATION 1. It is very likely that at any point in time only

a few root-cause nodes are troubled or down.– OBSERVATION 2. Since a root-cause is assigned to be up in

most assignment vectors, the evaluation of an assignment vector only requires re-evaluation of states at the descendants of root cause nodes that are not up.

Page 15: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Inference Graph ferret algorithm

Page 16: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Sherlock SystemThree-Step Process

• Step1: service-level dependency graph– We define the dependency probability of a host on service A

when accessing service B as the probability the host needs to communicate with service A before it can successfully communicate with service B.

• Step2: Inference Graph• Step3: Fault Localization using Ferret• Score for a given assignment vector:

– Track the history of response time and fits two Gaussian distribution to the data, namely Gaussianup and Gaussiantroubled.

– If the observation time is t and the predicted observation node is (pup, ptroubled, pdown), then the score of this vector is calculated as: pup*Prob(t|Gaussianup) + ptroubled*Prob(t|Gaussiantroubled)

Page 17: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Sherlock SystemDiscovering Service-Level Dependencies

• OBSERVATION: If accessing service B depends on service A, then packets exchanged with A and B are likely to co-occur.

• Dependency probability: conditional probability of accessing service A within the dependency interval, prior to accessing service B.– Dependency interval: 10ms

• Chance of co-occurrence: first calculate the average interval, I, between accesses to the same service and estimate the likelihood of “chance co-occurrence” as (10ms)/I. They then retain only the dependencies where the dependency probability is much greater than the likelihood of chance co-occurrence.

Page 18: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Sherlock SystemConstructing the Inference Graph

• creates a noisy max meta-node to represent the service. • creates an observation node and makes the service meta-

node a parent of the observation node. • examines the service dependency information recursively. • creates a root-cause node to represent the host on which

the service runs and makes this root-cause a parent of the meta-node.

• adds network topology information by using trace route results. For each path between hosts, it adds a noisy-max meta node to represent the path and root-cause nodes to represent every router and link on the path.

• adds each of these root-causes as parents of the path meta-node.

• put AT and AD. Give each the edges connecting AT/AD to the observation point a weight 0.001. And give the edges between a router and a path meta-node a probability 0.999.

Page 19: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Implementation Agent

Page 20: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Implementation Service Dependency Graphs

Page 21: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Implementation Dependency Probabilities

Page 22: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Implementation Test Bed

Page 23: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

EvaluationRoot Cause

Page 24: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

EvaluationPerformance Comparison

Page 25: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

EvaluationError Influence

Page 26: Towards Highly Reliable Enterprise Network Services via Inference of  Multi-level Dependencies

Summary

• Main contribution– Sherlock system

• Assist IT Admin for troubleshooting.• A Multi-level probabilistic inference model• Automatic Construction of the Inference Graph• An algorithm to localize root cause.