Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Preview:

Citation preview

Problem Diagnosis

• Distributed Problem Diagnosis

• Sherlock

• X-trace

Troubleshooting Networked Systems

• Hard to develop, debug, deploy, troubleshoot• No standard way to integrate debugging,

monitoring, diagnostics

Status quo: device centric

...

...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire......

...

...[04:03:23 2006] [notice] Dispatch s1...[04:03:23 2006] [notice] Dispatch s2...[04:04:18 2006] [notice] Dispatch s3...[04:07:03 2006] [notice] Dispatch s1...[04:10:55 2006] [notice] Dispatch s2...[04:03:24 2006] [notice] Dispatch s3...[04:04:47 2006] [crit] Server s3 down.........

...

... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga......

...

... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga......

...

...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid.........

Firewall

Load Balancer

Web 1

Web 2

Database

Status quo: device centric

• Determining paths:– Join logs on time and ad-hoc identifiers

• Relies on – well synchronized clocks– extensive application knowledge

• Requires all operations logged to guarantee complete paths

Examples

5

User

DNS Server

Proxy

Web Server

Examples

6

User

DNS Server

Proxy

Web Server

Examples

7

User

DNS Server

Proxy

Web Server

Examples

8

User

DNS Server

Proxy

Web Server

Approaches to Diagnosis

• Passively learn the relationships– Infer problems as deviations from the norm

• Actively Instrument the stack to learn relationships– Infer problems as deviations from the norm

Sherlock – Diagnosing Problems in the Enterprise

Srikanth Kandula

Well-Managed Enterprises Still Unreliable

10% Troubled

85% Normal

Fraction Of Requests

0.7% Down

.1

.02

.04

.06

.08

10 100 1000 10000

Response time of a Web server (ms)

0

10% responses take up to 10x longer than normal

How do we manage evolving enterprise networks?How do we manage evolving enterprise networks?

Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems

Sherlock

Challenges for the End-to-End Approach

• Don’t know what user’s performance depends on

• Don’t know what user’s performance depends on– Dependencies are distributed

– Dependencies are non-deterministic

• Don’t know which dependency is causing the problem– Server CPU 70%, link dropped 10

packets, but which affected user?

SQLBackend

Web Server

Auth. Server

DNS

Client

E.g., Web Connection

Challenges for the End-to-End Approach

Sherlock’s Contributions

• Passively infers dependencies from logs• Builds a unified dependency graph incorporating

network, server and application dependencies• Diagnoses user problems in the enterprise • Deployed in a part of the Microsoft Enterprise

Sherlock’s Architecture

Servers

Clients

Sherlock’s Architecture

Web1 1000ms

Web2 30ms

File1 Timeout

User Observations+

=

List Troubled Components

Network Dependency Graph

Inference Engine

Sherlock works for various client-server applications Sherlock works for various client-server applications

Video Server

Data Store

DNS

How do you automatically learn such distributed dependencies?

Strawman: Instrument all applications and libraries

Sherlock exploits timing info

Time

My Client talks to B

t

My Client talks to C

If talks to B, whenever talks to C Dependent Connections

Not Practical

Sherlock exploits timing info

Time

t

BBB B BB

False Dependence

BC

If talks to B, whenever talks to C Dependent Connections

Strawman: Instrument all applications and libraries Not Practical

Sherlock exploits timing info

Time

If talks to B, whenever talks to C Dependent Connections

t

BB C

Inter-access timeDependent iff t << Inter-access time

As long as this occurs with probability higher than chance

Strawman: Instrument all applications and libraries Not Practical

Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing

Video

DNS

Store

Dependency Graph

Bill’s Client StoreDNS

Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations

Video Store

Video

Bill Watches Video

Bill DNS Bill Video

• Works with legacy applications• Adapts to changing conditions

Dependency Graph

Video

DNS

Store

But hard dependencies are not enough…

Bill’s Client StoreDNS

Video Store

Video

Bill watches Video

Bill DNS Bill Video

But hard dependencies are not enough…

Need Probabilities

p1

p3

If Bill caches server’s IP DNS down but Bill gets video

Sherlock uses the frequency with which a dependence occurs in logs as its edge probability

p2p1=10% p2=100%

How do we use the dependency graph to diagnose user problems?

Bill’s Client StoreDNS

Video Store

Video

Bill Watches Video

Bill DNS Bill Video

Which components caused the problem?

Need to disambiguate!!

Diagnosing User Problems

Bill’s Client StoreDNS

Video Store

Video

Bill Watches Video

Bill DNS Bill Video

Diagnosing User Problems

Which components caused the problem?

Bill Sees Sales

Sales

Bill Sales

Paul Watches Video2

Paul Video2

Video2 Store

Video2

Use correlation to disambiguate!!• Disambiguate by correlating

– Across logs from same client– Across clients

• Prefer simpler explanations

Will Correlation Scale?

Corporate Core

Will Correlation Scale?Microsoft Internal Network• O(100,000) client desktops• O(10,000) servers• O(10,000) apps/services• O(10,000) network devices

Building Network

Campus Core

Data Center

Dependency Graph is Huge

Can we evaluate all combinations of component failures?

The number of fault combinations is exponential!

Impossible to compute!

Will Correlation Scale?

Scalable Algorithm to Correlate

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Only a few faults happen concurrently

Exponential PolynomialExponential Polynomial

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Scalable Algorithm to Correlate

Only a few faults happen concurrently

Only few nodes change state

Exponential PolynomialExponential Polynomial

Re-evaluate only if an ancestor changes state

Reduces the cost of evaluating a case by 30x-70x

Reduces the cost of evaluating a case by 30x-70x

Exponential PolynomialExponential Polynomial

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Only a few faults happen concurrently

Only few nodes change state

Scalable Algorithm to Correlate

Results

Experimental Setup

• Evaluated on the Microsoft enterprise network

• Monitored 23 clients, 40 production servers for 3 weeks– Clients are at MSR Redmond– Extra host on server’s Ethernet logs packets

• Busy, operational network– Main Intranet Web site and software distribution file server– Load-balancing front-ends– Many paths to the data-center

What Do Web Dependencies in the MS Enterprise Look Like?

Auth. Server

What Do Web Dependencies in the MS Enterprise Look Like?

Client Accesses Portal

Auth. Server

What Do Web Dependencies in the MS Enterprise Look Like?

Client Accesses Portal

Auth. Server

Sherlock discovers complex dependencies of real apps.Sherlock discovers complex dependencies of real apps.

What Do Web Dependencies in the MS Enterprise Look Like?

Client Accesses Portal Client Accesses Sales

What Do File-Server Dependencies Look Like?

Client Accesses Software Distribution Server

Auth.Server

WINS DNS

Backend Server 1

Backend Server 2

Backend Server 3

Backend Server 4

ProxyFile Server

100%10% 6% 5% 2%

8%

5%

1%.3%

Sherlock works for many client-server applicationsSherlock works for many client-server applications

Dependency Graph: 2565 nodes; 358 components that can fail

Sherlock Identifies Causes of Poor Performance

Com

pone

nt In

dex

Time (days)87% of problems localized to 16 components87% of problems localized to 16 components

Sherlock Identifies Causes of Poor PerformanceInference Graph: 2565 nodes; 358 components that can fail

Corroborated the three significant faults

Com

pone

nt In

dex

Time (days)

• SNMP-reported utilization on a link flagged by Sherlock• Problems coincide with spikes

Sherlock Goes Beyond Traditional Tools

Sherlock identifies the troubled link but SNMP cannot! Sherlock identifies the troubled link but SNMP cannot!

X-Trace

• X-Trace records events in a distributed execution and their causal relationship

• Events are grouped into tasks– Well defined starting event and all that is

causally related• Each event generates a report, binding it to

one or more preceding events• Captures full happens-before relation

X-Trace Output

• Task graph capturing task execution – Nodes: events across layers, devices– Edges: causal relations between events

IP IP Router

IP RouterIP

TCP 1Start

TCP 1End

IP IP Router IP

TCP 2Start

TCP 2End

HTTPProxy

HTTPServer

HTTPClient

• Each event uniquely identified within a task: [TaskId, EventId]

• [TaskId, EventId] propagated along execution path• For each event create and log an X-Trace report

– Enough info to reconstruct the task graph

Basic Mechanism

IP IP Router

IP RouterIP

TCP 1Start

TCP 1End

IP IP Router IP

TCP 2Start

TCP 2End

HTTPProxy

HTTPServer

HTTPClient

f hb

a g

m

n

c d e i j k l

[T, g][T, a]

[T, a]X-Trace ReportTaskID: TEventID: gEdge: from a, f

X-Trace ReportTaskID: TEventID: gEdge: from a, f

X-Trace Library API

• Handles propagation within app• Threads / event-based (e.g., libasync)• Akin to a logging API:

– Main call is logEvent(message)• Library takes care of event id creation,

binding, reporting, etc• Implementations in C++, Java, Ruby,

Javascript

Task Tree

• X-Trace tags all network operations resulting from a particular task with the same task identifier

• Task tree is the set of network operations connected with an initial task

• Task tree could be reconstruct after collecting trace data with reports

52

An example of the task tree

• A simple HTTP request through a proxy

53

X-Trace Components

• Data– X-Trace metadata

• Network path– Task tree

• Report– Reconstruct task tree

54

Propagation of X-Trace Metadata

• The propagation of X-Trace metadata through the task tree

55

Propagation of X-Trace Metadata

• The propagation of X-Trace metadata through the task tree

56

The X Trace metadata

Field Usage

Flags Bits that specify which of the three optional components are present

TaskID An unique integer ID

TreeInfo ParentID, OpID, EdgeType

Destination Specify the address that X-Trace report should be sent to

Options Accommodate future extensions mechanism

57

X-Trace Report Architecture

58

X-Trace Report Architecture

59

X-Trace Report Architecture

60

X-Trace-like in Google/Bing/Yahoo

• Why?– Own large portion of the ecosystem– Use RPC for communication– Need to understand

• Time for user request• Resource utilization by request

Sherlock V X-trace

• Overhead V. Accuracy

• Deployment issues– Invasiveness– Code modification

Conclusions

• Sherlock passively infers network-wide dependencies from logs and traceroutes

• It diagnoses faults by correlating user observations

• X-trace actively discovers network-wide dependencies

Recommended