61
Problem Diagnosis • Distributed Problem Diagnosis • Sherlock • X-trace

Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Embed Size (px)

Citation preview

Page 1: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Problem Diagnosis

• Distributed Problem Diagnosis

• Sherlock

• X-trace

Page 2: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Troubleshooting Networked Systems

• Hard to develop, debug, deploy, troubleshoot• No standard way to integrate debugging,

monitoring, diagnostics

Page 3: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Status quo: device centric

...

...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire......

...

...[04:03:23 2006] [notice] Dispatch s1...[04:03:23 2006] [notice] Dispatch s2...[04:04:18 2006] [notice] Dispatch s3...[04:07:03 2006] [notice] Dispatch s1...[04:10:55 2006] [notice] Dispatch s2...[04:03:24 2006] [notice] Dispatch s3...[04:04:47 2006] [crit] Server s3 down.........

...

... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga......

...

... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga......

...

...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid.........

Firewall

Load Balancer

Web 1

Web 2

Database

Page 4: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Status quo: device centric

• Determining paths:– Join logs on time and ad-hoc identifiers

• Relies on – well synchronized clocks– extensive application knowledge

• Requires all operations logged to guarantee complete paths

Page 5: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Examples

5

User

DNS Server

Proxy

Web Server

Page 6: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Examples

6

User

DNS Server

Proxy

Web Server

Page 7: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Examples

7

User

DNS Server

Proxy

Web Server

Page 8: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Examples

8

User

DNS Server

Proxy

Web Server

Page 9: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Approaches to Diagnosis

• Passively learn the relationships– Infer problems as deviations from the norm

• Actively Instrument the stack to learn relationships– Infer problems as deviations from the norm

Page 10: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Sherlock – Diagnosing Problems in the Enterprise

Srikanth Kandula

Page 11: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Well-Managed Enterprises Still Unreliable

10% Troubled

85% Normal

Fraction Of Requests

0.7% Down

.1

.02

.04

.06

.08

10 100 1000 10000

Response time of a Web server (ms)

0

10% responses take up to 10x longer than normal

How do we manage evolving enterprise networks?How do we manage evolving enterprise networks?

Page 12: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems

Sherlock

Page 13: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Challenges for the End-to-End Approach

• Don’t know what user’s performance depends on

Page 14: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

• Don’t know what user’s performance depends on– Dependencies are distributed

– Dependencies are non-deterministic

• Don’t know which dependency is causing the problem– Server CPU 70%, link dropped 10

packets, but which affected user?

SQLBackend

Web Server

Auth. Server

DNS

Client

E.g., Web Connection

Challenges for the End-to-End Approach

Page 15: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Sherlock’s Contributions

• Passively infers dependencies from logs• Builds a unified dependency graph incorporating

network, server and application dependencies• Diagnoses user problems in the enterprise • Deployed in a part of the Microsoft Enterprise

Page 16: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Sherlock’s Architecture

Page 17: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Servers

Clients

Sherlock’s Architecture

Web1 1000ms

Web2 30ms

File1 Timeout

User Observations+

=

List Troubled Components

Network Dependency Graph

Inference Engine

Sherlock works for various client-server applications Sherlock works for various client-server applications

Page 18: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Video Server

Data Store

DNS

How do you automatically learn such distributed dependencies?

Page 19: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Strawman: Instrument all applications and libraries

Sherlock exploits timing info

Time

My Client talks to B

t

My Client talks to C

If talks to B, whenever talks to C Dependent Connections

Not Practical

Page 20: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Sherlock exploits timing info

Time

t

BBB B BB

False Dependence

BC

If talks to B, whenever talks to C Dependent Connections

Strawman: Instrument all applications and libraries Not Practical

Page 21: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Sherlock exploits timing info

Time

If talks to B, whenever talks to C Dependent Connections

t

BB C

Inter-access timeDependent iff t << Inter-access time

As long as this occurs with probability higher than chance

Strawman: Instrument all applications and libraries Not Practical

Page 22: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing

Video

DNS

Store

Dependency Graph

Page 23: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Bill’s Client StoreDNS

Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations

Video Store

Video

Bill Watches Video

Bill DNS Bill Video

• Works with legacy applications• Adapts to changing conditions

Dependency Graph

Video

DNS

Store

Page 24: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

But hard dependencies are not enough…

Page 25: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Bill’s Client StoreDNS

Video Store

Video

Bill watches Video

Bill DNS Bill Video

But hard dependencies are not enough…

Need Probabilities

p1

p3

If Bill caches server’s IP DNS down but Bill gets video

Sherlock uses the frequency with which a dependence occurs in logs as its edge probability

p2p1=10% p2=100%

Page 26: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

How do we use the dependency graph to diagnose user problems?

Page 27: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Bill’s Client StoreDNS

Video Store

Video

Bill Watches Video

Bill DNS Bill Video

Which components caused the problem?

Need to disambiguate!!

Diagnosing User Problems

Page 28: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Bill’s Client StoreDNS

Video Store

Video

Bill Watches Video

Bill DNS Bill Video

Diagnosing User Problems

Which components caused the problem?

Bill Sees Sales

Sales

Bill Sales

Paul Watches Video2

Paul Video2

Video2 Store

Video2

Use correlation to disambiguate!!• Disambiguate by correlating

– Across logs from same client– Across clients

• Prefer simpler explanations

Page 29: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Will Correlation Scale?

Page 30: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Corporate Core

Will Correlation Scale?Microsoft Internal Network• O(100,000) client desktops• O(10,000) servers• O(10,000) apps/services• O(10,000) network devices

Building Network

Campus Core

Data Center

Dependency Graph is Huge

Page 31: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Can we evaluate all combinations of component failures?

The number of fault combinations is exponential!

Impossible to compute!

Will Correlation Scale?

Page 32: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Scalable Algorithm to Correlate

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Only a few faults happen concurrently

Exponential PolynomialExponential Polynomial

Page 33: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Scalable Algorithm to Correlate

Only a few faults happen concurrently

Only few nodes change state

Exponential PolynomialExponential Polynomial

Page 34: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Re-evaluate only if an ancestor changes state

Reduces the cost of evaluating a case by 30x-70x

Reduces the cost of evaluating a case by 30x-70x

Exponential PolynomialExponential Polynomial

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Only a few faults happen concurrently

Only few nodes change state

Scalable Algorithm to Correlate

Page 35: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Results

Page 36: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Experimental Setup

• Evaluated on the Microsoft enterprise network

• Monitored 23 clients, 40 production servers for 3 weeks– Clients are at MSR Redmond– Extra host on server’s Ethernet logs packets

• Busy, operational network– Main Intranet Web site and software distribution file server– Load-balancing front-ends– Many paths to the data-center

Page 37: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

What Do Web Dependencies in the MS Enterprise Look Like?

Page 38: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Auth. Server

What Do Web Dependencies in the MS Enterprise Look Like?

Client Accesses Portal

Page 39: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Auth. Server

What Do Web Dependencies in the MS Enterprise Look Like?

Client Accesses Portal

Page 40: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Auth. Server

Sherlock discovers complex dependencies of real apps.Sherlock discovers complex dependencies of real apps.

What Do Web Dependencies in the MS Enterprise Look Like?

Client Accesses Portal Client Accesses Sales

Page 41: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

What Do File-Server Dependencies Look Like?

Client Accesses Software Distribution Server

Auth.Server

WINS DNS

Backend Server 1

Backend Server 2

Backend Server 3

Backend Server 4

ProxyFile Server

100%10% 6% 5% 2%

8%

5%

1%.3%

Sherlock works for many client-server applicationsSherlock works for many client-server applications

Page 42: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Dependency Graph: 2565 nodes; 358 components that can fail

Sherlock Identifies Causes of Poor Performance

Com

pone

nt In

dex

Time (days)87% of problems localized to 16 components87% of problems localized to 16 components

Page 43: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Sherlock Identifies Causes of Poor PerformanceInference Graph: 2565 nodes; 358 components that can fail

Corroborated the three significant faults

Com

pone

nt In

dex

Time (days)

Page 44: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

• SNMP-reported utilization on a link flagged by Sherlock• Problems coincide with spikes

Sherlock Goes Beyond Traditional Tools

Sherlock identifies the troubled link but SNMP cannot! Sherlock identifies the troubled link but SNMP cannot!

Page 45: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace
Page 46: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

X-Trace

• X-Trace records events in a distributed execution and their causal relationship

• Events are grouped into tasks– Well defined starting event and all that is

causally related• Each event generates a report, binding it to

one or more preceding events• Captures full happens-before relation

Page 47: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

X-Trace Output

• Task graph capturing task execution – Nodes: events across layers, devices– Edges: causal relations between events

IP IP Router

IP RouterIP

TCP 1Start

TCP 1End

IP IP Router IP

TCP 2Start

TCP 2End

HTTPProxy

HTTPServer

HTTPClient

Page 48: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

• Each event uniquely identified within a task: [TaskId, EventId]

• [TaskId, EventId] propagated along execution path• For each event create and log an X-Trace report

– Enough info to reconstruct the task graph

Basic Mechanism

IP IP Router

IP RouterIP

TCP 1Start

TCP 1End

IP IP Router IP

TCP 2Start

TCP 2End

HTTPProxy

HTTPServer

HTTPClient

f hb

a g

m

n

c d e i j k l

[T, g][T, a]

[T, a]X-Trace ReportTaskID: TEventID: gEdge: from a, f

X-Trace ReportTaskID: TEventID: gEdge: from a, f

Page 49: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

X-Trace Library API

• Handles propagation within app• Threads / event-based (e.g., libasync)• Akin to a logging API:

– Main call is logEvent(message)• Library takes care of event id creation,

binding, reporting, etc• Implementations in C++, Java, Ruby,

Javascript

Page 50: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Task Tree

• X-Trace tags all network operations resulting from a particular task with the same task identifier

• Task tree is the set of network operations connected with an initial task

• Task tree could be reconstruct after collecting trace data with reports

52

Page 51: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

An example of the task tree

• A simple HTTP request through a proxy

53

Page 52: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

X-Trace Components

• Data– X-Trace metadata

• Network path– Task tree

• Report– Reconstruct task tree

54

Page 53: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Propagation of X-Trace Metadata

• The propagation of X-Trace metadata through the task tree

55

Page 54: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Propagation of X-Trace Metadata

• The propagation of X-Trace metadata through the task tree

56

Page 55: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

The X Trace metadata

Field Usage

Flags Bits that specify which of the three optional components are present

TaskID An unique integer ID

TreeInfo ParentID, OpID, EdgeType

Destination Specify the address that X-Trace report should be sent to

Options Accommodate future extensions mechanism

57

Page 56: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

X-Trace Report Architecture

58

Page 57: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

X-Trace Report Architecture

59

Page 58: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

X-Trace Report Architecture

60

Page 59: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

X-Trace-like in Google/Bing/Yahoo

• Why?– Own large portion of the ecosystem– Use RPC for communication– Need to understand

• Time for user request• Resource utilization by request

Page 60: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Sherlock V X-trace

• Overhead V. Accuracy

• Deployment issues– Invasiveness– Code modification

Page 61: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Conclusions

• Sherlock passively infers network-wide dependencies from logs and traceroutes

• It diagnoses faults by correlating user observations

• X-trace actively discovers network-wide dependencies