Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Problem Diagnosis

• Distributed Problem Diagnosis

• Sherlock

• X-trace

Troubleshooting Networked Systems

• Hard to develop, debug, deploy, troubleshoot• No standard way to integrate debugging,

monitoring, diagnostics

Status quo: device centric

...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire......

...[04:03:23 2006] [notice] Dispatch s1...[04:03:23 2006] [notice] Dispatch s2...[04:04:18 2006] [notice] Dispatch s3...[04:07:03 2006] [notice] Dispatch s1...[04:10:55 2006] [notice] Dispatch s2...[04:03:24 2006] [notice] Dispatch s3...[04:04:47 2006] [crit] Server s3 down.........

... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga......

...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid.........

Firewall

Load Balancer

Database

Status quo: device centric

• Determining paths:– Join logs on time and ad-hoc identifiers

• Relies on – well synchronized clocks– extensive application knowledge

• Requires all operations logged to guarantee complete paths

Examples

DNS Server

Web Server

Examples

DNS Server

Web Server

Examples

DNS Server

Web Server

Examples

DNS Server

Web Server

Approaches to Diagnosis

• Passively learn the relationships– Infer problems as deviations from the norm

• Actively Instrument the stack to learn relationships– Infer problems as deviations from the norm

Sherlock – Diagnosing Problems in the Enterprise

Srikanth Kandula

Well-Managed Enterprises Still Unreliable

10% Troubled

85% Normal

Fraction Of Requests

0.7% Down

10 100 1000 10000

Response time of a Web server (ms)

10% responses take up to 10x longer than normal

How do we manage evolving enterprise networks?How do we manage evolving enterprise networks?

Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems

Sherlock

Challenges for the End-to-End Approach

• Don’t know what user’s performance depends on

• Don’t know what user’s performance depends on– Dependencies are distributed

– Dependencies are non-deterministic

• Don’t know which dependency is causing the problem– Server CPU 70%, link dropped 10

packets, but which affected user?

SQLBackend

Web Server

Auth. Server

Client

E.g., Web Connection

Challenges for the End-to-End Approach

Sherlock’s Contributions

• Passively infers dependencies from logs• Builds a unified dependency graph incorporating

network, server and application dependencies• Diagnoses user problems in the enterprise • Deployed in a part of the Microsoft Enterprise

Sherlock’s Architecture

Servers

Clients

Sherlock’s Architecture

Web1 1000ms

Web2 30ms

File1 Timeout

User Observations+

List Troubled Components

Network Dependency Graph

Inference Engine

Sherlock works for various client-server applications Sherlock works for various client-server applications

Video Server

Data Store

How do you automatically learn such distributed dependencies?

Strawman: Instrument all applications and libraries

Sherlock exploits timing info

My Client talks to B

My Client talks to C

If talks to B, whenever talks to C Dependent Connections

Not Practical

BBB B BB

False Dependence

Strawman: Instrument all applications and libraries Not Practical

Inter-access timeDependent iff t << Inter-access time

As long as this occurs with probability higher than chance

Strawman: Instrument all applications and libraries Not Practical

Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing

Dependency Graph

Bill’s Client StoreDNS

Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations

Video Store

Bill Watches Video

Bill DNS Bill Video

• Works with legacy applications• Adapts to changing conditions

Dependency Graph

But hard dependencies are not enough…

Video Store

Bill watches Video

Bill DNS Bill Video

But hard dependencies are not enough…

Need Probabilities

If Bill caches server’s IP DNS down but Bill gets video

Sherlock uses the frequency with which a dependence occurs in logs as its edge probability

p2p1=10% p2=100%

How do we use the dependency graph to diagnose user problems?

Video Store

Bill Watches Video

Bill DNS Bill Video

Which components caused the problem?

Need to disambiguate!!

Diagnosing User Problems

Video Store

Bill Watches Video

Bill DNS Bill Video

Diagnosing User Problems

Which components caused the problem?

Bill Sees Sales

Bill Sales

Paul Watches Video2

Paul Video2

Video2 Store

Video2

Use correlation to disambiguate!!• Disambiguate by correlating

– Across logs from same client– Across clients

• Prefer simpler explanations

Will Correlation Scale?

Corporate Core

Will Correlation Scale?Microsoft Internal Network• O(100,000) client desktops• O(10,000) servers• O(10,000) apps/services• O(10,000) network devices

Building Network

Campus Core

Data Center

Dependency Graph is Huge

Can we evaluate all combinations of component failures?

The number of fault combinations is exponential!

Impossible to compute!

Will Correlation Scale?

Scalable Algorithm to Correlate

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Only a few faults happen concurrently

Exponential PolynomialExponential Polynomial

Only few nodes change state

Re-evaluate only if an ancestor changes state

Reduces the cost of evaluating a case by 30x-70x

Only few nodes change state

Results

Experimental Setup

• Evaluated on the Microsoft enterprise network

• Monitored 23 clients, 40 production servers for 3 weeks– Clients are at MSR Redmond– Extra host on server’s Ethernet logs packets

• Busy, operational network– Main Intranet Web site and software distribution file server– Load-balancing front-ends– Many paths to the data-center

What Do Web Dependencies in the MS Enterprise Look Like?

Auth. Server

Client Accesses Portal

Auth. Server

Client Accesses Portal

Auth. Server

Sherlock discovers complex dependencies of real apps.Sherlock discovers complex dependencies of real apps.

Client Accesses Portal Client Accesses Sales

What Do File-Server Dependencies Look Like?

Client Accesses Software Distribution Server

Auth.Server

WINS DNS

Backend Server 1

Backend Server 2

Backend Server 3

Backend Server 4

ProxyFile Server

100%10% 6% 5% 2%

Sherlock works for many client-server applicationsSherlock works for many client-server applications

Dependency Graph: 2565 nodes; 358 components that can fail

Sherlock Identifies Causes of Poor Performance

Time (days)87% of problems localized to 16 components87% of problems localized to 16 components

Sherlock Identifies Causes of Poor PerformanceInference Graph: 2565 nodes; 358 components that can fail

Corroborated the three significant faults

Time (days)

• SNMP-reported utilization on a link flagged by Sherlock• Problems coincide with spikes

Sherlock Goes Beyond Traditional Tools

Sherlock identifies the troubled link but SNMP cannot! Sherlock identifies the troubled link but SNMP cannot!

X-Trace

• X-Trace records events in a distributed execution and their causal relationship

• Events are grouped into tasks– Well defined starting event and all that is

causally related• Each event generates a report, binding it to

one or more preceding events• Captures full happens-before relation

X-Trace Output

• Task graph capturing task execution – Nodes: events across layers, devices– Edges: causal relations between events

IP IP Router

IP RouterIP

TCP 1Start

TCP 1End

IP IP Router IP

TCP 2Start

TCP 2End

HTTPProxy

HTTPServer

HTTPClient

• Each event uniquely identified within a task: [TaskId, EventId]

• [TaskId, EventId] propagated along execution path• For each event create and log an X-Trace report

– Enough info to reconstruct the task graph

Basic Mechanism

IP IP Router

IP RouterIP

TCP 1Start

TCP 1End

IP IP Router IP

TCP 2Start

TCP 2End

HTTPProxy

HTTPServer

HTTPClient

c d e i j k l

[T, g][T, a]

[T, a]X-Trace ReportTaskID: TEventID: gEdge: from a, f

X-Trace ReportTaskID: TEventID: gEdge: from a, f

X-Trace Library API

• Handles propagation within app• Threads / event-based (e.g., libasync)• Akin to a logging API:

– Main call is logEvent(message)• Library takes care of event id creation,

binding, reporting, etc• Implementations in C++, Java, Ruby,

Javascript

Task Tree

• X-Trace tags all network operations resulting from a particular task with the same task identifier

• Task tree is the set of network operations connected with an initial task

• Task tree could be reconstruct after collecting trace data with reports

An example of the task tree

• A simple HTTP request through a proxy

X-Trace Components

• Data– X-Trace metadata

• Network path– Task tree

• Report– Reconstruct task tree

Propagation of X-Trace Metadata

• The propagation of X-Trace metadata through the task tree

Propagation of X-Trace Metadata

• The propagation of X-Trace metadata through the task tree

The X Trace metadata

Field Usage

Flags Bits that specify which of the three optional components are present

TaskID An unique integer ID

TreeInfo ParentID, OpID, EdgeType

Destination Specify the address that X-Trace report should be sent to

Options Accommodate future extensions mechanism

X-Trace Report Architecture

X-Trace-like in Google/Bing/Yahoo

• Why?– Own large portion of the ecosystem– Use RPC for communication– Need to understand

• Time for user request• Resource utilization by request

Sherlock V X-trace

• Overhead V. Accuracy

• Deployment issues– Invasiveness– Code modification

Conclusions

• Sherlock passively infers network-wide dependencies from logs and traceroutes

• It diagnoses faults by correlating user observations

• X-trace actively discovers network-wide dependencies

Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

Documents

Building & Indoor Environment Problem Diagnosis & Repair

Apache HTTP Server 2.4 Problem Diagnosis - People

AUTOMATED PROBLEM DIAGNOSIS IN DISTRIBUTED SYSTEMSftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy06Dissertation.pdf · AUTOMATED PROBLEM DIAGNOSIS IN DISTRIBUTED SYSTEMS by Alexander

AUTOMATED PROBLEM DIAGNOSIS IN DISTRIBUTED SYSTEMS

Kahuna: Problem Diagnosis for MapReduce-Based …spertet/papers/noms-2010.pdf · Kahuna: Problem Diagnosis for MapReduce-Based Cloud Computing Environments Jiaqi Tan, Xinghao Pan

Combinatorial Group Testing Methods for the BIST Diagnosis Problem

Problem Diagnosis And Resolution€¦ · Problem 11 Diagnosis And Resolution Terms you’ll need to understand: Troubleshooting User and account management Performance tuning Network

Sherlock Holmes Diagnosis NACAT July, 2010 Jim Morton Jim Halderman

Theia: Visual Signatures for Problem Diagnosis in Large ...spertet/papers/hadoopvis-lisa12-camera... · Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters

The Diagnosis of Mental Disorders: The Problem of …psych.colorado.edu/~willcutt/pdfs/Hyman_2010.pdfThe Diagnosis of Mental Disorders: The Problem of Reiﬁcation Steven E. Hyman

Automated Problem Diagnosis for Production Systems

Black-Box Problem Diagnosis in Parallel File System...Problem Diagnosis Techniques •White Box testing incurs significant runtime overhead, requires code-level instrumentation and

Guide for Laboratory Diagnosis in Herd Problem Investigations

Theia: Visual Signatures for Problem Diagnosis in Large ... · Theia: Visual Signatures for Problem Diagnosis in Large Hadoop ... application-level and infrastructural issues.

Problem Diagnosis - WSU Small Grains · Plant Problem Diagnosis=Crime Scene Investigation •Find a problem –What is wrong, what are the symptoms? •Collect evidence –Is the

Chapter 2 The Problem of Dual Diagnosis. Dual Diagnosis and Comorbidity Dual diagnosis – Describes individuals who meet diagnostic criteria for a mental

Sherlock Holmes, The Final Problem: An Illustrated, In-Depth Summary

4 Seeing Beyond the Symptoms Problem Diagnosis · the real problem. Connecting symptoms to causes to sources that point to remedial actions, is often crucial to building Problem Diagnosis:

User Reference Manual - IBM - United States€¦ · Version 2.1.0 User Reference Manual ... Information for problem diagnosis .....103 zSecure Audit problem diagnosis.....103 zSecure

UPDATE ON DIAGNOSIS AND MANAGEMENT OF FETAL … · PROBLEM: MANAGEMENT PROBLEM: DIAGNOSIS Placental disease: high (UA+, PE high) Placental disease: low (UA-, PE low) Hypoxia ++: systemic