13
Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

Embed Size (px)

Citation preview

Page 1: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

Enterprise Network Troubleshooting

Nick FeamsterGeorgia Tech

(joint with Russ Clark, Yiyi Huang,Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

Page 2: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

2

Three Disjoint Views of the Network

• Policy: The operator’s “wish list”• Static: What the configurations say• Dynamic: The behavior that users witness

Policy Static Dynamic

Generation

Error Checking and Deployment

- rancid/rcc- FIREMAN/Lumeta

- ping- traceroute- …

Independent analyses!

Page 3: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

3

A Closer Look

• Proactive analysis– Fault avoidance– Policy conformance

• Reactive diagnosis– Correcting network faults

• Detection• Localization

– Active and passive measurements– Need user’s perspective

Idea: These analyses should inform each other

Two studies

1. Routing

2. Firewalls

Page 4: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

4

Catastrophic Configuration Faults“…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.”

-- news.com, April 25, 1997

“Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001

“WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue." -- cnn.com, October 3, 2002

"A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).“

-- dslreports.com, February 23, 2004

Page 5: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

5

Case 1: Network-Wide Routing Analysis

• Proactive routing configuration analysis

• Idea: Analyze configuration before deployment

ConfigureDetectFaults

Deploy

rcc

Many faults can be detected with static analysis.

Page 6: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

6

Operators Find Static Analysis Useful

“That’s wicked!” -- Nicolas Strina, ip-man.net

“Thanks again for a great tool.” -- Paul Piecuch, IT Manager

“...good to finally see more coverage of routing as distributed programming. From my experience, the principles of software engineering eliminate a vast majority of errors.”

-- Joe Provo, rcn.com

“I find your approach useful, it is really not fun (but critical for the health of the network) to keep track of the inconsistencies among different routers…a configuration verifier like yours can give the operator a degree of confidence that the sky won't fall on his head real soon now.”

-- Arnaud Le Tallanter, clara.net

Page 7: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

7

Yes, but Surprises Happen!

• Link failures• Node failures• Traffic volumes shift• Network devices “wedged”• …

• Two problems– Detection– Localization

Page 8: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

8

Detection: Analyze Routing Dynamics

• Idea: Routers exhibit correlated behavior

Blips across signals may be more operationally interesting than any spike in one.

Page 9: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

9

Detection Three Types of Events

• Single-router bursts• Correlated bursts• Multi-router bursts

• Common• Commonly

missed using thresholds

Page 10: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

10

Localization: Joint Dynamic/Static

• Which routers are “border routers” for that burst• Topological properties of routers in the burst

Static Dynamic

Proactive Analysis

Deployment

Reactive Detection

Diagnosis/Correction

Page 11: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

11

Case 2: Firewalls

• Georgia Tech Campus Network– Research and Administrative Network– 180 buildings– 130+ firewalls– 1700+ switches– 55000+ ports

• Problem: Availability/Reachability– Flux in firewall, router, switch configurations– No common authority over changes made

Page 12: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

12

Specific Focus: Firewall Configuration

• Difficult to understand and audit configs

• Subject to continual modifications– Roughly 1-2 touches per day

• Federated policy, distributed dependencies– Each department has independent policies– Local changes may affect global behavior

Page 13: Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

13

(Immediate) Open Issues

• Reachability and reliability of controller

• Service-level probes– Diagnostic tools != Service-level “Happiness”

• Policy conformance