23
1 Enterprise Networks under Stress

1 Enterprise Networks under Stress. 2 = 60% growth/year Vern Paxson, ICIR, “Measuring Adversaries”

Embed Size (px)

Citation preview

1

Enterprise Networksunder Stress

2

= 60% growth/year

Vern Paxson, ICIR, “Measuring Adversaries”

3

= 596% growth/year

Vern Paxson, ICIR, “Measuring Adversaries”

“Background”Radiation

--Dominates

traffic in manyof today’snetworks

4

Some Observations

• Internet reasonably robust to point problems like link and router failures (“fail stop”)

• Successfully operates under a wide range of loading conditions and over diverse technologies

• During 9/11/01, Internet worked well, under heavy traffic conditions and with some major facilities failures in Lower Manhattan

5

The Problem

• Networks awash in illegitimate traffic: port scans, propagating worms, p2p file swapping

– Legitimate traffic starved for bandwidth– Essential network services (e.g., DNS, NFS)

compromised

• Needed: better network management of services/applications to achieve good performance and resilience even in the face of network stress

– Self-aware network environment– Observing and responding to traffic changes– While sustaining the ability to control the network

6

From the Frontlines

• Berkeley Campus Network– Unanticipated traffic surges render the network

unmanageable (and may cause routers to fail)– Denial of service attacks, latest worm, or the newest file

sharing protocol largely indistinguishable– In-band control channel is starved, making it difficult to

manage and recover the network

• Berkeley EECS Department Network (12/04)– Suspected denial-of-service attack against DNS– Poorly implemented/configured spam appliance adds to

DNS overload– Traffic surges render it impossible to access Web or

mount file systems

• Network problems contribute to brittleness of distributed systems

7

Why and HowNetworks Fail

• Complex phenomenology of failure• Traffic surges break enterprise networks• “Unexpected” traffic as deadly as high net

utilization– Cisco Express Forwarding: random IP addresses --> flood route

cache --> force traffic thru slow path --> high CPU utilization --> dropped router table updates

– Route Summarization: powerful misconfigured peer overwhelms weaker peer with too many router table entries

– SNMP DoS attack: overwhelm SNMP ports on routers– DNS attack: response-response loops in DNS queries generate

traffic overload

8

TechnologyTrends

• Integration of servers, storage, switching, and routing – Blade Servers, Stateful Routers,

Inspection-and-Action Boxes (iBoxes)

• Packet flow manipulations at L4-L7– Inspection/segregation/accounting of traffic– Packet marking/annotating

• Building blocks for network protection– Pervasive observation and statistics collection– Analysis, model extraction, statistical correlation and causality

testing– Actions for load balancing and traffic shaping

Load Balancing Traffic Shaping

9

R

R

DistributionTier

E

E

E

S

S

II

R IA

E

InternetEdge

AccessEdge

ServerEdge

SpamAppliance

Primary &Secondary

DNSServers

ISS

MailServer

S

Scenario: Traffic Surge Inhibiting Network Services

• DNS Server swamped by excessive request traffic– Observe: DNS time outs, Web access traffic slowed, but also

higher than normal mail delivery latency implying busy server edge (correlation between Mail Server and DNS Server utilization?)

– Root Cause: High DNS request rates generated by Spam Appliance triggered by mail surge

10

Scenario Continued

• How Diagnosed?– I-S detects high link utilization but abnormally high DNS traffic– Stats from I-I: high mail traffic, low outgoing web traffic, in

traffic high but link utilization not high– Stats from I-A: lower web traffic, no unusual mail origination– Problem localized to Server edge, but visibility limited

R

R

DistributionTier

E

E

E

S

S

II

R IA

E

InternetEdge

AccessEdge

ServerEdge

SpamAppliance

Primary &Secondary

DNSServers

ISS

MailServer

S

11

Scenario Continued

• Possible Action Responses– Experiment: Redirect local DNS requests to Secondary DNS

server: if these complete, can infer the server is the problem, not the network

– Throttle: Due to MS-DNS correlation, block/slow email traffic at Server Edge: should expect reduced DNS server utilization

R

R

DistributionTier

E

E

E

S

S

II

R IA

E

InternetEdge

AccessEdge

ServerEdge

SpamAppliance

Primary &Secondary

DNSServers

ISS

MailServer

S

12

InternetEdge

PC

AccessEdge

MS

FSSpamFilter

DNS

Server Edge

Scenario

DistributionTier

13

ObservedOperational Problems

• User visible services:– NFS mount operations time out– Web access also fails intermittently due to time outs

• Failure causes:– Independent or correlated failures?– Problem in access, server, or Internet edge?– File server failure?– Internet denial of service attack?

14

Network Dashboard

b/wconsumed

time

Gentle risein ingressb/w

FS CPUutilization

time

No unusualpattern

MS CPUutilization

time

Mail trafficgrowing

DNS CPUutilization

time

Unusualstep jump/DNS xactrates

AccessEdge b/wconsumed

time

Declinein accessedge b/w

InWeb

OutWeb

Email

15

Network Dashboard

b/wconsumed

time

FS CPUutilization

time

MS CPUutilization

time

DNS CPUutilization

time

AccessEdge b/wconsumed

time

Gentle risein ingressb/w

No unusualpattern

Mail trafficgrowing

Unusualstep jump/DNS xactrates

Declinein accessedge b/w

InWeb

OutWeb

Email

CERT Advisory!DNS Attack!

16

Observed Correlations

• Mail traffic up• MS CPU utilization up

– Service time up, service load up, service queue longer, latency longer

• DNS CPU utilization up– Service time up, request rate up,

latency up

• Access edge b/w down

Causality no surprise!

How doesmail trafficcause DNSload?

17

Run ExperimentShape Mail Traffic

MS CPUutilization

time

Mail trafficlimited

DNS CPUutilization

time

DNS down

AccessEdge b/wconsumed

time

Accessedge b/wreturns

Root cause:

Spam appliance --> DNS lookups to verify sender domains;

Spam attack hammers internal DNS, degrading other services: NFS, Web

InWeb

OutWeb

Email

18

Policies and ActionsRestore the Network

• Shape mail traffic– Mail delay acceptable to users?– Can’t do this forever unless mail is filtered at the

Internet edge

• Load balance DNS services– Increase resources faster than incoming mail rate– Actually done: dedicated DNS server for Spam

appliance

• Other actions? Traffic priority, QoS knobs

19

Analysis

• Root causes difficult to diagnose– Transitive and hidden causes

• Key is pervasive observation– iBoxes provide the needed infrastructure– Observations to identify correlations– Perform active experiments to “suggest” causality

20

Many Challenges

• Policy specification: how to express? Service Level Objectives?

• Experimental plan– Distributed vs. centralized development– Controlling the experiments … when the network is stressed– Sequencing matters, to reveal “hidden” causes

• Active experiments– Making things worse before they get better– Stability, convergence issues

• Actions– Beyond shaping of classified flows, load balancing, server

scaling?

21

Implications for Network Operations and Management

• Processing-in-the-Network is real• Enables pervasive monitoring and actions• Statistical models to discover correlations

and to detect anomalies• Automated experiments to reveal causality• Policies drive actions to reduce network

stress

2222

Datacenter Networks

23

Networks Under Stress