32
Hot Potatoes Heat Up BGP Routing Jennifer Rexford AT&T Labs—Research http://www.research.att.com /~jrex Joint work with Renata Teixeira, Aman Shaikh, and Timothy Griffin

Hot Potatoes Heat Up BGP Routing Jennifer Rexford AT&T Labs—Research jrex Joint work with Renata Teixeira, Aman Shaikh, and

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Hot Potatoes Heat Up BGP Routing

Jennifer RexfordAT&T Labs—Research

http://www.research.att.com/~jrex

Joint work with Renata Teixeira, Aman Shaikh, and Timothy Griffin

2

Outline

Internet routing Interdomain and intradomain routing Coupling due to hot-potato routing

Measuring hot-potato routing Measuring the two routing protocols Correlating the two data streams

Performance evaluation Characterization on AT&T’s network Implications on network practices

Conclusions and future directions

3

Autonomous Systems

1

2

3

4

5

67

ClientWeb server

AS path: 6, 5, 4, 3, 2, 1

4

Interdomain Routing (BGP)

Border Gateway Protocol (BGP) IP prefix: block of destination IP addresses AS path: sequence of ASes along the path

Policy configuration by the operator Path selection: which of the paths to use? Path export: which neighbors to tell?

1 2 3

12.34.158.5

“I can reach 12.34.158.0/24”

“I can reach 12.34.158.0/24 via AS 1”

5

Intradomain Routing (IGP) Interior Gateway Protocol (OSPF and IS-IS)

Shortest path routing based on link weights Routers flood link-state information to each other Routers compute “next hop” to reach other routers

Weights configuration by the operator Simple heuristics: link capacity or physical distance Traffic engineering: tuning link weights to the traffic

32

2

1

13

1

4

5

3

6

Two-Level Internet Routing

Hierarchical routing Intra-domain

Metric based Inter-domain

Reachability and policy

Design principles Scalability Isolation Simplicity of reasoning

AS 1

AS 2

AS 3

intra-domainrouting (IGP)

inter-domainrouting (BGP)

Autonomous system (AS) = network with unified administrative routing policy (ex. AT&T, Sprint, UCSD)

7

packet to dst

Motivation

dst

XISP network

Y

10

911

Z

Hot-potato routing = ISPs policy of routing to closest exit point when there is more than one route to destination

Consequences:Routers CPU overloadTransient forwarding instabilityTraffic shiftInter-domain routing changes

ISP networkfailureplanned maintenancetraffic engineering

Routes to thousandsof destinations switch exit point!!!

8

BGP Decision Process

Ignore if exit point unreachable Highest local preference Lowest AS path length Lowest origin type Lowest MED (with same next hop AS) Lowest IGP cost to next hop Lowest router ID of BGP speaker

Hot potato

9

Outline

Internet routing Interdomain and intradomain routing Coupling due to hot-potato routing

Measuring hot-potato routing Measuring the two routing protocols Correlating the two data streams

Performance evaluation Characterization on AT&T’s network Implications on network practices

Conclusions and future directions

10

Why is This So Darn Hard?

Noisy signals Single event can cause multiple IGP messages Large amount of background BGP updates Multiple messages for single BGP routing change

Protocol implementation Routing protocols provide limited information High complexity of BGP due to configurable policies Many vendor-specific details, such as timers

Monitoring limitations Cannot collect data from every vantage point Delays in delivering data to the collection machine Time synchronization across multiple collectors

11

Our Approach

Collect measurement of both protocols BGP monitor and OSPF monitor

Correlate the two streams of data Match BGP updates with OSPF events

Analyze the protocol interaction

X

Y

Z

M

AT&T backboneOSPF messagesBGP updates

12

Challenges

Lack of information on routing messages Routing protocols are designed to determine a path

between two hosts, but not to give reason

X

Y

Z

M

dst

Example 1: BGP update caused by OSPF

BGP: announcement: dst, XOSPF: CHG: X, 8

9

10

dst, Ydst, X

8

13

M

Challenges

Lack of information on routing messages Routing protocols are designed to determine a path

between two hosts, but not to give reason

X

Y

Z

dst

Example 2: BGP update NOT caused by OSPF

9

10

dst, Ydst, X

BGP: announcement: dst, X

14

M

Challenges

Lack of information on routing messages Routing protocols are designed to determine a path

between two hosts, but not to give reason

X

Y

Z

dst

Example 2: BGP update NOT caused by OSPF

9

10

dst, Ydst, X

8

BGP: announcement: dst, XOSPF: CHG: X, 8

15

Heuristic for Matching

Classify BGP updates by possible OSPF causes

Transform stream of OSPFmessages into routing changes

link failurerefresh weight change

chg cost

del

chg cost

Match BGP updateswith OSPF events thathappen close in time

Stream of OSPF messages

Stream of BGP updates

time

16

Pre-processing OSPF LSAs

Transform OSPF messages into routing changes from a router’s perspective

MX

Y

Z

1

1

12 1

22 10LSA weight change, 10

10LSA weight change, 10

X 5Y 4

CHG Y, 7X 5Y 7

LSA deleteLSA add, 1

DEL X

Y 7

ADD X, 5

X 5 Y 7

OSPF routing changes:

2

1

17

Classifying BGP Updates

BGP update from Z

Announcement of dst, X Withdrawal of dst, Y

Replacement of routeto dst

different route through Y

ADD X? DEL Y?

DEL Y?

ADD X?

CHG X or CHG Y?

Y X

X

Y

Z

dst

M

18

Classifying BGP Updates

route via X is betterroute via X is worse

routes are equally goodADD X? DEL Y?

DEL Y?

ADD X?CHG X, CHG Y?

Y X

X

Y

Z

dst

M

19

Outline

Internet routing Interdomain and intradomain routing Coupling due to hot-potato routing

Measuring hot-potato routing Measuring the two routing protocols Correlating the two data streams

Performance evaluation Characterization on AT&T’s network Implications on network practices

Conclusions and future directions

20

Time Lag

OSPF-triggered BGP updatesfor June 25th, 2003

Cum

ulat

ive

%B

GP

upd

ates

time BGP – time OSPF (seconds)

~15% of OSPF-triggeredBGP updates in a day

most OSPF-triggered BGP updates lagfrom 10 to 60 seconds

21

Results for June 2003

High variability according to location and day Impact on external BGP measurements and

customers

One LSA can have a big impact

location min max days > 10%

close to peers 0% 3.76% 0

between peers 0% 25.87% 5

location no impact prefixes impacted

close to peers 97.53% less than 1%

between peers 97.17% 55%

22

BGP Updates Over PrefixesC

umul

ativ

e %

BG

P u

pdat

es

% prefixes

Non-OSPF triggeredAll

OSPF-triggered

OSPF-triggered BGP updatesaffects ~50% of prefixesuniformly

prefixes with onlyone exit point

Contrast with non-OSPFtriggered BGP updates

23

Operational Implications

Forwarding plane convergence Accuracy of active measurements

Router proximity to exit points Likelihood of hot-potato routing changes

Cost in/out of links during maintenance Avoid triggering BGP routing changes

More complexity with route reflectors Longer delays and more BGP messages

24

Forwarding Convergence

R1R2

dst

10

100 10111

R2 starts using R1 to reach dstScan process

runs in R2

R1’s scan processcan take up to

60 seconds to run

Packets to dst may be caught in a loop

for 60 seconds!

25

Measurement Accuracy

Measurements of customer experience Probe machines have just one exit point!

R1R2

dst

10

100 111

loop to reach dst

W1 W2

26

What to do?

Increase estimate for forwarding convergence For destinations/customers with multiple exit points

Extensions to measurement infrastructure Multiple access links for a probe machine Multiple probe machines with same address

Better BGP implementation on the router Decrease scan timer (maybe too much overhead?) Event-driven IGP/BGP implementation

27

Avoid Equal-distance Exits

Z

10

10X

Y

Z

1000

1X

Y

dst dst

Small changes will make Z switch exit points to dst

More robust to intra-domainrouting changes

28

Careful Cost in/out Links

Z

X

Y

55

1010

10

dst

4

100

Traffic is more predictableFaster convergenceLess impact on neighbors

29

iBGP Route Reflectors

20

10

8

9

X

Y

Z

Wdst

dst Y, 18dst W, 20

11

dst Y, 21dst W, 20

Announcement X dst X,19dst W, 20

Scalability trade-off: Less BGP state

vs. Number of BGP updates from Z and longer convergence delay

30

Ongoing Work

Reduction of false matches Compare with conservative analysis (lower bound) Cluster BGP updates and IGP LSAs in time

Black-box testing of the routers Scan timer and its effects (forwarding loops) Vendor interactions (with Cisco)

Impact of the IGP-triggered BGP updates Changes in the flow of traffic Externally visible BGP routing changes

Modeling the protocol interaction Understand impact of router location

31

Future Directions Improving isolation (cooling those potatoes!)

Operational practices: preventing interactions Protocol design: stronger decoupling Network design: internal topology/architecture

Extending our monitoring architecture Data from multiple vantage points Real time correlation of data streams Automatic generation of alarms/reports

Better route monitoring Router support for special monitoring sessions Protocol extensions to help in troubleshooting Diagnose problems, and read the router’s mind!

32

Exporting Routing Instability

Z

X

Y

Z

X

Y

dst

dst

Announcement

No change => no announcement