Upload
nirmala-last
View
404
Download
3
Tags:
Embed Size (px)
Citation preview
1
BGP Anomaly Detection in an ISP
Jian Wu (U. Michigan)Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton)
Jia Wang (AT&T Labs)
http://www.cs.princeton.edu/~jrex/papers/nsdi05-jian.pdf
2
Goal
Identify important anomalies Lost reachability Persistent flapping Large traffic shifts
Contributions:
•Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time.
•Use the tool to characterize routing disruptions in an operational network
3
Capturing Routing Changes
CBRCBR
CPEBGP Monit
or
CBRCBR
CBRCBR
CBRCBR
CBRCBR
CBRCBR
iBGP
iBGP
iBG
P
iBG
P
iBGP
iBGP
eBGP
eBG
P
eBGP
eBG
P
eBGP
eBGP
UpdatesUpdates
Best routesBest ro
utes
Large operational network(8/16/2004 – 10/10-2004)
4
Challenges
Large volume of BGP updates Millions daily, very bursty Too much for an operator to manage
Different than root-cause analysis Identify changes and their effects Focus on actionable events Diagnose causes only in/near the AS
5
System Architecture
Event Classification
Event Classification
“Typed”Events
EEBR
EEBR
EEBR
BGP Updates
(106)
BGP Update Grouping
BGP Update Grouping
Events
Persistent Flapping Prefixes
(101)
(105)
EventCorrelation
EventCorrelation
Clusters
Frequent Flapping Prefixes
(103)
(101)
Traffic ImpactPrediction
Traffic ImpactPrediction
EEBREEBR EEBR
LargeDisruptions
Netflow Data
(101)
6
Grouping BGP Update into Events
Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers
Solution: •Group all updates for a prefix with inter-arrival < 70 seconds•Flag prefixes with changes lasting > 10 minutes.
BGP Update Grouping
BGP Update Grouping
EEBR
EEBR
EEBR
BGP Updates
Events
Persistent Flapping Prefixes
7
Grouping Thresholds
Based on data analysis and our understanding of BGP
Event timeout: 70 seconds 2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds
Convergence timeout: 10 minutes BGP usually converges within minutes 99.9% events < 10 minutes
8
Persistent Flapping Prefixes
Causes of persistent flapping Conservative damping parameters (78.6%) Protocol oscillations due to MED (18.3%) Unstable interface or BGP session (3.0%)
Surprising finding: 15.2% of updates were caused by persistent flapping prefixes, even though flap damping was enabled!
9
Example: Unstable eBGP Session
ISP Peer
CustomerEC
EB
EA ED
p
Flap damping parameters are session-based Damping not implemented for iBGP sessions
10
Event Classification
Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of traffic through the network
Event Classification
Event Classification
Events “Typed”Events
Solution: classify events by severity of their impacts
11
Event Category – “No Disruption”
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
No Traffic Shift
“No Disruption”: each of the border routers has no traffic shift. (50.3%)
12
Event Category – “Internal Disruption”
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
Internal Traffic Shift
“Internal Disruption”: all of the traffic shifts are internal traffic shift. (15.6%)
13
Event Category – “Single External Disruption”
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
external Traffic Shift
“Single External Disruption”: only one of the traffic shifts is external traffic shift. (20.7%)
14
Statistics on Event Classification
Events Updates
No Disruption 50.3% 48.6%
Internal Disruption 15.6% 3.4%
Single External Disruption 20.7% 7.9%
Multiple External Disruption 7.4% 18.2%
Loss/Gain of Reachability 6.0% 21.9%
First 3 categories have significant variations from day to day
Updates per event depends on the type of events and the number of affected routers
15
Event Correlation
Challenge: A single routing change affects multiple destination prefixes
EventCorrelation
EventCorrelation“Typed”
EventsClusters
Solution: group events of same type that occur close in time
16
EBGP Session Reset Caused most “single external disruption” events Check if the number of prefixes using that
session as the best route changes dramatically
Validation with Syslog router report (95%)
time
Number of prefixes
session failure
session recovery
17
Hot-Potato Changes Hot-Potato Changes
Caused “internal disruption” events Validation with OSPF measurement (95%)
[Teixeira et al – SIGMETRICS’ 04]
ISP
P
EA EB
EC
10119
“Hot-potato routing” = route to closest egress point
18
Traffic Impact Prediction
Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations
Traffic ImpactPrediction
Traffic ImpactPrediction
EEBR
Clusters LargeDisruptions
Netflow Data
EEBR EEBR
Solution: weigh each cluster by traffic volume
19
Traffic Impact Prediction
Traffic weight Per-prefix measurement from Netflow 10% prefixes accounts for 90% of traffic
Traffic weight of a cluster Sum of “traffic weight” of the prefixes A few clusters have large traffic weight Mostly session resets & hot-potato changes
20
Performance Evaluation Memory
Static memory: “current routes”, 600 MB Dynamic memory: “clusters”, 300 MB
Speed 99% of intervals of 1 second of updates can
be process within 1 second Occasional execution lag Every interval of 70 seconds of updates can
be processed within 70 seconds
Measurements were based on 900MHz CPU
21
Conclusion
BGP anomaly detection Fast, online fashion Operator concerns (reachability, flapping,
traffic) Significant information reduction
Uncovered important network behaviors Persistent flapping prefixes Hot-potato changes Session resets and interface failures
22
Detecting Peering Violations Consistent export requirement
Peer should advertise prefixes at all peering points, with the same AS path length
Allows the AS to do hot-potato routing Detecting violations
Using iBGP feeds from the border routers Some inference tricks to identify inconsistencies
Results of the study http://www.nanog.org/mtg-0410/feamster.html http://www.cs.princeton.edu/~jrex/papers/imc04.pdf