Upload
katsushi-kobayashi
View
126
Download
3
Embed Size (px)
Citation preview
National Institute of Advanced Industrial Science and Technology
Availability of US R&E network
- viewpoint from IGP -Katsushi Kobayashi
National Institute of Advanced Industrial Science and Technology
Toward End-to-end Non-disruptive Operation and Service
• Emerging new activities to improve Internet reliability.
• IETF IPFRR
• Overlay (Detour, RON, ....)
• TENOS (It’s our goal :))
• Data set required for quantitative estimation.
• Nation-wide production quality network.
National Institute of Advanced Industrial Science and Technology
IS-IS collector in Abilene
• Major US Research and Education (R&E) network.
• IS-IS collector is part of Abilene observatory activity. • Contributed by Shu Zhang [ZK06] • Similar to Cengiz Alaettinoglu’s work on Qwest [AC02]
• Deployed all Abilene nodes for multi observation points. • Synchronized with CDMA timer (GPS based) • From Aug. ’04 to Apr. ’07 data set is available. • This presentation based on preliminary analysis of this IS-
IS data, possibly including flaws.
[ZK06] S. Zhang and K. Kobayashi, “Rtanaly: A System to Detect and Measure IGP Routing Changes”
National Institute of Advanced Industrial Science and Technology
Abilene Network Map
Seattle
DenverSunnyvale
Los Angels
Kansas City
Chicago
Indianapolis
Atlanta
Washington
New York City
Houston
Seattle
Denver
Sunnyvale
Los Angels
Kansas City
Chicago
Indianapolis
Atlanta
Washington
New York City
Houston
11 nodes with T640 routers, and 14 OC192 circuits.
National Institute of Advanced Industrial Science and Technology
Abilene IS-IS operation
• 9 sec. Hello interval, lost ISIS adjacency after missing 3 hellos • 22.5 sec. failure detection delay is supposed. • More faster failure detection is possible, e.g., shorter hello
interval, carrier loss with circuit failure.
• IGP maintains infrastructure information only.
• Minimize IGP database
• Not import any BGP route into IS-IS.
National Institute of Advanced Industrial Science and Technology
• Network availability in hereafter:
• All network works without any failure. • From Network operator’s viewpoint.
• Don’t care specific source destination path availability. • Not from customer’s viewpoint.
• Timeframe:
• May include more than one event at same time.
ATLA
IPLS
Network
Timeframe of failure Timeframe of double failure
..........
National Institute of Advanced Industrial Science and Technology
Abilene IS-IS overview ’05-’06 play back
• Node failure: timeout node LSP, or seq. number reset.
• Only 1 times on ’05 (53 sec. downtime), 2 on ’06 (1,298 sec. )
• Circuit failure: adjacency away from list in LSP
• Usually found, 635 timeframe on ’05, 513 on ’06.
• Ext. route failure: Route away from LSP
• Represent edge troubles ?
• Difficult to identify whether serious or trivial.To focus this failure.
National Institute of Advanced Industrial Science and Technology
Lost adjacency event
Note that above histograms are drawn with IS-IS captured data at Atlanta. Few details are different with other IS-IS observatory point.
2005/Jan.-Dec. 2006/Jan.-Dec.single−failure
Monitor duration: 365 (days)Total disrupt(count): 635, Availability: 0.95443
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
020
030
040
0
single−failureMonitor duration: 365 (days)
Total disrupt(count): 513, Availability: 0.98424
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
020
030
040
0
60 sec. 1 hour 1 day 60 sec. 1 hour 1 day
National Institute of Advanced Industrial Science and Technology
Breakdown in ‘05ATLA−IPLS
Monitor duration: 365 (days)Disrupt(count) 288, Avail: 0.99137
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
050
100
150
200
250
CHIN−IPLSMonitor duration: 365 (days)
Disrupt(count) 34, Avail: 0.99981
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 60
24
68
CHIN−NYCMMonitor duration: 365 (days)
Disrupt(count) 64, Avail: 0.99947
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
05
1015
DNVR−KSCYMonitor duration: 365 (days)
Disrupt(count) 4, Avail: 0.99997
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
0.0
0.5
1.0
1.5
2.0
DNVR−SNVAMonitor duration: 365 (days)
Disrupt(count) 12, Avail: 0.99302
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
01
23
45
DNVR−STTLMonitor duration: 365 (days)
Disrupt(count) 8, Avail: 0.99997
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 60.
00.
51.
01.
52.
02.
53.
0
National Institute of Advanced Industrial Science and Technology
Availability Map (05/01-12)
Seattle
Denver
Sunnyvale
Los Angels
Kansas City
Chicago
Indianapolis
Atlanta
Washington
New York City
Houston
0.9999/11/800
0.9991/122/12,4940.9730/24/819,803
0.9930/12/183,352
0.9913/288/170,303
0.9998/34/1,364
0.9994/64/5,090
0.9998/10/3,940
Availability / Disrupt count / Longest down time (sec.)
0.9997/54/1,194
0.9999/4/398
0.9992/16/7,071
0.9997/12/2,349
0.9999/8/501
0.9993/18/14,192
Hurricane KatrinaAug. ‘05
National Institute of Advanced Industrial Science and Technology
Yearly summary ’05 - ’062005/Jan.- Dec. 2006/Jan.- Dec.
Avail. Disrupt cnt. Avail. Disrupt cnt.
ATLA-HSTN 0.9738 24 0.9990 39
ATLA-IPLS 0.9914 288 0.9975 48
ATLA-WASH 0.9998 12 0.9994 25
CHIN-IPLS 0.9998 34 0.9998 14
CHIN-NYCM 0.9995 64 0.9999 30
DNVR-KSCY 1.0000 4 0.9999 18
DNVR-SNVA 0.9930 12 0.9922 51
DNVR-STTL 1.0000 8 0.9999 5
HSTN-KSCY 0.9993 18 0.9990 19
HSTN-LOSA 0.9991 121 0.9996 40
IPLS-KSCY 0.9998 10 0.9998 17
LOSA-SNVA 0.9997 54 0.9993 128
NYCM-WASH 0.9993 17 0.9989 113
SNVA-STTL 1.0000 11 1.0000 129
Total(*) 0.9544 677 0.9842 676
National Institute of Advanced Industrial Science and Technology
Critical event
• 2 or more lost adjacency at same timeframe • Some combination makes serious impact. But, not all event
lead split graph condition.
• 32 timeframes (47 disrupt) in ’05, 58 (61) in ’06
• 26/47 timeframes in ’05, 49/61 in ’06, are attributed as missing a specific node,
National Institute of Advanced Industrial Science and Technology
2 or more links failure (1) - Not critical -
Seattle
Denver
Sunnyvale
Los Angels
Kansas City
Chicago
Indianapolis
Atlanta
Washington
New York City
Houston
Not serious 05/12/19 11:35 - 11:41
National Institute of Advanced Industrial Science and Technology
2 or more links failure (2) - Missing Node -
Seattle
Denver
Sunnyvale
Los Angels
Kansas City
Chicago
Indianapolis
Atlanta
Washington
New York City
Houston
Missing IPLS router at ...........
06/02/19 05:31-05:56 06/02/19 06:30-06:35 06/02/19 15:47-15:51
............
National Institute of Advanced Industrial Science and Technology
Two or more failure in ‘05
single−failureMonitor duration: 365 (days)
Total disrupt(count): 637, Availability: 0.95435
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
020
030
040
0
2005/Jan.-Dec.double−failure
Monitor duration: 365 (days)Total disrupt(count): 47, Availability: 0.99976
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
05
1015
2025
All lost adjacency events Two or more missing
3 nines!!60 sec. 1 hour 1 day 60 sec. 1 hour 1 day
National Institute of Advanced Industrial Science and Technology
Two or more failure in ‘062006/Jan.-Dec.
double−failureMonitor duration: 365 (days)
Total disrupt(count): 61, Availability: 0.99959
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
2030
40
single−failureMonitor duration: 365 (days)
Total disrupt(count): 514, Availability: 0.98419
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
020
030
040
0
All lost adjacency events Two or more missing
3 nines!!60 sec. 1 hour 1 day 60 sec. 1 hour 1 day
National Institute of Advanced Industrial Science and Technology
Single link failure is trivial ? (1)
• Lost two or more adjacency events are rare, more than 99.95% availability, < 5 hours/year downtime.
• More than 500 lost single adjacency are founded. • 637 times in ’05, and 514 in ’06
• 3-4 hours/year downtime are estimated : • Suppose 22 sec. downtime for each lost adjacency. • Other delays, e.g. LSP propagation, SPF computation,
degrade more.
National Institute of Advanced Industrial Science and Technology
Single link failure is trivial ? (2)
• 22 sec. downtime for each lost adjacency is overestimated ? • Router can detect circuit failure more faster triggered with
lower layer information, e.g., loss of optical, framer error. • Sub-second IGP convergence already presented [AC02].
• Sub-second is derived from propagation delay limit, impossible to reduce it.
• However, even if downtime is reduced into 1/10, downtime is still in order, i.e., from 3 hours/year to 20 min./year.
National Institute of Advanced Industrial Science and Technology
“O” approach can improve availability on IP network ?
• Abilene IP network availability achieves more than three nines (99.9%) in previous analysis. Q: Is it possible to achieve six nines (99.9999 %) with building redundant path approach, i.e., Overlay or variant ? A1: No, not ensured disjointness, 90% multihomed paths are overlapped [HWJ06] A2: No, Each routing sub-system must be independent. Overlay infrastructure shares same routing infrastructure.
[HWJ06] J.Han et al., “An Experimental Study of Internet Path Diversity”
National Institute of Advanced Industrial Science and Technology
Conclusion
• Introduce Abilene IS-IS trace data, and our preliminary availability evaluation with ’05 and ’06.
• Network availability is more than 99.95 % with IGP restoration.
National Institute of Advanced Industrial Science and Technology
Acknowledge
• Zhang Shu
• Randy Bush