Delayed Internet Routing Convergence - Computer Science at

Delayed Internet Routing Convergence �

Craig LabovitzMicrosoft ResearchOne Microsoft Way

Redmond, WA 98052-63999

[email protected]

Abha Ahuja, Abhijit BoseFarnam JahanianUniversity of Michigan

Department of Electrical Engineering andComputer Science1301 Beal Avenue

Ann Arbor, MI 48109-2122fahuja,abose,[email protected]

ABSTRACTThis paper examines the latency in Internet path failure,failover and repair due to the convergence properties of inter-domain routing. Unlike switches in the public telephony net-work which exhibit failover on the order of milliseconds, ourexperimental measurements show that inter-domain routersin the packet switched Internet may take tens of minutesto reach a consistent view of the network topology after afault. These delays stem from temporary routing table os-cillations formed during the operation of the BGP path se-lection process on Internet backbone routers. During theseperiods of delayed convergence, we show that end-to-end In-ternet paths will experience intermittent loss of connectivity,as well as increased packet loss and latency. We present atwo-year study of Internet routing convergence through theexperimental instrumentation of key portions of the Internetinfrastructure, including both passive data collection andfault-injection machines at major Internet exchange points.Based on data from the injection and measurement of severalhundred thousand inter-domain routing faults, we describeseveral unexpected properties of convergence and show thatthe measured upper bound on Internet inter-domain rout-ing convergence delay is an order of magnitude slower thanpreviously thought. Our analysis also shows that the up-per theoretic computational bound on the number of routerstates and control messages exchanged during the processof BGP convergence is factorial with respect to the numberof autonomous systems in the Internet. Finally, we demon-strate that much of the observed convergence delay stemsfrom speci�c router vendor implementation decisions andambiguity in the BGP speci�cation.

�Partially supported by National Science Foundation GrantsNCR-971017 and NCR-961276.

1. INTRODUCTIONIn a brief number of years, the Internet has evolved from anexperimental research and academic network to a commod-ity, mission-critical component of the public telecommunica-tion infrastructure. During this period, we have witnessedan explosive growth in the size and topological complex-ity of the Internet and an increasing strain on its under-lying infrastructure. As the national and economic infras-tructure has become increasingly dependent on the globalInternet, the end-to-end availability and reliability of datanetworks promises to have signi�cant rami�cations for anever-expanding range of applications. For example, tran-sient disruptions in backbone networks that previously im-pacted a handful of scientists may now cause enormous �-nancial loss and disrupt hundreds of thousands of end users.

Since its commercial inception in 1995, the Internet haslagged behind the public switched telephone network (PSTN)in availability, reliability and quality of service (QoS). Fac-tors contributing to these di�erences between the commer-cial Internet infrastructure and the PSTN have been dis-cussed in various literature[22, 16]. Although recent ad-vances in the IETF's Di�erentiated Services working grouppromise to improve the performance of application-level ser-vices within some networks, across the wide-area Internetthese QoS algorithms are usually predicated on the exis-tence of a stable underlying forwarding infrastructure.

The Internet backbone infrastructure is widely believed tosupport rapid restoration and rerouting in the event of indi-vidual link or router failures. At least one report places thelatency of inter-domain Internet path failover on the order of30 seconds or less based on qualitative end user experience[13]. These brief delays in inter-domain failover are furtherbelieved to stem mainly from queuing and router CPU pro-cessing latencies [3, (message digests 11/98, 1/99)]. In thispaper, we show that most of this conventional wisdom aboutInternet failover is incorrect. Speci�cally, we demonstratethat the Internet does not support e�ective inter-domainfailover and that most of the delay in path restoral stemssolely from the unexpected interaction of con�gurable rout-ing protocol timers and speci�c router vendor protocol im-plementation decisions during the process of delayed BGPconvergence.

175175

______________ ______________*Partially supported by National Science Foundation Grants NCR-971017 and NCR-961276.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.SIGCOMM '00, Stockholm, Sweden.Copyright 2000 ACM 1-58113-224-7/00/0008...$5.00.

The slow convergence of distance vector (DV) routing al-gorithms is not a new problem [20]. DV routing requiresthat each node maintain the distance from itself to eachpossible destination and the vector, or neighbor, to use toreach that destination. Whenever this connectivity informa-tion changes, the router transmits its new distance vector toeach of its neighbors, allowing each to recalculate its routingtable.

DV routing can take a long time to converge after a topolog-ical change because routers do not have su�cient informa-tion to determine if their choice of next hop will cause rout-ing loops to form. The count-to-in�nity problem [20] is thecanonical example used to illustrate the slow convergencein DV routing. Numerous solutions have been proposed toaddress this issue. For example, including the entire path tothe destination, known as the path vector approach, is usedin the Border Gateway Protocol (BGP), the inter-domainrouting protocol in the Internet. Other attempts to solvethe count-to-in�nity problem or accelerate convergence inmany common cases include techniques such as split horizon(with poison reverse), triggered updates, and the di�usingupdate algorithm [7].

Although the theoretical aspects of the delayed convergenceproblems associated with DV protocols are well known, thispaper is the �rst to our knowledge to investigate and quan-titatively measure the convergence behavior of BGP4 de-ployed in today's Internet. In [4], the authors showed thatin the worst case, the original Bellman-Ford distance vec-tor algorithm requires O(n3) iterations to �nd the shortestpath lengths for a network with n nodes. However, we arenot aware of any published result of a similar bound forpath vector algorithms. The adoption of the path vector iswidely and incorrectly believed to provide BGP with signi�-cantly improved convergence properties over traditional DVprotocols, including RIP [12].

A number of recent studies, including Varadhan et al. [23]and Gri�n and Wilfong [8] have explored BGP routing di-vergence. As we describe in the next Section, BGP allowsthe administrator of an autonomous system to specify ar-bitrarily complex policies. In BGP divergence, Gri�n andWilfong show that it is possible for autonomous systemsto implement \unsafe," or mutually unsatis�able policies,which will result in persistent route oscillations. Gri�n etal. in [9] and Rexford et al. in [6] also describe modi�ca-tions to BGP policies which guarantee that the protocol willnot diverge. The authors of all these papers note that BGPdivergence remains a theoretical �nding and has not beenobserved in practice. Our work explores a complimentaryfacet of BGP routing { the convergence behavior of safe,or satis�able routing policies. As we describe in the nextSection, deployed Internet routers default to a constrainedshortest path �rst route selection policy. We show that evenwith this constrained policy, the theoretical upper-bound oncomplexity for BGP convergence is factorial with respect tothe number of autonomous systems.

Bhargavan et al. in [4] provide a stricter upper bound on theconvergence of RIP. The authors account for implementationdetails of RIP including poison reverse, triggered updates,and split-horizon, which provide for improved convergence

behavior over previous analyses of Bellman-Ford algorithms.In [24], the authors simulated the convergence behaviors ofseveral algorithms including a distributed Bellman-Ford andpresent metrics for comparing the convergence properties ofthese di�erent protocols. In this work, we similarly focuson both measuring the convergence latencies of BGP anddeveloping theoretical upper and lower bounds.

In [17], Labovitz et al. describe signi�cant levels of mea-sured Internet routing instability. The authors show thatmost Internet routing instability in 1997 was pathologicaland stemmed from software bugs and artifacts of router ven-dor implementation decisions. In a later paper, Labovitzand his co-authors show in [18] that once ISPs deployed up-dated router software suggested by [17], the level of Internetrouting instability dropped by several orders of magnitude.Finally, in [16], Labovitz et al. measured the rate of networkfailure, repair and availability. In this work, we present acomplimentary study of both the impact and the rate atwhich inter-domain repair and failure information propa-gates through the Internet. We also measure the impact ofInternet path changes on end-to-end network performance.Speci�cally, our major results include:

� Although the adoption of the path vector by BGPeliminates the DV count-to-in�nity problem, the pathvector exponentially exacerbates the number of possi-ble routing table oscillations.

� The delay in Internet inter-domain path failovers aver-aged three minutes during the two years of our study,and some percentage of failovers triggered routing ta-ble oscillations lasting up to �fteen minutes.

� The theoretical upper bound on the number of com-putational states explored during BGP convergence isO(n!), where n is the number of autonomous systemsin the Internet. We note that this is a theoretical upperbound on BGP convergence and is unlikely to occur inpractice.

� If we assume bounded delay on BGP message prop-agation, then the lower bound on BGP convergenceis ((n � 3) � 30) seconds, where n is the number ofautonomous systems in the Internet.

� The delay of inter-domain route convergence is due al-most entirely to the unforeseen interaction of protocoltimers with speci�c router vendor implementation de-cisions.

� Internet path failover has signi�cant deleterious im-pact on end-to-end performance { measured packetloss grows by a factor of 30 and latency by a factorof four during path restoral.

� Minor changes to current vendor BGP implementa-tions would, if deployed, reduce the lower bound oninter-domain convergence time complexity from ((n�3)�30) to (30) seconds, where n is the number of au-tonomous systems in the Internet.

The remainder of this paper is organized as follows: Sec-tion 2 provides additional background on BGP. Section 3

176176

provides a description of our experimental measurement in-frastructure. In Section 4, we present the results of our twoyear study of Internet routing convergence. We describethe measured convergence latencies of both individual ISPsand the Internet as a whole after several categories of in-jected routing faults. In Section 5, we present a simpli�edmodel of delayed BGP convergence and discuss the theoret-ical upper and lower bounds on the process. In Section 6,we provide analysis of our experimental data based on ourmodel of BGP convergence. Finally, we conclude with adiscussion of speci�c modi�cations to vendor BGP imple-mentations which, if deployed, would signi�cantly improveInternet convergence latencies.

2. BACKGROUNDAutonomous systems (ASes) in the Internet today exchangeinter-domain routing information through BGP. We assumethat the reader is familiar with Internet architecture and theBGP routing concepts discussed in [21, 10]. We provide abrief review of the more salient attributes of BGP related tothe discussion in this paper.

Unlike interior gateway protocols, which periodically oodan intra-domain network with all known topological infor-mation, BGP is an incremental protocol that sends updateinformation only upon changes in network topology or rout-ing policy. Routing information shared among BGP speak-ing peers has two forms - announcements and withdrawals.A route announcement indicates that a router has eitherlearned of a new network attachment or has made a policydecision to prefer another route to a network destination.Route withdrawals are sent when a router makes a new lo-cal decision that a network is no longer reachable via anypath. Explicit withdrawals are those associated with a with-drawal message. Implicit withdrawals occur when an exist-ing route is replaced by an announcement of a new, morepreferred route without an intervening withdrawal message.We de�ne route failover as the implicit withdrawal and re-placement of a route with one having a di�erent ASPath.For purposes of our discussion, we de�ne a steady-state net-work as one where no BGP monitored peer sends updatesfor a given pre�x for 30 minutes or more. We choose the30 minute time period as an upper bound on short-termrouting table oscillations based on results described in [16].

BGP limits the distribution of a router's reachability infor-mation to its peer, or neighbor routers. As a path vectorprotocol, BGP updates include an ASPath, or a sequence ofintermediate autonomous systems between source and des-tination routers that form the directed path for the route.The default BGP behavior uses the ASPath for both loopdetection and policy decisions. Upon receipt of a BGP up-date, each router evaluates the path vector and invalidatesany route which includes the router's own AS number in thepath.

Although not speci�ed in the BGP standard [21], most ven-dor implementations ultimately default to the best path se-lection based on ASPath length. The number of ASes inthe path is used in a manner similar to the metric countattribute in the RIP protocol. While BGP allows for pathselection based on policy attributes, including local prefer-ence and multi-exit discriminator values, a review of BGP

logs, discussions with Internet network operators, and a sur-vey of policies registered in the Internet Routing Registry(IRR) indicates that the majority of ISP policies default tothe selection of the route with the shortest path. In the re-mainder of this paper, we base our analysis on the defaultbehavior of BGP, or constrained shortest path �rst policies.

Internet providers commonly constrain path selection andsubsequent advertisements to peers through the use of ingressand egress �ltering. Most commercial routers include con�g-urable �lter-lists which support the rejection or acceptanceof route advertisements based on pre�x or ASPath patternmatching. Common provider �ltering practices include therejection of customer route advertisements outside the ad-dress space owned by that customer, and the �ltering of non-customer and non-transit route advertisements to peers.

The BGP standard also includes a minimum route adver-tisement interval timer, abbreviated in this paper as Min-RouteAdver, which speci�es a minimum amount of time thatmust elapse between advertisement of routes to a particu-lar destination from a given BGP peer. This timer providesboth a rate limiter on BGP updates as well as a window inwhich BGP updates with common attributes may be bun-dled into a single update for greater protocol e�ciency. Inorder to achieve a minimum of MinRouteAdver between an-nouncements, the speci�cation calls for this rate-limiter tobe applied as a jittered interval on a (pre�x destination,peer) tuple basis.

The standard further speci�es that MinRouteAdver only ap-plies to BGP announcements and not explicit withdrawals.This distinction stems from the goal of avoiding the long-lived \black holing" of tra�c to unreachable destinations.Due to the delay introduced by MinRouteAdver on announce-ments throughout the Internet, BGP withdrawals are com-monly (and incorrectly) believed to propagate and convergemore quickly.

3. METHODOLOGYWe base our analysis on data collected from the experimen-tal instrumentation of key portions of the Internet infras-tructure. Over the course of two years, we injected over250,000 routing faults into geographically and topologicallydiverse peering sessions with �ve major commercial Internetservice providers. We then measured the impact of thesefaults through both end-to-end measurements and loggingISP backbone routing table changes.

Figure 1 shows a simpli�ed diagram of our RouteViews mea-surement and fault injection infrastructure. We measuredthe impact of injected faults via both active and passiveprobe machines deployed at major US exchange points, aswell as on the University of Michigan campus. Our pas-sive instrumentation included several \RouteViews" probemachines, which maintained default-free peering with over25 Internet providers. These RouteViews machines time-stamped and logged all BGP updates received from peers todisk.

We injected faults consisting of BGP update messages in-cluding route transitions (i.e. announcements and with-draws) for both /19 and /24 pre�x-length addresses. Al-

177177

Internet

ISP4

Stub AS

RouteViews Data

CollectionProbe

ISP5

ISP6

ISP3

UpstreamISP1

Stub AS

Fault Injection Server

UpstreamISP2BGP Fault

BGP Fault

BGP

BGP

BGP

BGP

ICMPEchos

Figure 1: Diagram of the fault injection and mea-

surement infrastructure.

though we injected faults from a number of diverse probelocations, we simplify the discussion in this paper by pre-senting data only from faults injected at the Mae-West ex-change point and from the University of Michigan campus.We note that data from other probe locations exhibited sim-ilar behaviors. As we only injected routing information foraddresses assigned to our research e�ort, these faults did notimpact routing for commodity ISP tra�c with the exceptionof the addition of some minimal level of extra routing con-trol tra�c. We generated faults over a two year period toprovide statistical guarantees that our analysis was based ondeliberately injected faults rather than normally occurringexogenous Internet failures, which the authors in [16] foundoccur on the average of once a month.

Software from the MRT and IPMA projects [1, 2] runningon both FreeBSD PCs and Sun Microsystems workstationswas used to generate BGP routing update messages at ran-dom intervals of roughly a two-hour periodicity. The faultssimulated route failures, repairs and multi-homed failover.In the case of failover, we announced both a primary routefor a given pre�x with a short ASPath to one upstream BGPneighbor, and a longer ASPath route for the same pre�x toa second provider. The announcement of two routes of dif-ferent ASPath length represents a common method of cus-tomer multihoming to two Internet providers. In an e�ort toensure that the downstream peers would always prefer theprimary route if it existed, we prepended the long ASPathroute announcement with three times the average number ofAS numbers observed in steady-state path lengths. We thenperiodically failed the shorter ASPath route while maintain-ing the longer backup path.

While the RouteViews probes monitored the impact of BGPfaults on core Internet routers, our active measurementsmonitored the impact on end-to-end performance. We con-�gured these probe machines with a virtual interface ad-dressed within the pre�x blocks included in the injectedBGP faults. These probe machines sent 512 byte ICMPecho messages to 100 randomly selected web sites once asecond. We randomly selected the web site IP addressesfrom a major Internet cache log of several hundred thou-sand entries.

We then correlated the data between our NTP synchronizedfault injection probe machines and both our RouteViews andend-to-end measurement logs. These correlations provideddata on the number of update messages generated for a par-

ticular route announcement and withdrawal, as well as theconvergence delay for a particular ISP, and all ISPs to reachsteady state after a fault.

We also simulated routing convergence using software fromthe MRT project [2]. The MRTd daemon supports the con-�guration of multiple BGP autonomous systems and asso-ciated routing tables within a single workstation process.As a complete routing protocol implementation, the soft-ware supports the generation of BGP update packets andthe application of arbitrary BGP policies similar to thoseavailable on commercial routers. In simulation mode, thedaemon exchanges packets internally and does not forwardupdates to the network. By programmatically introducingdelay in message propagation and processing, we were ableto simulate both the average and upper bound on BGP con-vergence for networks of varying degree and topology.

4. EXPERIMENTAL RESULTSIn this section we present data collected with the experi-mental measurement infrastructure described in the previ-ous section. We �rst provide a taxonomy for describing thefour categories of routing events injected into the Internetduring our study:

Tup A previously unavailable route is announced as avail-able. This represents a route repair.

Tdown A previously available route is withdrawn. Thisrepresents a route failure.

Tshort An active route with a long ASPath is implicitly re-placed with a new route possessing a shorter ASPath.This represents both a route repair and failover.

Tlong An active route with a short ASPath is implicitlyreplaced with a new route possessing a longer ASPath.This represents both a route failure and failover.

We de�ne the latency of each injected event as the timebetween the injection of the fault and the routing tables ofa given ISP, or all ISPs, we monitored to reach steady statefor the injected pre�x. In the following two subsections, wepresent data from our both our passive routing and activeend-to-end measurements.

4.1 Routing MeasurementsWe �rst explore the di�erences in latency among the fourcategories of routing events. Figure 2(a) shows the conver-gence latency for a cumulative percentage of Tdown, Tup,Tshort and Tlong events over all monitored ISPs. The hor-izontal axis represents the number of seconds from injectionof the fault until all ISPs' BGP routing tables reach steadystate for that pre�x; the vertical axis shows the cumulativepercentage of all such events. For clarity we limit the hori-zontal axis to 180 seconds. All four events exhibited a long-tailed distribution of convergence latencies extending up to�fteen minutes for a small, but tangible percentage of events.Signi�cantly, Figure 2(a) shows more than twenty percent ofTlong and forty percent of Tdown events oscillated for morethan three minutes. We note that these observed latencies

178178

0 20 40 60 80 100 120 140 160 1800

10

20

30

40

50

60

70

80

90

100

Seconds Until Convergence

Cum

ulat

ive

Per

cent

age

of E

vent

s TshortTupTlongTdown

(a) Latency

0

0.5

1

1.5

2

2.5

3

3.5

4

Tdown Tlong Tup Tshort

Av e

rag

e N

um

be r

of

BG

P U

pd

a te s

ISP1

ISP2

ISP3

ISP4

ISP5

(b) Messages

Figure 2: Convergence latency of cumulative percentage of Tup, Tshort, Tlong and Tdown events and average

number of BGP updates from 5 ISPs triggered by Tdown, Tlong, Tup, Tshort events for all monitored ISPs

over course of our two year study. Both data sets are for faults injected at the Mae-West exchange point.

are an order of magnitude longer than those reported in [3,13].

We also observe in Figure 2 that (Tlong, Tdown) and (Tshort,Tup) form equivalence classes based on their similar distri-bution of convergence latencies. Both Tdown and Tlongconverged more slowly than Tup or Tshort: Tup and Tshortevents converged within 90 seconds while only �ve percent ofTdown and Tlong events converged within 90 seconds, andtwenty percent of Tdown/Tlong required longer than twominutes to converge. We note that the cumulative percent-age curves for Tup and Tshort match closely while Tlongand Tdown share similar curves separated by an average of20 seconds. We posit a likely explanation for both the equiv-alence classes and the di�erences between Tlong and Tdowncurves in Section 6.

We next examine the volume or number of BGP routing up-dates triggered by each injection of a routing event. We ob-serve that the injection of a single routing event may triggerthe generation of multiple route announcements and with-drawals from each ISP. In Figure 2(b), we show the averagenumber of update messages generated by �ve ISPs for eachcategory of routing event over the two year course of ourstudy. Although we monitored the BGP routing tables of25 ISPs, we graph only �ve ISPs in Figure 2(b) for clar-ity. We note that data from the other monitored providersexhibited similar behaviors.

The most salient observation we make from Figure 2(b) isthat both Tdown and Tlong events on average triggeredmore than two times the number of update messages thanboth Tup and Tshort events. As we observed in Figure 2(a),(Tlong, Tdown) and (Tup, Tshort) appear to form equiva-lence classes with respect to both convergence latency andthe number of update messages they trigger. We note sig-ni�cant variation in the average number of updates gener-ated by individual ISPs within each equivalence class. Forexample, we see that for ISP3, Tdown triggered twice thenumber of messages as Tlong. In contrast, Tlong events trig-gered more messages in ISP2 than Tdown. In all categories,ISP1 generated an average of only one BGP update. Fi-

nally, we note strong correlations between the relative num-ber of update messages generated per equivalence class inFigure 2(b) and the convergence latencies of each categoryin Figure 2(a). We provide probable explanations for thesebehaviors later in Section 6.

We now look at the latency for two categories of injectedevents on a per ISP basis. Figure 3 shows the convergencelatency of a cumulative percentage of both Tdown (a) andTup (b) events for �ve ISPs. The horizontal axis representsthe delay in one second bins between the time of event in-jection and the BGP routing tables in each ISP reach steadystate for that pre�x. The vertical axis shows the cumula-tive percentage of all such events. As before, we presentdata from only �ve ISPs and limit the horizontal axis to 180seconds for clarity of presentation.

We observe signi�cant variation in the convergence laten-cies of the �ve ISPs in both graphs of Figure 3. The varia-tions appear most pronounced in Figure 3(a) where a three-minute gap separates 80% of ISP1 converged events fromISP5. In our analysis, we looked for correlations between theconvergence latencies of an ISP and both the geographic andnetwork distance of that ISP. We de�ne network distance asthe steady-state number of traceroute hops or BGP ASPathentries from the point of fault injection to the peer borderrouter interface of a ISP. In Figure 2 and Figure 3, ISP1represents a special case { the only ISP into which we bothinjected events and monitored the convergence latencies. Asone of the ISPs into which we also injected faults, the rout-ing table of ISP1 did not exhibit EBGP route oscillations.As we explain in Section 6, at all times ISP1 either had theshortest ASPath route, or ignored updates from neighborISPs after detection of an ASPath loop.

With the exception of ISP1, our data shows no correlationbetween convergence latency and geographic or network dis-tance. For example ISP3, which is a national backbone inJapan, converged more quickly for both Tdown and Tupthan a Canadian provider, ISP5. We show in Section 6that convergence latencies are likely primarily dependent ontopological factors including the number of adjacent BGP

179179

0 20 40 60 80 100 120 140 160 1800

10

20

30

40

50

60

70

80

90

100


Cum

ulat

ive

Per

cent

age

of E

vent

s ISP1ISP2ISP3ISP4ISP5

(a) Tdown

0 20 40 60 80 100 120 140 160 1800

10

20

30

40

50

60

70

80

90

100


Cum

ulat

ive

Per

cent

age

of E

vent

s ISP1ISP2ISP3ISP4ISP5

(b) Tup

Figure 3: Convergence latency of a cumulative percentage of Tdown and Tup events injected at the Mae-West

exchange point for �ve major ISPs

peers and upstream provider transit policies.

We also looked for temporal correlations between conver-gence delay and the time of day or week. In [17], Labovitzet al. describe a direct relationship between the hourly rateof routing instability and the diurnal bell curve exhibitedby Internet bandwidth consumption and the correspondingload on backbone routers. Our analysis, however, found nosuch temporal relationship with failover latency. This resultsuggests that the factors contributing to Internet fail-overdelay are largely independent of network load and conges-tion.

4.2 End-to-End MeasurementsWe now turn our attention from the convergence latenciesof backbone routing tables to the impact of delayed con-vergence on end-to-end network paths. We show that evenmoderate levels of routing table oscillation will lead to in-creased packet loss, latency and out of order packets. Theseperformance problem arise as routers drop packets for whichthey do not have a valid next hop, or queue packets whileawaiting the completion of forwarding table cache updates.We expect end-to-end active measurements to provide a bet-ter measure of the application-level impact of routing con-vergence as not all routing table changes a�ect the forward-ing path, and external BGP routing table measurementsdo not the capture delays introduced by the convergence ofsmaller, stub ISPs or interior routing protocol communica-tion.

In Figure 4(a), we show packet loss averaged over one minuteintervals between our fault injection machine and 100 ran-domly selected web sites. The horizontal axis shows one-minute bins for the ten minutes both proceeding and imme-diately following the injection of both a Tlong and Tshortfailover event. Time 0 is the point of fault injection. The ver-tical axis represents the percentage loss for each one-minutebin averaged both over all web sites and each correspondingbin in every ten-minute fault injection period. We see in Fig-ure 4(a) less than one percent average packet loss through-out the ten-minute period before each fault. Immediatelyfollowing the fault, the graphs for Tlong and Tshort eventsshow a sharp rise to 17 and 32 percent loss, respectively,followed by sharply declining loss over the next three min-

utes. The wider curve of Tlong with respect to Tshort corre-sponds to the relative speeds of routing table convergence forboth events shown in Figure 2. Speci�cally, Tlong exhibitsa two minute period where loss exceeds twenty percent andTshort a one minute period of greater than �fteen percentloss. These loss trends support the data in Figure 2, whereeighty percent of Tlong and Tshort events converged withinthe same respective periods.

We also examine the impact of convergence on end-to-endpath latency. Figure 4(b) shows the average normalizedround-trip latency of ICMP echos in ten-minute bins be-fore and after a Tlong and Tshort event. Time 0 representsthe instant of fault injection. We normalize the latency ofechos on a per destination basis by dividing the latency ofeach echo by the average delay to that destination. As withthe analysis of packet loss, we see that route failover hassigni�cant impact on end-to-end latencies. For both Tlongand Tshort, latencies more than tripled in the three minutesimmediately following both categories of failover. AlthoughTshort exhibited an initially higher increase in latency, thecurve for Tlong appears broader, extending for �ve minutesafter the event. We note that the variation in end-to-endlatency between Tup and Tdown corresponds with routingtable convergence data presented in Figure 3.

Finally, we analyze the end-to-end speed of repair, or Tup,by measuring the rate at which ICMP echos �rst began con-sistently returning from each web site after a repair. Al-though we omit the graph of Tup end-to-end behavior forbrevity, we note that the majority (over 80%) of web sitesbegan returning ICMP echoes within 30 seconds, and allweb sites returned echos within one minute. These resultscorrespond with the routing convergence latencies reportedin Figure 3 for Tup events.

We also note that our end-to-end and routing table mea-surements correspond to observations by other researchers.Delayed convergence provides a likely explanation for boththe temporary routing table oscillations observed by Pax-son in [19] as well as some of the instabilities observed byLabovitz et al. in [18].

180180

0

5

10

15

20

25

30

35

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

One Minute Bins Before and After Fault

Per

cen

tag

e P

a ck e

t L

os s

Tshort

Tlong

Fault

(a) Loss

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

One Minute Bins Before and After Fault

No

rmal

ized

La t

e nc y

Tlong

TshortFault

(b) Latency

Figure 4: Average percentage end-to-end loss and normalized latency of 512 byte ICMP echos sent to 100

web sites every second during the ten minutes immediately proceeding and following the injection of a Tshort

and Tlong events at the Mae-West exchange point.

5. BGP CONVERGENCE MODELIn this Section, we present a simpli�ed model of the delayedBGP convergence process. We provide examples and anal-ysis of both the theoretic upper and lower computationalbound on BGP convergence. We will use this model laterin Section 6 as the basis for our analysis of the BGP con-vergence behaviors we observed. We base our model on theBGP speci�cation [21], simulation results, and the previ-ously described experimental measurements.

We simplify our analysis by modeling each AS as a singlenode. In practice, most ASes encompass dozens or evenhundreds of border and internal routers. These routers mayexchange routing information through a myriad of protocols,including intra-domain BGP communication (IBGP), routere ectors, confederations and interior routing protocols [10].We exclude the delay and additional states generated bythese ancilliary protocols in our model as our experimentalresults show these do not add signi�cant latency with respectto the overall BGP convergence delays.

We further simplify our analysis by choosing a full mesh, orcomplete graph of autonomous systems as our model of theInternet (i.e. each node has n� 1 adjacencies). In addition,we exclude the impact of ingress and egress �lters on BGProute propagation. In practice, the Internet retains somelevel of hierarchy and most providers implement some de-gree of customer route �ltering. We note, however, that thechoice of a full mesh re ects current trends in the evolutionof the Internet towards less hierarchy and a more meshedtopology[14, 13]. We show in Section 5.1 that a completegraph in the absence of ingress/egress �lters provides theworst-case complexity of BGP convergence and, as such, sig-ni�cantly overestimates the average case. Current research,including our ongoing work and [6], has begun to explore thee�ect of incomplete topologies and more restrictive policieson BGP convergence.

Since BGP does not place bounds on the delay of updatepropagation or processing, discussions of time complexityare only constructive if we assume bounded delays. We ini-tially exclude the impact of MinRouteAdver and associated

timers on convergence. We will discuss time complexity andthe impact of these timers in SubSection 5.2. Given the lackof bounds on message propagation, we initially assume mes-sages may arrive in non-deterministic order subject only tothe constraint that FIFO ordering is preserved between anypair of autonomous system peers. This unbounded delaymodel will provide the basis of our calculation for the upperbound on BGP convergence later in this Section. In prac-tice, the link latency and router processing delay for mostBGP messages is signi�cantly less than the MinRouteAdverinterval.

Finally, we model BGP processing as a single linear, globalqueue. All messages (both announcements and withdrawals)are placed in a global queue after transmission, and only oneset of messages from a single node to each of its peers is pro-cessed at a time. We refer to the processing of a single setof messages from a node and the resultant possible statechanges and message generation as a stage. Such `serializa-tion' of the BGP algorithm may arise in practice if thereare long link delays in a network. In Section 5.2, we extendour taxonomy of BGP convergence to include a set of stageswhich form a round. We de�ne a round as the set of all con-tiguous stages which process BGP paths at a given lengthwithin a single MinRouteAdver timer interval.

In Figure 5, we provide an example of BGP convergenceinvolving a complete graph of a three node system whereall nodes are initially directly connected to route R. The\Routing Tables" column shows the routing table of eachautonomous system at each computational stage. For eachAS, we provide the matrix of current paths through each ofits neighbors. We denote the active route with an asteriskand a withdrawn, or invalid path with a dash and/or 1symbol. So, for example, we see at step 0 from 1(0R; �R; 2R)that AS1 has one primary route (directly connected) andtwo backup paths (via AS0 and AS2) to R.

The \Message Processing" column in Figure 5 provides themessages processed at each stage. The last \Messages Queued"column shows the global queue of outstanding messages inthe system. We process messages in serial fashion from this

181181

AS 0

AS 1AS 2

R

2 -> 0 21R2 -> 1 21R

2 -> 0 21R2 -> 1 21R

2 -> 0 21R2 -> 1 21R

2 -> 0 21R2 -> 1 21R

2 -> 0 21R2 -> 1 21R

0 -> 1 02R0 -> 2 02R

0 -> 1 02R0 -> 2 02R

2 -> 0 201R2 -> 1 201R

0 -> 2 W0 -> 1 W 1 -> 0 120R

1 -> 2 120R

2 -> 0 201R2 -> 1 201R

0 -> 1 02R0 -> 2 02R

2 -> 0 201R2 -> 1 201R 0 -> 2 W

0 -> 1 W 1 -> 0 120R1 -> 2 120R 0 -> 2 012R

0 -> 1 012R

0 -> 2 W0 -> 1 W 1 -> 0 120R

1 -> 2 120R 0 -> 2 012R0 -> 1 012R

Routing Tables

0(*R, 1R, 2R) 1(0R, *R, 2R) 2(0R, 1R, *R)

steady state

Messages Processing Messages Queued in System

1 -> 0 10R 1 -> 2 10R

0

R withdraws its route1 0 -> 2 01R 0 -> 1 01R

1 -> 0 10R 1 -> 2 10R

1 -> 0 12R1 -> 2 12R2

1 and 2 receive new announcement from 0

1 -> 0 10R 1 -> 2 10R0(-, -, *2R) 1(-, -, *2R) 2(*01R, 10R, -)

0 -> 1 01R 0 -> 2 01R

R -> 0 WR -> 1 WR -> 2 W

1 -> 0 12R1 -> 2 12R3


1 -> 2 12R1 -> 0 12R

4

1 -> 0 12R1 -> 2 12R5



6

N/A

N/A

N/A

N/A

N/A

N/A

N/A

0(-, *1R, 2R) 1(*0R, -, 2R) 2(*0R, 1R, -)

0(-, *1R, 2R) 1(-, -, *2R) 2(01R, *1R, -)

2 -> 0 20R2 -> 1 20R

2 -> 0 20R2 -> 1 20R

2 -> 0 20R2 -> 1 20R

2 -> 0 20R2 -> 1 20R


(steps omitted)

48 N/A 0(-, -, -) 1(-, -, -) 2(-, -, -)steady state

0(-, -, -) 1(-, -, *20R) 2(*01R, 10R, -)

0 -> 1 02R0 -> 2 02R

2 -> 0 201R

0(-, *12R, -) 1(-, -, *20R) 2(*01R, -, -)

0(-, *12R, 21R) 1(-, -, -) 2(*01R, -, -)

2 -> 1 201R

1 -> 0 W1 -> 2 W

Stage Time

Figure 5: Example of BGP bouncing problem.

global queue subject only to the constraint that the FIFOordering of messages is preserved between BGP peers. Weuse the following notations to represent messages: an an-nouncement of a new path by node i sent to its neighboringnode j is given as i ! j [path] where path is the set ofnodes starting with node i. Similarly a withdrawal messageoriginated at node i is represented by i ! j [1]. We alsorepresent a withdrawal message, or the absence of a validpath with a W . So, for example, \0 ! 1 01R" at stage1 denotes that AS0 has sent a route announcement to AS1with the path 01R. Similarly, R! 0 W at stage 1 indicatesthat R has sent a withdrawal to AS0.

As the full example includes over forty stages, we presentonly the �rst six stages and the last stage in Figure 5 forclarity. The main goal of the example is to illustrate theexploration of ever increasing ASPath lengths and the gen-eration of large numbers of update messages during conver-gence. At stage 0, Route R is withdrawn following a fault.All three ASes in stage 1 then invalidate their directly con-nected paths of length 1, and choose secondary paths: ASOselects 1R, AS1 selects 0R and AS2 select 0R. The threeASes also announce these new active routes to each of theirneighbors. In the next stages (2 through 4), AS0 detectsa looped path from AS1 and AS2, and invalidates both ofthese routes. Lacking a valid route to R, AS0 then sendsout withdrawal messages to both neighbors. Upon receiptof this withdraw, AS1 and AS2 again failover to secondaryroutes (AS1 via 20R, and AS2 via 10R). In the �nal stagesof the example, AS1 and AS2 detect the mutual route de-pendency through each other via the exchange of loopedBGP ASPaths. Finally, at stage 48 the system convergeswith all routes withdrawn.

The intuition behind the large number of messages gener-ated in this example is that adoption of the path vector inBGP exponentially exacerbates the bouncing problem [5].We note that the loop detection mechanism in BGP resolvesthe RIP routing table looping problem where a given nodereuses information in a new path that the node itself origi-nally initiated. The ASPath mechanism, however, does notprevent an AS from learning of a new, invalid path from aneighbor. For example, in stage 3 of Figure 5 AS2 processes

the queued 1 ! 2 10R message from AS1 and selects thisinvalid route as a new active path. AS2 then appends itsown AS number and propagates the new invalid 210R pathto each of its neighbors.

Intuitively, the most signi�cant di�erence between the con-vergence behavior of traditional DV algorithms and BGPis that DVs are strictly increasing, whereas BGP is mono-tonically increasing. Traditional DVs will explore one, andonly one route associated with each distance metric value.In contrast, BGP has n! possible paths in a network of nnodes. We show in the next SubSection that in the worstcase, long link/queuing or processing delays can result in anordering of messages such that BGP will explore all possiblepaths of all possible lengths. We note that such an order-ing represents the upper bound on BGP convergence and isunlikely to occur in practice.

5.1 Upper Bound on ConvergenceIn this Section, we provide an upper bound on the conver-gence time for a network of n BGP autonomous systems. Asdiscussed earlier, we initially assume unbounded delay onmessage propagation. We begin with several observations:

Observation 1: For a complete graph of n nodes, thereexist O((n � 1)!) distinct paths to reach a particular desti-nation.

To show this, we note that there exists a total of (n�1) pathsof length 1 to reach a particular destination in a completegraph. Any other path of length greater than 1 must use oneof these (n� 1) paths as the last hop in order to reach thatdestination. For example, there are exactly (n� 1) � (n� 2)paths of length 2 in a complete graph. Therefore, the sumof all paths can be written as a series sum:

P [n] = (n� 1) + (n� 1)(n� 2) + :::: + (n� 1)!

The above expression can be rewritten as :

P [n] = (n� 1)! [1 + 1=2! + 1=3! + ::: + 1=(n� 2)!]

182182

which is closely approximated by p [n] = O((n � 1)!). Thisis an upper bound on the number of all possible paths toany destination in a compete graph of size n.

Observation 2: When a particular route is withdrawn, apath vector algorithm attempts to �nd an alternate path byiterating on the available paths of equal or increasing length.We refer to this as a k-level iteration of the algorithm. Atthe k-th iteration, the algorithm looks at paths spanning atmost k edges of the graph.

Observation 3: The conditions necessary for the worstcase convergence are :

i A complete graph, i.e. all nodes have a degree of (n�1).

ii All messages (both announcements and withdrawals)are processed in sequence i.e. only one message is al-lowed to be processed at a time. Such `serialization' ofthe BGP algorithm may arise in practice if there arelong link delays in a network.

iii The messages generated in each k-level iteration arereordered at the beginning of each iteration. Thosemessages that invalidate the currently installed pathat each node are favored and processed ahead of theothers.

With these de�nitions, it is straightforward to construct asequence of messages between any two nodes i and j for eachk-level iteration. Consider the routing table at node i of anetwork at time t: (�013; 103;1;1). In this case, node ihas two possible paths to the destination via its two neigh-boring nodes 0 and 1 respectively. Let us assume that nodei receives a new announcement from its neighbor, node 1:1! i[1i3]. Since this newly announced path creates a rout-ing loop, node i rejects it and also deletes path 103 fromits routing table. The only e�ect of the announcement isthe deletion of an alternative path from the routing table.No new update is generated at node i for its neighbors. Weconsider such announcements a necessity for rapid conver-gence of a network following the withdrawal of a route sincethe removal of path 103 prevents it from being propagatedduring the next k-level (k = 4) iteration as a new path i103.

On the other hand, suppose that node i receives an an-nouncement from a di�erent neighbor, node 0 (instead ofnode 1): 0 ! i[0i3]. This time, however, path 013 is with-drawn and a new path i103 is announced by node i. Thisleads to more iterations of the shortest path algorithm untilevery possible path containing i103 has been explored.

The above discussion points out an important characteristicof BGP. In the absence of a �xed timer such as MinRouteAd-ver, the order in which announcements are processed at anode in uences the rate of convergence for a path-vectoralgorithm.

Observation 4: If the conditions in Observation 3 areapplied to all new announcement messages generated at anyk-level, the algorithm will continue until all possible paths

have been explored. Once the set of all possible paths isexhausted, the algorithm will stop after processing the �nalwithdrawal messages. This is the basis of our conjecturethat the complexity for the worst case is O((n� 1)!).

Observation 5: The communication complexity, or thenumber of announcements and withdrawals, are much largerthan the bound on the number of states O((n � 1)!). Eachannouncement of a new path is forwarded to all (n � 1)neighbors of an AS, thereby generating (n � 1)O((n � 1)!)messages until convergence. The number of initial with-drawals is (n � 1) and in the worst case, the �nal iteration(i.e. k = n � 1) generates (n� 1)! messages, each of whichends in a withdrawal. Depending on the implementation de-tails of BGP, this may result in O(n� 1)!) withdrawals forthe worst case. Therefore, for the worst-case BGP model,the number of messages (both withdrawals and announce-ments) grows faster than exponentially with n.

We present an algorithm that provides an ordering of mes-sages as per condition (iii) (in Observation 3) while preserv-ing the essential features of BGP in the Appendix of [15].The algorithm forces the path-vector algorithm to exploreall k = 1; 2; :::(n � 1)� length paths until convergence andresults in the worst-case behavior of BGP. As pointed outin a later Section, the best case convergence for BGP canbe achieved in O(n) stages. Since the Internet is not a com-plete graph and the link delays vary widely, the convergencebehavior in practice will be in between these two bounds.We describe an arti�cially severe worst-case algorithm inthis Section and [15] to provide a loose upper bound onBGP convergence and demonstrate the vulnerability of theBGP protocol to long or unbounded message delays. Webelieve our study �lls an important gap in the analysis ofpath-vector algorithms.

5.2 Lower Bound on ConvergenceWe now examine BGP convergence under the assumptionof bounded message delay. Although BGP does not placebounds on message propagation time, operator experiencehas shown that the vast majority of BGP messages propa-gate between two peers within several seconds. As noted ear-lier, the assumption of bounded delay limits the re-orderingof messages that may occur (as demonstrated in Figure 5)and provides a more realistic model of BGP convergence.

Figure 6 provides an example of BGP convergence for afour node full mesh topology. As in the previous example,all nodes are initially directly connected to a route R. Atstage 0, Route R is withdrawn and all four nodes fail-overto secondary paths (AS0 to 1R, AS1 to 0R, AS2 to 0R, andAS3 to 0R). Unlike Figure 5, however, this example con-verges within 13 stages due to the synchronization addedby the MinRouteAdver timers. We provide insight into thebehavior of MinRouteAdver and its e�ect on the overall con-vergence of BGP in the next several observations.

We now show that with the adoption of MinRouteAdvertimer, the lower bound on convergence for BGP requires atleast (n� 3) rounds of the MinRouteAdver timer in a com-plete graph, where n is the number of autonomous systems.We again refer to the graph of �ve nodes shown in Figure 6.

183183

Observation 1: The best case algorithm with MinRouteAd-ver when applied to a complete graph of size n results incomplete withdrawal of at most one node at the end of the�rst round.

The following example illustrates the above observation inthe event of a withdrawal of a route R which is initiallydirectly connected to every node in the graph. The initialrouting table at each node is represented in stage 0 of Fig-ure 6.

In the event of a withdrawal message from node R, everynode in the system, except node 0 will choose the path 0Ras the active route; node 0 will announce path 1R. Underthe MinRouteAdver timer, node 0 will receive (n � 2) an-nouncements from its neighbors and will try to replace itsalternate paths (i.e. paths 1R, 2R, 3R etc.) with the newlyreceived information. However, each of these new updatesresults in a loop and therefore, node 0 removes all thesepaths. Node 0 then sends a withdrawal message to all itsneighbors, as it no longer has a valid path to R.

Since the direct path of length one from any node, if avail-able, is the best route to reach R, the above sequence ofroute withdrawal at a single node applies to any completegraph of size n, i.e. one of the nodes will always be with-drawn irrespective of the size of the graph.

Observation 2: The primary e�ect of a MinRouteAdvertimer is to impose a monotonically increasing path metricfor successive k-level iterations.

This is the most important contribution of the MinRouteAd-ver timer and also helps to intuitively explain rapid conver-gence of general graphs in the event of a route failure. By`monotonically increasing' paths, we mean that at the endof a MinRouteAdver round, only the next higher level paths(i.e. longer paths) will be announced. Consecutively, un-der MinRouteAdver, there should be no pending path an-nouncements of length k for a network when a (k+1)-lengthpath has already been announced by any node. Under aMinRouteAdver timer, a node must process all (n� 1) an-nouncements from its neighbors before it can send out a newupdate. The order in which it processes each announcementdoes not matter since it receives only one message from eachof its neighbor and must wait for the MinRouteAdver timerto expire before announcing a new path. A newly receivedpath from a neighbor may either result in a loop or replacethe existing path to that neighbor. If it replaces an exist-ing path, we need to show that the path being replaced isa shorter path than the path replacing it. If this is truefor all nodes, each of the nodes will send out a longer pathin the next MinRouteAdver timer. This will then ensurethat only longer and longer ASPaths will be announced un-der MinRouteAdver. To see this, let us consider the 4-nodeexample again.

Upon receiving the withdrawals from node R, twelve mes-sages are generated as shown in stage 1 of Figure 6. Letus consider the messages waiting to be processed at node1. Its routing table currently consists of paths of lengthtwo: 1(�0R,1,2R; 3R). However, each of the arriving mes-sages at node 1 replaces the corresponding 2-length path

with a 3-length path. As a result, once all (n � 1) mes-sages have been processed at node 1 under the MinRouteAd-ver timer, its routing table now has the following entries:1(1,1,�20R; 30R). A new longer (k = 4) path 120R istherefore announced to its neighbors in the next iteration atthe end of stage 5. Let us contrast this situation with thecase when no MinRouteAdver timer is allowed. In this case,node 1 will process only one message before it announcesa new path. If the particular message 0 ! [01R] was pro-cessed (without the MinRouteAdver timer), the routing ta-ble at node 1 would become: 1(1,1, �2R; 3R) resulting inthe same-length path 12R to be announced to its neighbors.

The overall convergence of BGP under MinRouteAdver isas follows: as shown above, the very �rst round of the timerresults in announcements of paths of length 2 which causeone of the nodes to delete all paths in its routing table. Inthe next round, paths of length 3 are announced. Thesemessages will result in a di�erent node being completelywithdrawn. The process continues until the longest path(of length (n� 1)) is announced from each of the remainingnodes, resulting in all nodes being withdrawn. The impor-tant observation here is that for a complete graph of size n,an announcement for a path of length k will cause a routingloop at (k�1) nodes in the graph. The role of MinRouteAd-ver in a complete graph is to ensure that all newly announcedpaths of length k are processed and loops at (k � 1) nodesare detected so that in the next round, only paths of longerpath are announced.

By following the routing tables at other nodes in the exam-ple graph, one can con�rm the same observation as above,i.e. only increasingly longer paths will be announced un-der the MinRouteAdver timer. Therefore, the e�ect of theMinRouteAdver timer is to impose a global state synchro-nization which results in deletion of all k-length paths beforea new longer k + 1 path is announced by any node.

Observation 3: Since kmax = n�1 and each MinRouteAd-ver timer deletes paths of length k at the k-th iteration, therewill be at least (n� 1) MinRouteAdver rounds for the best-case algorithm when applied to a complete graph of size n.(This follows readily from Observation 2.)

Observation 4: The above estimate for the number of Min-RouteAdver rounds can be further reduced to (n� 3) for acomplete graph of size n greater than 3. This result followsfrom the observation that for complete graphs of size n � 3,BGP converges within a single MinRouteAdver period inthe event of a route withdrawal.

We re-emphasize that the above observations are valid whenthe best-case algorithm with the MinRouteAdver timer isapplied to a complete graph. The degree to which Min-RouteAdver preserves the monotonicity of each k-level iter-ation in incomplete graphs is a topic of our current research.

6. ANALYSIS OF RESULTSArmed with a model of BGP convergence, we now returnto the results presented in Section 4. Why do Tup/Tshortconverge more quickly than Tdown/Tlong? The explana-tion lies in the observation that, like the comparison be-tween DV algorithms and BGP, Tup/Tshort are strictly in-

184184

0 -> 1 01R0 -> 2 01R0 -> 3 01R

1 -> 0 10R 1 -> 2 10R1 -> 3 10R

1 -> 0 10R 1 -> 2 10R1 -> 3 10R

2 -> 0 20R2 -> 1 20R2 -> 3 20R

3 -> 0 30R3 -> 1 30R3 -> 2 30R

2 -> 0 20R2 -> 1 20R2 -> 3 20R

3 -> 0 30R3 -> 1 30R3 -> 2 30R

2 -> 0 20R2 -> 1 20R2 -> 3 20R

3 -> 0 30R3 -> 1 30R3 -> 2 30R

3 -> 0 30R3 -> 1 30R3 -> 2 30R

0 -> 1 01R0 -> 2 01R0 -> 3 01R

1 -> 0 10R 1 -> 2 10R1 -> 3 10R

2 -> 0 20R2 -> 1 20R2 -> 3 20R

3 -> 0 30R3 -> 1 30R3 -> 2 30R

0 -> 1 W0 -> 2 W0 -> 3 W

0 -> 1 W0 -> 2 W0 -> 3 W

1 -> 0 120R 1 -> 2 120R1 -> 3 120R

2 -> 0 201R2 -> 1 201R2 -> 3 201R

3 -> 0 301R3 -> 1 301R3 -> 2 301R

1 -> 0 120R 1 -> 2 120R1 -> 3 120R

1 -> 0 120R 1 -> 2 120R1 -> 3 120R

2 -> 0 201R2 -> 1 201R2 -> 3 201R

3 -> 0 301R3 -> 1 301R3 -> 2 301R

2 -> 0 201R2 -> 1 201R2 -> 3 201R

3 -> 0 301R3 -> 1 301R3 -> 2 301R

2 -> 0 201R2 -> 1 201R2 -> 3 201R

3 -> 0 301R3 -> 1 301R3 -> 2 301R

3 -> 0 301R3 -> 1 301R3 -> 2 301R

1 -> 0 W 1 -> 2 W1 -> 3 W

1 -> 0 W 1 -> 2 W1 -> 3 W

2 -> 0 2301R2 -> 1 2301R2 -> 3 2301R

3 -> 0 3120R3 -> 1 3120R3 -> 2 3120R

2 -> 0 2301R2 -> 1 2301R2 -> 3 2301R

2 -> 0 2301R2 -> 1 2301R2 -> 3 2301R

3 -> 0 3120R3 -> 1 3120R3 -> 2 3120R

3 -> 0 3120R3 -> 1 3120R3 -> 2 3120R

3 -> 0 3120R3 -> 1 3120R3 -> 2 3120R

announcement from 39Min Route Timer expires

600(-, -, -, -) 1(-, -, -, -) 2(-, -, -, *301R) 3(-, *120R, 201R, -)

Min Route Timer expiresannouncement from 30(-, -, -, -) 1(-, -, -, -) 2(-, -, -, -) 3(-, -, -, -)

12 902 -> 0 W2 -> 1 W2 -> 3 W

3 -> 0 W3 -> 1 W3 -> 2 W

3 -> 0 W3 -> 1 W3 -> 2 W

Time Routing Tables

steady state

Messages Queued in SystemMessages Processing

announcement from 20(-, -, -, -) 1(-, -, -, -) 2(-, -, -, *301R) 3(-, -, -, -)

11 N/A

0(-, -, -, -) 1(-, -, -, -) 2(-, -, -, *301R) 3(-, -, *201R, -)10 N/A

announcement from 20(-, -, -, -) 1(-, -, -, *30R) 2(-, -, -, *30R) 3(-, 120R, *201R, -)N/A8

0(-, -, -, -) 1(-, -, *20R, 30R) 2(-, -, -, *30R) 3(-, 120R, *20R, -)announcement from 1

N/A7

withdrawal from 00(-, -, -, -) 1(-, -, *20R, 30R) 2(-, *10R, -, 30R) 3(-, *10R, 20R, -)6 N/A

announcement from 30(-, -, -, -) 1(-, -, *20R, 30R) 2(*01R, 10R, -, 30R) 3(*01R, 10R, 20R, -)

Min Route Timer expires5 30

announcement from 20(-, -, -, *3R) 1(-, -, 20R, *3R) 2(01R, 10R, -, *3R) 3(*01R, 10R, 20R, -)4 N/A

announcement from 1N/A 0(-, -, *2R, 3R) 1(-, -, *2R, 3R) 2(*01R, 10R, -, 3R) 3(*01R, 10R, 2R, -)3

announcement from 0N/A 0(-, *1R, 2R, 3R) 1(-, -, *2R, 3R) 2(01R, *1R, -, 3R) 3(01R, *1R, 2R, -)2

0 N/A 0(*R, 1R, 2R, 3R) 1(0R, *R, 2R, 3R) 2(0R, 1R, *R, 3R) 3(0R, 1R, 2R, *R)

R withdraws its route1 N/A

0(-, -, -, -) 1(-, -, -, -) 2(-, -, -, -) 3(-, -, -, -)13 N/A process withdrawals

withdrawal from 1

2 -> 0 W2 -> 1 W2 -> 3 W

Stage

steady state

0(-, *1R, 2R, 3R) 1(*0R, -, 2R, 3R) 2(*0R, 1R, -, 3R) 3(*0R, 1R, 2R, -)

R -> 0 WR -> 1 WR -> 2 W

R -> 3 W

Figure 6: Example of BGP bouncing problem with MinRouteAdver.

creasing while Tdown/TLong are monotonically increasing.Intuitively, once a node receives an update during Tup andselects an active path, the node will never choose a routewith a longer path. In contrast, since the Tdown implicitmetric of in�nity is longer than all possible ASPaths, eachnode will failover to secondary paths until all paths havebeen eliminated. If we assume bounded delays, then Tuphas a computational complexity of O(1) and Tdown of O(n)for a network of n autonomous systems.

Unlike Tup/Tshort, Figure 2(a) shows a slight variation be-tween the relative latencies of Tlong and Tdown. Due tothe e�ects of MinRouteAdver, we might expect Tlong toconverge at the same rate or slower than Tdown. Analysisof the data, however, shows that if the prepended ASPathassociated with a Tlong is not su�ciently long, then thisroute might be preferred over shorter paths at some pointduring convergence. In e�ect, these Tlongs would resembleboth Tshort and Tdown and represent the average of thetwo. In our experiments, we observed a small number ofpaths with lengths four times the steady-state average fol-lowing Tdown and Tlong events. As described in Section 3,we only associated a path of only three times the steady-state average with the injected Tlongs.

Although we did not associate a su�ciently long ASPath

with Tlong to render Tshort completely indistinguishablefrom Tup, or Tdown indistinguishable from Tlong, Tshort/Tupenjoy the property that routing information associated withthe shortest ASPath will usually propagate faster than rout-ing information associated with longer paths. This speed ad-vantage arises because in the absence of pre-pending policieswhich create arti�cially long paths, ASPaths by de�nitionare formed by routing information traveling through moreBGP autonomous systems, each of which adds some addi-tional latency. Although convergence following Tshort the-oretically may have introduced added oscillations over Tupas the system explored ASPaths longer than Tlong, suchoscillations are unlikely in practice.

In Figure 3, we described signi�cant variations between theconvergence latencies of �ve ISPs. We noted that these dif-ferences were independent of both geographic and networkdistance. As we showed in Section 5.2, if the Internet weretruly a complete mesh we would expect all ASes to exhibitthe same convergence behaviors. Instead, analysis of thedata shows that these variations directly relate to a numberof topological factors, including the length and number ofpossible paths between an AS and a given destination. Thenumber of available paths is a factor of peering relation-ships, transit policies/agreements and the implementationof �lters by both the AS and downstream ASes.

185185

Nodes Time States Messages4 N/A 12 415 N/A 60 3066 N/A 320 25717 N/A 1955 23823

(a) Unbounded

Nodes Time States Messages4 30 11 265 60 26 546 90 50 927 120 85 140

(b) MinRouteAdver

Nodes Time States Messages4 30 11 265 30 23 546 30 39 927 30 59 140

(c) Modi�ed

Figure 7: Simulation results for convergence with unbounded delay, MinRouteAdver, and modi�ed Min-

RouteAdver.

Analysis of Figure 3(a) also shows that the Tdown conver-gence times of between 0 and 180 seconds directly relate tothe number of MinRouteAdver rounds. Our data shows astrong correlation between the average ASPath length dur-ing Tdown events and convergence latency. Speci�cally,as the point of injection ISP1 always announced routes oflength one; ISP3 averaged 2.6, and ISP5 averaged ASPathsof length 6. These results corresponds with our 30(n � 3)lower bound on MinRouteAdver convergence times.

Finally, we examine the 0 to 30 second convergence latenciesexhibited in Figure 3(b). As described earlier, Tup eventsare strictly increasing and do not typically generate multipleannouncements. Figure 2 shows that most ISPs average oneupdate message following a Tup event. Since MinRouteAd-ver does not impact the �rst announcement of a route, wemight expect Tup latencies to be signi�cantly less than 30seconds, as they would re ect only the network latency androuter processing delays along a single path. Discussion witha major router vendor, however, indicates that at least onewidely deployed router implements MinRouteAdver on a perpeer basis instead of the (destination pre�x, peer) tuple. Weemphasize that this implementation choice is in accordancewith the BGP speci�cation [21] and may improve routermemory utilization. A per peer timer, however, introducessome portion of the MinRouteAdver delay to Tup/Tshortupdates. If a router has previously sent any update to agiven peer within the last 30 seconds, then a new Tup an-nouncement destined for the same peer will also be delayeduntil the expiration of the per-peer MinRouteAdver timer.

In general, while MinRouteAdver signi�cantly reduces thecomputational and communication complexity of BGP con-vergence, the timer also arti�cially creates multiple thirty-second rounds which delay end-to-end failover in most cases.As we showed in Section 5.2, these rounds form due to thedelay in the exchange of path vectors containing mutuallydependent routes. Although the BGP speci�cation describesASPath loop detection, [21] does not specify where the de-tection should occur. Analysis of our data and discussionswith vendors indicates that most commercial routers onlyperform loop detection upon the receipt of a route update.We distinguish receiver-side loop detection from the routeinspection and invalidation performed by a sender before theorigination of a looped update.

Figure 6 illustrates the delay introduced by receiver-sideonly loop detection. At stage 4, AS0 and AS3 share mutu-ally dependent routes: AS0 has an active route via 3R andAS3 has an active route via 01R. At the end of stage 4, AS3

delays sending the new 01R path to all three of its neigh-bors due to the operation of its MinRouteAdver timer. Onlyafter its MinRouteAdver timer expires, will AS3 send the\AS3 ! AS0 301R" BGP update message. Upon receiptof this looped path in stage 5, AS0 will invalidate the pathvia AS3 and send BGP withdrawals to each of its neigh-bors. The example encounters a similar mutual dependencybetween AS2 and AS3 at the end of stage 8.

We note that if loop detection is performed on both thesender and receiver side, in the best case all mutual depen-dencies will be discovered and eliminated within a singleround. Again returning to Figure 6, we observe that AS3 atthe end of stage 4 could invalidate the \AS3! AS0 301R"message and send an explicit withdrawal to AS0. Sincewithdrawals are not impacted by MinRouteAdver accord-ing to the standard [21], AS3 and AS0 would learn of theirmutual dependency within a single MinRouteAdver round.

Figure 7(c) provides simulation results of MinRouteAdvermodi�ed to perform sender-side loop detection. We notethat for all node sizes, modi�ed MinRouteAdver convergeswithin a single thirty second round. We also observe thatalthough the communication complexity remains the same,modi�ed MinRouteAdver exhibits improved state complex-ity over unmodi�ed MinRouteAdver.

We discussed this proposed modi�cation to MinRouteAd-ver with a number of router vendors, and at least one in-dicated that all future versions of a widely deployed routerwill include both sender and receiver-side ASPath loop de-tection. The elimination of rounds, however, requires thatthe router does not apply MinRouteAdver to withdrawalsas speci�ed in [21]. At least one major router vendor hasmade an implementation decision to apply MinRouteAdverto both announcements and withdrawals. A discussion ofthe motivation and engineering tradeo�s for applying Min-RouteAdver to withdrawals is outside the scope of this paperand remains an active area of our current research.

7. CONCLUSIONAs the national and economic infrastructure become increas-ingly dependent on the global Internet, the availability andscalability of IP-based networks will emerge as among themost signi�cant problems facing the continued evolution ofthe Internet. This paper has argued that the lack of inter-domain failover due to delayed BGP routing convergencewill potentially become one of the key factors contributingto the \gap" between the needs and expectations of today's

186186

data networks. In this paper, we demonstrated that multi-homed failover now averages three minutes, and may triggeroscillations lasting as long as �fteen minutes. Further, weshowed that these delays will grow linearly with the addi-tion of new autonomous systems to the Internet in the bestcase, and exponentially in the worst. These results suggest astrong need to reevaluate applications and protocols, includ-ing emerging QoS and VoIP standards [11], which assumea stable underlying inter-domain forwarding infrastructureand fast IP path restoral.

This paper also suggested speci�c changes to vendor BGPimplementations which, if deployed, would signi�cantly im-prove Internet convergence latencies. But even with our sug-gested changes to ASPath loop detection, BGP path changeswill still trigger temporary oscillations and require many sec-onds longer than the current PSTN restoral times. We cancertainly improve BGP convergence through the addition ofsynchronization, di�using updates [7] and additional stateinformation [5], but all of these changes to BGP come atthe expense of a more complex protocol and increased routeroverhead. The extraordinary growth and success of the In-ternet is arguably due to the scalability and simplicity of theunderlying protocols. The implications of this trade-o� be-tween the scalability of wide-area routing protocols and thegrowing need for fault-tolerance in the Internet is an activearea of our current research.

AcknowledgmentsWe wish to thank Anish Arora, Randy Bush, Tony Li, PedroMarques, Susan R. Harris, Bert Rossi, John Scudder andBob Stovall for their comments and helpful insights. We alsothank members of the Internet service provider communityfor their comments, support and willingness to permit ourinstrumentation of their networks. Finally, we thank theSIGCOMM 2000 anonymous referees for their feedback andconstructive criticism.

8. REFERENCES[1] Internet Peformance Measurement and Analysis

Project (IPMA). http://www.merit.edu/ipma.

[2] Multithreaded Routing Toolkit (MRT) Project.http://www.mrtd.net.

[3] North American Network Operators Group(NANOG). http://www.nanog.org.

[4] K. Bhargavan, D. Obradovic, and C. Gunter. FormalVeri�cation of Distance Vector Routing Protocols.International Conference on Theorem Proving andHigher Order Logics, Aug. 2000.

[5] C. Cheng, R. Riley, S. Kumar, and J. G. Aceves. ALoop-Free Extended Bellman-Ford Routing ProtocolWithout Bouncing E�ect. Proc. of the ACMSIGCOMM, pages 224{236, Aug. 1989.

[6] L. Gao and J. Rexford. Stable Internet RoutingWithout Global Coordination. Proc. of ACMSIGMETRICS, June 2000.

[7] J. Garcia-Luna-Aceves. Loop-free Routing UsingDi�using Computations. IEEE/ACM Transactions onNetworking, Feb. 1993.

[8] T. Gri�n and G. Wilfong. An Analysis of BGPConvergence Properties. Proceedings of the ACMSIGCOMM, Aug. 1999.

[9] T. Gri�n and G. Wilfong. A Safe Path VectorProtocol. Proc. of IEEE Infocom, Mar. 2000.

[10] B. Halabi. Internet Routing Architecture. Cisco Press,1997.

[11] D. Hampton, H. Salama, and D. Shah. The IPTelephony Border Gateway Protocol (TBGP), June1999. draft-ietf-iptel-glp-tbgp-01.txt.

[12] J. Hawkinson. Cisco Routing FAQ.http://www.faqs.org/faqs/cisco-networking-faq.

[13] M. Kaat. Overview of 1999 IAB Network LayerWorkshop, Nov. 1999. http://www.ietf.org/internet-drafts/draft-ietf-iab-ntwlyrws-over-01.txt.

[14] C. Labovitz. Scalability of the Internet BackboneRouting Infrastructure. PhD thesis, University ofMichigan, Aug. 1999.

[15] C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian. AStudy of Delayed BGP Convergence. Technical ReportMSR-TR-2000-08, Microsoft Research, Feb. 2000.

[16] C. Labovitz, A. Ahuja, and F. Jahanian.Experimental Study of Internet Stability andWide-area Network Failures. Proc. InternationalSymposium on Fault-Tolerant Computing, June 1999.

[17] C. Labovitz, G. Malan, and F. Jahanian. InternetRouting Instability. IEEE/ACM Transactions onNetworking, Aug. 1997.

[18] C. Labovitz, G. Malan, and F. Jahanian. Origins ofPathological Internet Routing Instability. Proc. of theIEEE INFOCOM, Mar. 1999.

[19] V. Paxson. End-to-End Internet Packet Dynamics.Proc. of the ACM SIGCOMM, 1997.

[20] R. Perlman. Interconnections, Second Edition.Addison-Wesley, Reading Massachusetts, 1999.

[21] Y. Rekhter and T. Li. A Border Gateway Protocol 4(BGP4), Sept. 1999. draft-ietf-idr-bgp4-09.txt.

[22] F. Schneider, S. Bellovin, and A. Inouye. BuildingTrustworthy Systems: Lessons from the PTN andInternet. Internet Computing, pages 64{72, Nov. 1999.

[23] K. Varadhan, R. Govindan, and D. Estrin. PersistentRoute Oscillations in Inter-Domain Routing. TechnicalReport USC CS TR 96-631, Department of ComputerScience, University of Southern California, Feb. 1996.

[24] W. Zaumen and J. J. G.-L. Aceves. Dynamics ofDistributed Shortest-Path Routing Algorithms. Proc.of the ACM SIGCOMM, Aug. 1991.

187187

Documents

Delayed Internet Routing Convergence - Computer Science at