Improving the Reliability of Internet Paths with One-hop Source Routing Krishna Gummadi, Harsha Madhyastha Steve Gribble, Hank Levy, David Wetherall Department

Improving the Reliability of Internet Paths with One-hop Source Routing

Krishna Gummadi, Harsha MadhyasthaSteve Gribble, Hank Levy, David Wetherall

Department of Computer Science and EngineeringUniversity of Washington

Seattle, WA

Reliability of Internet paths

• Enormous interest in understanding Internet path reliability

• Proposals to improve reliability using indirection routing– RON, Detour

• Current implementations maintain complex overlays that do not scale

This talk

• What are the failure characteristics of Internet paths?

• What do they imply about reliability benefits of indirection routing?

• Can a simple, stateless, scalable scheme realize these benefits?

• What benefits would end-users see in practice?– for a real-application, such as Web browsing

Outline

• Introduction

• Measurement study of Internet path failures

• One-hop source routing

• An implementation study of SOSR

• Conclusions

Measurement study of path failures

• We conducted a week long measurement study– probed 3,153 destinations from 67 Planetlab sites

– each destination is probed from exactly one node

• Our goal is to answer the following:– How often do paths fail?

– Where do failures occur?

– How long do failures last?

Choosing destinations

• We want to understand how the network paths to servers and broadband hosts differ– it has implications for different workloads/apps

• Web transfers between servers and broadband hosts• VOIP apps between broadband hosts

• We chose 3153 destinations:– 378 popular web servers

– 1,139 broadband hosts

– 1,636 randomly selected IPs

Detecting path failures

• Each probe (response) is a TCP ACK (RST) packet– default probe frequency: one every 15 seconds

• Upon a single probe response loss, we:– increase probe frequency: one every 5 seconds

• till 10 consecutive probe responses are received

– perform traceroute to detect failure location

• A path fails when 3 consecutive probes and traceroute fail

How often do paths fail?

• Failures do happen, but not frequently– on average each path sees 6 failures/week

– server paths see 4 failures/week

– broadband paths see 7 failures/week

• Most paths see at least one failure in a week– 85% of all paths

– 78% of server paths

– 88% of broadband paths

Categories of failure locations

• Categories help distinguish between core and edge failures

Source Destination

Local ISP Local ISP

Tier1 ISP Tier1 ISP

source_side

core

dst_side

last_hop

Where do paths fail?

• Server path failures occur throughout the network– very few (16%) last_hop failures

– suggests network is the dominating cause for server unavailability

0

0.5

1

1.5

2

2.5

3

3.5

servers

src_ side core dst_ side last_ hop

av

g.

# o

f fa

ilu

res

p

er

pa

th p

er

we

ek

Where do paths fail?

• Most of the broadband failures happen on last_hop

• Excluding last_hop, server and broadband paths see similar number of failures

0

0.5

1

1.5

2

2.5

3

3.5

servers

broadband

src_ side core dst_ side last_ hop

av

g.

# o

f fa

ilu

res

p

er

pa

th p

er

we

ek

How long do failures last?

• Failure durations are highly skewed

• Majority of failures are short– median failure duration: 1-2 min for all paths

– median path availability: 99.9% for all paths

• A non-negligible fraction of paths see long failures– tend to occur on last_hop

– mean path availabilty: 99.6% for servers and 94.4% for broadband

Implications for indirection routing

• Failures happen often enough that they are worth fixing

• But, they are rare enough that recovery schemes should be inexpensive under normal conditions

• Failures near the end-nodes limit the performance of indirection routing

– good news: servers see very few failures near end hosts

– bad news: broadband hosts see many last_hop failures

Outline

• Introduction




• Conclusions

One-hop source routing

• Use default path under normal conditions• When default path fails, source attempts to recover by

routing through an intermediary

src dstX

intermediate

Our goals

• Understand the potential reliability benefits of one-hop source routing

• Design a simple stateless, scalable scheme to realize this potential

Evaluating one-hop source routing

• For each path failure during the week-long trace– we sent probes via intermediaries at 39 Planetlab sites

• Compared the success of probes along default and intermediate paths– estimate the maximum potential of any one-hop scheme

– estimate success rate of specific one-hop scheme

• A failure is recoverable if any of the 39 intermediaries help

• Server failures more recoverable than broadband

• Almost all Internet core failures can be avoided through one-hop routing

Potential of any one-hop routing scheme

percent of failures that are recoverable

servers broadband

src_side 54% 55%

core 92% 90%

dst_side 79% 66%

last_hop 41% 12%

all 66% 39%

What fraction of intermediaries help in recovery?

0

13

26

39

0 0.2 0.4 0.6 0.8 1CDF of failures recovered

# o

f u

sefu

l in

term

ed

iari

es

• For most failures, > half of the intermediaries avoid the failure• All we need to do is find one of them!• Suggests that a randomly selected intermediary might work

22

75%

How effective is a random policy?• Random-k: Pick K intermediaries at random• Random-4 delivers near-optimal success rate

– requires no a priori probing or state

0

0.2

0.4

0.6

0.8

0 10 20 30 40k (number of attempted intermediaries)

fra

cti

on

of

fail

ure

s r

ec

ov

ere

d

servers

broadband

Recovery latency with random-4• Random-4 either helps early or not at all

– nearly 60% failures recovered in 5-10 seconds

• After that, we have to wait for paths to self-repair• So, initiate and abandon recovery early

0

0.2

0.4

0.6

0.8

1

10 15 20 25 30 35 40 45 50 55

time after fault (seconds)

fra

cti

on

of

fail

ure

s

rec

ov

ere

d

random-4 succeeds default path self-repairs

Server failures

Outline

• Introduction




• Conclusions

SOSR implementation

• Validate random-4 policy in practice using a real application, Web browsing

• SOSR: Scalable One-hop Source Routing• Implemented in linux

– transparent to destinations (NAT on intermediate nodes)

– transparent to applications on source node (netfilter)

– extensible (can plug in policies)

Evaluating SOSR implementation

• Ran two clients one with and another without SOSR– repeatedly fetched Web pages from 982 popular servers

– both machines located at UW

– one request per second over 3 days

• Client 1: default wget command-line web browser

• Client 2: default wget + SOSR with random-4 policy– deployed intermediaries on 39 Planetlab nodes

User perceived benefits of SOSR

• wget succeeds 99.8% of time– Web seems pretty reliable

• A SOSR user sees only 20% fewer failures!– not clear whether SOSR matters for Web

requests failures

wget 273,978 481

wget

SOSR273,978 383

User perceived benefits of SOSR

• SOSR recovers from 56% of network failures• But, can’t recover from application failures• 62% of wget + SOSR failures are application related

network level

failures

application level failuresHTTP error codes

TCP refused

HTTP refused

HTTP timeout

wget 328 40 78 35 44

wget

SOSR145 41 101 96 37

Conclusions

• What are the failure characteristics of Internet paths?– failures do happen, but they are short and infrequent

– many occur on last_hop for broadband paths

• What do they imply about reliability benefits of indirection routing?– recovery must be cheap in the common case

– one-hop source routing recovers from 66% of server and 39% of broadband path failures

Conclusions

• Can a simple, stateless, scalable scheme realize these benefits?– random-4 realizes the potential of any one-hop scheme

– no cost in common case

– no a priori probing or state needed

• What benefits would end-users see in practice for real applications?– Web users see only 20% fewer failures

– many application-level failures

Conclusions

• Is indirection routing useful or not?– pessimistic view: not for the Web

– optimistic view: perhaps for other applications, like VoIP

For more information

Visit our research group website:

http://www.cs.washington.edu/research/networking/websys

Documents

Improving the Reliability of Internet Paths with One-hop Source Routing Krishna Gummadi, Harsha Madhyastha Steve Gribble, Hank Levy, David Wetherall Department