Upload
john-crawford
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Improving the Reliability of Internet Paths with One-hop Source Routing
Krishna Gummadi, Harsha MadhyasthaSteve Gribble, Hank Levy, David Wetherall
Department of Computer Science and EngineeringUniversity of Washington
Seattle, WA
Reliability of Internet paths
• Enormous interest in understanding Internet path reliability
• Proposals to improve reliability using indirection routing– RON, Detour
• Current implementations maintain complex overlays that do not scale
This talk
• What are the failure characteristics of Internet paths?
• What do they imply about reliability benefits of indirection routing?
• Can a simple, stateless, scalable scheme realize these benefits?
• What benefits would end-users see in practice?– for a real-application, such as Web browsing
Outline
• Introduction
• Measurement study of Internet path failures
• One-hop source routing
• An implementation study of SOSR
• Conclusions
Measurement study of path failures
• We conducted a week long measurement study– probed 3,153 destinations from 67 Planetlab sites
– each destination is probed from exactly one node
• Our goal is to answer the following:– How often do paths fail?
– Where do failures occur?
– How long do failures last?
Choosing destinations
• We want to understand how the network paths to servers and broadband hosts differ– it has implications for different workloads/apps
• Web transfers between servers and broadband hosts• VOIP apps between broadband hosts
• We chose 3153 destinations:– 378 popular web servers
– 1,139 broadband hosts
– 1,636 randomly selected IPs
Detecting path failures
• Each probe (response) is a TCP ACK (RST) packet– default probe frequency: one every 15 seconds
• Upon a single probe response loss, we:– increase probe frequency: one every 5 seconds
• till 10 consecutive probe responses are received
– perform traceroute to detect failure location
• A path fails when 3 consecutive probes and traceroute fail
How often do paths fail?
• Failures do happen, but not frequently– on average each path sees 6 failures/week
– server paths see 4 failures/week
– broadband paths see 7 failures/week
• Most paths see at least one failure in a week– 85% of all paths
– 78% of server paths
– 88% of broadband paths
Categories of failure locations
• Categories help distinguish between core and edge failures
Source Destination
Local ISP Local ISP
Tier1 ISP Tier1 ISP
source_side
core
dst_side
last_hop
Where do paths fail?
• Server path failures occur throughout the network– very few (16%) last_hop failures
– suggests network is the dominating cause for server unavailability
0
0.5
1
1.5
2
2.5
3
3.5
servers
src_ side core dst_ side last_ hop
av
g.
# o
f fa
ilu
res
p
er
pa
th p
er
we
ek
Where do paths fail?
• Most of the broadband failures happen on last_hop
• Excluding last_hop, server and broadband paths see similar number of failures
0
0.5
1
1.5
2
2.5
3
3.5
servers
broadband
src_ side core dst_ side last_ hop
av
g.
# o
f fa
ilu
res
p
er
pa
th p
er
we
ek
How long do failures last?
• Failure durations are highly skewed
• Majority of failures are short– median failure duration: 1-2 min for all paths
– median path availability: 99.9% for all paths
• A non-negligible fraction of paths see long failures– tend to occur on last_hop
– mean path availabilty: 99.6% for servers and 94.4% for broadband
Implications for indirection routing
• Failures happen often enough that they are worth fixing
• But, they are rare enough that recovery schemes should be inexpensive under normal conditions
• Failures near the end-nodes limit the performance of indirection routing
– good news: servers see very few failures near end hosts
– bad news: broadband hosts see many last_hop failures
Outline
• Introduction
• Measurement study of Internet path failures
• One-hop source routing
• An implementation study of SOSR
• Conclusions
One-hop source routing
• Use default path under normal conditions• When default path fails, source attempts to recover by
routing through an intermediary
src dstX
intermediate
Our goals
• Understand the potential reliability benefits of one-hop source routing
• Design a simple stateless, scalable scheme to realize this potential
Evaluating one-hop source routing
• For each path failure during the week-long trace– we sent probes via intermediaries at 39 Planetlab sites
• Compared the success of probes along default and intermediate paths– estimate the maximum potential of any one-hop scheme
– estimate success rate of specific one-hop scheme
• A failure is recoverable if any of the 39 intermediaries help
• Server failures more recoverable than broadband
• Almost all Internet core failures can be avoided through one-hop routing
Potential of any one-hop routing scheme
percent of failures that are recoverable
servers broadband
src_side 54% 55%
core 92% 90%
dst_side 79% 66%
last_hop 41% 12%
all 66% 39%
What fraction of intermediaries help in recovery?
0
13
26
39
0 0.2 0.4 0.6 0.8 1CDF of failures recovered
# o
f u
sefu
l in
term
ed
iari
es
• For most failures, > half of the intermediaries avoid the failure• All we need to do is find one of them!• Suggests that a randomly selected intermediary might work
22
75%
How effective is a random policy?• Random-k: Pick K intermediaries at random• Random-4 delivers near-optimal success rate
– requires no a priori probing or state
0
0.2
0.4
0.6
0.8
0 10 20 30 40k (number of attempted intermediaries)
fra
cti
on
of
fail
ure
s r
ec
ov
ere
d
servers
broadband
Recovery latency with random-4• Random-4 either helps early or not at all
– nearly 60% failures recovered in 5-10 seconds
• After that, we have to wait for paths to self-repair• So, initiate and abandon recovery early
0
0.2
0.4
0.6
0.8
1
10 15 20 25 30 35 40 45 50 55
time after fault (seconds)
fra
cti
on
of
fail
ure
s
rec
ov
ere
d
random-4 succeeds default path self-repairs
Server failures
Outline
• Introduction
• Measurement study of Internet path failures
• One-hop source routing
• An implementation study of SOSR
• Conclusions
SOSR implementation
• Validate random-4 policy in practice using a real application, Web browsing
• SOSR: Scalable One-hop Source Routing• Implemented in linux
– transparent to destinations (NAT on intermediate nodes)
– transparent to applications on source node (netfilter)
– extensible (can plug in policies)
Evaluating SOSR implementation
• Ran two clients one with and another without SOSR– repeatedly fetched Web pages from 982 popular servers
– both machines located at UW
– one request per second over 3 days
• Client 1: default wget command-line web browser
• Client 2: default wget + SOSR with random-4 policy– deployed intermediaries on 39 Planetlab nodes
User perceived benefits of SOSR
• wget succeeds 99.8% of time– Web seems pretty reliable
• A SOSR user sees only 20% fewer failures!– not clear whether SOSR matters for Web
requests failures
wget 273,978 481
wget
SOSR273,978 383
User perceived benefits of SOSR
• SOSR recovers from 56% of network failures• But, can’t recover from application failures• 62% of wget + SOSR failures are application related
network level
failures
application level failuresHTTP error codes
TCP refused
HTTP refused
HTTP timeout
wget 328 40 78 35 44
wget
SOSR145 41 101 96 37
Conclusions
• What are the failure characteristics of Internet paths?– failures do happen, but they are short and infrequent
– many occur on last_hop for broadband paths
• What do they imply about reliability benefits of indirection routing?– recovery must be cheap in the common case
– one-hop source routing recovers from 66% of server and 39% of broadband path failures
Conclusions
• Can a simple, stateless, scalable scheme realize these benefits?– random-4 realizes the potential of any one-hop scheme
– no cost in common case
– no a priori probing or state needed
• What benefits would end-users see in practice for real applications?– Web users see only 20% fewer failures
– many application-level failures
Conclusions
• Is indirection routing useful or not?– pessimistic view: not for the Web
– optimistic view: perhaps for other applications, like VoIP
For more information
Visit our research group website:
http://www.cs.washington.edu/research/networking/websys