Upload
india
View
45
Download
0
Tags:
Embed Size (px)
DESCRIPTION
High-Performance Transport Protocols for Data-Intensive World-Wide Grids. S. Ravot, Caltech, USA T. Kelly, University of Cambridge, UK J.P. Martin-Flatin, CERN, Switzerland. Outline. Overview of DataTAG project Problems with TCP in data-intensive Grids Problem statement - PowerPoint PPT Presentation
Citation preview
NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
High-Performance Transport Protocols for Data-Intensive
World-Wide GridsS. Ravot, Caltech, USA
T. Kelly, University of Cambridge, UKJ.P. Martin-Flatin, CERN, Switzerland
2NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Outline Overview of DataTAG project Problems with TCP in data-intensive Grids
Problem statement Analysis and characterization
Solutions: Scalable TCP GridDT
Future Work
3NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Overview ofDataTAG Project
4NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Member Organizations
http://www.datatag.org/
5NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Project Objectives Build a testbed to experiment with massive
file transfers (TBytes) across the Atlantic
Provide high-performance protocols for gigabit networks underlying data-intensive Grids
Guarantee interoperability between major HEP Grid projects in Europe and the USA
6NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
DataTAG Testbed
r04gvaCisco7606
r04chiCisco7609 stm16
(DTag)
r05chi-JuniperM10
r06chi-Alcatel7770
r05gva-JuniperM10
r06gvaAlcatel7770
SURFNET
stm16(Colt)backup+projects
s01gvaExtreme S1i
w01gvaw02gvaw03gvaw04gvaw05gvaw06gvaw20gvav02gvav03gva
7x
w01chiw02chi
v10chiv11chiv12chiv13chi
s01chiExtreme S5i
8x
VTHD/INRIA
stm16(FranceTelecom)
Chicago Geneva
2x
ONS15454
Alcatel 1670 Alcatel 1670
SURFNETCESNET
ONS15454
stm64(GC)
CNAFGEANTcernh7
w03chiw04chiw05chi
3x 3x 2x
w02chi
1000baseSX
SDH/Sonet
1000baseT
10GbaseLX
w03
w06chi
w01bol
CCC tunnel
stm64
Edoardo Martelli
7NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Records Beaten Using DataTAG Testbed
Internet2 IPv4 land speed record: February 27, 2003 10,037 km 2.38 Gbit/s for 3,700 s MTU: 9,000 Bytes
Internet2 IPv6 land speed record: May 6, 2003 7,067 km 983 Mbit/s for 3,600 s MTU: 9,000 Bytes
http://lsr.internet2.edu/
8NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Network Research Activities Enhance performance of network protocols
for massive file transfers: Data-transport layer: TCP, UDP, SCTP
QoS: LBE (Scavenger) Equivalent DiffServ (EDS)
Bandwidth reservation: AAA-based bandwidth on demand Lightpaths managed as Grid resources
Monitoring
Rest of this talkRest of this talk
9NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Problems with TCP inData-Intensive Grids
10NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Problem Statement End-user’s perspective:
Using TCP as the data-transport protocol for Grids leads to a poor bandwidth utilization in fast WANs
Network protocol designer’s perspective: TCP is inefficient in high bandwidth*delay
networks because: few TCP implementations have been tuned for gigabit
WANs TCP was not designed with gigabit WANs in mind
11NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Design Problems (1/2) TCP’s congestion control algorithm (AIMD) is
not suited to gigabit networks Due to TCP’s limited feedback mechanisms,
line errors are interpreted as congestion: Bandwidth utilization is reduced when it shouldn’t
RFC 2581 (which gives the formula for increasing cwnd) “forgot” delayed ACKs: Loss recovery time twice as long as it should be
12NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Design Problems (2/2) TCP requires that ACKs be sent at most every
second segment: Causes ACK bursts Bursts are difficult to handle by kernel and NIC
13NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
AIMD (1/2) Van Jacobson, SIGCOMM 1988 Congestion avoidance algorithm:
For each ACK in an RTT without loss, increase:
For each window experiencing loss, decrease:
Slow-start algorithm: Increase by one MSS per ACK until ssthresh
iii cwnd
cwndcwnd 11
ii cwndcwnd 21
1
14NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
AIMD (2/2) Additive Increase:
A TCP connection increases slowly its bandwidth utilization in the absence of loss:
forever, unless we run out of send/receive buffers or detect a packet loss
TCP is greedy: no attempt to reach a stationary state Multiplicative Decrease:
A TCP connection reduces its bandwidth utilization drastically whenever a packet loss is detected:
assumption: line errors are negligible, hence packet loss means congestion
15NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Congestion Window (cwnd)
congestioncongestionavoidanceavoidance
slowslowstartstart
16NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Disastrous Effect of Packet Losson TCP in Fast WANs (1/2)
AIMD C=1 Gbit/s MSS=1,460 Bytes
17NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Disastrous Effect of Packet Losson TCP in Fast WANs (2/2)
Long time to recover from a single loss: TCP should react to congestion rather than packet
loss: line errors and transient faults in equipment are no
longer negligible in fast WANs TCP should recover quicker from a loss
TCP is particularly sensitive to packet loss in fast WANs (i.e., when both cwnd and RTT are large)
18NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Characterization of the Problem (1/2)
The responsiveness measures how quickly we go back to using the network link at full capacity after experiencing a loss (i.e., loss recovery time if loss occurs when bandwidth utilization = network link capacity)
2 . inc2 . inc
TCP responsiveness
02000400060008000
1000012000140001600018000
0 50 100 150 200
RTT (ms)
Tim
e (s
) C= 622 Mbit/sC= 2.5 Gbit/sC= 10 Gbit/s
C . RTTC . RTT22
19NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Characterization of the Problem (2/2)
Capacity RTT # inc Responsiveness
9.6 kbit/s(typ. WAN in 1988)
max: 40 ms 1 0.6 ms
10 Mbit/s(typ. LAN in 1988)
max: 20 ms 8 ~150 ms
100 Mbit/s(typ. LAN in 2003)
max: 5 ms 20 ~100 ms
622 Mbit/s 120 ms ~2,900 ~6 min
2.5 Gbit/s 120 ms ~11,600 ~23 min
10 Gbit/s 120 ms ~46,200 ~1h 30min
inc size = MSS = 1,460 Bytes
20NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Congestion vs. Line Errors
Throughput Required BitLoss Rate
Required PacketLoss Rate
10 Mbit/s 2 10-8 2 10-4
100 Mbit/s 2 10-10 2 10-6
2.5 Gbit/s 3 10-13 3 10-9
10 Gbit/s 2 10-14 2 10-10
At gigabit speed, the loss rate required for packet loss to be ascribed only to congestion is unrealistic with AIMD
RTT=120 ms, MTU=1,500 Bytes, AIMD
21NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Effect of packet loss
0
20
40
60
80
100
0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10Packet Loss frequency (%)
Ban
dwid
th U
tiliz
atio
n (%
)
Mathis SIGCOMM'97
WAN (RTT=120ms)
LAN (RTT=0.04 ms)
Single TCP Stream Performance under Periodic Losses
Effect of packet loss
0100200300400500600700800900
1000
0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10
Packet Loss frequency (%)
Thro
ughp
ut (M
bit/s
)
Mathis SIGCOMM'97
WAN (RTT=120ms)
LAN (RTT=0.04 ms)
Loss rate =0.01%:Loss rate =0.01%:LAN BW LAN BW utilization= 99%utilization= 99%WAN BW WAN BW utilization=1.2%utilization=1.2%
MSS=1,460 Bytes
22NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Solutions
23NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
What Can We Do? To achieve higher throughputs over high
bandwidth*delay networks, we can: Fix AIMD Change congestion avoidance algorithm:
Kelly: Scalable TCP Ravot: GridDT
Use larger MTUs Change the initial setting of ssthresh Avoid losses in end hosts
24NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Delayed ACKs with AIMD RFC 2581 (spec. defining TCP congestion control
AIMD algorithm) erred:
Implicit assumption: one ACK per packet In reality: one ACK every second packet with
delayed ACKs Responsiveness multiplied by two:
Makes a bad situation worse in fast WANs Problem fixed by ABC in RFC 3465 (Feb 2003)
Not implemented in Linux 2.4.21
iii cwnd
SMSSSMSScwndcwnd 1
25NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Delayed ACKs with AIMD and ABC
26NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Scalable TCP: Algorithm For cwnd>lwnd, replace AIMD with new
algorithm: for each ACK in an RTT without loss:
cwndi+1 = cwndi + a for each window experiencing loss:
cwndi+1 = cwndi – (b x cwndi) Kelly’s proposal during internship at CERN:
(lwnd,a,b) = (16, 0.01, 0.125) Trade-off between fairness, stability, variance
and convergence
27NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Scalable TCP: lwnd
28NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Scalable TCP: Responsiveness Independent of Capacity
29NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Scalable TCP:Improved Responsiveness
Responsiveness for RTT=200 ms and MSS=1,460 Bytes: Scalable TCP: ~3 s AIMD:
~3 min at 100 Mbit/s ~1h 10min at 2.5 Gbit/s ~4h 45min at 10 Gbit/s
Patch against Linux kernel 2.4.19: http://www-lce.eng.cam.ac.uk/˜ctk21/scalable/
30NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Scalable TCP vs. AIMD:Benchmarking
Number of flows 2.4.19 TCP
2.4.19 TCP + new dev
driverScalable
TCP
1 7 16 442 14 39 934 27 60 1358 47 86 140
16 66 106 142
Bulk throughput tests with C=2.5 Gbit/s. Flows transfer 2 GBytes and start again for 20 min.
31NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
GridDT: Algorithm Congestion avoidance algorithm:
For each ACK in an RTT without loss, increase:
By modifying A dynamically according to RTT, GridDT guarantees fairness among TCP connections:
iii cwnd
Acwndcwnd 1
2
2
1
21
A
A
RTTRTT
AA
32NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
AIMD: RTT Bias
SunnyvaleSunnyvaleStarLightStarLight
CERNCERN
RR RRGE GE SwitchSwitch
Host #1Host #1
POS 2.5POS 2.5 Gbit/sGbit/s1 GE1 GE
1 GE1 GE
Host #2Host #2Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GE
BottleneckBottleneck
RRPOS 10POS 10 Gbit/sGbit/sRR10GE10GE
Two TCP streams share a 1 Gbit/s bottleneck CERN-Sunnyvale: RTT=181ms. Avg. throughput over a period of 7,000s = 202 Mbit/s CERN-StarLight: RTT=117ms. Avg. throughput over a period of 7,000s = 514 Mbit/s MTU = 9,000 Bytes. Link utilization = 72%
Throughput of two streams with different RTT sharing a 1Gbps bottleneck
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000 7000
Time (s)
Thro
ughp
ut (
Mbp
s) RTT=181ms
Average over the life of the connection RTT=181msRTT=117ms
Average over the life of the connection RTT=117ms
33NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Throughput of two streams with different RTT sharing a 1Gbps bottleneck
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000Time (s)
Thro
ughp
ut (M
bps)
A=7 ; RTT=181ms
Average over the life of the connection RTT=181msB=3 ; RTT=117ms
Average over the life of the connection RTT=117ms
GridDT Fairer than AIMD
CERN-Sunnyvale: RTT = 181 ms. Additive inc. A1 = 7. Avg. throughput = 330 Mbit/s CERN-StarLight: RTT = 117 ms. Additive inc. A2 = 3. Avg. throughput = 388 Mbit/s MTU = 9,000 Bytes. Link utilization 72%
39.2117181 22
2
1
A
A
RTTRTT
33.237
21
AA
SunnyvaleSunnyvale
StarLightStarLight
CERNCERN
RR RRGE GE SwitchSwitch
POS 2.5POS 2.5 Gbit/sGbit/s1 GE1 GE
1 GE1 GEHost #2Host #2
Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GEBottleneckBottleneck
RRPOS 10POS 10 Gbit/sGbit/sRR10GE10GE
Host #1Host #1
A2=3 RTT=117ms
A1=7 RTT=181ms
34NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Advocated by Mathis Experimental environment:
Linux 2.4.21 SysKonnect device driver 6.12 Traffic generated by iperf:
average throughput over the last 5 seconds Single TCP stream RTT = 119 ms Duration of each test: 2 hours Transfers from Chicago to Geneva
MTUs: POS MTU: 9180 Bytes MTU on the NIC: 9000 Bytes
Larger MTUs (1/2)
35NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
TCP max: 990 Mbit/s (MTU=9000)TCP max: 990 Mbit/s (MTU=9000)TCP max: 940 Mbit/s (MTU=1500)TCP max: 940 Mbit/s (MTU=1500)
Larger MTUs (2/2)
36NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Related Work Floyd: High-Speed TCP Low: Fast TCP Katabi: XCP Web100 and Net100 projects PFLDnet 2003 workshop:
http://www.datatag.org/pfldnet2003/
37NORDUnet 2003, Reykjavik, Iceland, 26 August 2003
Research Directions Compare performance of TCP variants Investigate proposal by Shorten, Leith, Foy
and Kildu More stringent definition of congestion:
Lose more than 1 packet per RTT ACK more than two packets in one go:
Decrease ACK bursts SCTP vs. TCP