FAST TCP Cheng Jin David Wei Steven Low netlab.CALTECH.edu

Preview:

Citation preview

FAST TCP

Cheng JinDavid Wei

Steven Low

netlab.CALTECH.edu

Acknowledgments Caltech

Bunn, Choe, Doyle, Hegde, Jayaraman, Newman, Ravot, Singh, X. Su, J. Wang, Xia

UCLA Paganini, Z. Wang

CERN Martin

SLAC Cottrell

Internet2 Almes, Shalunov

MIT Haystack Observatory Lapsley, Whitney

TeraGrid Linda Winkler

Cisco Aiken, Doraiswami, McGugan, Yip

Level(3) Fernes

LANL Wu

Outline

Motivation & approach FAST architecture Window control algorithm Experimental evaluation

skip: theoretical foundation

Congestion control

xi(t)

pl(t)

Example congestion measure pl(t) Loss (Reno) Queueing delay (Vegas)

TCP/AQM

Congestion control is a distributed asynchronous algorithm to share bandwidth

It has two components TCP: adapts sending rate (window) to congestion AQM: adjusts & feeds back congestion information

They form a distributed feedback control system Equilibrium & stability depends on both TCP and AQM And on delay, capacity, routing, #connections

pl(t)

xi(t)TCP: Reno Vegas

AQM: DropTail RED REM/PI AVQ

Difficulties at large window

Equilibrium problem Packet level: AI too slow, MD too drastic Flow level: required loss probability too

small Dynamic problem

Packet level: must oscillate on binary signal

Flow level: unstable at large window

5

Packet & flow level

ACK: W W + 1/W

Loss: W W – 0.5W

Packet level

Reno TCP

Flow level

Equilibrium

Dynamics

pkts (Mathis formula)

Reno TCP

Packet level Designed and implemented first

Flow level Understood afterwards

Flow level dynamics determines Equilibrium: performance, fairness Stability

Design flow level equilibrium & stability Implement flow level goals at packet level

Reno TCP

Packet level Designed and implemented first

Flow level Understood afterwards

Flow level dynamics determines Equilibrium: performance, fairness Stability

Packet level design of FAST, HSTCP, STCP guided by flow level properties

Packet level

ACK: W W + 1/W

Loss: W W – 0.5W

Reno AIMD(1, 0.5)

ACK: W W + a(w)/W

Loss: W W – b(w)W

HSTCP AIMD(a(w), b(w))

ACK: W W + 0.01

Loss: W W – 0.125W

STCP MIMD(a, b)

RTT

baseRTT W W :RTT FAST

Flow level: Reno, HSTCP, STCP, FAST

Similar flow level equilibrium

= 1.225 (Reno), 0.120 (HSTCP), 0.075 (STCP)

pkts/sec (Mathis formula)

Flow level: Reno, HSTCP, STCP, FAST

Different gain and utility Ui

They determine equilibrium and stability

Different congestion measure pi Loss probability (Reno, HSTCP, STCP) Queueing delay (Vegas, FAST)

Common flow level dynamics!

windowadjustment

controlgain

flow levelgoal=

Implementation strategy

Common flow level dynamics

windowadjustment

controlgain

flow levelgoal=

Small adjustment when close, large far away Need to estimate how far current state is wrt target Scalable

Window adjustment independent of pi Depends only on current window Difficult to scale

Outline

Motivation & approach FAST architecture Window control algorithm Experimental evaluation

skip: theoretical foundation

Architecture

RTT timescaleLoss recovery

<RTT timescale

Architecture

Each component designed independently upgraded asynchronously

Architecture

Each component designed independently upgraded asynchronously

WindowControl

Uses delay as congestion measure Delay provides finer congestion info Dealy scales correctly with network capacity Can operate with low queuing delay

FAST-TCP basic idea

Loss

C Window

Que

ue D

elay

FASTLoss Based TCP

Window control algorithm

Full utilization regardless of bandwidth-delay product

Globally stable exponential convergence

Fairness weighted proportional fairness parameter

Outline

Motivation & approach FAST architecture Window control algorithm Experimental evaluation

Abilene-HENP network Haystack Observatory DummyNet

Abilene Test

OC48

OC192

(Yang Xia, Harvey Newman, Caltech)

Periodic lossesevery 10mins

(Yang Xia, Harvey Newman, Caltech)

Periodic lossesevery 10mins

(Yang Xia, Harvey Newman, Caltech)

Periodic lossesevery 10mins

FAST backs off tomake room for Reno

Haystack Experiments

Lapsley, MIT Haystack

Haystack - 1 Flow (Atlanta-> Japan)

• Iperf used to generate traffic.• Sender is a Xeon 2.6 Ghz• Window was constant:Burstiness in rate due to Host processing and ack spacing.

Lapsley, MIT Haystack

Haystack – 2 Flows from 1 machine (Atlanta -> Japan)

Lapsley, MIT Haystack

Timeout

All outstanding packets marked as lost.1. SACKs reduce lost packets

2. Lost packets retransmitted slowlyas cwnd is capped at 1 (bug).

Linux Loss Recovery

DummyNet Experiments

Experiments using emulated network. 800 Mbps emulated bottleneck in

DummyNet.

Sender PC

Dual Xeon 2.6Ghz 2Gb

Intel GbE

Linux 2.4.22

DummyNet PC

Dual Xeon 3.06Ghz 2Gb

FreeBSD 5.1

800Mbps

Receiver PC

Dual Xeon 2.6Ghz 2Gb

Intel GbE

Linux 2.4.22

Dynamic sharing: 3 flowsFAST Linux

Dynamic sharing on Dummynet capacity = 800Mbps delay=120ms 3 flows iperf throughput Linux 2.4.x (HSTCP: UCL)

Dynamic sharing: 3 flowsFAST Linux

HSTCPBIC

Steady throughput

FAST Linux

throughput

loss

queue

STCPHSTCP

Dynamic sharing on Dummynet capacity = 800Mbps delay=120ms 14 flows iperf throughput Linux 2.4.x (HSTCP: UCL)

30min

FAST Linux

throughput

loss

queue

HSTCP

30min

Room for mice !

HSTCP BIC

Average Queue vs Buffer Size

Dummynet capacity

= 800Mbps Delay

=200ms 1 flows Buffer size:

50, …, 8000 pkts

(S. Hedge, B. Wydrowski, etc, Caltech)

Is large queue necessary for high throughput?

FAST TCP: motivation, architecture, algorithms, performance. IEEE Infocom March 2004

-release: April 2004Source freely available for any non-profit use

netlab.caltech.edu/FAST

Aggregate throughput

ideal performance

Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts

Aggregate throughput

small window800pkts

largewindow

8000

Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts

Fairness

Jain’s index

HST

CP ~

Ren

oDummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts

Stability

Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts

stable indiverse

scenarios

FAST TCP: motivation, architecture, algorithms, performance. IEEE Infocom March 2004

-release: April 2004Source freely available for any non-profit use

netlab.caltech.edu/FAST

BACKUP Slides

IP Rights

Caltech owns IP rights applicable more broadly than TCP leave all options open

IP freely available if FAST TCP becomes IETF standard Code available on FAST website for any non-commercial use

WAN in Lab

Caltech: John Doyle, Raj Jayaraman, George Lee, Steven Low (PI), Harvey Newman, Demetri Psaltis, Xun Su, Yang Xia

Cisco: Bob Aiken, Vijay Doraiswami, Chris McGugan, Steven Yip

netlab.caltech.edu

NSF

Key Personnel Steven Low, CS/EE Harvey Newman,

Physics John Doyle, EE/CDS Demetri Psaltis, EE

Cisco Bob Aiken Vijay Doraiswami Chris McGugan Steven Yip

Raj Jayaraman, CS Xun Su, Physics Yang Xia, Physics George Lee, CS

2 grad students 3 summer students Cisco engineers

Spectrum of toolslog(cost)

log(abstraction)mathsimulationemulationlive nk WANiLab

NSSSFNetQualNetJavaSim

Mathis formulaOptimizationControl theoryNonlinear modelStocahstic model

DummyNetEmuLabModelNetWAIL

PlanetLabAbileneNLRDataTAGCENICWAILetc

?

…we use them all

Spectrum of tools

mathsimulationemulationlive nk WANiLab

Distance High High High

Speed High High Low

Realism High High Low

Traffic High Low Low

Configurable Low Medium High

Monitoring Low Medium High

Cost High Medium Low

Critical in developmente.g. Web100

Goal

State-of-the-art hybrid WAN High speed, large distance

2.5G 10G 50 – 200ms

Wireless devices connected by optical core

Controlled & repeatable experiments Reconfigurable & evolvable Built in monitoring capability

WAN in Lab

5-year plan 6 Cisco

ONS15454 4 routers 10s servers Wireless

devices 800km fiber ~100ms

RTT

OSPF Area: 40OSPF Area: 20

OSPF Area: 10 OSPF Area:30

OPTICAL NETWORK

ONS15454Site B

ONS15454Site E

ONS15454Site C

ONS15454Site D

CISCO7613

(Bottleneck Rtr)

ML-Series NeworkModule

ML-Series NeworkModule

ML-Series networkmodule

CISCO7613

(Bottleneck Rtr)ML-Series Nework

Module

ONS15454Site A

ONS15454Site F

10GE : 100KM

10GE: 100km

Server ServerServer Server

Server Server

CISCO7613

(Bottleneck Rtr)

Server Server Server ServerServer Server Server Server

Linux Farm

Server

Server

Server

Server Server Server ServerServer Server

CISCO7613

(Bottleneck Rtr)

Server Server ServerServer

192.168.10/24 192.168.30/24

10.0.2/24

ITANIUM -10GE Server

10.0.3/24

WirelessComponents

WirelessComponents

Itanium -10GE Server

10.0.3/24

Linux Farm

Server

Server

Server

Linux FarmServer

ServerServer

Linux FarmServer

ServerServerWireless

ComponentsWireless

Components

ITANIUM10GE Server

10.0.3/24

10.0.2/24

10.0.2/24 10.0.2/24

192.168.20/24

ITANIUM10GE Server

10.0.3/24

192.168.40/24

10.0.1/24

10.0.5/2410.0.1/24

10.0.4/24

10.0.4/24

10.0.5/24

V. Doraiswami (Cisco)R. Jayaraman (Caltech)

OSPF Area: 20

OSPF Area: 10

OPTICAL NETWORK

ONS15454Site B

ONS15454Site D

CISCO7613

(Bottleneck Rtr)

ONS15454 (to support

additionalML-Series cards)

ONS15454 (to support

additionalML-Series cards)

ONS15454Site A

10

GE

: 10

0K

M

Server ServerServer Server Server Server

Server Server

CISCO7613

(Bottleneck Rtr)

Server Server ServerServer

192.168.10/24

10.0.2/24

ITANIUM -10GE Server

10.0.2/24

WirelessComponents

Itanium -10GE Server

10.0.2/24

WirelessComponents

10.0.2/24

192.168.20/24

10.0.1/24

10.0.1/24

WAN in Lab

Year-1 plan 3 Cisco ONS

15454 2 routers 10s servers Wireless

devices

V. Doraiswami (Cisco)R. Jayaraman (Caltech)

Hybrid NetworkScenarios: Ad hoc network Cellular network Sensor network

How optical core supports wireless

edges?

X. Su (Caltech)

Experiments Transport & network layer

TCP, AQM, TCP/IP interaction

Wireless hybrid networking Wireless media delivery Fixed wireless access Sensor networks

Optical control plane Grid computing

UltraLight

WAN in Lab Capacity: 2.5 – 10 Gbps Delay: 0 – 100 ms round trip Delay: 0 – 400 ms round trip

Configurable & evolvable Topology, rate, delays, routing Always at cutting edge

Flexible, active debugging Passive monitoring, AQM

Integral part of R&A networks Transition from theory, implementation,

demonstration, deployment Transition from lab to marketplace

Global resource Part of global infrastructure UltraLight led by

Newman

Unique capabilities

Calren2/Abilene

Chicago

Amsterdam

CERN

Geneva

SURFNet

StarLight

WAN in LabCaltech

research & production networks

Multi-Gbps50-200ms delay

Experiment

Network debugging

Performance problems in real network Simulation will miss Emulation might miss Live network hard to debug

WAN in Lab Passive monitoring inside network Active debugging possible

Passive monitoring

Fibersplitter

DAG

RAID

TimestampHeader

GPS

Monitor

No overhead on system Can capture full info at OC48

UofWaikato’s DAG card captures at OC48 speed

Can filter if necessary Disk speed = 2.5Gbps*40/1500

= 66Mbps Monitors synchronized by GPS

or cheaper alternatives Data stored for offline

analysis

D. Wei (Caltech)

Passive monitoring

D. Wei (Caltech)

Fibersplitter

DAG

RAID

TimestampHeader

GPS

Monitor

Server

Server

router

router

monitor

monitor

monitor monitor

monitor

monitor

Web100, MonALISA

UltraLight testbed

UltraLight team (Newman)

Status Hardware

Optical transport design: finalized IP infrastructure design: finalized (almost) Wireless infrastructure design: finalized Price negotiation/ordering/delivery: summer 04

Software Passive monitoring: summer student Management software: 2005 -

Physical lab Renovation: to be completed by summer 04

2007200620052003 2004

hardwaredesign

physical building

fundraising

NSF funds10/03

Status

usabletestbed12/04

monitoring

trafficgeneration

connectedUltraLight

usefultestbed12/05

AROfunds5/04

expansion

support

management

OSPF Area: 40OSPF Area: 20

OSPF Area: 10 OSPF Area:30

OPTICAL NETWORK

ONS15454Site B

ONS15454Site E

ONS15454Site C

ONS15454Site D

CISCO7613

(Bottleneck Rtr)

ML-Series NeworkModule

ML-Series NeworkModule

ML-Series networkmodule

CISCO7613

(Bottleneck Rtr)ML-Series Nework

Module

ONS15454Site A

ONS15454Site F

10GE : 100KM

10GE: 100km

Server ServerServer Server

Server Server

CISCO7613

(Bottleneck Rtr)

Server Server Server ServerServer Server Server Server

Linux Farm

Server

Server

Server

Server Server Server ServerServer Server

CISCO7613

(Bottleneck Rtr)

Server Server ServerServer

192.168.10/24 192.168.30/24

10.0.2/24

ITANIUM -10GE Server

10.0.3/24

WirelessComponents

WirelessComponents

Itanium -10GE Server

10.0.3/24

Linux Farm

Server

Server

Server

Linux FarmServer

ServerServer

Linux FarmServer

ServerServerWireless

ComponentsWireless

Components

ITANIUM10GE Server

10.0.3/24

10.0.2/24

10.0.2/24 10.0.2/24

192.168.20/24

ITANIUM10GE Server

10.0.3/24

192.168.40/24

10.0.1/24

10.0.5/2410.0.1/24

10.0.4/24

10.0.4/24

10.0.5/24

CS DeptJorgensen Lab

NetLab

WANin Lab

G. Lee, R. Jayaraman, E. Nixon (Caltech)

Summary Testbed driven by research agenda

Rich and strong networking effort Integrated approach:

theory + implementation + experiments “A network that can break”

Integral part of real testbeds Part of global infrastructure UltraLight led by

Harvey Newman (Caltech) Integrated monitoring & measurement

facility Fiber splitter passive monitors MonALISA

Recommended