Flowlet Switching Srikanth Kandula Shan Sinha & Dina Katabi

High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.

Flowlet SwitchingSrikanth Kandula

Shan Sinha & Dina Katabi

ISPs Want to Split Traffic Across Multiple Paths

30%

70%

• Load balancing to remove hot spots

• Rebalance traffic when unpredictable events occur (Outages, DoS, BGP reroutes, Flash Crowds, …)




Unpredictable Traffic

Rebalance

Traffic

30%

70%




Unpredictable Traffic

70%

30%


• Much research on balancing and rebalancing load,

• But implementation is hard particularly with dynamic ratios Either sacrifice accuracy or reorder TCP

packets

• Much research on balancing and rebalancing load,

• But implementation is hard particularly with dynamic ratios Either sacrifice accuracy or reorder TCP

packets

1. Given the desired split ratios – possibly dynamic

2. Split traffic accurately, at the edge router, without reordering TCP’s packets

Problem

Existing Scheme 1: Packet-Based Splitting

• Assign packets to paths proportional to the desired ratios

Reorders TCP packets causing bad throughput

• Assign TCP flows to each path proportional to the desired ratio

1. Flows are not all equal: Elephants & Mice

2. So, estimate the rate of each TCP flow3. But rates change with time4. Too complex5. Very inaccurate if desired ratios

change

Existing Scheme 2: Flow-Based Splitting

How to Split Traffic?

Packet-Based

• Accurate

• Reorders TCP packets

• Easily tracks dynamic ratios

Flow-Based

• Inaccurate

• No packet reordering

• Hard to track if ratios change

Can we combine the best of the two approaches?

Can we combine the best of the two approaches?

This Talk

• Show how to send a single TCP flow down multiple paths without reordering

• Accurately split traffic even when desired ratios are dynamic

• Easy to implement

Flowlet Switching

• If the previous packet from the flow has left the merging point Can reassign the flow to a different path

TCP flow

2

1

Flowlet Switching

Delay = D1

Delay = D2Given > |D2-D1|

Flowlet Switching

Delay = D1

Delay = D2

Flowlets are bursts from same flow separated by at least ; they can be switched independently!

Given > |D2-D1|

Idle ≥

Implementing Flowlet Switching is Simple

• Router at the split point hashes packet header

• If (Now - Last_Seen) > , flow can change path

• Reassign path proportionally to the desired split ratios

SRCip DSTip SRCPort DSTPort hash

Last_Seen (s) Path

9920.2659 3

Does it Really Work?

• Traces collected on a peering link, an edge link and two core links

• Split Vectors (3 paths) Static (.3, .3, .4) Dynamic – sinusoidal with amplitude 60%,

period 20min [Akella04,Chuah02]

Paths Desired

DesiredObtained

NError

1

0.06%2.31%

12.01%

0.07%3.96%

40.83%

0

5

10

15

20

25

30

35

40

45

Packet-based Flow-based Flowlet-switching

Static Dynamic

Is Flowlet Switching Accurate?Er

ror

0.06%2.31%

12.01%

0.07%3.96%

40.83%

0

5

10

15

20

25

30

35

40

45

Packet-based Flow-based Flowlet-switching

Static Dynamic

Is Flowlet Switching Accurate?

Flowlet switching is much more accurate than flow-based switching

Flowlet switching is much more accurate than flow-based switching

Erro

r

Can do Flowlet Switching without Per-Flow State

#Active Flows ~ 50,000; But… Router maintains a hash table < 1000 entries

(5KB).

#Active Flows ~ 50,000; But… Router maintains a hash table < 1000 entries

(5KB).

4 16 64 256 1024 2048 4096 8192

Hash Table Entries

Errors stabilize for small table

Fig. shows Avg. and Max. of many traces

Understanding Flowlets

But Where do Flowlets come from?

• Can’t be just timeouts or short flows; most of the bytes are in the elephants

• Why can a large flow be broken into many small flowlets?

• Well-known that TCP usually sends a window in one or a few bursts and waits for acks [Zhang91,Zhang03, Jiang04]

• Some Reasons Slow-start Ack compression Window is much smaller than delay-BW

product

Flowlets exist because TCP is bursty at RTT and sub-RTT scales

Most flowlets have inter-arrivals less than an RTT most flowlets are sub-windowsMost flowlets have inter-arrivals less than an RTT most flowlets are sub-windows

Flowlets exist because TCP is Bursty

Why Flowlet Switching is Accurate?

• 80% of bytes are in flowlets smaller than 10KB

• Assigning a flowlet to a path isn’t a long commitment

Why Flowlets can Track Dynamics?

An order of magnitude more opportunities to rebalance!An order of magnitude more opportunities to rebalance!

143.16

611.95

3784.10

111.33

1454.98

8661.43

35287.04

2848.76

Edge

Peering

Core1

Core2

Arrival Rate of both flows and flowlets (/sec)Arrival Rate of both flows and flowlets (/sec)

1454.98

8661.43

35287.04

2848.76111.33

3784.1

611.95

143.16Edge

Peering

Core1

Core2

Flowlets

Flows

Why flowlet switching doesn’t need per-flow state?

Flow 1

Flow 2

Flow 3

Time# Active Flowlets

0

1

2

3


Flow 1

Flow 2

Flow 3

Time# Active Flowlets

0

1

2

3


Flow 1

Flow 2

Flow 3

# Active Flowlets

Time0

1

2

3


Edge

Peering

Core1

Core2

Trace

18.41

28.08

240.12

50.66

#Active Flowlets


#Active flowlets is 2 orders of magnitude smaller than flows Very small hash table#Active flowlets is 2 orders of magnitude

smaller than flows Very small hash table

Edge

Peering

Core1

Core2

1450.42

8477.33

47883.33

1559.33

#ActiveFlows Trace

18.41

28.08

240.12

50.66

#Active Flowlets


Why Flowlet Switching is Possible?

• Why can a large flow be broken into many small flowlets?

• Why is flowlet switching accurate?

• Why flowlet switching does not need per-flow state?

• TCP burstiness at small time scales

• Small commitment; many more chances to rebalance

• Few simultaneously active flowlets

Configuring Flowlet Switching

For our traces which are a diverse collection of traffic within continental US ~50ms is a good and safe choice! Our procedure is a constructive way to find

Flowlet separation > delay difference

But, how to find delay difference?

Flowlet Separation of 50ms is Good

Any flowlet timeout in [50, 100] ms yields highly accurate splitsAny flowlet timeout in [50, 100] ms yields highly accurate splits

~50ms results in accurate splitting

Even if delay difference >> 50ms, prob. of reordering is negligible compared to drop. rate in the Internet (about 1%)

Even if delay difference >> 50ms, prob. of reordering is negligible compared to drop. rate in the Internet (about 1%)

Flowlet Separation of 50ms is Safe

1 %

.8 %

.6 %

.4 %

.2 %

0 %

Conclusion• Harness TCP burstiness to split traffic at a

finer resolution than a flow without reordering

• Flowlet Switching: Splitting errors are a few percents Reordering probability is negligible compared

to drop prob. in the Internet Easy to implement

• Enable ISPs to do dynamic load balancing

Documents

Flowlet Switching Srikanth Kandula Shan Sinha & Dina Katabi