6.3 Buffer Space versus Number of Flows - csd.uoc.grhy534/15a/s63_bufSp_sl.pdf · W orm h., 2 ht-sp 1 2 4 8 16 N um ber of Lanes (L) (with buffer space B=16 cells or flits per link)

CS-534 - Copyright University of Crete 1

6.3 Buffer Space versus Number of Flows

• What to do when the number of flows is so large that it

becomes impractical to allocate a separate flow-control

“window” for each one of them– Per-destination flow merging: Sapountzis and Katevenis, IEEE

Communications Magazine, Jan. 2005, pp. 88-94.

– Dynamically sharing the buffer space among the flows: the ATLAS I and the QFC flow control protocol, and its evaluation

– Regional Explicit Congestion Notification (RECN)

– Request-Grant protocols: N. Chrysos, 2005

– End-to-end congestion control via request-grant: N. Chrysos, 2006

– Ethernet Quantized Congestion Control

– Job-size aware scheduling and buffer management

– Other buffer sharing protocols of the mid-90’s

– Buffer memory cost versus transmission throughput cost in 1995

2

• Performance of OQ at the cost of IQ,

• Requires per-flow backpressure.

Buffered Switching Fabrics with

Internal Backpressure

backpressure

large off-chip

buffers (DRAM)

small on-chip

buffers

3

• Per-flow traffic distribution:

– Per-flow round-robin (PerFlowRR)

– Per-flow imbalance up to 1 cell (PerFlowIC)

accurate load balancing, on a shorter-term basis

Cell Distribution Methods

distribute

cellsre-sequence

cells

same input

different outputs

• Aggregate traffic distribution:

– Randomized routing (no backpressure)

– Adaptive routing (indiscriminate backpressure)

load balancing on the long-term only

4

Re-sequencing needs to consider flows as they were before merging

Freedom from deadlock

Per-output Flow Merging

• Retains the benefits of

per-flow backpressure

• N flows per link, everywhere

Too many Flows

• N2 per chip in the middle stage


The “ATLAS I” Credit Flow-Control Protocol

• M. Katevenis, D. Serpanos, E. Spyridakis: “Credit-Flow-Controlled ATM

for MP Interconnection: the ATLAS I Single-Chip ATM Switch”, Proc.

IEEE HPCA-4 (High Perf. Comp. Arch.), Las Vegas, USA, Feb. 1998,

pp. 47-56; http://archvlsi.ics.forth.gr/atlasI/atlasI_hpca98.ps.gz

• Features:

– identically destined traffic is confined to a single “lane”

– each “lane” can be shared by cells belonging to multiple packets

• As opposed to Wormhole Routing, where:

– a “virtual circuit” (VC) is allocated to a packet and dedicated to it for

its entire duration, i.e. until all “flits” (cells) of that packet go through

– identically destined packets are allowed to occupy distinct VC’s


A T L A S I: s im ila r p ro to c o l, a d a p te d tos h o r t l in k s

h a rd w a re im p le m .

Q u a n tu m F lo w C o n tro l (Q F C ) A l l ia n c e : p ro p o s e d s ta n d a rd

fo r c re d it-b a s e d flo w c o n tro l o v e r W A N A T M l in k s

B

bN u m b e r o f L a n e s L =

Q F C -like C red it P ro to co l

0

1

2

D -1

b

b

b

b

B -1

i: d s t

d s tC rd e s tin .

b u ffe r

2

1

0

p o o lC r

B

c e l l

i

c re d it

i

dow nstream sw itchupstream sw itch

b o th k in d s o f c re d it

( in A T L A S I: b = 1 )fo r a c e l l to d e p a r t

a re n e e d e d


0 .6

0 .5

0 .4

0 .3

S a tu ra tion Throughp ut64x64 fabric : 6-s tage banyan us ing 2x2 e lem ents

20-cell or 20-flit burs ts , un iform ly destined

Sa

tu

ra

tio

n T

hr

ou

gh

pu

t

1 2 4 8 1 6

B u ffe r S p a c e (= L a n e s ) p e r L in k

c e lls o r fl its

AT

LA

S

(B =L, w ith b=1)

0 .9

1 .0

0 .8

0 .7

Wo

r mh

ole


A T L A S , 1 h t-s p

A T L A S , n o h t-s p

A T L A S , 2 h t-s p

W o rm h ., n o h t-s p

W o rm h ., 1 h t-s p

W o rm h ., 2 h t-s p

1 2 4 8 1 6

N u m b e r o f L a n e s (L )

(w ith b u ffe r sp a ce B =1 6 ce lls o r flits p e r lin k)

N o n -H o t -S p o t D e la y , in th e P re s e n c e o f H o t -S p o t D e s t in a t io n s

2 0 0 0

2 0 0

2 0

5 0

5 0 0

1 0

1 0 0

1 0 0 0D

ela

y (

ce

ll tim

es

)

n o n -h o t-sp o t lo a d = 0 .2 ; 2 0 -ce ll/flit b u rs ts ; 6 4 x6 4 fa b r ic : 6 -s tg b a n ya n w . 2 x2 e l.

9

ATLAS I

• Single-chip ATM

Switch with

Multilane

Backpressure

• 10 Gbit/s =

16×16 @

622 Mb/s/port

• Shared Buffer

• 0.35 µm CMOS

• 1996-98,

FORTH-ICS,

Crete, GR

T a b le s

C re d it

c h ip c o n tro l & m a n a g e m e n td e l

d e l

g e n

s e r

h d r

p a ttrn

m m

T a b le

L o a d M o n

im p .

im p .

m a tc h

m a tc h

01234567

8 9 1 0 1 1 1 2 1 3 1 4 1 5

V P /V C

T ra n s la t.

T a b le

R o u tin gm a tc h

S c h e d

C o u n t

Qu

eu

e M

an

ag

.

& S w itc h in g

B u ffe r

C e l l

4567 0123

8 1 0 1 1 1 2 1 3 1 4 1 59

P ip e lin e d M e m

cr

. s

rc

h.

T R A N S C E I V E R S

T R A N S C E I V E R S

0

5

1 5

0

1 51 05

1 0


P a d s & D r ive rs

G ig a B a u d T ra n sce ive rs

C o re :

P e rip h e ry :

1 5 %

2 0 %

1 5 %

1 3 %

1 7 %

1 3 %

1 0 %

2 5 %1 0 %

1 0 %

1 0 % 1 0 %

2 5 %

4 5 %

4 %

20%

42%

20%

20%

45%

20%

50%

25%

2 5 %

w ir in g

E la stic b u f., I/O L in k In tf.

S ch e d u lin g , P o p . C o u n ts

C tr l/M g t, L o a d M o n , M isc.

Q u e u e P o in te r M a n a g 'm n t

A re a P o w e rS R A MF FG a te s

9 . W

D e sig n

E ffo r t

C e ll B u ffe r & S w itch in g

H e a d e r P r ., R t'n g , T ra n s l.

1 5 . 1 5 0 4 4 5 7 0

K b itsKKp -y rs

C re d it-b a se d F lo w C o n tro l

22 2 5 m m


Backpressure Cost Evaluation, versus Alternatives

• Measure the cost of credit flow control in ATLAS & compare to:

• Alternatives, without internal backpressure in the fabric:

– large buffers (off-chip DRAM) in all switches throughout the fabric, or

– internal speedup in the fabric and output buffers

• Kornaros, Pnevmatikatos, Vatsolaki, Kalokerinos, Xanthaki, D. Mavroidis,

Serpanos, Katevenis: “ATLAS I: Implementing a Single-Chip ATM Switch

with Backpressure”, IEEE Micro Magazine, Jan/Feb. 1999,

http://archvlsi.ics.forth.gr/atlasI/hoti98/


1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 11

1

1

1

1

1

15

4

3

2

1

N

1

2

3

4

5

N

A T L A S A T L A S A T L A S

in p u t q u e u e s (p e r- flo w )in p u t

in te r fa ce

b a ckp re ssu re b a ckp re ssu reb a ckp r .

S w itch in g F ab rics w ith In tern al B ackp ressu re

sm a ll, o n -ch ip b u ffe rs

la rg e , o ff-ch ip

b u ffe rs (D R A M ); to ta l th ro u g h p u t = 2 N

M ak ing Large A T M S w itches :


1

1 1 1

1

1

1

1

1

1

1N

5

4

3

2

1 1

2

3

4

5

N

la rg e , o ff-ch ip b u ffe rs (D R A M ) ; to ta l th ro u g h p u t = 2 N lo g N

1

1

1

1

1

1

1

S w itch in g F ab rics w ith o u t B ackp ressu re 1 : L arg e B u ffe rs


s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

1

1

1

1

in te r fa ce

o u tp u t

q u e u e s

o u tp u t

s

s

s

s

s

s

s

s

speedup s> 1 ; under burs ty o r non-un ifo rm tra ffic : s> > 1...

1

2

3

4

5

N

1

2

3

4

5

N

1

1

1

1

1

1

1

1

1

1

1

1

S w itch in g F ab rics w ith o u t B ackp ressu re 2 : In tern al S p eed u p

la rg e , o ff-ch ip b u ffe rs (D R A M );sm a ll, o n -ch ip b u ffe rs

to ta l th ro u g h p u t = (s+ 1 )N


O ff-C h ip

C o m m u n ica tio n

C o re :

P e r ip h e ry :

C o s t

E la s tic b u f., I/O In tf.

20% 20%

20%

50%

10%

25% x0

10% x0

4% x0

42% + 45% +

35% x 2

55% x 2

10% x.6

25% x.5

25% +

w iring

Q .P tr , S ch , C tr l, e tc .

6% +

Ba ckp re ssu re C o st/Be n e fit 1 :

A re a

C e ll B u f.& S w itch in g

P o w e r

H d r , R t'n g , T ra n s l.

G a te s S R A M

20% x0

F F

N o B a c k p re s s u re , L a rg e O ff-C h ip B u ffe rs

C re d it-b . F lo w C tr l


C o m m u n ic a tio n

O ff-C h ip

C o s t

C o re :

P e r ip h e ry :

20%

20%

20%

50%

1 0 %

2 5 %

w irin g

25%x0

1 0 % x0

4% x0

xS

Q .P tr , S c h , C tr l, e tc .

42%xS 45%xS

1 0 % ++

25 %

2 2

xS2

E la s tic b u f., I/O In tf.

20% xS

S R A MF F A re a P o w e r

B a ckp re ssu re C o st/B e n e fit 2 :

C e ll B u f.& S w itch in g

H d r , R t'n g , T ra n s l.

C re d it-b . F lo w C tr l

3 5 % xS

55 % xS

N o B a c k p re s s u re , In te rn a l S p e e d u p , O u tp u t Q u e u e s

G a te s


Regional Explicit Congestion Notification (RECN)

• Generalization & evolution of the ATLAS/QFC protocol

• Source-routing header describes path through fabric

• Intermediate or final link congestion sends back-notification

• All packets to congested link confined to a single lane

– intermediate links identified via path component in header

entire trees of destinations in single lane (improvement over QFC)

– equivalent of lane here called “Set-Aside Queue” (SAQ)

• VOQ’s replaced by Single (!) Input Queue + SAQ’s

– dynamically create/delete SAQ’s

– CAM assumed to match incoming pck header versus current SAQ’s

• Duato, Johnson, Flich, Naven, Garcia, Nachiondo: “A New Scalable and

Cost-Effective Congestion Management Strategy for Lossless Multistage

Interconnection Networks”, HPCA-11, San Francisco, USA, Feb. 2005.


Request-Grant Protocols

• Consider a buffer feeding an output link, and receiving traffic

from multiple sources:

• If credits are pre-allocated to each source, the buffer needs

to be as large as one RTT-window per source;

• If credits are “held” at buffer and only allocated to requesting

source(s) when these have something to transmit, then a

single RTT-window suffices for all sources!

economize on buffer space at the cost of longer latency

• N. Chrysos, M. Katevenis: “Scheduling in Switches with Small Internal

Buffers”, IEEE Globecom 2005, St. Louis, USA, Nov. 2005;

N. Chrysos, M. Katevenis: “Scheduling in Non-Blocking Buffered Three-

Stage Switching Fabrics”, IEEE Infocom 2006, Barcelona, Spain, Apr.

2006; http://archvlsi.ics.forth.gr/bpbenes/


Congestion elimination using proactive admissions

• Ensure all injected packets can fit in fabric-output buffers

– fabric output buffers never exert backpressure

• Output buffer credits requested by inputs – scheduled by output arbiter

– injection rates are fair end2end – no “parking lot” problem

• With good traffic distribution (inverse-multiplexing)

– intermediate buffers never fill up as well backpressure eliminated!

• Minimal in-fabric packet delay (queues usually empty) & no packet drops

• low flow completion times

Buffere

d

Fat-

tree /

Clo

s

fabric:

sim

plif

ied v

iew


Hotspot Traffic: the load for two outputs is 1.1 x C

• Indiscriminate bckpr: cells to non-hotspot dests have very large delays

• Req-grant bckpr. performs as per-flow queues, but uses shared queues

– O(N) less queues

• Higher latency at low loads due to request-grant msgs

• Need to handle contention among requests using per-flow req. cnts


Ethernet Quantized Congestion Control (QCN)

• Congestion point @ sw. queues:

every ~ 100 new frames, compute

congestion feedback value (fd)

– fd = Qoffset + w • dQ

– if fd > 0, issue congestion

notification msg (CNM) to src

NIC, identifying MAC-pair flow

• Non-congested flows share a single NIC queue

• Upon receiving a CNM for a dest MAC:

– alloc. flow queue & rate limiter (RL)

– target = current RL

– new RL = f ( RL, fd )

• Recovery upon absence of CNMs:

– fast recovery: RL = (RL+ target) / 2

– active increase: increment target (+x Mbps)

Additive Increase /

Multiplicative Decrease (AIMD)


QCN at outputs is unfair: who’s to get the next CNM?

QCN at outputs

QCN at inputs

QCN CPs

at outputs

versus.

at inputs

under fan-in

scenario

next CNM goes to the

“unlucky” flow that sent

the nxt 100-th frame

fair rates


Arrival-sampling QCN at inputs does not protect non-congested flows

• Departures from switch input buffer (VOQs) not

FIFO (order enforced by arbiter)

• Arrival sampling: CNM flow of nxt “100th” frame

– flows CNM’ed proportionally to arrival rates

– unfair with non-FIFO service out of buffer

• Occupancy sampling: CNM the flow w. largest

“backlog” (~ arrivals rate minus departures

rate)

Arrival sampling

Occupancy sampling

Chrysos e.a. “Unbiased QCN for scalable server-rack fabrics”, IBM Technical Report:

http://domino.research.ibm.com/library/cyberdig.nsf/papers/EF857B498A21B3EF85257D6A002E2235


Small & flat FCTs for scale-out latency-sensitive apps

• Traditionally, congestion control cares about tput, fairness of congested

link b/w & latency of non-congested pkts – congested pkts high latency

• Main perf. metric of interest when a group of servers collectively work on

latency-sensitive (datacenter) applications

– flow completion time (FCT) -- small for as many flows as possible

• Measures/actions

– TCP variants: keep in-fabric queues empty (Alizadeh, SIGCOM 2010)

– datacenter networks rethink lossy (current) vs. lossless flow control

• avoid delayed TCP retransmission: up to many millisecond to recover pkts

– shift also from deterministic routing to flow-level (and more recently also to

packet-level) multipathing M. Al-Fares, SIGCOMM’08, Zats SIGCOMM’12)

• what about flows crossing temporarily congested paths ?

– proactive (scheduled injections) also drawing renewed interest

• don’t blindly inject packets--scheduled injections reduce in-fabric backlogs

– host (network-stack) latencies: many tens of μs;

• ok for previous netw. (latency 100s μs) – but new nets only fewμs

• avoid excessive packet copies (RDMA?), TCP offloading


Shortest-job-first scheduling: let small go first

Alizadeh, e.a., “pFabric: Minimal Near-Optimal Datacenter Transport”

• Switches store packets in per-input priority queues sorted based on the

remaining size of the corresponding TCP flows

– but how to find the remaining size of a TCP flow?

• heuristic: use the already transmitted size as indication

– schedule first packets from the flow with the smallest remaining size

– when buffer exceeds threshold, drop packets from the flow with the

largest remaining size

• Cost of priority queues

– commodity (cheap) switches have small on-chip buffers (a few

hundreds of 1500B frames per input)

– h/w comparators for a few-hundred values is feasible


Virtual switches in cloud (computing) datacenters

• Another layer of switching inside hosts (likely in s/w versatile/flexible)

– VM vNIC vSwitch NIC fabric NIC vSwitch vNIC VM

• Ping between two remote physical servers: 21 μs; ping between two

collocated VMs: 221 μs

• vSwitches have “soft” capacity links: transfer pkts when “on” CPU

– excessive packet drops inside dest host (Crisan, e.a., SIGCOMM 2013)

– offloaded vSwitch functions inside hardware network adapters (SRIOV)

– good but not fully SDN compatible

• Lossless vSwitches (Crisan, e.a. SIGCOM 2013)

– vSwitch backpr. can propagate inside physical network excessive blocking

– reserve endpoint vSwitch credits before injecting pkts from host

• Crisan, Birke, Chrysos, Minkenberg, Gusat, “zFabric: how to virtualized

lossless Ethernet”, IEEE Cluster 2014


Buffer Space for Bounded Peak-to-Average Rate Ratio

• Assume Rpeak(i) / Raverage(i) PAR for all flows i on a link

– R(i) is the rate (throughput) of flow i

– PAR is a constant: peak-to-average ratio bound

– interpretation: rate fluctuation is bounded by PAR

• Each flow i needs a credit window of RTT · Rpeak(i)

• Buffer space for all flows is ∑ ( RTT · Rpeak(i) ) =

= RTT · ∑( Rpeak(i) ) RTT · ∑( PAR · Raverage(i) ) =

= PAR · RTT · ∑( Raverage(i) ) PAR · ( RTT · Rlink )

Allocate buffer space = PAR number of “windows”

When individual flow rates change, rearrange the allocation

of buffer space between flows –but must wait for the buffer

of one flow to drain before rallocating it (not obvious how to)

• H.T. Kung, T. Blackwell, A. Chapman: “Credit-Based Flow Control for

ATM Networks: Credit Update Protocol, Adaptive Credit Allocation, and

Statistical Multiplexing”, SIGCOMM '94, pp. 101-114.


Dynamically Sharing the Buffer Space among Flows

• In order to depart, a packet must acquire both:

– a per-flow credit (to guard against “buffer hogging”), and

– a per-link credit (to ensure that the shared buffer does not overflow)

• Properly manage (increase or decrease) the per-flow

window allocation based on traffic circumstances:

– ATLAS and QFC protocols never change the per-flow window

– H.T.Kung protocol moves allocations between flows (unclear how)

– other idea: use two window sizes –a “full” one and a “small” one; use

full-size windows when total buffer occupancy is below a threshold,

use small-size windows (for all flows) above that point (flows that had

already filled more than a small window will lose their allocation on

packet departure) – C. Ozveren, R. Simcoe, G. Varghese: “Reliable

and Efficient Hop-by-Hop Flow Control”, IEEE JSAC, May 1995.

• Draining Rate Theorem: M. Katevenis: “Buffer Requirements of Credit-

Based Flow Control when a Minimum Draining Rate is Guaranteed”, HPCS'97;

ftp://ftp.ics.forth.gr/tech-reports/1997/1997.HPCS97.drain_cr_buf.ps.gz


s p e e d o f tra n s m is s io n lin e 4 5 M b /s

ra tio : tra n s m is s io n /m e m o ry c o s t 3 3 0 ,0 0 0 to 1

c o s t o f q u e u e m e m . p e r m o n th 0 .0 1 4 c e n ts /m ile /m o n th

in v e s tm e n t w r ite -d o w n p e r io d 3 6 m o n th s

c o s t o f w in d o w s iz e m e m o ry 0 .5 c e n ts /m ile

c o s t o f 1 6 M B y te D R A M 1 0 0 0 $

ro u n d - tr ip w in d o w s iz e 7 9 b y te s /m ile

s p e e d o f s ig n a l p ro p a g a tio n 7 m ic ro s e c /m ile

c o s t o f lo n g d is ta n c e x m is s io n 4 5 $ /m ile /m o n th

d a ta fr o m H o t In te r c o n n e c ts '9 5 k e y n o te

C om m u nic a tion C os t v e rs us B u ffe r/Log ic C os t

m illio n s o f tra n s is to rs - h u n d re d s o f p in sO n -C h ip :

O ff-C h ip :s p e e c h , b y A . F r a s e r , V P , A T & T B e ll L a b s :


w ind ow S iz e (V C i) = R T T * p ea k T h ro ug hp u t(V C i)

c o s t(L ink ) ~ = 33 0 ,00 0 * c o s t(w in do w L)

lo s s y flow c on tro l us ua lly o pe ra tes the n e tw o rk w ith g o od pu t

rea c h ing u p to 70 - 8 0 % o f lin k th ro ug h pu t

lo s s les s flo w c o n tro l o p e ra tes u p to 98 - 1 00 % lin k u til iz a tion

th e 20 -30 % ex tra u til iz a tio n w ith los s les s F C is w o r th a pp rox .

1 0 to 10 0 tho us an d w in d ow L 's w o r th o f e x tra b u ffe r m e m ory

w ith les s th a n a few te n s o f tho us a n ds o f w ind ow L 's o f ex tra

b u ffe r m e m ory , th en lo s s le s s flow c on tro l is a c le a r w in

in de ed , lo s s le s s F C c a n d o th a t, e v en w ith q u ite le s s bu ffe r s p ac e ...

w ind ow S iz e (V C i) < o r < < w ind o w L := R T T * th rou gh pu t(L ink )

if lo s s le s s flo w c on tro l c a n y ie ld its lin k u til iz a tion a dv a n tag e

P e r-C o n n e ctio n Q u e u e in g & F C :

H o w m a ny ``W in d o w s '' o f B u ffe r S p a c e ?

Documents

6.3 Buffer Space versus Number of Flows - csd.uoc.grhy534/15a/s63_bufSp_sl.pdf · W orm h., 2 ht-sp 1 2 4 8 16 N um ber of Lanes (L) (with buffer space B=16 cells or flits per link)