Upload
nguyenxuyen
View
223
Download
0
Embed Size (px)
Citation preview
CS-534 - Copyright University of Crete 1
6.3 Buffer Space versus Number of Flows
• What to do when the number of flows is so large that it
becomes impractical to allocate a separate flow-control
“window” for each one of them– Per-destination flow merging: Sapountzis and Katevenis, IEEE
Communications Magazine, Jan. 2005, pp. 88-94.
– Dynamically sharing the buffer space among the flows: the ATLAS I and the QFC flow control protocol, and its evaluation
– Regional Explicit Congestion Notification (RECN)
– Request-Grant protocols: N. Chrysos, 2005
– End-to-end congestion control via request-grant: N. Chrysos, 2006
– Ethernet Quantized Congestion Control
– Job-size aware scheduling and buffer management
– Other buffer sharing protocols of the mid-90’s
– Buffer memory cost versus transmission throughput cost in 1995
2
• Performance of OQ at the cost of IQ,
• Requires per-flow backpressure.
Buffered Switching Fabrics with
Internal Backpressure
backpressure
large off-chip
buffers (DRAM)
small on-chip
buffers
3
• Per-flow traffic distribution:
– Per-flow round-robin (PerFlowRR)
– Per-flow imbalance up to 1 cell (PerFlowIC)
accurate load balancing, on a shorter-term basis
Cell Distribution Methods
distribute
cellsre-sequence
cells
same input
different outputs
• Aggregate traffic distribution:
– Randomized routing (no backpressure)
– Adaptive routing (indiscriminate backpressure)
load balancing on the long-term only
4
Re-sequencing needs to consider flows as they were before merging
Freedom from deadlock
Per-output Flow Merging
• Retains the benefits of
per-flow backpressure
• N flows per link, everywhere
Too many Flows
• N2 per chip in the middle stage
CS-534 - Copyright University of Crete 5
The “ATLAS I” Credit Flow-Control Protocol
• M. Katevenis, D. Serpanos, E. Spyridakis: “Credit-Flow-Controlled ATM
for MP Interconnection: the ATLAS I Single-Chip ATM Switch”, Proc.
IEEE HPCA-4 (High Perf. Comp. Arch.), Las Vegas, USA, Feb. 1998,
pp. 47-56; http://archvlsi.ics.forth.gr/atlasI/atlasI_hpca98.ps.gz
• Features:
– identically destined traffic is confined to a single “lane”
– each “lane” can be shared by cells belonging to multiple packets
• As opposed to Wormhole Routing, where:
– a “virtual circuit” (VC) is allocated to a packet and dedicated to it for
its entire duration, i.e. until all “flits” (cells) of that packet go through
– identically destined packets are allowed to occupy distinct VC’s
CS-534 - Copyright University of Crete 6
A T L A S I: s im ila r p ro to c o l, a d a p te d tos h o r t l in k s
h a rd w a re im p le m .
Q u a n tu m F lo w C o n tro l (Q F C ) A l l ia n c e : p ro p o s e d s ta n d a rd
fo r c re d it-b a s e d flo w c o n tro l o v e r W A N A T M l in k s
B
bN u m b e r o f L a n e s L =
Q F C -like C red it P ro to co l
0
1
2
D -1
b
b
b
b
B -1
i: d s t
d s tC rd e s tin .
b u ffe r
2
1
0
p o o lC r
B
c e l l
i
c re d it
i
dow nstream sw itchupstream sw itch
b o th k in d s o f c re d it
( in A T L A S I: b = 1 )fo r a c e l l to d e p a r t
a re n e e d e d
CS-534 - Copyright University of Crete 7
0 .6
0 .5
0 .4
0 .3
S a tu ra tion Throughp ut64x64 fabric : 6-s tage banyan us ing 2x2 e lem ents
20-cell or 20-flit burs ts , un iform ly destined
Sa
tu
ra
tio
n T
hr
ou
gh
pu
t
1 2 4 8 1 6
B u ffe r S p a c e (= L a n e s ) p e r L in k
c e lls o r fl its
AT
LA
S
(B =L, w ith b=1)
0 .9
1 .0
0 .8
0 .7
Wo
r mh
ole
CS-534 - Copyright University of Crete 8
A T L A S , 1 h t-s p
A T L A S , n o h t-s p
A T L A S , 2 h t-s p
W o rm h ., n o h t-s p
W o rm h ., 1 h t-s p
W o rm h ., 2 h t-s p
1 2 4 8 1 6
N u m b e r o f L a n e s (L )
(w ith b u ffe r sp a ce B =1 6 ce lls o r flits p e r lin k)
N o n -H o t -S p o t D e la y , in th e P re s e n c e o f H o t -S p o t D e s t in a t io n s
2 0 0 0
2 0 0
2 0
5 0
5 0 0
1 0
1 0 0
1 0 0 0D
ela
y (
ce
ll tim
es
)
n o n -h o t-sp o t lo a d = 0 .2 ; 2 0 -ce ll/flit b u rs ts ; 6 4 x6 4 fa b r ic : 6 -s tg b a n ya n w . 2 x2 e l.
9
ATLAS I
• Single-chip ATM
Switch with
Multilane
Backpressure
• 10 Gbit/s =
16×16 @
622 Mb/s/port
• Shared Buffer
• 0.35 µm CMOS
• 1996-98,
FORTH-ICS,
Crete, GR
T a b le s
C re d it
c h ip c o n tro l & m a n a g e m e n td e l
d e l
g e n
s e r
h d r
p a ttrn
m m
T a b le
L o a d M o n
im p .
im p .
m a tc h
m a tc h
01234567
8 9 1 0 1 1 1 2 1 3 1 4 1 5
V P /V C
T ra n s la t.
T a b le
R o u tin gm a tc h
S c h e d
C o u n t
Qu
eu
e M
an
ag
.
& S w itc h in g
B u ffe r
C e l l
4567 0123
8 1 0 1 1 1 2 1 3 1 4 1 59
P ip e lin e d M e m
cr
. s
rc
h.
T R A N S C E I V E R S
T R A N S C E I V E R S
0
5
1 5
0
1 51 05
1 0
CS-534 - Copyright University of Crete 10
P a d s & D r ive rs
G ig a B a u d T ra n sce ive rs
C o re :
P e rip h e ry :
1 5 %
2 0 %
1 5 %
1 3 %
1 7 %
1 3 %
1 0 %
2 5 %1 0 %
1 0 %
1 0 % 1 0 %
2 5 %
4 5 %
4 %
20%
42%
20%
20%
45%
20%
50%
25%
2 5 %
w ir in g
E la stic b u f., I/O L in k In tf.
S ch e d u lin g , P o p . C o u n ts
C tr l/M g t, L o a d M o n , M isc.
Q u e u e P o in te r M a n a g 'm n t
A re a P o w e rS R A MF FG a te s
9 . W
D e sig n
E ffo r t
C e ll B u ffe r & S w itch in g
H e a d e r P r ., R t'n g , T ra n s l.
1 5 . 1 5 0 4 4 5 7 0
K b itsKKp -y rs
C re d it-b a se d F lo w C o n tro l
22 2 5 m m
CS-534 - Copyright University of Crete 11
Backpressure Cost Evaluation, versus Alternatives
• Measure the cost of credit flow control in ATLAS & compare to:
• Alternatives, without internal backpressure in the fabric:
– large buffers (off-chip DRAM) in all switches throughout the fabric, or
– internal speedup in the fabric and output buffers
• Kornaros, Pnevmatikatos, Vatsolaki, Kalokerinos, Xanthaki, D. Mavroidis,
Serpanos, Katevenis: “ATLAS I: Implementing a Single-Chip ATM Switch
with Backpressure”, IEEE Micro Magazine, Jan/Feb. 1999,
http://archvlsi.ics.forth.gr/atlasI/hoti98/
CS-534 - Copyright University of Crete 12
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 11
1
1
1
1
1
15
4
3
2
1
N
1
2
3
4
5
N
A T L A S A T L A S A T L A S
in p u t q u e u e s (p e r- flo w )in p u t
in te r fa ce
b a ckp re ssu re b a ckp re ssu reb a ckp r .
S w itch in g F ab rics w ith In tern al B ackp ressu re
sm a ll, o n -ch ip b u ffe rs
la rg e , o ff-ch ip
b u ffe rs (D R A M ); to ta l th ro u g h p u t = 2 N
M ak ing Large A T M S w itches :
CS-534 - Copyright University of Crete 13
1
1 1 1
1
1
1
1
1
1
1N
5
4
3
2
1 1
2
3
4
5
N
la rg e , o ff-ch ip b u ffe rs (D R A M ) ; to ta l th ro u g h p u t = 2 N lo g N
1
1
1
1
1
1
1
S w itch in g F ab rics w ith o u t B ackp ressu re 1 : L arg e B u ffe rs
CS-534 - Copyright University of Crete 14
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
1
1
1
1
in te r fa ce
o u tp u t
q u e u e s
o u tp u t
s
s
s
s
s
s
s
s
speedup s> 1 ; under burs ty o r non-un ifo rm tra ffic : s> > 1...
1
2
3
4
5
N
1
2
3
4
5
N
1
1
1
1
1
1
1
1
1
1
1
1
S w itch in g F ab rics w ith o u t B ackp ressu re 2 : In tern al S p eed u p
la rg e , o ff-ch ip b u ffe rs (D R A M );sm a ll, o n -ch ip b u ffe rs
to ta l th ro u g h p u t = (s+ 1 )N
CS-534 - Copyright University of Crete 15
O ff-C h ip
C o m m u n ica tio n
C o re :
P e r ip h e ry :
C o s t
E la s tic b u f., I/O In tf.
20% 20%
20%
50%
10%
25% x0
10% x0
4% x0
42% + 45% +
35% x 2
55% x 2
10% x.6
25% x.5
25% +
w iring
Q .P tr , S ch , C tr l, e tc .
6% +
Ba ckp re ssu re C o st/Be n e fit 1 :
A re a
C e ll B u f.& S w itch in g
P o w e r
H d r , R t'n g , T ra n s l.
G a te s S R A M
20% x0
F F
N o B a c k p re s s u re , L a rg e O ff-C h ip B u ffe rs
C re d it-b . F lo w C tr l
CS-534 - Copyright University of Crete 16
C o m m u n ic a tio n
O ff-C h ip
C o s t
C o re :
P e r ip h e ry :
20%
20%
20%
50%
1 0 %
2 5 %
w irin g
25%x0
1 0 % x0
4% x0
xS
Q .P tr , S c h , C tr l, e tc .
42%xS 45%xS
1 0 % ++
25 %
2 2
xS2
E la s tic b u f., I/O In tf.
20% xS
S R A MF F A re a P o w e r
B a ckp re ssu re C o st/B e n e fit 2 :
C e ll B u f.& S w itch in g
H d r , R t'n g , T ra n s l.
C re d it-b . F lo w C tr l
3 5 % xS
55 % xS
N o B a c k p re s s u re , In te rn a l S p e e d u p , O u tp u t Q u e u e s
G a te s
CS-534 - Copyright University of Crete 17
Regional Explicit Congestion Notification (RECN)
• Generalization & evolution of the ATLAS/QFC protocol
• Source-routing header describes path through fabric
• Intermediate or final link congestion sends back-notification
• All packets to congested link confined to a single lane
– intermediate links identified via path component in header
entire trees of destinations in single lane (improvement over QFC)
– equivalent of lane here called “Set-Aside Queue” (SAQ)
• VOQ’s replaced by Single (!) Input Queue + SAQ’s
– dynamically create/delete SAQ’s
– CAM assumed to match incoming pck header versus current SAQ’s
• Duato, Johnson, Flich, Naven, Garcia, Nachiondo: “A New Scalable and
Cost-Effective Congestion Management Strategy for Lossless Multistage
Interconnection Networks”, HPCA-11, San Francisco, USA, Feb. 2005.
CS-534 - Copyright University of Crete 18
Request-Grant Protocols
• Consider a buffer feeding an output link, and receiving traffic
from multiple sources:
• If credits are pre-allocated to each source, the buffer needs
to be as large as one RTT-window per source;
• If credits are “held” at buffer and only allocated to requesting
source(s) when these have something to transmit, then a
single RTT-window suffices for all sources!
economize on buffer space at the cost of longer latency
• N. Chrysos, M. Katevenis: “Scheduling in Switches with Small Internal
Buffers”, IEEE Globecom 2005, St. Louis, USA, Nov. 2005;
N. Chrysos, M. Katevenis: “Scheduling in Non-Blocking Buffered Three-
Stage Switching Fabrics”, IEEE Infocom 2006, Barcelona, Spain, Apr.
2006; http://archvlsi.ics.forth.gr/bpbenes/
CS-534 - Copyright University of Crete 19
Congestion elimination using proactive admissions
• Ensure all injected packets can fit in fabric-output buffers
– fabric output buffers never exert backpressure
• Output buffer credits requested by inputs – scheduled by output arbiter
– injection rates are fair end2end – no “parking lot” problem
• With good traffic distribution (inverse-multiplexing)
– intermediate buffers never fill up as well backpressure eliminated!
• Minimal in-fabric packet delay (queues usually empty) & no packet drops
• low flow completion times
Buffere
d
Fat-
tree /
Clo
s
fabric:
sim
plif
ied v
iew
CS-534 - Copyright University of Crete 20
Hotspot Traffic: the load for two outputs is 1.1 x C
• Indiscriminate bckpr: cells to non-hotspot dests have very large delays
• Req-grant bckpr. performs as per-flow queues, but uses shared queues
– O(N) less queues
• Higher latency at low loads due to request-grant msgs
• Need to handle contention among requests using per-flow req. cnts
CS-534 - Copyright University of Crete 21
Ethernet Quantized Congestion Control (QCN)
• Congestion point @ sw. queues:
every ~ 100 new frames, compute
congestion feedback value (fd)
– fd = Qoffset + w • dQ
– if fd > 0, issue congestion
notification msg (CNM) to src
NIC, identifying MAC-pair flow
• Non-congested flows share a single NIC queue
• Upon receiving a CNM for a dest MAC:
– alloc. flow queue & rate limiter (RL)
– target = current RL
– new RL = f ( RL, fd )
• Recovery upon absence of CNMs:
– fast recovery: RL = (RL+ target) / 2
– active increase: increment target (+x Mbps)
Additive Increase /
Multiplicative Decrease (AIMD)
CS-534 - Copyright University of Crete 22
QCN at outputs is unfair: who’s to get the next CNM?
QCN at outputs
QCN at inputs
QCN CPs
at outputs
versus.
at inputs
under fan-in
scenario
next CNM goes to the
“unlucky” flow that sent
the nxt 100-th frame
fair rates
CS-534 - Copyright University of Crete 23
Arrival-sampling QCN at inputs does not protect non-congested flows
• Departures from switch input buffer (VOQs) not
FIFO (order enforced by arbiter)
• Arrival sampling: CNM flow of nxt “100th” frame
– flows CNM’ed proportionally to arrival rates
– unfair with non-FIFO service out of buffer
• Occupancy sampling: CNM the flow w. largest
“backlog” (~ arrivals rate minus departures
rate)
Arrival sampling
Occupancy sampling
Chrysos e.a. “Unbiased QCN for scalable server-rack fabrics”, IBM Technical Report:
http://domino.research.ibm.com/library/cyberdig.nsf/papers/EF857B498A21B3EF85257D6A002E2235
CS-534 - Copyright University of Crete 24
Small & flat FCTs for scale-out latency-sensitive apps
• Traditionally, congestion control cares about tput, fairness of congested
link b/w & latency of non-congested pkts – congested pkts high latency
• Main perf. metric of interest when a group of servers collectively work on
latency-sensitive (datacenter) applications
– flow completion time (FCT) -- small for as many flows as possible
• Measures/actions
– TCP variants: keep in-fabric queues empty (Alizadeh, SIGCOM 2010)
– datacenter networks rethink lossy (current) vs. lossless flow control
• avoid delayed TCP retransmission: up to many millisecond to recover pkts
– shift also from deterministic routing to flow-level (and more recently also to
packet-level) multipathing M. Al-Fares, SIGCOMM’08, Zats SIGCOMM’12)
• what about flows crossing temporarily congested paths ?
– proactive (scheduled injections) also drawing renewed interest
• don’t blindly inject packets--scheduled injections reduce in-fabric backlogs
– host (network-stack) latencies: many tens of μs;
• ok for previous netw. (latency 100s μs) – but new nets only fewμs
• avoid excessive packet copies (RDMA?), TCP offloading
CS-534 - Copyright University of Crete 25
Shortest-job-first scheduling: let small go first
Alizadeh, e.a., “pFabric: Minimal Near-Optimal Datacenter Transport”
• Switches store packets in per-input priority queues sorted based on the
remaining size of the corresponding TCP flows
– but how to find the remaining size of a TCP flow?
• heuristic: use the already transmitted size as indication
– schedule first packets from the flow with the smallest remaining size
– when buffer exceeds threshold, drop packets from the flow with the
largest remaining size
• Cost of priority queues
– commodity (cheap) switches have small on-chip buffers (a few
hundreds of 1500B frames per input)
– h/w comparators for a few-hundred values is feasible
CS-534 - Copyright University of Crete 26
Virtual switches in cloud (computing) datacenters
• Another layer of switching inside hosts (likely in s/w versatile/flexible)
– VM vNIC vSwitch NIC fabric NIC vSwitch vNIC VM
• Ping between two remote physical servers: 21 μs; ping between two
collocated VMs: 221 μs
• vSwitches have “soft” capacity links: transfer pkts when “on” CPU
– excessive packet drops inside dest host (Crisan, e.a., SIGCOMM 2013)
– offloaded vSwitch functions inside hardware network adapters (SRIOV)
– good but not fully SDN compatible
• Lossless vSwitches (Crisan, e.a. SIGCOM 2013)
– vSwitch backpr. can propagate inside physical network excessive blocking
– reserve endpoint vSwitch credits before injecting pkts from host
• Crisan, Birke, Chrysos, Minkenberg, Gusat, “zFabric: how to virtualized
lossless Ethernet”, IEEE Cluster 2014
CS-534 - Copyright University of Crete 27
Buffer Space for Bounded Peak-to-Average Rate Ratio
• Assume Rpeak(i) / Raverage(i) PAR for all flows i on a link
– R(i) is the rate (throughput) of flow i
– PAR is a constant: peak-to-average ratio bound
– interpretation: rate fluctuation is bounded by PAR
• Each flow i needs a credit window of RTT · Rpeak(i)
• Buffer space for all flows is ∑ ( RTT · Rpeak(i) ) =
= RTT · ∑( Rpeak(i) ) RTT · ∑( PAR · Raverage(i) ) =
= PAR · RTT · ∑( Raverage(i) ) PAR · ( RTT · Rlink )
Allocate buffer space = PAR number of “windows”
When individual flow rates change, rearrange the allocation
of buffer space between flows –but must wait for the buffer
of one flow to drain before rallocating it (not obvious how to)
• H.T. Kung, T. Blackwell, A. Chapman: “Credit-Based Flow Control for
ATM Networks: Credit Update Protocol, Adaptive Credit Allocation, and
Statistical Multiplexing”, SIGCOMM '94, pp. 101-114.
CS-534 - Copyright University of Crete 28
Dynamically Sharing the Buffer Space among Flows
• In order to depart, a packet must acquire both:
– a per-flow credit (to guard against “buffer hogging”), and
– a per-link credit (to ensure that the shared buffer does not overflow)
• Properly manage (increase or decrease) the per-flow
window allocation based on traffic circumstances:
– ATLAS and QFC protocols never change the per-flow window
– H.T.Kung protocol moves allocations between flows (unclear how)
– other idea: use two window sizes –a “full” one and a “small” one; use
full-size windows when total buffer occupancy is below a threshold,
use small-size windows (for all flows) above that point (flows that had
already filled more than a small window will lose their allocation on
packet departure) – C. Ozveren, R. Simcoe, G. Varghese: “Reliable
and Efficient Hop-by-Hop Flow Control”, IEEE JSAC, May 1995.
• Draining Rate Theorem: M. Katevenis: “Buffer Requirements of Credit-
Based Flow Control when a Minimum Draining Rate is Guaranteed”, HPCS'97;
ftp://ftp.ics.forth.gr/tech-reports/1997/1997.HPCS97.drain_cr_buf.ps.gz
CS-534 - Copyright University of Crete 29
s p e e d o f tra n s m is s io n lin e 4 5 M b /s
ra tio : tra n s m is s io n /m e m o ry c o s t 3 3 0 ,0 0 0 to 1
c o s t o f q u e u e m e m . p e r m o n th 0 .0 1 4 c e n ts /m ile /m o n th
in v e s tm e n t w r ite -d o w n p e r io d 3 6 m o n th s
c o s t o f w in d o w s iz e m e m o ry 0 .5 c e n ts /m ile
c o s t o f 1 6 M B y te D R A M 1 0 0 0 $
ro u n d - tr ip w in d o w s iz e 7 9 b y te s /m ile
s p e e d o f s ig n a l p ro p a g a tio n 7 m ic ro s e c /m ile
c o s t o f lo n g d is ta n c e x m is s io n 4 5 $ /m ile /m o n th
d a ta fr o m H o t In te r c o n n e c ts '9 5 k e y n o te
C om m u nic a tion C os t v e rs us B u ffe r/Log ic C os t
m illio n s o f tra n s is to rs - h u n d re d s o f p in sO n -C h ip :
O ff-C h ip :s p e e c h , b y A . F r a s e r , V P , A T & T B e ll L a b s :
CS-534 - Copyright University of Crete 30
w ind ow S iz e (V C i) = R T T * p ea k T h ro ug hp u t(V C i)
c o s t(L ink ) ~ = 33 0 ,00 0 * c o s t(w in do w L)
lo s s y flow c on tro l us ua lly o pe ra tes the n e tw o rk w ith g o od pu t
rea c h ing u p to 70 - 8 0 % o f lin k th ro ug h pu t
lo s s les s flo w c o n tro l o p e ra tes u p to 98 - 1 00 % lin k u til iz a tion
th e 20 -30 % ex tra u til iz a tio n w ith los s les s F C is w o r th a pp rox .
1 0 to 10 0 tho us an d w in d ow L 's w o r th o f e x tra b u ffe r m e m ory
w ith les s th a n a few te n s o f tho us a n ds o f w ind ow L 's o f ex tra
b u ffe r m e m ory , th en lo s s le s s flow c on tro l is a c le a r w in
in de ed , lo s s le s s F C c a n d o th a t, e v en w ith q u ite le s s bu ffe r s p ac e ...
w ind ow S iz e (V C i) < o r < < w ind o w L := R T T * th rou gh pu t(L ink )
if lo s s le s s flo w c on tro l c a n y ie ld its lin k u til iz a tion a dv a n tag e
P e r-C o n n e ctio n Q u e u e in g & F C :
H o w m a ny ``W in d o w s '' o f B u ffe r S p a c e ?