Upload
curtis-mccoy
View
215
Download
1
Embed Size (px)
Citation preview
A. Cassinelli, A. Goulet, M. Ishikawa
University of Tokyo, Department of Information Physics and Computing
M. Naruse, F. Kubota
National Institute of Information and Communications Technology
Load-balanced optical packet Load-balanced optical packet switch using two-stage time-slot switch using two-stage time-slot
interchangersinterchangersIEICE 2004
Plan of the presentation
I. Introduction:
II. The Load Balanced BVN Switch
V. References
- LB stage simplifies scheduling of a BVN switch non-uniformities.
III. Optical implementation of the Load Balancing SwitchLoad balancing architectre allows for simple, deterministic buffer schedule, ideal for optical implementation using fiber delay-line based TSI...
- The Ideal Packet Switch and Our goals- Some Assumptions- The BVN Switch and its Scheduling Complexity Bottleneck
III.1 Single Stage TSI and resulting LBS performance
III.2 Double Stage TSI and resulting LBS performance
IV. Conclusion and Further Research
- makes the switch performance independent from traffic
I. Introduction
• develop an “ideal” optical packet switch for TDM, possibly for asynchronous optical networks (WDM remains an additional dimension).
• do that without using non-mature RAM optical memories - only delay lines.
• Provide high throughput for any kind of traffic
• Be stable – queues in buffer should remain bounded
• Have low delays
• Manage priority traffic– provide throughput guarantees for some ports
– provide reduced delays for such traffic
The ideal packet switch should:
Our goal here:
BVN scheduler
Some preliminary assumptions
• Time is “slotted”, packets have the same size and are “aligned” • At most one packet arrives per time slot at each input line (no WDM)• The output lines are not overloaded (traffic is “admissible”)
Given these assumptions, a good switch candidate is the so called “Birkhoff-Von Newmann Switch”, first proposed by Chang [1999], based on the works of Birkhoff [1946] and Von-Newmann [1953].
Essentially, it is a Crossbar Switch that:
• has Virtual Output Queues (VOQ) to alleviate HOL blocking,
• Relies on an efficient but rather time-consuming O(N4.5) scheduling algorithm in order to find the appropiate sequence of crossbar states to service the VOQs, avoid their saturation and reduce packet delay.
The BVN switch
... but today there is an additional constraint: given the speed of today networks, schedulers are running short of time for computation!!
0
100
200
300
400
500
600
700
1996 1997 1998 1999 2000 2001
Clock cycles allowed to schedule a single packet
(from McKeown – Stanford University)
So, the “ideal switch” must also rely on a scheduling algorithm with very low computational complexity.
(40 Gb/s => 11 ns per ATM packet, or
10 cycles in a 1 GHz computer...)
• It is relatively easy to prove that if the traffic is uniform, then the BVN decomposition consist on a set of N permutations providing full-access. These can be cycled blindly in order to serve the VOQs.
• The only condition over this set of N permutations is that they provide full-access (i.e., for any input-ouput pair, at least one permutation out of this set connect these input and output).
, , ,
... this would mean an O(1) scheduling complexity
Ex: one cycle for N=4
full-access
There is hope...
So...
Yes! It is called “Load Balancing”.
There are several ways to do that...
Is there a way to pre-process an irregular traffic load such that the inputs of the
switch “see” an uniform load?
Answer:
The simplest (deterministic) consist on adding an additional input switch stage, which runs through a
periodic sequence of connection patterns that realize full access...
(1) Input load is equally distributed at the outputs (2) Bursty traffic is also distributed
0 0 2 1 1 0 0 0
1 1 2 3
3 3 3 3 3 1 1 1
0 0 2 2
input traffic uniformly distributed traffic
0
0
2
1
1
0
0
0
1
1
2
3
3
3
3
3
3
1
1
1
0
0
2
2load balancing
t t
0 0 2 1 1 0 0 0
1 1 1 2 2 2 3 3
3 3 3 3 3 1 1 1
0 0 0 0 2 2 2 2
bursty input traffic uniformly distributed traffic
0
0
2
1
1
0
0
0
1
1
1
2
2
2
3
3
3
3
3
3
3
1
1
1
0
0
0
0
2
2
2
2
load balancing
t t
0 0 2 1 1 0 0 0
1 1 1 2 2 2 3 3
3 3 3 3 3 1 1 1
0 0 0 0 2 2 2 2
bursty input traffic uniformly distributed traffic
0
0
2
1
1
0
0
0
1
1
1
2
2
2
3
3
3
3
3
3
3
1
1
1
0
0
0
0
2
2
2
2
load balancing
t t
0
1
N-1
…
Load-Balancing
“wild
” in
pu
t tr
affi
c p
atte
rn...
“su
bd
ued
” tr
affi
c!
(1) input load balancing...
(2) destination (output) balancing...... ...
Deterministic Load-Balancing is achieved by running an input switch through a sequence of periodic connection patterns that realize full access...
II. Deterministic Load Balancing
• The Load-Balancing stage runs through a periodic sequence of connection patterns that realize full access... just like the Crossbar Stage, because traffic it sees is just uniform.
0
1
N-1
…
0
1
N-1
…
Buffer (VOQ) stage
Crossbar (TDM) Stage
Load-Balancing
Stage
Load balancing stage
Crossbar stage
• Moreover, it is possible to prove that this two-stage architecture provides 100% throughput on a very general class of traffic [Chang&Valiant]
The Load-balanced BVN Switch
……
……
……
… …
A buffer maintains N VOQ FIFO
queues
III. Implementation of an optical Load-balanced switch
(1) Given the particularly simple interconnection requirements (TDM permutation schedule) of the load-balancing and switching stage, both stages can be efficiently implemented using a guided-wave-based Stage-Controlled Banyan Network (SC-BN);
(2) Because of the deterministic, cyclic schedule, it is possible to emulate the VOQs FIFO queue stage using delay-lines, instead of real RAM memory...
Why the Deterministic LBS is suited for optical implementation?
main topic of this presentation!
(1) Emulation of the load-balancing and TDM switches by stage-controlled Banyan network (SC-BN)
• A N x N Banyan network is composed of log2N stages.
• Each stage is made of (N/2) 2 x 2 switches.• In a SC-BN, all switches within a stage are set either in the
bar state or cross state.• The N possible permutations of a SC-BN provide full access
67
01
23
45
00
11
00
00
11
00
11
11
0
1
0
0
1
0
1
0
1
0
0
1
0
1
1 1
67
01
23
45
00
11
00
00
11
00
11
11
8 x 8 Omega network
stage 0 stage 1 stage 2
Example: SC-BN with EA gates
8 x 8 Omega network
(2) Emulation of VOQ buffers using delay-lines
(a) ... A packet arrives at time t to port 1, with destination port N-1:
(b) If LBS were not operating, the packet would be stored in queue N-1 of buffer N-1:
(c) ... but at time t, LBS permutation was “scrambling” data, so packet is stored in queue N-1 of a different buffer:
(d) Last, this packet has to wait a deterministic amount of time, for the correct permutation to be available at the second TDM stage:
…
(...plus a multiple of the whole cycle, if some packet was previouly scheduled for the same output)
…
• A packet arriving to port r at time t with destination d has to be delayed by t = + kN time slots where:
Ntrd modulo
Concretely:
• While is fixed by the packet, the parameter k can be freely tuned by the scheduling algorithm;
• Such “freedom” will be used to avoid collision of packets previously scheduled for the same output, thus effectively simulating a FIFO queue. The way k is chosen depend on the actual TSI architecture.
The nice thing is that because the total delay can be computed in advance, there
is no need of real memory buffers:
a Time-Slot Interchanger architecture (relying on delay-lines) will effectively
simulate the VOQs!
…
Crossbar (TDM) Stage
…
……
……
……
TSI “buffer”
III.1 : Single-stage TSI architecture
• number of delay lines: N.b
•delay increment = 1 time slot
• maximum delay: bN-1
• total fiber length:
• equivalent VOQ FIFO size (equal to the maximum delay +1 divided by N): b
2
).1( bNbN
...performance of this architecture is strictly equivalent to that of a VOQ based buffer when using a deterministic schedule!
optical switch
...
1
2
1xN
b
Nb
1
0
N.b-1
...
1
2
1xN
bNb
1
0
N.b-1
So, a packet arriving at t with destination d at the input of the optical buffer has to be delayed t = + kN time slots, where = (d-r-t) mod N.
Constraint: the packet may collide with another one when exiting the buffer at point A.
k has to has to be choosen so as to avoid contention at the output of the TSI buffer
Contention Resolution
risk of packetcollision
How? The maximum delay that a packet can be given is Nb-1:
Need to keep track of the schedule of the Nb-1 previous time slots by using an electronic memory of size Nb-1 (or, more simply, a single counter - but then the strategy does not generalize to multi-stage buffers).
Check for a free schedule, i.e, choose a cycle-delay k indicating a free space. A maximum of b checks are needed. In our simulations, k is choosen as the smaller index that indicates a free space, so as to minimize packet delay, but more complex selection can be done to account for packet priorities.
A
Rem: if a packet cannot be scheduled, it will be discarded (so in fact the switch is a 1x(Nb+1) switch, whose last line is the discard line.
The resulting scheduling algorithm is O(b)(and can be made constant using a single counter)
A packet arrives at time t, when permutation P3 is on the
TDM switch. However, packet destination requires P1.
Then, we have t =2 + k.N (=2)
Example: N=4, b=3
...
Packet Schedule
t’=t +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11Time:
0 1 2 3 0 1 2 3 0 1 2 3TDM Permutation
schedule:
k.N
N
P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 ...
occupied
free
irrelevant for scheduling this packet
Packet Schedule Memory
(total memory cells: Nb-1=11)
k=2
interesting remark: because contention is resolved by the scheduling algorithm, the following hardware performs
equally well:
...the advantage being a large reduction on the number of fiber delay lines employed: in the first case we need bN(bN-1)/2, while in the second implementation only Nb.
This is important when considering scaling the number of input-outputs (N) or the amount of buffering (b).
LBS performance using a single-stage TSI (simulation)
N=16 input/outputs
108 packets / load
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1Load
Pac
ket l
oss
prob
abili
ty
b = 5b = 10b = 15b = 20b = 25b = 30
(Rem: traffic is assumed to be i.i.d Bernouilli at the exit of the LB stage)
rem: b=30 corresponds to a FIFO buffer holding a maximum of 30 packets: this is very little compared with the thousands of some shared memory buffers on the market...
108 packets / load
0
40
80
120
160
200
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
load
aver
age
dela
ys (t
ime
slot
s)
b = 5b = 10b = 15b = 20b = 25
LBS performance using a single-stage TSI (simulation)
N=16 input/outputs
(Rem: traffic is assumed to be i.i.d Bernouilli at the exit of the LB stage)
feasability problems (single stage)
EDFA( プリアンプ)
EDFA(ブースタアンプ)
+20dB-15dB
+20dB
EAValidRange
EDFAMinimumInput
Inp
ut
Outp
ut
132 broadcast
EA and Interfacing loss
Broadcast & Select Module(EA module)
Fiber delay line module Merging module
-10dB
Fiber and interfacing loss
Waveguideand interfacing loss
EDFASaturatedoutput
20dBm
仮定:入力信号レベル 0dBm
-2.5dB
-2.5dB -10dBm
13dBm 以下!
-30dB !!!
b1 FDLsincrement: b0 time slots
b0 FDLsincrement: 1 time slot, maximum delay: b0-1
N
bbB 10
III.2 : Double-Stage TSI bufferWhy? Because of architectural considerations: for a constant total amount of delay, a multistage architecture uses much less fiber delay lines => small switches!!
• number of delay lines: b0+b1
... vs. b0.b1 in the case of a single stage.
•delay increment (depends on the stage): 1 for first stage, b0 for second stage.
•maximum delay: b1.b0-1
• total fiber length:
• equivalent VOQ FIFO size:
2
)1()1( 11000 bbbbb
• By making the minimal increment of first stage equal to the maximum delay of first stage plus one, we ensure a unique decomposition of the required total delay t which further simplifies scheduling complexity...
...
1
2
1xb
1
b1
b0
0
(b1-1).b0
...
1
2
1xb
0
b0
1
0
b0-1
...in the following, we will consider that b0=N (the number of input outputs) and b1=b will be variable, corresponding to the equivalent size of a VOQ FIFO buffer:
cycle delay stage (k)
sub-cycle delay stage ()
• number of delay lines: b+N vs. b.N for single stage.•delay increment (depends on the stage): 1 for first stage, N for second stage.•maximum delay: b.N-1 • total fiber length: ... vs. for single stage. • equivalent VOQ FIFO size: b=b1
2
)1()1( bNbNN2
).1( bNbN ...
1
2
1xb
b
N
0
(b-1).N
...
1
2
1xN
N
1
0
N-1
...
1
2
1xb
b
N
0
(b-1).N
...
1
2
1xN
N
1
0
N-1
...
1
2
1xb
b
0
...
1
2
1xN
N
1
0
N-1
Contention
Exit of the stage SExit of the stage S11::
The maximum delay that a packet can be given in the stage S1 is (b-1)N time slots Need to keep track of the (b-1)N previous time slots. Need for an electronic memory MEM_S1 of size (b-1)N that will indicate
which time slots at the exit of S1 are “busy” or “free”.
Exit of the stage SExit of the stage S22::The maximum delay that a packet can be given for the whole optical buffer is (b-1)N+N-1= bN -1 Need to keep track of the bN -1 previous time slots. Need for an electronic memory MEM_S2 of size bN-1 that will indicate which
time slots at the exit of S2 are “busy” or “free”.
Now there are 2 locations where contention can happen:- at the exit A of first stage (S1)- at the exit B of second (and final) stage (S2)
S1 S2
A B
Rem: if a packet cannot be scheduled, it will be discarded on the first stage (so in fact the first stage switch is a 1x(b+1) switch, whose last line is the discard line. Discarding a packet in other than the first stage would be necessary if one uses another scheduling strategy – for instance, a non unique delay decomposition.
remark: again, the contention avoidance schedule enables the following fiber-length-reducing architecture
to work equally well:
(in the example, b1=b, and b0=N)
1
2
1xN
N
. . ....
1 1 1
1
2
1xb
b
. . ....
N N N
A B...
1
2
1xb
b
N
0
N.(b-1)
...
1
2
1xN
N
1
0
N-1
Temporal diagram of the permutation schedule, the first and the second “crosspoint” schedules (MEM_S1, MEM_S2)
The permutation schedule represents the available permutation at the exit of the TSI buffers at time t’=t+k (there are N possible permutations). The permutation schedule is not computed as a function of the traffic – as in a BVN switch. It is deterministic (TDM), therefore we do not need to store any scheduling memory array.
MEM_S1
MEM_S2
N
Example: N=4, b=3 (b1=b, and b0=N) rem: these schedule positions do not need to be stored in memory, since they are always free at the start of a scheduling cycle
(b1-1)b0=(b-1)N=8
b1b0-1=bN-1=11
t’=t +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11time:
0 1 2 3 0 1 2 3 0 1 2 3Permutation schedule:
P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 ...
t’=t +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11time:
0 1 2 3 0 1 2 3 0 1 2 3Permutation schedule:
k.N
occupied
free
irrelevant
... a packet arrives at time t, such that the requested permutation is P1. We have then =2.
N
The packet will be scheduled to go trough S1 at time t’=t+2N=t+8, and will exit the network trough S2 at time t’=t+2N+2 = t+10. Both cells in the considerer pair are made “busy”, and then the arrays are shifted to the left by one.
P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 ...
k=2k=1k=0
MEM_S1
MEM_S2
N=4, b=3 (b1=b, and b0=N)
(b1-1)b0=(b-1)N=8
b1b0-1=bN-1=11
b1 pairs to check = > O(b1) schedule!!
In the previous example, b1=3 pairs had to be taken into consideration... • In general a maximum of b1 memory locations have to be checked.
So, overall complexity of the scheduling algorithm is O(b1).
(a strategy using counters is not easy to implement, and may lead to sub-optimal schedules)
1xb
1xN
b1=3 lines in first stage
b0=4 lines in second stage
E1
E2
One vs. Two buffer-stages (for the same total fiber length)
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Load
Pack
et lo
ss p
roba
bilit
y
b = 10b = 20b = 30b = 10b = 20b = 30
N = 16
Nb packets = 107
single buffer
two buffers
This indicates that the collision avoidance at intermediate stage, sligthly degrades performances => there is a trade-off between architectural
considerations and performance.
Conclusion
• Because it is a LBS, it can achieve high throughput under bursty traffic
• Because deterministic balancing is used:
– Guide-wave-integrable Stage-controlled Banyan networks can be used both for the switching
stage and the balancing stage.
– no need to employ optical memories for buffering, only fiber-delay lines functionning as a TSI
• Has a scheduling complexity in O(b) , where b is the equivalent size of a electronic FIFO
buffer.
• Can (potentially) handle traffic priorities by making k priority-dependent.
• performances only slighly degrades when comparing to a single-stage TSI (*), while:
– making possible a very large reduction of the number of delay lines.
– thus using “buffer space” more efficiently
• it would be possible to modify the architecture so as to handle asynchronous traffic and
different length packets using only TSIs, as in [Harai].
The proposed two-stage Load-balanced photonic switch:The proposed two-stage Load-balanced photonic switch:
(*) performance of a single-stage based photonic switch using Nb-1 FDLs are strictly equivalent to that of a LBS using RAM buffers composed of N FIFO queues each of size b.
16-64
16-8-8
8-8-8-8
One that provides a unique decomposition of the scheduled delay, however, is such that bi = li-1.bi-1 = l0l1l2…li-1. For the first stage S0, b0 is equal to a delay of one time slot. Hence, the maximum delay that can be given to a packet by the whole TSI is equal to B=l0l1l2…ln-1 (this is also the maximum number of packets that the TSI can hold). For a switch with N ports, it is comparable to N VOQ queues of length Be = B/N.
...Further Research: generic multi-stage delay-line buffers
There are thousands of ways of implementing a generic multistage buffer.
Packet loss probability
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
load
pack
et lo
ss p
roba
bilit
y
409664-6432-32-416-16-168-8-8-84-4-4-4-4-4
N=64
Average packet delay
0
500
1000
1500
2000
2500
3000
3500
4000
0.5 0.6 0.7 0.8 0.9 1load
aver
age
dela
y (t
ime
slot
s)
4096
64-64
32-32-4
16-16-16
8-8-8-8
4-4-4-4-4-4
N=64