A. Cassinelli, A. Goulet, M. Ishikawa University of Tokyo, Department of Information Physics and Computing M. Naruse, F. Kubota National Institute of Information

A. Cassinelli, A. Goulet, M. Ishikawa

University of Tokyo, Department of Information Physics and Computing

M. Naruse, F. Kubota

National Institute of Information and Communications Technology

Load-balanced optical packet Load-balanced optical packet switch using two-stage time-slot switch using two-stage time-slot

interchangersinterchangersIEICE 2004

Plan of the presentation

I. Introduction:

II. The Load Balanced BVN Switch

V. References

- LB stage simplifies scheduling of a BVN switch non-uniformities.

III. Optical implementation of the Load Balancing SwitchLoad balancing architectre allows for simple, deterministic buffer schedule, ideal for optical implementation using fiber delay-line based TSI...

- The Ideal Packet Switch and Our goals- Some Assumptions- The BVN Switch and its Scheduling Complexity Bottleneck

III.1 Single Stage TSI and resulting LBS performance

III.2 Double Stage TSI and resulting LBS performance

IV. Conclusion and Further Research

- makes the switch performance independent from traffic

I. Introduction

• develop an “ideal” optical packet switch for TDM, possibly for asynchronous optical networks (WDM remains an additional dimension).

• do that without using non-mature RAM optical memories - only delay lines.

• Provide high throughput for any kind of traffic

• Be stable – queues in buffer should remain bounded

• Have low delays

• Manage priority traffic– provide throughput guarantees for some ports

– provide reduced delays for such traffic

The ideal packet switch should:

Our goal here:

BVN scheduler

Some preliminary assumptions

• Time is “slotted”, packets have the same size and are “aligned” • At most one packet arrives per time slot at each input line (no WDM)• The output lines are not overloaded (traffic is “admissible”)

Given these assumptions, a good switch candidate is the so called “Birkhoff-Von Newmann Switch”, first proposed by Chang [1999], based on the works of Birkhoff [1946] and Von-Newmann [1953].

Essentially, it is a Crossbar Switch that:

• has Virtual Output Queues (VOQ) to alleviate HOL blocking,

• Relies on an efficient but rather time-consuming O(N4.5) scheduling algorithm in order to find the appropiate sequence of crossbar states to service the VOQs, avoid their saturation and reduce packet delay.

The BVN switch

... but today there is an additional constraint: given the speed of today networks, schedulers are running short of time for computation!!

0

100

200

300

400

500

600

700

1996 1997 1998 1999 2000 2001

Clock cycles allowed to schedule a single packet

(from McKeown – Stanford University)

So, the “ideal switch” must also rely on a scheduling algorithm with very low computational complexity.

(40 Gb/s => 11 ns per ATM packet, or

10 cycles in a 1 GHz computer...)

• It is relatively easy to prove that if the traffic is uniform, then the BVN decomposition consist on a set of N permutations providing full-access. These can be cycled blindly in order to serve the VOQs.

• The only condition over this set of N permutations is that they provide full-access (i.e., for any input-ouput pair, at least one permutation out of this set connect these input and output).

, , ,

... this would mean an O(1) scheduling complexity

Ex: one cycle for N=4

full-access

There is hope...

So...

Yes! It is called “Load Balancing”.

There are several ways to do that...

Is there a way to pre-process an irregular traffic load such that the inputs of the

switch “see” an uniform load?

Answer:

The simplest (deterministic) consist on adding an additional input switch stage, which runs through a

periodic sequence of connection patterns that realize full access...

(1) Input load is equally distributed at the outputs (2) Bursty traffic is also distributed

0 0 2 1 1 0 0 0

1 1 2 3

3 3 3 3 3 1 1 1

0 0 2 2

input traffic uniformly distributed traffic

0

0

2

1

1

0

0

0

1

1

2

3

3

3

3

3

3

1

1

1

0

0

2

2load balancing

t t

0 0 2 1 1 0 0 0

1 1 1 2 2 2 3 3

3 3 3 3 3 1 1 1

0 0 0 0 2 2 2 2

bursty input traffic uniformly distributed traffic

0

0

2

1

1

0

0

0

1

1

1

2

2

2

3

3

3

3

3

3

3

1

1

1

0

0

0

0

2

2

2

2

load balancing

t t

0 0 2 1 1 0 0 0

1 1 1 2 2 2 3 3

3 3 3 3 3 1 1 1

0 0 0 0 2 2 2 2

bursty input traffic uniformly distributed traffic

0

0

2

1

1

0

0

0

1

1

1

2

2

2

3

3

3

3

3

3

3

1

1

1

0

0

0

0

2

2

2

2

load balancing

t t

0

1

N-1

…

Load-Balancing

“wild

” in

pu

t tr

affi

c p

atte

rn...

“su

bd

ued

” tr

affi

c!

(1) input load balancing...

(2) destination (output) balancing...... ...

Deterministic Load-Balancing is achieved by running an input switch through a sequence of periodic connection patterns that realize full access...

II. Deterministic Load Balancing

• The Load-Balancing stage runs through a periodic sequence of connection patterns that realize full access... just like the Crossbar Stage, because traffic it sees is just uniform.

0

1

N-1

…

0

1

N-1

…

Buffer (VOQ) stage

Crossbar (TDM) Stage

Load-Balancing

Stage

Load balancing stage

Crossbar stage

• Moreover, it is possible to prove that this two-stage architecture provides 100% throughput on a very general class of traffic [Chang&Valiant]

The Load-balanced BVN Switch

……

……

……

… …

A buffer maintains N VOQ FIFO

queues

III. Implementation of an optical Load-balanced switch

(1) Given the particularly simple interconnection requirements (TDM permutation schedule) of the load-balancing and switching stage, both stages can be efficiently implemented using a guided-wave-based Stage-Controlled Banyan Network (SC-BN);

(2) Because of the deterministic, cyclic schedule, it is possible to emulate the VOQs FIFO queue stage using delay-lines, instead of real RAM memory...

Why the Deterministic LBS is suited for optical implementation?

main topic of this presentation!

(1) Emulation of the load-balancing and TDM switches by stage-controlled Banyan network (SC-BN)

• A N x N Banyan network is composed of log2N stages.

• Each stage is made of (N/2) 2 x 2 switches.• In a SC-BN, all switches within a stage are set either in the

bar state or cross state.• The N possible permutations of a SC-BN provide full access

67

01

23

45

00

11

00

00

11

00

11

11

0

1

0

0

1

0

1

0

1

0

0

1

0

1

1 1

67

01

23

45

00

11

00

00

11

00

11

11

8 x 8 Omega network

stage 0 stage 1 stage 2

Example: SC-BN with EA gates

8 x 8 Omega network

(2) Emulation of VOQ buffers using delay-lines

(a) ... A packet arrives at time t to port 1, with destination port N-1:

(b) If LBS were not operating, the packet would be stored in queue N-1 of buffer N-1:

(c) ... but at time t, LBS permutation was “scrambling” data, so packet is stored in queue N-1 of a different buffer:

(d) Last, this packet has to wait a deterministic amount of time, for the correct permutation to be available at the second TDM stage:

…

(...plus a multiple of the whole cycle, if some packet was previouly scheduled for the same output)

…

• A packet arriving to port r at time t with destination d has to be delayed by t = + kN time slots where:

Ntrd modulo

Concretely:

• While is fixed by the packet, the parameter k can be freely tuned by the scheduling algorithm;

• Such “freedom” will be used to avoid collision of packets previously scheduled for the same output, thus effectively simulating a FIFO queue. The way k is chosen depend on the actual TSI architecture.

The nice thing is that because the total delay can be computed in advance, there

is no need of real memory buffers:

a Time-Slot Interchanger architecture (relying on delay-lines) will effectively

simulate the VOQs!

…

Crossbar (TDM) Stage

…

……

……

……

TSI “buffer”

III.1 : Single-stage TSI architecture

• number of delay lines: N.b

•delay increment = 1 time slot

• maximum delay: bN-1

• total fiber length:

• equivalent VOQ FIFO size (equal to the maximum delay +1 divided by N): b

2

).1( bNbN

...performance of this architecture is strictly equivalent to that of a VOQ based buffer when using a deterministic schedule!

optical switch

...

1

2

1xN

b

Nb

1

0

N.b-1

...

1

2

1xN

bNb

1

0

N.b-1

So, a packet arriving at t with destination d at the input of the optical buffer has to be delayed t = + kN time slots, where = (d-r-t) mod N.

Constraint: the packet may collide with another one when exiting the buffer at point A.

k has to has to be choosen so as to avoid contention at the output of the TSI buffer

Contention Resolution

risk of packetcollision

How? The maximum delay that a packet can be given is Nb-1:

Need to keep track of the schedule of the Nb-1 previous time slots by using an electronic memory of size Nb-1 (or, more simply, a single counter - but then the strategy does not generalize to multi-stage buffers).

Check for a free schedule, i.e, choose a cycle-delay k indicating a free space. A maximum of b checks are needed. In our simulations, k is choosen as the smaller index that indicates a free space, so as to minimize packet delay, but more complex selection can be done to account for packet priorities.

A

Rem: if a packet cannot be scheduled, it will be discarded (so in fact the switch is a 1x(Nb+1) switch, whose last line is the discard line.

The resulting scheduling algorithm is O(b)(and can be made constant using a single counter)

A packet arrives at time t, when permutation P3 is on the

TDM switch. However, packet destination requires P1.

Then, we have t =2 + k.N (=2)

Example: N=4, b=3

...

Packet Schedule

t’=t +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11Time:

0 1 2 3 0 1 2 3 0 1 2 3TDM Permutation

schedule:

k.N

N

P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 ...

occupied

free

irrelevant for scheduling this packet

Packet Schedule Memory

(total memory cells: Nb-1=11)

k=2

interesting remark: because contention is resolved by the scheduling algorithm, the following hardware performs

equally well:

...the advantage being a large reduction on the number of fiber delay lines employed: in the first case we need bN(bN-1)/2, while in the second implementation only Nb.

This is important when considering scaling the number of input-outputs (N) or the amount of buffering (b).

LBS performance using a single-stage TSI (simulation)

N=16 input/outputs

108 packets / load

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1Load

Pac

ket l

oss

prob

abili

ty

b = 5b = 10b = 15b = 20b = 25b = 30

(Rem: traffic is assumed to be i.i.d Bernouilli at the exit of the LB stage)

rem: b=30 corresponds to a FIFO buffer holding a maximum of 30 packets: this is very little compared with the thousands of some shared memory buffers on the market...

108 packets / load

0

40

80

120

160

200

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

load

aver

age

dela

ys (t

ime

slot

s)

b = 5b = 10b = 15b = 20b = 25

LBS performance using a single-stage TSI (simulation)

N=16 input/outputs

(Rem: traffic is assumed to be i.i.d Bernouilli at the exit of the LB stage)

feasability problems (single stage)

EDFA( プリアンプ）

EDFA（ブースタアンプ）

+20dB-15dB

+20dB

EAValidRange

EDFAMinimumInput

Inp

ut

Outp

ut

132 broadcast

EA and Interfacing loss

Broadcast & Select Module(EA module)

Fiber delay line module Merging module

-10dB

Fiber and interfacing loss

Waveguideand interfacing loss

EDFASaturatedoutput

20dBm

仮定：入力信号レベル 0dBm

-2.5dB

-2.5dB -10dBm

13dBm 以下！

-30dB !!!

b1 FDLsincrement: b0 time slots

b0 FDLsincrement: 1 time slot, maximum delay: b0-1

N

bbB 10

III.2 : Double-Stage TSI bufferWhy? Because of architectural considerations: for a constant total amount of delay, a multistage architecture uses much less fiber delay lines => small switches!!

• number of delay lines: b0+b1

... vs. b0.b1 in the case of a single stage.

•delay increment (depends on the stage): 1 for first stage, b0 for second stage.

•maximum delay: b1.b0-1

• total fiber length:

• equivalent VOQ FIFO size:

2

)1()1( 11000 bbbbb

• By making the minimal increment of first stage equal to the maximum delay of first stage plus one, we ensure a unique decomposition of the required total delay t which further simplifies scheduling complexity...

...

1

2

1xb

1

b1

b0

0

(b1-1).b0

...

1

2

1xb

0

b0

1

0

b0-1

...in the following, we will consider that b0=N (the number of input outputs) and b1=b will be variable, corresponding to the equivalent size of a VOQ FIFO buffer:

cycle delay stage (k)

sub-cycle delay stage ()

• number of delay lines: b+N vs. b.N for single stage.•delay increment (depends on the stage): 1 for first stage, N for second stage.•maximum delay: b.N-1 • total fiber length: ... vs. for single stage. • equivalent VOQ FIFO size: b=b1

2

)1()1( bNbNN2

).1( bNbN ...

1

2

1xb

b

N

0

(b-1).N

...

1

2

1xN

N

1

0

N-1

...

1

2

1xb

b

N

0

(b-1).N

...

1

2

1xN

N

1

0

N-1

...

1

2

1xb

b

0

...

1

2

1xN

N

1

0

N-1

Contention

Exit of the stage SExit of the stage S11::

The maximum delay that a packet can be given in the stage S1 is (b-1)N time slots Need to keep track of the (b-1)N previous time slots. Need for an electronic memory MEM_S1 of size (b-1)N that will indicate

which time slots at the exit of S1 are “busy” or “free”.

Exit of the stage SExit of the stage S22::The maximum delay that a packet can be given for the whole optical buffer is (b-1)N+N-1= bN -1 Need to keep track of the bN -1 previous time slots. Need for an electronic memory MEM_S2 of size bN-1 that will indicate which

time slots at the exit of S2 are “busy” or “free”.

Now there are 2 locations where contention can happen:- at the exit A of first stage (S1)- at the exit B of second (and final) stage (S2)

S1 S2

A B

Rem: if a packet cannot be scheduled, it will be discarded on the first stage (so in fact the first stage switch is a 1x(b+1) switch, whose last line is the discard line. Discarding a packet in other than the first stage would be necessary if one uses another scheduling strategy – for instance, a non unique delay decomposition.

remark: again, the contention avoidance schedule enables the following fiber-length-reducing architecture

to work equally well:

(in the example, b1=b, and b0=N)

1

2

1xN

N

. . ....

1 1 1

1

2

1xb

b

. . ....

N N N

A B...

1

2

1xb

b

N

0

N.(b-1)

...

1

2

1xN

N

1

0

N-1

Temporal diagram of the permutation schedule, the first and the second “crosspoint” schedules (MEM_S1, MEM_S2)

The permutation schedule represents the available permutation at the exit of the TSI buffers at time t’=t+k (there are N possible permutations). The permutation schedule is not computed as a function of the traffic – as in a BVN switch. It is deterministic (TDM), therefore we do not need to store any scheduling memory array.

MEM_S1

MEM_S2

N

Example: N=4, b=3 (b1=b, and b0=N) rem: these schedule positions do not need to be stored in memory, since they are always free at the start of a scheduling cycle

(b1-1)b0=(b-1)N=8

b1b0-1=bN-1=11

t’=t +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11time:

0 1 2 3 0 1 2 3 0 1 2 3Permutation schedule:

P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 ...

t’=t +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11time:

0 1 2 3 0 1 2 3 0 1 2 3Permutation schedule:

k.N

occupied

free

irrelevant

... a packet arrives at time t, such that the requested permutation is P1. We have then =2.

N

The packet will be scheduled to go trough S1 at time t’=t+2N=t+8, and will exit the network trough S2 at time t’=t+2N+2 = t+10. Both cells in the considerer pair are made “busy”, and then the arrays are shifted to the left by one.

P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 ...

k=2k=1k=0

MEM_S1

MEM_S2

N=4, b=3 (b1=b, and b0=N)

(b1-1)b0=(b-1)N=8

b1b0-1=bN-1=11

b1 pairs to check = > O(b1) schedule!!

In the previous example, b1=3 pairs had to be taken into consideration... • In general a maximum of b1 memory locations have to be checked.

So, overall complexity of the scheduling algorithm is O(b1).

(a strategy using counters is not easy to implement, and may lead to sub-optimal schedules)

1xb

1xN

b1=3 lines in first stage

b0=4 lines in second stage

E1

E2

One vs. Two buffer-stages (for the same total fiber length)

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Load

Pack

et lo

ss p

roba

bilit

y

b = 10b = 20b = 30b = 10b = 20b = 30

N = 16

Nb packets = 107

single buffer

two buffers

This indicates that the collision avoidance at intermediate stage, sligthly degrades performances => there is a trade-off between architectural

considerations and performance.

Conclusion

• Because it is a LBS, it can achieve high throughput under bursty traffic

• Because deterministic balancing is used:

– Guide-wave-integrable Stage-controlled Banyan networks can be used both for the switching

stage and the balancing stage.

– no need to employ optical memories for buffering, only fiber-delay lines functionning as a TSI

• Has a scheduling complexity in O(b) , where b is the equivalent size of a electronic FIFO

buffer.

• Can (potentially) handle traffic priorities by making k priority-dependent.

• performances only slighly degrades when comparing to a single-stage TSI (*), while:

– making possible a very large reduction of the number of delay lines.

– thus using “buffer space” more efficiently

• it would be possible to modify the architecture so as to handle asynchronous traffic and

different length packets using only TSIs, as in [Harai].

The proposed two-stage Load-balanced photonic switch:The proposed two-stage Load-balanced photonic switch:

(*) performance of a single-stage based photonic switch using Nb-1 FDLs are strictly equivalent to that of a LBS using RAM buffers composed of N FIFO queues each of size b.

16-64

16-8-8

8-8-8-8

One that provides a unique decomposition of the scheduled delay, however, is such that bi = li-1.bi-1 = l0l1l2…li-1. For the first stage S0, b0 is equal to a delay of one time slot. Hence, the maximum delay that can be given to a packet by the whole TSI is equal to B=l0l1l2…ln-1 (this is also the maximum number of packets that the TSI can hold). For a switch with N ports, it is comparable to N VOQ queues of length Be = B/N.

...Further Research: generic multi-stage delay-line buffers

There are thousands of ways of implementing a generic multistage buffer.

Packet loss probability

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

load

pack

et lo

ss p

roba

bilit

y

409664-6432-32-416-16-168-8-8-84-4-4-4-4-4

N=64

Average packet delay

0

500

1000

1500

2000

2500

3000

3500

4000

0.5 0.6 0.7 0.8 0.9 1load

aver

age

dela

y (t

ime

slot

s)

4096

64-64

32-32-4

16-16-16

8-8-8-8

4-4-4-4-4-4

N=64

Documents

A. Cassinelli, A. Goulet, M. Ishikawa University of Tokyo, Department of Information Physics and Computing M. Naruse, F. Kubota National Institute of Information