048866: Packet Switch Architectures

048866: Packet Switch Architectures

Dr. Isaac KeslassyElectrical Engineering, Technion

[email protected]

http://comnet.technion.ac.il/~isaac/

Scaling

Spring 2006 048866 – Packet Switch Architectures 2

Achieving 100% throughput

1. Switch model2. Uniform traffic

Technique: Uniform schedule (easy)3. Non-uniform traffic, but known traffic matrix

Technique: Non-uniform schedule (Birkhoff-von Neumann)4. Unknown traffic matrix

Technique: Lyapunov functions (MWM)5. Faster scheduling algorithms

Technique: Speedup (maximal matchings) Technique: Memory and randomization (Tassiulas) Technique: Twist architecture (buffered crossbar)

6. Accelerate scheduling algorithm Technique: Pipelining Technique: Envelopes Technique: Slicing

7. No scheduling algorithm Technique: Load-balanced router


Outline

Up until now, we have focused on high performance packet switches with:

1. A crossbar switching fabric,2. Input queues (and possibly output queues as well),3. Virtual output queues, and4. Centralized arbitration/scheduling algorithm.

Today we’ll talk about the implementation of the crossbar switch fabric itself. How are they built, how do they scale, and what limits their capacity?


Crossbar switchLimiting factors

1. N2 crosspoints per chip, or N x N-to-1 multiplexors

2. It’s not obvious how to build a crossbar from multiple chips,

3. Capacity of “I/O”s per chip. State of the art: About 300 pins each operating at

3.125Gb/s ~= 1Tb/s per chip. About 1/3 to 1/2 of this capacity available in practice

because of overhead and speedup. Crossbar chips today are limited by “I/O” capacity.


Scaling

1. Scaling Line Rate Bit-slicing Time-slicing

2. Scaling Time (Scheduling Speed) Time-slicing Envelopes Frames

3. Scaling Number of Ports Naïve approach Clos networks Benes networks


Bit-sliced parallelism

Linecard (from each input)

Cell

Cell

Cell

• Cell is “striped” across k identical planes.

•Scheduler makes same decision for all slices.

•However, doesn’t decrease scheduling speed

•Other problem(s)?SchedulerScheduler

8

7654321

k


Time-sliced parallelism

• Cell carried by one plane; takes k cell times.

• Centralized scheduler is unchanged. It works for each slice in turn.

•Problem: same scheduling speed

SchedulerScheduler

8

7654321

k

Cell


Cell

Cell

Cell


Scaling





Time-sliced parallelismwith parallel scheduling

•Now scheduling is distributed to each slice.

•Scheduler has k cell times to schedule

•Problem(s)?

Slow Scheduler

2

1

k

Cell


Cell

Cell

Cell

Cell

3

Slow Scheduler

Slow Scheduler

Slow Scheduler


Envelopes

Envelopes of k cells [Kar et al., 2000] Problem: “Should I stay or should I go now?”

Waiting starvation (“Waiting for Godot”) Timeouts loss of throughput

Slow SchedulerSlow Scheduler

Linecard (at each VOQ)

Cell Cell Cell CellCell


Frames for scheduling

The slow scheduler simply takes its decision every k cell times and holds it for k cell times

Often associated with pipelining Note: pipelined-MWM still stable (intuitively: the weight doesn’t change

much) Possible problem(s)?

Slow SchedulerSlow Scheduler

Linecard (at each VOQ)

Cell Cell Cell CellCell Cell Cell Cell


Scaling a crossbar

Conclusion: Scaling the line rate is relatively straightforward

(although the chip count and power may become a problem).

Scaling the scheduling decision is more difficult, and often comes at the expense of packet delay.

What if we want to increase the number of ports?

Can we build a crossbar-equivalent from multiple stages of smaller crossbars?

If so, what properties should it have?


Scaling





Scaling number of outputs Naïve Approach

4 inp

uts

4 outputs

Building Block: 16x16 crossbar switch:

Eight inputs and eight outputs required!


3-stage Clos Network

n x k

m x m

k x n1

N

N = n x mk ≥ n

1

2

…

m

1

2

…

…

…

k

1

2

…

m

1

N

n n


With k = n, is a Clos network non-blocking like a crossbar?

Consider the example: scheduler chooses to match(1,1), (2,4), (3,3), (4,2)


With k = n is a Clos network non-blocking like a crossbar?

Consider the example: scheduler chooses to match(1,1), (2,2), (4,4), (5,3), …

By rearranging matches, the connections could be added.Q: Is this Clos network “rearrangeably non-blocking”?


With k = n a Clos network is rearrangeably non-blocking

Route matching is equivalent to edge-coloring in a bipartite multigraph.Colors correspond to middle-stage switches.

(1,1), (2,4), (3,3), (4,2)

Each vertex corresponds to an n x k

or k x n switch.

No two edges at a vertex may be colored the same.

Vizing ‘64: a D-degree bipartite graph can be colored in D colors.(remember: Birkhoff-von Neumann Decomposition

Theorem)

Therefore, if k = n, a Clos network is rearrangeably non-blocking (and can therefore perform any permutation).


How complex is the rearrangement?

Method 1: Find a maximum size bipartite matching for each of D colors in turn, O(DN2.5). Why does it work?

Method 2: Partition graph into Euler sets, O(N.logD) [Cole et al. ‘00]


Euler partition of a graph

Euler partition of graph G: 1. Each odd degree vertex is at the end of one open path.2. Each even degree vertex is at the end of no open path.


Euler split of a graph

Euler split of G into G1 and G2:1. Scan each path in an Euler

partition.2. Place each alternate edge

into G1 and G2

GG1

G2


Edge-Coloring using Euler sets

Assume for simplicity that the graph is regular (all vertices have the

same degree, D), and D=2i

Perform i “Euler splits” and 1-color each resulting graph. This is log D operations, each of O(E).


Implementation

SchedulerScheduler Route connections

Route connections

Requestgraph

Permutation Paths


Implementation

Pros A rearrangeably non-blocking switch can perform any

permutation A cell switch is time-slotted, so all connections are

rearranged every time slot anywayCons Rearrangement algorithms are complex (in addition to

the scheduler)

Can we eliminate the need to rearrange?


Strictly non-blocking Clos Network

Clos’ Theorem: If k >= 2n – 1, then a new connection can alwaysbe added without rearrangement.


Clos Theorem

I1

I2

…

Im

O1

O2

…

Om

M1

M2

…

…

…

Mk

n x k

m x m

k x n1

N

N = n x mk ≥ 2n-1

1

N

n n


Clos Theorem

Ia Ob

1 1

n

k

1

n

k

1. Consider adding the n-th connection between1st stage Ia and 3rd stage Ob.

2. We need to ensure that there is always somecenter-stage M available.

3. If k > (n – 1) + (n – 1) , then there is always an M available. i.e. we need k >= 2n – 1.

n – 1 alreadyin use at input

and output.n-1n?

1

n-1n?


Benes networksRecursive construction


Benes networksRecursive construction


Scaling Crossbars: Summary

Scaling the bit-rate through parallelism is easy. Scaling the scheduler is hard. Scaling the number of ports is harder. Clos network:

Rearrangeably non-blocking with k = n, but routing is complicated,

Strictly non-blocking with k >= 2n – 1, so routing is simple. But requires more bisection bandwidth.

Benes network: scaling with small components

Documents

048866: Packet Switch Architectures