Low-Latency Pipelined Crossbar Arbitration

Zurich Research Laboratory

GLOBECOM 2004 © 2004 IBM Corporation

Low-Latency Pipelined Crossbar Arbitration

Cyriel Minkenberg, Ilias Iliadis, François AbelIBM Research, Zurich Research Laboratory



Outline

Context OSMOSIS project

Problem Low-latency, high-throughput crossbar arbitration in FPGAs

Approach A new way to pipeline parallel iterative matching algorithms

Simulation results Latency-throughput as a function of pipeline depth

Conclusions



OSMOSIS

Optical Shared MemOry Supercomputer Interconnect System Sponsored by DoE & NNSA as part of ASCI Joint 2½-year project

• Corning: Optics and packaging• IBM: Electronics (arbiter, input and output adapters) and system integration

High-Performance Computing (HPC) Massively parallel computers (e.g. Earth Simulator, Blue Gene) Low-latency, high-bandwidth, scalable interconnection networks

Main sponsor objective Solving the technical challenges and accelerating the cost reduction of all-optical

packet switches for HPCS interconnects by• building a full-function all-optical packet switch demonstrator system• showing the scalability, performance and cost paths for a potential commercial system

Key requirements: Latency < 1 μs end-to-end

Line rate 40 Gb/s/port

Number of ports 64

Efficiency 75% user data

Error rate BER < 10-21

Implementation FPGA-only

Scalability 2048 nodes via 3-stage Fat Tree topology



OSMOSIS System Architecture

Broadcast-and-select architecture (crossbar) Combination of wavelength- and space-division multiplexing Fast switching based on SOAs Electronic input and output adapters Electronic arbitration

EQ

control

2 Rx

central arbiter(bipartite graph matching algorithm)

VOQs

Tx

control

64 Ingress Adapters

All-optical Switch

64 Egress Adapters

EQ

control

2 Rx

control links

8 Broadcast Units128 Select Units

8x11x88x1

Com- biner

Fast SOA 1x8FiberSelectorGates

Fast SOA 1x8FiberSelectorGates

Fast SOA 1x8WavelengthSelectorGates

Fast SOA 1x8WavelengthSelectorGates

OpticalAmplifier

WDM Mux

StarCoupler

8x1 1x128VOQs

Tx

control

1

8

1

128

1

64

1

64



OSMOSIS Arbitration

Crossbar arbitration Heuristic parallel iterative matching algorithms

RRM, PIM, i-SLIP, FIRM, DRRM, etc.

These require I = log2 N iterations to achieve good performance

Mean latency decreases as the number of iterations increases

OSMOSIS N = 64 I = 6 iterations

Problem

• An iteration takes too long (Ti) to complete I iterations in one time slot Tc

• VHDL experiments indicate that Ti Tc 2Ti

• Poor performance…

Solution

• Pipelining

• however, in general this incurs a latency penalty



Parallel Matching: PMM

K parallel matching units (allocators) Every allocator now has K time slots to compute a matching K = I( Ti / Tc )

Requests/grants issued in round-robin TDM fashion In every time slot, one allocator receives a set of requests, and one allocator issues a set of grants (and is reset)

Drawbacks Minimum arbitration latency equals K time slots Allocators cannot take most recent arrivals into account in subsequent iterations

A0

…

AK-1

requests matchingA1

M0[1]

M1[2]

M0[2]

M0[4]

M3[4]

M1[4]

M2[4]

M0[3]

M2[3]

M1[3]

A0

A1

…

AK-1

Tc Tarbitration

requests

grants

Ti

allocators

time



FLPPR: Fast Low-latency Parallel Pipelined aRbitration

A3

A1

A0

requests matchingA2VOQstate

REQUEST Requests are issued depending on VOQ state to all or a subset of the allocators

MATCH Every allocator Ai adds new edges based on current requests and existing matching

UPDATE New edges are accounted for by updating the VOQ state

SHIFT Ai > 0 forwards resulting matching to Ai-1 at end of every time slot AK-1 starts with empty matching

Ai < K -1 start with previous matching of Ai+1

A0 issues the final matching

M3[1]

M2[1]

M1[1]

M0[1]

M2[2]

M3[2]

M3[1]

M2[1]

M1[1]

M1[2]

M0[2]

M3[1]

M2[2]

M1[3]

M2[3]

M3[3]

M3[2]

M2[1]

M1[2]

M0[3]

M0[4]

M3[1]

M1[3]

M2[2]

M3[2]

M2[3]

M1[4]

M2[4]

M3[4]

M3[3]



Request and Grant Filtering

PMM = Parallel allocators, TDM requests; FLPPR = Pipelined allocators, parallel requests FLPPR allows requests to be issued to any allocator in any time slot Request filtering function determines the subset of allocators for every VOQ

Opportunity for performance optimization by selectively submitting requests to and accepting grants from specific allocators

Request and grant filtering

General class of algorithms Request filter Rk determines mapping between allocators and requests

• Selective requests depending on Lij, Mk, k Grant filter Fk can remove excess grants

A0

…

AK-1

requestsmatchingA1

R0

R1

…

R3

VOQstate

F0

F1

…

F3

line cardrequests



Example, N = 4, K = 2 without filtering

0 3 01

1 2 00

0 6 24

2 0 01

0 1 01

1 1 00

0 1 11

1 0 01

0 1 01

1 1 00

0 1 11

1 0 01

0 1 00

1 0 00

0 0 01

0 0 00

0 1 00

1 0 00

0 0 10

0 0 01

0 2 00

2 0 00

0 0 11

0 0 01

VOQ ctrsLij

requestsRk

matchesMk

grantsGij

0 1 01

0 2 00

0 6 13

2 0 00



Example, N = 4, K = 2, with request filtering

0 3 01

1 2 00

0 6 24

2 0 01

0 1 01

1 1 00

0 1 11

1 0 01

0 1 00

0 1 00

0 1 11

1 0 00

0 1 00

1 0 00

0 0 01

0 0 00

0 1 00

0 0 00

0 0 10

1 0 00

0 2 00

1 0 00

0 0 11

1 0 00

VOQ ctrsLij

requestsRk

matchesMk

grantsGij

0 1 01

0 2 00

0 6 13

1 0 01



FLPPR Methods

We define three FLPPR variants

Method 1: Broadcast requests, selective post-filtering

• Requests sent to all allocators; excess grants are cancelled

Method 2: Broadcast requests, no post-filtering

• Requests sent to all allocators; no check for excess grants

• May lead to “wasted” grants

Method 3: Selective requests, no post-filtering

• Requests sent selectively (no more requests than current VOQ occupancy); no check for excess grants



FLPPR performance – Uniform Bernoulli traffic

Method 1

0.01

0.1

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

normalized throughput

late

ncy

[ti

me

slo

ts]

K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)

Method 2

0.01

0.1

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


late

ncy

[ti

me

slo

ts]

K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)

Method 3

0.01

0.1

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


late

ncy

[ti

me

slo

ts]

K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)

Comparison of FLPPR and PMM

0.01

0.1

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


late

ncy

[ti

me

slo

ts]

PMM, K = 5FLPPR method 1, K = 5FLPPR method 2, K = 5FLPPR method 3, K = 55-SLIP (K = 1)



FLPPR performance – Nonuniform Bernoulli traffic

Method 1

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

nonuniformity (w)

no

rmal

ized

th

rou

gh

pu

t

K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)

Method 2

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

nonuniformity (w)

no

rmal

ized

th

rou

gh

pu

t

K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)

Method 3

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

nonuniformity (w)

no

rmal

ized

th

rou

gh

pu

t

K = 1K = 2K = 3K = 4K = 55-SLIP (K = 1)



Arbiter Implementation

Fj

VOQstate

CCitf.

CC[00]Rx

TxSCI

SCC[00]

SCC[15]

Tx

Tx

VOQstate

CCitf.

CC[63]Rx

Tx

RjRj

Fj

Fj

RjRj

Fj

SYSCLK

&CTRL

SCI

A[K-1]

A[0]



Conclusions

Problem: Short packet duration makes it hard to complete enough iterations

Pipelining achieves high rates of matching with a highly distributed implementation

FLPPR pipelining with parallel requests has performance advantages Eliminates pipelining latency at low load

Achieves 100% throughput with uniform traffic

Reduces latency with respect to PMM also at high load

Can improve throughput with nonuniform traffic

Request pre- and post-filtering allows performance optimization Different traffic types may require different filtering rules

Future work: Find filtering functions that optimize uniform and non-uniform performance

Highly amenable to distributed implementation in FPGAs

Can be applied to any existing iterative matching algorithm

Documents

Low-Latency Pipelined Crossbar Arbitration