2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

1/127

VLSI Micro-Architectures for High-Radix Crossbars

Giorgos Passas

Computer Architecture & VLSI Systems (CARV) Laboratory,

Institute of Computer Science (ICS)

Foundation of Research and Technology Hellas (FORTH)

Science and Technology Park of CreteP.O. Box 1385 Heraklion, Crete, GR-711-10 Greece

Technical Report FORTH-ICS/TR-427 April 2012

Copyright 2012 by FORTH

Work Performed as a Ph.D Thesis

at the Department of Computer Science, University of Crete,

under the supervision of Prof. Manolis Katevenis,with the financial support of FORTH-ICS


2/127


3/127

FORTH-ICS/TR-427 A PRIL 2012

VLSI Micro-Architectures for High-Radix Crossbars

Giorgos Passas

The crossbar is the most popular switch for digital systems such as Internet routers, clusters, and

multiprocessors (on-chip, as well as multichip). However, because the cost of the crossbar grows with

the square of the radix thereof, and because of past implementations in various technologies, it is

widely believed that the crossbar is not scalable to radices beyond 32 or 64, and that for higher radices

more complicated networks are needed, where the crossbar is the basic building block. In this thesis,

we scale the crossbar to radices well beyond 100 by crafting novel VLSI micro-architectures and their

detailed CMOS layouts.

As a case study, we laid out a 12812824Gb/s crossbar, interconnecting 128 1mm2 user tiles in a

single hop, using just 16mm

2

of silicon in 90nm CMOS. The crossbar is 32b i t s wide, runs at 750M H z,and consumes 7W a t t s .

In router systems, the user tiles will contain memory implementing combined queueing at the inputs

and outputs of the crossbar, plus a small part of logic for port control. We show that this architecture

is the best among a range of known router memory architectures (e.g. totally shared memory, solely

input queueing, or crosspoint queueing), for two reasons: (i)It gives top performance using only a

modest speedup on either the crossbar or the memories, independent of radix; and (ii)it partitions

the memory space only linearly with the radix, thus yielding: (a) High SRAM density by using few,

large, and area efficient blocks; and (b)high memory space utilization through flexible sharing among

flows. In chip multiprocessors, the user tiles will contain cache or local memory, plus a small part of

logic for the processor. When traffic is global and heavy, such a system is competitive to the popular

mesh-centric systems, owing to the simplified routing and load balancing of the crossbar.

We made high radix crossbars feasible by developing novel VLSI micro-architectures for both theirdatapath and their control path. We implement the datapath using trees of multiplexor gates, as tris-

tate buses are slowed down by intrinsically large parasitic capacitances, and we show that highly con-

centrated trees are more area efficient by further reducing the parasitic capacitance of their internal

wires. Moreover, we contribute an experimental analysis showing that: (i)The area of the crossbar is

gate limited for all practical values of its radix N and its width W, thus growing as O(N2W), not as

O(N2W2), which would have been the case had area been wire limited, as is commonly believed in

the literature; and (ii)the delay of the crossbar is dominated by the parasitics of wires, and because

wire length grows with the perimeter of the crossbar, delay grows as O(N

W), not as O(lo g N ), which

would have been the case had delay been gate limited, as is commonly believed in the literature. Next,

we propose novel pipelines to cope with the delay of the interconnect. Finally, we demonstrate that

modern EDA tools can be guided to exploit the abundance of wiring resources through custom, but

algorithmic placement of gates.For the control path, we study the architecture of iSLIP, which is the most popular parallel match-

ing crossbar scheduler. In particular, we study a traditional iSLIP architecture that implements the

matching decision of each input and each output of the crossbar in a separate arbiter block, and com-

municates the matching decisions between the input and the output arbiters through global arbiter-

to-arbiter links. First, we show that this architecture is expensive because the arbiter-to-arbiter links

take up O(N4) area. Thus, a r a d i x -128 iSLIP scheduler occupies 14mm2, where the arbiter-to-arbiter

links account for more than 50%. Next, by observing that the wiring of an arbiter fits in O(N l o g N )

area, we propose a novel architecture that inverts the locality of wires by orthogonally interleaving the

input with the output arbiters, thus lowering the wiring area of the scheduler down to O(N2log2N).

Using this architecture, the r a d i x -128 iSLIP scheduler becomes gate limited, fitting in 7mm2, which

is a 50% reduction compared to the traditional. For a higher radix of 256, area is reduced by almost

an order of magnitude. Finally, the running time of the proposed scheduler is less than 10ns, thusallowing operation with a minimum packet as small as 30 B y t e s at a 24Gb/s line rate.


4/127


5/127

iii


6/127


7/127

Publications

Publications related to the topic:

G. Passas, M. Katevenis, D. Pnevmatikatos: Crossbar NoCs are scalable beyond 100 nodes,IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), Vol.

31, No. 4, Apr. 2012, pp. 573-585;

G. Passas, M. Katevenis, D. Pnevmatikatos: VLSI Micro-Architectures for High-Radix CrossbarSchedulers, proc. 5th ACM/IEEE International Symposium on Networks-on-Chip (NOCS 2011),

Pittsburgh, PA, USA, May 1-4, 2011, 8 pages, ISBN 978-1-4503-0720-8;

G. Passas, M. Katevenis, D. Pnevmatikatos: A 128 x 128 x 24Gb/s Crossbar, Interconnecting 128Tiles in a Single Hop, and Occupying 6% of their Area, proc. 4th ACM/IEEE International Sym-

posium on Networks-on-Chip (NOCS 2010), Grenoble, France, May 3-6, 2010, pp. 87-95, IEEE

Computer Society, ISBN 978-0-7695-4049-8.

Other publications by the author:

G. Passas, H. Eberle, N. Gura, W. Olesinski: Fast and Fair Arbitration on a Data Link, U.S. Patent,USPTO number 7965705, June 21, 2011;

G. Passas, M. Katevenis: Asynchronous Operation of Bufferless Crossbars, proc. IEEE Interna-tionalConference on HighPerformance Switching and Routing (HPSR2007), Brooklyn, NY, USA,

May 30 - June 1, 2007, ISBN 1-4244-1206-4, paper ID 1569017531.pdf;

G. Passas, M. Katevenis: Packet Mode Scheduling in Buffered Crossbar (CICQ) Switches, proc.IEEEWorkshop on High Performance Switching and Routing (HPSR 2006), Poznan, Poland, June

7-9, 2006, pp. 105-112, ISBN 0-7803-9570-0;

M. Katevenis, G. Passas: Variable-Size Multipacket Segments in Buffered Crossbar (CICQ) Archi-tectures, proc. IEEE International Conference on Communications (ICC 2005), Seoul, Korea,

May 16-20, 2005, 6 pages, paper ID 09GC08-4;

M. Katevenis, G. Passas, D. Simos, I. Papaefstathiou, N. Chrysos: Variable Packet Size Buffered

Crossbar (CICQ) Switches, proc. IEEE International Conference on Communications (ICC 2004),Paris, France, June 20-24, 2004, vol. 2, pp. 1090-1096.

v


8/127


9/127

Acknowledgments

I worked on my PhD thesis at the Computer Architecture and VLSI Systems (CARV) laboratory of the

Institute of Computer Science (ICS) of the Foundation for Research and Technology Hellas (FORTH).

FORTH provided my graduate scholarship, including funding by the European Commission through

the SARC project (FP6 IP #27648) and the HiPEAC Network of Excellence (NoE #004408 and #217068).

I am grateful to my advisor prof. ManolisKatevenis for suggesting theevaluation of the cost of crossbar

speedup on chip using real VLSI layouts, and for supervising my work through weekly meetings from

Spring 2008 to Fall 2011. Particularly years 2008 and 2009 had been very hard, and I can remember

only him giving me a hand. Nevertheless, there was still plenty of time and space to act myself, whichI really appreciated.

I am also grateful to my co-advisor prof. Dionisios Pnevmatikatos, who joined our meetings in fall

2009, and offered fresh insights. However, I mostly thank him for all nice things I learned from him on

technical writing and drawing for example, the bold lines in Fig. 5.10 are due to him.

I also thank the other members of my thesis committee, prof. Davide Bertozzi, prof. Angelos Bilas, Dr.

Cyriel Minkenberg, prof. Yannis Papaefstathiou, and prof. Apostolos Traganitis, for their questions

and comments in the defense of my thesis they proved very constructive. Special thanks to prof.

Davide Bertozzi and Dr. Cyriel Minkenberg for going into more details.

I thank prof. Christos Sotiriou whom I consulted on several of the issues I faced on EDA flows and

algorithms the custom placement techniques were motivated by these discussions.

I thank Spyros Lyberis and Michael Ligerakis for setting up the EDA toolset for me. Spyros Lyberis also

helped me improve the oral presentation of my defense by commenting on my rehearsal.

I thank Dr. Hans Eberle, prof. Jose Duato, and prof. Jose Flich, whom I had the opportunity to cooper-

ate with during my internship at Sun Labs, Menlo Park, CA, in Fall 2007 seeing how other researchers

are thinking on a closely related research topic was very helpful for my thesis.

Last but not least, I thank my family, especially my mother, and my uncle Nikos Arapakis, for their

love and encouragement, and my friends, especially Makis Stamos (Psilos), Kostis Anastasakis, Enrico

Schiattarella, Orestis Karamagiolas, and Giorgos Panagiotakis, for helping me further tolerate reality.

Finally, I thank the cleaning and security personnel at FORTH for being kind with me.

vii


10/127


11/127

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Basic Concepts 52.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Time Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Space Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Virtual Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 High Radix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 A Comparison of Architectures for High-Radix Switches 233.1 Basic Switch Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Memory-Sharing Merits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 On-Chip SRAM Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Combined Input-Output Queued (CIOQ) Crossbars . . . . . . . . . . . . . . . . . . . . . . 333.5 Hierarchically-Queued Crossbars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Datapath Micro-Architectures for High-Radix Crossbars 414.1 Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Cost & Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Customized Layout using Link Pipelining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4 Models for Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Scheduler Micro-Architectures for High-Radix Crossbars 635.1 iSLIP Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Block Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3 Cross Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4 Cross-iSLIP versus Wavefront-Scheduler Comparison . . . . . . . . . . . . . . . . . . . . . 815.5 FIFO & Virtual-Channel Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 High-Radix CIOQ Crossbar Switches & Crossbar NoCs 876.1 Tiled Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2 Wire-Over-SRAM Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 Radix-128 Tiled System using Centralized Crossbar . . . . . . . . . . . . . . . . . . . . . . . 936.4 Crossbar versus Mesh Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.5 Projections in Newer Technology Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7 Summary & Future Work 105

Bibliography 109

ix


12/127


13/127

Chapter 1

Introduction

Today, on-chip memory systems with few hundreds of memory nodes, such as chip multiprocessors

(CMPs) and switch fabrics, are pivotal digital systems. A key component in such systems is the switch

interconnecting the memories. While the crossbar is the most popular switch, it is widely considered

non-scalable to radices beyond fewtens of nodes dueto its quadratic cost [1][2][3]. Thus, designers are

increasingly adopting cheaper-but-intricate topologies, such as meshes and tori [1][4][5], where the

crossbar is the basic building block. However, a study on the scaling limits of the crossbar is missing

from the literature: If the crossbar is proven to be feasible, designers are likely to replace the currently

deployed topologies with a crossbar to benefit from its simplicity.

We consider memory systems where each memory node is located on a user tile, and user tiles

are arranged in a 2D matrix. In a switch fabric, a user tile will implement mostly queues associated

with a switch port, plus a small circuit for the control of that port. In a CMP, a user tile will implement

a processor and the cache or local memory next to the processor. We evaluate the area, speed, and

power consumption overhead of the crossbar to the user tiles by studying their VLSI organization in

modern CMOS technology, using Electronic Design Automation (EDA) tools.

1.1 Motivation

Switch chips interconnect their ports using a crossbar. An efficient approach to handle contention for

the output ports is using queues. In particular, traffic at different inputs contending for a same output

is queued at the inputs [6]. To reduce Head of Line blocking in such Input Queued (IQ) crossbars,

queue memories can be divided in lanes, e.g. Virtual Output Queues (VOQ) [6]. However, scheduling

of which lane is connected to which output is a hard problem to solve fast, fair, and efficiently [7]. To

compensate for inefficiencies of scheduling, internal speedup can be used. Then, the throughput of

both the memories and the crossbar is overprovisioned, and memories are placed also at the outputs.

This organization is known as Combined Input-Output Queueing (CIOQ) [8][9][10][11].

1


14/127

G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

CIOQ has been studied for systems where the crossbar and the memories are implemented on

separate chips [7][12]. In such systems, it is known to be expensive because it wastes expensive chip-

IO resources [12]. Thus, crosspoint queueing (XQ) has been proposed as a more scalable alternative.

XQ can be considered a scalable var iant of output queueing (OQ), trading queueing internally to the

crossbar for operation without internal speedup [13][14].

However, the tradeoff between speedup and memory is very different off-chip from on-chip.

Chip-to-chip link bandwidth is expensive, in terms of both pin count and power consumption. Each

pin is run at the highest possible frequency, so link speedup can only be provided by increasing the

number of pins. But, more pins cost in packagesize, board area, and wiring cost. Moreover, high speed

serial off-chip links carry their own clock information, embedded in the data encoding. Even if such a

link has no valid data to carry, it still consumes power because it has to carry synchronization signals

for the receivers clock recovery circuit to stay in-sync (powering down is beneficial only for quite long

inactivity periods). Thus, an off-chip link with a speedup ofs> 1.0 consumes power proportional to

its peak throughput, s, although its average utilization never exceeds 1.0.

On-chip, things are very different though. The wires of an idle link need not change state, hence

CMOS circuits will only consume energy when transferring valid data: On-chip link power consump-

tion is proportional to average throughput, not peak capacity. Moreover, on-chip, links can be routed

over the memories: Modern CMOS processes offer many layers of interconnect (e.g. eight), while

SRAM blocks obstruct few of them (e.g. four). Thus, on chip, CIOQ appears technologically correct.

Furthermore, CIOQ is advantageous to XQ because it reduces the partitioning of memory space.

Particularly, in CIOQ the total number of memories grows linearly with the radix, while in XQ it grows

quadratically. Moreover, higher memory partitioning translates to higher implementation cost. On

the other hand, CIOQ compared to XQ increases the cost of the crossbar, while also requiring a mono-

lithic crossbar scheduler. Hence, a comparison between CIOQ and XQ starts from the evaluation of

the cost of crossbar speedup and the feasibility of the scheduler.

This evaluation should be done for high radix switches [15]. Switch chip designers strive to ben-

efit from the advances in signaling technology by increasing the radix of the switch chips, as switch

chips with higher radix enable lower diameter network topologies, with lower component count, and

lower cost. However, most studies in the literature concern relatively low radix crossbars, up to 32 or

64 ports [2][16], and a study of scaling to hundreds of ports is missing.

Finally, the cost of crossbar speedup and the feasibility of the scheduler should be evaluated

on real VLSI layouts. Switch chips are typically Application Specific Integrated Circuits (ASICs), and

ASICs are designed using Electronic Design Automation (EDA) tools. Thus, VLSI layouts should be

developed using such EDA tools.

2


15/127

1.2. CONTRIBUTIONS

1.2 Contributions

The key contributions of this thesis are as follows:

1. High Radix Crossbar Network-on-Chip. We lay out a 12812824Gb/s crossbar in a 90nm

CMOS process with 9 layers of interconnect. The crossbar is 32b i t s wide, runs at 750M H zusing

a 3-st ag e pipeline, fits in 16mm2 of silicon by filling it at the 90% level, and consumes 6W a t t s .

Moreover, we surround the crossbar with 128 1mm2 user tiles, and we connect the crossbar to

the user tiles through global links. The global links are 32b i t s wide, run at 750M H z using a

two-stage pipeline, run on top of the user tiles, and consume 1.2W a t t s .

2. Crossbar Datapath Micro-Architecture.We implement the datapath of the crossbar using trees

of multiplexor gates, as tristate buses are slowed down by intrinsically large parasitic capaci-

tances, and we show that highly concentrated trees are more area efficient by further reducing

the parasitic capacitance of their internal wires. Next, we contribute an experimental scaling

analysis, showing that: (i)The area of the crossbar is gate limited for all practical values of its

radix N and its width W, thus growing as O(N2W), not as O(N2W2), which would have been

the case had area been wire limited, as is commonly believed in the literature [2][17]; and (ii)

the delay of the crossbar is dominated by the parasitics of wires, and because wire length grows

with the perimeter of the crossbar, delay grows as O(N

W), not as O(lo g N ), which would have

been the case had delay been gate limited, as is commonly believed in the literature [ 15]. Next,

we propose novel pipelines to cope with the delay of the interconnect. Finally, we demonstrate

that EDA tools can be guided to compact routing solutions through custom gate placement.

3. Crossbar Scheduler Micro-Architecture. We study a traditional iSLIP architecture that imple-

ments the matching decision of each input and each output of the crossbar in a separate arbiter

block, and communicates the matching decisions between the input and the output arbiters

through global arbiter-to-arbiter links. First, we show that this architecture is expensive be-

cause the arbiter-to-arbiter links take up O(N4) area. Thus, a r a d i x -128 iSLIP scheduler occu-

pies 14mm2

, where the arbiter-to-arbiter links account for more than 50%. Next, by observing

that the wiring of an arbiter fits in O(N l o g N ) area, we propose a novel cross architecture that

inverts the locality of wires by orthogonally interleaving the input with the output arbiters, thus

lowering the wiring area of the scheduler down to O(N2l og2 N). Using this cross architecture,

the r a d i x -128 iSLIP scheduler becomes gate limited, fitting in 7mm2, which is 50% reduction

compared to the traditional. For a higher radix of 256, reduction nears an order of magnitude.

4. Combined Input-Output Queueing is better than Crosspoint Queueing. Based on the above

findings we conclude that crossbars are small and speedup is inexpensive. Because CIOQ com-

pared to XQ reduces memory partitioning, CIOQ is better than XQ.

3


16/127


1.3 Outline

Chapter 2 presents basic concepts in switch design. First, we abstract the role of the switch to scalable

distributed multiparty communication. Next, we classify switches to time and space switches, and

we explain why space switches are more scalable. Though scalable, space switches need scheduling

and routing. Thus, we also overview some popular scheduling and routing algorithms. Moreover, we

discuss thepopular Virtual Channels and the argument for highradix switches. Finally, we summarize.

Chapter 3 presents a comparison of known switch architectures for high radix switches. First, we

overview basic switch architectures. Next, we study the merits of memory sharing and the cost of

memory implementation, and we show that the input queued crossbar is the only scalable switch.

Because the input queued crossbar is difficult to schedule at high radices, we also discuss scalable

variants thereof, namely combined input-output queued crossbars and hierarchically queued cross-

bars. Finally, we discuss related work and the conclusion.

Chapter 4 presents VLSI micro-architectures for high radix crossbar datapaths. First, we describe a

basic datapath architecture. Next, we show that in this architecture area is practically always gate

limited, while delay becomes wire limited at high radices. To increase throughput, we also describe a

customized layout using a novel wire pipeline. Moreover, we develop simple models for area. Finally,

we discuss related work and the conclusion.

Chapter 5 presents VLSI micro-architectures for high radix crossbar schedulers. First, we describe theiSLIP circuit. Next, we study a traditional block architecture, and we show that at high radices sched-

uler area is wire limited. To remove the wiring limitations, we propose and we study a novel cross

architecture. We find that this cross architecture has similarities with the architecture of the Wave-

front Scheduler. Moreover, we adapt the cross scheduler architecture to FIFO and Virtual Channel

crossbars. Finally, we discuss related work and the conclusion.

Chapter 6 presents VLSI micro-architectures for high radix Combined Input-Output Queued (CIOQ)

switches and Networks-on-Chip (NoCs). First, we show that a tiled architecture can be used for both

CIOQ switches and NoCs. Next, we study alternative locations of the crossbar in its context of tiles,

and we show that a centralized crossbar is more practical. Moreover, we plot overall system perfor-

mance, we compare our crossbar to a popular mesh NoC, and we make projections for newer tech-

nology nodes. Finally, we describe related work and the conclusion.

Chapter 7 presents a summary of the thesis, as well as directions for future work.

4


17/127

Chapter 2

Basic Concepts

In this chapter, we describe basic concepts in switch design. In particular, we describe the role of

a switch (section 2.1), a taxonomy of switches to time and space switches (section 2.2 and 2.3), the

problem of scheduling (section 2.4), the problem of routing (section 2.5), the concept of Virtual Chan-

nels (section 2.6), the argument for high radix switches (section 2.7), and a summary (section 2.8).

The description is based mainly on the transparencies of the Packet Switch Architecture class at the

University of Crete [18].

2.1 PreliminariesInterconnection switches are intermediaries implementing the communication between the parties

of a digital system. For example, between line cards in an Internet router [19], between processors and

memories in a multiprocessor system [20], and between I/O devices, processors, and/or memories in

a storage system [21].

As the scale of systems varies from few tens of parties (e.g. in an Internet router) to many thou-

sands of parties (e.g. in a multiprocessor), the scale of the switch has to vary accordingly. While a small

switch may switch data by simply passing it through a single point in space at different moments in

time (e.g. the memory switch section 2.2), larger switches use topologies of parallel paths in space

(e.g. the crossbar or the Benes switch section 2.3). Thus, space and time are two basic dimensions in

switch design. A third dimension is the coordination of the parties. Examples are scheduling or rout-

ing algorithms resolving situations where parties are contending for a single resource of the switch,

such as an output link or an internal path (sections 2.4 and 2.5). In any case, parties comprise a dis-

tributed system. When scale is large, distribution is obvious a direct consequence of the physical

distances between the parties. In smaller scales, distribution emerges from large ratios of path delays

to processing times.

5


18/127


r

r i

0

r

r i

0

>= R


19/127

2.2. TIME SWITCHES

assemble

words1

assemble

words

single path in space is bottleneck

4

words

words

disassemble

disassemble

1

11

memory

Figure 2.2: A 44 memory switch.

2.2 Time Switches

For small systems, a switch may simply pass data through a single point in space at different moments

in time. A representative such switch is the memory switch. A memory switch (Fig. 2.2) switches data

from a set of input links to a set of output links throughwrites and readsto and from a central memory.

The access rate of the central memory is equal to the aggregate rate of the input and the output links,

in order to absorb any contention among the input links for the output links, and to fully utilize the

output links. Thus, the inputs assemble words, which are then multiplexed in time on the memory

(write) bus. In the opposite direction, words are demultiplexed on the memory (read) bus, and are

disassembled at the outputs. Inside the memory, words are organized in queues. For details on the

memory switch, refer e.g. to [22][23][24].

Queues are typically simple first-in-first-out (FIFO) structures, as only such simple structures

can be implemented fast in hardware [25]. Thus, at least one queue is needed per output, to maximize

the throughput of the memory: If words for different outputs are intermingled in the same queue,

words for a heavily loaded output block other, lightly loaded outputs. This is a fundamental problem

in switch design, known as Head-of-Line (HoL) blocking[6][26].

Coordination in a memory switch concerns the sharing of memory space among the inputs and

theoutputs. In particular, when some inputsare contending for a same output, thequeuecorrespond-

ing to that output starts building up in time. Given that memory space is finite, a protocol is needed

to prevent the memory from overflowing. Usually, a credit based protocol is employed [27]. Using a

credit based protocol, inputs are allocated a number of credits for each queue, credits are spent on

writes, and are returned on reads, so that operation is lossless. Notice that in this way the memory is

minimally partitioned into N2 credit equivalents. Furthermore, while dropping excessive words could

be an alternative, this usually degrades performance, as new words may not be able to arrive in time

to replace the dropped ones [6][26]. Finally, according to Littles law [28], rates allocated to inputs for

7


20/127


(a) (b)

Figure 2.3: (a) A 44 crossbar switch and (b) example connections.

a particular output are proportional to their share of that outputs queue space. That is, in a memory

switch rates are allocated through memory space.

The memory switch is the most efficient of the known switch architectures because it optimizes

the utilization of both the links and the memory space. All links are guaranteed to run at 100% of their

capacity, while memory space can be flexibly shared among queues see [29] for a range of possible

sharing schemes. Unfortunately, the memory switch is non-scalable because as the number and/or

the rate of links increases, the access rate of the central memory increases proportionally. As we shall

see in the next chapter 3, current memory and link technology constrain the size of a state-of-the-artmemory switch to only eight ports.

2.3 Space Switches

Space switches are more scalable by using parallel paths in space. Popular examples are the crossbar,

the mesh, and the Benes switch.

A crossbar switch (Fig. 2.3) is a set of N input lines, N output lines, and N2 programmable

crosspoints between them. Let us denote byxi,j

the crosspoint between input i and output j. For

input i to connect to output j, crosspoint xi,j closes, while every other crosspoint xk,j, k= i, opens

to avoid shorts. Finally, the input and the output lines connect to same-rate input and output links.

Thus, the crossbar is internally non blocking.

A mesh switch (Fig. 2.4) uses one 55 crossbar for each pair of input-output links, and connects

the crossbars in a

N

N grid. In this way, it reduces the complexity of the crossbar from O(N2)

down to O(N). However, this comes at the cost of internal blocking. In particular, the bisection of the

mesh is O(

N) wide, that is narrower than the O(N) connections. Furthermore, unlike the crossbar

where there is a dedicated path for each input-output pair, in the mesh each input may connect to

any of the outputs through multiple alternative paths, intersecting with other paths connecting other

8


21/127

2.3. SPACE SWITCHES

xbar

5x5

xbar

5x5

xbar5x5

xbar5x5

xbar

5x5

xbar5x5

xbar

5x5

xbar

5x55x5

xbar

(a) (b)

Figure 2.4: (a) A 99 mesh switch and (b) alternative routes for an example connection.

pairs. For details on the mesh refer e.g. to [30].

A 44 Benes switch (Fig. 2.5) comprises two back-to-back connected 44 Banyan networks of

22 switching elements. Larger Benes switches are constructedrecursively. An NNBenescomprises

two N/2N/2 Benes sub-switches, sandwiched byNadditional elements. Thus, the cost of the Benes

is O(N lo g N ). Furthermore, the Benes is non-blocking, like the crossbar. The intuition behind its

non-blocking property is that the Benes has more states than all possible permutations of external

connections. In particular, it uses 2N l o g N 1 stages ofN/2 22 switching elements, thus providing

2N(l o g N )

N/2

states, which are more than the N! permutations of external connections [31]. Finally,

like the mesh, the Benes is a multipath switch. For details on the Benes refer e.g. to [ 32].

There are a number of other known space switches. For example, the mesh can be extended to

three or more dimensions. Moreover, the Benes can be folded in a fat tree. Finally, the Clos switch

is a Benes switch using higher radix switching elements. For details on these switches, refer e.g. to

[33][34][35].

Like time switches, space switches hold contending traffic in queues. However, space switches

place these queues at the inputs. Thus, memory throughput is independent of N. Owing to this fea-

ture, space switches can scale to hundreds or even thousands of ports, as we show in chapter 3. In

the simplest case, each input memory contains one FIFO queue. The problem then is HoL blocking.

Progressing at the head of the queue, traffic destined to a congested output blocks other, irrelevant

traffic behind it. When traffic is uniformly destined, this is known to reduce the throughput of the

switch below 60% [6]. Under more stressed conditions, performance degrades even further [6]. The

solution to HoL blocking is to change the organization of memories. In particular, by separating traffic

per switch output, HoL blocking is eliminated [6]. This approach is known as Virtual Output Queueing

(VOQ) because although queues are per output, they are physically located at the inputs. Other less

expensive approaches, such as Virtual Channels (section 2.6), have been also proposed. Finally, the

9


22/127


(a)

(b)

(c)

Figure 2.5: (a) A 44 or (b) 88 Benes switch and (c) alternative routes for an example connection.

allocation of the space of each memory to upstream nodes can be controlled by a credit protocol, in

analogy to time switches.

Fig. 2.6 compares the operation of space and time switches using space-time diagrams. We as-

sume a 3 3 switch, where traffic at input 0 and 1 is destined to output 0, and traffic at input 2 is

destined to output 1. In the time switch (Fig. 2.6(a)), inputs assemble 3-packet frames, which are

multiplexed in time on the write bus, are written in memory per output, and are finally demultiplexed

on the read bus towards the outputs. Observe in Fig. 2.6(a) that there is a minimum latency of three

packet times for the assembly of frames, and backlog for output 0 accumulates inside the memory

neither at the inputs, nor at the outputs. On the other hand, in the space switch (Fig. 2.6(b)) input

memories are three times narrower and packets are multiplexed towards the outputs in space. Ob-

serve in Fig. 2.6(b) that there is a minimum latency of one packet time, corresponding to scheduling,

and backlog accumulates solely at the inputs.

Unfortunately, the problem with space switches is coordination. Coordination concerns (i)schedul-

ing of which input is connected to which output, and (ii)routing connections between the matched

inputs and outputs. We overview these problems in the next two sections 2.4 and 2.5. Notice that

10


23/127

2.3. SPACE SWITCHES

000

000

000

111

111

111

000000111111000000000111111111000000111111000000000111111111

00

0000

11

1111

00

00

11

11

00

00

00

11

11

11

000

000

000

111

111

111

00

00

00

11

11

11

00

00

00

11

11

11

00

00

00

11

11

11

000000111111000000111111000000111111

Q00

Q00

Q10

Q21

Q10

Q21

00 1 2

outputs1

inputs memory

1widepath

(a)

000011110000001111110000111100

00

00

11

11

11

00

00

00

11

11

11

00

00

11

11

000000000111111111000000111111000000000111111111000000111111000

000

000

111

111

111

000

000

000

111

111

111

0

outputs1

crossbarinputs0 1 2

Q00 Q10 Q21

2parallel,narrowpaths

(b)

Figure 2.6: Space-time diagram of packet forwardings in a 33 (a) time switch and (b) space switch.Traffic at input 0 and input 1 is destined to output 0, and traffic at input 2 is destined to output 1. In

(a), inputs assemble 3-packet frames, which are multiplexed in time on the write bus, are written in

memory per output, and are finally demultiplexed on the read bus towards the outputs. In (b), input

memories are three times narrower and packets are multiplexed towards the outputs in space.

routing in crossbars is trivial because each input-output pair has a private path. This is a basic reason

crossbars are so popular.

11


24/127


requests match

0 0

inputs outputs inputs outputs

VOQs ???

0 0

1 111

Figure 2.7: Unfairness of maximum matchings. To setup the connection from input 1 to output 0 a

non-maximum match is needed.

2.4 Scheduling

Switch scheduling is a special application of bipartite graph matching [36]. The vertices of the graph

are the inputs and the outputs of the switch, and the edges the wishful connections. Although max-

imum size matching algorithms maximize the throughput of the switch, they are impractical for two

reasons. First, they are too slow to implement in fast hardware [37]. Second, they are inherently unfair

in the example of Fig. 2.7, a maximum size matching algorithm would starve the connection from

input 1 to output 0 [37]. Thus, heuristics are being used in practice.

Let us first consider the simplest case, that is the scheduler for a switch with a single FIFO per

input. Such a scheduler runs in the following two steps:

Step 1: Request. Each input sends a request to the destination output of its head word.

Step 2: Grant. If an output receives any requests, it chooses the one to grant (e.g. randomly), and

notifies each input whether its request was granted.

After the above two steps have been executed, a bipartite match has been found. The runs of the

scheduler are usually pipelined with the configuration of the switch and the forwardings of the words.

Thus, the running time of the scheduler quantizes the external traffic into fixed size internal units,

often called packets or cells. Notice that to sustain the line rate, the internal packets have to be at least

as small as the minimum external packets.

Schedulers for VOQ switches are more complicated because in such switches each input may

wish to connect to more than one outputs. Thus, input is added to output contention. Below we

review four popular and representative scheduling algorithms: PIM [38], iSLIP [37], DRRM [39], and

the Wavefront Scheduler [40].

12


25/127

2.4. SCHEDULING

01

0000000

0000000

00000000000000000000000000000000000

1111111

1111111

11111111111111111111111111111111111

00000001111111

000000000000

000000

000000

000000000000

111111111111

111111

111111

111111111111

00

00

11

11000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111110011

000111

000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111110000111100000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111

0000000

0000000

00000000000000000000000000000000000

1111111

1111111

1111111111111111111111111111111111101

request acceptgrant

11

2

3 3

00

2

3 3

0

22

11

00

2

33

0

11

2

Figure 2.8: Example run of the PIM scheduling algorithm.

PIM Parallel Iterative Matching[38] runs in the following three steps:

Step 1: Request. Each input sends a request to every output for which it has at least one packet.

Step 2: Grant. If an output receives any requests, it chooses randomly the one to grant, and

notifies each input whether its request was granted.

Step 3: Accept. If an input receives any grants, it chooses randomly the one to accept.

After the above three steps have been executed, a bipartite match has been found (Fig. 2.8). Moreover,

the above steps may be iterated between the unmatched inputs and outputs to increase the size of

the match. As proved in the original PIM paper [38], lo g N iterations converge to a maximal match.

However, the problem with this algorithm is that (i)it needs random-number generators, which are

tricky to implement in fast hardware, and (ii)it is unfair under asymmetric traffic [41], as illustrated

in Fig. 2.9. In Fig. 2.9(a), each flow would ideally receive 1/2 of link bandwidth, but in reality, the

algorithm tends to discriminate against inputs that have contention [41]. Fig. 2.9(b) shows a second,

analogous scenario.

iSLIP Iterative SLIP [37] overcomes the problems of PIM by resolving contention round-robin. It runs

in the following four steps:

Step 1: Request. Each input sends a request to every output for which it has at least one packet.

Step 2: Grant. If an output receives any requests, it decides round-robin which one to grant, and

communicates back to each input whether its request was granted.

Step 3: Accept. If an input receives any grants, it decides round-robin which one to accept, and

communicates back to each output whether its grant was accepted.

13


26/127


000000111111000000111111

000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111110000001111110000000000000000000000000000

0000000

0000000

1111111111111111111111111111

1111111

1111111

5/8

5/8

3/8

3/8

33

00

2

11

0

2

11

0

2

0000011111000000111111

5/16

5/1

6

5/16

5/16

5/16

probability

1/4 accept

per grant

5/16

1/4 grant

probability

per request

1/16

00

22

11

0

2

3 3

11

0

2

3 3

(a) (b)

Figure 2.9: PIM is unfair when input load is asymmetric. (a) and (b) are two examples. Inputs request

the full rate from a set of outputs. Fractions denote rates allocated by PIM.

Step 4: Slip. If an input accepts any output, it increments (modulo N) its round-robin pointer to

one location beyond that output. If an output is accepted by the input it granted, it increments

(modulo N) its round-robin pointer to one location beyond that input.

After the first three steps have been executed, a bipartite match has been found. The fourth step

ensures that subsequent runs of the algorithm will give fair and often maximal matches. Because each

output keeps granting the same input until accepted, and because inputs arbitrate round-robin, any

output gets eventually accepted in at most Nruns of the algorithm. As a consequence, it is guaranteed

that any request results to a match in at most N2 runs [37]. Finally, by insisting on their grants, the

outputs tend to slip (desynchronize), speeding up convergence to maximal matches.

DRRM Dual Round Robin Matching[39] runs in the following three steps:

Step 1: Request. Each input selects round-robin which output to send a request to.

Step 2: Grant. If an output receives any requests, it decides round-robin which one to grant, and

communicates back to each input whether or not its request was granted.

Step3: Slip. If an input is granted by the output it requested, it increments (modulo N) its round-

robin pointer to one location beyond that output. If an output grants any input, it increments

(modulo N) its round-robin pointer to one location beyond that input.

14


27/127

2.4. SCHEDULING

1 1 0 0

0000

0 1 0 1

1000

1 1 0 0

0000

0 1 0 1

1000

1 1 0 0

0000

0 1 0 1

1000

1 1 0 0

0000

0 1 0 1

1000

1 1 0 0

0000

0 1 0 1

1000

1 1 0 0

0000

0 1 0 1

1000

0 0 0 0

0 0

1 1 0 0

1 0 10

10

run

00

111

22

33

2

000

11

2 2

0

3 3

requests request matrix match

Figure 2.10: Example run of the Wavefront Scheduler.

Compared to iSLIP, DRRM saves one step. Thus, a bipartite match is found in the first two steps. The

third step ensures that subsequent runs of the algorithm will give fair and often maximal matches, in

analogy to the fourth step of iSLIP. Unfortunately, DRRM introduces HoL blocking [42]. If the output

that an input insists on is congested, flows from this input to non-congested outputs are being blocked

[42].

Wavefront Scheduler Instead of making selections locally at N inputs and N outputs, the Wavefront

Scheduler [40] operates globally on a square matrix ofN2 flows, where flows are prioritized by a diag-

onal wavefront. We show an example run of this scheduler in Fig. 2.10. In the upper part, we show

the request matrix and the match eventually computed by the scheduler. In particular, the Wavefront

Scheduler uses a request matrix, where flow (i,j) from input i to output j corresponds to entry (i,j),

and that entry (i,j) is set if and only if flow (i,j) is backlogged. The scheduler runs on the request

matrix as we show in the bottom part of Fig. 2.10. Initially, each flow in the first row 0 is given a vertical

token, and each flow in the first column 0 is given a horizontal token. A flows request is granted if and

15


28/127


1 2 3

432

3 4 5

3

4 2

1 2

3

435

5 3 4

213

4 2 3

3

4

213

5 3

24

1 2 3

5

43

43

2

3

1

2 3

2

4 5

3

4

4

2

3 4 2

13

5 3

1

3

24

5

32

4

3

2

3

4 5

4

3 1

2

3

run 1 run 2 run 3

run 6run 5run 4

run 7 run 8 run 9

Figure 2.11: Priorities of flows during a period of runs of the Wavefront Scheduler in an example 3 3switch. Top priority is shifted to a different flow from run to run. Flow 00 is given 6 times higherpriority over flow 10, while flow 10 is given only 3 times higher priority over flow 00.

only if that flow grabs both the vertical and the horizontal token corresponding to its column and its

row, respectively. Thus, in the first step, flow (0,0) has the top priority, flows (0,0) request is granted,

and flow(0, 0) stops the propagation of both its vertical and horizontal token to prevent its same-input

and same-output flows from being granted in subsequent steps. In the second step, likely to have the

top priority are flows (0,1) and (1,0). However, flow (0,1) misses the horizontal token, hence it is not

granted. Thus, it propagates unchanged its tokens to its neighbors in both directions. In parallel, flow

(1,0) propagates its tokens also unchanged because it is idle. Scheduling proceeds similarly, and a

match is computed in 2N 1 steps. Notice that in step i, i flows are likely to have top priority, but

these flows are always non conflicting. In the next scheduling decision, the initial top priority flow

changes, as in Fig. 2.11, to provide a degree of fairness. Observe in Fig. 2.11 that flow (0,0) is given six

times higher priority over flow (1,0), while flow (1,0) is given only three times higher priority over flow

(0,0). Thus, the Wavefront Scheduler is inherently unfair.

Traditionally, scheduling algorithms are evaluated for crossbar switches, using simulation. The

simulated performance of the above algorithms is plotted in Fig. 2.12. We assume traffic of fixed size

packets, uniformly destined over the outputs. Moreover, packet arrivals follow a Bernoulli process,

with probability corresponding to traffic load. In Fig. 2.12(a), we plot average packet delay as a func-

tion of input load. We observe that PIM with one or two iterations is unable to sustain a load greater

than 0.6 or 0.9, respectively. For larger loads, VOQs keep growing in time, and the switch becomes un-

stable. On the other hand, iSLIP is stable for all loads, even using a single iteration. Furthermore, the

Wavefront Scheduler performs significantly better than both iSLIP and PIM. However, performance

converges as the number of iterations of PIM and iSLIP increases. In Fig. 2.12(b) we plot the stan-

dard deviation of delay, which serves as a metric for fairness. Thus, the Wavefront Scheduler performs

16


29/127

2.4. SCHEDULING

0.01

0.1

1

10

100

1000

10000

averagedelay(packettimes)

1PIM1SLIP

WFS

2PIM

2SLIP

WFS

0.2 0.4 0.6 0.8 1.0

load

0.01

0.1

1

10

100

1000

10000

averagedelay(packet

times)

4PIM

4SLIP

WFS

0.2 0.4 0.6 0.8 1.0

load

7SLIP/7PIMWFS

(a)

0.2 0.4 0.6 0.8 1.0

load

0.01

0.1

1

10

100

1000

10000

standarddeviationofdelay(packettimes)

2SLIP

WFS

0.2 0.4 0.6 0.8 1.0

load

7SLIPWFS

(b)

Figure 2.12: Simulated (a) average delay and (b) standard deviation of delay of PIM, iSLIP, and Wave-

front Scheduler (WFS) as a function of input load. N= 128.

17


30/127


worse than iSLIP, while PIM falls between iSLIP and the Wavefront Scheduler. (In Fig. 2.12, the stan-

dard deviation of PIM delay is omitted for clarity.)

Finally, several other approaches to switch scheduling have been proposed. Marsan et al. [43]

proposed extensions of the above algorithms for operation on variable length packets. Kar et al. [44]

proposed a scheme that merges multiple external packets into large internal envelopes to relax the

running time of the scheduler. Kam and Kai-Yeung Siu [45] proposed weighted matching to provide

Quality of Service (QoS) guarantees. Ahuja et al. [46] studied multicast scheduling.

2.5 Routing

In multipath space switches, an input-output pair can be connected through multiple alternative

paths, intersecting with other paths connecting other pairs. Thus, once scheduling has resolved input

and output contention, routing is needed to resolve contention for internal paths. Like scheduling,

routing algorithms are heuristics, and they can be classified into the following three categories [47]:

Deterministic Routingalgorithms make deterministic path selections. For example, in the Benes

switch (Fig. 2.5), each input selects round-robin one of the available paths for each output, and

in case of conflict with other inputs, a selection is made, e.g. again round-robin. Deterministic

routing algorithms ineffectively exploit path diversity, thus suffering performance issues [47].

Randomized Routingalgorithms randomly select an intermediate node, and then make a deter-

ministic selection between the available paths. Randomized routing algorithms perform well

under non-uniform traffic, but their performance degrades under uniform traffic [47].

Adaptive Routingalgorithms aim at combining the merits of deterministic and randomized al-

gorithms by using network state to select among paths. However, practical adaptive routing

algorithms access only local state, thus often making globally sub-optimal decisions [47].

2.6 Virtual Channels

Scheduling algorithms are widely used to coordinate crossbars. However, multipath switches are

harder to coordinate because they need routing in addition to scheduling. Thus, multipath switches

are only implicitly coordinated, using memories inside the switching elements. An example queued

mesh is shown in Fig. 2.13. In Fig. 2.13, we consider a scenario where inputs B and I are contending

for output G, we assume dimensioned ordered routing, and we show only the queues related to our

scenario. Thus, packets from B follow the path B-C-G, and packets from I the path I-J-K-G. As mem-

ory space is finite, congestion starts from G and spreads backwards. Using credit flow control, this

is realized by the fact that credits are being consumed at the upstream nodes at rate greater than the

rate they are being released by their downstream neighbors. In this way, rates are implicitly regulated.

18


31/127

2.7. HIGH RADIX

hot

idle

BA C D

E F G H

I J K L

Figure 2.13: Example contention situation in a 1212 queued mesh. Flows from B and I are content-ing for G, blocking a flow from I destined to an idle output D. Dimension ordered routing is assumed.

Queues also resolve contention for internal paths. For example, a route from Ato D, sharing the path

segment B-C with the route from B to G, is throttled similarly.

A hard problem in queued networks is HoL blocking. In particular, in Fig. 2.13, a third connec-

tion from I to D is blocked behind the connection from I to G even when its destination D is idle.

Unfortunately, it is infeasible to resolve this problem using VOQ. While it is affordable to implement

O(N) queues at each input of the switch, per-flow queueing inside the switching elements requires

O(N2) queues inside each switching element. Thus, in practice, heuristics are being employed.

An example such heuristic is illustrated in Fig. 2.14(a), where the connection from I to Descapes

blocking using a second network. Another solution is the popular Virtual Channels [30]. The main

idea is that instead of duplicating the whole network, only queues are duplicated, as in Fig. 2.14(b),

while links are shared hence the name Virtual Channels. The comparison of network duplication and

Virtual Channels is technology dependent, specified by the relative cost of links and memories.

2.7 High Radix

A great limiting factor in todays chip is power consumption. A high speed serial off-chip transceiver

implemented as a differential pair at 3.125Gbaudconsumes on the order of 150mW[48]. Thus, for a

state-of-the-art port rate of 10Gb/s, a few hundreds of ports cost on the order of a few tens of Watts.

The only way to scale up to larger fabrics is by connecting switch chips in multistage topologies, like

the Benes.

Kim et al. [15] showed that it is more effective to implement switching elements with a large

19


32/127


I

A

E

B C D

F G H

J K L

(a)

LKJI

(b)

Figure 2.14: Reducing blocking using (a) a duplicated network and (b) two Virtual Channels

number of slow ports rather than a small number of fast ports because higher radix switch chips en-

able lower diameter fabrics. Their argument is illustrated in Fig. 2.15. Consider a r a d i x -4 Benes

switch, and suppose that advances in signaling technology double the IO rate that is feasible on a sin-

gle chip. Then, there aretwo options to benefit. First, onecan keep theradix of theswitching elements,

and double the rate of their ports this actually doubles the rate of the end ports of the fabric. Second,

one can keep the port rate of the switching elements, and double their radix this merges multiple

switching elements on a single chip, converting the Benes topology to a Clos topology. Notice that

end port rate doubles by arranging switching elements in parallel. (We assume that chips at end ports

implementing demultiplexing and multiplexing introduce negligible overhead). Thus, comparing the

two options, the second one is better because it reduces hop count and chip count. Lower hop count

translates to lower latency and power consumption and lower chip count to lower cost [ 15].

20


33/127

2.8. SUMMARY

2x

2x

1x1x

2x

2x

Figure 2.15: High radix switching elements exploit increases in chip IO throughput more effectively

by reducing the diameter of the switch fabric [15].

2.8 Summary

This chapter described basic concepts in switch design. An interconnection switch provides scal-

able and distributed communication between the parties of a digital system. Time switches, like the

memory switch, implement communication by switching data through a single point in space at dif-

ferent moments in time. While time switches optimize resource utilization, they are non scalable.

Space switches, like the crossbar and the Benes switch, are more scalable by providing parallel paths

in space. However, space switches need scheduling and routing to resolve contention for these paths.

Compared to other space switches, the crossbar has the advantage of simplifying routing by providing

a private path for each input-output pair. However, switch scheduling is a hard problem alone. From

the scheduling algorithms known in the literature, iSLIP is the most efficient, by providing both high

throughput and good fairness properties. Scheduling and routing can be simplified by using queues

internally to the switching elements. Finally, high radix switching elements are the right tradeoff in

modern technology by reducing the diameter of the switch fabric.

21


34/127


35/127

Chapter 3

A Comparison of Architectures

for High-Radix Switches

In this chapter, we first overview basic switch architectures, like shared memory, block crosspoint

queueing, output queueing, crosspoint queueing, and input queueing (section 3.1). We then com-

pare the performance of the above architectures (section 3.2) and their feasibility using on-chip SRAM

technology (section 3.3). We show that only the input queued crossbar scales to high radices, by min-

imizing both the individual and the aggregate throughput of memories. Because the traditional in-

put queued crossbar is difficult to schedule at high radices, we consider the combined input-output

queued crossbar, which compensates for scheduling inefficiencies by moderately overprovisioning

the crossbar and the memories (section 3.4). Moreover, we compare the combined input-output

queued crossbar to the hierarchically queued crossbar, proposed recently for high radix switches, and

we show that the combined input output queued crossbar is advantageous because it gives better

performance using only a moderate speedup on the crossbar and the memories, independent of radix

(section 3.5). Finally, we discuss related work (section 3.6) and the conclusion (section 3.7).

3.1 Basic Switch Architectures

We first overview four basic time switch architectures: (i)Crosspoint Queueing (XQ),(ii)Output Queue-

ing (OQ), (iii) Block Crosspoint Queueing (BXQ), and (iv) Shared Memory (SM). Next, we compare

these architectures to the Input Queued (IQ) space switch.

Crosspoint Queueing (XQ) The XQ architecture (Fig. 3.1(a)) switches packets from N inputs to N

outputs usingN2 memories. Each input selects which memory to write its head packet to according

to the destination output of that packet, and each output selects which memory to read a packet from

according to a predetermined policy e.g. weighted round-robin. Thus, XQ additionally uses one

23


36/127


N1

N2

N3

2

1

0

0 1 N1

N:1

arbiter

N:1

arbiter

N:1

arbiter

N110

0

1

2

N1

(a) (b)

01

N1

0 1 N2 N1

N2

N/2:1arbiter

N/2:1arbiter

N/2:1arbiter

N/2:1arbiter

0 1 N1

0

1

N1

(c) (d)

Figure 3.1: Time switch architectures. (a) Crosspoint Queueing (XQ). (b) Output Queueing (OQ). (c)Block Crosspoint Queueing (BXQ-2). (d) Shared Memory (SM).

bus per input to route packets to the memories of that input, and N crosspoints coupled with one N:1

arbiterper outputto forward packets to that output. Because all memories operate at the rate of inputs

and outputs, XQ can also be considered a degeneration of a time to a space switch. Summarizing, XQ

uses N2 memories, each with a throughput of 2R, for an aggregate of 2N2R. XQ designs were proposed

by Abel et al. [13] and by Katevenis et al. [14].

Output Queueing (OQ) The OQ architecture (Fig. 3.1(b)) compared to XQ allows better sharing of

memory space by merging the memories of each output in a single memory, dedicated to that output.

The write access rate of that memory is N R to resolve any contention between the inputs for the

output. Thus, the crosspoints and arbiters of XQ are removed. Moreover, using a single FIFO queue

per memory suffices to provide both high output utilization and a degree of fairness. In summary, OQ

uses Nmemories, each with a throughput of (N+1)R, for an aggregate of (N+1)N R. OQ designs were

proposed by Yeh at al. [49].

Block Crosspoint Queueing (BXQ) The BXQ architecture (Fig. 3.1(c)) results from XQ by merging

the k2 memories between kdistinct inputs and kdistinct outputs in a single memory block. The write

24


37/127

3.1. BASIC SWITCH ARCHITECTURES

access rate of that memory is kR to resolve any contention between the kinputs for the koutputs, and

the read access rate is also kR to fully utilized the koutputs. Each memory block must also implement

at least one FIFO per local output, to remove HoL. Thus, the inputs of the same memory block share

the space of that block in analogyto OQ. The space of the block can also be shared between its outputs.

However, this type of sharing increases complexity by requiring queues to be implemented as link

lists. In contrast, to implement FIFO queues, simple circular arrays suffice. Finally, BXQ uses N/k

crosspoints coupled with one N/k:1 arbiter per output to multiplex the memory blocks of that output.

Thus, BXQ is a combination of a time and a space switch. Summarizing, BXQ-kuses(N/k)2 memories,

each with a throughput of 2kR, for an aggregate of 2N2R/k.

Shared Memory (SM) By varying the parameter kof BXQ from 1 to N, we get intermediate solutions

from complete partitioning (XQ) to complete sharing the latter solution is widely known as shared

memory (SM, Fig. 3.1(d)). Thus, SM uses a single memory with a throughput of 2N R. SM designs

were proposed by Devault et al. [22], by Katevenis et al. [23], and by Kozaki et al. [24]. (We described

SM in more detail in chapter 2.)

Now let us compare the above time switches to the Input Queued (IQ) space switch (Chapter

2). An IQ switch uses N memories, like OQ, but it places these memories at the inputs instead of the

outputs. Thus, the physical partitioning of memories in IQ is analogous to OQ. However, while in

OQ sharing is between the inputs across the outputs, in IQ sharing is between the outputs across the

inputs. As a consequence, like BXQand SM, IQ also needs linked lists to implement sharing. Moreover,

the aggregate memory throughput of IQ is 2N R, that is equal to SM. Finally, while in time switches

rates are allocated through memory spacing, IQ strongly depends on the efficiency of scheduling.

While IQ can use any space switch, in the rest of this thesis, we will be considering that IQ uses a

crossbar, to simplify routing.

Fig. 3.2 shows a conceptual derivation of the above architectures through alternative groupings

ofN2 memory blocks. Observe that the throughput of each memory is proportional to the periphery

of the rectangle enclosing the blocks, while space is proportional to the area of that rectangle. Fig. 3.2,

as well as the observation on throughput geometry, is copied from the transparencies of the Packet

Switch Architecture class at the University of Crete [18]. Also copied from there is the metric of ag-

gregate memory throughput. However, in section 3.3, we contribute a practical application of that

metric. In particular, we show that the minimum total memory area to implement a switch architec-

ture is analogous to the aggregate memory throughput of that architecture. Thus, architectures like

XQ are costly.

25


38/127


OutputQueueing(OQ)

SharedMemory

(SM)

Input Queueing (IQ)

Block Crosspoint Queueing (BXQ4) Crosspoint Queueing (XQ)

Figure 3.2: Derivation of switch architectures.

3.2 Memory-Sharing Merits

Switches allowing better sharing of memory space have the capacity to improve performance under a

broad range of traffic conditions by allowing memory space to be allocated on demand, thus virtually

increasing memory space. Equivalently, the more the sharing, the less the memory space to achieve

a fixed level of performance. We first examine the effect of memory sharing on performance by com-

paring through simulation the time switch architectures described in the previous section 3.1. Next,

we study a second type of memory sharing, which we call queue sharing.

In order to quantify the effect of memory space sharing on performance, we evaluate the rate of

packet losses as a function of memory space under Bernoulli traffic of fixed size, uniformly destined

packets. In this approach, the better the sharing of memory space, the less the memory space to

achieve a fixed packet loss rate [26]. While real traffic patterns may be considerably more stressful,

including bursts and hot spots, the results described in this section are fundamental, also likely to be

found within more complicated scenaria. Packet loss rates are plotted in Fig. 3.3 for a range of link

loads. First, we observe that packet loss rates increase with load in all architectures, as contention for

switch outputs increases correspondingly, and more packets have to be queued. Second, we observe

that architectures allowing better sharing of memory space require a smaller memory space to achieve

a given packet loss rate. Thus, XQ has the worse performance and SM the best, while OQ falls in

between, and BXQ is better or worse than OQ dependent on block size. Finally, at high loads XQ

requires about 5 larger memory space to achieve the performance of SM.

26


39/127

3.2. MEMORY-SHARING MERITS

-6

-5

-4

-3

-2

-1

0

packetlo

ssrate(log10)

XQ

BXQ-2

OQBXQ-4

load = 0.70

XQ

BXQ-2

OQ

BXQ-4

load = 0.80

0 10 20 30 40 50

memory space (packets)

-6

-5

-4

-3

-2

-1

0

packetlossrate(log10)

SM

BXQ-4 OQ

BXQ-2

XQ

load = 0.90

0 10 20 30 40 50

memory space (packets)

SM

BXQ-4OQBXQ-2XQ

load = 0.99

Figure 3.3: Packet loss rates as a function of memory space. N= 8 and memory space is the totaldivided byN. BXQ and XQ use round-robin arbiters. Simulated time is 106 packet times.

0.5 0.6 0.7 0.8 0.9 1.0

load

0

1

10

100

1000

10000

delay(packettimes)

IQ with 3SLIP

XQ/OQ/BXQ/SM

Figure 3.4: Queueing delay as a function of input load. N= 8 and memory space is infinite. IQ usesVOQ and 3-it e rat io nSLIP (3SLIP).

In Fig. 3.4, we plot queueing delay when memory space is infinite. Then, XQ, OQ, BXQ, and SM

all degenerate to a M/D/1 queueing system [6]. We also plot the performance of IQ using VOQ and

3SLIP. At low loads, performance is comparable for all switches, as there is low contention, and only

few packets are queued. At high loads, delay in IQ is significantly higher, as packets are contending for

27


40/127


5flow

s

25flows Qs

5

25

10

5flow

s

25flows

1010

(a) (b)

Figure 3.5: Queue sharing in a 33 switch with a total memory space of 30 queues per input. (a) In IQqueues are flexibly allocated on demand. (b) In XQ queues are statically partitioned per crosspoint.

IQ increases performance by reducing blocking.

both the inputs and the outputs, while in any time switch contention is for the outputs only.

Finally, memory sharing affects the performance of queue sharing mechanisms in larger fabrics

[30][50]. Particularly, consider the 33 switches of Fig. 3.5 are elements of a larger Clos fabric. Also

consider that technology defines a maximum memory space of 30 queues per switch input. In IQ,

queues are concentrated at the inputs. Thus, there are 30 queues at each input. In XQ, queues are

distributed at the crosspoints. Thus, there are 10 queues at each crosspoint. Hence, when same-

input flows are unevenly distributed over the outputs, queues are better utilized in IQ, and blocking is

reduced. Notice that this queue sharing is a variation of byte sharing we described above.

3.3 On-Chip SRAM Cost

In this section, we evaluate implementation cost in on-chip SRAM technology. We consider a 90 nm

CMOS process, where SRAM is available in blocks, and we decide the feasibility of a switch architec-

ture by evaluating two metrics: (i)Total silicon area to implement all memories; and (ii) individual

memory width. State of the art technology constrains the core of a chip to less than 400mm2

, and

smaller chips are typically less expensive [51]. Moreover, the budget for memories is a major cost. For

example, Katevenis et al. [14] described a 180nm-CMOS implementation of XQ, where memory area

accounted for as much as 70% of the switch core. Thus, we bound the feasible memory area to less

than 200mm2. On the other hand, memory throughput expands by increasing its word width. When

this becomes larger than the maximum width of the available SRAM blocks, one can arrange multiple

SRAM blocks in parallel. In any case, memory width is bounded by the minimum external packets.

We will consider a maximum width of 64B y t e s , corresponding to minimum Ethernet packets [52].

Summarizing, a switch architecture is feasible when (i)total memory area is smaller than 200mm2,

and (ii)individual memory width is smaller than 64B y t e s .

28


41/127

3.3. ON-CHIP SRAM COST

256 1K 4K 16K

number of words

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

area(sq

-mm)

word width 128 bits

64

32

16

256 1K 4K 16K

number of words

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

latency(ns,w

orstcase)

word width 128 bits

64

32

16

(a) (b)

8 32 128 512

block size (Kbits)

0

20

40

60

80

100

120

140

160

180

200

capacity(Mbitsin200sq-mm

)163264128

word width

8 32 128 512

block size (Kbits)

200

300

400

500

600

700

800

900

1000

throughput(Mb/s/pin)

163264128

word width

(c) (d)

Figure 3.6: On chip memory performance in 90nm CMOS. (a) Single port SRAM block area 2-po rt

blocks are 20% to 60% larger, and are omitted here, (b) speed as a function of block height, (c) total

memory capacity fitting in 200mm2 as a function of block size, and (d) memory throughput per pin

as a function of block size.

We first plot 90nm CMOS SRAM block performance in Fig. 3.6. Wordwidth varies from 16b i t s to

128b i t s , and number of words from 256 to 16K. We observe in Fig. 3.6(a) that as block size increases,

block area increases as well, to accommodate more SRAM bit cells. However, area increases sub-

linearly. The reason is that larger blocks are more area efficient because their peripheral overhead

(e.g. address decoders, column multiplexors, sense amplifiers, etc) is amortized over a larger core.

Furthermore, in Fig. 3.6(b), we observe that as block size increases, SRAM blocks become slower, as

internal bit-line and word-line capacitances increase accordingly (The smaller the faster [51]). In

Fig. 3.6(c), we show the memory capacity we can fit in 200mm2. We observe that, except for corner

cases, capacity depends mainly on the size of SRAM blocks, rather than their configuration corner

cases are wide and small blocks, where extra sense amplifiers result to disproportional overheads. We

also observe that using the largest blocks, it is feasible to implement as many as 100Mbit s of memory

29


42/127


1

2

addressbus

X/x

read/writebus

1

2

read/writebusa

ddressbus

A/a

(a) (b)

Figure 3.7: (a) Memory throughput and space expansion and (b) memory space expansion. A and

X denote the area and the throughput of the memory, and a and x the area and throughput of an

individual block, respectively.

space. Finally, in Fig. 3.6(d), weshow the throughput ofeachmemory block in M b i t s /s per pin. Again,

throughput mainly depends on block size, rather than block configuration. Moreover, the maximum

throughput we can get from a single SRAM block is less than 60Gb/s, using the smallest blocks.

To build a memory of throughput X, we need X/x parallel SRAM blocks, where x the through-

put of one SRAM block see Fig. 3.7(a). Ifa is the area of the SRAM block, memory area is X/xa.

Then, if M is the number of memories, total memory area is MX/xa this must be smaller than

200mm2. Notice that total memory area is analogous to aggregate memory throughput. As a conse-

quence, aggregate memory throughput suggests a realistic cost metric. Finally, ifw is the block width,

the width of each memory is X/xw. Summarizing,

Total Memory Area= MXxa,

Memory Width= Xxw,

where x, a, and w the throughput, area, and width of one SRAM block, respectively.

Table 3.1 summarizes the values of X and M for each switch architecture. By substituting the SRAM

cost numbers of Fig. 3.6, we get an SRAM cost of each switch architecture.

Total memory area and individual memory width are plotted in Fig. 3.8 and Fig. 3.9, respectively.

We observe in Fig. 3.8 that area grows as O(N2) for XQ and OQ, and as O(N) for IQ and SM, follow-

Table 3.1: Number of memories (M) and throughput per memory (X). R= 10Gb/s.

switch class

XQ OQ BXQ SM IQ

M N2 N (N/k)2 1 N

X 2R (N+1)R 2kR 2N R 2R

30


43/127

3.3. ON-CHIP SRAM COST

0.1

1

10

100

200

500

memo

ryarea(sq-mm)

4Kbit blocks

XQ OQ

IQ/SM

infeasible

feasible

8Kbit blocks

XQ OQ

IQ/SM

16Kbit blocks

XQ OQ

IQ/SM

32Kbit blocks

XQ OQ

IQ/SM

4 8 16 32 64 128

N

0.1

1

10

100

200

500

memoryarea(sq-mm)

64Kbit blocks

XQ OQ

IQ/SM

4 8 16 32 64 128

N

128Kbit blocks

XQ OQ

IQ/SM

4 8 16 32 64 128

N

256Kbit blocks

XQ OQ

IQ/SM

4 8 16 32 64 128

N

512Kbit blocks

XQ OQ

IQ/SM

Figure 3.8: Minimum total area to implement the memories of each switch architecture as a function

ofN. Area is proportional to aggregate memory throughput. Only IQ and SM scale above 100 ports.

1

10

20

30

40

50

60

70

80

90

100

memorywidth(Bytes)

infeasible

feasible

4Kbit blocks

XQ/IQ

OQSM

8Kbit blocks

XQ/IQ

OQSM

16Kbit blocks

XQ/IQ

OQSM

32Kbit blocks

XQ/IQ

OQSM

4 8 16 32 64 128

N

1

1020

30

40

50

60

70

80

90

100

mem

orywidth(Bytes)

64Kbit blocks

XQ/IQ

OQSM

4 8 16 32 64 128

N

128Kbit blocks

XQ/IQ

OQSM

4 8 16 32 64 128

N

256Kbit blocks

XQ/IQ

OQSM

4 8 16 32 64 128

N

512Kbit blocks

XQ/IQ

OQSM

Figure 3.9: Minimum width of an individual memory of each of the switch architectures. Memory

width is proportional to memory throughput. Only IQ and XQ scale above 100 ports.

ing the aggregate throughput of memories. Thus, area is the same for both IQ and SM, well below

200mm2. On the other hand, XQ and OQ do not scale above 32 and 64 ports, respectively, and to scale

to these radices they use small SRAM blocks. Furthermore, we observe in Fig. 3.9 that memory width

grows as O(N) for SM and OQ, while it remains constant, independent of N, for IQ and XQ, following

31


44/127


0.1

1

10

100

200

500

memo

ryarea(sq-mm)

4Kbit blocks

BXQ

infeasible

feasible

8Kbit blocks

BXQ

16Kbit blocks

BXQ

32Kbit blocks

BXQ

4 8 16 32 64 128

N

0.1

1

10

100

200

500

memoryarea(sq-mm)

64Kbit blocks

BXQ

4 8 16 32 64 128

N

128Kbit blocks

BXQ

4 8 16 32 64 128

N

256Kbit blocks

BXQ

4 8 16 32 64 128

N

512Kbit blocks

BXQ

Figure 3.10: Minimum total area to implement the memories of BXQ-8 as a function ofN. BXQ does

scale above 100 ports, but only using the smallest SRAM blocks.

4 8 16 32 64 128

N

0

20

40

60

80

100120

140

160

180

200

memorycapacity(Mbitsin200sq-mm)

IQ

XQ

OQSM

4 8 16 32 64 128

N

BXQ

Figure 3.11: Total memory capacity in 200mm2 for each of the switch architectures.

the individual throughput of memories. Thus, for IQ and XQ, memory width is smaller than the ex-

ternal packets, for all N, while for SM and OQ it grows quickly, limiting these architectures to radices

below 8 or 16, respectively. Combining the plots of Fig. 3.8 and 3.9, we conclude that, from IQ, XQ,

OQ, and SM, only IQ scales to radices above 100.

Another scalable architecture is BXQ. We plot total memory area for BXQ-8 in Fig. 3.10. In this

plot, BXQ uses memory blocks corresponding to the the largest feasible SM. We observe that BXQ does

scale above 100, but it can do so only using the smallest SRAM blocks. However, this has a negative

impact on memory density, as plotted in Fig. 3.11. In this plot, each architecture has the density of the

32


45/127

3.4. COMBINED INPUT-OUTPUT QUEUED (CIOQ) CROSSBARS

maximum SRAM block it can use (When memory area is smaller than 200mm2, spare space is utilized

t

Documents

2012.TR427 VLSI Micro-Architectures High-Radix Crossbars