2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

Embed Size (px)

Citation preview

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    1/127

    VLSI Micro-Architectures for High-Radix Crossbars

    Giorgos Passas

    Computer Architecture & VLSI Systems (CARV) Laboratory,

    Institute of Computer Science (ICS)

    Foundation of Research and Technology Hellas (FORTH)

    Science and Technology Park of CreteP.O. Box 1385 Heraklion, Crete, GR-711-10 Greece

    Technical Report FORTH-ICS/TR-427 April 2012

    Copyright 2012 by FORTH

    Work Performed as a Ph.D Thesis

    at the Department of Computer Science, University of Crete,

    under the supervision of Prof. Manolis Katevenis,with the financial support of FORTH-ICS

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    2/127

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    3/127

    FORTH-ICS/TR-427 A PRIL 2012

    VLSI Micro-Architectures for High-Radix Crossbars

    Giorgos Passas

    The crossbar is the most popular switch for digital systems such as Internet routers, clusters, and

    multiprocessors (on-chip, as well as multichip). However, because the cost of the crossbar grows with

    the square of the radix thereof, and because of past implementations in various technologies, it is

    widely believed that the crossbar is not scalable to radices beyond 32 or 64, and that for higher radices

    more complicated networks are needed, where the crossbar is the basic building block. In this thesis,

    we scale the crossbar to radices well beyond 100 by crafting novel VLSI micro-architectures and their

    detailed CMOS layouts.

    As a case study, we laid out a 12812824Gb/s crossbar, interconnecting 128 1mm2 user tiles in a

    single hop, using just 16mm

    2

    of silicon in 90nm CMOS. The crossbar is 32b i t s wide, runs at 750M H z,and consumes 7W a t t s .

    In router systems, the user tiles will contain memory implementing combined queueing at the inputs

    and outputs of the crossbar, plus a small part of logic for port control. We show that this architecture

    is the best among a range of known router memory architectures (e.g. totally shared memory, solely

    input queueing, or crosspoint queueing), for two reasons: (i)It gives top performance using only a

    modest speedup on either the crossbar or the memories, independent of radix; and (ii)it partitions

    the memory space only linearly with the radix, thus yielding: (a) High SRAM density by using few,

    large, and area efficient blocks; and (b)high memory space utilization through flexible sharing among

    flows. In chip multiprocessors, the user tiles will contain cache or local memory, plus a small part of

    logic for the processor. When traffic is global and heavy, such a system is competitive to the popular

    mesh-centric systems, owing to the simplified routing and load balancing of the crossbar.

    We made high radix crossbars feasible by developing novel VLSI micro-architectures for both theirdatapath and their control path. We implement the datapath using trees of multiplexor gates, as tris-

    tate buses are slowed down by intrinsically large parasitic capacitances, and we show that highly con-

    centrated trees are more area efficient by further reducing the parasitic capacitance of their internal

    wires. Moreover, we contribute an experimental analysis showing that: (i)The area of the crossbar is

    gate limited for all practical values of its radix N and its width W, thus growing as O(N2W), not as

    O(N2W2), which would have been the case had area been wire limited, as is commonly believed in

    the literature; and (ii)the delay of the crossbar is dominated by the parasitics of wires, and because

    wire length grows with the perimeter of the crossbar, delay grows as O(N

    W), not as O(lo g N ), which

    would have been the case had delay been gate limited, as is commonly believed in the literature. Next,

    we propose novel pipelines to cope with the delay of the interconnect. Finally, we demonstrate that

    modern EDA tools can be guided to exploit the abundance of wiring resources through custom, but

    algorithmic placement of gates.For the control path, we study the architecture of iSLIP, which is the most popular parallel match-

    ing crossbar scheduler. In particular, we study a traditional iSLIP architecture that implements the

    matching decision of each input and each output of the crossbar in a separate arbiter block, and com-

    municates the matching decisions between the input and the output arbiters through global arbiter-

    to-arbiter links. First, we show that this architecture is expensive because the arbiter-to-arbiter links

    take up O(N4) area. Thus, a r a d i x -128 iSLIP scheduler occupies 14mm2, where the arbiter-to-arbiter

    links account for more than 50%. Next, by observing that the wiring of an arbiter fits in O(N l o g N )

    area, we propose a novel architecture that inverts the locality of wires by orthogonally interleaving the

    input with the output arbiters, thus lowering the wiring area of the scheduler down to O(N2log2N).

    Using this architecture, the r a d i x -128 iSLIP scheduler becomes gate limited, fitting in 7mm2, which

    is a 50% reduction compared to the traditional. For a higher radix of 256, area is reduced by almost

    an order of magnitude. Finally, the running time of the proposed scheduler is less than 10ns, thusallowing operation with a minimum packet as small as 30 B y t e s at a 24Gb/s line rate.

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    4/127

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    5/127

    iii

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    6/127

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    7/127

    Publications

    Publications related to the topic:

    G. Passas, M. Katevenis, D. Pnevmatikatos: Crossbar NoCs are scalable beyond 100 nodes,IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), Vol.

    31, No. 4, Apr. 2012, pp. 573-585;

    G. Passas, M. Katevenis, D. Pnevmatikatos: VLSI Micro-Architectures for High-Radix CrossbarSchedulers, proc. 5th ACM/IEEE International Symposium on Networks-on-Chip (NOCS 2011),

    Pittsburgh, PA, USA, May 1-4, 2011, 8 pages, ISBN 978-1-4503-0720-8;

    G. Passas, M. Katevenis, D. Pnevmatikatos: A 128 x 128 x 24Gb/s Crossbar, Interconnecting 128Tiles in a Single Hop, and Occupying 6% of their Area, proc. 4th ACM/IEEE International Sym-

    posium on Networks-on-Chip (NOCS 2010), Grenoble, France, May 3-6, 2010, pp. 87-95, IEEE

    Computer Society, ISBN 978-0-7695-4049-8.

    Other publications by the author:

    G. Passas, H. Eberle, N. Gura, W. Olesinski: Fast and Fair Arbitration on a Data Link, U.S. Patent,USPTO number 7965705, June 21, 2011;

    G. Passas, M. Katevenis: Asynchronous Operation of Bufferless Crossbars, proc. IEEE Interna-tionalConference on HighPerformance Switching and Routing (HPSR2007), Brooklyn, NY, USA,

    May 30 - June 1, 2007, ISBN 1-4244-1206-4, paper ID 1569017531.pdf;

    G. Passas, M. Katevenis: Packet Mode Scheduling in Buffered Crossbar (CICQ) Switches, proc.IEEEWorkshop on High Performance Switching and Routing (HPSR 2006), Poznan, Poland, June

    7-9, 2006, pp. 105-112, ISBN 0-7803-9570-0;

    M. Katevenis, G. Passas: Variable-Size Multipacket Segments in Buffered Crossbar (CICQ) Archi-tectures, proc. IEEE International Conference on Communications (ICC 2005), Seoul, Korea,

    May 16-20, 2005, 6 pages, paper ID 09GC08-4;

    M. Katevenis, G. Passas, D. Simos, I. Papaefstathiou, N. Chrysos: Variable Packet Size Buffered

    Crossbar (CICQ) Switches, proc. IEEE International Conference on Communications (ICC 2004),Paris, France, June 20-24, 2004, vol. 2, pp. 1090-1096.

    v

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    8/127

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    9/127

    Acknowledgments

    I worked on my PhD thesis at the Computer Architecture and VLSI Systems (CARV) laboratory of the

    Institute of Computer Science (ICS) of the Foundation for Research and Technology Hellas (FORTH).

    FORTH provided my graduate scholarship, including funding by the European Commission through

    the SARC project (FP6 IP #27648) and the HiPEAC Network of Excellence (NoE #004408 and #217068).

    I am grateful to my advisor prof. ManolisKatevenis for suggesting theevaluation of the cost of crossbar

    speedup on chip using real VLSI layouts, and for supervising my work through weekly meetings from

    Spring 2008 to Fall 2011. Particularly years 2008 and 2009 had been very hard, and I can remember

    only him giving me a hand. Nevertheless, there was still plenty of time and space to act myself, whichI really appreciated.

    I am also grateful to my co-advisor prof. Dionisios Pnevmatikatos, who joined our meetings in fall

    2009, and offered fresh insights. However, I mostly thank him for all nice things I learned from him on

    technical writing and drawing for example, the bold lines in Fig. 5.10 are due to him.

    I also thank the other members of my thesis committee, prof. Davide Bertozzi, prof. Angelos Bilas, Dr.

    Cyriel Minkenberg, prof. Yannis Papaefstathiou, and prof. Apostolos Traganitis, for their questions

    and comments in the defense of my thesis they proved very constructive. Special thanks to prof.

    Davide Bertozzi and Dr. Cyriel Minkenberg for going into more details.

    I thank prof. Christos Sotiriou whom I consulted on several of the issues I faced on EDA flows and

    algorithms the custom placement techniques were motivated by these discussions.

    I thank Spyros Lyberis and Michael Ligerakis for setting up the EDA toolset for me. Spyros Lyberis also

    helped me improve the oral presentation of my defense by commenting on my rehearsal.

    I thank Dr. Hans Eberle, prof. Jose Duato, and prof. Jose Flich, whom I had the opportunity to cooper-

    ate with during my internship at Sun Labs, Menlo Park, CA, in Fall 2007 seeing how other researchers

    are thinking on a closely related research topic was very helpful for my thesis.

    Last but not least, I thank my family, especially my mother, and my uncle Nikos Arapakis, for their

    love and encouragement, and my friends, especially Makis Stamos (Psilos), Kostis Anastasakis, Enrico

    Schiattarella, Orestis Karamagiolas, and Giorgos Panagiotakis, for helping me further tolerate reality.

    Finally, I thank the cleaning and security personnel at FORTH for being kind with me.

    vii

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    10/127

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    11/127

    Contents

    1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 Basic Concepts 52.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Time Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.3 Space Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Virtual Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 High Radix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3 A Comparison of Architectures for High-Radix Switches 233.1 Basic Switch Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Memory-Sharing Merits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 On-Chip SRAM Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Combined Input-Output Queued (CIOQ) Crossbars . . . . . . . . . . . . . . . . . . . . . . 333.5 Hierarchically-Queued Crossbars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    4 Datapath Micro-Architectures for High-Radix Crossbars 414.1 Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Cost & Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Customized Layout using Link Pipelining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4 Models for Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    5 Scheduler Micro-Architectures for High-Radix Crossbars 635.1 iSLIP Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Block Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3 Cross Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    5.4 Cross-iSLIP versus Wavefront-Scheduler Comparison . . . . . . . . . . . . . . . . . . . . . 815.5 FIFO & Virtual-Channel Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    6 High-Radix CIOQ Crossbar Switches & Crossbar NoCs 876.1 Tiled Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2 Wire-Over-SRAM Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 Radix-128 Tiled System using Centralized Crossbar . . . . . . . . . . . . . . . . . . . . . . . 936.4 Crossbar versus Mesh Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.5 Projections in Newer Technology Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    7 Summary & Future Work 105

    Bibliography 109

    ix

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    12/127

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    13/127

    Chapter 1

    Introduction

    Today, on-chip memory systems with few hundreds of memory nodes, such as chip multiprocessors

    (CMPs) and switch fabrics, are pivotal digital systems. A key component in such systems is the switch

    interconnecting the memories. While the crossbar is the most popular switch, it is widely considered

    non-scalable to radices beyond fewtens of nodes dueto its quadratic cost [1][2][3]. Thus, designers are

    increasingly adopting cheaper-but-intricate topologies, such as meshes and tori [1][4][5], where the

    crossbar is the basic building block. However, a study on the scaling limits of the crossbar is missing

    from the literature: If the crossbar is proven to be feasible, designers are likely to replace the currently

    deployed topologies with a crossbar to benefit from its simplicity.

    We consider memory systems where each memory node is located on a user tile, and user tiles

    are arranged in a 2D matrix. In a switch fabric, a user tile will implement mostly queues associated

    with a switch port, plus a small circuit for the control of that port. In a CMP, a user tile will implement

    a processor and the cache or local memory next to the processor. We evaluate the area, speed, and

    power consumption overhead of the crossbar to the user tiles by studying their VLSI organization in

    modern CMOS technology, using Electronic Design Automation (EDA) tools.

    1.1 Motivation

    Switch chips interconnect their ports using a crossbar. An efficient approach to handle contention for

    the output ports is using queues. In particular, traffic at different inputs contending for a same output

    is queued at the inputs [6]. To reduce Head of Line blocking in such Input Queued (IQ) crossbars,

    queue memories can be divided in lanes, e.g. Virtual Output Queues (VOQ) [6]. However, scheduling

    of which lane is connected to which output is a hard problem to solve fast, fair, and efficiently [7]. To

    compensate for inefficiencies of scheduling, internal speedup can be used. Then, the throughput of

    both the memories and the crossbar is overprovisioned, and memories are placed also at the outputs.

    This organization is known as Combined Input-Output Queueing (CIOQ) [8][9][10][11].

    1

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    14/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    CIOQ has been studied for systems where the crossbar and the memories are implemented on

    separate chips [7][12]. In such systems, it is known to be expensive because it wastes expensive chip-

    IO resources [12]. Thus, crosspoint queueing (XQ) has been proposed as a more scalable alternative.

    XQ can be considered a scalable var iant of output queueing (OQ), trading queueing internally to the

    crossbar for operation without internal speedup [13][14].

    However, the tradeoff between speedup and memory is very different off-chip from on-chip.

    Chip-to-chip link bandwidth is expensive, in terms of both pin count and power consumption. Each

    pin is run at the highest possible frequency, so link speedup can only be provided by increasing the

    number of pins. But, more pins cost in packagesize, board area, and wiring cost. Moreover, high speed

    serial off-chip links carry their own clock information, embedded in the data encoding. Even if such a

    link has no valid data to carry, it still consumes power because it has to carry synchronization signals

    for the receivers clock recovery circuit to stay in-sync (powering down is beneficial only for quite long

    inactivity periods). Thus, an off-chip link with a speedup ofs> 1.0 consumes power proportional to

    its peak throughput, s, although its average utilization never exceeds 1.0.

    On-chip, things are very different though. The wires of an idle link need not change state, hence

    CMOS circuits will only consume energy when transferring valid data: On-chip link power consump-

    tion is proportional to average throughput, not peak capacity. Moreover, on-chip, links can be routed

    over the memories: Modern CMOS processes offer many layers of interconnect (e.g. eight), while

    SRAM blocks obstruct few of them (e.g. four). Thus, on chip, CIOQ appears technologically correct.

    Furthermore, CIOQ is advantageous to XQ because it reduces the partitioning of memory space.

    Particularly, in CIOQ the total number of memories grows linearly with the radix, while in XQ it grows

    quadratically. Moreover, higher memory partitioning translates to higher implementation cost. On

    the other hand, CIOQ compared to XQ increases the cost of the crossbar, while also requiring a mono-

    lithic crossbar scheduler. Hence, a comparison between CIOQ and XQ starts from the evaluation of

    the cost of crossbar speedup and the feasibility of the scheduler.

    This evaluation should be done for high radix switches [15]. Switch chip designers strive to ben-

    efit from the advances in signaling technology by increasing the radix of the switch chips, as switch

    chips with higher radix enable lower diameter network topologies, with lower component count, and

    lower cost. However, most studies in the literature concern relatively low radix crossbars, up to 32 or

    64 ports [2][16], and a study of scaling to hundreds of ports is missing.

    Finally, the cost of crossbar speedup and the feasibility of the scheduler should be evaluated

    on real VLSI layouts. Switch chips are typically Application Specific Integrated Circuits (ASICs), and

    ASICs are designed using Electronic Design Automation (EDA) tools. Thus, VLSI layouts should be

    developed using such EDA tools.

    2

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    15/127

    1.2. CONTRIBUTIONS

    1.2 Contributions

    The key contributions of this thesis are as follows:

    1. High Radix Crossbar Network-on-Chip. We lay out a 12812824Gb/s crossbar in a 90nm

    CMOS process with 9 layers of interconnect. The crossbar is 32b i t s wide, runs at 750M H zusing

    a 3-st ag e pipeline, fits in 16mm2 of silicon by filling it at the 90% level, and consumes 6W a t t s .

    Moreover, we surround the crossbar with 128 1mm2 user tiles, and we connect the crossbar to

    the user tiles through global links. The global links are 32b i t s wide, run at 750M H z using a

    two-stage pipeline, run on top of the user tiles, and consume 1.2W a t t s .

    2. Crossbar Datapath Micro-Architecture.We implement the datapath of the crossbar using trees

    of multiplexor gates, as tristate buses are slowed down by intrinsically large parasitic capaci-

    tances, and we show that highly concentrated trees are more area efficient by further reducing

    the parasitic capacitance of their internal wires. Next, we contribute an experimental scaling

    analysis, showing that: (i)The area of the crossbar is gate limited for all practical values of its

    radix N and its width W, thus growing as O(N2W), not as O(N2W2), which would have been

    the case had area been wire limited, as is commonly believed in the literature [2][17]; and (ii)

    the delay of the crossbar is dominated by the parasitics of wires, and because wire length grows

    with the perimeter of the crossbar, delay grows as O(N

    W), not as O(lo g N ), which would have

    been the case had delay been gate limited, as is commonly believed in the literature [ 15]. Next,

    we propose novel pipelines to cope with the delay of the interconnect. Finally, we demonstrate

    that EDA tools can be guided to compact routing solutions through custom gate placement.

    3. Crossbar Scheduler Micro-Architecture. We study a traditional iSLIP architecture that imple-

    ments the matching decision of each input and each output of the crossbar in a separate arbiter

    block, and communicates the matching decisions between the input and the output arbiters

    through global arbiter-to-arbiter links. First, we show that this architecture is expensive be-

    cause the arbiter-to-arbiter links take up O(N4) area. Thus, a r a d i x -128 iSLIP scheduler occu-

    pies 14mm2

    , where the arbiter-to-arbiter links account for more than 50%. Next, by observing

    that the wiring of an arbiter fits in O(N l o g N ) area, we propose a novel cross architecture that

    inverts the locality of wires by orthogonally interleaving the input with the output arbiters, thus

    lowering the wiring area of the scheduler down to O(N2l og2 N). Using this cross architecture,

    the r a d i x -128 iSLIP scheduler becomes gate limited, fitting in 7mm2, which is 50% reduction

    compared to the traditional. For a higher radix of 256, reduction nears an order of magnitude.

    4. Combined Input-Output Queueing is better than Crosspoint Queueing. Based on the above

    findings we conclude that crossbars are small and speedup is inexpensive. Because CIOQ com-

    pared to XQ reduces memory partitioning, CIOQ is better than XQ.

    3

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    16/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    1.3 Outline

    Chapter 2 presents basic concepts in switch design. First, we abstract the role of the switch to scalable

    distributed multiparty communication. Next, we classify switches to time and space switches, and

    we explain why space switches are more scalable. Though scalable, space switches need scheduling

    and routing. Thus, we also overview some popular scheduling and routing algorithms. Moreover, we

    discuss thepopular Virtual Channels and the argument for highradix switches. Finally, we summarize.

    Chapter 3 presents a comparison of known switch architectures for high radix switches. First, we

    overview basic switch architectures. Next, we study the merits of memory sharing and the cost of

    memory implementation, and we show that the input queued crossbar is the only scalable switch.

    Because the input queued crossbar is difficult to schedule at high radices, we also discuss scalable

    variants thereof, namely combined input-output queued crossbars and hierarchically queued cross-

    bars. Finally, we discuss related work and the conclusion.

    Chapter 4 presents VLSI micro-architectures for high radix crossbar datapaths. First, we describe a

    basic datapath architecture. Next, we show that in this architecture area is practically always gate

    limited, while delay becomes wire limited at high radices. To increase throughput, we also describe a

    customized layout using a novel wire pipeline. Moreover, we develop simple models for area. Finally,

    we discuss related work and the conclusion.

    Chapter 5 presents VLSI micro-architectures for high radix crossbar schedulers. First, we describe theiSLIP circuit. Next, we study a traditional block architecture, and we show that at high radices sched-

    uler area is wire limited. To remove the wiring limitations, we propose and we study a novel cross

    architecture. We find that this cross architecture has similarities with the architecture of the Wave-

    front Scheduler. Moreover, we adapt the cross scheduler architecture to FIFO and Virtual Channel

    crossbars. Finally, we discuss related work and the conclusion.

    Chapter 6 presents VLSI micro-architectures for high radix Combined Input-Output Queued (CIOQ)

    switches and Networks-on-Chip (NoCs). First, we show that a tiled architecture can be used for both

    CIOQ switches and NoCs. Next, we study alternative locations of the crossbar in its context of tiles,

    and we show that a centralized crossbar is more practical. Moreover, we plot overall system perfor-

    mance, we compare our crossbar to a popular mesh NoC, and we make projections for newer tech-

    nology nodes. Finally, we describe related work and the conclusion.

    Chapter 7 presents a summary of the thesis, as well as directions for future work.

    4

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    17/127

    Chapter 2

    Basic Concepts

    In this chapter, we describe basic concepts in switch design. In particular, we describe the role of

    a switch (section 2.1), a taxonomy of switches to time and space switches (section 2.2 and 2.3), the

    problem of scheduling (section 2.4), the problem of routing (section 2.5), the concept of Virtual Chan-

    nels (section 2.6), the argument for high radix switches (section 2.7), and a summary (section 2.8).

    The description is based mainly on the transparencies of the Packet Switch Architecture class at the

    University of Crete [18].

    2.1 PreliminariesInterconnection switches are intermediaries implementing the communication between the parties

    of a digital system. For example, between line cards in an Internet router [19], between processors and

    memories in a multiprocessor system [20], and between I/O devices, processors, and/or memories in

    a storage system [21].

    As the scale of systems varies from few tens of parties (e.g. in an Internet router) to many thou-

    sands of parties (e.g. in a multiprocessor), the scale of the switch has to vary accordingly. While a small

    switch may switch data by simply passing it through a single point in space at different moments in

    time (e.g. the memory switch section 2.2), larger switches use topologies of parallel paths in space

    (e.g. the crossbar or the Benes switch section 2.3). Thus, space and time are two basic dimensions in

    switch design. A third dimension is the coordination of the parties. Examples are scheduling or rout-

    ing algorithms resolving situations where parties are contending for a single resource of the switch,

    such as an output link or an internal path (sections 2.4 and 2.5). In any case, parties comprise a dis-

    tributed system. When scale is large, distribution is obvious a direct consequence of the physical

    distances between the parties. In smaller scales, distribution emerges from large ratios of path delays

    to processing times.

    5

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    18/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    r

    r i

    0

    r

    r i

    0

    >= R

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    19/127

    2.2. TIME SWITCHES

    assemble

    words1

    assemble

    words

    single path in space is bottleneck

    4

    words

    words

    disassemble

    disassemble

    1

    11

    memory

    Figure 2.2: A 44 memory switch.

    2.2 Time Switches

    For small systems, a switch may simply pass data through a single point in space at different moments

    in time. A representative such switch is the memory switch. A memory switch (Fig. 2.2) switches data

    from a set of input links to a set of output links throughwrites and readsto and from a central memory.

    The access rate of the central memory is equal to the aggregate rate of the input and the output links,

    in order to absorb any contention among the input links for the output links, and to fully utilize the

    output links. Thus, the inputs assemble words, which are then multiplexed in time on the memory

    (write) bus. In the opposite direction, words are demultiplexed on the memory (read) bus, and are

    disassembled at the outputs. Inside the memory, words are organized in queues. For details on the

    memory switch, refer e.g. to [22][23][24].

    Queues are typically simple first-in-first-out (FIFO) structures, as only such simple structures

    can be implemented fast in hardware [25]. Thus, at least one queue is needed per output, to maximize

    the throughput of the memory: If words for different outputs are intermingled in the same queue,

    words for a heavily loaded output block other, lightly loaded outputs. This is a fundamental problem

    in switch design, known as Head-of-Line (HoL) blocking[6][26].

    Coordination in a memory switch concerns the sharing of memory space among the inputs and

    theoutputs. In particular, when some inputsare contending for a same output, thequeuecorrespond-

    ing to that output starts building up in time. Given that memory space is finite, a protocol is needed

    to prevent the memory from overflowing. Usually, a credit based protocol is employed [27]. Using a

    credit based protocol, inputs are allocated a number of credits for each queue, credits are spent on

    writes, and are returned on reads, so that operation is lossless. Notice that in this way the memory is

    minimally partitioned into N2 credit equivalents. Furthermore, while dropping excessive words could

    be an alternative, this usually degrades performance, as new words may not be able to arrive in time

    to replace the dropped ones [6][26]. Finally, according to Littles law [28], rates allocated to inputs for

    7

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    20/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    (a) (b)

    Figure 2.3: (a) A 44 crossbar switch and (b) example connections.

    a particular output are proportional to their share of that outputs queue space. That is, in a memory

    switch rates are allocated through memory space.

    The memory switch is the most efficient of the known switch architectures because it optimizes

    the utilization of both the links and the memory space. All links are guaranteed to run at 100% of their

    capacity, while memory space can be flexibly shared among queues see [29] for a range of possible

    sharing schemes. Unfortunately, the memory switch is non-scalable because as the number and/or

    the rate of links increases, the access rate of the central memory increases proportionally. As we shall

    see in the next chapter 3, current memory and link technology constrain the size of a state-of-the-artmemory switch to only eight ports.

    2.3 Space Switches

    Space switches are more scalable by using parallel paths in space. Popular examples are the crossbar,

    the mesh, and the Benes switch.

    A crossbar switch (Fig. 2.3) is a set of N input lines, N output lines, and N2 programmable

    crosspoints between them. Let us denote byxi,j

    the crosspoint between input i and output j. For

    input i to connect to output j, crosspoint xi,j closes, while every other crosspoint xk,j, k= i, opens

    to avoid shorts. Finally, the input and the output lines connect to same-rate input and output links.

    Thus, the crossbar is internally non blocking.

    A mesh switch (Fig. 2.4) uses one 55 crossbar for each pair of input-output links, and connects

    the crossbars in a

    N

    N grid. In this way, it reduces the complexity of the crossbar from O(N2)

    down to O(N). However, this comes at the cost of internal blocking. In particular, the bisection of the

    mesh is O(

    N) wide, that is narrower than the O(N) connections. Furthermore, unlike the crossbar

    where there is a dedicated path for each input-output pair, in the mesh each input may connect to

    any of the outputs through multiple alternative paths, intersecting with other paths connecting other

    8

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    21/127

    2.3. SPACE SWITCHES

    xbar

    5x5

    xbar

    5x5

    xbar5x5

    xbar5x5

    xbar

    5x5

    xbar5x5

    xbar

    5x5

    xbar

    5x55x5

    xbar

    (a) (b)

    Figure 2.4: (a) A 99 mesh switch and (b) alternative routes for an example connection.

    pairs. For details on the mesh refer e.g. to [30].

    A 44 Benes switch (Fig. 2.5) comprises two back-to-back connected 44 Banyan networks of

    22 switching elements. Larger Benes switches are constructedrecursively. An NNBenescomprises

    two N/2N/2 Benes sub-switches, sandwiched byNadditional elements. Thus, the cost of the Benes

    is O(N lo g N ). Furthermore, the Benes is non-blocking, like the crossbar. The intuition behind its

    non-blocking property is that the Benes has more states than all possible permutations of external

    connections. In particular, it uses 2N l o g N 1 stages ofN/2 22 switching elements, thus providing

    2N(l o g N )

    N/2

    states, which are more than the N! permutations of external connections [31]. Finally,

    like the mesh, the Benes is a multipath switch. For details on the Benes refer e.g. to [ 32].

    There are a number of other known space switches. For example, the mesh can be extended to

    three or more dimensions. Moreover, the Benes can be folded in a fat tree. Finally, the Clos switch

    is a Benes switch using higher radix switching elements. For details on these switches, refer e.g. to

    [33][34][35].

    Like time switches, space switches hold contending traffic in queues. However, space switches

    place these queues at the inputs. Thus, memory throughput is independent of N. Owing to this fea-

    ture, space switches can scale to hundreds or even thousands of ports, as we show in chapter 3. In

    the simplest case, each input memory contains one FIFO queue. The problem then is HoL blocking.

    Progressing at the head of the queue, traffic destined to a congested output blocks other, irrelevant

    traffic behind it. When traffic is uniformly destined, this is known to reduce the throughput of the

    switch below 60% [6]. Under more stressed conditions, performance degrades even further [6]. The

    solution to HoL blocking is to change the organization of memories. In particular, by separating traffic

    per switch output, HoL blocking is eliminated [6]. This approach is known as Virtual Output Queueing

    (VOQ) because although queues are per output, they are physically located at the inputs. Other less

    expensive approaches, such as Virtual Channels (section 2.6), have been also proposed. Finally, the

    9

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    22/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    (a)

    (b)

    (c)

    Figure 2.5: (a) A 44 or (b) 88 Benes switch and (c) alternative routes for an example connection.

    allocation of the space of each memory to upstream nodes can be controlled by a credit protocol, in

    analogy to time switches.

    Fig. 2.6 compares the operation of space and time switches using space-time diagrams. We as-

    sume a 3 3 switch, where traffic at input 0 and 1 is destined to output 0, and traffic at input 2 is

    destined to output 1. In the time switch (Fig. 2.6(a)), inputs assemble 3-packet frames, which are

    multiplexed in time on the write bus, are written in memory per output, and are finally demultiplexed

    on the read bus towards the outputs. Observe in Fig. 2.6(a) that there is a minimum latency of three

    packet times for the assembly of frames, and backlog for output 0 accumulates inside the memory

    neither at the inputs, nor at the outputs. On the other hand, in the space switch (Fig. 2.6(b)) input

    memories are three times narrower and packets are multiplexed towards the outputs in space. Ob-

    serve in Fig. 2.6(b) that there is a minimum latency of one packet time, corresponding to scheduling,

    and backlog accumulates solely at the inputs.

    Unfortunately, the problem with space switches is coordination. Coordination concerns (i)schedul-

    ing of which input is connected to which output, and (ii)routing connections between the matched

    inputs and outputs. We overview these problems in the next two sections 2.4 and 2.5. Notice that

    10

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    23/127

    2.3. SPACE SWITCHES

    000

    000

    000

    111

    111

    111

    000000111111000000000111111111000000111111000000000111111111

    00

    0000

    11

    1111

    00

    00

    11

    11

    00

    00

    00

    11

    11

    11

    000

    000

    000

    111

    111

    111

    00

    00

    00

    11

    11

    11

    00

    00

    00

    11

    11

    11

    00

    00

    00

    11

    11

    11

    000000111111000000111111000000111111

    Q00

    Q00

    Q10

    Q21

    Q10

    Q21

    00 1 2

    outputs1

    inputs memory

    1widepath

    (a)

    000011110000001111110000111100

    00

    00

    11

    11

    11

    00

    00

    00

    11

    11

    11

    00

    00

    11

    11

    000000000111111111000000111111000000000111111111000000111111000

    000

    000

    111

    111

    111

    000

    000

    000

    111

    111

    111

    0

    outputs1

    crossbarinputs0 1 2

    Q00 Q10 Q21

    2parallel,narrowpaths

    (b)

    Figure 2.6: Space-time diagram of packet forwardings in a 33 (a) time switch and (b) space switch.Traffic at input 0 and input 1 is destined to output 0, and traffic at input 2 is destined to output 1. In

    (a), inputs assemble 3-packet frames, which are multiplexed in time on the write bus, are written in

    memory per output, and are finally demultiplexed on the read bus towards the outputs. In (b), input

    memories are three times narrower and packets are multiplexed towards the outputs in space.

    routing in crossbars is trivial because each input-output pair has a private path. This is a basic reason

    crossbars are so popular.

    11

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    24/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    requests match

    0 0

    inputs outputs inputs outputs

    VOQs ???

    0 0

    1 111

    Figure 2.7: Unfairness of maximum matchings. To setup the connection from input 1 to output 0 a

    non-maximum match is needed.

    2.4 Scheduling

    Switch scheduling is a special application of bipartite graph matching [36]. The vertices of the graph

    are the inputs and the outputs of the switch, and the edges the wishful connections. Although max-

    imum size matching algorithms maximize the throughput of the switch, they are impractical for two

    reasons. First, they are too slow to implement in fast hardware [37]. Second, they are inherently unfair

    in the example of Fig. 2.7, a maximum size matching algorithm would starve the connection from

    input 1 to output 0 [37]. Thus, heuristics are being used in practice.

    Let us first consider the simplest case, that is the scheduler for a switch with a single FIFO per

    input. Such a scheduler runs in the following two steps:

    Step 1: Request. Each input sends a request to the destination output of its head word.

    Step 2: Grant. If an output receives any requests, it chooses the one to grant (e.g. randomly), and

    notifies each input whether its request was granted.

    After the above two steps have been executed, a bipartite match has been found. The runs of the

    scheduler are usually pipelined with the configuration of the switch and the forwardings of the words.

    Thus, the running time of the scheduler quantizes the external traffic into fixed size internal units,

    often called packets or cells. Notice that to sustain the line rate, the internal packets have to be at least

    as small as the minimum external packets.

    Schedulers for VOQ switches are more complicated because in such switches each input may

    wish to connect to more than one outputs. Thus, input is added to output contention. Below we

    review four popular and representative scheduling algorithms: PIM [38], iSLIP [37], DRRM [39], and

    the Wavefront Scheduler [40].

    12

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    25/127

    2.4. SCHEDULING

    01

    0000000

    0000000

    00000000000000000000000000000000000

    1111111

    1111111

    11111111111111111111111111111111111

    00000001111111

    000000000000

    000000

    000000

    000000000000

    111111111111

    111111

    111111

    111111111111

    00

    00

    11

    11000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111110011

    000111

    000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111110000111100000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111

    0000000

    0000000

    00000000000000000000000000000000000

    1111111

    1111111

    1111111111111111111111111111111111101

    request acceptgrant

    11

    2

    3 3

    00

    2

    3 3

    0

    22

    11

    00

    2

    33

    0

    11

    2

    Figure 2.8: Example run of the PIM scheduling algorithm.

    PIM Parallel Iterative Matching[38] runs in the following three steps:

    Step 1: Request. Each input sends a request to every output for which it has at least one packet.

    Step 2: Grant. If an output receives any requests, it chooses randomly the one to grant, and

    notifies each input whether its request was granted.

    Step 3: Accept. If an input receives any grants, it chooses randomly the one to accept.

    After the above three steps have been executed, a bipartite match has been found (Fig. 2.8). Moreover,

    the above steps may be iterated between the unmatched inputs and outputs to increase the size of

    the match. As proved in the original PIM paper [38], lo g N iterations converge to a maximal match.

    However, the problem with this algorithm is that (i)it needs random-number generators, which are

    tricky to implement in fast hardware, and (ii)it is unfair under asymmetric traffic [41], as illustrated

    in Fig. 2.9. In Fig. 2.9(a), each flow would ideally receive 1/2 of link bandwidth, but in reality, the

    algorithm tends to discriminate against inputs that have contention [41]. Fig. 2.9(b) shows a second,

    analogous scenario.

    iSLIP Iterative SLIP [37] overcomes the problems of PIM by resolving contention round-robin. It runs

    in the following four steps:

    Step 1: Request. Each input sends a request to every output for which it has at least one packet.

    Step 2: Grant. If an output receives any requests, it decides round-robin which one to grant, and

    communicates back to each input whether its request was granted.

    Step 3: Accept. If an input receives any grants, it decides round-robin which one to accept, and

    communicates back to each output whether its grant was accepted.

    13

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    26/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    000000111111000000111111

    000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111110000001111110000000000000000000000000000

    0000000

    0000000

    1111111111111111111111111111

    1111111

    1111111

    5/8

    5/8

    3/8

    3/8

    33

    00

    2

    11

    0

    2

    11

    0

    2

    0000011111000000111111

    5/16

    5/1

    6

    5/16

    5/16

    5/16

    probability

    1/4 accept

    per grant

    5/16

    1/4 grant

    probability

    per request

    1/16

    00

    22

    11

    0

    2

    3 3

    11

    0

    2

    3 3

    (a) (b)

    Figure 2.9: PIM is unfair when input load is asymmetric. (a) and (b) are two examples. Inputs request

    the full rate from a set of outputs. Fractions denote rates allocated by PIM.

    Step 4: Slip. If an input accepts any output, it increments (modulo N) its round-robin pointer to

    one location beyond that output. If an output is accepted by the input it granted, it increments

    (modulo N) its round-robin pointer to one location beyond that input.

    After the first three steps have been executed, a bipartite match has been found. The fourth step

    ensures that subsequent runs of the algorithm will give fair and often maximal matches. Because each

    output keeps granting the same input until accepted, and because inputs arbitrate round-robin, any

    output gets eventually accepted in at most Nruns of the algorithm. As a consequence, it is guaranteed

    that any request results to a match in at most N2 runs [37]. Finally, by insisting on their grants, the

    outputs tend to slip (desynchronize), speeding up convergence to maximal matches.

    DRRM Dual Round Robin Matching[39] runs in the following three steps:

    Step 1: Request. Each input selects round-robin which output to send a request to.

    Step 2: Grant. If an output receives any requests, it decides round-robin which one to grant, and

    communicates back to each input whether or not its request was granted.

    Step3: Slip. If an input is granted by the output it requested, it increments (modulo N) its round-

    robin pointer to one location beyond that output. If an output grants any input, it increments

    (modulo N) its round-robin pointer to one location beyond that input.

    14

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    27/127

    2.4. SCHEDULING

    1 1 0 0

    0000

    0 1 0 1

    1000

    1 1 0 0

    0000

    0 1 0 1

    1000

    1 1 0 0

    0000

    0 1 0 1

    1000

    1 1 0 0

    0000

    0 1 0 1

    1000

    1 1 0 0

    0000

    0 1 0 1

    1000

    1 1 0 0

    0000

    0 1 0 1

    1000

    0 0 0 0

    0 0

    1 1 0 0

    1 0 10

    10

    run

    00

    111

    22

    33

    2

    000

    11

    2 2

    0

    3 3

    requests request matrix match

    Figure 2.10: Example run of the Wavefront Scheduler.

    Compared to iSLIP, DRRM saves one step. Thus, a bipartite match is found in the first two steps. The

    third step ensures that subsequent runs of the algorithm will give fair and often maximal matches, in

    analogy to the fourth step of iSLIP. Unfortunately, DRRM introduces HoL blocking [42]. If the output

    that an input insists on is congested, flows from this input to non-congested outputs are being blocked

    [42].

    Wavefront Scheduler Instead of making selections locally at N inputs and N outputs, the Wavefront

    Scheduler [40] operates globally on a square matrix ofN2 flows, where flows are prioritized by a diag-

    onal wavefront. We show an example run of this scheduler in Fig. 2.10. In the upper part, we show

    the request matrix and the match eventually computed by the scheduler. In particular, the Wavefront

    Scheduler uses a request matrix, where flow (i,j) from input i to output j corresponds to entry (i,j),

    and that entry (i,j) is set if and only if flow (i,j) is backlogged. The scheduler runs on the request

    matrix as we show in the bottom part of Fig. 2.10. Initially, each flow in the first row 0 is given a vertical

    token, and each flow in the first column 0 is given a horizontal token. A flows request is granted if and

    15

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    28/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    1 2 3

    432

    3 4 5

    3

    4 2

    1 2

    3

    435

    5 3 4

    213

    4 2 3

    3

    4

    213

    5 3

    24

    1 2 3

    5

    43

    43

    2

    3

    1

    2 3

    2

    4 5

    3

    4

    4

    2

    3 4 2

    13

    5 3

    1

    3

    24

    5

    32

    4

    3

    2

    3

    4 5

    4

    3 1

    2

    3

    run 1 run 2 run 3

    run 6run 5run 4

    run 7 run 8 run 9

    Figure 2.11: Priorities of flows during a period of runs of the Wavefront Scheduler in an example 3 3switch. Top priority is shifted to a different flow from run to run. Flow 00 is given 6 times higherpriority over flow 10, while flow 10 is given only 3 times higher priority over flow 00.

    only if that flow grabs both the vertical and the horizontal token corresponding to its column and its

    row, respectively. Thus, in the first step, flow (0,0) has the top priority, flows (0,0) request is granted,

    and flow(0, 0) stops the propagation of both its vertical and horizontal token to prevent its same-input

    and same-output flows from being granted in subsequent steps. In the second step, likely to have the

    top priority are flows (0,1) and (1,0). However, flow (0,1) misses the horizontal token, hence it is not

    granted. Thus, it propagates unchanged its tokens to its neighbors in both directions. In parallel, flow

    (1,0) propagates its tokens also unchanged because it is idle. Scheduling proceeds similarly, and a

    match is computed in 2N 1 steps. Notice that in step i, i flows are likely to have top priority, but

    these flows are always non conflicting. In the next scheduling decision, the initial top priority flow

    changes, as in Fig. 2.11, to provide a degree of fairness. Observe in Fig. 2.11 that flow (0,0) is given six

    times higher priority over flow (1,0), while flow (1,0) is given only three times higher priority over flow

    (0,0). Thus, the Wavefront Scheduler is inherently unfair.

    Traditionally, scheduling algorithms are evaluated for crossbar switches, using simulation. The

    simulated performance of the above algorithms is plotted in Fig. 2.12. We assume traffic of fixed size

    packets, uniformly destined over the outputs. Moreover, packet arrivals follow a Bernoulli process,

    with probability corresponding to traffic load. In Fig. 2.12(a), we plot average packet delay as a func-

    tion of input load. We observe that PIM with one or two iterations is unable to sustain a load greater

    than 0.6 or 0.9, respectively. For larger loads, VOQs keep growing in time, and the switch becomes un-

    stable. On the other hand, iSLIP is stable for all loads, even using a single iteration. Furthermore, the

    Wavefront Scheduler performs significantly better than both iSLIP and PIM. However, performance

    converges as the number of iterations of PIM and iSLIP increases. In Fig. 2.12(b) we plot the stan-

    dard deviation of delay, which serves as a metric for fairness. Thus, the Wavefront Scheduler performs

    16

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    29/127

    2.4. SCHEDULING

    0.01

    0.1

    1

    10

    100

    1000

    10000

    averagedelay(packettimes)

    1PIM1SLIP

    WFS

    2PIM

    2SLIP

    WFS

    0.2 0.4 0.6 0.8 1.0

    load

    0.01

    0.1

    1

    10

    100

    1000

    10000

    averagedelay(packet

    times)

    4PIM

    4SLIP

    WFS

    0.2 0.4 0.6 0.8 1.0

    load

    7SLIP/7PIMWFS

    (a)

    0.2 0.4 0.6 0.8 1.0

    load

    0.01

    0.1

    1

    10

    100

    1000

    10000

    standarddeviationofdelay(packettimes)

    2SLIP

    WFS

    0.2 0.4 0.6 0.8 1.0

    load

    7SLIPWFS

    (b)

    Figure 2.12: Simulated (a) average delay and (b) standard deviation of delay of PIM, iSLIP, and Wave-

    front Scheduler (WFS) as a function of input load. N= 128.

    17

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    30/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    worse than iSLIP, while PIM falls between iSLIP and the Wavefront Scheduler. (In Fig. 2.12, the stan-

    dard deviation of PIM delay is omitted for clarity.)

    Finally, several other approaches to switch scheduling have been proposed. Marsan et al. [43]

    proposed extensions of the above algorithms for operation on variable length packets. Kar et al. [44]

    proposed a scheme that merges multiple external packets into large internal envelopes to relax the

    running time of the scheduler. Kam and Kai-Yeung Siu [45] proposed weighted matching to provide

    Quality of Service (QoS) guarantees. Ahuja et al. [46] studied multicast scheduling.

    2.5 Routing

    In multipath space switches, an input-output pair can be connected through multiple alternative

    paths, intersecting with other paths connecting other pairs. Thus, once scheduling has resolved input

    and output contention, routing is needed to resolve contention for internal paths. Like scheduling,

    routing algorithms are heuristics, and they can be classified into the following three categories [47]:

    Deterministic Routingalgorithms make deterministic path selections. For example, in the Benes

    switch (Fig. 2.5), each input selects round-robin one of the available paths for each output, and

    in case of conflict with other inputs, a selection is made, e.g. again round-robin. Deterministic

    routing algorithms ineffectively exploit path diversity, thus suffering performance issues [47].

    Randomized Routingalgorithms randomly select an intermediate node, and then make a deter-

    ministic selection between the available paths. Randomized routing algorithms perform well

    under non-uniform traffic, but their performance degrades under uniform traffic [47].

    Adaptive Routingalgorithms aim at combining the merits of deterministic and randomized al-

    gorithms by using network state to select among paths. However, practical adaptive routing

    algorithms access only local state, thus often making globally sub-optimal decisions [47].

    2.6 Virtual Channels

    Scheduling algorithms are widely used to coordinate crossbars. However, multipath switches are

    harder to coordinate because they need routing in addition to scheduling. Thus, multipath switches

    are only implicitly coordinated, using memories inside the switching elements. An example queued

    mesh is shown in Fig. 2.13. In Fig. 2.13, we consider a scenario where inputs B and I are contending

    for output G, we assume dimensioned ordered routing, and we show only the queues related to our

    scenario. Thus, packets from B follow the path B-C-G, and packets from I the path I-J-K-G. As mem-

    ory space is finite, congestion starts from G and spreads backwards. Using credit flow control, this

    is realized by the fact that credits are being consumed at the upstream nodes at rate greater than the

    rate they are being released by their downstream neighbors. In this way, rates are implicitly regulated.

    18

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    31/127

    2.7. HIGH RADIX

    hot

    idle

    BA C D

    E F G H

    I J K L

    Figure 2.13: Example contention situation in a 1212 queued mesh. Flows from B and I are content-ing for G, blocking a flow from I destined to an idle output D. Dimension ordered routing is assumed.

    Queues also resolve contention for internal paths. For example, a route from Ato D, sharing the path

    segment B-C with the route from B to G, is throttled similarly.

    A hard problem in queued networks is HoL blocking. In particular, in Fig. 2.13, a third connec-

    tion from I to D is blocked behind the connection from I to G even when its destination D is idle.

    Unfortunately, it is infeasible to resolve this problem using VOQ. While it is affordable to implement

    O(N) queues at each input of the switch, per-flow queueing inside the switching elements requires

    O(N2) queues inside each switching element. Thus, in practice, heuristics are being employed.

    An example such heuristic is illustrated in Fig. 2.14(a), where the connection from I to Descapes

    blocking using a second network. Another solution is the popular Virtual Channels [30]. The main

    idea is that instead of duplicating the whole network, only queues are duplicated, as in Fig. 2.14(b),

    while links are shared hence the name Virtual Channels. The comparison of network duplication and

    Virtual Channels is technology dependent, specified by the relative cost of links and memories.

    2.7 High Radix

    A great limiting factor in todays chip is power consumption. A high speed serial off-chip transceiver

    implemented as a differential pair at 3.125Gbaudconsumes on the order of 150mW[48]. Thus, for a

    state-of-the-art port rate of 10Gb/s, a few hundreds of ports cost on the order of a few tens of Watts.

    The only way to scale up to larger fabrics is by connecting switch chips in multistage topologies, like

    the Benes.

    Kim et al. [15] showed that it is more effective to implement switching elements with a large

    19

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    32/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    I

    A

    E

    B C D

    F G H

    J K L

    (a)

    LKJI

    (b)

    Figure 2.14: Reducing blocking using (a) a duplicated network and (b) two Virtual Channels

    number of slow ports rather than a small number of fast ports because higher radix switch chips en-

    able lower diameter fabrics. Their argument is illustrated in Fig. 2.15. Consider a r a d i x -4 Benes

    switch, and suppose that advances in signaling technology double the IO rate that is feasible on a sin-

    gle chip. Then, there aretwo options to benefit. First, onecan keep theradix of theswitching elements,

    and double the rate of their ports this actually doubles the rate of the end ports of the fabric. Second,

    one can keep the port rate of the switching elements, and double their radix this merges multiple

    switching elements on a single chip, converting the Benes topology to a Clos topology. Notice that

    end port rate doubles by arranging switching elements in parallel. (We assume that chips at end ports

    implementing demultiplexing and multiplexing introduce negligible overhead). Thus, comparing the

    two options, the second one is better because it reduces hop count and chip count. Lower hop count

    translates to lower latency and power consumption and lower chip count to lower cost [ 15].

    20

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    33/127

    2.8. SUMMARY

    2x

    2x

    1x1x

    2x

    2x

    Figure 2.15: High radix switching elements exploit increases in chip IO throughput more effectively

    by reducing the diameter of the switch fabric [15].

    2.8 Summary

    This chapter described basic concepts in switch design. An interconnection switch provides scal-

    able and distributed communication between the parties of a digital system. Time switches, like the

    memory switch, implement communication by switching data through a single point in space at dif-

    ferent moments in time. While time switches optimize resource utilization, they are non scalable.

    Space switches, like the crossbar and the Benes switch, are more scalable by providing parallel paths

    in space. However, space switches need scheduling and routing to resolve contention for these paths.

    Compared to other space switches, the crossbar has the advantage of simplifying routing by providing

    a private path for each input-output pair. However, switch scheduling is a hard problem alone. From

    the scheduling algorithms known in the literature, iSLIP is the most efficient, by providing both high

    throughput and good fairness properties. Scheduling and routing can be simplified by using queues

    internally to the switching elements. Finally, high radix switching elements are the right tradeoff in

    modern technology by reducing the diameter of the switch fabric.

    21

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    34/127

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    35/127

    Chapter 3

    A Comparison of Architectures

    for High-Radix Switches

    In this chapter, we first overview basic switch architectures, like shared memory, block crosspoint

    queueing, output queueing, crosspoint queueing, and input queueing (section 3.1). We then com-

    pare the performance of the above architectures (section 3.2) and their feasibility using on-chip SRAM

    technology (section 3.3). We show that only the input queued crossbar scales to high radices, by min-

    imizing both the individual and the aggregate throughput of memories. Because the traditional in-

    put queued crossbar is difficult to schedule at high radices, we consider the combined input-output

    queued crossbar, which compensates for scheduling inefficiencies by moderately overprovisioning

    the crossbar and the memories (section 3.4). Moreover, we compare the combined input-output

    queued crossbar to the hierarchically queued crossbar, proposed recently for high radix switches, and

    we show that the combined input output queued crossbar is advantageous because it gives better

    performance using only a moderate speedup on the crossbar and the memories, independent of radix

    (section 3.5). Finally, we discuss related work (section 3.6) and the conclusion (section 3.7).

    3.1 Basic Switch Architectures

    We first overview four basic time switch architectures: (i)Crosspoint Queueing (XQ),(ii)Output Queue-

    ing (OQ), (iii) Block Crosspoint Queueing (BXQ), and (iv) Shared Memory (SM). Next, we compare

    these architectures to the Input Queued (IQ) space switch.

    Crosspoint Queueing (XQ) The XQ architecture (Fig. 3.1(a)) switches packets from N inputs to N

    outputs usingN2 memories. Each input selects which memory to write its head packet to according

    to the destination output of that packet, and each output selects which memory to read a packet from

    according to a predetermined policy e.g. weighted round-robin. Thus, XQ additionally uses one

    23

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    36/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    N1

    N2

    N3

    2

    1

    0

    0 1 N1

    N:1

    arbiter

    N:1

    arbiter

    N:1

    arbiter

    N110

    0

    1

    2

    N1

    (a) (b)

    01

    N1

    0 1 N2 N1

    N2

    N/2:1arbiter

    N/2:1arbiter

    N/2:1arbiter

    N/2:1arbiter

    0 1 N1

    0

    1

    N1

    (c) (d)

    Figure 3.1: Time switch architectures. (a) Crosspoint Queueing (XQ). (b) Output Queueing (OQ). (c)Block Crosspoint Queueing (BXQ-2). (d) Shared Memory (SM).

    bus per input to route packets to the memories of that input, and N crosspoints coupled with one N:1

    arbiterper outputto forward packets to that output. Because all memories operate at the rate of inputs

    and outputs, XQ can also be considered a degeneration of a time to a space switch. Summarizing, XQ

    uses N2 memories, each with a throughput of 2R, for an aggregate of 2N2R. XQ designs were proposed

    by Abel et al. [13] and by Katevenis et al. [14].

    Output Queueing (OQ) The OQ architecture (Fig. 3.1(b)) compared to XQ allows better sharing of

    memory space by merging the memories of each output in a single memory, dedicated to that output.

    The write access rate of that memory is N R to resolve any contention between the inputs for the

    output. Thus, the crosspoints and arbiters of XQ are removed. Moreover, using a single FIFO queue

    per memory suffices to provide both high output utilization and a degree of fairness. In summary, OQ

    uses Nmemories, each with a throughput of (N+1)R, for an aggregate of (N+1)N R. OQ designs were

    proposed by Yeh at al. [49].

    Block Crosspoint Queueing (BXQ) The BXQ architecture (Fig. 3.1(c)) results from XQ by merging

    the k2 memories between kdistinct inputs and kdistinct outputs in a single memory block. The write

    24

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    37/127

    3.1. BASIC SWITCH ARCHITECTURES

    access rate of that memory is kR to resolve any contention between the kinputs for the koutputs, and

    the read access rate is also kR to fully utilized the koutputs. Each memory block must also implement

    at least one FIFO per local output, to remove HoL. Thus, the inputs of the same memory block share

    the space of that block in analogyto OQ. The space of the block can also be shared between its outputs.

    However, this type of sharing increases complexity by requiring queues to be implemented as link

    lists. In contrast, to implement FIFO queues, simple circular arrays suffice. Finally, BXQ uses N/k

    crosspoints coupled with one N/k:1 arbiter per output to multiplex the memory blocks of that output.

    Thus, BXQ is a combination of a time and a space switch. Summarizing, BXQ-kuses(N/k)2 memories,

    each with a throughput of 2kR, for an aggregate of 2N2R/k.

    Shared Memory (SM) By varying the parameter kof BXQ from 1 to N, we get intermediate solutions

    from complete partitioning (XQ) to complete sharing the latter solution is widely known as shared

    memory (SM, Fig. 3.1(d)). Thus, SM uses a single memory with a throughput of 2N R. SM designs

    were proposed by Devault et al. [22], by Katevenis et al. [23], and by Kozaki et al. [24]. (We described

    SM in more detail in chapter 2.)

    Now let us compare the above time switches to the Input Queued (IQ) space switch (Chapter

    2). An IQ switch uses N memories, like OQ, but it places these memories at the inputs instead of the

    outputs. Thus, the physical partitioning of memories in IQ is analogous to OQ. However, while in

    OQ sharing is between the inputs across the outputs, in IQ sharing is between the outputs across the

    inputs. As a consequence, like BXQand SM, IQ also needs linked lists to implement sharing. Moreover,

    the aggregate memory throughput of IQ is 2N R, that is equal to SM. Finally, while in time switches

    rates are allocated through memory spacing, IQ strongly depends on the efficiency of scheduling.

    While IQ can use any space switch, in the rest of this thesis, we will be considering that IQ uses a

    crossbar, to simplify routing.

    Fig. 3.2 shows a conceptual derivation of the above architectures through alternative groupings

    ofN2 memory blocks. Observe that the throughput of each memory is proportional to the periphery

    of the rectangle enclosing the blocks, while space is proportional to the area of that rectangle. Fig. 3.2,

    as well as the observation on throughput geometry, is copied from the transparencies of the Packet

    Switch Architecture class at the University of Crete [18]. Also copied from there is the metric of ag-

    gregate memory throughput. However, in section 3.3, we contribute a practical application of that

    metric. In particular, we show that the minimum total memory area to implement a switch architec-

    ture is analogous to the aggregate memory throughput of that architecture. Thus, architectures like

    XQ are costly.

    25

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    38/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    OutputQueueing(OQ)

    SharedMemory

    (SM)

    Input Queueing (IQ)

    Block Crosspoint Queueing (BXQ4) Crosspoint Queueing (XQ)

    Figure 3.2: Derivation of switch architectures.

    3.2 Memory-Sharing Merits

    Switches allowing better sharing of memory space have the capacity to improve performance under a

    broad range of traffic conditions by allowing memory space to be allocated on demand, thus virtually

    increasing memory space. Equivalently, the more the sharing, the less the memory space to achieve

    a fixed level of performance. We first examine the effect of memory sharing on performance by com-

    paring through simulation the time switch architectures described in the previous section 3.1. Next,

    we study a second type of memory sharing, which we call queue sharing.

    In order to quantify the effect of memory space sharing on performance, we evaluate the rate of

    packet losses as a function of memory space under Bernoulli traffic of fixed size, uniformly destined

    packets. In this approach, the better the sharing of memory space, the less the memory space to

    achieve a fixed packet loss rate [26]. While real traffic patterns may be considerably more stressful,

    including bursts and hot spots, the results described in this section are fundamental, also likely to be

    found within more complicated scenaria. Packet loss rates are plotted in Fig. 3.3 for a range of link

    loads. First, we observe that packet loss rates increase with load in all architectures, as contention for

    switch outputs increases correspondingly, and more packets have to be queued. Second, we observe

    that architectures allowing better sharing of memory space require a smaller memory space to achieve

    a given packet loss rate. Thus, XQ has the worse performance and SM the best, while OQ falls in

    between, and BXQ is better or worse than OQ dependent on block size. Finally, at high loads XQ

    requires about 5 larger memory space to achieve the performance of SM.

    26

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    39/127

    3.2. MEMORY-SHARING MERITS

    -6

    -5

    -4

    -3

    -2

    -1

    0

    packetlo

    ssrate(log10)

    XQ

    BXQ-2

    OQBXQ-4

    load = 0.70

    XQ

    BXQ-2

    OQ

    BXQ-4

    load = 0.80

    0 10 20 30 40 50

    memory space (packets)

    -6

    -5

    -4

    -3

    -2

    -1

    0

    packetlossrate(log10)

    SM

    BXQ-4 OQ

    BXQ-2

    XQ

    load = 0.90

    0 10 20 30 40 50

    memory space (packets)

    SM

    BXQ-4OQBXQ-2XQ

    load = 0.99

    Figure 3.3: Packet loss rates as a function of memory space. N= 8 and memory space is the totaldivided byN. BXQ and XQ use round-robin arbiters. Simulated time is 106 packet times.

    0.5 0.6 0.7 0.8 0.9 1.0

    load

    0

    1

    10

    100

    1000

    10000

    delay(packettimes)

    IQ with 3SLIP

    XQ/OQ/BXQ/SM

    Figure 3.4: Queueing delay as a function of input load. N= 8 and memory space is infinite. IQ usesVOQ and 3-it e rat io nSLIP (3SLIP).

    In Fig. 3.4, we plot queueing delay when memory space is infinite. Then, XQ, OQ, BXQ, and SM

    all degenerate to a M/D/1 queueing system [6]. We also plot the performance of IQ using VOQ and

    3SLIP. At low loads, performance is comparable for all switches, as there is low contention, and only

    few packets are queued. At high loads, delay in IQ is significantly higher, as packets are contending for

    27

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    40/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    5flow

    s

    25flows Qs

    5

    25

    10

    5flow

    s

    25flows

    1010

    (a) (b)

    Figure 3.5: Queue sharing in a 33 switch with a total memory space of 30 queues per input. (a) In IQqueues are flexibly allocated on demand. (b) In XQ queues are statically partitioned per crosspoint.

    IQ increases performance by reducing blocking.

    both the inputs and the outputs, while in any time switch contention is for the outputs only.

    Finally, memory sharing affects the performance of queue sharing mechanisms in larger fabrics

    [30][50]. Particularly, consider the 33 switches of Fig. 3.5 are elements of a larger Clos fabric. Also

    consider that technology defines a maximum memory space of 30 queues per switch input. In IQ,

    queues are concentrated at the inputs. Thus, there are 30 queues at each input. In XQ, queues are

    distributed at the crosspoints. Thus, there are 10 queues at each crosspoint. Hence, when same-

    input flows are unevenly distributed over the outputs, queues are better utilized in IQ, and blocking is

    reduced. Notice that this queue sharing is a variation of byte sharing we described above.

    3.3 On-Chip SRAM Cost

    In this section, we evaluate implementation cost in on-chip SRAM technology. We consider a 90 nm

    CMOS process, where SRAM is available in blocks, and we decide the feasibility of a switch architec-

    ture by evaluating two metrics: (i)Total silicon area to implement all memories; and (ii) individual

    memory width. State of the art technology constrains the core of a chip to less than 400mm2

    , and

    smaller chips are typically less expensive [51]. Moreover, the budget for memories is a major cost. For

    example, Katevenis et al. [14] described a 180nm-CMOS implementation of XQ, where memory area

    accounted for as much as 70% of the switch core. Thus, we bound the feasible memory area to less

    than 200mm2. On the other hand, memory throughput expands by increasing its word width. When

    this becomes larger than the maximum width of the available SRAM blocks, one can arrange multiple

    SRAM blocks in parallel. In any case, memory width is bounded by the minimum external packets.

    We will consider a maximum width of 64B y t e s , corresponding to minimum Ethernet packets [52].

    Summarizing, a switch architecture is feasible when (i)total memory area is smaller than 200mm2,

    and (ii)individual memory width is smaller than 64B y t e s .

    28

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    41/127

    3.3. ON-CHIP SRAM COST

    256 1K 4K 16K

    number of words

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    area(sq

    -mm)

    word width 128 bits

    64

    32

    16

    256 1K 4K 16K

    number of words

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    5.0

    latency(ns,w

    orstcase)

    word width 128 bits

    64

    32

    16

    (a) (b)

    8 32 128 512

    block size (Kbits)

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    capacity(Mbitsin200sq-mm

    )163264128

    word width

    8 32 128 512

    block size (Kbits)

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    throughput(Mb/s/pin)

    163264128

    word width

    (c) (d)

    Figure 3.6: On chip memory performance in 90nm CMOS. (a) Single port SRAM block area 2-po rt

    blocks are 20% to 60% larger, and are omitted here, (b) speed as a function of block height, (c) total

    memory capacity fitting in 200mm2 as a function of block size, and (d) memory throughput per pin

    as a function of block size.

    We first plot 90nm CMOS SRAM block performance in Fig. 3.6. Wordwidth varies from 16b i t s to

    128b i t s , and number of words from 256 to 16K. We observe in Fig. 3.6(a) that as block size increases,

    block area increases as well, to accommodate more SRAM bit cells. However, area increases sub-

    linearly. The reason is that larger blocks are more area efficient because their peripheral overhead

    (e.g. address decoders, column multiplexors, sense amplifiers, etc) is amortized over a larger core.

    Furthermore, in Fig. 3.6(b), we observe that as block size increases, SRAM blocks become slower, as

    internal bit-line and word-line capacitances increase accordingly (The smaller the faster [51]). In

    Fig. 3.6(c), we show the memory capacity we can fit in 200mm2. We observe that, except for corner

    cases, capacity depends mainly on the size of SRAM blocks, rather than their configuration corner

    cases are wide and small blocks, where extra sense amplifiers result to disproportional overheads. We

    also observe that using the largest blocks, it is feasible to implement as many as 100Mbit s of memory

    29

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    42/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    1

    2

    addressbus

    X/x

    read/writebus

    1

    2

    read/writebusa

    ddressbus

    A/a

    (a) (b)

    Figure 3.7: (a) Memory throughput and space expansion and (b) memory space expansion. A and

    X denote the area and the throughput of the memory, and a and x the area and throughput of an

    individual block, respectively.

    space. Finally, in Fig. 3.6(d), weshow the throughput ofeachmemory block in M b i t s /s per pin. Again,

    throughput mainly depends on block size, rather than block configuration. Moreover, the maximum

    throughput we can get from a single SRAM block is less than 60Gb/s, using the smallest blocks.

    To build a memory of throughput X, we need X/x parallel SRAM blocks, where x the through-

    put of one SRAM block see Fig. 3.7(a). Ifa is the area of the SRAM block, memory area is X/xa.

    Then, if M is the number of memories, total memory area is MX/xa this must be smaller than

    200mm2. Notice that total memory area is analogous to aggregate memory throughput. As a conse-

    quence, aggregate memory throughput suggests a realistic cost metric. Finally, ifw is the block width,

    the width of each memory is X/xw. Summarizing,

    Total Memory Area= MXxa,

    Memory Width= Xxw,

    where x, a, and w the throughput, area, and width of one SRAM block, respectively.

    Table 3.1 summarizes the values of X and M for each switch architecture. By substituting the SRAM

    cost numbers of Fig. 3.6, we get an SRAM cost of each switch architecture.

    Total memory area and individual memory width are plotted in Fig. 3.8 and Fig. 3.9, respectively.

    We observe in Fig. 3.8 that area grows as O(N2) for XQ and OQ, and as O(N) for IQ and SM, follow-

    Table 3.1: Number of memories (M) and throughput per memory (X). R= 10Gb/s.

    switch class

    XQ OQ BXQ SM IQ

    M N2 N (N/k)2 1 N

    X 2R (N+1)R 2kR 2N R 2R

    30

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    43/127

    3.3. ON-CHIP SRAM COST

    0.1

    1

    10

    100

    200

    500

    memo

    ryarea(sq-mm)

    4Kbit blocks

    XQ OQ

    IQ/SM

    infeasible

    feasible

    8Kbit blocks

    XQ OQ

    IQ/SM

    16Kbit blocks

    XQ OQ

    IQ/SM

    32Kbit blocks

    XQ OQ

    IQ/SM

    4 8 16 32 64 128

    N

    0.1

    1

    10

    100

    200

    500

    memoryarea(sq-mm)

    64Kbit blocks

    XQ OQ

    IQ/SM

    4 8 16 32 64 128

    N

    128Kbit blocks

    XQ OQ

    IQ/SM

    4 8 16 32 64 128

    N

    256Kbit blocks

    XQ OQ

    IQ/SM

    4 8 16 32 64 128

    N

    512Kbit blocks

    XQ OQ

    IQ/SM

    Figure 3.8: Minimum total area to implement the memories of each switch architecture as a function

    ofN. Area is proportional to aggregate memory throughput. Only IQ and SM scale above 100 ports.

    1

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    memorywidth(Bytes)

    infeasible

    feasible

    4Kbit blocks

    XQ/IQ

    OQSM

    8Kbit blocks

    XQ/IQ

    OQSM

    16Kbit blocks

    XQ/IQ

    OQSM

    32Kbit blocks

    XQ/IQ

    OQSM

    4 8 16 32 64 128

    N

    1

    1020

    30

    40

    50

    60

    70

    80

    90

    100

    mem

    orywidth(Bytes)

    64Kbit blocks

    XQ/IQ

    OQSM

    4 8 16 32 64 128

    N

    128Kbit blocks

    XQ/IQ

    OQSM

    4 8 16 32 64 128

    N

    256Kbit blocks

    XQ/IQ

    OQSM

    4 8 16 32 64 128

    N

    512Kbit blocks

    XQ/IQ

    OQSM

    Figure 3.9: Minimum width of an individual memory of each of the switch architectures. Memory

    width is proportional to memory throughput. Only IQ and XQ scale above 100 ports.

    ing the aggregate throughput of memories. Thus, area is the same for both IQ and SM, well below

    200mm2. On the other hand, XQ and OQ do not scale above 32 and 64 ports, respectively, and to scale

    to these radices they use small SRAM blocks. Furthermore, we observe in Fig. 3.9 that memory width

    grows as O(N) for SM and OQ, while it remains constant, independent of N, for IQ and XQ, following

    31

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    44/127

    G. Passas, VLSI Micro-Architectures for High-Radix Crossbars

    0.1

    1

    10

    100

    200

    500

    memo

    ryarea(sq-mm)

    4Kbit blocks

    BXQ

    infeasible

    feasible

    8Kbit blocks

    BXQ

    16Kbit blocks

    BXQ

    32Kbit blocks

    BXQ

    4 8 16 32 64 128

    N

    0.1

    1

    10

    100

    200

    500

    memoryarea(sq-mm)

    64Kbit blocks

    BXQ

    4 8 16 32 64 128

    N

    128Kbit blocks

    BXQ

    4 8 16 32 64 128

    N

    256Kbit blocks

    BXQ

    4 8 16 32 64 128

    N

    512Kbit blocks

    BXQ

    Figure 3.10: Minimum total area to implement the memories of BXQ-8 as a function ofN. BXQ does

    scale above 100 ports, but only using the smallest SRAM blocks.

    4 8 16 32 64 128

    N

    0

    20

    40

    60

    80

    100120

    140

    160

    180

    200

    memorycapacity(Mbitsin200sq-mm)

    IQ

    XQ

    OQSM

    4 8 16 32 64 128

    N

    BXQ

    Figure 3.11: Total memory capacity in 200mm2 for each of the switch architectures.

    the individual throughput of memories. Thus, for IQ and XQ, memory width is smaller than the ex-

    ternal packets, for all N, while for SM and OQ it grows quickly, limiting these architectures to radices

    below 8 or 16, respectively. Combining the plots of Fig. 3.8 and 3.9, we conclude that, from IQ, XQ,

    OQ, and SM, only IQ scales to radices above 100.

    Another scalable architecture is BXQ. We plot total memory area for BXQ-8 in Fig. 3.10. In this

    plot, BXQ uses memory blocks corresponding to the the largest feasible SM. We observe that BXQ does

    scale above 100, but it can do so only using the smallest SRAM blocks. However, this has a negative

    impact on memory density, as plotted in Fig. 3.11. In this plot, each architecture has the density of the

    32

  • 7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars

    45/127

    3.4. COMBINED INPUT-OUTPUT QUEUED (CIOQ) CROSSBARS

    maximum SRAM block it can use (When memory area is smaller than 200mm2, spare space is utilized

    t