i CROSSBAR ARCHITECTURES FOR VLSI SYSTEMS: A …s3.amazonaws.com/zanran_storage/fppa.mrc.uidaho.edu/ContentPages… · i CROSSBAR ARCHITECTURES FOR VLSI SYSTEMS: A COMPARATIVE STUDY

iCROSSBAR ARCHITECTURES FOR VLSI SYSTEMS:

A COMPARATIVE STUDY

A Thesis

Presented in Partial Fulfillment of the Requirements for the

Degree of Master of Science

with a

Major in Electrical Engineering

in the

College of Graduate Studies

University of Idaho

by

Enrique Coen-Alfaro

April 2004

Major Professor: Gregory W. Donohoe

ii

AUTHORIZATION TO SUBMIT

THESIS

This thesis of Enrique Coen-Alfaro, submitted for the degree of Master of Science with a

major in Electrical Engineering and titled “Crossbar Architectures for VLSI Systems: A

Comparative Study,” has been reviewed in final form. Permission, as indicated by the

signatures and dates given below, is now granted to submit final copies to the College of

Graduate Studies for approval.

Major Professor _______________________________________Date___________

Gregory W. Donohoe

Committee

Members ________________________________________Date___________

James F. Frenzel

________________________________________Date___________

Robert Rinker

Department

Chair _________________________________________Date___________

Joseph J. Feeley

Engineering

College Dean _________________________________________Date___________

David E. Thompson

Final Approval and Acceptance by the College of Graduate Studies

_________________________________________Date___________

Katherine G. Aiken

iii

ABSTRACT

Programmable interconnect is responsible for much of the versatility in reconfigurable

hardware. Crossbars are often used to implement programmable interconnect for a variety

of applications. This study focuses on crossbar design for integrated circuits. It compares

three alternative crossbar designs: (1) a full crossbar or simple mesh, (2) a synthesized,

multiplexer-based crossbar, and (3) a Benes network. Crossbar circuits are compared in

terms of transistor count and area, throughput, power consumption, reliability, and

programming requirements. Electrical comparisons are based on transistor level simulations

of the circuits. A variety of application scenarios are discussed. For these application

scenarios, advantages and disadvantages of each crossbar architecture are presented.

Increasing crossbar size affects all the metrics for comparison. These effects of scaling are

also considered within this report. A programming algorithm is presented for the Benes

network.

iv

TABLE OF CONTENTS

Authorization to Submit Thesis……………………………………………………………....ii

Abstract……………………………………………...……….……………………………...iii

Table of Contents…………………………………...……………………………………….iv

List of Figures………………………………………………………………………………vii

List of Tables……………………………………………………………………………..…ix

Introduction…………………………………………………………………………………..1

Chapter 1: Reconfigurable Interconnect for Application Specific Integrated Circuits…...….2

1.1 Context and motivation………………………………………………………2

1.2 Previous work on crossbar architectures……………………………………..7

1.3 Crossbar general concepts……………………………………………………9

1.4 Overview of the experiment………………………………………………...15

Chapter 2: Physical Design Considerations for Crossbars in VLSI Systems………………20

2.1 Crossbar design goals and applications………………………………….…20

2.2 Transistor sizing and delay……………………………………………...….21

2.3 Power consumption……………………………………………………..….23

2.4 Noise immunity, leakage current, and other second order effects……...….24

2.5 Reliability…………………………………………………………………..25

v

2.6 Programming, Routability and Pin count…………………………………..29

2.7 Ground rules for crossbar study…………………………………………....31

Chapter 3: The Simple Mesh Crossbar……………...……………………………………..40

3.1 General circuit description……………………………………………........40

3.2 Transistor count and area……………………………...……………….…..42

3.3 Delay estimations……………………………………………………….….44

3.4 Power consumption……………………….…………………………….….50

3.5 Other considerations…………………………………………………….….56

Chapter 4: Multiplexer Based Crossbar……………...…………………………..………..59

4.1 General circuit description………………………………………….…..…59

4.2 Transistor count and area……………………………………………….….63

4.3 Delay estimations……………………………………………………….….66

4.4 Power consumption…………………………………………………….…..68

4.5 Other considerations……………………………………………………......71

Chapter 5: The Benes Network……….……………...…………………………..………..74

5.1 General circuit description…………………………………………………..74

5.2 Transistor count and area…………………………………………….……...77

5.3 Delay estimations……………………………………………………………80

5.4 Power consumption……………………………………………………...….85

5.5 Other considerations…………………………………………………….......88

vi

Chapter 6: Architecture Comparison.………………...…………………………..………..94

6.1 Area and scalability……...………………………………………………....94

6.2 Throughput…………………………………………………………….…...96

6.3 Power consumption….……………………………………………………..99

6.4 Architectural efficiency and programming complexity…………..……….100

6.5 Application scenarios...…………………………………………………....102

6.6 Conclusions and future work……………………………………………...104

References………………………………………………………………………………107

vii

LIST OF FIGURES

Figure 1.1: Relative sizes of metal wires and transistors……...……………………………..4

Figure 1.2: Blocking and non-blocking networks……………………………………….….10

Figure 1.3: Basic crossbar……………………………………………………………….….11

Figure 1.4: Multicasting network………………………………………………………...…12

Figure 1.5: Fully connected vs. partially connected network……………………………….13

Figure 1.6: Full crossbar, mesh approach………………………………………………...…15

Figure 1.7: Automatically synthesized crossbar…………………………………………….16

Figure 1.8 (a): 4x4 Benes network. Select lines removed for clarity……………………….17

Figure 1.8 (b): Schematic diagram of a Butterfly switch………………………………..….17

Figure 2.1: Electromigration in aluminum wires……………………………………...…….26

Figure 2.2: Grain structures on VLSI metal wires…………………………………………..27

Figure 2.3: Obtaining Spice simulations from Cadence layout…………………………..…33

Figure 2.4: Circuit for measuring crossbar delay…………………………………………...34

Figure 2.5: Circuit for measuring delay across two inverters……………………………….35

Figure 3.1: 4x4 simple mesh crossbar………………………………………………………41

Figure 3.2: Layout for simple mesh crossbar……………………………………………….42

Figure 3.3: Transistor circuit for one I/O connection, using a full transmission gate……....43

Figure 3.4: Scaling trend for simple mesh crossbar………………………………………...44

Figure 3.5: Rising and falling transitions using an NMOS switch………………………….46

Figure 3.6: Average delay for pass transistor version of the simple mesh crossbar………...47

Figure 3.7: Average delay for the full pass gate version of the simple mesh crossbar……..48

Figure 3.8: Itot for mesh with NMOS switches…………...…………………………………51

Figure 3.9: Load seen by one crossbar input………………………………………………..51

Figure 3.10: Output voltage degradation for the AMI crossbar…………………………….53

Figure 3.11: Output voltage degradation for the ULP crossbar…………………………….53

Figure 3.12: Total current for mesh with full pass gates as switches…………………...….54

viii

Figure 4.1: VHDL code for 4x4 bidirectional crossbar……...…………………………….60

Figure 4.2: Synthesized 4x4 bidirectional crossbar………………………………………...61

Figure 4.3: Four to one MUX implementation……………………………………..………62

Figure 4.4: Layout for the synthesized 4x4 crossbar……………………………………….63

Figure 4.5: Transistor circuit for a two-to-one multiplexer……………………………...…64

Figure 4.6: Scaling trend for synthesized crossbar…………………………………………66

Figure 4.7: Average delay for the synthesized crossbar…………………………………....67

Figure 4.8: Power consumption vs. frequency for the 4x4 synthesized crossbar…………..70

Figure 4.9: Power consumption vs. frequency for the 8x8 synthesized crossbar…………..70

Figure 4.10: Power consumption vs. frequency for the 16x16 synthesized crossbar………71

Figure 5.1: Behavior of the butterfly switch………………………………………………..74

Figure 5.2: 4x4 Benes network……………………………………………………………..74

Figure 5.3: 8x8 Benes network……………………………………………………………..75

Figure 5.4: 16x16 Benes network…………………………………………………………..76

Figure 5.5: Layout for the 4x4 Benes network……………………………………………..77

Figure 5.6: Butterfly switch using full pass gates…………………………………………..78

Figure 5.7: Scaling trend for Benes network……………………………………………….79

Figure 5.8: Average delay for the pass transistor version of the Benes crossbar…………...82

Figure 5.9: Average delay for the full pass gate version of the Benes crossbar……………82

Figure 5.10: Power consumption vs. frequency for the 4x4 Benes crossbar using NMOS...87

Figure 5.11: Algorithm for configuring Benes networks…………………………………...90

Figure 5.12: Benes crossbar and configuration matrix…………………………………..….91

Figure 6.1: Transitor count vs. Crossbar size……………………………………………….95

Figure 6.2: Layout area vs. Crossbar size…………………………………………………...95

Figure 6.3: Signal delay using AMI transistor models……………………………………...97

Figure 6.4: Signal delay for crossbars using ULP transistor models………………………..98

Figure 6.5: Worst case wire length………………………………………………………….98

Figure 6.6: Static power consumption for crossbars using ULP transistor models………..100

Figure 6.7: Architectural efficiency for different crossbars……………………………….101

ix

LIST OF TABLES

Table 2.1: Delay across two inverters driving a 0.1 pF capacitance…………………...…...35

Table 3.1: Transistor Count and Area…………………………………………………….…43

Table 3.2: Delay summary for simple mesh crossbar using NMOS transistors as switches..45

Table 3.3: Delay summary for simple mesh crossbar using full pass gates as switches…....45

Table 3.4: Average parasitic capacitance for simple mesh crossbar………………………..49

Table 3.5: Worst case wire lengths………………………………………………………….50

Table 3.6: Power consumption summary for simple mesh using NMOS switches………....55

Table 3.7: Power consumption summary for simple mesh using full pass gate switches…..55

Table 3.8: Architectural efficiency for the simple mesh……………………………………57

Table 4.1: Transistor count and layout area……………………………………………...…64

Table 4.2: Delay summary for multiplexer-based crossbar………………………………...67

Table 4.3: Worst case wire lengths…………………………………………………………68

Table 4.4: Power consumption summary for synthesized crossbar………………………...69

Table 4.5: Architectural efficiency for synthesized crossbar……………………………….72

Table 5.1: Transistor Count and Area………………………………………………………78

Table 5.2: Delay summary for Benes network using NMOS transistors as switches………80

Table 5.3: Delay summary for Benes network using full pass gates as switches………..…80

Table 5.4: Average parasitic capacitances for Benes crossbar……………………………..84

Table 5.5: Worst case wire lengths…………………………………………………………85

Table 5.6: Power consumption summary for Benes networks using NMOS transistors…...85

Table 5.7: Power consumption summary for Benes crossbar using full pass gates………..86

Table 5.8: Minimum frequency for ULP power savings…………………………………..88

Table 5.9: Architectural efficiency for the Benes crossbar………………………………...88

1

INTRODUCTION

Reconfigurable hardware provides a low cost solution for Application Specific Integrated

Circuits (ASICs). Such hardware owes much of its versatility to programmable

interconnect. Crossbars are often used to implement this programmable interconnect. This

study focuses on crossbar design for very large scale integrated (VLSI) circuits. It compares

three crossbar designs: (1) a full crossbar or simple mesh, (2) a synthesized, multiplexer-

based crossbar, and (3) a Benes network. The comparisons are based on transistor count,

area, throughput, power consumption, reliability, architectural efficiency, and programming

requirements.

Varying circuit size has a significant impact on overall crossbar performance, as defined

by the metrics for comparison listed above. All crossbar designs were laid out for three

different sizes: a 4x4 crossbar, an 8x8 crossbar, and a 16x16 crossbar. Scaling trends for

various crossbar parameters were analyzed. Advantages and disadvantages of each crossbar

architecture are presented, for a variety of application scenarios.

Electrical comparisons are based on transistor level simulations of the circuits.

Transistor level netlists include parasitic capacitances extracted from silicon layout.

Synopsys® and Cadence® design tools were used to generate the layout. Transistor level

simulations were done using Smartspice®.

Chapter 1 frames the problem of reconfigurable interconnect in VLSI systems, and

defines the terminology associated with crossbars. Chapter 2 summarizes the physics

necessary to understand the different application scenarios. It also describes the simulation

procedure, and the ground rules for valid comparisons across architectures. Chapter 3

describes the simple mesh crossbar. Chapter 4 presents the multiplexer-based crossbar.

Chapter 5 is devoted to Benes networks. Chapter 6 compares the three architectures,

providing conclusions and recommendations for future work.

2

CHAPTER 1: RECONFIGURABLE INTERCONNECT FOR APPLICATION

SPECIFIC INTEGRATED CIRCUITS

1.1 Context and motivation

The relentless proliferation of integrated circuit applications has created a vast market for

application specific integrated circuits (ASICs). The cost of low volume fabrication runs

remains the most expensive step in ASIC design. Therefore, reconfigurable computing has

emerged as the most cost-effective solution for many applications. Moreover, the possibility

of reconfiguring a device to perform a variety of tasks reduces the amount of costly and time

consuming circuit design. Likewise, in-system reconfiguration enables the same hardware

to be used for a variety of different problems, achieving some of the flexibility of software

with the performance of dedicated hardware.

The onset of this reconfigurable computing trend dates back to the mid 1980s, when field

programmable gate arrays (FPGAs) were introduced. At the time, their main application

was rapid prototyping of “glue logic”, and, eventually, rapid prototyping of full digital

systems. Growing densities and improved CAD tools eventually made FPGAs very

attractive for applications like hardware acceleration, logic emulation and custom

(reconfigurable) computing, among others [2]. Beyond FPGAs, the next generation in

reconfigurable computing includes field programmable processor arrays (FPPAs). The

difference between FPGAs and FPPAs is their level of granularity. While FPGAs allow

reconfiguration of the system at the gate level, FPPAs provide higher level functional

blocks, such as adders and multipliers, which can be interconnected to perform a variety of

tasks.

3

Whether it be FPGAs, complex programmable logic devices (CPLDs) or field

programmable processor arrays (FPPAs), the versatility of reconfigurable computing is

highly dependent on programmable interconnect. In fact, interconnect takes up to 90%

of the area of an FPGA [8]. In order to improve performance and reduce power

consumption, programmable devices are built on a hierarchical architecture, with multiple

partitions sharing central crossbars [16,21,22].

Although the literature on FPGAs usually describes the high-level architectures in detail,

there is very little information on the circuit-level or the physical design of the devices,

because much of the information is proprietary [6]. Considering that reconfigurable devices

may be used in such a wide range of applications, it is an interesting problem to determine

whether certain circuit implementations are better suited to achieve specific design

objectives, such as high speed, low power consumption, or radiation tolerance.

Several controversies surround the design of reconfigurable interconnect for very large

scale integration (VLSI) systems. There is an ongoing debate regarding the degree to which

interconnect related issues will dominate integrated circuit (IC) design in the future.

Considering that metal conductor geometries have not shrunk at the same rate as transistor

geometries, many would agree that scaling into deep submicron technologies has shifted the

paradigm from device-dominated to interconnect-dominated design methodology. The

impact of metal wires on current VLSI systems is most obvious in terms of circuit size, as

figure 1.1 illustrates.

4

Figure 1.1: Relative sizes of metal wires and transistors.

Transistors used to make up about 30% of total circuit volume, with interconnect and

field oxide accounting for the remaining 70%. In 2004, transistors take up less than 2% of

the total circuit volume.

As technology continues to scale towards higher densities, the area of the logic will

decrease, but the relative effect of the delays due to wiring will increase [5]. According to

Cheng et al. [4],

“Increasing circuit sizes on the one hand, and shrinking design rules on the other, have

shifted ICs and MCMs from being device dominated to being interconnect dominated. In

deep submicron technology, wire delay can no longer be ignored, and must be factored

prominently into any overall performance metric.”

5

In contrast to such dire warnings about the relevance of interconnect in IC performance,

current CAD tools and wire models do not provide accurate assessments of the effects of

wiring in circuit speed and power consumption. The available models seem particularly

simplistic when attempting to describe heavily wired circuits, such as crossbars and

multistage interconnection networks, which are at the core of any reconfigurable computing

solution. Apparently, current CAD tools have not yet caught up with the demands of deep

submicron design. Yet, most experts agree that there is a trend. As feature sizes scale more

aggresively in the silicon than in the metal, the role of interconnect in critical design criteria

becomes more significant. Moreover, clock rates have increased far beyond 100 MHz,

where simple resistance-capacitance (RC) wire models do not capture the transmission line

behavior of metal lines. Although there is controversy about the magnitude of this impact,

there is consensus that technology is evolving towards a scenario where interconnect

dynamics will play a larger role in IC performance. Even if we have not reached the point

where interconnect-related effects severely limit circuit throughput or power consumption

for most applications, it is important to note that reconfigurable computing relies on

interconnect intensive circuitry. Thus, it has become relevant to look more closely at the

issue of programmable interconnect performance, and its effect on reconfigurable VLSI

systems.

The second controversy surrounding reconfigurable interconnect deals with the optimal

routing architecture for a given application. By routing architecture this document is

referring to the organization of wire segments and switches in a reconfigurable network.

Routing architecture has a strong effect on the speed, cost and routability of any system. In

fact, for most reconfigurable computing applications, routing architecture is the key

determining factor of system speed and logic density, because programmable switches have

significant resistance and capacitance. These switches also require additional area not used

in fixed interconnection systems[5,6,15].

6

The increasing effect of delays due to wiring mentioned above will create further

demands on the routing architecture. In particular, the amount of logic available on a single

reconfigurable chip will also increase, resulting in larger systems being built [5,6].

Architecturally, there will always be a need for innovation in logic and routing structures,

and thus it becomes relevant to ask which architecture of reconfigurable interconnect

provides the best solution for a particular application.

Many routing architecture alternatives have been studied in a variety of contexts. The

problem of reconfigurable interconnect is akin to the problem of switching networks for

telephone systems. In both cases, the goal is to provide a pathway that connects point A to

point B. Furthermore, as in a telephone system, it is desirable for a VLSI interconnect block

to guarantee that the establishment of a particular connection does not prevent other

connections between previously unconnected points. As a result, several routing

architectures for VLSI applications are based on telephone switching network schemes, such

as those proposed by Clos and Benes [1,7].

Although research has been done to assess the performance of different routing

architectures for telecommunications, studies dealing specifically with hardware crossbar

networks for VLSI applications are scarce. Permutation networks using a variety of

structures have been studied in theory quite extensively. Although delay boundaries in

terms of number of switches and path length have been established, there seems to be no

study based on transistor level simulation of the devices. Such a simulation might yield

important information about the behavior of the actual physical circuits. The absence of

circuit level simulations may be due to the lack of appropriate wire models mentioned

above. Notice that assuming wire delays to be zero, which is a very common default value

in most CAD tools, completely neglects any effect that physical routing might have on

circuit performance.

7

1.2 Previous work on crossbar architectures

Gaudet et al. [11] perform a comparison of crossbar architectures similar to the ones

presented in this study. However, their comparisons focus on the application of crossbars to

the design of iterative decoders. The present document explores the viability of several

crossbar architectures in the more general realm of reconfigurable computing. Furthermore,

[11] provides coarse capacitance estimations based on

transistor count, but there are no dynamic timing simulations to validate the theoretical

calculations. In contrast, the present report includes delay estimations based on actual

layout of the crossbars, for two different fabrication processes. Circuit level simulations are

provided to illustrate the dynamic timing results. The simulation data should help expand

the observations put forth in more theoretical approaches. By bringing the theoretical and

empirical aspects together, the VLSI engineer will gain insight into the most appropriate

interconnection scheme for a particular application.

The published literature regarding reconfigurable interconnect provides a wealth of

theory, but it is lacking in practical comparisons. Empirical studies that evaluate the

implementation of real circuits on different architectures would provide a clearer picture of

the advantages of each architecture relative to others [11]. Thorough empirical studies are

limited by cost, and observability of the phenomena being studied. In the context of

reconfigurable interconnect, simulations are limited by wire models. In particular, situations

like the high leakage phenomena associated with ultra low power CMOS technologies may

be inadequately modeled in commercially available CAD tools. As the relevance of

interconnect analysis grows, CAD developers will need to provide better wire models.

In order to offer a statistically meaningful performance assessment for any reconfigurable

computing architecture, it is particularly important that a good set of benchmark circuits be

available. With the trend toward larger systems-on-a-chip, true comparative studies can

8

only have merit when they use circuits that reflect a wide span of scenarios. Although a few

studies provide speed comparisons based on benchmarks at the board level, no set of

benchmarks has been established to gauge the performance of reconfigurable interconnect

within a single chip [5,15]. Furthermore, the development of reliable benchmarks for

reconfigurable interconnect architectures falls outside the scope of this project.

Consequently, the simulations performed here serve only to illustrate circuit behavior under

specific conditions. The simulation results presented in this document suggest a behavioral

pattern, which, coupled with theoretical considerations, leads to the architectural

recommendations issued in this study.

However, it must be noted that the data gathered are insufficient for a proper statistical

analysis. In spite of the numerical data and calculations presented, the results of this study

should be regarded as qualitative, rather than quantitative.

As mentioned above, the goal is to identify which interconnect structures are best suited

for specific environments and applications. In particular, three alternatives for building

crossbars are compared: a standard crosspoint matrix approach, a synthesizable

combinational logic approach, and a butterfly switch approach, based on Benes networks

[1]. This report points out advantages and disadvantages of each architecture, and ultimately

makes recommendations as to which architectures should be considered for design in

specific situations.

Among the metrics for comparison, aspects such as area, speed, power consumption,

flexibility, fabrication complexity and programming complexity have been evaluated. The

comparison also takes into account the effects of varying the number of nodes that the

crossbar must connect, as well as performance variations across different fabrication

processes.

9

Our observations emphasize circuit level simulation results, incorporating as much real

layout information as possible. The purpose of such emphasis is to gain insight into

electrical aspects of reconfigurable interconnect design that have not been compared in

previous similar studies. Despite the relative inaccuracy of wire models in traditional CAD

tools, studying dynamic timing behavior illuminates circuit characteristics that may be

crucial in certain environments. Moreover, the comparisons have shed light on certain

specific shortcomings of current wire models, which are also discussed.

The case of Benes networks is particularly interesting. DeHon first explored the

theoretical advantages of programmable interconnect using non-blocking Benes networks

[8]. Additionally, [2] and [11] present routing architectures based on Benes networks for IC

applications. However, none of these references provides a VLSI implementation of their

networks. This thesis takes the next step in asserting the

feasibility of Benes networks in VLSI systems. First, it provides a transistor level circuit

design for non-blocking Benes networks, complete with silicon layout. It then incorporates

the layout information into dynamic simulation results through capacitance extraction.

Beyond the realm of physical circuit realization, any new design must provide the

appropriate implementation tools to configure the circuit [5,6]. For the proposed Benes

network, the activation and deactivation of the butterfly switches to achieve specific

interconnection patterns is not a trivial problem. Thus, an algorithm for programming the

butterfly switch network is included.

1.3 Crossbar general concepts

The purpose of this section is to introduce the basic terminology used to describe the

different interconnect structures being compared. The concepts presented here are intended

10

only as an introduction. Additional terms will be defined in the sections where they become

relevant.

In a blocking interconnection network, establishing certain connections between inputs

and outputs prevents some inputs from being connected to other outputs. In non-blocking

structures, once any specific connection has been established, there is still at least one path

that leads from any unconnected input to any unconnected output. In figure 1.2, each black

bubble represents a valid connection port. The solid lines represent connections that have

been made. The dashed lines represent possible alternate connections. If we restrict each

connection port to provide one and only one input/output (I/O) connection, the top network

is non-blocking. Meanwhile, in the array at the bottom, establishing a connection between

In0 and Out1 prevents the possibility of connecting In1 to Out0. The bottom diagram

illustrates a blocking network.

Figure 1.2: Blocking and non-blocking networks.

11

A crossbar is a network topology in which any input port can be connected to any free

output port without blocking [12]. A basic crossbar, or crosspoint, switch has a set of

inputs, a corresponding set of outputs, and a set of addresses mapping inputs to outputs, as

shown in figure 1.3. The size of the crossbar refers to the number of points being connected.

It is typically specified as a number of inputs and a number of outputs. Thus, a 5x3 crossbar

is a crossbar in which five inputs can be connected to three outputs.

Figure 1.3: Basic crossbar

According to Han et al. [12], crossbar networks can be categorized into three major

topological classes: full-crossbar networks, multistage interconnection networks (MINs),

and networks consisting of multiple levels of full crossbar connections, called hierarchical

crossbar interconnection networks (HCINs). The implementations presented in this study

belong to the first two categories. A full-crossbar network is comprised of a single

switching element. In contrast, multistage interconnection networks (MINs) consist of

multiple interconnected layers of simpler switches.

12 A multistage switching network is said to be rearrangeable if, for every matching of

input to output pins, there exists a switch setting sequence such that the matching can be

realized. A set of input to output matchings that must be achieved simultaneously

constitutes a permutation of the MIN. Utilization of the network is defined as the ratio of

the number of programmable switches used to implement a specific permutation over the

total number of switches in the network [2].

In terms of flexibility, crossbars may be classified according to the following criteria:

• Bi-directional or unidirectional: The purpose of any crossbar is to connect two points,

A and B, for instance. In a bi-directional crossbar, data can flow freely from A to B, or

from B to A. In unidirectional crossbars, the set of inputs and the set of outputs are

clearly differentiated. Data may flow from an input to an output, but not in the opposite

direction.

• One-to-one or multicasting: Multicasting crossbars allow for one input to be broadcast

simultaneously through more than one output. One-to-one crossbars require that each

input is connected to one and only one output at any given time.

Figure 1.4: Multicasting network.

13• Fully connected or partially connected: Fully connected crossbars allow any input to

be connected to any output. In contrast, a partially connected crossbar is such that at

least one input cannot reach one or more outputs. In other words, the idea of full

connectivity implies that all one-to-one mappings of the input pins to the output pins are

possible [2]. In figure 1.5 the dashed lines represent all valid connections. Notice that,

in the bottom diagram, In1 is isolated from Out0, making this a partially connected

crossbar.

Figure 1.5: Fully connected vs. partially connected network.

• Static or dynamic configuration: The notion of static and dynamic configuration of

crossbars relates more to the application, than to the nature of the crossbars themselves.

Static configuration refers to the case in which the crossbar connections are set at

configuration time and then stay fixed while the system is running. Dynamic crossbars

allow the possibility of reprogramming connections during run time. That is, a given set

of connections may be reconfigured even as data run through another portion of the

same crossbar.

14

In the available literature, the architectures of interconnection networks have been

classified into two generations, also known as direct and indirect routing [10, 17]. In the

first generation architecture, or direct routing, processing elements are directly connected to

each other in a mesh pattern or some other fixed pattern, and it is the processing element

itself, i.e., an FPGA, that may be reconfigured. In other words, the interconnect within an

FPGA may be reconfigured, but when building a network of FPGAs, the connection

between different FPGAs is fixed. The central principle of this mesh architecture is based

on the locality of circuit designs, that is, a node of circuits is more likely connected to its

neighboring nodes. The second generation of routing architectures, also known as indirect

routing, is the partial crossbar interconnection network. With indirect routing, it is possible

to reconfigure the connections between processing elements, in addition to configuring the

processing elements themselves. A commercial example of second generation routing are

field programmable interconnect devices (FPIDs), which are used to connect several FPGAs

at the board level. The evolution of systems from first generation routing to second

generation routing provides evidence of the increasing role of programmable interconnect in

the ASIC market.

We will use the term architectural efficiency, defined in [2], to encompass the number of

programmable switches, number of I/O pins, average utilization of programmable resources,

and delay encountered in a routing path. As a figure of merit useful in comparing routing

architectures, architectural efficiency will be computed using

equation 1.1.

pathinelementsDelayswitchesofNumbernUtilizatioinputsofNumberefficiencyralArchitectu

××

= (1.1)

15

1.4 Overview of the experiment

The study presented here compares three different crossbar architectures. One of the

architectures is a full crossbar, while the other two represent multistage interconnection

networks. The full crossbar is essentially a mesh of metal lines where each input/output

connection is provided by a pass gate, as shown in figure 1.6. To take advantage of the

layout regularity, this mesh was custom designed and laid out, without the aid of any

hardware description languages or synthesis tools.

Figure 1.6: Full crossbar, mesh approach.

The second architecture uses multiplexers, tristate buffers, and logic gates to select the

input/output connection. This architecture is the result of CAD synthesis from a VHDL

behavioral description of the crossbar. Figure 1.7 illustrates a 4x4 crossbar synthesized

using this method.

16

Figure 1.7: Automatically synthesized crossbar.

The final architecture is based on the standard Benes network described in [1] and [2].

This approach allows for certain flexibility, as it provides several choices concerning the

length of the routing paths or the number of switches. This is also the only one of the three

designs in which scalability and programming are not straightforward. As an example, the

4x4 network is shown in figure 1.8(a). The blocks labeled as B are butterfly switches,

whose schematic diagram is shown in figure 1.8(b). Each butterfly switch consists of four

transmission gates. These transmision gates work in pairs. When control signal S is a zero,

the top and bottom transmission gates are on, while the transmission gates on the left and

right are off. Hence, when S is zero, In0 is connected to Out0, and In1 is connected to Out1.

When S switches to one, the top and bottom gates turn off, and the left and right gates turn

on. At this point, In0 connects to Out1, and In1 connects to Out0. The layout for all Benes

networks in this study was drawn manually, based on transistor level schematic diagrams.

17

Figure 1.8 (a): 4x4 Benes network. Select lines removed for clarity.

Figure 1.8(b): Schematic diagram of a Butterfly switch.

Each architecture was used to build crossbars of three different sizes: 4x4, 8x8, and

16x16. Among the aspects to be considered in comparing these circuits were throughput,

scalability, design complexity and time to design, programming requirements, power

consumption, transistor count, and area penalty. All circuit design and simulations were

completed through the Cadence design flow, provided by their IC front to back tools,

18

available at the University of Idaho's Electrical and Computer Engineering department.

Circuit level simulations were performed using Smartspice ®.

All of the circuits presented above fit the definition of a crossbar, as they are non-

blocking interconnect structures. Furthermore, all circuits are designed to be fully

connected and bidirectional, when used for one to one connections in a static configuration

mode. Note that, in this case, multicasting implies a sense of direction, as it does not make

sense to drive one output through more than one permanently connected input. In the case

of dynamic configuration, it is possible to retain bidirectionality even when multicasting

signals. It should be remarked that although the Benes network is non-blocking for a one to

one network, multicasting may in fact block other inputs from reaching their outputs. These

properties will be addressed in the following chapters. Unless otherwise noted, the networks

will be assumed to be operating in a one to one, statically configured mode, which makes

them fully connected, bidirectional crossbars.

A total of nine circuits were designed, and each of them has been simulated for two

fabrication processes with very different characteristics. One is a standard, speed driven

fabrication technology (AMI 0.6 microns). The other process uses the same layers in

drawing the layout, but parasitic model parameters have been skewed to emulate the

characteristics of an ultra low power, high leakage environment, operating on a supply

voltage of just 0.5 volts. High leakage is a major concern for processes aiming for low

power consumption through the use of low voltage supplies. Thus, one of the goals of in

this project is to determine which crossbar network would perform best in a high leakage

environment.

Capacitance parameters for each of these circuits were extracted from the layout. These

capacitance parameters were then included in the circuit level simulations of the circuits.

Layout data were also used to estimate circuit area. Many SPICE simulations were

19

performed on each of the circuits. The simulations presented in this report were chosen to

portray specific situations relevant to throughput or power consumption.

Along with these electrical considerations, the issue of programming interconnect is

worthy of special attention. A configuration algorithm is presented for one of the

programmable interconnect structures, the Benes based approach. Key concerns in

interconnect controller design are addressed, although no transistor level design of these

controllers has been performed. The assessment of this configuration time control is based

on the published literature and the simulation of the interconnect structures themselves.

20

CHAPTER 2: PHYSICAL DESIGN CONSIDERATIONS FOR CROSSBARS IN

VLSI SYSTEMS

2.1 Crossbar design goals and applications

This document attempts to determine if certain crossbar architectures are better suited for

specific applications than others. To understand the potential advantages of a particular

crossbar in a given environment, it is necessary to be acquainted with the physical principles

that affect crossbar performance in that environment. This chapter summarizes some key

physical considerations that pertain to interconnect design, and, in particular, to crossbar

design. Once this background has been established, section 2.7 outlines the experimental

procedure used to compare crossbar architectures.

Historically, the goal of field programmable gate arrays (FPGAs) has been to produce a

high speed architecture and circuit with a reasonable logic density [5]. Power consumption

and tight reliability constraints have not been the driving force in the FPGA market. The

demand for application specific integrated circuits (ASICs) keeps growing, and

reconfigurable devices provide a more inexpensive alternative than custom fabrication runs.

As a result, it has become relevant to provide reconfigurable computing for applications

whose aim is different than that of traditional FPGAs. Of course everyone wants

applications to run as fast as possible, but for portable systems it is also desirable that

batteries last longer, and that the system is smaller. Even more critically, in life safety

applications, like fire alarms, and certain medical equipment such as cardiovascular

pacemakers, battery life and reliability are more pressing design concerns than a faster

clock frequency. Many of the services available today, such as weather reports, GPS

systems, and even the Internet, rely heavily on satellites. Electronics for outer space must

deliver the best possible performance, on a limited power budget, in a radiation intensive

environment. Accordingly, the effectiveness of programmable interconnect structures

21

should be examined under a wider range of environmental conditions, with a broader

spectrum of applications in mind.

The evaluation of crossbar performance for a variety of applications requires an

understanding of the underlying physical principles that govern circuit design in a particular

context. Standard issues that must be considered include throughput, power consumption,

circuit size, noise immunity and reliability. What constitutes acceptable performance in

each of these categories is strongly dependent on the application for which the circuit is

intended. This chapter provides an overview of possible application scenarios, and the

physical challenges for reconfigurable interconnect in each of these environments. First of

all, it must be noted that this study focuses on datapath interconnect, under the assumption

that data skew can be tolerated. As a result, this document does not include an in-depth

analysis of signal skew for the routing alternatives presented. Reconfigurable interconnect

structures such as the ones presented in this study are not suitable for clock routing, as skew

between alternative paths may be significant.

2.2 Transistor sizing and delay:

Transistor sizing is a major concern for achieving high performance. Small transistors

limit the current that a given circuit can handle, while large transistors have a higher

capacitive load. Tools such as logical effort theory [20] have been developed to assist in

transistor sizing and buffer insertion for optimal performance of CMOS circuits. However,

logical effort theory focuses on speed driven designs, rather than our context based

definition of high performance.

As a general trend, fabrication process technology seeks to shrink transistor sizes in order

to increase logic density. Having more transistors provides increased computational power

in every integrated circuit. Concurrently, many process engineers seek to develop faster

22

transistors, which may function in systems with higher clock frequencies, providing higher

throughput. The quest for smaller, faster devices has led to what is known as deep

submicron technologies, which are currently the state of the art in terms of logic density and

speed.

Notice that the goal of increasing logic density implies fitting more transistors into a

given die. Although transistor sizes have shrunk significantly, die sizes for most

applications do not scale as aggressively. Because signals must travel across the die, smaller

transistors do not necessarily translate into shorter wires. On the other hand, connecting to

smaller transistors often requires thinner wires. As a result, wire resistance tends to increase

as fabrication processes shrink. Capacitance remains fairly constant as feature sizes scale.

Thus, an increase in resistance causes an increase in interconnect delay. In contrast, gate

delay decreases as processes shrink. Overall, the balance in delay contribution is shifting

from gates to interconnect.

Certain fabrication process adjustments help mitigate this increase in wire delay. The

most common of these process variations include reducing both, oxide permitivity and metal

resistance. In fact, the impact of metal resistance in this context has emerged as a source of

controversy. Several experts believe that aluminum has run its course, and fabrication

processes must switch to lower resistance metals such as copper. Others defend the position

that the situation is not so dramatic yet, and a premature switch to copper would simply be

too expensive. In any case, when considering deep submicron fabrication processes,

interconnect-intensive circuits, such as crossbars, owe most of their signal delay to the metal

layers, rather than the transistors. Typically, smaller and faster transistors allow significant

increases in clock frequency, which translate into higher throughput. However, for the latest

fabrication processes, we might be reaching a point where interconnect delay, rather than

transistor switching is the limiting factor for maximum speed of operation.

23

2.3 Power consumption

Although increasing clock frequency has the desired effect of providing higher

throughput, issues like power consumption and noise immunity are actually hindered by a

higher switching rate. Recall the general expression for dynamic power consumption in

CMOS circuits,

P = C*f*Vdd2*u (2.1)

where P is the dynamic power consumption of the circuit, C is the load capacitance, f is the

operating frequency, Vdd is the supply voltage, and u is a utilization factor for the circuit

output. From equation (2.1), it follows that when CMOS circuits operate at higher

frequency they consume more dynamic power.

Reducing power consumption becomes important for portable system designs. In a

world full of mobile telephones, personal digital assistants, laptop computers, and other

battery powered applications, power budget is a crucial design specification. In particular,

space-borne applications pose the tightest constraints in terms of power budget. Even in

more conventional, non-portable systems, power dissipation remains an important concern

in circuit design. For instance, personal desktop computers that are plugged into an alternate

current (AC) outlet must limit their power consumption to prevent overheating of the

integrated circuits. Traditionally, gates have been the main sources of on-chip power

dissipation. However, just as it has happened with delay, the situation has changed. As

interconnect scales down, the spacing between wires becomes smaller. Thus, coupling

capacitances replace parallel-plate and fringing capacitances as the dominant source for

overall wire capacitance. Today, interconnect capacitance is the cause of most of the on-

chip power dissipation, especially for crossbars, which are inherently wiring intensive

circuits [4].

24

2.4 Noise immunity, leakage current, and other second order effects

One common way of reducing power dissipation has been to reduce the supply voltage,

from 5 V a few years ago, to 3.3 V, then 2.5 V and less. Recent work in ultra-low-power

CMOS has produced devices operating with a supply voltage of 0.5V [3]. As the supply

voltage is reduced, the MOS transistor thresholds are also reduced. This causes an increase

in sub-threshold leakage current, which increases static power consumption and degrades

switching performance, producing slower transitions. The

sharpness of a signal's rising and falling edges strongly affects the short circuit current.

Additionally, coupling noise may change the shape of the signal transition waveform in a

way that causes a higher short circuit current, further complicating the problem [4]. The

effects of leakage are most pronounced in circuits with large numbers of parallel transistors,

such as memories and crossbars [14]. In designing crossbars for low power applications, it

becomes relevant to consider the effects of leakage on throughput, power consumption, and

noise immunity.

At the other end of the spectrum, for extremely fast systems with decreasing rise times,

inductance becomes an important issue in signal integrity and delay. Inductance causes

ringing in the signal waveform, which can affect signal integrity if it exceeds allowed

thresholds. Also, with increasing wire lengths, the effective current loop is increased,

making mutual inductances a nonnegligible effect, which may further increase the short

circuit current.

Beyond an increase in power consumption, degraded signal waveforms cause other, more

dramatic, problems. Specifically, when signal integrity becomes compromised, noise

immunity is harder to achieve. For instance, in deep submicron technologies, interconnect

capacitance becomes comparable or even larger than gate capacitance. Thus, for dynamic

logic, the charge of interconnect may overwrite the content of the gate capacitance [3]. As

25

distance between wires scales down, crosstalk and charge sharing effects become more

significant. Furthermore, as supply voltages decrease, so does the threshold voltage. A

lower threshold voltage causes an increase in leakage current, which is most hazardous for

highly parallel structures, such as memories and crossbars [14]. These second order effects

call for more sophisticated circuit models, and capacitance extraction techniques.

Otherwise, Computer Aided Design (CAD) verification tools become a less reliable

predictor of actual circuit operation.

Whether one is pushing the limits of signal throughput, or circuit power consumption, it

seems that wire models currently available neglect important circuit dynamics. As an

example, for very fast circuits, the lumped-circuit representation is no longer adequate, since

it results in substantial underestimation of both crosstalk and delay. Hence, it would be

appropriate to model wires as transmission lines, rather than simple RC networks [4].

However, developing software that takes into account these considerations is costly, and

simulation times would probably increase beyond acceptable ranges. Because CAD

simulation results rely on such limited models, the designer must pay careful attention to

environmental conditions that may exacerbate these second order effects.

2.5 Reliability

Computer simulations cannot account for aging of the components, or statistical

variations within a fabrication process. Reliability analysis studies trends in integrated

circuit fabrication, and provides guidelines to ensure systems will work properly, for the

longest time possible, in spite of the aging of components. Reliability is also concerned with

the fact that no two integrated circuits are exactly alike, and yet, it must be possible to

replace a defective integrated circuit with an equivalent spare, and expect the system to work

correctly. In this context, reliability assessments depend upon the environment where the

integrated circuit must operate. For example, a high density of power dissipation creates hot

26

spots that cause temperature increases. If the temperature rises beyond the operating range,

the chip may fail to function [4].

There are two major sources of failure that affect interconnect: (1) electromigration, and

(2) formation of stress-induced voids, also called stress voiding. Electromigration occurs

when current is forced through metal lines whose width is not constant. Electrons tend to

accelerate when they reach the narrower sections of the wire, as shown in figure 2.1. As

electrons accelerate, they gain momentum. Eventually, the moving electrons collide with

the metal nuclei. The collision causes momentum to be transferred from the electrons to the

nuclei. Although the metallic nuclei are much heavier than the electrons, continued

collisions cause displacement of the metallic nuclei. Over time, these displaced nuclei can

lead to opens in a metal line. In fact, they may even lead to shorts, if the metal is displaced

in such a way that it provides a conduction path between lines that should be isolated from

each other.

Electromigration is primarily affected by operating temperature and current density in the

metal line. Higher current densities mean that more electrons are moving through the wire.

No fabrication process can guarantee that the width of their metal lines will be constant

throughout a chip. Thus, high current density is a major cause of electron migration. As

interconnect shrinks, the current density increases, making electromigration a growing

concern.

Figure 2.1: Electromigration in aluminum wires.

27

It must be noted that although electromigration is a problem for aluminum wires, copper

wires are not affected by this phenomenon. Copper has a lower resistivity than aluminum,

and copper nuclei are much heavier than aluminum nuclei. The replacement of aluminum by

copper would solve the electromigration problem, but the switch would represent a huge

investment for foundries. Instead, designers have chosen to prevent electromigration by

limiting the maximum current density that may run through an aluminum wire. As a rule of

thumb, current density is restricted to less than 1 mA/µm2. . To ensure the designed

crossbars do not suffer electromigration, current density has been calculated and monitored

throughout this project.

At microscopic levels, a single aluminum wire is made up of several grains, which may

be arranged in a variety of structures, as shown in figure 2.2.

Figure 2.2: Grain structures on VLSI metal wires.

Grain structure strongly affects the likelihood of metal wire failure due to stress voiding.

A triple point is a location along a metal wire where three grain boundaries intersect. Figure

28

2.2, shows that a bamboo grain structure is such that there are no triple points along the

wire. Stress voiding usually occurs at triple points. Therefore, an aluminum wire with a

bamboo grain structure is less prone to stress voiding failure than a polygranular wire.

Essentially, stress voiding results from a stress gradient acting on a triple point. After

annealing, different parts of a chip cool down at different rates. As materials cool down,

they contract. Different cool down rates result in different contraction rates. Consequently,

one side of a wire will feel more stress from the contracting materials than the other side. In

the presence of such a stress gradient, a wire will tend to break where the atomic bonds are

weakest, leading to open wire failures. Metal wires are weakest at triple points.

Electromigration aids the stress voiding process by making atomic bonds even weaker.

Unlike electromigration, techniques for dealing with stress voiding are mainly a fabrication

process issue, rather than a circuit design issue. One might argue that longer wires that span

a die from side to side are more likely candidates for stress voiding, because materials will

be more heterogeneous at both ends of the wire. Furthermore, a smaller layout area would

allow more room for redundant circuitry which

would increase the overall reliability of the system. As a result, at the layout level, attempts

have been made to keep metal wires as short as possible, and total layout area to a minimum.

It must be pointed out that stress voiding depends on several factors, and equating the

likelihood of stress voiding to wire length would be a gross oversimplification. A statistical

analysis of stress voiding for the crossbar architectures and fabrication processes falls

outside the scope of this project.

One of the most hostile environments for integrated circuits (ICs) is outer space. For

space borne applications, ICs must be able to deliver optimal performance in a radiation

intensive environment. Two approaches have been used to get around this issue. Some

vendors provide radiation hardened packaging for commercial off-the-shelf integrated

circuits. On the other hand, researchers around the world develop rad-hard-by-design

29

systems. Ultimately, these radiation tolerant designs are also encapsulated in rad-hard

packaging [19]. In addition to radiation tolerance, space grade ICs must comply with a

limited power budget. Furthermore, it is very expensive to replace a defective system in

space, so circuits must be designed to be both, highly reliable and fault tolerant. The various

space programs around the world are major consumers of ASICs, as the applications they

develop are frequently cutting edge technology. Reconfigurable systems provide the

versatility that such a dynamic field requires, at a fraction of the cost of custom fabrication

runs. On account of this close match between the needs of the space application market and

the advantages of reconfigurable computing, it becomes worthwhile to evaluate how

different crossbar topologies perform in a radiation intensive environment. In rad-hard by

design circuits, the need for guardbands and redundancy results in area overhead, which may

be a valid reason for choosing the smallest crossbar implementation for this kind of

application. Also, rad-hard ICs often include built in test structures, used to measure

performance of the chip when exposed to controlled amounts of radiation, which further

increase the area penalty. In summary, smaller designs allow more room for guardbands,

radiation hardening circuitry, and redundancy. If reliability is not the driving concern in the

design, a smaller area translates either into lower cost, or increased computational power, as

there is more room for additional transistors.

2.6 Programming, Routability, and Pin count

Regardless of the type of programmable switch used, the capacitance, resistance, and size

of programmable connections makes them much slower and larger than a simple metal wire

[5]. Besides all the electrical constraints, there are mechanical limitations that a crossbar

designer must consider. Note that crossbars may be designed to interconnect entire busses,

rather than individual bit lines. In such a case, the size of a routed crossbar increases

roughly by a factor of W, where W is the number of bits in the bus. Size concerns not

withstanding, the major challenge for a system with wide busses is routability. The space

30

available for routing is limited by two factors: layout design rules and the number of metal

layers available in the fabrication process. Using a higher number of metal layers may

reduce the footprint of the layout, but it offers no help regarding the routing path length or

the cost of the design. Moreover, routing wires through vias may even result in longer

delays for some signals [2].

The problem of running thousands of parallel wires through the patterns demanded by the

different crossbar architectures is not trivial. CAD tools provide routing algorithms, but

these often take hours or even days of computation time, for highly complex systems. More

importantly, it is not uncommon for automatic routing to yield less than optimal results. In

consequence, for large crossbar designs, it may be favorable to use the crossbar architecture

which provides the simplest, and most regular, routing pattern. Paraphrasing Chow et al. in

[6], layout styles should favor regularity, with the irregular structures restricted to the final

control logic.

While routing is a concern for VLSI wiring in general, reconfigurable interconnect must

also budget for the programming control logic. Programming logic not only influences size

and routability, but it also affects how much time it takes to configure the crossbar. Ideally,

one would prefer to load all the crossbar addresses in parallel, so that the entire crossbar is

configured in a single load operation. However, a real integrated circuit has a limited

number of pins available for interface to the outside world, so it is far more common to load

configurations in several steps. In other words, there is a pin count versus speed of

programming compromise. If a particular routing architecture uses fewer pins to configure

its connections, it may require more input/output (I/O) transfers to complete its

programming, thus increasing configuration time. Now, given a fixed I/O interface, a

crossbar architecture that needs fewer bits to program its connections, will take less time to

configure.

31

Typically, multistage interconnection networks (MINs) can be configured using fewer

bits than standard mesh crossbars. The price to be paid for such convenience lies in the

programming complexity of the crossbar architecture. In the mesh architecture, it takes only

one switch to build a path from a crossbar input to a crossbar output. Despite using fewer

address bits, programming a MIN is not as simple. In MINs, it takes several stages to link

inputs and outputs. Furthermore, using a particular path disables certain possible routes

between some inputs and outputs. As a result, it becomes necessary to perform additional

computations at configuration time. These computations must yield a valid switch

combination that realizes all the intended connections. Variations on the routing algorithms

used in computer networks often provide efficient ways to complete this task. And yet, for

certain systems, the additional “crossbar programming” step may be unacceptable,

especially for larger crossbars with too many alternative paths. In addition to all the

electrical and mechanical considerations, a practical design must consider the compromise

between routing flexibility and programming effort. On the one hand, it is relevant to

establish how the number of address bits scales with growing crossbars. On the other hand,

one must evaluate how much the programming complexity grows for more convoluted

crossbar architectures.

2.7 Ground rules for crossbar study

In order to provide a meaningful comparison of the crossbar alternatives, it is necessary

to establish a common context in which to work. The ideal would be to provide identical

conditions for the three architectures, so that any trends observed can be attributed to

specific crossbar circuits and their inherent properties. Wherever possible, care has been

taken to ensure an identical environment, but, due to fundamental differences in the

architectures themselves, certain compromises have been made. This

32

section describes the simulation environment, and the general procedure used in obtaining

data for all the circuits being compared.

Fifteen different layouts were generated using the AMI 0.6 micron technology definitions

available for Cadence. Each of the three approaches was laid out in three different sizes:

4x4, 8x8, and 16x16 crossbars. Additionally, the full crossbar approach and the Benes

interconnection network were laid out in two versions for each size. One version used full

pass gates as switches, while the other one used NMOS transistors as switches.

Capacitances were extracted for each of these nine layouts. A basic SPICE netlist

describing the transistors and parasitic capacitances was obtained for each layout.

Unfortunately, the technology files assumed metal layer resistances to be zero, and thus

automated resistance extraction was impossible. Alternatively, wire lengths and widths

were measured, and resistances were estimated for each input/output path. Although these

values are analyzed theoretically in discussions about crossbar delay for each architecture,

the resistance values were not incorporated into circuit level simulations. For larger feature

sizes, where λ is greater than 0.3 microns, the observations presented here are valid, because

wire resistance is indeed negligible. However, as processes scale into deep submicron

technologies, better circuit models would be necessary to provide meaningful comparisons.

When first laid out, all transistors in the netlist were associated with AMI 0.6 micron

transistor models, as provided by MOSIS. Spice transistor models for the ultra low power

(ULP) fabrication process were obtained through the Center for Advanced Microelectronics

and Biomolecular Research (CAMBR) at the University of Idaho. The ULP transistor

model was inserted into the nine original netlists, providing nine additional circuits for

comparison. In all, eighteen circuits were simulated for various input/output (I/O) patterns.

The simulation results were used to compute average signal delays, total leakage current,

and power consumption. Figure 2.3 summarizes the procedure for obtaining SPICE

simulation results based on circuit layouts.

33

Figure 2.3: Obtaining Spice simulations from Cadence layout.

It must be stressed that this is not a true ULP layout, but an AMI netlist using ULP

transistor models. Consequently, simulation results provided here may differ from those

obtained for a circuit specifically laid out for ULP. In particular, parasitic capacitances

associated with ULP may be very different than the values extracted for AMI. The point of

34

using ULP models in this study is to examine crossbar performance in systems where

leakage currents are not negligible. Transistor models with very low threshold voltages,

such as those obtained for ULP, mimic the leakage conditions being targeted.

Closely related to the issue of measuring delay is the question of how to drive and load

the simulation test circuits. In [19], Sutherland provides several recommendations and

examples aimed at generating realistic edges for transistor level simulations. Following

these recommendations, all simulation inputs are driven using two cascaded inverters, whose

input is connected to the direct current (DC) or pulse sources available in Smartspice®.

Similarly, the outputs are loaded with three cascaded inverters, followed by a 0.1 picofarad

capacitor. The output is then measured after the second output inverter. The test circuitry

for one input/output pair is depicted in figure 2.4. Delay is measured from point A to point

B in the figure. For this project, delay is defined at 50% of the full voltage swing. Since the

AMI process uses a 3.3 volt supply, delay was measured between the time the input reached

1.65 volts, and the time the output reached that same voltage. For the ULP process the

switch point was 0.25 volts, which is half of the 0.5 volt supply.

Figure 2.4: Circuit for measuring crossbar delay

Note that such measured delay is the sum of the crossbar delay plus the delay across two

inverters. To obtain the net crossbar delay, one must subtract the delay across two inverters.

The test circuit illustrated in figure 2.5 was simulated to determine the delay across two

cascaded inverters driving a third inverter of the same size. The load seen at point D in

figure 2.5 is the same as the load seen at point B in figure 2.4. Therefore, the delay between

35

points C and D is the value that must be subtracted from the original data. Table 2.1 shows

the inverter delay values for the two fabrication processes used.

Figure 2.5: Circuit for measuring delay across two inverters.

Table 2.1: Delay across two inverters driving a 0.1 pF capacitance.

Process Rising transition Falling transition Average across two inverters

AMI 0.6 0.285 ns 0.285 ns 0.285 ns

ULP 0.896 ns 0.786 ns 0.841 ns

From Table 2.1 it is evident that AMI transistors are faster than their ULP counterparts.

This behavior is consistent with the fact that AMI processes operate on a 3.3 volt power

supply, while ULP operates on 0.5 volts. The difference in rising and falling transition

delays for the ULP process is a consequence of the netlist generation procedure described

above. The layouts were built using AMI 0.6 standard cells and AMI 0.6 layout design

rules. Transistors in the layout were sized for equal rise and fall times on the AMI 0.6

fabrication process, as evidenced in the first row of Table 2.1. When the netlists were

modified to utilize ULP transistor models, transistor sizes were not modified. Consequently,

rise and fall times are slightly different on ULP simulation runs. For the remainder of this

document, the word “delay” refers to the average of rising transition delay and falling

transition delay, unless otherwise specified.

36

All Spice simulations in this project ran for one microsecond, with a 0.1 nanosecond time

step. All simulation runs used to measure delays were such that they included at least three

falling transitions and three rising transitions for each I/O pair. In every case, the delay for

each transition was measured, and then two averages were computed, one for rising

transitions and one for falling transitions. Additionally, all delays were averaged to get a

delay estimate for a given I/O pair in a specific simulation run. Thus, for each I/O pair

being connected, a particular simulation yields an average delay, an average rising transition

delay, and an average falling transition delay. All of the

averages computed in this manner are then compared to get a worst case delay for the

crossbar. Furthermore, the delays are averaged again across all I/O pairs, to get an average

delay for the crossbar implementation.

Dynamic power consumption generally depends on how often signals toggle between

voltage rails. Such signal activity is specific to the operations being performed and the data

being processed. Because no thorough set of benchmarks has been defined to assess

crossbar power consumption, the estimates presented in this report should not be regarded as

absolutes. Within the context of this project, care has been taken to ensure circuits operated

at the same frequency and with the same I/O activity. In other words, all circuits being

compared have been subjected to the same testbenches.

Power consumption was gauged by monitoring source current, at each input, on every

simulation run. Two values were monitored for each current sample: Ileak and Ipeak. The

direct current (DC) component of the current waveforms was isolated. In most cases, this

value represents the leakage current for a particular input. Leakage across all inputs was

added for every simulation run, and is referred to as Ileak for the remainder of this document.

Ileak was averaged across simulation runs to generate an Ileak value for each circuit. Average

static power consumption was computed by multiplying Ileak by the supply voltage for each

circuit. The absolute value of the largest current peak for each circuit was labeled as Ipeak for

37

that circuit. The value of Ipeak is directly proportional to dynamic power consumption in the

circuit. Thus, comparing Ipeak for different circuits provides insight into dynamic power

consumption of the circuit.

Although all outputs have identical loads, the capacitance seen from a particular input to

the crossbar depends on whether the switch is on or off. If the switch is off, the contribution

to the capacitance is due to the switch transistor itself, and in the case of ultra low power

processes, this load may be significant [14]. If the switch is on, then all of the circuitry on

the other side of the switch will contribute to the load seen by the driving signal. This

means that, when multicasting inputs, the loading will be a function of the number of

crossbar switches that are turned on.

A related problem may arise between the time after chip power-up, but before the

crossbar programming has been downloaded. At this point, the configuration bits have not

been programmed, and it is possible that two or more inputs may be trying to drive the same

output. [6] presents two alternatives to deal with this problem. The first possibility is to

disable the output drivers, using a tristate buffer, during power up and programming. This

design impacts the speed of the crossbar, because the disabling transistors of the tristate

buffer are in the critical path. The second alternative involves adding a global control

signal, which is activated during power up and disables all crossbar connections [6].

Next to interconnect, the most important element in a crossbar is the programmable

switch. The design of the switches will have a significant impact on the area and speed of

the array. FPGA's and other reconfigurable computing applications have used several

switching methodologies, which include Static Random Access Memory (SRAM) controlled

pass transistors and pass gates, antifuses, and even floating-gate-based switching [6]. For

this study, due to the lack of proper electrical models for simulation, only two alternatives

were considered. The decision was whether to use full transmission gates, or single n-

channel pass transistors as switches.

38

The most obvious advantage of using a single n-channel pass transistor is a decrease in

area. The area benefit is not just because you have half the transistors, but also because of a

savings in well areas. Well areas might be significant, depending on whether the switches

are scattered enough that individual wells must be provided for each transmission gate. In

addition, internal nodes of n-channel pass transistor circuits may not require contacts, which

would save area and reduce junction capacitance. There is also a savings in routing area, as

the complementary address signal is not required to program the crossbar [6].

A major drawback of the single transistor approach is signal integrity. The pass

transistor causes a voltage drop equal to the threshold voltage (Vt) when passing highs, and

has a slower rise time towards the end of a low-to-high transition. These effects can be

partially compensated by lowering the switching threshold of the succeeding gates.

Unfortunately, the reduced high also means that the pull-up transistor in succeeding gates

will not be fully turned off, resulting in static power dissipation. This becomes a greater

problem when the doping is not adjusted to reduce Vt in the face of reduced supply voltages.

In such a case, the noise margins are severely degraded. Chapters 3 and 4 illustrate the

tradeoffs in using a single pass transistor to build the simple mesh and the Benes crossbars.

As stated in section 2.4, CAD tools present an important limitation for this project,

because the wire models available neglect interconnect properties. Although most

synthesizers and simulators model interconnect as a resistance-capacitance (RC) tree, the

resistance parameter is often assumed to be zero, which produces simulation results in which

wire delay has been neglected. At higher clock frequencies, the RC tree models are not even

appropriate, as interconnect starts to behave as a transmission line. Parasitic capacitances

were extracted from the layout. In an effort to recognize the increasing importance of

interconnect delay over gate delay, wire lengths were compared across the different crossbar

architectures. By making sure that wire widths are the same for all lines, one may assume

39

that wire resistance will be directly proportional to wire length. Switching rates were kept

below 20 MHz, to ensure that the RC tree models are still valid.

40

CHAPTER 3: THE SIMPLE MESH CROSSBAR

3.1 General circuit description

The simple mesh crossbar is essentially a grid of metal lines. The horizontal lines are

crossbar inputs. The vertical lines are crossbar outputs. At any point of intersection

between an input and an output, a switch selects whether the horizontal line and the vertical

line should be connected together. The control line which turns the switch on or off is

known as SELXX, where XX is a number that identifies each pass switch. These switches

can be either full pass gates, or just NMOS transistors. Figure 3.1 depicts this arrangement

for the 4x4 crossbar, using full pass gates as switches.

41

Figure 3.1: 4x4 simple mesh crossbar.

The outputs are laid out in the layer of metal closest to the transistors, which is known as

metal 1. The horizontal lines are crossbar inputs, and they are routed in metal 2, the layer

above metal one. Metal 3 is reserved for power and ground connections. Figure 3.2 shows

layout for the 4x4 simple mesh.

42

Figure 3.2: Layout for simple mesh crossbar.

3.2 Transistor Count and Area

The simple mesh crossbar requires an independent connection for each input/output (I/O)

pair. The I/O connecting switch can be either a full pass gate or an NMOS pass transistor.

In the case of a full pass gate design, each one of these connections requires two PMOS and

two NMOS transistors, for a total of four devices. One PMOS/NMOS pair makes up the

pass gate. The PMOS and NMOS transistors in the pass gate must receive complementary

signals at their transistor gates, for the switch to function properly. Thus, the other two

transistors provide an inverter for the SELXX signal. Figure 3.3 illustrates the transistor

circuit for one I/O connection.

43

Figure 3.3: Transistor circuit for one I/O connection, using a full

transmission gate as a switch.

Table 3.1 summarizes transistor count and layout area for the three crossbar sizes

constructed for this study. Notice that transistor count is four times higher for the full pass

gate implementation than for the single pass transistor approach.

Table 3.1: Transistor Count and Area

Single NMOS switch Full pass gate switch

Crossbar size Transistor count Area (μm2) Transistor count Area (μm2)

4x4 16 4,044 64 11,729

8x8 64 15,282 256 46,915

16x16 256 62,346 1024 187,661

The actual layout for these circuits was done “by hand”. That is, no automatic tool

generated the layout. The transistors were actually drawn and placed by the circuit designer,

based on standard cells for simple logic gates. As a result, the computed areas may vary for

other layout constructions. The area for the pass transistor mesh is

44

approximately one third of the area for the full pass gate version. Based on transistor count,

the pass transistor circuit could probably be compressed to less than 25% of the area for the

full pass gate crossbar. More important than the specific area of any one circuit, the key

issue concerning area is scalability. In other words, how does total circuit area scale with

increasing I/O connections? From Table 3.1 it follows that total area is proportional to N2,

where N is the number of crossbar inputs in an NxN crossbar. Figure 3.4 shows the scaling

trend for such a crossbar.

Scaling trend for simple mesh crossbar

0

50000

100000

150000

200000

4x4 8x8 16x16

Crossbar size

Layo

ut a

rea

(squ

are

mic

rons

)

Full pass gate switchNMOS switch

Figure 3.4: Scaling trend for simple mesh crossbar.

3.3 Delay estimations

In the case of the simple mesh crossbar, all inputs must go through exactly one switch,

which is responsible for most of the I/O delay. Additional delay is caused by parasitic

capacitances, which were modeled by the layout extraction procedure. I/O delay was

measured for each input/output pair under several configurations. For this approach it is

45

possible to turn on just one I/O connection at a time. However, both the Benes network and

the synthesized circuit lack this ability. As a result, to keep comparisons fair, this study only

considers configurations where every output is being toggled by some input.

Tables 3.2 and 3.3 summarize average delays for the three different sizes of the simple

mesh crossbar, across the two fabrication processes being studied. Notice the difference in

rising-transition delay and falling-transition delay for the crossbars in Table 3.2. Remember

that these crossbars use a single NMOS transistor as a programmable switch.

Table 3.2: Delay summary for simple mesh crossbar using NMOS transistors as

switches (delays in nanoseconds).

AMI ULP

Size Rising Falling Avg. Rising Falling Avg.

4x4 1.15 1.21 1.18 1.54 2.19 1.87

8x8 2.74 3.93 3.34 5.09 6.3 5.70

16x16 4.37 5.54 4.96 8.03 11.2 9.62

Table 3.3: Delay summary for simple mesh crossbar using full pass gates as


AMI ULP


4x4 0.54 0.54 0.54 0.89 0.83 0.86

8x8 1.54 1.58 1.56 3.19 3.51 3.35

16x16 2.39 2.73 2.56 5.91 6.05 5.98

Figure 3.5 shows an NMOS transistor acting as a switch for a rising transition as well as

a falling transition. On a rising transition, the NMOS is initially in the triode region. As the

46

load capacitance charges towards Vdd, the NMOS gradually turns off. In contrast, before a

falling transition, the NMOS switch is cut off, and it turns on when the input falls below

(Vdd-Vt). While the transistor is off, the voltage swing at the input is not propagated to the

output. In addition to the time it takes the input to drop one threshold voltage, it takes the

transistor a certain amount of time to come out of the cut-off region. Beyond that, it takes a

certain amount of time for the output to reach the voltage at which delay is being measured,

in this case, Vdd/2. Because a rising transition does not need to wait for the switch to turn on

before the signal is propagated, delay measured on rising transitions is less than delay

measured on falling transitions. Note that full pass gates do not exhibit this behavior,

because at least one of the transistors in the pass gate is always turned on.

Figure 3.5: Rising and falling transitions using an NMOS switch.

47

Figures 3.6 and 3.7 illustrate the behavior of average delay as crossbars become larger.

The AMI process is generally faster than the version using ULP transistors. Additionally,

the circuit using pass gates as programmable switches is roughly twice as fast as the version

that only uses NMOS transistors. Although three points do not provide enough information

to extrapolate a best fit curve, it seems that delay increases by more than 200% for the 8x8

crossbar with respect to the 4x4 crossbar, but it only increases by about 70% between the

8x8 crossbar and the 16x16 crossbar.

Delay for crossbar based on NMOS switches

0

2

4

6

8

10

12

4x4 8x8 16x16

Crossbar size

Del

ay in

nan

osec

onds

AMIULP

Figure 3.6: Average delay for the pass transistor version

of the simple mesh crossbar.

48

Delay for crossbars based on full pass gates

0

1

2

34

5

6

7

4x4 8x8 16x16

Crossbar size

Del

ay in

nan

osec

onds

AMIULP

Figure 3.7: Average delay for the full pass gate version

of the simple mesh crossbar.

To understand why delay would grow less dramatically for larger crossbars, one must

understand how that delay is being modeled. Each I/O connection can be modeled as a

resistance-capacitance (RC) circuit. Recall that wire resistance is assumed to be zero for

these analyses. Thus, the resistive component of the circuit is provided by the dynamic

resistance of the pass transistors. Note that this resistive component is independent of

crossbar size, as every I/O path contains exactly one pass transistor. Based on this premise,

variations in delay would depend entirely on capacitances.

Diffusion capacitances for the programmable switches contribute the same amount of

capacitance to each I/O path, regardless of crossbar size. Hence, the only capacitance

contribution that is affected by crossbar size is parasitic capacitance. For this project,

parasitic capacitance was extracted from the layout using Virtuoso®, from Cadence ®.

Capacitance extraction generates lumped capacitors at different nodes in the circuit.

Average parasitic capacitance was computed by adding all extracted capacitances and

dividing by the total number of extracted capacitors. Table 3.4 presents average parasitic

49

capacitances obtained from the layout for the three crossbar sizes. Notice that average

parasitics seem to double as the number of crossbar inputs doubles. The contribution of

capacitance between metal lines depends on the distance between these metal lines. Wires

that are closer together have more influence on delay than those spaced far apart. The metal

lines added to the 4x4 crossbar to make it an 8x8 crossbar have a significant impact on

delay, because the distance between the preexisting lines and the new lines is relatively

small. In other words, the 8x8 crossbar is small enough that most metal lines represent a

significant contribution to parasitic capacitance at any point in the circuit. The 16x16

version is larger, so that only the lines within a certain radius affect parasitics on any given

line.

Table 3.4: Average parasitic capacitances for simple mesh crossbar

Crossbar size Average extracted capacitance

4x4 10.7 fF

8x8 21.2 fF

16x16 41.7 fF

For the fabrication processes being compared, wire resistance is negligible compared to

the dynamic resistance of transistors. In fact, fabrication process models developed by the

foundries themselves neglect the effects of wire resistance. In an effort to recognize the

increasing role of interconnect delay in overall circuit performance, Table 3.5 lists the

longest paths and their respective wire lengths. Note the effect of scaling in wire length.

Doubling crossbar size essentially doubles worst case wire length across the crossbar.

50

Table 3.5: Worst case wire lengths

Crossbar size Longest path Length of longest path

4x4 In0 to Out3 187.65 μm

8x8 In0 to Out7 404.6 μm

16x16 In0 to Out15 777.6 μm

3.4 Power Consumption:

Theoretically, CMOS circuits are often said to consume no static power, because, unless

there is an input transition, there is no direct path for current to flow between the power

rails. In real transistors, even if transistors are cut off, there is a small leakage current

running through the transistors.

Let us define Itot as the total current drawn by a crossbar on a given simulation run.

Figure 3.8 shows typical Itot curves for the 4x4 simple mesh crossbar using only NMOS

transistors. Curve A corresponds to a simulation run using AMI transistor models. Curve B

depicts the results for the same simulation run using ULP transistor models. The crossbar is

operating without multicasting. For the first 30 ns, all data inputs to the crossbars are low.

At 30 ns, all inputs to the crossbar switch to high. All inputs stay high for at least 100 ns

more. Beyond the first 130 ns, each input toggles at a different rate.

51

Figure 3.8: Itot for mesh with NMOS switches.

Curve A for AMI, and Curve B for ULP.

Figure 3.9 presents the load seen by any data input line in the circuit. Transistors T1

through T4 provide connections from the input to any one of the outputs. Observe that in

one-to-one operation, only one of the switches will be turned on at any given time.

Consequently, the input source must provide enough current to drive the load on one output,

plus the leakage current drawn by the three switches that are turned off.

Figure 3.9: Load seen by one crossbar input.

52

Going back to our analysis of figure 3.8, since all data voltages are zero for the first 30 ns

of the simulation, all transistors are cut off. When all switches are cut off, the current drawn

by the circuit approximates the true leakage current of the system. Note that, under these

conditions, the ULP transistors produce more leakage than their AMI counterparts, just as

expected. At 30 ns there is a current spike, consistent with the low to high transition of the

inputs. At this point, capacitive loads on the outputs of active NMOS switches must be

charged, drawing current from the input sources. These current spikes are responsible for

dynamic power consumption of the circuit. Once the load capacitances have been charged,

current consumption falls back to leakage current levels. At 130 ns, another current spike

indicates that at least one of the inputs has transitioned from high to low.

So far, power consumption has behaved exactly as predicted in the literature. Hass et al.

have pointed out that leakage may be significant for the ULP process, because of the low

threshold voltage associated with ULP [14]. Furthermore, they mention that highly parallel

structures, such as crossbars or memories are more likely to be affected by high leakage

currents. The curves in figure 3.6 show that for the first 130 ns of the simulation, ULP does

indeed consume more static power than AMI. However, after 130 ns, static power

consumption for AMI is far larger than that of ULP.

Recall, however, that using a single pass transistor leads to narrower noise margins and

higher static power consumption. Figure 3.10 shows an unbuffered crossbar output for the

simulation above, using the AMI transistors. The corresponding input is shown as a

reference. Figure 3.11 is the result of repeating the simulation using ULP transistors. The

static power dissipation evident in the Itot curves is a consequence of the degraded voltage

levels seen at the crossbar outputs. The fact that the output voltage stays between 0.3 volts

and 3 volts is evidence that transistors are not fully turned off, thereby increasing static

power consumption. Figures 3.7, 3.9, and 3.10 illustrate the inherent shortcomings of

using a pass transistor as a programmable switch.

53

Figure 3.10: Output voltage degradation for the AMI crossbar.

Figure 3.11: Output voltage degradation for the ULP crossbar.

54

Figure 3.12 is equivalent to figure 3.8, except that the circuits simulated use full pass

gates as programmable switches. Note that, in this case, the AMI circuit shows virtually no

static power consumption. However, the AMI circuit also shows larger current peaks on

signal transitions. On the other hand, the ULP process shows significant leakage, with much

smaller current spikes on signal transitions. As expected, ULP consumes far less dynamic

power, at the expense of higher leakage.

Figure 3.12: Total current for mesh with full pass gates as switches.

Curve A for AMI, and Curve B for ULP.

One of the main goals of this study is to determine which crossbar architecture works

best in spite of leaky transistors. It is interesting to note that the choice of programmable

switch crucially alters this assessment. For an implementation using pass transistors, the

static power consumption due to degraded voltage levels overshadows the effects of leakage.

In that case, static power consumption is more closely related to the process threshold

55

voltages. Higher threshold voltages translate into more voltage degradation, so that the

voltage levels are often insufficient to fully cut off transistors. In contrast, for a full pass

gate design, static power consumption is directly proportional to leakage current. Tables 3.6

and 3.7 summarize leakage current, average static power consumption, and Ipeak, for all the

circuits being compared.

Table 3.6: Power consumption summary for simple mesh crossbar using NMOS

transistors as switches. Currents in μA, power in milliwatts.

AMI ULP

Size Ileak Ipeak Static power Ileak Ipeak Static power

4x4 0 835 0.58 17.9 65 0.009

8x8 0 2,254 1.47 49.5 221.9 0.025

16x16 0.3 5,313 1.88 93 528.4 0.047

Table 3.7: Power consumption summary for simple mesh crossbar using full pass

gates as switches. Currents in μA, power in milliwatts.

AMI ULP


4x4 0 875 0 8.2 105.3 0.004

8x8 0 2,293 0 26 221.8 0.013

16x16 0 5,290 0 92.9 508.3 0.046

When NMOS transistors are used as switches, the AMI crossbars consume more static

power than their ULP counterparts. However, the full pass gate implementation yields no

static power consumption at all for the AMI crossbar. At any rate, static power consumption

is in the order of microwatts for the crossbar sizes in this study. Even if each crossbar input

were a 16-bit bus, worst case static power dissipation for a 16x16 crossbar would be on the

56

order of one milliwatt. Table 3.6 and figure 3.8 show that static power dissipation due to

degraded output voltages is far more significant than static power dissipation due to leakage

current, even for a high leakage process such as ULP.

Comparing the values for Ipeak across fabrication processes, signal transitions on the AMI

crossbar draw around ten times more current than the corresponding ULP transitions. In

other words, AMI consumes much more dynamic power than ULP, for a given operating

frequency and capacitive load.

3.5 Other considerations:

In general, VLSI circuits are compared according to speed, layout area, and power

consumption, as the previous sections have described. According to their specific function,

crossbars should also be compared in terms of their architectural efficiency. As a figure of

merit, architectural efficiency seeks to summarize the compromises between programming

flexibility, number of I/O pins, signal delay, and design complexity. It must be pointed out

that architectural efficiency does not take into account any electrical parameters of the actual

crossbar circuit. The computed efficiency for a given architecture is independent of the

fabrication process, or the actual layout information of the circuit. In a sense, architectural

efficiency serves as a first order approximation for comparing crossbar architectures,

regardless of the actual implementation details. Table 3.8 presents architectural efficiency

for the simple mesh crossbar.

57

Table 3.8: Architectural efficiency for the simple mesh

Size I/O

connections

Transistor

count

Delay elements per

connection

Utilization Arch. Eff.

4x4 16 64 1 0.25 0.0625

8x8 64 256 1 0.125 0.03125

16x16 256 1024 1 0.0625 0.01563

Table 3.8 shows that a 4x4 simple mesh crossbar is a more efficient circuit than the 8x8,

which in turn is more efficient than the 16x16. This means that although a larger crossbar is

more flexible, the price being paid in terms of transistor count, utilization, and delay grows

faster than the number of connection alternatives achieved.

Among the qualitative advantages of the simple mesh architecture, the most obvious is its

low design complexity. This circuit yields an extremely regular layout that can be tiled to

produce crossbars of different sizes. Yet, the greatest drawback for this circuit is its poor

scalability. The number of switches increases quadratically with crossbar size. Besides the

obvious impact such growth has on transistor count and area, for the simple mesh it also

means a quadratic increase in the number of configuration signals required. Depending on

the implementation of the programming interface, a sharp increase in configuration signals

can lead to either longer configuration times, or greater pin counts.

Concerns about circuit area become even more pressing for applications where reliability

is an issue. Fault tolerant systems rely on redundant circuitry, which can easily double or

triple circuit area and transistor count. Some fault detection and correction schemes even

call for five copies of a given circuit. Other reliability issues, such as radiation tolerance,

impose restrictions on the separation between circuit components. Radiation tolerance also

uses guardbands, and other layout techniques, to improve circuit reliability in a harsh

environment.

58

The simple mesh approach is best suited for small crossbars, like the ones presented in

this study. The ideal application for a simple mesh crossbar is one where pin count is not a

limiting factor. Alternatively, when pins are limited, a simple mesh crossbar would still be a

good choice, as long as slower configuration times can be tolerated.

59

CHAPTER 4: MULTIPLEXER-BASED CROSSBAR


It is possible to synthesize a crossbar from a hardware description language (HDL)

representation. First, the synthesizer generates a circuit that produces the behavior specified

by the HDL description. The synthesized circuit is then mapped onto a standard cell library,

which is associated with a specific fabrication process. A standard cell library can be

thought of as the set of building blocks that may be used to implement the circuit in a

particular technology. Different standard cell libraries may contain different building

blocks. Thus, the final synthesized circuit depends on the standard cell library used.

The final implementation also depends on the synthesizer itself, and its ability to translate

HDL instructions into logic circuits. In fact, it is possible to write HDL code that can be

simulated but not synthesized. Figure 4.1 shows VHDL code for the 4x4 bidirectional

crossbar. Figure 4.2 shows the synthesized circuit for this code. Note that the “with-select”

statements have synthesized to multiplexers, while the “if” statement has synthesized to

tristate buffers. In general, “case” and “select” clauses synthesize to multiplexers, but “if”

statements are mapped to individual tristate buffers. Notice that any given algorithm may be

described in several different ways in HDL. At the behavioral level, the choice between an

“if” and a “case” is based on convenience. However, when coding for synthesis, the

designer should write the HDL description keeping the target circuit in mind. Although

multiplexers and tristate buffers may accomplish the same logical behavior, their physical

characteristics are often very different.

60library ieee;

use ieee.std_logic_1164.all;

use ieee.std_logic_arith.all;

entity xbar is

port( dir: in std_logic;

d0,d1,d2,d3: inout std_logic;

sa0,sa1,sa2,sa3: in std_logic_vector(1 downto 0);

xb0,xb1,xb2,xb3: inout std_logic);

end xbar;

architecture BEHAVE of xbar is

begin

if dir='0' then

with sa0 select

xb0 <= d0 when "00", d1 when "01", d2 when "10", d3 when "11", d0 when others;

with sa1 select


with sa2 select


with sa3 select


elsif dir='1' then

with sa0 select

d0 <= xb0 when "00", xb1 when "01", xb2 when "10", xb3 when "11", xb0 when others;

with sa1 select


with sa2 select


with sa3 select


end if;

end BEHAVE;

Figure 4.1: VHDL code for 4x4 bidirectional crossbar.

61

Figure 4.2: Synthesized 4x4 bidirectional crossbar

The standard cell library for the AMI process was provided by the Electrical Engineering

Department at North Carolina State University (NCSU), through their NCSU Cadence

package. The NCSU standard cell library does not include a four-to-one multiplexer. Thus,

each of the four-to-one multiplexers shown in figure 4.2 was constructed using two input

multiplexers, as shown in figure 4.3.

62

Figure 4.3: Four to one MUX implementation.

In fact, the only multiplexer included in the AMI standard cell library available with the

NCSU package is a two-to-one multiplexer. Any larger multiplexer can be built using this

basic multiplexer. Thus, all crossbars synthesized from the AMI standard cell library consist

of two-to-one multiplexers, tristate buffers, and some combinational logic.

Layout for this circuit was obtained using Silicon Ensemble ® from Cadence ®. Silicon

Ensemble is an automated place and route tool. The routed circuit is then exported to the

Virtuoso ® Layout Editor using GDSII format. After performing a Design Rule Check

(DRC) on the layout, parasitic capacitances are extracted using the same procedure followed

for custom layout crossbars. SPICE netlists are then generated and simulated as explained

in chapter 2. Figure 4.4 presents layout for the 4x4 synthesized crossbar. Other than power

and ground rails, metal lines do not run straight from inputs to outputs, as they do on custom

layout for crossbars. Instead, inputs and outputs follow convoluted patterns, often spanning

two or even three metal layers. The use of vias for datapath has a significant impact on

delay that CAD tools used for this project do not model.

63

Figure 4.4: Layout for the synthesized 4x4 crossbar.


The basic building block for the synthesized crossbar is the two to one multiplexer, as

implemented in NCSU’s AMI standard cell library. Figure 4.5 depicts the corresponding

transistor circuit for the two to one multiplexer. Other standard cells used in the crossbar

netlist include two-input NAND gates, two-input NOR gates, four input AOI gates, and

inverters.

64

Figure 4.5: Transistor circuit for a two-to-one multiplexer.

Table 4.1 summarizes transistor count and layout area for the three crossbar sizes

constructed for this study. Transistor count was obtained from Design Compiler ®, the

synthesis tool that generated the circuits based on the VHDL description.

Table 4.1: Transistor count and layout area

Crossbar size Transistor count Area (μm2)

4x4 248 13,870

8x8 1168 59,302

16x16 5032 257,568

It must be pointed out that, because the layout was generated using an automatic tool,

logic density for the synthesized crossbar is greater than for the other two crossbar

implementations. Place and route tools allow the user to define a target area for their layout.

Additionally, the circuit designer is asked for a utilization factor, and a mapping effort. The

layout data summarized in table 4.1 was obtained using 75% utilization with mapping effort

set to low. This means that circuit area can probably be reduced by at least 25% from the

values presented above. As with the simple mesh, because area is somewhat dependent on

65

judgement calls made by the circuit designer, actual area may vary for other place and route

runs of the same netlist.

The key advantage of automatic place and route is that design time is reduced drastically.

Not only are tedious and repetitive tasks done automatically, but the likelihood of an error

on the layout becomes minimal. In general, place and route tools will also achieve denser

layouts than custom designs. Observe, however, that layout area for the synthesized

crossbar is actually larger than that for a simple mesh crossbar. The synthesizer is

constrained by the specific instructions in the HDL description, and the standard cells

available in the library. A human designer or a specialized crossbar generator can spot the

underlying regularity in crossbar behavior, and come up with alternative circuits that

optimize area, speed, or power consumption, depending on the intended crossbar

application.

Figure 4.6 illustrates how area scales, according to the data in table 3.1. Transistor count

increases by a factor of four between the 4x4 and the 8x8 crossbar. The scaling between an

8x8 crossbar and a 16x16 crossbar corresponds also to a factor of four. Target area for each

layout was chosen according to this trend.

66

Scaling trend for synthesized crossbar

0

50,000

100,000

150,000

200,000

250,000

300,000

4x4 8x8 16x16

Crossbar size

Are

a in

squ

are

mic

rons

Layout area

Figure 4.6: Scaling trend for synthesized crossbar.


In a multiplexer-based crossbar, the number of delay elements that a signal must go

through varies with crossbar size. Because datapaths are constructed using two to one

multiplexers, doubling the number of inputs results in an additional multiplexer on each line.

Additional delay is caused by parasitic capacitances, which were modeled by the layout

extraction procedure. I/O delay was measured for each input/output pair under several

configurations.

Table 4.2 summarizes average delays for the three different sizes of the synthesized

crossbar, across the two fabrication processes being studied. Notice there is little difference

in rising-transition delay and falling-transition delay for the simulations using AMI

transistor models. Transistors in the standard cells have been sized for equal rise and fall

times. Because carrier mobility is different in the ULP models, the difference in delay for

rising and falling transitions is more noticeable.

67

Table 4.2: Delay summary for multiplexer-based crossbar (delays in ns)

AMI ULP


4x4 0.786 0.896 0.841 1.986 2.694 2.34

8x8 2.61 2.517 2.564 6.602 7.87 7.236

16x16 3.912 3.022 3.467 10.17 8.456 9.313

Figure 4.7 illustrates the behavior of average delay as crossbars become larger. As

expected, the AMI process is faster than the version using ULP transistors. Again, three

points do not provide enough information to extrapolate a best fit curve, but it seems that

delay increases by more than 200% for the 8x8 crossbar with respect to the 4x4 crossbar, but

it only increases by about 50% between the 8x8 crossbar and the 16x16 crossbar.

Interestingly, this behavior of delay with crossbar scaling is very similar to that for the

simple mesh crossbar. From these results, it seems that the addition of one multiplexer on

every data line does not significantly impact average delay. As explained in chapter 3, the

influence of parasitic capacitances seems to be the driving factor in delay across different

crossbar sizes.

Average delay for synthesized crossbar

0

2

4

6

8

10

4x4 8x8 16x16

Crossbar size

Del

ay in

nan

osec

onds

AMIULP

Figure 4.7: Average delay for the synthesized crossbar.

68

For the fabrication processes being compared, wire resistance is negligible compared to

the dynamic resistance of transistors. In fact, fabrication process models developed by the


increasing role of interconnect delay in overall circuit performance, table 4.3 lists the longest

paths and their respective wire lengths. Note the effect of scaling in wire length. Doubling

crossbar size increases worst case wire length by roughly 100 μm.

Table 4.3: Worst case wire lengths

Crossbar size Longest path Length of longest path

4x4 d0 to xb2 135 μm

8x8 d0 to xb4 238 μm

16x16 xb0 to d11 340 μm

In the synthesized crossbar, wire length is determined by the place and route tool.

However, it is possible to issue design constraints telling the tool to keep a certain I/O path

under a maximum length. No such design constraints were set for this experiment. Because

all I/O signals must traverse the same kinds of logic gates, and the designer has not singled

out any specific wires through design constraints, it is reasonable to expect I/O wire lengths

to behave according to a Gaussian distribution. This is indeed the case. In the simple mesh,

all I/O connections had different lengths, depending on the position of the crossbar switch.

Meanwhile, most I/O connections in the synthesized crossbar are roughly the same length,

with several wires slightly above or below this average.

4.4 Power Consumption:

Table 4.4 presents power consumption data for the multiplexer-based crossbar. As

expected, the AMI circuit shows virtually no static power consumption. On the other hand,

69

the ULP process shows significant leakage, with static power consumption on the order of

microwatts. Yet, comparing the values for Ipeak across fabrication processes, signal

transitions on the AMI crossbar draw around ten times more current than the corresponding

ULP transitions. In other words, AMI consumes much more dynamic power than ULP, for

a given operating frequency and capacitive load.

Table 4.4: Power consumption summary for synthesized crossbar. Currents in

μA, power in milliwatts.

AMI ULP


4x4 0 3,683 0 26.22 372.7 0.01311

8x8 0 14,040 0 70.7 1,519 0.03535

16x16 0 31,290 0 364.93 4,012 0.1824

Recalling that capacitive load is 0.1 pF per input for our simulations, it is possible to

compute dynamic power consumption as a function of frequency for both fabrication

processes. Furthermore, adding static power consumption to this rate yields total power

consumption as a function of frequency. Figures 4.8, 4.9, and 4.10 show total power

consumption as a function of frequency for the three crossbar sizes. On all of these graphs,

the point of intersection between the two lines corresponds to the minimum frequency at

which ULP consumes less power than AMI.

70

Total power consumption vs. frequency (4x4 crossbar)

02468

10121416

0.1 0.4 0.7 1 1.3 1.6 1.9 2.2 2.5 2.8 3.1

Frequency (MHz)

Pow

er c

onsu

mpt

ion

in

mic

row

atts

AMIULP

Figure 4.8: Power consumption vs. frequency for the 4x4 synthesized crossbar

Total power consumption vs. frequency(8x8 crossbar)

01020304050607080

0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

Frequency (MHz)

Pow

er c

onsu

mpt

ion

in

mic

row

atts

AMIULP

Figure 4.9: Power consumption vs. frequency for the 8x8 synthesized crossbar

71

Total power consumption vs. frequency(16x16 crossbar)

0200400600800

100012001400

0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

Frequency (MHz)

Pow

er c

onsu

mpt

ion

in

mic

row

atts

AMIULP

Figure 4.10: Power consumption vs. frequency for the 16x16 synthesized crossbar ncy for the 16x16 synthesized crossbar

From figure 4.8, the minimum frequency at which the ULP version of the 4x4

synthesized crossbar consumes less power than its AMI counterpart is approximately 3

MHz. This means that for ULP to actually save power over AMI, all inputs to the crossbar

must toggle at an average rate of 3 MHz. In other words, if every input toggles roughly

once every ten clock cycles, the minimum clock frequency for which ULP saves power is 30

MHz.

From figure 4.8, the minimum frequency at which the ULP version of the 4x4

synthesized crossbar consumes less power than its AMI counterpart is approximately 3

MHz. This means that for ULP to actually save power over AMI, all inputs to the crossbar

must toggle at an average rate of 3 MHz. In other words, if every input toggles roughly

once every ten clock cycles, the minimum clock frequency for which ULP saves power is 30

MHz.

For the 8x8 multiplexer-based crossbar, the minimum frequency that yields power

savings for ULP is approximately 0.95 MHz. For the 16x16 crossbar this minimum

frequency drops below 0.3 MHz. As a general tendency, circuits operating at lower

frequencies are more affected by leakage than those that toggle at faster rates. It is also true

that leakage current in ULP is less harmful for larger crossbars than it is for smaller ones.

For the 8x8 multiplexer-based crossbar, the minimum frequency that yields power

savings for ULP is approximately 0.95 MHz. For the 16x16 crossbar this minimum

frequency drops below 0.3 MHz. As a general tendency, circuits operating at lower

frequencies are more affected by leakage than those that toggle at faster rates. It is also true

that leakage current in ULP is less harmful for larger crossbars than it is for smaller ones.

4.5 Other considerations: 4.5 Other considerations:

Table 4.5 presents architectural efficiency for the multiplexer-based crossbar. Table 4.5 presents architectural efficiency for the multiplexer-based crossbar.

72Table 4.5: Architectural efficiency for the synthesized crossbar

Size I/O

connections

Transistor

count

Delay elements per

connection


4x4 16 248 3 0.333 0.00926

8x8 64 1168 4 0.21 0.00383

16x16 256 5032 5 0.133 0.00089

Although architectural efficiency will be used in chapter 6 as a first order comparison of

crossbar architectures, this parameter does not account for the main advantage of the

multiplexer-based crossbar. The key reason for choosing this approach is the fact that it is

easily synthesizable. Even if the architecture itself is less efficient, the significant reduction

in design time can be used to achieve more thorough optimizations on critical design

criteria. In some ways, this approach exhibits even lower design complexity than the simple

mesh architecture. While the final circuit is certainly more complex than a simple mesh,

most of that complexity is being handled by CAD tools, rather than human designers.

The major drawback of the multiplexer-based architecture is precisely its circuit

complexity. This approach has the highest transistor count for all crossbar sizes, with four

times as many transistors as the simple mesh approach. To keep a reasonable area, logic

density must be much higher for the synthesized crossbar than for other approaches. When

transistors are packed more tightly, a particle hitting an integrated circuit will affect more

transistors. Thus the multiplexer-based crossbar is inherently less radiation tolerant than

other crossbars with similar areas. Moreover, once the circuit is fabricated, it becomes

harder to locate a particular source of failure, because the circuit layout does not reflect the

datapath nature of the crossbar.

Multiplexers and logic gates are not bidirectional. Consequently, the synthesized version

of the crossbars requires an additional input pin to select direction of dataflow. The 4x4

crossbar requires eight address bits to program all its connections. The 8x8 crossbar needs

73

24 address bits, and the 16x16 crossbar needs 64 address bits. The number of programming

bits does not grow as rapidly with crossbar size as it does for the simple mesh.

It must be noted that the synthesized crossbar is less flexible than the simple mesh

approach in terms of direction of dataflow. The simple mesh is truly bidirectional, because

every line can be used as an input or an output independently of other lines. With the

current multiplexer-based design, direction of dataflow is set for all connections in the

crossbar with one signal. Consequently, if d0 is being used as an input, all d lines are being

used as inputs, and all xb lines are being used as outputs.

74

CHAPTER 5: THE BENES NETWORK


The standard Benes network is described in [1] and [2]. The architecture is based on

butterfly switches. A butterfly switch has two data inputs, two data outputs, and one control

signal. Figure 5.1 illustrates the behavior of a butterfly switch. When S is zero, In0 is

connected to Out0, and In1 is connected to Out1. When S is one, In0 is connected to Out1,

and In1 is connected to Out0.

Figure 5.1: Behavior of the butterfly switch.

Using butterfly switches, it is possible to construct non-blocking networks of virtually

any size. Figure 5.2 illustrates a 4x4 crossbar Benes network. Each block labeled with a

“B” represents a butterfly switch. Each butterfly switch is controlled by an independent

control signal. These control signals have been removed from figure 5.2 for clarity.

Figure 5.2: 4x4 Benes network.

75

Unlike the standard mesh, or the synthesized crossbar, building a 4x4 Benes network

does not provide an intuitive notion on how to build larger crossbars. In general, given a

crossbar size, there are multiple non-blocking Benes networks. The specific architectures

considered in this project use figure 5.2 as their basic building block. Figure 5.3 shows the

8x8 Benes architecture, while figure 5.4 shows the 16x16 version. These are known as

perfect shuffle networks [9].


76


Layout for the Benes networks was drawn starting with a custom butterfly switch. These

switches were then tiled and connected as suggested by figures 5.2, 5.3, and 5.4. It is

possible to make a more compact layout by sharing wells and some transistors that become

redundant on larger crossbars. As pointed out in previous chapters, layout area depends

77

partly on subjective decisions made by the circuit designer. Throughout this project, a

conscious effort has been made to provide conservative area estimations. In other words,

area estimates are meant to be worst case values for any 0.6 micron fabrication process.

Figure 5.5 depicts layout for the 4x4 Benes network.

Figure 5.5: Layout for the 4x4 Benes network.


A full pass gate implementation of a butterfly switch is shown in figure 5.6. Note that

such a circuit requires both the true and complement forms of the control signal.

Alternatively, NMOS switches may be used in place of the full pass gates.

78

Figure 5.6: Butterfly switch using full pass gates.

As Benes networks grow larger, any given signal must go through more butterfly

switches before reaching an output. In particular, every time crossbar size doubles, two

butterfly switches are added to every possible I/O path. Too many pass gates or pass

transistors in a signal path may lead to degraded signal integrity, slower rise and fall times,

and decreased throughput, making the circuit unsuitable for many applications and

environments. In an effort to alleviate these concerns, a buffer has been added to the outputs

of each butterfly switch. Table 5.1 summarizes transistor count and layout area for the three

crossbar sizes built using Benes networks. Notice that transistor count for the NMOS switch

version is 60% of the transistor count for the equivalent full pass gate circuit. A full-pass-

gate butterfly switch consists of ten transistors. The NMOS butterfly switch uses four pass

transistors and an inverter, for a total of six transistors.

Table 5.1: Transistor Count and Area

Single NMOS switch Full pass gate switch

Crossbar size Transistor count Area (μm2) Transistor count Area (μm2)

4x4 36 11,483 60 23,011

8x8 120 42,472 200 91,882

16x16 336 103,518 560 207,926

79

In terms of measured layout area, the full-pass-gate Benes network is roughly twice as

large as its NMOS counterpart. As with the previous architectures, the most useful aspect of

measuring layout area is looking at scaling trends. Figure 5.7 summarizes area information

from table 5.1.

It is interesting to observe that layout area for the 8x8 Benes network is about four times

larger than that of the 4x4 circuit. However, the 16x16 Benes crossbar is less than three

times larger than the 8x8 network. Remembering figures 5.3 and 5.4, note that doubling

crossbar size effectively doubles the number of butterfly switch rows. In contrast, only two

columns are added every time Benes network size doubles.

Scaling trend for Benes network

0

50,000

100,000

150,000

200,000

250,000

4x4 8x8 16x16

Crossbar size

Layo

ut a

rea

(squ

are

mic

rons

)

Full pass gateNMOS switch

Figure 5.7: Scaling trend for Benes network.

80


Tables 5.2 and 5.3 summarize average delays for the Benes networks for both AMI and

ULP transistors. As explained in chapter 3, using only NMOS switches causes a noticeable

discrepancy between rising-transition delay and falling-transition delay.

Table 5.2: Delay summary for Benes network crossbar using NMOS transistors

as switches (delays in nanoseconds).

AMI ULP


4x4 2.40 3.24 2.82 3.80 4.47 4.14

8x8 4.45 6.71 5.58 8.46 11.37 9.92

16x16 6.23 8.77 7.50 12.72 17.38 15.05

Table 5.3: Delay summary for Benes network crossbar using full pass gates as


AMI ULP


4x4 2.07 2.13 2.1 3.1 3.62 3.36

8x8 4.54 4.71 4.63 6.96 7.23 7.1

16x16 6.29 6.75 6.52 9.04 11.17 10.11

Figures 5.8 and 5.9 illustrate the behavior of average delay as crossbars become larger.

As observed for the other crossbar architectures, the AMI process is generally faster than the

81

version using ULP transistors. Again, the circuit using full pass gates is faster than its

NMOS-switch counterpart, but this time the difference is not as dramatic as it was for the

simple mesh. These data seem to indicate that the additional buffers included at the switch

outputs have a greater influence in delay than the switches themselves. It may be

worthwhile to consider a version of the Benes crossbars with fewer output buffers, but such

a design falls outside the scope of this project.

82

Delay for crossbar based on NMOS switches

02468

10121416

4x4 8x8 16x16

Crossbar size

Del

ay in

nan

osec

onds

AMIULP

Figure 5.8: Average delay for the pass transistor version

of the Benes crossbar

Delay for crossbar based on full pass gates

0

2

4

6

8

10

12

4x4 8x8 16x16

Crossbar size

Del

ay in

nan

osec

onds

AMIULP

Figure 5.9: Average delay for the full pass gate version

of the Benes crossbar

83

Benes network, the number of

witches on any given I/O path increases with crossbar size.

t with

e observation that exactly two butterfly switches are being added to each I/O path.

s

xtracted from the layout are smaller for the Benes crossbar than they were for the simple

Although three points do not provide enough information to extrapolate a best fit curve, it

seems that delay increases by about 200% for the 8x8 crossbar with respect to the 4x4

crossbar, but it only increases by approximately 50% between the 8x8 crossbar and the

16x16 crossbar. Such behavior is remarkably similar to that of the simple mesh crossbar,

which seems counterintuitive. I/O paths in the simple mesh always contained only one

switch, regardless of crossbar size. On the other hand, in a

s

Let us consider absolute delay increase, rather than percentage increase. For instance,

Table 5.2 shows that doubling crossbar size, in the ULP simulations for the NMOS switch

version of the Benes crossbar, results in an average delay increase of roughly five

nanoseconds. This observation holds true whether the shift is from a 4x4 network to an 8x8,

or from an 8x8 to a 16x16 crossbar. For the AMI simulations of the same circuit, the

absolute increase in average delay, for doubling crossbar size, is about two nanoseconds.

The same tendency is evident in the full pass gate circuits, where delay increases by two

nanoseconds for the AMI process, and by three nanoseconds for the ULP simulations. The

fact that absolute delay increases linearly, while crossbar size is doubled, is consisten

th

When analyzing the simple mesh crossbar, it was noted that the fact that delay increased

with increasing crossbar size was primarily due to increased parasitic capacitance. In a

Benes network that is not the case. Since crossbar size affects the number of switches that a

signal must go through, the effective resistance of the path is a function of crossbar size.

Furthermore, the number of diffusion capacitances that must be charged varies for different

crossbar sizes. As a result, variations in RC delay are not determined solely by parasitic

capacitances, as was the case with the simple mesh. In fact, average parasitic capacitance

e

84

ther decreasing capacitance. Table 5.4 presents average

capacitance for the Benes crossbar.

Table 5.4: Average parasitic capacitances

mesh. Routing for the simple mesh leads naturally towards long stretches of wires running

parallel to each other. In a Benes network, metal lines must be interleaved to make the

network a fully-connected, non-blocking system. In general, sections of wires running

parallel to each other are shorter in a Benes network than in the simple mesh. Also, to

accommodate the more convoluted patterns, wires are spaced farther apart in the Benes

network than in the simple mesh, fur

for Benes crossbar

Crossbar size Average extracted capacitance

4x4 10.83 fF

8x8 15.3 fF

16x16 19.67 fF

For the fabrication processes presented here, wire resistance is negligible compared to the

dynamic resistance of transistors. In fact, fabrication process models developed by the


increasing role of interconnect delay in overall circuit performance, Table 5.5 lists the

ngest paths and their respective wire lengths. It is noteworthy that routing for

op row of

utterfly switches to the bottom row of switches, and then back again to the top.

lo

Benes networks requires signals to traverse several vias, whose resistance might be far

larger than that of wires. Resistance of vias and contacts is neglected entirely by the CAD

tools used for this project. Table 5.5 includes the number of vias on each of the longest

paths. For all Benes crossbars, the longest I/O paths, that go through the most vias, lead

from In0 to Out0. Recall that there are several ways to reach Out0 from In0 in a Benes

crossbar. The worst case path is the one where the signal travels from the t

b

85

: WTable 5.5 orst case wire lengths

Crossbar size Length of longest path Nu er of viasLongest path mb

4x4 In0-Out0 (alt.) 482.1 μm 4

8x8 In0-Out0 (alt) 912.1 μm 12

16x16 In0-Out0 (alt) 1612.8 μm 28

.4 Power consumption

e current, average static power consumption, and

, for all the circuits being compared.

Table MOS

transistors as s. Currents in μA, power in milliwatts.

5

Tables 5.6 and 5.7 summarize leakag

Ipeak

5.6: Power consumption summary for Benes crossbar using N

switche

AMI P UL

Size I I Static wer S leak peak po Ileak Ipeak tatic power

4x4 0 971 0 9.9 106.6 0.00495

8x8 0 1,705 0.00191 27.9 221.2 0.01395

16x16 0 14,990 0.00356 71.8 476 0.03590

86

Table 5.7: Power consumption summary for Benes crossbar using full pass gates

as switches. Currents in μA, power in milliwatts.

AMI ULP


4x4 0 993 0 9.96 106.4 0.00498

8x8 0 1,810 0 14.1 221.3 0.00705

16x16 0 15,370 0 39.7 476.2 0.01985

NMOS implementations have higher leakage than their full pass gate counterparts, and

AMI transistors show virtually no leakage current for either switch implementation. As

explained in chapter 3, static power consumption in NMOS-switch circuits using AMI

transistors is primarily due to degraded voltage levels, which prevent transistors from fully

turning on or off.

Ipeak, on the other hand, remains essentially constant, regardless of the type of switch used

to build the Benes crossbar. This suggests that dynamic power consumption is also

relatively independent of the selected switch. Recalling equation 2.1,

P = C*f*Vdd2*u (2.1)

it follows that the only term affected by the choice of pass switch is capacitance. The load

seen by all crossbar outputs was set to 0.1 picofarads, or 100 femtofarads, as described in

section 2.7. Table 5.4 shows that parasitic capacitances for the Benes crossbar are on the

order of twenty femtofarads. These two parameters remain unaffected by the selection of a

switch. Additionally, all switches drive output buffers. All of these output buffers have the

same gate capacitance, independent of whether the butterfly switch uses a pass gate or a pass

transistor. The only other significant capacitance on the datapath is the diffusion

capacitance of the switch transistors. The data for Ipeak, in tables 5.6 and 5.7 suggests that

87

this capacitance is small compared to the other contributions mentioned above. For the

Benes crossbar layout, diffusion capacitances reported by Smartspice ® are always less than

one femtofarad, which is consistent with the Ipeak data reported in table 5.5.

In agreement with the data from chapter 3, and the nature of ULP, AMI simulations show

much higher values for Ipeak than ULP simulations. Figure 5.10 shows how power

consumption varies with frequency for the 4x4 Benes crossbar built using NMOS transistors

as switches. From the intersection of the AMI and ULP curves, it follows that ULP will

consume less power than AMI as long as inputs toggle at a rate that is faster than 1.2 MHz.

Similar analyses prove that ULP is the more power efficient alternative for all Benes

crossbars operating at more than 2 MHz. Table 5.8 summarizes the minimum frequency at

which ULP becomes viable for the different circuits studied.

Total power consumption vs. frequency (4x4 crossbar)

0

0.002

0.004

0.006

0.008

0.01

0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

Frequency (MHz)

Pow

er c

onsu

mpt

ion

in

mill

iwat

ts

AMIULP

Figure 5.10: Power consumption vs. frequency for the 4x4 Benes crossbar

using NMOS pass transistors.

88Table 5.8: Minimum frequency for ULP power savings (MHz)

Size NMOS switch Full pass gate

4x4 1.91 1.18

8x8 1.42 0.85

16x16 1.2 0.12

5.5 Other considerations:

Table 5.9 presents architectural efficiency for the Benes crossbar.

Table 5.9: Architectural efficiency for the Benes crossbar

Size I/O

connections

Transistor

count

Delay elements per

connection


4x4 16 60 3 1 0.222

8x8 64 200 5 1 0.08

16x16 256 560 7 1 0.041

For smaller circuits, it has been shown that Benes crossbars are larger, and more

complex, than the alternatives offered in this study. Yet, among the options studied, Benes

networks have the advantage of scaling less dramatically than the simple mesh or the

multiplexer-based crossbar. This means that, for larger crossbars, Benes networks will take

up less area than other crossbar implementations.

When crossbar size is increased, the number of configuration signals necessary to

program the crossbar also increases. As with area, the number of configuration bits scales at

a lower rate for the Benes network than for other crossbar circuits. This advantage might be

exploited in two ways. One possibility is to say that larger Benes crossbars require fewer

89

programming pins than other crossbar circuits of the same size. Alternatively, given a

specific programming interface, with a fixed set of I/O pins, larger Benes networks require

fewer data transfers on any given programming pass.

A reduced number of programming signals does come at a price. Benes networks are not

intuitively programmable, like the simple mesh or the multiplexer crossbar. Every time an

I/O connection is established in a Benes crossbar, it affects other possible routing paths

within the crossbar. As a result, finding the appropriate configuration that realizes all

intended I/O connections without blocking is not a trivial problem. Thus, of the three

alternatives being compared, the Benes network is the only one that requires software

support in its programming. This becomes an issue for systems that require real time

reconfiguration. For systems where configuration and execution are done in separate steps,

it is viable to compute the valid configurations ahead of time, store them in some kind of

memory, and then retrieve the configurations as needed. To aid in configuring Benes

crossbars, figure 5.11 presents an algorithm that takes a set of required I/O connections, and

finds a valid configuration that achieves all connections simultaneously.

90

Figure 5.11: Algorithm for configuring Benes networks.

Configurations for Benes crossbars can be easily visualized by organizing them into a

binary matrix format. Rows and columns in the matrix correspond to rows and columns in

the Benes crossbar. Each entry in the matrix contains the value of the control signal for a

particular butterfly switch, which is identified by its row and column number. Figure 5.12

presents a 4x4 Benes crossbar and its corresponding configuration matrix.

91

Figure 5.12: Benes crossbar and configuration matrix

Using this matrix notation, it is possible to define any set of connections for a particular

crossbar. For example, the two alternative paths that connect X0 to Y0 may be represented

as:

⎥⎦

⎤⎢⎣

⎡xxx000

and ⎥⎦

⎤⎢⎣

⎡xx

x0

11

Note that the control signals marked with the symbol “x” do not affect the connection

between X0 and Y0. Given a set of required I/O connections, it is possible to build an

exhaustive list of configuration matrices that achieve each individual connection. In the

case of a 4x4 crossbar, there are two alternative paths for each I/O pair, and thus two

matrices are formed for each required connection. In general, for an NxN Benes network,

there are exactly N/2 alternative ways of establishing each individual connection.

The goal of programming a Benes network is achieving all the specified connections.

Thus, the algorithm in figure 5.11 assumes that a set of I/O pairs has been defined. On a

fully programmed NxN Benes crossbar, exactly N I/O connections are specified.

It is possible to build a lookup table containing all the alternative ways of establishing

each individual connection. Because each of these N I/O connections can be completed in

N/2 alternative ways, the lookup table resembles an NxN/2 matrix, except that each element

in the lookup table is itself a configuration matrix, rather than just a binary value. Each row

92

in the lookup table identifies a required connection. The columns in the table represent

alternative paths that establish the connection. In other words, each entry in the lookup table

represents a path that, if taken, would provide one of the required connections.

The key to programming a Benes network is finding one configuration matrix that

achieves all required connections simultaneously. This can only be accomplished by

choosing a set of non-conflicting paths for all connections. Two paths are said to be

conflicting if they require any particular butterfly switch to be in both states at the same

time. The algorithm from figure 5.11 essentially traverses the lookup table looking for non-

conflicting paths, until it finds a valid configuration matrix. The algorithm uses two row

pointers and two column pointers. ROW1 and COL1 point to the first path that has been

incorporated into the configuration matrix. ROW2 and COL2 define the latest path being

tested for conflicts with the configuration matrix. The configuration matrix is labeled as

COMPARE in the algorithm.

A match occurs when a new path is found that allows an intended connection without

conflicting with the previously chosen paths. When there is a match, a new configuration

matrix, which incorporates the matching path, must be generated. The new configuration

matrix is the result of intersecting the previous configuration matrix and the new path. Paths

are specified by configuration matrices, whose dimensions are determined by the size of the

Benes crossbar. Intersecting two paths consists of comparing all the corresponding entries

in both configuration matrices. If a particular entry is the same in both matrices, the

resulting matrix keeps the same value for that entry. If one of the two values for a particular

entry is an “x”, but the other one is a one or a zero, the resulting intersection takes the one or

the zero as the value for that entry. If two paths have complementary values on a particular

entry, there is a conflict, and therefore no match has occurred. Equation 5.1 gives an

example of a configuration matrix intersection.

93

)1.5(00

0110

01010 ⎥

⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡∩⎥

⎦

⎤⎢⎣

⎡xxx

xxx

x

Every time a match is found, the current COMPARE matrix, and all the pointers are

pushed onto a stack, before computing the new COMPARE matrix. If COL2 ever reaches

the end of a row (COL2=N/2), it means that a required connection has not been

accomplished. Since all the paths for the latest connection have been tested unsuccessfully,

the only way to find a valid configuration is to backtrack. The algorithm pops old values of

COMPARE and the pointers, increments COL2, and starts looking for new non-conflicting

paths. MCOUNT keeps track of how many intersections have been performed to reach the

current COMPARE matrix. If MCOUNT falls back to zero, it means that the initial path

chosen is creating the conflicts. In that case, COL1 is incremented, so that an alternative

way of achieving the first required connection can be explored. Once ROW2 reaches the

end of the lookup table (ROW2=N), all required connections have been successfully

performed. The resulting configuration matrix is stored in COMPARE.

Although this programming procedure is far more involved than configuring a simple

mesh or a multiplexer-based crossbar, it is actually a rather simple software problem. The

real drawback is not the algorithm’s complexity, but its execution time. Larger lookup

tables take up more memory, and it takes longer to match all the different entries. The

viability of Benes networks as VLSI crossbars rests largely on whether the proposed

application can accommodate the various consequences of this additional programming step,

which is not necessary with other crossbar architectures.

94

CHAPTER 6: ARCHITECTURE COMPARISON

6.1 Area and scalability

Chapter 3 has pointed out several shortcomings of pass transistor implementations, in

terms of their robustness and reliability. Thus, comparisons across different architectures

will focus on circuits that use full pass gates as switches.

Figure 6.1 plots transistor count for the three crossbar architectures being analyzed.

Figure 6.2 provides a similar comparison of crossbar layout area across architectures. For

all the crossbars studied, the simple mesh architecture resulted in the smallest layout area.

Yet, two observations suggest that for larger crossbars the Benes network might be the more

efficient alternative in terms of real estate. First of all, when full pass gates are used as

switches, transistor count for the Benes network is lower than that for the simple mesh.

Moreover, layout area grows more rapidly with crossbar size for the simple mesh than for

the other two approaches. In fact, the rate of increase in layout area with crossbar size is

smallest for the Benes network.

95

Transistor count vs. Crossbar size

0

1000

2000

3000

4000

5000

6000

4x4 8x8 16x16

Crossbar size

Tran

sist

or c

ount

Simple meshMUX basedBenes

Figure 6.1: Transistor count vs. Crossbar size

Layout area vs. Crossbar size

0

50000

100000

150000

200000

250000

300000

4x4 8x8 16x16

Crossbar size

Layo

ut a

rea

in s

quar

e m

icro

ns Simple meshMUX basedBenes

Figure 6.2: Layout area vs. Crossbar size

It is also interesting to note that although the difference in transistor count suggests that

the multiplexer-based crossbar should be much larger than the alternative architectures,

actual layout area shows that is not the case. Layout for the multiplexer-based crossbar is

being generated by a place and route tool. Such CAD tools are able to achieve much higher

96

uch tools are expensive, and

rrently unavailable at the University of Idaho.

.2 Throughput

logic densities than many human layout designers. Crossbar generators can produce layout

that is very area efficient for the simple mesh architecture, but s

cu

6

Figures 6.3 and 6.4 summarize delay measurements observed in this study. As expected,

ULP transistors are slower than their AMI counterparts. More importantly, results show that

the simple mesh crossbars are faster than the alternative architectures. This is also

consistent with theoretical expectations, considering that the only delay element in the

simple mesh datapath is a single pass gate. The comparison between the multiplexer-based

crossbar and the Benes network is less predictable, as it depends on the relative delay

between the multiplexers and logic gates that make up the synthesized circuit, in relation to

the delay of the pass gates and buffers that make up the Benes crossbar.

From the delay data obtained using AMI transistor models, Benes crossbars seem about

two times slower than the corresponding synthesized crossbars. However, for the same

simulations and netlists, the data gathered using ULP transistor models indicate that the

difference in delay is much smaller. In the case of the 8x8 crossbar, average delay is smaller

for the Benes network than for the multiplexer-based crossbar, when only ULP data are

considered.

Recall that transistors were sized for equal rise and fall times considering AMI models.

When the AMI models are substituted with ULP models without resizing the transistors, rise

and fall times may differ significantly. Buffers for the Benes crossbars were built using two

cascaded inverters. Just like the rest of the circuit, these inverters were sized for equal rise

and fall times according to AMI data.

97

Delay is measured at the point where voltages are halfway through their full swing. Note

that, because of the buffers, the Benes network effectively consists of a long chain of

inverters from input to output. Due to the mismatches in rise and fall times for all of these

inverters, measuring delay at the halfway point sometimes yielded negative delays for rising

transitions. The net result of including these negative values into the average is a smaller

figure for average delay.

Note that this behavior does not mean that Benes networks are “faster” for ULP

processes. Simply, delay is measured at an arbitrary point in the signal, and the specific

characteristics of a certain waveform may produce misleading results. In summary, the

average delays reported for the Benes network using ULP transistors should be considered

inaccurate. They are only presented here for completeness. Section 6.8 discusses

alternatives for obtaining more useful data for comparison. Overall, Benes networks are

slower than the other two crossbar architectures.

Signal delay vs. Crossbar size(AMI transistors)

01234567

4x4 8x8 16x16

Del

ay in

nan

osec

onds


Crossbar size

Figure 6.3: Signal delay for crossbars using AMI transistor models.

98

Signal delay vs. Crossbar size (ULP transistors)

0

2

4

6

8

10

12

4x4 8x8 16x16

Crossbar size

Del

ay in

nan

osec

onds


Figure 6.4: Signal delay for crossbars using ULP transistor models.

Dynamic simulation results indicate that Benes crossbars are much slower than other

architectures, as long as gate delay dominates over interconnect delay. To understand what

happens in deep submicron technologies, where interconnect delay should not be neglected,

figure 6.5 summarizes worst case wire length for the different crossbars.

Worst case wire length

0200400600800

10001200140016001800

4x4 8x8 16x16

Crossbar size

Wire

leng

th in

mic

rons


Figure 6.5: Worst case wire length.

99

Again, the Benes network proves to be the slowest architecture. Worst case wire lengths

are far greater for Benes networks than for other architectures. Furthermore, the convoluted

routing of Benes crossbars forces signals to travel through more vias than the simple mesh,

or the multiplexer-based crossbar. As a result, Benes networks will remain the slowest

crossbar alternative, even in deep submicron technologies. Notice however what happens

for the multiplexer-based crossbar. Because this is, essentially, a combinational circuit that

adds one level of logic every time crossbar size doubles, the physical distance between

inputs and outputs only increases as much as is necessary to accommodate the additional

level of logic. As the paradigm shifts from gate dominated delay to interconnect dominated

delay, the combinational approach to crossbars might yield throughputs comparable to those

of the simple mesh.

6.3 Power consumption

One of the main issues driving the interest in Benes crossbars for VLSI is its limited

static power consumption. There is concern that highly parallel structures, such as simple

mesh crossbars, might yield unacceptable static power consumption, due to high

leakage currents. Using full pass gates, rather than NMOS switches, essentially eliminates

all static power consumption from AMI crossbars. Figure 6.6 illustrates static power

consumption for ULP crossbars using full pass gates.

100

Static power consumption for crossbars using ULP transistor models

0

0.05

0.1

0.15

0.2

4x4 8x8 16x16

Crossbar size

Pow

er c

onsu

mpt

ion

in

mill

iwat

ts Simple meshMUX basedBenes

Figure 6.6: Static power consumption for crossbars using ULP transistor models.

For the crossbar circuits simulated, as long as all inputs toggled at an average rate of at

least 3 MHz, ULP simulations yielded power savings, when compared with AMI

simulations. Thus, static power consumption and leakage currents should only be cause for

concern where the average toggling rate of the inputs is less than 3 MHz.

6.4 Architectural efficiency and programming complexity

Architectural efficiency attempts to quantify the relationship between the flexibility

offered by a particular crossbar architecture, and the amount of hardware needed to achieve

that flexibility. A higher architectural efficiency means that a given architecture provides

more alternative connections, using fewer switches (or multiplexers), than another

architecture with a lower efficiency. Architectural efficiency is a very useful figure of merit

in comparing routing schemes for computer networks. In the case of

101

VLSI crossbars, it is important to remember that architectural efficiency does not account

for all the electrical considerations discussed in the previous sections of this chapter. Figure

6.7 graphs architectural efficiency for the circuit alternatives considered in this study.

Architectural efficiency of crossbar alternatives

0

0.05

0.1

0.15

0.2

0.25

4x4 8x8 16x16

Crossbar size

Arc

hite

ctur

al e

ffici

ency


Figure 6.7: Architectural efficiency for different crossbars.

Notice that architectural efficiency decreases with increasing crossbar size. This trend is

consistent with the observation that, although small crossbars are attractive circuits for their

simplicity and versatility, adding crossbar inputs often results in very large circuits, which

require programming too many switches. The Benes network has the highest architectural

efficiency precisely because it uses the fewest total number of switches to construct a fully-

connected, non-blocking crossbar. Furthermore, given a certain crossbar configuration, the

Benes network uses all switches to establish the required connections. In contrast, when

considering a simple mesh crossbar, for every established connection in an NxN crossbar,

there are N-1 switches that must remain turned off to avoid driving conflicts. Because

multiplexers are not bidirectional, on the synthesized crossbar half the circuit is turned off at

any given time. Moreover, the multiplexer-based crossbar presented here has a global

direction signal, so it is not truly bidirectional, further decreasing its architectural efficiency.

102

Even though the Benes network is the more efficient architecture, its advantages are better

suited for computer networks than for hardware crossbars. For example, architectural

efficiency does not take into account the fact that programming a Benes network for a

specific set of connections is not a trivial task. The additional complexity of programming a

Benes network might not be a serious burden for a computer network. In such sytems, there

is a versatile processor that can handle the routing algorithm, either on a server or on the

workstations themselves. In the case of embedded systems or ASICs, where computing

power is optimized for a particular application, accommodating a routing algorithm for the

crossbars might be unacceptable, or even infeasible. Moreover, in a computer network, the

hardware cost of idle switches, such as those found in the simple mesh, might make the

Benes network a more attractive alternative. Larger systems may be inclined to use Benes

networks because these scale less rapidly than the other crossbars, but the price to be paid is

lower throughput and increased programming complexity.

6.5 Application scenarios

Performance of a particular circuit is closely linked to its application. Likewise, the

choice of a certain circuit design is often dependent on the environment the circuit must

withstand, i.e. a radiation intensive environment, extreme temperatures, etc. Rather than

reaching a general conclusion about which crossbar architecture is better, this study looks

for the strengths and weaknesses of each architecture, and then outlines a context where

each crossbar design might be useful.

Overall, the simple mesh offers a straightforward, versatile, and fast architecture to build

a crossbar. For ULP transistors, simulation data have shown that static power consumption

is not a major concern for crossbars of size 16x16, or smaller, as long as inputs toggle at an

average rate of more than 3 MHz. Thus, the only major disadvantage of the simple mesh is

its scaling trend. For most crossbars larger than the ones examined in this project, the

103

simple mesh is likely to be the architecture that takes up the largest silicon area.

Additionally, the number of independent programming signals required for the single mesh

grows at the same exponential rate as layout area. In summary, the simple mesh is the ideal

architecture for small crossbars. For systems looking for power-consumption efficiency,

such as battery operated systems, whose most extreme subset includes space-borne

applications, it is important to estimate the activity rate of the crossbar signals. If signals

show low activity rates, the advantages of a ULP fabrication process might be

overshadowed by static power consumption. If that happens, it may be better to switch to a

less expensive fabrication process, or choose a different crossbar architecture. The simple

mesh is also the most robust and reliable of crossbar architectures. Because it has fewer

transistors, it is less likely that a defect in the silicon will affect circuit operation. By the

same token, it is less probable that a radiation particle will hit a critical transistor in an

architecture with such a low transistor density.

The Benes network can be the most energy efficient crossbar architecture, in applications

where signals have a low activity rate. Yet, if a crossbar has a low activity rate, it is likely

that its contribution to system power consumption is small, compared to the rest of the

circuit. Thus, Benes networks would only be recommended for very large crossbars, where

they would yield the smallest layout area. However, the price to be paid for the smaller

circuit is a significant increase in signal delay, which adversely affects throughput. In any

case, Benes networks are only viable in systems that can efficiently deal with the issue of

programming the Benes crossbar. In general, Benes architectures are best suited for

computer networks. In VLSI systems, Benes crossbars may find a niche in applications

requiring extremely large interconnection networks, but even then, such crossbars might

prove to be too slow for the intended application. Furthermore, the need to use more vias in

the routing of Benes networks is a strong point against the architecture’s reliability.

Electromigration, stress voiding, and even mechanical breaking of a contact are more likely

to occur in vias than in metal layers. The skin effect on current flow effectively increases

current density across contacts and vias. Adding more contacts alleviates this problem, but

104

it also leads to larger circuits, diminishing the advantage of using Benes networks for larger

crossbars. Finally, the multiplexer-based crossbar is, electrically, the least efficient of all

crossbar architectures. Circuits are large in area, and also have much higher transistor

counts.

Because they are logically denser than other architectures, they will tend to be less

reliable than the other architectures. It is also the least energy efficient architecture. For

current fabrication technologies, the multiplexer-based crossbar is slower than the simple

mesh. However, it should also be pointed out that the synthesized crossbar also yielded the

shortest wire lengths for the datapath. As technology shifts to deep submicron, with a

dominance of interconnect delay over gate delay, a combinational crossbar could become

faster than a simple mesh. Aside from this speed conjecture, the major advantage of a

synthesized crossbar is that it has the lowest design time. Although a simple mesh is

extremely simple, it is still necessary to lay it out, or else use expensive crossbar generators.

Standard synthesizers and place and route tools generate multiplexer-based crossbars at a

fraction of the cost, using a very simple HDL description as their starting point. For this

reason, synthesized crossbars are the preferred alternative for proof of concept in a larger

system.

6.6 Conclusions and future work

• Overall, the simple mesh offers the most straightforward, versatile, and fastest

architecture to build a crossbar for VLSI systems. Its only drawback is the fact that the

circuit grows proportionally to N2, for NxN crossbars.

• In general, Benes architectures are best suited for computer networks, where the issue of

network reconfiguration can be dealt with more efficiently in software.

105• Combinational circuits that serve as crossbars might become the fastest alternative

for deep submicron technologies. For now, these synthesized crossbars belong in proof-

of-concept prototypes, where the focus is on functionally correct circuits with very short

design times.

• Static power consumption, for crossbars built with the ULP process, is only an issue for

circuits operating at frequencies below 10 MHz.

• Wire resistance models posed a severe limitation for this study. A more accurate study

can be developed if simulations include the effects of wire resistance. The effect of vias

should also be quantified as part of such a follow-up study.

• The effect of varying transistor sizes was not explored in this project. Although the

general conclusions provided here are probably still valid across different transistor

sizes, a more thorough comparison would include such variations.

• For a more formal comparison of crossbar architectures it would be important to

establish a set of benchmark tests, to be performed in simulation or on actual circuitry.

Although all crossbars in this study went through the exact same tests, there is no

guarantee that the chosen input stimuli are representative of real life crossbar

applications.

• Two variables that might affect crossbar behavior but were not explored thoroughly in

this project are temperature and operating frequency. To provide accurate quantitative

results, crossbars should be tested under a variety of conditions for these two variables.

• It would be interesting to reevaluate an alternative design of the Benes crossbar, using

fewer buffers between butterfly switches.

106• A true reliability study can only be performed based on actual circuits, through

burnout experiments. Reliability observations in this report are merely hypothesis that

stem from theoretical notions and previous experience with circuits other than crossbars.

• A quantitative study requires a more thorough statistical analysis of all data. However,

the length and scope of such a study only makes sense for actual, physical circuits.

Because the simulations examined in this project are crude approximations of reality, a

complete statistical analysis would not yield quantitatively accurate

information. Thus, this study focuses on qualitative assessments, supported by

theoretical background and simulation results.

• For a complete study on crossbar architectures and their applications, it would be

necessary to fabricate these circuits, and perform a statistical analysis over many copies

of every particular chip. Only then can quantitative data be considered reliable.

107

REFERENCES

[1] V.E. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic,

Academic Press, 1965.

[2] D. Bhatia and J. Haralambides, “Resource requirements for Field Programmable

Interconnection Chips”, IEEE Trans. VLSI Syst., Vol. 8, No. 3, pp. 346-355, June 2000.

[3] J.B. Burr, and A.M. Peerson, “Ultra Low Power CMOS Technology”, NASA Symposium

on VLSI Design, pp. 4.2.1-4.2.13, 1991.

[4] C-K. Cheng, J. Lillis, S. Lin, N. Chang, Interconnect analysis and synthesis, John Wiley

& Sons, Inc. 2000.

[5] P. Chow, S.O. Seo, J. Rose, K. Chung, G. Paez-Monzon, and I. Rahardja, “The design

of an SRAM-based field-programmable gate array-Part I: Architecture,” IEEE Trans. VLSI

Syst., Vol. 7, No.2, pp. 191-197, June 1999.

[6] P. Chow, S.O. Seo, J. Rose, K. Chung, G. Paez-Monzon, and I. Rahardja, “The design

of an SRAM-based field-programmable gate array-Part II: Circuit Design and Layout,”

IEEE Trans. VLSI Syst., Vol. 7, No. 3, pp. 321-330, September 1999.

[7] C. Clos, “A Study of Non-Blocking Switching Networks”, Bell System Tech Journal,

Vol. 32, pp. 406-424, March 1953.

[8] A. DeHon, "Reconfigurable Architectures for General-Purpose Computing",

Massachusetts Institute of Technology Artificial Intelligence Laboratory Report No. 1586,

sponsored under contract F30602-94-C-0252, October 1996.

108

[9] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: an Engineering

Approach, IEEE Computer Society Press, 1997.

[10] A. Ejnioui, and N. Ranganathan, “Multiterminal Net Routing for Partial Crossbar-

Based Multi-FPGA Systems”, IEEE Trans. VLSI Syst., Vol. 11, No. 1, pp. 71-78, February

2003.

[11] V.C. Gaudet, R.J. Gaudet, and P.G. Gulak, “Programmable Interleaver Design for

Analog Iterative Decoders”, IEEE Trans. on Circuits and Systems-II: Analog and Digital

Signal Processing, Vol. 49, No. 7, pp. 457-464, July 2002.

[12] G. Han, R.H. Klenke, and J.H. Aylor, “Performance Modeling of Hierarchical

Crossbar-Based Multicomputer Systems”, IEEE Trans. on Computers, Vol. 50, No. 9, pp.

877-890, September 2001.

[13] D. Harris. Skew-Tolerant Circuit Design. Morgan-Kaufmann Publishers. San

Francisco, California. 2001.

[14] K.J. Hass, J. Venbrux and P. Bhatia, “Logic Design Considerations for 0.5-Volt

CMOS”, 2001 Conference on Advanced Research in VLSI, March 14-16, 2001.

[15] M.A.S. Khalid, and J. Rose, “A Novel and Efficient Routing Architecture for Multi-

FPGA Systems”, IEEE Trans. VLSI Syst., Vol. 8, No. 1, pp. 30-39, February 2000.

[16] Y-T. Lau, and P-T. Wang, “Hierarchical Interconnection Structures for Field

Programmable Gate Arrays”, IEEE Trans. VLSI Systems, Vol. 5, No. 2, June 1977.

[17] J. Li and C-K. Cheng, “Routability Improvement Using Dynamic Interconnect

Architecture”, IEEE Trans. VLSI Syst., vol. 6, pp. 498-501, September 1998.

109

[18] J.M. Rabaey, Digital Integrated Circuits: A Design Perspective. Prentice Hall

Electronics and VLSI series. Prentice Hall, Inc. 1996.

[19] Space Radiation Effects Handbook. Space Electronics Inc. and Full Circle Research.

December, 1997.

[20] I. Sutherland et al, Logical effort: Designing fast CMOS circuits. Morgan-Kaufmann

Publishers. San Francisco, California. 1999.

[21] H. Zhang, M. Wan, G. Varghese, J. Rabaey, “Interconnect Architecture Exploration for

Low-Energy Reconfigurable Single-Chip DSPs”, Proc. IEEE Computer Society Workshop

on VLSI ’99, 8-9 April 1999, pp. 2-8.

[22] Y. Zhang, and M. J. Irwin, “Power and Performance Comparison of Crossbars and

Busses as On-Chip Interconnect Structures”, Proc. 33rd Asilomar Conf. on Signals, Systems

and Computers, Oct. 24-27, 1999, pp. 378-383.

Documents

i CROSSBAR ARCHITECTURES FOR VLSI SYSTEMS: A …s3.amazonaws.com/zanran_storage/fppa.mrc.uidaho.edu/ContentPages… · i CROSSBAR ARCHITECTURES FOR VLSI SYSTEMS: A COMPARATIVE STUDY