Download pdf - THE EFFECT OF LOGIC BLOCK GRANULARITY DEEP-SUBMICRON … · The Effect of Logic Block Granularity on Deep-Submicron FPGA Performance and Density Elias Ahmed Master of Applied Science

THE EFFECT OF LOGIC BLOCK GRANULARITY ON DEEP-SUBMICRON FPGA PERFORMANCE AND DENSITY

Elias Ahmed

A thesis submitted in confonnity with the requirements for the degree of Master of Applied Science

Graduate Department of Electrical and Cornputer Engineering University of Toronto

Copyright @ 2001 by Elias Ahmed

National Library 1*1 of Canada Bibliothèque nationaie du Canada

Acquisitions and Acquisitions et Bibliographie Services services bibliographiques

395 Wellington Street 395. rue Wellington Ottawa ON K1 A O N 4 Ottawa ON K I A O N 4 Canada Canada

The author has granted a non- exclusive licence allowing the National Lïbrary of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or eleclronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or otherwise reproduced without the author's permission.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thése sous la forme de microfiche/nlm, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

Abstract

The Effect of Logic Block Granularity on Deep-Submicron FPGA Performance and Density

Elias Ahmed

Master of Applied Science

Graduate Department of Electricd and Computer Engineering

University of Toronto

200 1

The architecture of an FPGA has a si,dficant effect on area and delay. In deep-submicron

designs, the interconnect resistance and capacitance accounts for the majority of the circuit

delay. In the first part of this thesis, wc perforrn a detailed study of the FPGA logic block

architecture to detennine the impact of logic block functiondity on performance and density.

In particular, in the context of lookup table (LUT), cluster-based island style FPGAs w e look

at the effect of LUT size and cluster size (number of LUTs per cluster) on the speed and logic

density of an FPGA. The second part of this thesis explores the area and delay properties of a

hardwired logic block architecture. This involves a new packing aigorithm.

iii

Acknowledgements

1 would like to thank my supervisor Professor Jonathan Rose for his technical guidance and

moral support. Our weekly discussions were always interesting and very educational.

1 would also like to thank Vaughn Betz and Alexander Marquardt for ail their help with

VPR, the CAD flow and SPLCE modeling. Thanks to Steve WiIton for his advice conceming

the 0.18 pn SPICE FPGA routing models.

I'd also like to thank Guy Lemieux and al1 the students in Jonathan's research group, Rob,

Andy, Ajay, Paul and William. Thanks to Vincent, Warren, Ted, Scott, Marcus, Brent, Jorge,

Humberto, Kostas, Derek and dl the rest of the students in the Computer and Electronics Group

for making this a wonderful research environment.

Last but not least, I'm grateful to my family for their support and encouragement throughout

the years and especially to my late father, Noor, for always believing in me.

Contents

1 Introduction 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation 1

. . . . . . . . . . . . . . . . . . . . . . . . . 1.2 FPGA Logic Block Architecture 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Hardwired Logic Blocks 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis Organization 5

2 Background 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 CADFlow 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 AreaModel 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 FPGA Packing Algorithms I l

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 RASP 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 WACK 12

. . . . . . . . . . . . . . . . . . . 2.2.3 Timing-Driven Packing (T-WACK) 14

. . . . . . . . . . . . . . . . . . . . . . . . . 2.3 FPGA Logic Block Architecture 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 LUTSize 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 CIuster Size 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Surnmary 20

3 FPGA Logic Block Architecture 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 FPGA Architecture Modeling 22

. . . . . . . . . . . . . . . . . 3.1.1 Logic Circuit Design and Delay Mode1 22

. . . . . . . . . . . . . . . . . * . . . . . . . 3.1.2 Routing Architecture ,. 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Expenmental Results 25

. . . . . . . . . . . 3.2.1 Cluster Inputs Required vs . LUT and Cluster Size 25

. . . . . . . . . . . . . . . . . . . . . . 3.2.2 AreaasaFunctionofNandK 27

. . . . . . . . . . . . . . . . . . 3.2.3 PerformanceasaFunctionofNandK 33

. . . . . . . . . . . . . . . . . . . 3.2.4 Area-Delay Product ... . . . . . 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Summary 42

4 Hardwired Logic Blocks 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Hardwired Architecture 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Cluster Inputs (1) 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Tapping Buffers 47

4.1.3 Logical Equivalence of Cluster Outputs . . . . . . . . . . . . . . . . . 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 HLB Packing 50

4.2.1 HLB Packing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.2 HLB Packing with Tapping Buffers . . . . . . . . . . . . . . . . . . . 54

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experimental Results 55

4.3.1 Area Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Delay Results 59

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Area-Delay Results 63

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Surnmary 64

5 Conclusions and Future Work 65

5.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Future Work . . - . . . . . . . . . . . . . - . - . . - . . . - . . . . - . . - - . 66

A Total Area 67

B Intra-Cluster (Logic) Area 73

C Inter-Cluster (Routing) Area

D FPGA Channel Width

E Total Critical Path Delay

F Intra-Cluster (Logic) Delay

G Inter-Cluster (Routing) Delay

F3 Number of BLE Levels on Critical Path

1 Number of Cluster Levels on Critical Path

Bibliography 115

vii

List of Tables

. . . . . 3.1 Logic Cluster Delays for 4-input LUT Using 0.18 pm CMOS process 23

. . . . . . . . . . . . . . . . . . . . 3.2 LUT Delays Using O 18 p CMOS process 24

. . . . . . . . . . . . . . . . . . . . . 3.3 MCNC Benchmark Circuit Descriptions 26

3.4 Channel Width vs . LUT and Cluster Size (1 to 5) . . . . . . . . . . . . . . . . 35

. . . . . . . . . . . . . . . . . 3.5 Channel Width vs LUT and Cluster Size (6 to 10) 36

3.6 Critical Path Delay Cornparison for K=4 . . . . . . . . . . . . . . . . . . . . . 38

. . . . . . . . . . . . . 3.7 Summary of Best Area, Delay. and Area-Delay Results 44

4.1 Percentage of Logic Block Area that is Occupied by Output Routing Crossbar . 50 C

. . . . . . . . . . . 4.2 HLB Cluster Utilization (with and without tapping buffers) 55

. . . . . . . . . . . . . . 4.3 Number of Clusters with and without Tapping Buffers 55

4.4 Comparison of number of 4-LUT to 7-LUT blocks after technology mapping . 60

4.5 Area-Delay Product Comparison Between Cascaded 4-LUTs and Non-hardwired

Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

. . . . . . . . . . A . 1 Total Area (x lo6) in Min Width Tram Area (Cluster Size = 1) 67

A.2 Total Area (x 1o6) in Min . Width Trans . Area (Cluster Size = 2) . . . . . . . . 68

. . . . . . . . . . A.3 Total Area (x la6) in Min Width Tram Area (Cluster Size = 3) 68

. . . . . . . . . . A.4 Total Area (x 1o6) in Min Width Tram Area (Cluster Size = 4) 69

. . . . . . . . . . A S Total Area ( x 1o6) in Min Width Trans Area (Cluster Size = 5) 69

. . . . . . . . . . A.6 Total Area (x lo6) in Min Width Trans Area (Cluster Size = 6) 70

. . . . . . . . . * A.7 Total Area ( x 106) in Min Width Tram Area (Cluster Size = 7) 70

. . . . . . . * . . A.8 Total Area (x lo6) in Min Width Trans Area (Cluster Size = 8) 71

. . . . . . . . . . A.9 Total Area (x lo6) in Min Width Trans Area (Cluster Size = 9) 71

. . . . . . . . . . A . 10 Total Area (x 106) in Min Width Trans Area (Cluster Size = 10) 72

. . . . B . 1 Intra-Cluster Area ( x 106) in Min . Width Tram . Area (Cluster Size = 1) 73

B.2 Intra-Cluster Area (x 1o6) in Min . Width Trans . Area (Cluster Size = 2) . . . . 74

B.3 Intra-Cluster Area ( x 1o6) in Min . Width Trans . Area (Cluster Size = 3) . . . . 74

. . . . B.4 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 4) 75

B.5 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 5) . 75

. . . . B.6 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 6) 76

B.7 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 7) . . . . 76

B.8 Intra-Cluster Area ( x 106) in Min . Width Tram . Area (Cluster Size = 8) . . . 77

B.9 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 9) . 77

B . 10 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 10) . . . 78

. . . . . . Inter-Cluster Area ( x 106) in Min Width Tram Area (Cluster Size = 1) 79

. . . . . . Inter-Cluster Area ( x 106) in Min Width Trans Area (Cluster Size = 2) 80

. . . . Inter-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 3) 80

. . . . Inter-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 4) 81

Inter-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 5) . . . . 81

Inter-Cluster Area ( x lo6) in Min . Width Tram . Area (Cluster Size = 6) . . . . 82

Inter-Cluster Area ( x 1o6) in Min . Width Trans . Area (Cluster Size = 7) . . . . 82

Inter-Cluster Area (x 106) in Min . Width Trans . Area (Cluster Size = 8) . . . . 83

. . . . C.9 Inter-Cluster Area (x lo6) in Min . Width Trans . Area (Cluster Size = 9) 83

. . . . C . 10 Inter-Cluster Area (x 1o6) in Min . Width Trans . Area (Cluster Size = 10) 84

. . . . . . . . . . . . . . . . . . . . . . . . . D. 1 Channel Width (Cluster Size = 1) 85

. . . . . . . . . . . . . . . . . . . . . . . . . D.2 Channel Width (Cluster Size = 2) 86



. . . . . . . . . . . . . . . . . . . . . . . . . D S Channel Width (Cluster Size = 5) 87





. . . . . . . . . . . . . . . . . . . . . . . . D . 10 Channel Width (Cluster Size = 10) 90

. . . . . . . . . . . . . . . . . E . 1 Total Delay in nano-seconds (Cluster Size = 1) 91

. . . . . . . . . . . . . . . . . E.2 Total Delay in nano-seconds (Cluster Size = 2) 92

. . . . . . . . . . . . . . . . . E-3 Total Delay in nano-seconds (CIuster Size = 3) 92

. . . . . . . . . . . . . . . . . EA Total Delay in nano-seconds (Cluster Size = 4) 93

. . . . . . . . . . . . . . . . . E S Total Delay in nano-seconds (Cluster Size = 5) 93

. . . . . . . . . . . . . . . . . E.6 Total Delay in nano-seconds (Cluster Size = 6 ) 94




. . . . . . . . . . . . . . . . . E . 10 Total Delay in nano-seconds (Cluster Size = 10) 96

. . . . . . . . . . . . . . F 1 Intra-Cluster Delay in nano-seconds (Cluster Size = I ) 97

. . . . . . . . . . . . . F.2 Intra-Cluster Delay in nano-seconds (Cluster Size = 2) 98



. . . . . . . . . . . . . ES Intra-Cluster Delay in nano-seconds (Cluster Size = 5) 99

F.6 Intra-Cluster Delay in nano-seconds (Cluster Size = 6) . . . . . . . . . . . . . 100

. . . . . . . . . . . . . E7 Intra-Cluster Delay in nano-seconds (Cluster Size = 7) LOO

. . . . . . . . . . . . . E8 Intra-Cluster Delay in nano-seconds (Cluster Size = 8) 101


. . . . . . . . . . . . . F . 10 Intra-Cluster Delay in nano-seconds (Cluster Size = 10) 102

G . 1 Inter-Cluster Delay in nano-seconds (Cluster Size = 1) . . . . . . . . . . . . . 103

. . . . . . . . . . . . . G.2 Inter-Cluster Delay in nano-seconds (Cluster Size = 2) 104


G.4 Inter-Cluster Delay in nano-seconds (Cluster Size = 4) . . . . . . . . . . . . . 105

G.5 Inter-Cluster Delay in nano-seconds (Cluster Size = 5 ) . . . . . . . . . . . . . 105

G-6 Inter-Cluster Delay in nano-seconds (Cluster Size = 6) . . . . . . . . . . . . . 106



G.9 Inter-Cluster Delay in nano-seconds (Cluster Size = 9) . . . . . . . . . .. . . 107

G . 1 O Inter-Cluster Delay in nano-seconds (Cluster Size = 10) . . . . . . . . . . . . . 108

. . . . . . . . . . . . . . . H . 1 Number of BLEs on Critical Path (Cluster Size = 1) 109

. . . . . . . . . . . . . . . H.2 Number of BLEs on Critical Path (Cluster Size = 2) 110

H.3 Number of BLEs on Critical Path (Cluster Size = 3) . . . . . . . . . . . . . . . 110



H.6 Number of BLEs on Criticai Path (Cluster Size = 6) . . . . . . . . . . . . . . . 112


H.8 Number of BLEs on Critical Patn (Cluster Size = 8) . . . . . . . . . . . . . . . 113


H.10 Number of BLEs on Critical Path (Cluster Size = 10) . . . . . . . . . . . . . . 114

Number of Clusters on Critical Path (Cluster Size = 1) . . . . . . . . . . . . . 1 15

Number of CIusters on Critical Path (Cluster Size = 2) . . . . . . . . . . . . . 116

Number of Clusters on Critical Path (Cluster Size = 3) . . . . . . . . . . . . . 116

. . . . . . . . . . . . . Number of Clusters on Critical Path (Cluster Size = 4) 117

Number of Clusters on Critical Path (Cluster Size = 5 ) . . . . . . . . . . . . . 117


Number of CIusters on Critical Path (Cluster Size = 7) . . . . . . . . . . . . . 118




xiii

xiv

List of Figures

1.1 Island-Style FPGA @31RM99] . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

. . . . . . . . . . . . . . . . . . . . 1.2 FPGA Cluster-style Logic Block Contents 3

. . . . . . . . . . . . . . . . . . . . . . 1.3 Examples of Hardwired Logic Blocks 5

. . . . . . . . . . . . . . . . . . . . . . . . 2.1 Grouping of LUTs and Flip-Flops 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Architecture Evaluation Flow 9

. . . . . . . . . . . 2.3 Definition of a Minimum-Width Transistor Area PRM99j 11

. . . . . . . . . . . . . . . . . . . . . . . . 2-4 Packing of BLEs to Form Clusters 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Original VPACK Algorithm 15

2.6 T-VPACKAlgorithmmR99] . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Stmcmre of (a) Basic Logic Element (BLE) and (b) Logic Cluster [BRM99] . . 19

. . . . . . . . . . . . . 3.1 Structure and Speed Paths of a Logic Cluster @RA4991 23

. . . . . . . . . . 3.2 Nurnber of Inputs Required for 989% Logic Block Utilization 28

3.3 Total Area for Clusters of Size 1 to 5 . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Total Area for Clusters of Size 6 to 10 . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Total Logic Block Cluster Area . . . . . . . . . . . . . . . . . . . . . . . . . . 31

. . . . . . . . . . . . 3.6 Number of Clusters and Cluster Area Versus K (for N=l) 3 1

. . . . . . . . . . . . . . . . . . . 3.7 Intra-cluster Multiplexer Area and LUT Size 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Routing Area 33

3.9 Number of Clusters and Routing Area Per Cluster Versus K (for N=l) . . . . . 34

. . . . . . . . . . . . . . . . . . . . . . 3.10 Total Delay for Clusters of Size 1 to 10 37

. . . . . . . . . . . . . . 3.1 1 Total Lntra-Cluster Delay for Clusters of Sizes 1 to 10 37

. . . . . . . . 3.12 Number of BLEs on Critical Path and BLE delay vs K (for N=l) 39

. . . . . . . . . . . . . . 3.13 Total Inter-Cluster Delay for Clusters of Size I to 10 39

. . . . . . . . . . . . . . . . . . . . 3.14 Nurnber of Cluster levels on Critical Path 40

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15 AverageBLEFanout 41

. . . . . . . . . . . . . . . . . 3.16 Area-Delay Product for Clusters of Size 1 to 10 43

3.17 Close-up View of Area-Delay Product for Clusters of Size 1 to 10 . . . . . . . 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Cascaded 4-LUTs

. . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Cluster Description (with HLBs)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 HLB Tapping Buffers

. . . . . . . . . . . . . 4.4 Cornparison of HLBs with and without tapping buffers

. . . . . 4.5 HLB Cluster Contents with Full Routing Crossbar for Output S ipa ls

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 HLB Packing Flow

. . . . . . . . . . . . . . . . . . 4.7 Pseudo-code for HLB Timing Driven Packing

. . . . . . . . . . . . . . . . 4.8 HLB with Cascaded 4-LUTs and Tapping Buffer

4.9 Total Area Cornparisons for Hardwired Arch . vs . Non-Hardwired . . . . . . .

4-10 Inter-Cluster Area Cornparisons for Hardwired Arch . vs . Non-Hardwired . . .

4.1 1 Intra-Cluster Area Cornparisons for Kardwired Arch . vs . Non-Hardwired . . .

4.12 Total Critical Path Delay Cornparisons for Hardwired Arch . vs . Non-Hardwired

4.13 Number of BLEs on the Critical Path Comparisons for Hardwired Arch . vs .

Non-Hardwired . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.14 Number of Clusters on the CnticaL Path Cornparisons for Hardwired Arch . vs .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-Hardwired 62

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Various HLB architectures 66

xvii

CHAPTER 1

Introduction

1.1 Motivation

Field-Programmable Gate Arrays (FPGAs) have experienced tremendous growth in recent

years and have becorne a multi-billion dollar industry. Shrinking device geometries resuIting

in larger gate capacity have provided for greater functionality. The instant prograrnmability

gives systems buiIt with these devices a significant time-to-market advantage. However, this

prograrnmability cornes at a price, since FPGAs are at least three times slower and demand

more than ten times the silicon area when irnplementing the same function on a chip when

compared to Standard Cells or Masked-Programmable Gate Arrays @3FRV92]. This happens

because Standard Cells use simple wires to make interconnections between logic gates but in

FPGAs, gates are connected with programmable swi tches. These switches have much larger

resistance and capacitance and hence are slower than the wires in full-fabrication chips. Ide-

ally, to improve the performance of an FPGA we would like to use as few switches as possible

for any given circuit. In general, the three main factors affecting overall FPGA performance

2 Chapter 1. Introduction

are the architecture of the FPGA, the quality of the CAD tools, and the electncal transistor

level design of the FPGA. While this thesis explores al1 three issues, we focus primzily on the

logic block FPGA architecture.

1.2 FPGA Logic Block Architecture

This thesis examines several aspects of FPGA logic block architecture and its impact on area

and performance. A generk FPGA consists of numerous programmabIe logic blocks which

have the capability to implement some digital logic functions. In between these logic blocks

are programmable routing switches which connect the input and output pins of each logic

block. This basic FPGA architecture is illustrated in Figure 1.1 and is known as an "island-

style'' structure in which a symmetric array of logic blocks is surrounded by routing channels

(or tracks). The Il0 pads are evenly distributed around the perimeter of the FPGA.

VO pad

a . . . a . . .m . . .= Figure 1.1 : Island-Style FPGA DRM993

Figure 1.2 shows a typical Io@ cluster, which is a Iogic block that consists of one or more

basic logic elements (BLEs) grouped together. BLEs are often composed of look-up tables

(LUTs). The BLEs in the cluster are fully-interconnected meaning that a crossbar allows any

BLE output to reach any BLE input and that al1 inputs to the cluster can reach any of the BLE

inputs. The advantage of having a fully-connected internd routing crossbar is that physical

routing becomes much easier since the router (the CAD tool which determines the paths of

1.2 FPGA Logic BIock Architecture 3

the wires in an FPGA) simply has to connect to any one of the cluster input pins. This added

flexibility in the router results in a fewer number of tracks being used. However, there is a cost

in terrns of multiplexer area and delay to build the full crossbar. For large clusters, this area

and delay can be quite significant.

-4

+ ; outputs

FPGA --__

.............................................

Figure 1.2: FPGA Cluster-style Logic Block Contents

The focus of the first part of this thesis is to determine the effect of the number of inputs to

the LUT (K) in a homogeneous architecture that employs al1 the sarne size of LUTs, and the

number of such LUTs in a cIuster (N) on the performance and density of an F'PGA. Increasing

either LUT size (K) or cIuster size (N) increases the functionality of the logic bIock, which

has two positive effects: it decreases the total number of logic blocks needed to implement

a given Function, and it decreases the number of such blocks on the critical path, typically

irnproving perfomance. Working against these positive effects is that the size of the logic

block increases with both K and N. The size of the LUT is exponential in K W C 9 0 1 and the

size of the cluster is quadratic in N @3R97]. Furthemore, the area devoted to routing outside

the block will change as a function of K and N, and this effect (since routing area typically is a

large percentage of total area) has a strong effect on the results. The choice of the logic block

ganularity which produces the best area-delay product lies in between these two extremes. In

explonng these trade-offs we seek to answer the following questions:

- For a cluster-based logic block with N LUTs of size K and 1 inputs to the cluster, what


should the value of 1 be so that 98 % of the LUTs in the cluster can be fully utilized?

Certainly setting I=KxN will do this, but a value less than this, which is cheaper, may

also suffice.

0 What is the effect cf K and N on FPGA area?

What is the effect of K and N on FPGA delay?

Which values of K and N give the best area-delay product?

Most importantiy, we seek to clearly explain the results and thus perhaps Ieading to bet-

ter architectures. Even though some of these questions were addressed some time ago in

W L 8 9 3 W C 9 0 1 KG9 11 BG92b] m 9 11 and [SRCL92], several reasons compelled

us to revisit the issue. First, prior work on the appropriate size of the LUT focused on non-

clustered logic blocks, which are known to have a significant impact on the area and delay

[MBR99]. Second, most pnor studies tended to look at area or delay, but not both as we will

here. Third, prior results were based on IC process generations that are several factors Iarger

than current process generations, and so do not take deep-submicron electrical effects into ac-

count. In the present work, we perform detailed spice-level simulations of circuits and perform

appropriate buffer and transistor sizing for al1 the Iogic and routing elements, in the manner of

[ERM99]. Fourth, the CAD tools available today for experimentation are significantly better

than those available 10 years ago, when this question was first raised. This turned out to be

significant because our new results show that the superior tools give rise to different trends in

the explmation of the results.

1.3 Hardwired Logic Blocks

The second part of this thesis will explore the use of hardwired logic blocks (HLBs) within

the context of logic clusters [Chu94]. HLBs consist of two or more BLEs connected together

by wires. These wires do not have switches and lack any form of prograrnmability. The low

1.4 Thesis Organization 5

resistance and capacitance of the metal wires makes the connections between BLEs fast and

cheap in t e m s of area. In a non-HLB architecture (shown in Figure 1.2) connections between

BLEs rnust propagate through the local routing crossbar which connects ail BLE outputs and

inputs. The area requirements of the full-routing crossbar can be quite significant sometimes

even larger than the BLE area. Also, there is a significant delay required for signals to reach

from a BLE output to another BLE input. The use of HLBs alleviates the area and delay

demands of local BLE connections and may improve FPGA density and performance. We

explore the area and delay of one particular HLB based FPGA architecture, again in the context

of clustered architectures. -

- LUT- % - a) Cascaded Hardwired LUTs

;?)FF b) Tree Topology Hardwired LUTs

Figure 1.3: Examples of Hardwired Logic Blocks

1.4 Thesis Organization

This thesis is organized as follows: Chapter 2 describes the background required to understand

this thesis and previous work relating to FPGA logic block architecture. Chapter 3 presents

the results of an extensive study of LUT and cluster size on FPGA area and delay. Chapter 4

presents the area and delay results for a cascaded 4-LUT hardwired logic block-based FPGA

and compares it to the non-hardwired results. Finally, we provide the conclusion in Chapter 5

along with possibifities for future work.


CHAPTER 2

Background

This chapter discusses severd related works from the past in the area of FPGA architecture and

CAD tools. There has been a significant amount of work in FPGA architecture over the last

decade and we will attempt to summarize sorne of the key related results- Also, we will review

the CAD flow and algonthms used in logical synthesis, placement and routing used ta produce

the results in the present work.

2.1 CAD Flow

The best-known and most believable method of determining the answers to the questions posed

in Section 1-2 is to experimentally synthesize real circuits using a CAD flow into the differ-

ent FPGA architectures of interest, and then measure the resulting area and delay DERV921

@3RM99] KG9 11. Figure 2.2 illustrates the CAD flow that was used in @RM99] and [MBR99]

to explore architectures and this is the one that we employ in this research. First, each circuit

passes through technology-independent logic optirnization using the SIS program [ea90]. It is

8 Chapter 2. Background

worth noting that, from this point on, the entire CAD fiow is hlly timing-driven. Technology

mapping (which converts the logic expressions into a netlist of K-input LUTs), was perforrned

using the FiowMap and FlowPack tools [CD94]. At this stage, there exists a netlist of logic

blocks (LUTs and registers). T-VPACK m R 9 9 ] takes this netlist of logic blocks and first

groups them into basic logic elements (BLEs) which consist of a single K-LUT and flip-flop.

Any LUT with a fanout of one and feeding into a flip-flop (as shown in Figure 2.1) c m be

collapsed into a single BLE. Hence, BLEs c m be composed of either a LUT, a LUT and a

LATCH or simply a LATCH.

Clock A

Figure 2.1 : Grouping of LUTs and Flip-Flops

These BLEs are then packed into clusters optimizing for both logic density and speed. VPR

PRM991 is then used for timing-driven placement and routing. Phcernent is the process of

determining the location of every cluster within the FPGA tile and routing determines which

programmable switches to turn on in order to connect the nets. Note that the VPACK and VPR

toolset was designed to explore FPGA architectures and each tool takes a number of parameters

as input that describe an FPGA architecture. For example, the packer (VPACK) takes the LUT

size (K), cluster size (N) and number of cluster inputs (I) as input parameters. The router takes

a VPR ".arch" file @Rh4991 which is a textual description of the routing architecture. Once

the physical layout is complete the area and delay of the circuit in the given architecture are

extracted. VPR uses a transistor based area mode1 to calculate the total FPGA area and the

timing analyzer determines the criticd path delay by extracting the Elmore delay of each net

and performing a path based timing analysis.

In our approach to modeling the area of an FPGA required by any given circuit, we deter-

2.1 CAD FIow 9

Circuit

Logic opamimuon (SIS) + TrchnoIogy Map ro K-LUT$

m (FiowMap + FiowPack)

1 Logic ~ ~ o c i c wj . . - Pack BLEs inio Logic Clusters CI-VPACK) 1

l Architecture I \ I - I

I + Placement (VPR. timing-dnven placement)

FPGA ' Rouung p. Description Routing (VPR. timing-drïven muter) 4

Architecture \ J

NO Adjust channel capacities (W)

YES - Wmin detemined

1 1 Rouung with W = 13 Wmin (VPR. Üming-driven router) I

1 Determine critical path dtlay and uansistor m a to buiId FPGA routing (VPR)

Figure 2.2: Architecture Evaiuation Flow

mine the minimum number of tracks needed to successfully route each circuit, W,,. Clearly

this isn't possible in real FPGAs, but we believe this is meanina&ul as part of a logic density

metric for an architecture. The area mode1 which makes use of this minimum track count is

described more fully in Section 2-1.1. In order to determine the minimum nurnber of tracks

per channel to route each circuit we continuously route each circuit, rernoving tracks from the

architecture until it fails to route. We cal1 the situation where the FPGA has the minimum

number of tracks needed to route a given circuit a "high stress" routing since the circuit is

barely routable. We believe that measuring the performance of a circuit under these high-stress

conditions is unreasonable and atypical, because F'PGA designers don't like working just on

the edge of routability. They will typically change something to avoid it, such as using a larger


device, or removing part of the circuit.

For this reason, we add 30% more tracks to the minimum track count and then perforrn

final "low stress" routing and use that to measure the criticd path delay.

From the output of the router, and using the area models described in the next section dong

with the circuit delay parameters, we can compare different architectures,

2.11 Area Mode1

Betz' area modeling procedure DRM991 was to create the detailed, transistor-level circuit

design of d l of the Iogic and routing circuitry in the FPGA. This includes circuits for the

LUTs, Aip-flops, intra-cluster muxes, inter-cluster routing muxes and switches and al1 of the

associated programming bits. His basic assumption was that the totaI area of the FPGA was

active-area limited, which tends to be true when there are many layers of metal (according to

[BRM99]). Two commercial PLD vendors have confinned this assumption.

This design process includes proper sizing of al1 of the gates and buffers, including the

pass-transistors in the routing, Betz uses the number of "minimum-width transistor areas" as

his area rnetric. The definition of a minimum-width transistor area is the smalIest possible

Iayout area of a transistor that cari be processed for a specific technoIogy plus the minimum

spacing surrounding the transistor as shown in Figure 2.3. The spacing is dictated by the

design niles for that particular technology. Any transistors in the circuit design that are sized

Iarger than minimum are counted as a greater number of minimum-width transistors, taking

into account the fact that a double size transistor takes Iess than twice the layout area, One

advantage of this metric is that it is a somewhat process-independent estirnate of the FPGA

area.

2.2 FPGA Packing Algorithniis 11

Minimum horizontal spacing

\ Polysilicon (gate)

1 I

'- - - - - - - ------------------ I 1 Perimeter 01 minimum

Figure 2.3: Definition of a Minimum-Width Transistor Area FRM991

2.2 FPGA Packing Algorithms

width transistor m a

Contact

Difhrsion

In order to effectively study the feasibility of HLB architectures we need to have a packing al-

gorithm capable of targeting such structures. However, we should first provide a bnef overview

of some of the general packing algorithms relevant to Our work. This section will discuss three

such packing algorithms: RASP, VPACK and T-WACK. The input to al1 these packing dgo-

rithms is a nelist of LUTs and Flip-flops and the output is a clustered set of BLEs-

-

2.2.1 RASP

RASP [CPD96] is a generd synthesis system for SRAM-based FPGAs. It has the capability

to map circuits into various types of logic blocks. M S P is composed of a core which includes

synthesis and optirnization algonthms targeting technology-independent Iogic synthesis and

mapping for LUT-based FPGAs. The packing algorithm uses a "closeness" metnc to deterrnine

which LUTs to group together in the same cluster. It first creates a compatibility graph where

the vertices represent the LUTs that require grouping and edges are formed between vertices


if they can be grouped together. There is no edge in the compatibility graph if two BLEs

cannot be grouped together due to some hard constra.int violation (for example, exceeding

the maximum number of cluster inputs allowable). The next step is to assign weights to d l

the edges in the network, Weights are assigned depending on the design objective. if circuit

performance is the main objective then a large weight is assigned to edges which produce a

grouping that reduces the length of the critical path. Conversely, if FPGA area is the pnmary

objective then large weights are given to those edges resulting in groupings that do not create

cornplex interconnection patterns in the final mapping. The algorithmic complexity of the

rnapper is O(nm) where n is the number of LUTs and m is the number of edges. With the

current benchmarks used in our experiments, the number of edges m in the compatibility graph

is 0(n2). This results in an overall algorithmic cornplexity of 0(n3), which is excessive and

somewhat impracticd for modem day circuits which require in excess of 10,000 blocks.

2.2.2 VPACK

VPACK [BRM99] takes a netlist of LUTs and registers as input and outputs a netlist of logic

cIusters as illustrated in Figure 2.4. It groups BLEs together in order to maxirnize input sharing.

The number of BLEs per cluster (N), inputs per cluster (1), LUT size (K) and clocks per cluster

(Melk) are al1 input parameters to the VPACK algorithm. VPACK accepts any combination

of these parameters and creates optimized logic clusters. The complete pseudo-code for the

VPACK algorithm is given in Figure 2.5.

VPACK attempts to pack as many BLEs into a given cluster without violating the following

constraints:

a There can be no more than N BLEs in any given cluster.

a Each cluster must use 1 inputs or less.

Every BLE contained in the cluster rnust have K-inputs per LUT (strictly homogeneous

architecture).

2.2 FPGA Packiner Algofithms 13

NetList of BLEs Netlist of Clusters

F i p r e 2.4: Packing of BLEs to Fom CIusters

a Melk is the maximum number of clocks per cluster.

The basic algorithm co~iisists of two stages. The first stage examines the input netlist and

groups LUTs and registers ttogether to form BLEs. Only LUTs with a fanout of one and whose

output directly feeds a register are grouped togeti-ter. Al1 other structures such as LUTs with

fanout greater than one and LUT to LUT connections require multiple BLEs.

The second stage of the V A C K algorithm groups BLEs together to forrn clusters. Clusters

are created in a sequential manner one after the other. First, the algorithm is in a greedy mode

and continues to add BLEs rto the current cluster until the maximum cluster capacity has been

reached. If the maximum murnber of BLEs, N, have been packed then the current cluster is

"closed" and a new empty acluster is created and the packing process is repeated. The basic

VPACK clustering starts by- selecting a seed BLE to pack into the current open cluster. The

unclustered BLE with the rrnost number of used inputs is always selected as the next available

seed. Once a seed BLE h a s been selected, an attraction function is applied to determine the

next BLE to group within the current cluster. The attraction between a BLE, B, and the current

cluster, C, are the number oEcommon nets that are shared. A net is defined as any electrically

equivdent wire connecting one or more logic blocks together.


If the greedy algorithm of the packer does not completely fil1 the cluster to its maximum

capacity then there is the option to invoke the hill-climbing phase. Hill-climbing will continue

to pack BLEs into the cluster even though the cluster uses more than the maximum I inputs that

are allowabIe. Hill-climbing ailows BLEs to be added even though the result is an infeasible

cluster (too many inputs). It must be remembered that adding a BLE to a cluster in which

al1 the inputs are already present in the cluster and having its output used by another BLE

causes a reduction in the number of cluster inputs. This is the key driving force behind hill-

climbing. That is, even though infeasiblities may occur early on during packing, it may become

feasible at later stages and the result is an improvernent in Iogic block density. The dgorithmic

complexity of VPACK is O(k,, - K - n) where the maximum number of terminais on a net is

represented by km,, K is the number of inputs to each LUT and n is the nurnber of LUTs + registers in the circuit.

2.2.3 Timing-Driven Packing (T-VPACK)

T-VPACK m R 9 9 ] was an extension to VPACK that attempts to maximize Iogic cluster ca-

pacity while simultaneously working to reduce the critical path delay. The total delay is im-

proved by reducing the number of inter-cluster connections on the critical path. Since the

external cluster routing deIay is much larger than the focai routing inside the cluster this should

have a positive impact on delay. The pseudo-code for the T-VPACK algorithm is shown in

Figure 2.6.

Timing analysis is perforrned on the circuit before any packing begins. T-VPACK models

three types of delays within an FPGA: the delay through a logic block, Logic_BlockDelay,

the intracluster delay between logic blocks, IntraLogicDelay and the extemal delay be-

tween cluster input/output pins, Inter.BlockDelay. The external routing delays cannot be

determined before placement and routing and so these must be approximated. It was exper-

imentally demonstrated in mR99] @31RM99] that weighting the LogicJ3lockDelay and ln-

2.2 FPGA Packing Algorithms 15

Let:UnclusteredBLEs be the set of BLEs not contained in any cluster C be the set of BLEs conrained in the current cluster LogicClusters be the set of ctusters (where each cluster is a set of BLEs)

UncIusteredBLEs = PatternMatchToBLEs (LUTs, Registers); LogicClusters = NüLL;

while (UnclusteredBLEs != NULL) ( /* More BLEs to cluster */ C = Ge tBLEwithMostUsedInputs (UnclusteredBLEs) ; while (ICI c N) { /* CIuster is not full */ BestBLE = MaxAttnctionLegalBLE (C, UnclusteredBLEs); if (BestBLE = NULL) /* No BLE can be added to cluster */

break; UnclusteredBLEs = UnclusteredBLEs - BestBLE; C = C u BestBLE;

1 LogicClusters = LogicClusters u C;

1

Figure 2.5: Original VPACK Algorithm

traLogicDelay to 0.1 and InterBlockDelay to 1 produced the most accurate post-routing

results.

All the net timing and slack information is stored after initial timing analysis has been

completed. Frorn this, the criticality of every connection can be calculated. Slack @SC831 is

defined as the amount of delay that may be added to a connection before it becornes critical.

Slack(i) is the slack of a connection i and M d a c k is the maximum possible slack for d l

connections in the circuit. The criticality of any connection, i, is expressed as:

Once the connection criticalities have been detennined T-VPACK detemines which BLE

will be chosen as the seed for a new cluster by selecting the BLE attached to the net with the


highest criticality. Once a seed BLE has been selected, an attraction function is applied to

calculate the next BLE that would be the best candidate for inclusion in the current cluster.

The attraction function for a BLE, B, towards a cluster, C is given by:

1 Ners(B)Mers(C) 1 Artraction(B) = h - Criticality ( B ) + ( 1 - h) MdetS

The BLE with the highest attraction will be the next candidate and A is a parameter which

determines whether T-WACK should be h l ly timing driven or maximizes input sharing. If h

is 1 then T-VPACK attempts to minirnize delay without regard to maximizing input pin sharing

and if h is O then T-VPACK will attempt to minirnize the number of used inputs. mR99]

demonstrated that setting h to a range between 0.4 and 0.8 was best.

2.2 FPGA Packing Algorithms 17

Let:UnclusteredBLEs be the set of BLEs not contained in any cluster C be the set of BLEs contained in the current cluster LogicClusters be the set of clusters (where each cluster is a set of BLEs)

UnclusteredBLEs = PattemMatchToBLEs (LUTs, Registers); LogicClusters = NLTLL;

while (UnclusteredBLEs != NULL) { /* More BLEs to cluster */

C = GetMos tCrïticalBLE (UnclusteredBLEs) ; BLEsSinceLastCriticality Recompte ++;

while (ICI < N) { /* Cluster is not full */

if (BLEsSinceLastCriticalityRecompute >= Recomputehterval) { ComputeCriticalities(); BLEsSinceLastCriticalityRecompute = 0;

}

BestBLE = MaxAttractionLegalBLE (C, UnclusteredBLEs) ; if (BestBLE = NULL) /* No BLE c m be added to cluster */

break; UnclusteredBLEs = UnclusteredBLEs - BestBLE; C = C u BestBLE; BLEsSinceLastCnticalityRecompute ++;

LogicClusters = LogicClusters u C; 1

Figure 2.6: T-VPACK Algorithm [.BR991


2.3 FPGA Logic Block Architecture

Now that the FPGA experimental rnethodology and design flows have been described, we dis-

cuss eariïer work which exarnined logic block architecture and its impact on FPGA density and

performance. This research and much of the relevant pnor work was based on a cluster-based

island style FPGA. The structure of the cluster-based logic bIock is illustrated in Figure 2.7.

Each cluster contains N basic Iogic elements (BLEs) fed by 1 cluster inputs. The %LE, illus-

trated in Figure 2.7(a) typically consists of a K-input lookup table (LUT) and register, which

feed a two-input multiplexer that determines whether the registered or unregistered LUT out-

put drives the BLE output. For clusters containing more than one BLE, ''full connectivity" is

assurned. This means that d l 1 cluster inputs and N outputs c m be progammably connected to

each of the K inputs on every LUT. These are irnplernented using the multiplexers shown in the

figure, which are un-necessary for a cluster of size 1, The Altera Flex 6K, 8K, 10K [Tnc98a]

and Xilinx 5200 pnc] and Virtex Dc98bl are commercial examples of such clusters (dthough

the XiIinx logic clusters are not fully connected).

2.3.1 LUT Size

There have been several studies in the past which examined the effect of logic block function-

ality on the area and performance of FPGAs. The use of large LUTs generally produces good

performance results since there are a fewer number of BLEs on any given criticai path. How-

ever, because the Iogic block area grows exponentially with the number of LUT inputs, these

larger blocks are very expensive. With this in mind, the key to any FPGA Iogic block study

is to balance these two opposing factors and determine an appropriate LUT size which gives

both good performance and logic density. For exampie, the work in W C 9 0 1 and [KG92a]

showed that a LUT size of 4 is the most area efficient in a non-clustered context. One of the key

observations from this study was that the FPGA area was mainly dominated by the inter-cluster

routing area and even though increasing Iogic functionality resulted in fewer logic blocks this

2.3 FPGA Logic BIock Architecture 19

Clock

(a) Basic l o g i c eIement ( B E )

K-input Inputs - - LUT -

(b) Logic cluster

I D FF Out

Clock -b>

Figure 2.7: Structure of (a) Basic Logic EIernent (BLE) and (b) Logic Cluster @3RM99]

could easily be offset by the increase in routing area due to the Iarger number of extemal pin to

pin connections between blocks. Hence, it was concluded in [PFLCgO] that logic blocks with

high functionality per connected pin produced the best area results. Conversely, evaiuation of

FPGA performance was performed in [Sin911 [SRCL921 and KG911. It was observed that

using a LUT size of 5 to 6 gave the best performance. The FPGA performance studies tended

to favour Iarger LUTs since they resulted in fewer levels of logic on the critical path. Since

the routing delay was larger than the logic delay, this generally lead to postive delay results. A

recent publication m K C 9 9 ) has suggested that using a heterogeneous mixture of LUT sizes

of 2 and 3 was equivdent in area efficiency to a LUT size of 4. In addition [ACSGf 991 States

that a logic structure using two 3-input LUTs was most beneficid in terms of area. However, it

must be noted that both these last two papers did not perform a full area or delay study where

a range of LUT sizes was exarnined.

20 Chamter 2. Backaound

2.3.2 Cluster Size

In general, clusters can be classified using four parameters [BRM99]: the nnrmber of inputs that

each BLE has (K), the number of BLEs within each cluster (N), the numbaer of distinct cluster

input pins feeding each cluster (1) and the number of distinct clocks per clluster (MCIk). It W ~ S

experimentally shown in [BRM99] that given a cluster with a single dock amd assumi~g a LUT

size of 4, then clusters of size 4 to 10 produced the best area-delay resultts using non-timing

driven packing (VPACK). Later on, a separate study Mar991 I34l3R991 based on a timing-

driven packing algorithm (T-VPACK) found that cIusters of sizes 7 to 10 were best in terms of

area-delay product.

2.4 Summary

This chapter outlined some of the key results from the past in FPGA logic block architecture.

AIso, background information concerning logic synthesis, technology mâlpping and packing

were provided. The rest of the thesis will elaborate on the area and delay- resuIts for various

Iogic block architectures.

CHAPTER 3

FPGA Logic Block Architecture

This chapter explores the effect of Iogic block functionality on FPGA performance and density.

In particular, in the context of lookup table, cluster-based island-style FPGAs PRM991 we

look at the effect of lookup table (LUT) size and cluster size (number of LUTs per cluster) on

the speed and logic density of an FPGA.

We use a fülly timing-driven experimental fiow as describecl in Section 2.1 @3RM99]

Mar991 in which a set of benchmark circuits are synthesized into different cluster-based

PR971 PR981 war99] logic block architectures, which contain groups of LUTs and flip-

flops. We look across al1 architectures with LUT sizes in the range of 2 inputs to 7 inputs, and

cluster size from 1 to 10 LUTs. In order to judge the quality of the architecture we do both

detaiIed circuit level design and measure the demand of routing resources for every circuit in

each architecture.

Chapter 3. FPGA Logic Block Architecture

3.1 FPGA Architecture Modeling

In this section we give a brief description of the FPGA architecture and deiay modeling used in

Our work. The level of detail present in these models goes far beyond any modeling previously

used in this kind of experimental analysis. Al1 device parameters and circuits are modeled

using SPICE simulations of a 0.18 pm CMOS process.

We make the foIlowing assumptions about the basic island-style architecture:

a The number of routing tracks in each channel between logic blocks is uniforrn throughout

the FPGA.

a AI1 metal routing wires are placed on metal layer 3 with minimum width and spacing.

Each circuit is mapped into the smailest square (MxM) grid possible given the number

of logic clusters it requires-

However, it is important to note that the area metric we count is not the total area required

by the square M x M block on the FPGA. Rather, we use the exact number of clusters

required to implement the circuit. For exarnple, a circuit which requires 800 logic blocks

will be routed in 29 x 29 WGA grid which results in 841 blocks. We use the area of the

logic and routing surrounding 800 clusters as opposed to 841.

3.1.1 Logic Circuit Design and Delay Mode1

The circuit design process described above is also necessary to determine accurate delay mea-

surements of the final placed and routed circuit. In deep-submicron IC design processes, the

effect of wire resistance and capacitance becomes more prevalent. We account for these effects

in this delay modeling. Figure 3.1 shows the detailed logic block circuit. Al1 circuit design

was done in TSMC's 0.18 pm, 1.8 V CMOS process. The paths have been simulated with their

actual loads in place and the input driven by what would actualiy be driving it in a real FPGA.

3.1 FPGA Architecture Modeling 23

As the cluster size increases, the buffers shcswn in Figure 3.1 must be sized larger because

of iarger loading from the interna1 muxes, which results in an increase in the basic BLE delay.

This is shown in Table 3.1 which gives the logic delays as the cluster size increases for the

paths indicated in Figure 3.1 for a BLE based o n a 4-input LUT.

Input connection

block buffers & muxes

1 Local 1 buffen

I I

+ Local routing muxes

L - , - A I l BLE 1

œ

N 1 : BLEs 1

Logic Cluster

Figure 3.1 : Structure and Speed Paths of a Logic Cluster @3R'M99]

Table 3.1 : Logic Cluster Delays for Cinput LUT Using 0.18 pm CMOS process

Cluster Size (NI

1 (No local muxesi

2

4

6

8

10

Similarly, the design of the larger LUTs mus t be done carefully, with proper buffer sizing

and, in some cases, insertion of buffers within t h e tree of pass-transistors. Table 3.2 gives the

LUT delay as a function of the LUT size.

A tb B (PB)

377

377

377

377

377

377

B to C and D t0 C (RS)

180

221

301

3 3 2

331

337

to (ps)

376

385

401

3 97

396

3 87

24 Chapter 3. FPGA Logic Block Architecture

Table 3.2: LUT Delays Using 0- 18 prn CMOS process

LUT Size (K)

3.1.2 Routing Architecture

The target routing architecture of the CAD flow used in these experiments is one that Betz et,

al m g 9 1 indicate is a good choice. This architecture has the following parameters:

rn Routing segments have a Iogical length of four (the logical length of a segment is defined

as the number of logic block clusters that it spans)

50% of these se-wents use tri-state buffers as the programmable switch and 50% use

pass transistors

The experiments conducted in DRM991 were based on a LUT size of four and a cluster

size of four. We will assume these results are valid for a11 the LUT sizes and cluster sizes that

we are comparing.

However, the LUT and cluster size does afTect the sizing of the buffers used to drive the

programmable routing, both from the block itself and the tri-state buffers interna1 to the pro-

grammable routing. As the Iogic block cluster increases in size, the size of each logic tile is

Iarger, and therefore the length of the wires being driven by each buffer increases. Since this

increases the capacitive loading of each wire, the buffers must be sized appropriately. Betz

@3RM99] indicates that for a cluster size of four and a LUT size of four, the best routing pass

3.2 Experimental Results 25

transistor width was ten times the minimum width, while the best tri-state buffer size was only

five times the minimum, We size our buffers in direct proportion to the length of this tile. That

is, if the tile length has doubled, then we double the size of the routing buffers.

3.2 Experimental Results

In this section we present the expenmental results of synthesizing benchmark circuits through

the CAD flow described in Section 2.1 with the area delay modeled as described in Section 3.1.

The benchmark circuits used in these experiments were the 20 largest from MCNC Wang11

d o n g with 8 new benchmarks '. Table 3.3 gives a description of the circuits, including the

n m e , number of 4 input-LUTs and number of nets.

Each circuit was rnapped, placed and routed with LUT size varying from 2 to 7 and cluster

sizes from 1 to 10. With 6 different LUT sizes and 10 different cluster sizes this gives a total

of 60 distinct architectures.

3.2.1 Cluster Inputs Required vs. LUT and Cluster Size

Before answering the principal questions raised in the introduction, we need to determine an

appropriate value for 1, the number of logic block cIuster inputs. The value of 1 should be a

function of K (the LUT size) and N (the number of LUTs in a cluster). This is of concern

since the larger the number of inputs the Iarger and slower the multiplexers feeding the LUT

inputs will be, and more programmable switches wiII be needed to connect externally to the

logic block. hdeed, one of the principal advantages of fully-connected clusters is that they

require fewer than the full number of inputs (K x N) to achieve high Iogic utilization. There

are several reasons for this:

' ~ h e 8 new benchmarks are from the University of Toronto from two computer vision applications. The 8 benchmarks are {display-chip, img-calc, imginterp, inputchip, peak-chip, scaleI25,chip, scale2ship, warping)


Table 3.3: MCNC Benchmark Circuit Descriptions

Circuit

a h 4

apex2

apex4

bigkey

c l m a

des

di£ feq

ds i p

e l l i p t i c

ex1010

ex5p

f r i s c

m i sex3

~ d c

s298

s38417

~ 3 8 5 8 4 . 1

seq

s p l a

tseng

d i s p l a y - c h i p

hg-calc

i m g - i n t e r p

i n p u t - c h i p

peak-chip

sca le125 ,ch ip

s c a l e 2 - c h i p

warp ing

# of ~ - I A P u ~ BLEs

1522

1878

1262

1707

8383

1 5 9 1

1497

1 3 7 0

3 604

4598

1064

3556

1397

4575

193 1

6406

6447

1750

3690

1047

1794

20141

2727

807

809

2632

1189

1353

Number of Nets

1536

1916

1 2 7 1

193 6

8445

1847

1 5 6 1

1599

3735

4608

1072

3576

1 4 1 1

4591

1935

643 5

6485

1 7 9 1

3706

1099

2419

10180

2769

8 4 1

840

2654

1202

1394

3.2 Experimental Resuits 27

a Some of the inputs are feedbacks from the outputs of LUTs within: the sarne clusters,

saving inputs.

a Some inputs are shared by multiple LUTs in the cluster

a Some of the LUTs do not require al1 of their K-inputs to be used- Indeed this is often the

case, as pointed out in WKC991.

Betz and Rose BR971 PR981 showed that when K=4 and 1 is set to t h e vaIue 2N+2, then

98% of al1 of the 4-LUTs in a cluster would typically be used. We would l i k e to find a sirnilar

relation, but one that includes the variable K.

To determine this relation, we ran several experiments, using only the first three steps il-

lustrated in Figure 2.2: logic synthesis, technology mapping and packing. For each possible

value of N and K, we ran experiments varying the value of 1 (the rnaximurn number of inputs

to the cluster allowed by the packer) from 1 to K x N. Following BR971 une chose the lowest

value of I that provided 98% utilization of al1 of the BLEs present in the cixrcuit. Figure 3.2 is

a plot of the relationship between the number of inputs (1) required to achiezve 98% utilization

and the cluster size (N) and the LUT size (K). Typically, the value of 1 must Ebe between 50 and

60% of the total possible BLE inputs, I = K xN.

By inspection we have generalized the relationship as:

This equation provides a close fit to the results in Figure 3.2. The averagre percentage error

across al1 possible data points is only IO. 1 % with a standard deviation of 7 . 6 9%.

3.2.2 Area as a Function of N and K

In this section we present and discuss the experimental results that show t h e area of an FPGA

as a function of N and K. Note that 1 was set to the value determined in the= previous section.


O 1 1 I I I t t 1 1 1 2 3 4 5 6 7 8 9 10

Cluster Size (NI

Figure 3.2: Number of Inputs Required for 98% Logic Block Utilization

These results are for the 28 benchmark circuits. Area, as discussed above, is measured in terms

of the total number of minimum-width transistors required to implement al1 of the logic and

routing.

Total Area

Figures 3.3 and 3.4 give a plot of the geometric average (across al1 28 circuits) of the total area

required as a hnction of cluster size and LUT size. Several observations can be made from

this data:

0 LUT sizes of 4 and 5 are the most area-efficient for d l cluster sizes.

O There is a reduction in total area when the cluster size is increased frorn 1 to 3 for ail

LUT sizes. However, as clusters are made larger (N > 4) there is very little impact on

total FPGA area. Figure 3.4 demonstrates this behaviour very well. It is generally ex-

pected that increasing logic block iünctionality should result in more BLEs being added

to a cluster and connections that normally would have been routed extemally are now ab-

sorbed intemal to the cluster. This should reduce the inter-cluster area which is usually

much higher than the intra-cluster area and thus having a positive impact on total area-

However, the reason the total FPGA area doesn't decrease is because increasing logic

capacity (more input & output pins) results in an increase in track count. It rnust also

be rernembered that the logic block area is ais0 increasing due to the LUT area and the

multiplexer area, So the area savings is not as significant as it appears and Figures 3.4

and 3.3 confirm this fact.

6e+06 l 1 1 L

'Cluster Size = 1' - 'Cluster Size = 2' -x- "Cluster Site = 3' ---*--- 'Cluster Site = 4' -El- 'Cluster Size = 5" --*- -

3.5e+06 -

3e46 I 1 r I

2 3 4 5 6 7 LLJT Size

Figure 3.3: Total Area for Clusters of Size 1 to 5

fi is instructive to break out the components of the data in Figures 3.3 and 3.4 in order to

achieve both insight and inspiration on how to make more area-efficient FPGAs. The total area

can be broken into two parts, the logic block area (including the muxes inside the clusters)

and the routing are% which is the programmable routing extemal to the clusters. Throughout

the rest of this paper, these will be referred to as the intra-cluster area and inter-cluster area

respective1 y.


Figure 3.4: Total Area for Clusters of Size 6 to 10

We wiIl first explore the intra-cluster area. Figure 3.5 shows the total intra-cluster area

component of the total area (again, geometrically averaged over the 28 circuits) as a function

of the LUT size. The data shows that the intra-cluster area increases as K increases. This area

is the product of the total number of clusrers times the area per clustec A plot of these two

components for a cIuster size of 1 is given in Figure 3.6.

The logic block area grows exponentially with LUT size as there are 2K bits in a K-input

LUT. in addition, larger LUT sizes require larger intra-cluster multiplexers because the size

of each rnultiplexer is (I + N) = (Ki2 (N+l) + N). As K increases, though, the nurnber of

clusters decreases (because each LUT can implement more of the logic function) as shown by

the downward curve in Figure 3.6.

However, the rate of decrease in the number of Iogic blocks is far outweighed by the in-

crease in the size of the block as K increases, and hence the upward trend in Figure 3.5. Fig-

ure 3.7 decomposes the logic block area into two parts: a) intra-cluster rnultiplexer area and b)

LUT area. The results illustrate that the local intra-ciuster routlng area cannot be ignored and

'Cluster Siie=lg + 'Cluster Siie=ZD -x- 'Cluster Sie*- ---r--- 'Cluster SizeS --a- 'Cluster Ske=5' -,-- 'Cluster Size4' --4.- 'Cluster Size=7' --+- 'Cluster Ske=eu ----x--- 'Cluster Size=gn --it-

'Cluster Size=lO* -e-

4 5 iüT Size

Figure 3.5: Total Logic Block Cluster Area

4 5 LUT Size

Figure 3.6: Number of CIusters and CIuster Area Versus K (for N=1)

cari be quite significant for larger clusters.


'LUT Area' - 'Mux Area (Cluster Size = 2)' -K-

"Mux Area (Cl~ster Size = 9 c.- mMux Area (Ciuster Sire = 101 -.p-

1-4e+06 -

1.2e+06 -

l e 4 6 -

aooooo -

K----- 2: --

2 3 4 5 6 7 LUT Size

Figure 3.7: Intra-cluster Multiplexer Area and LUT Size

Observing the absolute values in Figures 3.3 to 3.5, we see that the intra-cluster area

typically takes up about only 25% to 35% of the total area, except when the LUT size reaches

6 and 7, at which point intra-cluster area becomes a dominant factor-

The key effect, as always in FPGAs, is with the routing area. Figure 3.8 is a plot of the total

inter-cluster routing area as a function of the LUT size and cluster size. The Figure shows that

the routing area decreases in a Iinear fashion with increasing LUT size. This particular result

is interesting since previous work frorn [RFLCgO] bas shown that the routing area achieved a

minimum between K=3 and K 4 , and increased for values of K beyond this.

To explain this obsewed behavior, observe Figure 3.9 which decomposes the total routing

area into two separate components: the number of clusters and the routing area per cluster.

These curves are given for a cluster size of 1, but are representative for ail ciuster sizes, The

product of these two curves gives the total inter-cluster routing area. The reason why the

routing area decreases linearly with LUT size is that as we increase the LUT size, the number

of clusters decreases much faster than the rate at which the routing area per cluster increases.

The routing area per cluster grows slightiy with increasing LUT size since due to the fact that

there is very little change in channe1 width as the LUT size is varied. This is shown in Tables 3.4

and 3.5 where it can also be seen that increasing cluster size leads to larger average channel

widths. The channel width is the major deterrnining factor in inter-cluster area. The difference

in results from IPT;LC90] and Our current results can be attributed to the fact that we are now

using better CAD tools with more sophisticated dgorithms; in particular the quality of the

placement tool and the routing tool is significantly better, and uses significantly less wiring.

In addition, for clustered logic blocks, more of the routing is being implemented within the

cluster itself.

2 3 4 5 6 7 LUT Size

Figure 3.8: Routing Area

3.2.3 Performance as a Function of N and K

The second key metric for FPGAs is their speed measured by the critical path delay. The

total critical path delay is defined as the total delay due to the logic cluster combined with the

routing delay. Figure 3.10 shows the geometnc average of the total critical path delay across

34 Chapter 3. FPGA Logic Blocla Architecture

Figure 3.9: Number of Clusters and Routing Area Per Cluster Versus K ( f s r N=l)

al1 28 circuits as a function of the cluster size and LUT size. Observing the Figure, it is clear

that increasing N or K decreases the critical path delay. These decreases are significant: an

architecture with N=l and K=2 has an average delay of 45 ns while K=7 and- N=lO has an

average critical path delay of just 14 ns. There are two trends that explain this behavior. As the

LUT and cluster size increases:

c the delay of the LUT and the delay through a cluster increases

0 the number of LUTs and clusters in series on the critical path decreases

We will discuss these effects in more detail below.

It is instructive to break the total delay into two components: intra-cluster delay (which

includes the delay of the muxes and LUTs), and inter-cluster delay.

Figure 3.11 shows the portion of the critical path delay that cornes from the intra-cluster

delay as a function of K and N. There are two key points to observe here. First, t he intra-cluster

delay decreases as the LUT size increases. This is due to the fact that there is a reduction in


Table 3.4: Channel Width vs. LUT and Cluster Size (1 to 5)

Cluster Size (N) 1

LUT Size (K) 2

Channel Width (geometric avg.)

1 5 - 9

the number of BLE leveIs on the critical path and hence there will be fewer Iogic levels to

implement. This will translate into a reduction in intra-cluster delay. Figure 3. L2 illustrates

this concept more clearly at the BLE level: it is a plot of BLE delay and nurnber of BLEs on

the criticai path versus LUT size for a cluster size of 1. The number of BLE leveis decreases

quicker than the increase in BLE delay and hence the decrease in logic delay. The second

behaviour that should be noticed is that the intra-cluster delay increases for any given LUT size

as the cluster size is increased. This is because the intra-cluster muxes get larger and therefore


Table 3.5: Channel Width vs. LUT and Cluster Size (6 to 10)

slower. However, the deIay through these muxes is still much faster than the inter-cluster delay,

as shown in figure 3.1 1.

Cluster Size (N) 6

Figure 3.13 shows the portion of the critical path delay that cornes from the inter-cluster

routing deIay as a function of K and N.

As K increases there are fewer LUTs on the critical path, and this translates into fewer

inter-cluster routing links, thus decreasing the inter-cluster routing deIay. Similarly, as N is

increased, more connections are captured within a cluster, and again, the inter-cluster routing

LUT Size (K) 2

Channel Width (geometric avg.) 4 0 - 9

"Cluster Size = 2. -x- 'Cluster Size = 3' ---*--- 'Cluster Size = 4. -Ef- - 'Cluster Size = 5' -*- 'Cluster Size = 6' --*-- 'Cluster Size = 7' - 'Cluster Size = 8' ----x--- - 'Cluster Size = 9' --

'Cluster Sire = 10' --+-

-

- m n - 0- +%

20 -

15 -

10 1 I I r t I 2 3 4 5 6 7

LUT Size

Figure 3.10: Total Delay for Clusters of Size 1 to 10

'Cluster Size = 2. -x-- 'Cluster Size = 3' ---*--- 'Cluster Size = 4" -a- 'Cluster Size = 5' -=-- 'Cluster Size = 6' ---*-- 'Cluster Size = 7' -- 'Cluster Site = 8' ----x--- "Cluster Size = 9' -+

'Cluster Size = 10' -+--

6 -

5 1

4 5 LUT Size

Figure 3.1 1 : Total Intra-Cluster Delay for Clusters of Sizes I to 10

deiay decreases.


In discussing these trade-offs, it's useful to follow an explicit example: Table 3.6 shows

how the delay through one BLE and multiplexer stage (delay from B to D on Figure 3.1) rises

from 0.556 ns to 0.702 ns when going from K=4 and N=l to K=4 and N-4. Although the

number of BLE levels on the critical path remains fairly constant since we have not modified

K, the total logic delay increases from 5.58 ns to 6.65 ns due to the increase in the local cluster

routing multiplexers. However, since there are now 4 BLEs in every cluster as opposed to a

single BLE, more logic is irnplemented internally within the clusters. Nets that norrnally would

have been routed externally are now interna1 to the clusters. This transiates in a reduction in

the inter-cluster routing delay from 19-48 ns when using K=4, N= 1 to 1 1-72 ns for K=4 and

N=4. The total critical path delay decreases from 25.91 ns to 19.35 ns as originally shown in

Figure 3.10-

Table 3.6: Cntical Path Delay Cornparison for K=4

BLE t Mux Delay 1 0 . 5 5 6 ns

Avg # of BLEs on Critical Path 1 9 . 5 3

To ta1 Intra-Clus ter Delay l 5 - 5 8 ns To ta1 Inter-Cluster Delay 119.48 ns

In general, inter-cluster routing deiay is much larger than the intra-cluster delay, and hence

the value of increasing the cluster or LUT size. However, it is interesting that increasing cluster

size has little impact after a certain point (for N > 3). Figure 3.10 shows this clearly where

for any fixed LUT size, the majority of the improvement in critical path delay occurs as the

cluster size is increased from 1 to 3. Any further increases in cluster size results in a very

minimum delay improvement. This behaviour suggests that clustering has little effect after a

certain point. This is counter-intuitive to what we expect. That is, employing larger clusters

T o t a l D e l a y 2 5 . 9 1 ns


'#of BLEs on Critiml Path' 'BLE Dela i

Figure 3.12: Number of BLEs on Critical Path and BLE delay vs K (for N= 1)

40 1 I I 'Cluster 1 Size = 1' - 'Cluster Size = 2' -x- 'Cluster Size = 3' ---*--- 'Cluster Size = 4' -a- 'Cluster Size = 5' --*-- 'Cluster Size = 6' -4.- 'Cluster Size = 7' -+- gCluster Size = 8' ----x--- 'Cluster Size = 9' -r-

'Cluster Size = 10' --+-

4 5 LUT Sire

Figure 3.13: Total Inter-Cluster Delay for Clusters of Size I to 10

should always reduce the critical path. Aithough, the total delay results from Figure 3.10 do not

contradict this, what was surprishg was how little of an improvement in total delay that was

40 Chapter 3. FPGA Loaic BIock Architecture

achieved with larger clusters. To better understand this situation, it is sufficient to examine the

number of logic block levels (ie: cluster levels). Figure 3.14 shows the number of cluster levels

as a fùnction of cluster and LUT size. The results clearly show the number of levels decreasing

with increasing cluster and LUT sizes. But, for any given LUT size it c m be seen that most

of the reduction in the number of levels occurs as the cluster size is increased from 1 to 3.

Also, recall that the rnajority of the criticaï path delay was reduced in this range (as shown

in Figure 3-10). The direct relationship between the number of cluster levels on the critical

path and the final total delay is no coincidence. Fewer logic blocks on the critical path leads

to improved performance. The main reason that the total delay did not improve significantly

as we varied the cluster size frorn 3 to 10 was that there was no significant reduction in the

number of Iogic block levels. Without a reduction of the number of inter-cluster levels on the

critical path we cannot possibly expect improvements in FPGA performance.

30 - I I I I I I I

~L'UTS~Z~ = 2' - 'LUT Size = 3" -K- 'LUT Size = 4' ---r--- 'LUT Size = 5' -a- 'LUT Çize = 6' -,-- 'LUT Çize = 7' ---e--

1 2 3 4 5 6 7 8 9 10 Cluster Size

Figure 3.14: Number of Cluster levels on Critical Path

Another interesting trend to observe from figure 3.14 is that increasing the cluster size has

less of an effect for architectures cornposed of larger LUTs. For exarnple, increasing the cluster

size from 1 to 10 for a 2-input LUT architecture results in a 60% reduction in the number of

cluster Ievels on the cntical path. Conversely, employing BLEs with a 7-input LUT and varying

the cluster size from 1 to 10 results in only a 22% reduction in logic levels. Hence, clustering

proves to be more effective for smaller LUTs. To understand this more clearly, we should

examine the average BLE fanout for every LUT size. Figure 3.15 shows this and as we c m see

larger LUTs correlate to larger average fanout. The reason smaller LUTs had a better response

to larger cluster sizes was due to the fact that each LUT had a relatively small fanout and hence

adding an extra BLE to a cluster usually guaranteed some reduction in the number of logic

levels. The same cannot be said about larger LUTs since they have a much larger average

block fanout and it becomes much more diffult to ensure that any subsequent BLE addition

wiII result in fewer cluster levels on the critical path.

Figure 3.15: Average BLE Fanout

3.2

3

2.8

2.6

2 4

2.2

L ' 1.8

2

I 1 L I 'Avg. Block Fanout' ---a---

_--- .a----------*--.-------.*f, - _-.- -

_--- __----- -a----

- - _--- .m-

- -

- -

-a. - -

- /---- -

I f I 1

3 4 5 6 7 LUT Sire


3.2.4 Area-Delay Product

So fw, we have examined the effect of K and N on area and performance of FPGAs. As area

cari often be traded for delay, it is instructive to look at the area-delay product. Figures 3-16

and 3.17 illustrate the area-delay product versus K and N. This plot shows that using a LUT

sizes of 4 to 6 and clusters of 3 to 10 appear to give the best area-delay results.

Notice that area-delay decreases significantly as the LUT size is increased from 2 to 4 for

al1 cluster sizes. This is because their delay is poor due to the large arnount of BLE levels on

the cntical path combined with the fact that the total area requirements are slightly larger (by

about 20%), and so they are a bad choice.

The area-delay product jumps for K=7 on average by 10-15% principally because the huge

area cost for 7-input LUT outweighs the modest performance gains it achieves.

This latter observation suggests that, if there was a way to achieve the depth properties of a

7-input LUT without paying the heavy area price, then such a 7-input input function may well

be a good choice.

We have also observed that, for large clusters, a large portion of the deIay is taken up by

the intra-cluster muxes. If this delay could be reduced somehow, then significant speed wins

could be achieved.

3.2.5 Summary

We have studied the effect that different logic block architectures have on FPGA area and

performance. The main results are summarized in table 3-7. In addition, we experimentally

derived a relationship between the number of cluster logic block inputs required to achieve 98%

utilization as a function of the LUT size, K and the cluster size, N. This is I = 5 x (N + l),

where 1 is the number of distinct cluster inputs.

Secondly, we have shown that small LUT sizes (2-input and 3-input LUTs) are not as area

efficient as the 4 and 5-input LUTs and their performance characteristics are very poor. If area-

0.3 1 I I I 'Cluster Size = 1' - 'Cluster Size = 2. -K- 'Cluster Size = 3' ---r--- 'Cluster Size = 4' -a- 'Cluster Slze = 5 -,-

0.25 'Cluster Size = 6' ---O---

'Cluster Size = 7' - 'Cluster Size = 8' ----x--- 'Cluster Size = 9' ---+--

'Cluster Site = 1 O' -e--

2 3 4 5 6 LüT Size

Figure 3.16: Area-Delay Product for Clusters of Size 1 to 10

'~luste;~ize = 1' - 'Cluster Size = 2' -x- 'Cluster Site = 3' ---*--- 'Cluster Size = 4' -+i- 'Cluster Sire = 5' -m.- 'Cluster Size = 6' --*-- 'Cluster Size = 7' -+-- 'Cluster Site = 8' ----x--- 'Cluster Size = 9" --it

"Cluster Sire = 10' +-

LUT Size

Figure 3.17: Close-up Viecv of Area-Delay Product for Clusters of Size 1 to 10

delay is the main criteria, then the use of clusters of between 3 and 10 and LUT sizes of 4 to 6

will produce the best overall results.

44 Chapter 3. FPGA Logic BIock Architecture

FinalIy, Our work suggests two fbture directions: finding ways to reduce the number of

levels of Iogic without the expense of large LUTs, and reducing the delay of intra-cluster

rnultiplexers.

Table 3-7: Summary of Best Area, Delay, and Area-Delay Results

C r i t e r i a

Area

Delay

Area-Delay

LUT Size (K)

4 to 5

7

Cluster Size (N)

4 to 9

3 to 10

4 to 6 3 to 10

CHAPTER 4

Hardwired Logic Blocks

This chapter explores the hardwired logic block (HLB) architecture introduced in Chapter 1

with the expectation that it will lead to an improved area-delay product compared to the non-

hardwired 2 to 7-input LUT clustered architectures. R e d 1 that employing logic clusters con-

sisting of 3-10 BLEs and LUT sizes of 4-6 produced the best area-delay results. However, we

have shown that there was a 10-15% increase in the area-delay product when moving from

a 6-input LUT to a 7-input LUT. The increase in area-delay was due to the fact that the LUT

area grows exponentially with LUT size. Our main motivation for explorhg HLB architectures

was to find a coarse grain logic block which had aimost the equivalent logic functionality and

depth properties as a 7-input LUT but without the large area cost. The rest of this chapter will

examine one such architecture, a cascade of two 4-input LUTs, as illustrated in Figure 4.1, and

compare its area and performance to the non-hardwired architectures.

The HLB architecture in Figure 4.1 has several advantages over a 7-input LUT. First, there

is a considerable logic area savings since the 7-LUT requires 2' = 128 SRAM configuration

bits while the cascaded 4-LUTs demands only 2 x 24 = 32 SRAM bits. Secondly, the LUT

46 Chapter 4. Hardwired Logic Blocks

Figure 4.1 : Cascaded 4-LUTs

delay is slightly less since the output of the first LUT is fed directly into the fastest LUT input of

the second stage LUT. This is possible because LUTs are implemented as a tree of transistors,

and therefore the inputs have different deiays. Our goal in this chapter is to determine the area

and delay of the HLB structure shown in Figure 4- 1 and compare it to the regular 4 to 7-input

LUT-based Iogic clusters.

4.1 Hardwired Architecture

In general, HLB architectures consist of one or more fully-interconnected HLBs as illustrated

in Figure 4.2. Each HLB is hardwired in a tree topology such that every LUT within the HLB

c m oniy fanout to at most one other LUT within the sarne HLB. The root BLE in the HLB

is the onIy LUT that c m implement sequential logic since its output can be registered by a D

ff ip-flop. The finer points of an HLB architecture are presented below.

4.1.1 Cluster Inputs (I)

The number of distinct cluster inputs is an important architectural issue that needs to be ad-

dressed for HLBs. As discussed in Chapter 3 for non-hardwired architectures, just enough

inputs were used to produce 98% utilization of the LUTs. However, such high logic block

density is difficult to achieve in an HLB due to the inflexibility imposed by the hardwired

connections. For that reason it is assumed that the cascaded 4-LUT structure, which has 7

inputs, requires the same number of inputs as a 7-LUT. This way, the same relationship that

4.1 Hardwired Architecture 47

Inputs

HLB Contents

1

Clock '-, 1 8 ,

Figure 4.2: Cluster Description (with HLBs)

was determined earlier in Chapter 3 for 98% logic utilization cm be used:

Here, N represents the number of HLBs per cluster and K is the number of input pins

accessible externally from the HLB. For the cascaded 4-LUT, K=7 and therefore,

I = ~ x ( N + ~ ) .

4.1.2 Tapping Buffers

The HLB structure in Figure 4.2 has 7 inputs and one output which can be registered or un-

registered. Notice that only LUT B can have its output accessed externally frorn the cluster

output pins. Since the output of LUT A is hardwired into LUT B, this restricts how and where

subsequent LUTs are placed within the HLB. In Figure 4.2, LUTs A and B may be adjacent to

each other as long as LUT A has a fanout of one and LUT B uses the output of LUT A as one

of its inputs. If this pattern doesn't appear frequently enough in the netlist, many LUTs within

HLBs will be unused. The result is a reduction in logic block density for the HLB-based FPGA.

The use of tapping bufers [Chu941 will help alleviate this problem and improve the density of

HLB-based FPGAs. Tapping buffers allow the LUT outputs within an HLB to be externally


accessed. For exarnple, in Figure 4.3 the output of A can be accessed without propagating it

through LUTs B and C.

Tapping Buffer /

Tapping / Buffer

Figure 4.3: HLB Tapping Buffers

The tapping buffer saves two LUT delays and aIso allows other logic functions to be im-

plemented in LUTs B and C rather than simply prograrnming them to propagate the output of

A and essentially wasting area. The possible benefits of tapping buffers are increased logic

density and improvernents in speed. For these reasons, tapping buffers are included in the HLB

architecture studied in this chapter. Section 4.2.2 provides experirnental justification for this

decision. We should keep in rnind that tapping buffer-enabled HLB architectures lead to more

cluster output pins compared to HLBs without tapping buffers as ilEustrated in Figure 4.4.

Tapping 6u;fFer \

Clock

y-,-- LUT ,m- . ... . . Output I

a) HLB with no tapping buffen

Clock : 1

Figure 4.4: Cornparison of HLBs with and without tapping buffers

4.1 Hardwired Architecture 49

4.1.3 Logical Equivalence of Cluster Outputs

Logical equivalence on the cluster output pins means the router can connect to any one of the

output pins and still reach any of the BLE or HLB outputs in the cluster. This added flexibility

can reduce the track count by up to 50% compared to an architecture where the output pins are

not logically equivalent. For clusters based on the non-tapping structure shown in Figure 4.4(a),

the outputs are logically equivalent because they are identical and the full connectivity of the

inputs allows them to be easily swapped. Figure 4.4(b) shows a tapping-buffer based HLB

architecture and these outputs are not logically equivalent and the outputs of LUTs A and

B cannot be swapped. Output pin logical equivalence for this structure cm be achieved by

creating a full-routing crossbar which connects ail LUT and cluster outputs together as shown

in Figure 4.5. The crossbar contains N N: 1 multiplexers, where N is the number of LUTs in

the cluster. The area cost of this output crossbar multiplexer is relatively small in cornparison

to the total Iogic block area.

Cluster

1 1 1

- LUT f 1 1 )

1 1

CLB

CLB

outputs

outputs

Figure 4.5: HLB Cluster Contents with Full Routing Crossbar for Output Signals


Table 4.1 shows the percentage of the logic block area that this full crossbar routing rnul-

tiplexer occupies for different cluster sizes. Even for large cluster sizes, the logic block area

overhead is less than 8%.

Table 4.1: Percentage of Logic Block Area that is Occupied by Output Routing Crossbar

C l u s t e r Size

1

Multiglexer Area

4.2 HLB Packing

In order to explore tILBs, we need to synthesize circuits into HLB structures. This section

outlines a packing algorithm aimed at hardwired Iogic blocks. However, it should be noted that

there are several different ways of synthesizing circuits into HLBs.

0 The entire HLB can be considered as a coarse grain target for technology mapping dur-

ing logic synthesis. The basic procedure here is to translate the set of boolean equations

that represent the circuit into Iogic gates and then perform technology mapping directly

into the HLB structure. This was the approach taken in [CD941 since it was believed that

technology mapping to HLBs focused more attention on optimizing hardwired connec-

tions which could possibly lead to better results than the alternatives listed below.

4.2 HLB Packhg 51

HLB packing could be perfomed during layout synthesis as a separate step before place-

ment and routing as is done in VFACK [BRM99]. [Chu941 also took this approach to

HLB packing where the circuit would first be mapped to a netlist of basic blocks com-

posed of LUTs and flip-flops. Aftenvards, these LUTs and flip-flops are packed into

HLBs.

0 Finally, packing could be done simultaneously with placement. The placer would have

the ability to rnove BLEs from one cluster to another in order to improve FPGA area

andor delay .

This thesis does HLB packing prior to placement but after logic synthesis, as illustrated

in the flow given in Figure 4.6. Although the direct logic synthesis of gates into HLBs holds

the promise of the best results, this can be very cornplex and we chose the layout synthesis

approach due to time constraints.

Circuit

ech-Independent Logic Synthesis

Technology rnap 1 ta K - w n

JI Measure Area & Delay

Figure 4.6: HLB Packing Flow


4.2.1 HLB Packing Algorithm

The goal of the HLB packer is to minimize some combination of area and delay. The main

priority is to pack highly critical LUTs into the same HLB to take advantage of the "zero-

deiay" hardwired links. If this is not possible then critically related LUTs should be packed

into the sarne cluster. The HLB packing pseudo-code is illustrated in Figure 4.6. The first

several steps in the HLB packing algorithm are identical to the T-VPACK [MBR99] algorithm

described in Chapter 2- First, timing analysis is performed on the unclustered netlist of BLEs.

After timing analysis, the BLE with the highest crïticality ' is chosen as the seed of a new

cluster. This BLE is placed at the root of any of the HLBs, however, it may be shifted within

the HLB later on. With the seed BLE now packed, the next best BLE to pack into the HLB and

cluster is deterrnined. To do this, the attraction function described in Chapter 2 and originally

defined in m R 9 9 ] is used to find the next candidate BLE:

INers(l3) nNers(C) 1 Attraction ( B ) = h - Crit ical i fy(B) + ( 1 - A) - Maflers

The attraction function determines which unclustered BLE B should be added to the current

cluster C. The parameter which controls the tradeoff between area and delay is h. Setting h to

1 results in a packing algorithm focuses only on criticality. Conversely, setting h to O results

in a packing algorithm which tries to maximize input sharing and hence FPGA logic density.

Our work uses a h value of 0.75 as suggested in VEU9]. Once the next BLE (BestBLE) to

be clustered has been determined we attempt to place it into one of the HLBs in the currently

open cluster, At this stage, there are three choices which are evaIuated in order:

1. Place BestBLE in one of the non-empty HLBs within the current cluster, or

2. Place BestBLE into the next empty HLB in the current cluster, or

3. Place BestBLE into the next empty cluster.

' ~ e f e r to Chapter 2 for a definition o f criticality

4.2 HLB Packing 53

Let: UnclusteredBLEs be the set of BLEs not contained in any cluster C be the set of HLBs contained in the current cluster HLBs[i] be the set of HLBs in the cluster num-HLBs be the number of HLBs in the current cluster which are NOT empty N be the maximum number of HLBs in a cluster LogicClusters be the set of cIusters (where each cluster is a set of HLBs)

Unclus teredBLEs = PatternMatchToBLEs (LUTs, Registers); LogicClusters = NULL;

whiIe (UnclusteredBLEs != NULL) { /* More BLEs to cluster */

C = GeMostCriticalBLE (UnclusteredBLEs);

while (ICI < maximum-LUT-capacity) { /* Cluster is not full */ if (BestBLE = NLJLL) /* No BLE can be added to cluster */

break;

BLE-SUCCESSFULLY-ADDED = FALSE; /* initilization */ for ( i d ; i < num-HLBs; i++) {

if (i3estBLE-canbe-inserted-in to-HLB( BestBLE, HLB [il )) { HLBs[i] = HLBs[i] u BestBLE; BLE-SUCCESSFIJLLY-ADDED = TRUE; break;

1 1

if (BLE-SUCCESSFULLY-ADDED = FALSE) ( /* Try pIacing BestBLE into an empty HLB if one exists */ num-HLBs tt; if (Inum-HLBsl> N) {

break; /* No more empty HLB slots */ } else {

HLBs[num_HLB - 11 = insert-into-HLB( BestBLE ); HLBs[num-HLB -11 = HLBs[numHLBs -11 v BestBLE;

1 1

UncIusteredBLEs = UncIus teredBLEs - BestBLE; C = C uHLBs;

1

LogicClusters = LogicClusters u C; 1

Figure 4.7: Pseudo-code for HLB Timing Driven Packing

54 Chapter 4. Hardwired Logic BIocks

This aIgonthm tries packing BestBLE into one of the non-ernpty HLBs to determine if it

has a common net with any of the previously packed BLEs. The BestBLE c m be added to one

of the non-ernpty HLBs if a common net exists, the HLB is NOT full and only one of the LUTs

makes use of the register. BestBLE can still be added to an HLB in which no common net

exists as long as the logic block architecture contains tapping buffers. For example, Figure 4.8

shows a cascaded 4-LUT HLB where LUT F has been placed at the root. Due to the tapping

buffer and the fact that F only used 3 of its 4 LUT inputs, it is now possible to add another LUT

G which has no cornmon connection with LUT F.

Tappùig Buffer

Inputs

Figure 4.8: HLB with Cascaded 4-LUTs and Tapping Buffer

4.2.2 HLB Packing with Tapping Buffers

The example above illustrates how tapping buffers can improve FPGA density. We perfomed

two separate studies which examined the effectiveness of tapping buffer-enabled HLBs for

the cascaded 4-LUT stmcture shown in Figure 4.8. The first HLB architecture contained no

tapping buffers and the second had tapping buffers between al1 the LUTs in the HLBs. The 28

benchmark circuits used in Chapter 3 were synthesized and packed into HLBs. The results in

Table 4.2 show that that tapping buffered architectures had an average logic block utilization

4.3 Ex~erimental Results 55

rate of 88% compared to only 66% for non-tapping buffered architectures. This increased

logïc density lead to 26% fewer clusters on average as shown in Table 4.3.

Table 4-2: HLB Cluster Utilization (with and without tapping buffers)

Table 4.3: Number of Clusters with and without Tapping Buffers

Cluster Size

1

2

3

4

5

6

7

8

9

10

C l u s t e r Sizs

% Utilization (no tapping)

64

63

64

64

65

65

66

66

67

67

4.3 Experimental Results

% Utilization (with tapping)

89

87

88

87

88

87

87

87

88

88

# of Clustrrs (no tapping)

1729

869

573

428

337

280

23 9

207

183

165

This section presents the area and delay results for the cascaded 4-LUTs after HLB packing,

placement and routing through the full timing-driven CAD Aow given in Figure 2.2. This time,

# of Clustars (with tapping)

1233

63 1

4 18

3 15

251

2 10

179

157

13 9

12 5

2~tilization refers to the percentage of LUTs within the cluster that are actuaHy used

56 Chapter 4. Eardwired Logic BIocks

however, the new HLB packing algorithm is used. Although there are many different HLB

structures, due to time constraints only the cascaded 4-LUT configuration was explored. The

20 largest MCNC -911 circuits were used dong with 8 new benchmarks as described in

Chapter 3 and Table 3.3. Throughout this section, the results of the hardwired architecture

in Figure 4-1 will be compared to those of the conventional non-hardwired architecture, A11

results are based on a O. 18 pm CMOS design, including detailed circuit-level design for the

HLB-based logic clusters was perforrned.

4.3.1 Area Results

This section presents the total FPGA area results (logic cluster + extemal routing) of the cas-

caded 4-LUT architecture in relation to the non-hardwired 4 to 7-LUTs shown in Figure 4.gS3

From this, two key observations about the data can be made:

6e+06

5.5ei-06

S e m

4.5e+06

Figure 4.9: Total Area Cornparisons for Hardwired Arch. vs. Non-Hardwired

I a I I I 1 I i

'Hardwired 4-LUT$ - '4-input LUTs' -K- '5-input LUT$ ---.r---

- '6-input LüTs' -a- '7-input LüTs' --=-- '

- ..--- /.--A ---.---- --.----

it--------- -.-.-e---- ---m----'-l.'.---

3e+06

3The results for the small grain 2 and 3-input LUTs are not included due to their poor area-delay product. It is more important to compare the hardwired results with the best non-hardwired architectures.

- -

- =-ytJ

--*.-..-------a 3.Se+06 - - .- __-- _ _ - P m

-%-.=_Zr_

- *-.---=-----------*---------- -*-----------i-

-'.-r::----x ------>C-----X--K-----~~---- -----ic

2.5@+06 - r I I I I 1 t I

1 2 3 4 5 6 7 8 9 10 Cluster Size

t

4.3 Experimentai ResuIts 57

Increasing the cluster size from 4 to 10 in the cascaded CLUT architecture had a greater

negative effect on total area compared to a sirnilar increase for the non-hardwired 4 to

7-LUTs.

The cascaded CLUT requires less total area than the 7-input LUT, as anticipated. Al-

though it is not as efficient as the 4-LUT.

To further understand these observations, we have broken the totd area results into two

components: i) inter-cluster and ii) intra-cluster area. These components are shown in Fig-

ures 4.10 and 4.11. The inter-cluster area results show that the positive effects of clustenng

on reducing routing area are diminished after a certain point. More specifically, increasing the

cluster size beyond 4 has very little impact on reducing inter-cluster area The reason being

that nets were no longer being absorbed within the cluster after this point. If nets are not being

absorbed, then the BLEs attached to their terminais are in separate clusters and hence valuable

routing resources must be used to make the connections. As a result, there was no reduction in

inter-cluster area for any cluster size increase beyond 4.

The other component of the totd area is the intra-cluster area shown in Figure 4.1 1. Here,

the cascaded 4-LUT intra-cluster area demands more area than the 4 and 5-LUT non-hardwired

architectures. The hardwired logic block architecture requires a larger intra-clus ter area for

several reasons:

58 Chapter 4. Hardwired Logîc BIocks

1 2 3 4 5 6 7 8 9 10 Cluster Size

Figure 4.10: Inter-Cluster Area Cornparisons for Hardwired Arch. vs. Non-Hardwired

1. There are now twice as many LUTs per cluster for any given cluster size compared to

the non-hardwired architectures.

2. The intra-cluster rnukiplexer area is larger since there are now more LUTs which trans-

lates into more inputs. AIso, these extra LUTs result in more outputs which require a

full routing crossbar on the outputs to maintain logical equivalence on the cluster output

pins.

3. There is only 88% logic block utilization for the hardwired architecture compared to

98% for the non-hardwired architectures.

Al1 these factors contributed to the increase in total FPGA area of the cascaded 4-LUT ar-

chitecture as the cluster size was increased from 4 to 10. Note, that even though the hardwired

architecture demanded slightly more total area than the 4 to 6-LUTs, there was stitl up to 15%

in area savings compared to the 7-LUTs. Recall that the initial motivation for exploring these

hardwired structures was to reduce the large area requirements of the 7-LUT while still main-


Figure 4.1 1: Intra-Cluster Area Cornparisons for Hardwired Arch. vs. Non-Hardwired

4.5M6

4 W 6

A m

taining its desirable performance characteristics. The results indicate that Our initial predictions

were correct as far as area is concemed. The 15% area reduction could have been better had

it not been for the 5 circuits in the benchmark suite which required significantly more 4-LUTs

than 7-LUTs after technology mapping." A poor mapping would be any circuit that requires

more than two 4-input LUTs for every 7-LUT after technology mapping. Table 4.4 presents

the ratio of 7-LUTs to 4-LUTs after technology mapping for al1 the circuits with the inefficient

ones marked in bold. The determination of why these circuits map inefficiently is left for future

work.

I I I b I ' '~ardwiréd 4-~lJTs' - '4-input LüTs' -K- '5-input ---u--- - %input Luis' -a- - '7-input LUTs' -=-

4.3.2 Delay Results

ài 3.5e+û6 -

- x--------

-

- 1 2 3 4 5 6 7 8 9 10

Cluster Size

The next step is to examine the hardwired logic block performance. Figure 4.12 compares the

critical path delays of the cascaded 4-LUTs to the regular non-hardwired architectures. The

results show that the cascaded 4-LUTs slightly out-perform the pure 4-LUTs but are worse

" e 5 circuits were bigkey, des, ex 10 10, s3S4 17, img-calc


Table 4.4: Cornparison of number of CLUT to 7-LUT blocks after technology mapping

C i r c u i t

bigkey

des

diffeq

ds ip

e l l i p t i c

ex1010

seq

sp la

t s eng

display-chip

hg-calc

hg- in te rp

input-chip

peak-chip

scale l25xhip

scale2-chip

w a r p i n g

than the 5 to 7-LUTs.

Figure

14 I l I I I 1 1 l I 1 2 3 4 5 6 7 8 9 1 O

Cluster Sîze

4-12: Total Critical Path Delay Comparisons for Hardwired Arch. vs. Non-Hardwired

To understand this situation more clearly, we should examine the number of BLEs on the

criticai path and the number of cluster levels on the critical path. The nurnber of BLEs simply

represents the arnount of BLEs in series traversed dong the critical path. Similarly, the number

of clusters represents how many inter-cluster levels that exist on the critical path. There is a

direct relationship between the number of cluster levels and the cntical path delay. Generally,

the more levels there are the larger the delay. This is generally true since in modem FPGAs the

rnajority of the delay is still due to the external cluster routing. Figures 4.13 and 4.14 illustrate

the number of BLEs and cluster levels respectively as a function of LUT and cluster size. The

trend shows that more coarse g a i n logic structures result in fewer BLE and cluster levels on

the critical path. These larger BLEs tend to absorb more connections internally and produce

a fewer number of logic blocks. It can be seen that the cascaded 4-LUTs have more cluster

levels than the 5 to 7-LUTs and this the major reason for the increase in total delay.


1 2 3 4 5 6 7 8 9 10 Cluster Size

Figure 4.13: Number of BLEs on the Criticd Path Comparisons for Hardwired Arch. vs.

Non-Hardwired

Figure 4.14:

10 - I I I I 1 I I C I

'Hardwired 4-LUTsg - '4-input LüTsg -x-- '5-input Lms' ---r--- '&input LüTs' ---a- 7-input Lms' --,--

5 6 Cluster Size

Number of Clusters on the Cnticai Path Comparisons for Hardwired Arch. vs.

Non-Hardwired

4.4 Area-DeIay Results 63

4.4 Area-Delay Results

In FPGA studies, another metric for evduating the quality of an architecture is by measuring its

area-delay product. Since there is always the tradeoff between area and delay, a good architec-

ture is one with a Iow value for area-delay product. This minimum represents the point where

we are sacrificing the Ieast amount of area for the most performance. Table 4.5 shows the area-

delay product for the cascaded 4-LUTs and the non-hardwired architectures (4 to 7-LUTs).

The 7-LUT architecture has the worst area-delay product but interestingly, the hardwired ar-

chitecture does not perforrn any better despite the fact that there was a reduction in the total

FPGA area, Unfortunately, the area reduction was not accompanied by a corresponding reduc-

tion in critical path delay. The improvement in total FPGA area was not enough to compensate

for the slight increase in delay.

Table 4-5: Area-Delay Product Cornparison Between Cascaded 4-LUTs and Non-hardwired

Architectures

Cluster Size 4-LUT 5-LUT

1 0 .098 0 .088

2 0 .084 0,077

3 0 . 0 6 7 0 .064

4 0.063 0 .058

7 -LUT Hardwired 4-LUTs t

64 Chapter 4. Hardwired Lopiic Blocks

4.5 Summary

This chapter exarnined the feasibility of employing hardwired logic blocks and how they com-

pared to the non-hardwired architectures- We have found that the hardwired architecture re-

quired 10-15% less area than the 7-input LUT. However, there was a slight increase in critical

path delay and this offset any positive gains we realized in area. The result was no improvement

in the area-delay product.

CHAPTER 5

Conclusions and Future Work

5.1 Summary and Contributions

The goal of this thesis was to study the impact of different logic block architectures on FPGA

density and performance. In chapter 3 we studied the effect of LUT and cluster size on FPGA

area and speed. It was found that in terrns of area, using LUT sizes of 4-5 and clusters of 4-9

were best- Also, a LUT size of 7 and chsters of 3-10 produced the lowest average criticd path

delay. However, for the area-defay product, LUT sizes of 4-6 and clusters of 3-20 were best.

Another important contribution from Chapter 3 was an expression for the number of distinct

cluster inputs (I) as a function of both LUT size (K) and cluster size (N). It had been previously

shown in PR971 [BR981 that when K=4 and 1 is set to 2N + 2 then 98% of al1 the 4-LUTs

would be utilized. We found a more generaI expression for the 98% logic block utilization:

66 Chapter 5, Conclusions and Future Work

Chapter 4 introduced a new HLB architecture and packing algorithm. We were specifically

interested in the cascaded 4-LUT hardwired logic blocks. Our motivation for exploring such a

hardwired architecture was that it may lead to an improvement in area-delay when compared

to the non-hardwired logic bIocks. However, we found that the use of the cascaded 4-LUT

architecture did not improve our area-delay product. In fact, the hardwired area-delay results

were quite similar to the 7-LUT logic block architectures.

5.2 Future Work

In the future it would be interesting to study the effect of larger LUT sizes (> 7) and larger

cluster sizes (> 10) on FPGA area and performance. Also, our HLB study focused strictly on

the hardwired 4-LUTs. However, the behaviour of different hardwired architectures such as

those shown in Figure 5.1 would be an interesting topic.

r - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - .

' " p u q - E q

I - I

, I

Figure 5.1 : Various HLB architectures

Finally, our packing algorithm was based on a layout synthesis approach where we first

technology mapped to K-LUTs and then packed these logic elements to form HLBs. Another

method that could possibly lead to better area and delay results would be to perform the HLB

mapping at the logic synthesis level. That is, treat the entire HLB as a coarse grain target for

technology mapping.

Total Area

Table A. 1: Total Area (x 106) in Min. Width Trans. Area (Cluster Size = 1)

68 Appendix A. Total Area

Table A.2: Total Area (x 1o6) in Min. Width Trans. Area (Cluster Size = 2)

Table A.3: Total Area ( x 1o6) in Min. Width Trans. Area (Cluster Size = 3)

"' du4

~pr-i

bigkcy clma d a diffq

dsip elliptic ~ 1 0 1 1 1

&P f r k misa3

pdc fi8

.dx4 17 s385Ki

scll sph

mn6 rlisp1;iyrhip imgnlc imginurrp inpuuhip

pcalifhip scaicl2Srhip scalc3chip

Ccom Avg.

3.156 3.7m 2754 3 8 7 18.706 267 1 12-U L I I Y I

5.775 9 5 6 8

220 1 6.933 1950 13.022 3.475 14056 12169 3.710 to.4n 1 -450

4.1110 36.277 5593 1 .6W 1.024 5.736 2 i25 2.326 4.475

Circuit

duJ

a m

a m

"iSkey clma Jcs difrq h i p clliptic crlOI0

&P frisc mircr3

pllE $98 s3R117 s385M

sy S P ~ J

w n g dirphyrhip img-nlc imginurp inpurship pmkxhip d e l 3 ~ h i p d L c h i p

w i n s Ccom Avg.

LUT Size

Table A.4: Total Area (x lo6) in Min. Width Trans. Area (Cluster Size = 4)

Circuit

Table AS: Total Area ( x lo6) in Min. Width Trans. Area (Cluster Size = 5)

Appendix A. Total Area

Table A.6: Total Area (x 106) in Min. Width Trans. Area (Cluster Size = 6 )

Table A.7: Total Area (x 1 0 ~ ) in Min. Width Trans. Area (Cluster Size = 7)

Circuit

a l d

V=Q J F ~

higky c l m Jcr di f fq rlrip clliptic cxl0lO

-3-J frisc miscd

pdc s2Yn r3îX17 s38SS4

seq sph

k'%

dirpbyrhip img-nlc imfinicrp inpuuhip

-"il' sdc l Z r h i p d & h i p

w i n a Gnim. Avg.

Table A.8: Total Area (x lo6) in Min. Width Trans. Area (Cluster Size = 8)

Circuit

du4

.w apura b i g w clma des difky &il: clliptic a l O l 0

cxsp TriSc m i s a 3

pdc s38 s3XJ 17 385W

seq spla

mnl: disphyrhip i m g a l c imginrcrp inpuuhip W h i p . d c l l S r h i p s d e h i p

v i n g Zcom Avg

Table A.9: Total Area (x 106) in Min. Width Trans. Area (Cluster Size = 9)

Appendur A. Total Area

du4

a w

biEl;cy

dm dcr Jiffkq

&ip

rlliptic cxlOL0

C ~ P rrix mis& * s2Y 8

d M 1 7 s385M

Sal spla

wns Jkphyrhip

i m g d c

imgïntcrp inpuuhip e h i p scaicl25rhip sraielEhip w ~ p i n g Cmm Avg.

LUT Si

Table A. 10: Total Area ( x 106) in Min. Width Trans. Area (Cluster Size = 10)

Intra-Cluster (Logic) Area

7 - o-.m 033:

023 1

0313 1.4w

0305 O268

0764

0575

O.RI2 0.187

0.632

0268

0.883

0.449

1.434

1312

03OY

0.78 1

0. 195 0302

LY73

0.691

0.234

0 . 3 5

0-721 02x4

0.35 OAM

Table B. 1 : Intra-Cluster Area (x 1o6) in Min. Width Trans. Area (Cluster Size = 1)

74 Appendix B. Intra-Cluster (Logic) Area

Table B.2: Intra-Cluster Area (x lo6) in Min. Width Trans. Area (Cluster Size = 2)

1

Table B.3: Intra-Cluster Area (x lo6) in Min. Width Trans. Area (Cluster Size = 3)

- 7

LOU > x w

1.813 0983

1 1.813 1.059

1.695

2 8 8 1

J.788

3983

I l l 2

538 1 2.237 63X1

'432

6 567

7398

2576

5.703 1.8W '+iO 7-FI l 3.031

LLhY

I.051

3.830 1 . w 1.822

2785

Table B.4: Intra-Cluster Area ( x lo6) in Min. Width Tram Area (Cluster Size = 4)

Circuit 1 L m Sue 7

du4 0.438

a m 0508

a-4 0355 higkcy 0.474 clma 2265 des 0.463

JiiTcq 0.408 dtip 0.SO 1

clliptic 0.873

a1010 1 3 8

d p 0.2911 CriSc 0.959

misu3 0.5 I 1

PJc 1.35 1 s 3 x o. m s3Ki 17 2.171

~nsiu 1.981

serl 0.474 sph I.IY.4

'-=Y! 0-r) Jisplîyrhip 0.760

i r n g d c 4508 imginicrp I.M8 inpuuhip 0.355

pc3kchip O385 s d c l 2 5 r h i p 1.086

sealcZchip 0.42X

Table B.5: Intra-Cluster Area ( x 1o6) in Min. Width Trans. Area (Cluster Size = 5)

76 Appendix B. Intra-Clus ter (Logic) Area

du4

~prZ

Jpn4 bi- clma dcr

d i i l q ùsip

ellipiic a l 0 1 0

CX~P frire

misu3

PJc 13s ~38417

dR58J

rq

-l: Jïsplayrhip

imgnlc imginlcrl, inpuhhip

e h i p

d c l 3 x h i p d r h i p

x ~ p i w Z e o m A v s

LUT s i 2 l 3 l 4 l 5 1 6 1 7

Table B.6: Intra-Cluster Area (x 106) in Min. Width Trans. Area (Cluster Size = 6)

1

0 5 5 1

IF-( 0385 higkcy 0.512

clma 2.453 Jcs 0.50 1

d i i lq 0.512 dsip 0.435 clliptic u s 2 u l O l 0 1.406

d1, 03 15 lrix 1.037 r n w 0.515

pdc 1 .-am 0.7U

D W l 7 1-35? LI47

=l 05 14 spls 1.90 w n g 0320

displayrhip 0.825 i m g a l c 4.88 1 img~nrcrp t.133

inpuuhip 0383

p a l d i i p 0.418 w i c l S r h i p 1.179 lolc'rhip 0.465


Circuit

JTuJ

Jpcrj bigkcy clms dcs d i f f q dsip elliptic a l l ) l U

&P rriv m i x x )

pdc s 3 8

OR1 17 s38SRi

=Y sptr

-E disphyrhip i m g d c imginlcrp inpu~chip

W h i P s d c l 3 r h i p d c l c h i p uvping

Geom Avg.

Table B.8: Intra-Cluster Area ( x 106) in Min. Width Trans. Area (Cluster Size = 8)

Circuit I LUT Si

d a diffcq Lvip eüiptic a i 0 10

=% rrirc m i s c d

pdc 5298 d m 1 7 s385XJ

scq $pl3 '=w diplnyrhip i m g d c imginicrp inputrhip p z k c h i p scdcl2Lchip s c d f i h i p


78 Appendu B. Intra-Cluster (Logic) Area

Table B. 10: Intra-Cluster Area (x 106) in Min. Width Trans. Area (Cluster Size = 10)

Inter-Cluster (Routing) Area

Circuit

Table C. 1 : Inter-Cluster Area ( x l 06) in Min. Width Tram Area (Cluster Size = 1)

80 Appendix C . Inter-Cluster (Routing) Area

4 du4 3.107 LX43

J@ 3142 3369

3 ~ ~ x 4 iMO 2355

bigky 2.492 2087 clma 17513 162llS dcs 1394 '089 dilfq LW1 1.734)

6ip LM7 1.MiR clliptic 51<K 5.UY3 crlUl0 8.052 10.658

-a' 20x7 'S fnSe 6.837 6.061

misa3 LW 25w pdc 12.133 10323 s29R 3.488 3.1 IO s3R117 17727 11.780 d85M I l ln9 10727

Scq 3362 3.1N spIa 9.609 9.40N)

-€ 13M 1.058 dispbyrhip 3.863 3.159 i rngdc 3 5 x 2 23.932 i n 5.459 4.1CK> 5pauhip 1556 i.1-U pcskchip IYOI 1522 dc l25 rh ip 5.442 1036

.dclchip 7IoX 1.459

w~ping 2-188 1.812

Cmm Avg. 5.231 3.678

Table C.2: Inter-Cluster Area ( x 106) in Min. Width Trans. Area (Cluster Size = 2)

Circuit

du4 3 p C e

bigkcy clma

des diffq &ip

ellipuc ulOlO

~XSP f N c

rniscx3

pdc s298 s3M 17 s385R)

x<l sph txng diploy~hip i m g d c irnginurp inpuuhip

pclkship dc l25 rh ip scdchhip warping

Gmm Avg

LU 4

1 .Y62 2.577 I.x(n I52Y IL727 1367 1365 1396 4.479 6.623 1.619 5332 1923 9.070 IYN 7251

7278 L3Y2 6.779 1.00Y 1542 13335 2393 0.6 16 O589 2242 0911 l.lJ0

lSZ9

Table C.3: Inter-Cluster Area (x 1o6) in Min. Wtdth Trans. Area (Cluster Size = 3)

Yr9-1

LZXD 6W.O H6.l Y w u Y L i u OLVl

8 L f S oo1-1

Z l t ' l Wïo'F C l K l

S E 5 t OX6f Y s r I 6 K S o z 1 8W'S WY'O L K E Y tL'E Z S f 1 tZU'1 lWP'0

t l Z S I 1 or-O s o r - I YLOY

619'1

two YYY'O YS61

O1 f'0 hSr-O

EXL'l YEYL L W 1 ZSL'O XSO'S h W 1 LE6't f Y 5 E WC I Ifil'Y 81 r l E 0 . S L Z 6 0 ;YI-S E S K 61 iT0 Citfi'U E Et.0

YIY'OI o ü 5 0 fo r - l LS6- 1

s06-1

WY'O L W 0 son-1

Wf 'O WO

Y-JI h f GL I O 5 1 MY' O

LZl'S OZOL

LtY'S 1 XY'S 5 9 5 1 u1.L-Y LZS- 1 L i S f (X)f-1 hSL'S 51Y'C SLY'O L X 1 151-1

fLf iOl Y t S O rOL-1 1127:

5960 L t C O 6 E 6 1 1 f 50 %O

m l ESL'OI X X i l I r m i rrr3 OYIY SEY'Y 861 -9 1 5 5 1 S88'L OSY'1 SES'; l E f l W C 5 S f t ' t t 1 0 - I I X f l rFz -1

61 1.1 1 691-1 E H ' I

K t i

U.97 SLTC *AV II-3

K I EIY'I Z u ! h

5660 M T 1 J!v~;.P LW? LZKC ~!VSZI~F= x u s 1 I r 1 d!q@ 5.58'0 K i 1 d!qr>ndu!

OXWSE Kt hu!-3ur!

Y W 8 1 Yf5Y; a p 3 y LZG1 OLYi J!ilrbhp HOB'O 6ti)'t Suxi

Y r n t x qds 8 8 E 7 Z t Y 9 hx

X68'L EPL'6 t i l S 8 f ~ SfL-X XYS'II L I M I F S h C I 1 f l L XKS OYbL ZhX'h >Pd S t i l MfT (HL-t Ytl-S J N J

ZhY'1 ZLY'I d~ M t 8 EXL'L 0101rJ M E SSX-t J ~ ~ ! I P i n ] - I LW-1 d!sl> I W l LhKI 'W!P h f t ' l IXL-1 '7'

m ' E l Ytl 'Sl r m p

S r 5 1 CM-1 hq;7!q IE8'1 E l 1 7 t d c I n n CiIKi7:

I8Y-1

Y W O EIY'O 6 N I 1 Zr-0

xwa '3511'1 C W L S u S I 1 -do f a - 5 OZ6-1 VI f's YlY'E K t ' l 9EI'Y osf l 610-5 YYao

L t T S H O - 5 S I 5 0 t86-O +a--O

r s 1 1 S 8 t U h m - 1 t W 7 EYI'I

Y

6WC

EIL'I 8 1 5 1

O W t 6 L r 1 SLTI 8 L f ' t

BLO'OE W E W l 8 L f 8 8 n a T

StS'OI LIU71 1%

M l ü l M t 7 689's 6tL-1 6t 1 -8

g - x t ZYf l MY'I 8l-L-1

ILO'SI E0X.l 0 8 1 7 LSll't M t Z

L

82 Appendix C. Inter-Cluster (Routing) Area


LUT Site


Table C.8: Inter-Cluster Area (x 106) in Min. Wixdth Trans. Area (Cluster Size = 8)

1

2545 2167 1.M1 1.680 1558 1.165

1 3 5 11.ü7.l 1551 1.196 1.47R 1. lM ln81 0.8s

4.180 3551 6.633 7.7M 1.581 1576 4.7G 2.1113 1.651 8 . m 6.852 1.824 1.436 YSYY 7.032 8.771 7.051 2307 22ûY 6923 5385 0.YM 0.753 -3M 1.716

21.870 15219 3.242 2.751

0.931 0.7OY 1.035 O.!Jlh

3.236 2357 Y O.RIO

Table C.9: Inter-Cluster Area (x 106) in Min. Width Trans. Area (Cluster Size = 9)

84 Appendk C. Inter-Cluster (Routing) Area

Table C . 10: Inter-Cluster Area ( x 1o6) in Min. W~dth Trans. Area (Cluster Size = IO)

APPENDIX D

FPGA Channel Width

dlLJ

3 6

aPCX1 bigkcy clma dcs Jiffcq k i p elliptic a 1 0 1 0

d r frix m k d

Fdc s 3 x OM17 JXSRI

sEq

spL

'=="E diplayship i m g d c imginrcrp inputchip w h i p d c l 5 r h i p d c L c h i p warping C m n Avg

Table D. 1 : Channel Width (Cluster Size = 1)

86 Appendix D. FPGA Channel Width

Circuit

du4

a m bi- c h a

des diffcq

h i p clliptic ~ l n l n &P fri5c misa3

pdc s3X J I U 1 7 s385KI

=l spls

'=ng dlrphyrhip

i m m c img-in-

inpuuhip

&hir sdc l -S rh ip rcJl&hip

" p i w C m m Avg.

Table D.2: Channel Width (Cluster Size = 2)

Circuit


Ciraiit

du4

~ p r Z apa4 biSr;y cfma des diffcq &ip clliptic a l 0 1 0

&P r r k miscd

PJc a n d i U 1 7 s385X-I

Yq splr

LS;C"E

d i sphy~h ip i m g n l c imginrnp inpuuhip

D"lcrh i~ d c I 2 5 r . i i p s d c l c h i p

wlrpiw Ccom AVG

Table D.4: Channel Width (Cluster Size = 4) Circuit LUT Size

Table D.5: ChanneI Width (Cluster Size = 5)

Appendix D. FPGA Channel Width

Circuit

Table D.6: Channel Width (Cluster Size = 6 )

I LUT! l a 1

d d

V-' bigkey clmn dcr

diffcq

h i p clliprie alOLO

I& mircl3

pdc s 3 x s3Ki17 J 8 5 P '

rq

s p h WnE d i s p h y ~ h i p i m g s d c imginrrrp inpuuhip

-hi@ s n l c l 2 5 s h i p d c 2 r h i p

\ Y Y D ~ ~ Ceom Avg.

Circuit

d d

a p a l

npct? bigkey clma des d i m k i p cIlipuc orI0lO

&P frix miscx.3

pdc s 3 8 s3W 17 s385W

scq s p h LWnE duphyrhip i m g d c irnginlcrp inpurchip

scdc 1 S r h i p d - h i p

w;rrping G n m Avg.

I 1

16 51 52 3 62 32 33 32 57 58 5J 54 49 69 3 51 JY P,

65 28 30 63 34 Z5 26 32 16 33

4 0 3

Table D.7: Channel Width (CIuster Size = 7)

Circuit

al&

aP=J big- dma dcs

dicq dsip

dlipric

a1010

d p CrLIc miscd

PJc mu s3IU 17 s385x.t

rai spb

'="s d i h y x h i p

i m u c irngïnlcrp

inpuuhip

W h i f d c l 2 5 ~ h i p

d c 7 h i p

v i n s k l 8 l L AVG

55 52 58 cir 63 67 32 36

71 K1 39 3R

38 42

32 32 60 63

67 <X)

58 72 59 73 Ml 59

n a8

3 0 33

59 5s

54 56

59 M

75 89 29 37

33 36

65 63

39 39

29 30 29 33

36 36

30 30

34 33

463 499

Table D.8: Channel Width (Cluster Size = 8) Circuit 1 LUI s i

2 3

al114 56 11

62 O5

;il=4 62 OX bigkcy 39 32 clma 75 RI

des J I 38

diffcq 43 Ji

dsip 37 30 cllipiic h l 64

cx1010 69 95 cCI 76

frise iCZ 76

miscd 62 611

pLic Rf l Yu s38 32 3J s3IU17 fia 58

885iU 56 62

sec( 62 71

sp t 78 8R

-6 Y 39 Jisplayrhip 37 37

imgralc 69 67

imfinurp 39 311 inpurchip 29 13

p k h i p 30 33 .suIcl25rhip 37 37

sde lch ip 32 3 0


90 Appendix D. FPGA Channel Width

Circuit

duj

3w apura big- clma dEs difkq b i p ctliptic a m n &P rr*c m i s c d

pdc sBR s3R117 dMRi

scq s p h Lrng

dirplayrhip irngnlc imginicrp inpuuhip

W h i r d c l25rhip xslc?chip

w r ~ i n g Gmm Avg.

- 7 - -

64)

67 67 43 80

4 2 41

42

M

76 M 67 f i l

86

311 63

56

65 80 33 Y 68

J1 30 31 37 32 37 -

50.6 -

Table D. IO: Channel Width (Cluster Size = 10)

Total Critical Path Delay

Table E. 1 : Total Delay in nano-seconds (Cluster Size = 1)

Circuit 1 LUT S i

e ~pcr4 bigkcy elma dcs di f fq Jsip clliptic ex1010

C ~ P fri5c m iscx3

pcic 5 3 8 LIU17 s385U

sal

PL -6 displzyrhip i m g d c

imginbxp inpuuhip W h i p r P l c I 3 r h i p d c h h i p

Table E.2: Total Delay in nano-seconds (Cluster Size = 2)

Circuit

-

- C


Cimit

du4

a F a 2 ni-" hi*

c l m d a diffq &ip cliipüc cx1(110

&P rrisc

m i s a 3 pdc -8

s3R< 17 s385RI

uy S P ~

N n F display~hip i m m c im ginicrp inpuuhip

W h i ~ scalc l?Srhip Xa lLch ip

" ~ p i n g Cmm Avk


Circuit

du4

J@

higky cfma dcr d i w dsip clliptic a l o l n

d r rrisc mi sa3

pdc s38

s3R117

s385Ki

secl sph

W"€ displayrhip i m g d c irnginrcrp inpuuhip

pcl lcrhi~ scdc l2Srhip d c l c h i p

v i n a Cmm Avg.

Table ES: Total Delay in nano-seconds (Cluster Size = 5)

94 Appendix E. Total Critical Path Delay

SW 24.17 3 ~ 4 2 x 5 1 bis* 1 1.63 clma 31.16 dec 16.72 diffq 2752

dsip 101)0 cllipuç 36.72 otI0lO 3 m

&P 19.87 frirs 5L.Q m a 19.00

pde 39 .w s 3 8 43 5 5 5.W 17 2s. 18 s3R5nJ 1931

rq z1.m spli 9 - 5 6

WnE 30.W disphyrhip 35.71 imgnlc 8538 imginrcrp 4-31 inpuuhip 33.71

pal;chip 3x25 .scdcl25rhip 49.86 safcLchip 3119

Circuit 2

al& 19.46

-

TabIe E.6: Total Delay in nano-seconds (Cluster Size = 6)


Circuit

du4

bidrcy

clmn des diffcq

I i p

cllipiic a l010

d p frise

m i s d i

pdc s29 8

s3R117 s385U-t

='l sph

'="' disphyrhip i m w c imginurp inpuuhip

s d c l 2 S s h i p solchhip

\rupin€ Ceom Av&


LUT sur


96 Appendix E. Total Critical Path Delay

LUT Sue

Table E. 10: Total Delay in nano-seconds (Cluster Size = 10)

Intra-Cluster (Logic) Delay

Table F. 1: Intra-Cluster Delay in nano-seconds (Cluster Size = 1)

Table F.2: Intra-Cluster Delay in nano-seconds (Cluster Size = 2)



Circuit

du4

JW apex4 biglrry clma Jcs difïcq dsip ellipuc crl010

frire m k x 3

pdc s 3 8

J W 1 7 38511-1

snI spln

-6 duphyrhip i r n g d c imginicrp inpuuhip p k c h i p d c l 2 S r h i p s d c L c h i p

warping Ceom Avg.


8'9 - LZ'S Lsh

XX'fl IL'X

W.01 1 CS

U U L L56 1 f-x Yt-s r c f 9t-S MY W'U 1 t f - Y

DY-t WU1

tL-f ZEY K Y

697 1 LX IR'+ L56 6Y L W' t PL-f Y t Y

69'9

LY't -. LL 11

hG»I WL WL -. LC x

t r s 1 W L W'L 1 t'L ES'Y IVL

W'L U I I C Y tirs K I 1 tirs ELY

65-01 o n W'L YY'C 6601

u n ws r i 9 E .9

Ln9 - XYL

51-1 1 18'tl XY'h 12'8 11'11 I r61 S68 IIY'b a's S6E XB'Y 1 Z'X 896 51.9 SfiE

STfI 8Y'V Cr3 l Zn S E 568 X Y ' t 89'6 S E S6f t f S Zr-s - - t

E r 9

tirs 80-8 Ur6 XUY

80'8 XISX

YI'S1 WS 80'8 LfY YrS LrY nss 11-11 Lf-Y 9 c s

lTE1

YTS v r s WL CITE YO'L r r f WX

tEO'E YCS YfS LrY

LJY

- 6- - OS? stx KU1 OX'Y OX'Y

SY'S w r l OB'Y S6.L UTL wr UCL SYS a v i 1 SITY W t Uf l l 06't s u 9 sr0 1 OtZ OB'Y SL-f

5ro1 O L Z

ED'Y SO'Y

6tP

Kt S6L mx 56L SGL snL 16f1 S6L S6L LTS LTS LiTY Sh'Y f6'01 9z-L Lz-s -aï7 1

L r s LrY S6L 867

Y6-S STE SGY 8m &-S Lc.;

- E6-n -

KOI m i 6Y61 SSY 1 LY'tl Z6Sl U'ff W t l hLL1 K S 8YY YTY SV'h

m'ri Ys-Y X Y ' t

n x u

1 CS Y59 f x t l Xff Y171 1 CS

Y171 1 0't

LWY M S f f f - -

E

- 1011 - Yh'EI 6C17 8Y'YZ t f - f i K 6 1 zL'ti: SL-sr L E K f Ï I t1-L 19's 91.9 ED'I 1 Ers1 z1-x YI'Y

6SOE LU'S fY'L Et: LY'P

0691 SY'Y 6rLI YI3 91-9 rl-L SY'Y

C L - P5s

w'01 6 S t l Y 1%

f i 0 1 91.6 CtïOZ Y1 -6

91'6 £9'9 CB't EY'Y S f L

Lh'Ul EY'Y ZB't W C 1 art art Lh-01 cx7 SC L ZX't W'01 EX7 ZCS ZCS ZS't - s ==!S

669

I B ' t XfX

C601 Yf'L YCL xf'n N'Li 9TL YirL It'Y IYY W L Yz-L

ES01 W'L 1FY

S l i l 61'5 I f Y

E6Ol L r T Yi-.?. LGf

Eh'Ol L C i 61's W ' f IYY

- 166 -

C f 1 1 OB'tl Z t Z YirYI 81-Y1 SY'61 50'6E 88'9 1 OX'tl E L S I T CS9

W O I 01't l ES4 Y f f EO'IL til's f KY

fit's1 1 Cf If'CI m-s IYCI I CC S I 3 E L SI -s - -

C -

SL-Y

OCS XfX K6

Xt'X n t % Xf'X

0 6 5 1 ors nt's YY'Y ws YY'Y Zt.L

YY'II OY-s OYY

CL71 W S DY'S xf'x XI'E ZYL XYE XP'X 81'C OY'S W ' S W'S

LIY

ZO'LS 3l"3ui!

IKYZ &vAcldnr L I ' G Sum 86.8 q d r ZX'L l u s

1 L'DI tXSXEs 50'1 i L I W P - rc XI P(jLs Y 5 6 VJ ZB'L P=W'

ETLE -J ZKL d p O f 8 o l o l n

OX'hZ ~nJ!lla 1 6f J!V

01-1; h;?U!r fZ'L np -. cr Xl c u i p SL'E ~ Z ! Y 811'9 t d c

8t ïB 7Y3dc CB'L VI= -

W 6 - 2 6 0 1 Z 6 t I 6 S t Z h 5 L 1 65S1 -aï31 (WLC YTYI Z K 1 [KY EYT 1 x 9 64-6 - ï . 1 OTY EY'S

Y r x l EYT Y69 K r 1 Kt

6 5 E l EU'S

-lx71 XSC Y 6 b 0CY EY'S - -

C -

mi - W Y 1

E C S i t X L f Xz-(iZ S S E Y t'Of K X S OI 'K U75 L1'6 1 S'Y LI-h

hX'El ZY'XI 5 6 0 1 ZY'S s o c 08'9 M L

9t'UC 1 0 ' s

X l ï X 0t'L

W S 1 ZB'E 1 B'Y x r x W L - - - -

TOT

102 Appendix F. Intra-Cluster (Logic) Delay

Circuit LLmSnc 7 m

Table F. 10: Intra-Cluster Delay in nano-seconds (Cluster Size = 10)

Inter-Cluster (Routing) Delay

Table G. 1 : Inter-Cluster Delay in nano-seconds (Cluster Size = 1)

Table G.4: Inter-Cluster Delay in nano-seconds (Cluster Size = 4)


Appendix G. Inter-Cluster (Routing) DeIay

araut LU 2 3 1 4

sld i r a i 14.12 1 1206

-


d u 4 apex2

=l=4 hiçkcy

c l m a da di- c k i p c l l ipt ic a 1 O l O

dl' ENc misa3

pic d Y 8 d 1 U t7 J R 5 X - I

scq sFh ="E dirpiayxbip i m g d c hgintcrp

imputchip W h i p . t n l c l Z S s h i p r c l l c ' h i p

-Tins Gean Av&


Circuit

a l d

a F a Z JFJ

hi- clmn Jcr d i w ùsip clliptic a l010

=% rriv rniux3

piic fi8

.Ci%% 17 s385nI

SCq

S~IJ

&"E dkplayrhip im@c irngïnrnp inpuuhip

W h i p sdcl75rhip d & h i p

w i n f Ccom Avg.



108 Appendix G. Inter-Cluster @muting) Delay

Cirmit I LUT S i i 2

duJ 2026

a w I272 18-79

bigkey 6.12 clmz 3 . 7 2 dn I 1 2 3 diffq 1031 Qip 6.11 elliptic 7.25 u m t o 24.02

= % 14.12 frir 15.15 m i s u 3 I451

pdc 0 . 8 5 s2Y 8 1135 s3W17 23.18 J85X-I 7.07

=I 11.41 sph 21.74

W"E 5 9 5 dtphyrhip 8.75 i m g d c 29.34 imginwrp 13.67 inpurrhip 11-42

pcz);chip 1 1 3 1 de l2S -ch ip I333 s e a l a h i p 9 3 2

Table G. 10: Inter-CIuster Delay in nano-seconds (Cluster Size = 103

Number of BLE Levels on Critical Path

I c i ~ i t I LUT S L ~ 1

Table H. 1 : Number of BLEs on Critical Path (Cluster Size = 1)

110 Appendix H. Nurnber of BLE Levels on Criticai Path

Table H.2: Number of BLEs on Critical Path (Cluster Size = 2)

Circuit

du4

Wcfi

big- C I ~ J

da

di-

dsip ellipuc cxlOl l l

,=SrJ f r k m w

@ sBX JIW 17

3RSM

Y''.= '="' disphy-chip

i r n g a l c irngintcrp

inpuurhip

pkrm d c l T 5 r h i p

. d c L c h i p

wyping CcomAv\ve

1 Circuit

7 - 12 13 12

ID

n 12

37 Y

51

15 12

62

Il

15 32

24 IR 13

15 J2 16

103 59

43

52

62

43

2.1.61

Table H.3: Number of BLEs on Cntical Path (Cluster Size = 3)

2

15

13 10

Y

3s 10

33 K

U

15

13 55 1 1

14

32 20

13

Il 15 37

43 $4

50

47

58

41 26

~ 9 3

3

7 9

7

5 12

Y

20 6

22 10 8

3 7 10

21

15 14 7

1 0 M 26

53 23

3 25 33

9

14.06

3

6

10

7

5

IX 9

17

6

19

10 X

30 7 Y 20 14 4

6

10 19

25 48

18

3 3 32 -3 14

1 3 3

LVfSizt

14

14

15

7

M

Y

15 7

2

8

II 15

Z7

1 3 9 11

15

I n 15

2 X l 3 Y 7 5 ,

9.16

LUT

R

I L

14

15 7

19

15

10

8

13 IJ

27

12

15

20 14 9

9-43

13

Y

12

6

7 6 5 4 16

5 5 4 5 7

13

d

7 6 5 5 7 10

11

20

12 10

16

11

723

S i

7

12

6 5 3 3 R

3 3 3 1 11 7

7 5 5 4 15

6 6 5 5 7 7 6 5

II

9 8 7 6 5

5 5 5 5 5 10

Il

25

Y

t l l ? 8 6 11 16

II 6

7.217

4 5 6 7

7 6 5 5 8 7 6 5

6 5 5 4

3 3 2 2 10

6 3 3 3

R

3 3 2 1 10

6

14

6

11

6 7 6 6

5 R

8

15

R 7 8 8

Io X

637

Y

6

9

6

11

6

In

6

4

7

6

11

6

6

9

7

<.CI

4 5 6 7

7 6 6 5

6 6 5 5

3 3 3 2 8

6

10 6

1'

IO

7

6 X

8

16 8

8

IO 8

4 4 (5 .~8

6 6

8

fi

R

5

10

y 6

5 7

6

13 7

6 Y

7

5.-

imginiap inpuuhip paLchip d c l 2 S x h i p d & h i p

" i n f 1 PM Ccom A v c

Table H.4: Number of BLEs on Criticd Path (Cluster Size = 4)

a l 4

am apcxJ hi-

clma des d i fkq drip cllipric a 1 0 1 1 ~

=% fnsc miscd

pdc sZYX

sJIU17 ~385%

scq SpkI

'="' duplayrhip i m g a l c imginvrp inpuuhip

WP scalc 1 X r h i p scalckhip

~ y p i n ~ Cmm AVG

LUT Sue

Table H.5: Number of BLEs on Critical Path (Cluster Size = 5 )

112 Appendix H. Number of BLE Levels on Critical Path

Table H.6:

Table H.7: Number of BLEs on Critical Path (Cluster Size = 7)

Table H.8: Nurnber of BLEs on Critical Path (Cluster Size = 8 )

Table H.9: Number oFBLEs on Critical Path (Cluster Size = 9)

114 Appendix H. Number of BLE LeveIs on Critical Path

LUT S i

Table H. 10: Number of BLEs on Critical Path (Cluster Size = 10)

Numlber of Cluster Levels on Critical Path

img-inurp 51

inpuuhip 18

I'&hi~ 5 1 s d c 12.5-chip 59 d e l c h i p 56

Table 1.1 : Number of Clusters on Critical Path (Cluster Size = 1)

116 Appendix 1. Number of Cluster Levels on Critical Path

Circuit Li ï ï Si

n

Table L2: Nurnber of Clusters on Cntical Path (Cluster Size = 2)

1

Table 1.3: Number of Clusters on Critical Path (Cluster Size = 3)

Table 1.4: Number of Clusters on Criticai Path (Cluster Size = 4)

Table 1.5: Number of Clusters on Critical Path (CIuster Size = 5)

118 Appendix 1. Number of Cluster Levels on Critical Path

Circuit I LU1 I

a p u l 7

al-4 7

b i 6 b 3 clma IR Jts 8 d i f fq 17 ûsip J

cllipuc 17

~ 1 0 1 0 R

&P 7 (ME 3

misa3 7

pdc 7 z l l ) R 16

s3M 17 LJ

JXSRI Y

rq Y spla 1 0

-S 16 dirphyrhip 20

i m g n i c 48 imginicrp 28 inpu~chip 23

pc;rkrhir 22 sa Ic l25 rh ip 35 d c l c h i p 16

Table 1.6: Number of CIusters on Critical Path (Cluster Size = 6) Circuit 1 LUTSUC

f 3

s l d 10 ri

JW 7 5 apurJ 7 6 higky 4 2

clma 15 11

del Ci 7 diiïcq 14 14 rlsip 5 3

ciliptic 11 Y

ex1010 R 5 t 5

(ME E 1 1

misa3 7 6

p<lc R 6

s 3 X 15 10 ~3x417 13 13 ~385x4 10 7

Sal X 6 spla 7 7

'.5="' 16 11 displayship 11 13

i m g d c J 5 41 imginlcrp 24 13 inpukhip 23 13

W h i p 20 13 sc J l c l2S~h ip 20 IR suic2-ship 15 12

Table 1.7: Number of Clusters on Critical Path (Cluster Size = 7)

Table 1.8: Number of Clusters on Critical Path (Cluster

LUT S i

S ize

I"ls

I 4 ' 3 12 6

5

3 8 5 6 16 J 4

IJ X

7 4

5 Y

8

1 1 10

4 10

12 IO

5 655

Table 1.9: Number of Clusters on Cntical Path (Cluster Size = 9)

4 5 7 4

4

3 I l 5 5 2 8 5 4

12 J

4

10 Y

3 4

5 in 7 13 Y 4

Y

7

n 6

5-78

120 A ~ ~ e n d i x 1. Number of Cluster Levels on Critical Path

Table 1-10: Number of Clusters on Critical Path (Cluster Size = 10)

-

Bibliography

[ACSG+99] O. Agrawal, H. Chang, B. Sharpe-Geisler, N. Schrnitz, B. Nguyen, J. Wong,

G. Tran, F. Fontana, and B. Harding. "An Innovative, Segmented High Pefor-

mance FPGA Family with Variable-Grain-Architecture and Wide-gating Func-

tions". En ACM Symp. on FPGAs, Monterey, CA, USA, 1999.

E. Ahmed and J. Rose. "The Effect of LUT and Cluster Size on Deep-Subrnicron

FPGA Peqomance and den si^". In ACM Symp. on FPGAs, pages 3-12,2000.

S. Brown, R. Francis, J. Rose, and Z. Vranesic. "Field-Programmable Gate Ar-

rays ". Kluwer Academic Publishers, 1992.

S. Brown and J. Rose. "FPGA and CPLD Architectures: A Eitot-ïaZ". in IEEE

Design and Test of Cornputers, pages 42-57, Surnmer 1996.

V. Betz and J. Rose. "Cluster-Based Logic Bloch for FPGAs: Area-Efleciency vs

Input Sharïng und Ske ". In IEEE Custom Integrated Circuits Conference, pages

55 1-554, Santa Clara, CA, 1997.

V. Betz and J. Rose. "How Much Logic Should Go in an FPGA Logic Block?".

In IEEE Design and Test Magazine, pages 10- 15, 19%.

V. Betz, J. Rose, and A. Marquardt. "Architecture and CAD for Deep-Submicron

FPGAs ". Kiuwer Academic Publishers, New York, 1999.

J. Cong and Y. Ding. "FlowMap: An Optimal Technology Mapping Algorithm

for Delay Optirnization in Loo kup-Table Based FPGA Designs ". In IEEE Trans.

on CAD, pages 1-12, Jan 1994.

J. Cong and Y- Hwang. "Boolenn Matching for Cornplex PLBs in LUT-based

FPGAs with Application tu Architecture Evaluation ". In ACM Symp. on FPGAs,

Monterey, CA, 1998.

K. Chung. "Architecture and Synkhesis of Field-ProgrammabIe Gate Arrays with

Hardwired Connections". PhD thesis, University of Toronto, 1994.

J. Cong, J. Peck, and Y. Ding. "RASP: A General Logic Synrhesis System fur

S W - b a s e d FPGAs". In ACM Symp, on FPGAs, pages 137-143, 1996.

E.M. Sentovich et al. "SIS: A System for Seqcrential Circuit Analysis". Technical

report, University of California, Berkeley, 1990.

R. Hitchcock, G. Smith, and D. Cheng. "Timing Analysis of Computer-

Hardware". Technical report, IBM Journal of Research and Development, Jan.

1983.

D. Hill and N-S Woo. "The BeneJits of FZexibility in Look-up Table FPGAs". In

International Worhhop on FPGAs, 199 1.

Xilinx Inc. "XC5200 Series of FPGAs ", YEAR = 1997.

Altera Inc. "Data Book ". 1998.

Xilinx Inc. "Virtex 2.5 V Field Programmable Gate Arrays ". 1998.

S. Kaptanoglu, G. Bakker, A. Kundu, and 1. Corneillet. "A new high densi0 and

very low cos? reprograrnrnable FPGA architecture". In ACM Symp, on FPGAs,

Monterey, CA, 1999.

J- Kouloheris and A.EI Gamal. "FPGA Petformance vs. Cell Granulari@"' In

Proc. of Ctrstom Integrated Circuits Conference, pages 6.2.1 - 6.2.4, May 199 1.

J. Kouloheris and A-El Gamal. "FPGA Area vs. Cell Granularity - P U Cells".

In Proc. of lustom Integrated Circuits Conference, 1 992-

J. Kouloheris and A.El Gamal. "FPGA Area vs. Ce11 Granulariq - Lmkup tables

and PLA Cells". In First ACM Workhop on FPGAs, Berkeley, CA, 1992.

A. Marquardt. "Cluster-Based Architecture, ïïming-Driven Packing, and Timing-

Driven Placement for FPGAs ". Master's thesis, University of Toronto, 1999.

A. Marquardt, V. Betz, and J. Rose. "Using Cluster-Based Logic Bloch and

Timing-Driven Packing to Improve FPGA Speed and Density ". In ACWSIGDA

FPGA, 1999.

1. Rose, R.J. Francis, P. Chow, and D. Lewis. "The Effect of Logic Block Corn-

plaitu on Area of Programmable Arrays ". In Proc. of Custorn Integrated Circuits

Conference, pages 5.3.1 - 5.3.5, 1989.

J. Rose, R.J. Francis, D. Lewis, and P. Chow, "Architecture of Field-

Programmable Gate Arrays: The Efect of Logic Functionaliy on Area Efl-

ciency". In IEEE Joccrnal of Solid-State Circuits, 1990-

S. Singh. "The Eflect of Logic Block Architecture on FPGA Petfiormance". Mas-

ter's thesis, University of Toronto, f 99 1.

S. Singh, J. Rose, P. Chow, and D. Lewis. "The Effect of Logic BZockArchitecture

on FPGA Pe$o mance". In IEEE Journal of Solid-State Circuits, 1 992.

A. Sedra and K. Smith. "Microelectronic Circuits: Third Edition". Oxford

University Press, 199 1.

N. West and K- Eshraghian. "Principles of CMOS VLSI Design; A System

Perspective; Second Edition ". Addison Wesley, 1993.

S. Yang. "Logic Synthesis and Optirnization Benchmarks, Version 3.0". Techni-

caI report, Microelectronics Centre of North Carolina, 199 1.