THE EFFECT OF LOGIC BLOCK GRANULARITY ON DEEP-SUBMICRON FPGA PERFORMANCE AND DENSITY
Elias Ahmed
A thesis submitted in confonnity with the requirements for the degree of Master of Applied Science
Graduate Department of Electrical and Cornputer Engineering University of Toronto
Copyright @ 2001 by Elias Ahmed
National Library 1*1 of Canada Bibliothèque nationaie du Canada
Acquisitions and Acquisitions et Bibliographie Services services bibliographiques
395 Wellington Street 395. rue Wellington Ottawa ON K1 A O N 4 Ottawa ON K I A O N 4 Canada Canada
The author has granted a non- exclusive licence allowing the National Lïbrary of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or eleclronic formats.
The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or otherwise reproduced without the author's permission.
L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thése sous la forme de microfiche/nlm, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
Abstract
The Effect of Logic Block Granularity on Deep-Submicron FPGA Performance and Density
Elias Ahmed
Master of Applied Science
Graduate Department of Electricd and Computer Engineering
University of Toronto
200 1
The architecture of an FPGA has a si,dficant effect on area and delay. In deep-submicron
designs, the interconnect resistance and capacitance accounts for the majority of the circuit
delay. In the first part of this thesis, wc perforrn a detailed study of the FPGA logic block
architecture to detennine the impact of logic block functiondity on performance and density.
In particular, in the context of lookup table (LUT), cluster-based island style FPGAs w e look
at the effect of LUT size and cluster size (number of LUTs per cluster) on the speed and logic
density of an FPGA. The second part of this thesis explores the area and delay properties of a
hardwired logic block architecture. This involves a new packing aigorithm.
iii
Acknowledgements
1 would like to thank my supervisor Professor Jonathan Rose for his technical guidance and
moral support. Our weekly discussions were always interesting and very educational.
1 would also like to thank Vaughn Betz and Alexander Marquardt for ail their help with
VPR, the CAD flow and SPLCE modeling. Thanks to Steve WiIton for his advice conceming
the 0.18 pn SPICE FPGA routing models.
I'd also like to thank Guy Lemieux and al1 the students in Jonathan's research group, Rob,
Andy, Ajay, Paul and William. Thanks to Vincent, Warren, Ted, Scott, Marcus, Brent, Jorge,
Humberto, Kostas, Derek and dl the rest of the students in the Computer and Electronics Group
for making this a wonderful research environment.
Last but not least, I'm grateful to my family for their support and encouragement throughout
the years and especially to my late father, Noor, for always believing in me.
Contents
1 Introduction 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation 1
. . . . . . . . . . . . . . . . . . . . . . . . . 1.2 FPGA Logic Block Architecture 2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Hardwired Logic Blocks 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis Organization 5
2 Background 7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 CADFlow 7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 AreaModel 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 FPGA Packing Algorithms I l
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 RASP 11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 WACK 12
. . . . . . . . . . . . . . . . . . . 2.2.3 Timing-Driven Packing (T-WACK) 14
. . . . . . . . . . . . . . . . . . . . . . . . . 2.3 FPGA Logic Block Architecture 18
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 LUTSize 18
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 CIuster Size 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Surnmary 20
3 FPGA Logic Block Architecture 21
. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 FPGA Architecture Modeling 22
. . . . . . . . . . . . . . . . . 3.1.1 Logic Circuit Design and Delay Mode1 22
. . . . . . . . . . . . . . . . . * . . . . . . . 3.1.2 Routing Architecture ,. 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Expenmental Results 25
. . . . . . . . . . . 3.2.1 Cluster Inputs Required vs . LUT and Cluster Size 25
. . . . . . . . . . . . . . . . . . . . . . 3.2.2 AreaasaFunctionofNandK 27
. . . . . . . . . . . . . . . . . . 3.2.3 PerformanceasaFunctionofNandK 33
. . . . . . . . . . . . . . . . . . . 3.2.4 Area-Delay Product ... . . . . . 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Summary 42
4 Hardwired Logic Blocks 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Hardwired Architecture 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Cluster Inputs (1) 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Tapping Buffers 47
4.1.3 Logical Equivalence of Cluster Outputs . . . . . . . . . . . . . . . . . 49
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 HLB Packing 50
4.2.1 HLB Packing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.2 HLB Packing with Tapping Buffers . . . . . . . . . . . . . . . . . . . 54
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experimental Results 55
4.3.1 Area Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Delay Results 59
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Area-Delay Results 63
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Surnmary 64
5 Conclusions and Future Work 65
5.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Future Work . . - . . . . . . . . . . . . . - . - . . - . . . - . . . . - . . - - . 66
A Total Area 67
B Intra-Cluster (Logic) Area 73
C Inter-Cluster (Routing) Area
D FPGA Channel Width
E Total Critical Path Delay
F Intra-Cluster (Logic) Delay
G Inter-Cluster (Routing) Delay
F3 Number of BLE Levels on Critical Path
1 Number of Cluster Levels on Critical Path
Bibliography 115
vii
List of Tables
. . . . . 3.1 Logic Cluster Delays for 4-input LUT Using 0.18 pm CMOS process 23
. . . . . . . . . . . . . . . . . . . . 3.2 LUT Delays Using O 18 p CMOS process 24
. . . . . . . . . . . . . . . . . . . . . 3.3 MCNC Benchmark Circuit Descriptions 26
3.4 Channel Width vs . LUT and Cluster Size (1 to 5) . . . . . . . . . . . . . . . . 35
. . . . . . . . . . . . . . . . . 3.5 Channel Width vs LUT and Cluster Size (6 to 10) 36
3.6 Critical Path Delay Cornparison for K=4 . . . . . . . . . . . . . . . . . . . . . 38
. . . . . . . . . . . . . 3.7 Summary of Best Area, Delay. and Area-Delay Results 44
4.1 Percentage of Logic Block Area that is Occupied by Output Routing Crossbar . 50 C
. . . . . . . . . . . 4.2 HLB Cluster Utilization (with and without tapping buffers) 55
. . . . . . . . . . . . . . 4.3 Number of Clusters with and without Tapping Buffers 55
4.4 Comparison of number of 4-LUT to 7-LUT blocks after technology mapping . 60
4.5 Area-Delay Product Comparison Between Cascaded 4-LUTs and Non-hardwired
Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
. . . . . . . . . . A . 1 Total Area (x lo6) in Min Width Tram Area (Cluster Size = 1) 67
A.2 Total Area (x 1o6) in Min . Width Trans . Area (Cluster Size = 2) . . . . . . . . 68
. . . . . . . . . . A.3 Total Area (x la6) in Min Width Tram Area (Cluster Size = 3) 68
. . . . . . . . . . A.4 Total Area (x 1o6) in Min Width Tram Area (Cluster Size = 4) 69
. . . . . . . . . . A S Total Area ( x 1o6) in Min Width Trans Area (Cluster Size = 5) 69
. . . . . . . . . . A.6 Total Area (x lo6) in Min Width Trans Area (Cluster Size = 6) 70
. . . . . . . . . * A.7 Total Area ( x 106) in Min Width Tram Area (Cluster Size = 7) 70
. . . . . . . * . . A.8 Total Area (x lo6) in Min Width Trans Area (Cluster Size = 8) 71
. . . . . . . . . . A.9 Total Area (x lo6) in Min Width Trans Area (Cluster Size = 9) 71
. . . . . . . . . . A . 10 Total Area (x 106) in Min Width Trans Area (Cluster Size = 10) 72
. . . . B . 1 Intra-Cluster Area ( x 106) in Min . Width Tram . Area (Cluster Size = 1) 73
B.2 Intra-Cluster Area (x 1o6) in Min . Width Trans . Area (Cluster Size = 2) . . . . 74
B.3 Intra-Cluster Area ( x 1o6) in Min . Width Trans . Area (Cluster Size = 3) . . . . 74
. . . . B.4 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 4) 75
B.5 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 5) . 75
. . . . B.6 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 6) 76
B.7 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 7) . . . . 76
B.8 Intra-Cluster Area ( x 106) in Min . Width Tram . Area (Cluster Size = 8) . . . 77
B.9 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 9) . 77
B . 10 Intra-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 10) . . . 78
. . . . . . Inter-Cluster Area ( x 106) in Min Width Tram Area (Cluster Size = 1) 79
. . . . . . Inter-Cluster Area ( x 106) in Min Width Trans Area (Cluster Size = 2) 80
. . . . Inter-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 3) 80
. . . . Inter-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 4) 81
Inter-Cluster Area ( x 106) in Min . Width Trans . Area (Cluster Size = 5) . . . . 81
Inter-Cluster Area ( x lo6) in Min . Width Tram . Area (Cluster Size = 6) . . . . 82
Inter-Cluster Area ( x 1o6) in Min . Width Trans . Area (Cluster Size = 7) . . . . 82
Inter-Cluster Area (x 106) in Min . Width Trans . Area (Cluster Size = 8) . . . . 83
. . . . C.9 Inter-Cluster Area (x lo6) in Min . Width Trans . Area (Cluster Size = 9) 83
. . . . C . 10 Inter-Cluster Area (x 1o6) in Min . Width Trans . Area (Cluster Size = 10) 84
. . . . . . . . . . . . . . . . . . . . . . . . . D. 1 Channel Width (Cluster Size = 1) 85
. . . . . . . . . . . . . . . . . . . . . . . . . D.2 Channel Width (Cluster Size = 2) 86
. . . . . . . . . . . . . . . . . . . . . . . . . D.3 Channel Width (Cluster Size = 3) 86
. . . . . . . . . . . . . . . . . . . . . . . . . D.4 Channel Width (Cluster Size = 4) 87
. . . . . . . . . . . . . . . . . . . . . . . . . D S Channel Width (Cluster Size = 5) 87
. . . . . . . . . . . . . . . . . . . . . . . . . D.6 Channel Width (Cluster Size = 6) 88
. . . . . . . . . . . . . . . . . . . . . . . . . D.7 Channel Width (Cluster Size = 7) 88
. . . . . . . . . . . . . . . . . . . . . . . . . D.8 Channel Width (Cluster Size = 8) 89
. . . . . . . . . . . . . . . . . . . . . . . . . D.9 Channel Width (Cluster Size = 9) 89
. . . . . . . . . . . . . . . . . . . . . . . . D . 10 Channel Width (Cluster Size = 10) 90
. . . . . . . . . . . . . . . . . E . 1 Total Delay in nano-seconds (Cluster Size = 1) 91
. . . . . . . . . . . . . . . . . E.2 Total Delay in nano-seconds (Cluster Size = 2) 92
. . . . . . . . . . . . . . . . . E-3 Total Delay in nano-seconds (CIuster Size = 3) 92
. . . . . . . . . . . . . . . . . EA Total Delay in nano-seconds (Cluster Size = 4) 93
. . . . . . . . . . . . . . . . . E S Total Delay in nano-seconds (Cluster Size = 5) 93
. . . . . . . . . . . . . . . . . E.6 Total Delay in nano-seconds (Cluster Size = 6 ) 94
. . . . . . . . . . . . . . . . . E.7 Total Delay in nano-seconds (Cluster Size = 7) 94
. . . . . . . . . . . . . . . . . E.8 Total Delay in nano-seconds (Cluster Size = 8) 95
. . . . . . . . . . . . . . . . . E.9 Total Delay in nano-seconds (Cluster Size = 9) 95
. . . . . . . . . . . . . . . . . E . 10 Total Delay in nano-seconds (Cluster Size = 10) 96
. . . . . . . . . . . . . . F 1 Intra-Cluster Delay in nano-seconds (Cluster Size = I ) 97
. . . . . . . . . . . . . F.2 Intra-Cluster Delay in nano-seconds (Cluster Size = 2) 98
. . . . . . . . . . . . . F.3 Intra-Cluster Delay in nano-seconds (Cluster Size = 3) 98
. . . . . . . . . . . . . F.4 Intra-Cluster Delay in nano-seconds (Cluster Size = 4) 99
. . . . . . . . . . . . . ES Intra-Cluster Delay in nano-seconds (Cluster Size = 5) 99
F.6 Intra-Cluster Delay in nano-seconds (Cluster Size = 6) . . . . . . . . . . . . . 100
. . . . . . . . . . . . . E7 Intra-Cluster Delay in nano-seconds (Cluster Size = 7) LOO
. . . . . . . . . . . . . E8 Intra-Cluster Delay in nano-seconds (Cluster Size = 8) 101
. . . . . . . . . . . . . F.9 Intra-Cluster Delay in nano-seconds (Cluster Size = 9) 101
. . . . . . . . . . . . . F . 10 Intra-Cluster Delay in nano-seconds (Cluster Size = 10) 102
G . 1 Inter-Cluster Delay in nano-seconds (Cluster Size = 1) . . . . . . . . . . . . . 103
. . . . . . . . . . . . . G.2 Inter-Cluster Delay in nano-seconds (Cluster Size = 2) 104
. . . . . . . . . . . . . G.3 Inter-Cluster Delay in nano-seconds (Cluster Size = 3) 104
G.4 Inter-Cluster Delay in nano-seconds (Cluster Size = 4) . . . . . . . . . . . . . 105
G.5 Inter-Cluster Delay in nano-seconds (Cluster Size = 5 ) . . . . . . . . . . . . . 105
G-6 Inter-Cluster Delay in nano-seconds (Cluster Size = 6) . . . . . . . . . . . . . 106
. . . . . . . . . . . . . G.7 Inter-Cluster Delay in nano-seconds (Cluster Size = 7) 106
. . . . . . . . . . . . . G.8 Inter-Cluster Delay in nano-seconds (Cluster Size = 8) 107
G.9 Inter-Cluster Delay in nano-seconds (Cluster Size = 9) . . . . . . . . . .. . . 107
G . 1 O Inter-Cluster Delay in nano-seconds (Cluster Size = 10) . . . . . . . . . . . . . 108
. . . . . . . . . . . . . . . H . 1 Number of BLEs on Critical Path (Cluster Size = 1) 109
. . . . . . . . . . . . . . . H.2 Number of BLEs on Critical Path (Cluster Size = 2) 110
H.3 Number of BLEs on Critical Path (Cluster Size = 3) . . . . . . . . . . . . . . . 110
H.4 Number of BLEs on Critical Path (Cluster Size = 4) . . . . . . . . . . . . . . . 111
H.5 Number of BLEs on Critical Path (Cluster Size = 5) . . . . . . . . . . . . . . . 111
H.6 Number of BLEs on Criticai Path (Cluster Size = 6) . . . . . . . . . . . . . . . 112
H.7 Number of BLEs on Critical Path (Cluster Size = 7) . . . . . . . . . . . . . . . 112
H.8 Number of BLEs on Critical Patn (Cluster Size = 8) . . . . . . . . . . . . . . . 113
H.9 Number of BLEs on Critical Path (Cluster Size = 9) . . . . . . . . . . . . . . . 113
H.10 Number of BLEs on Critical Path (Cluster Size = 10) . . . . . . . . . . . . . . 114
Number of Clusters on Critical Path (Cluster Size = 1) . . . . . . . . . . . . . 1 15
Number of CIusters on Critical Path (Cluster Size = 2) . . . . . . . . . . . . . 116
Number of Clusters on Critical Path (Cluster Size = 3) . . . . . . . . . . . . . 116
. . . . . . . . . . . . . Number of Clusters on Critical Path (Cluster Size = 4) 117
Number of Clusters on Critical Path (Cluster Size = 5 ) . . . . . . . . . . . . . 117
Number of Clusters on Critical Path (Cluster Size = 6) . . . . . . . . . . . . . 118
Number of CIusters on Critical Path (Cluster Size = 7) . . . . . . . . . . . . . 118
Number of Clusters on Critical Path (Cluster Size = 8) . . . . . . . . . . . . . 119
Number of Clusters on Critical Path (Cluster Size = 9) . . . . . . . . . . . . . 119
Number of Clusters on Critical Path (Cluster Size = 10) . . . . . . . . . . . . . 120
xiii
xiv
List of Figures
1.1 Island-Style FPGA @31RM99] . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
. . . . . . . . . . . . . . . . . . . . 1.2 FPGA Cluster-style Logic Block Contents 3
. . . . . . . . . . . . . . . . . . . . . . 1.3 Examples of Hardwired Logic Blocks 5
. . . . . . . . . . . . . . . . . . . . . . . . 2.1 Grouping of LUTs and Flip-Flops 8
. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Architecture Evaluation Flow 9
. . . . . . . . . . . 2.3 Definition of a Minimum-Width Transistor Area PRM99j 11
. . . . . . . . . . . . . . . . . . . . . . . . 2-4 Packing of BLEs to Form Clusters 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Original VPACK Algorithm 15
2.6 T-VPACKAlgorithmmR99] . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Stmcmre of (a) Basic Logic Element (BLE) and (b) Logic Cluster [BRM99] . . 19
. . . . . . . . . . . . . 3.1 Structure and Speed Paths of a Logic Cluster @RA4991 23
. . . . . . . . . . 3.2 Nurnber of Inputs Required for 989% Logic Block Utilization 28
3.3 Total Area for Clusters of Size 1 to 5 . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Total Area for Clusters of Size 6 to 10 . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Total Logic Block Cluster Area . . . . . . . . . . . . . . . . . . . . . . . . . . 31
. . . . . . . . . . . . 3.6 Number of Clusters and Cluster Area Versus K (for N=l) 3 1
. . . . . . . . . . . . . . . . . . . 3.7 Intra-cluster Multiplexer Area and LUT Size 32
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Routing Area 33
3.9 Number of Clusters and Routing Area Per Cluster Versus K (for N=l) . . . . . 34
. . . . . . . . . . . . . . . . . . . . . . 3.10 Total Delay for Clusters of Size 1 to 10 37
. . . . . . . . . . . . . . 3.1 1 Total Lntra-Cluster Delay for Clusters of Sizes 1 to 10 37
. . . . . . . . 3.12 Number of BLEs on Critical Path and BLE delay vs K (for N=l) 39
. . . . . . . . . . . . . . 3.13 Total Inter-Cluster Delay for Clusters of Size I to 10 39
. . . . . . . . . . . . . . . . . . . . 3.14 Nurnber of Cluster levels on Critical Path 40
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15 AverageBLEFanout 41
. . . . . . . . . . . . . . . . . 3.16 Area-Delay Product for Clusters of Size 1 to 10 43
3.17 Close-up View of Area-Delay Product for Clusters of Size 1 to 10 . . . . . . . 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Cascaded 4-LUTs
. . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Cluster Description (with HLBs)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 HLB Tapping Buffers
. . . . . . . . . . . . . 4.4 Cornparison of HLBs with and without tapping buffers
. . . . . 4.5 HLB Cluster Contents with Full Routing Crossbar for Output S ipa ls
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 HLB Packing Flow
. . . . . . . . . . . . . . . . . . 4.7 Pseudo-code for HLB Timing Driven Packing
. . . . . . . . . . . . . . . . 4.8 HLB with Cascaded 4-LUTs and Tapping Buffer
4.9 Total Area Cornparisons for Hardwired Arch . vs . Non-Hardwired . . . . . . .
4-10 Inter-Cluster Area Cornparisons for Hardwired Arch . vs . Non-Hardwired . . .
4.1 1 Intra-Cluster Area Cornparisons for Kardwired Arch . vs . Non-Hardwired . . .
4.12 Total Critical Path Delay Cornparisons for Hardwired Arch . vs . Non-Hardwired
4.13 Number of BLEs on the Critical Path Comparisons for Hardwired Arch . vs .
Non-Hardwired . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.14 Number of Clusters on the CnticaL Path Cornparisons for Hardwired Arch . vs .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-Hardwired 62
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Various HLB architectures 66
xvii
CHAPTER 1
Introduction
1.1 Motivation
Field-Programmable Gate Arrays (FPGAs) have experienced tremendous growth in recent
years and have becorne a multi-billion dollar industry. Shrinking device geometries resuIting
in larger gate capacity have provided for greater functionality. The instant prograrnmability
gives systems buiIt with these devices a significant time-to-market advantage. However, this
prograrnmability cornes at a price, since FPGAs are at least three times slower and demand
more than ten times the silicon area when irnplementing the same function on a chip when
compared to Standard Cells or Masked-Programmable Gate Arrays @3FRV92]. This happens
because Standard Cells use simple wires to make interconnections between logic gates but in
FPGAs, gates are connected with programmable swi tches. These switches have much larger
resistance and capacitance and hence are slower than the wires in full-fabrication chips. Ide-
ally, to improve the performance of an FPGA we would like to use as few switches as possible
for any given circuit. In general, the three main factors affecting overall FPGA performance
2 Chapter 1. Introduction
are the architecture of the FPGA, the quality of the CAD tools, and the electncal transistor
level design of the FPGA. While this thesis explores al1 three issues, we focus primzily on the
logic block FPGA architecture.
1.2 FPGA Logic Block Architecture
This thesis examines several aspects of FPGA logic block architecture and its impact on area
and performance. A generk FPGA consists of numerous programmabIe logic blocks which
have the capability to implement some digital logic functions. In between these logic blocks
are programmable routing switches which connect the input and output pins of each logic
block. This basic FPGA architecture is illustrated in Figure 1.1 and is known as an "island-
style'' structure in which a symmetric array of logic blocks is surrounded by routing channels
(or tracks). The Il0 pads are evenly distributed around the perimeter of the FPGA.
VO pad
a . . . a . . .m . . .= Figure 1.1 : Island-Style FPGA DRM993
Figure 1.2 shows a typical Io@ cluster, which is a Iogic block that consists of one or more
basic logic elements (BLEs) grouped together. BLEs are often composed of look-up tables
(LUTs). The BLEs in the cluster are fully-interconnected meaning that a crossbar allows any
BLE output to reach any BLE input and that al1 inputs to the cluster can reach any of the BLE
inputs. The advantage of having a fully-connected internd routing crossbar is that physical
routing becomes much easier since the router (the CAD tool which determines the paths of
1.2 FPGA Logic BIock Architecture 3
the wires in an FPGA) simply has to connect to any one of the cluster input pins. This added
flexibility in the router results in a fewer number of tracks being used. However, there is a cost
in terrns of multiplexer area and delay to build the full crossbar. For large clusters, this area
and delay can be quite significant.
-4
+ ; outputs
FPGA --__
.............................................
Figure 1.2: FPGA Cluster-style Logic Block Contents
The focus of the first part of this thesis is to determine the effect of the number of inputs to
the LUT (K) in a homogeneous architecture that employs al1 the sarne size of LUTs, and the
number of such LUTs in a cIuster (N) on the performance and density of an F'PGA. Increasing
either LUT size (K) or cIuster size (N) increases the functionality of the logic bIock, which
has two positive effects: it decreases the total number of logic blocks needed to implement
a given Function, and it decreases the number of such blocks on the critical path, typically
irnproving perfomance. Working against these positive effects is that the size of the logic
block increases with both K and N. The size of the LUT is exponential in K W C 9 0 1 and the
size of the cluster is quadratic in N @3R97]. Furthemore, the area devoted to routing outside
the block will change as a function of K and N, and this effect (since routing area typically is a
large percentage of total area) has a strong effect on the results. The choice of the logic block
ganularity which produces the best area-delay product lies in between these two extremes. In
explonng these trade-offs we seek to answer the following questions:
- For a cluster-based logic block with N LUTs of size K and 1 inputs to the cluster, what
4 Chapter 1. Introduction
should the value of 1 be so that 98 % of the LUTs in the cluster can be fully utilized?
Certainly setting I=KxN will do this, but a value less than this, which is cheaper, may
also suffice.
0 What is the effect cf K and N on FPGA area?
What is the effect of K and N on FPGA delay?
Which values of K and N give the best area-delay product?
Most importantiy, we seek to clearly explain the results and thus perhaps Ieading to bet-
ter architectures. Even though some of these questions were addressed some time ago in
W L 8 9 3 W C 9 0 1 KG9 11 BG92b] m 9 11 and [SRCL92], several reasons compelled
us to revisit the issue. First, prior work on the appropriate size of the LUT focused on non-
clustered logic blocks, which are known to have a significant impact on the area and delay
[MBR99]. Second, most pnor studies tended to look at area or delay, but not both as we will
here. Third, prior results were based on IC process generations that are several factors Iarger
than current process generations, and so do not take deep-submicron electrical effects into ac-
count. In the present work, we perform detailed spice-level simulations of circuits and perform
appropriate buffer and transistor sizing for al1 the Iogic and routing elements, in the manner of
[ERM99]. Fourth, the CAD tools available today for experimentation are significantly better
than those available 10 years ago, when this question was first raised. This turned out to be
significant because our new results show that the superior tools give rise to different trends in
the explmation of the results.
1.3 Hardwired Logic Blocks
The second part of this thesis will explore the use of hardwired logic blocks (HLBs) within
the context of logic clusters [Chu94]. HLBs consist of two or more BLEs connected together
by wires. These wires do not have switches and lack any form of prograrnmability. The low
1.4 Thesis Organization 5
resistance and capacitance of the metal wires makes the connections between BLEs fast and
cheap in t e m s of area. In a non-HLB architecture (shown in Figure 1.2) connections between
BLEs rnust propagate through the local routing crossbar which connects ail BLE outputs and
inputs. The area requirements of the full-routing crossbar can be quite significant sometimes
even larger than the BLE area. Also, there is a significant delay required for signals to reach
from a BLE output to another BLE input. The use of HLBs alleviates the area and delay
demands of local BLE connections and may improve FPGA density and performance. We
explore the area and delay of one particular HLB based FPGA architecture, again in the context
of clustered architectures. -
- LUT- % - a) Cascaded Hardwired LUTs
;?)FF b) Tree Topology Hardwired LUTs
Figure 1.3: Examples of Hardwired Logic Blocks
1.4 Thesis Organization
This thesis is organized as follows: Chapter 2 describes the background required to understand
this thesis and previous work relating to FPGA logic block architecture. Chapter 3 presents
the results of an extensive study of LUT and cluster size on FPGA area and delay. Chapter 4
presents the area and delay results for a cascaded 4-LUT hardwired logic block-based FPGA
and compares it to the non-hardwired results. Finally, we provide the conclusion in Chapter 5
along with possibifities for future work.
6 Chapter 1. Introduction
CHAPTER 2
Background
This chapter discusses severd related works from the past in the area of FPGA architecture and
CAD tools. There has been a significant amount of work in FPGA architecture over the last
decade and we will attempt to summarize sorne of the key related results- Also, we will review
the CAD flow and algonthms used in logical synthesis, placement and routing used ta produce
the results in the present work.
2.1 CAD Flow
The best-known and most believable method of determining the answers to the questions posed
in Section 1-2 is to experimentally synthesize real circuits using a CAD flow into the differ-
ent FPGA architectures of interest, and then measure the resulting area and delay DERV921
@3RM99] KG9 11. Figure 2.2 illustrates the CAD flow that was used in @RM99] and [MBR99]
to explore architectures and this is the one that we employ in this research. First, each circuit
passes through technology-independent logic optirnization using the SIS program [ea90]. It is
8 Chapter 2. Background
worth noting that, from this point on, the entire CAD fiow is hlly timing-driven. Technology
mapping (which converts the logic expressions into a netlist of K-input LUTs), was perforrned
using the FiowMap and FlowPack tools [CD94]. At this stage, there exists a netlist of logic
blocks (LUTs and registers). T-VPACK m R 9 9 ] takes this netlist of logic blocks and first
groups them into basic logic elements (BLEs) which consist of a single K-LUT and flip-flop.
Any LUT with a fanout of one and feeding into a flip-flop (as shown in Figure 2.1) c m be
collapsed into a single BLE. Hence, BLEs c m be composed of either a LUT, a LUT and a
LATCH or simply a LATCH.
Clock A
Figure 2.1 : Grouping of LUTs and Flip-Flops
These BLEs are then packed into clusters optimizing for both logic density and speed. VPR
PRM991 is then used for timing-driven placement and routing. Phcernent is the process of
determining the location of every cluster within the FPGA tile and routing determines which
programmable switches to turn on in order to connect the nets. Note that the VPACK and VPR
toolset was designed to explore FPGA architectures and each tool takes a number of parameters
as input that describe an FPGA architecture. For example, the packer (VPACK) takes the LUT
size (K), cluster size (N) and number of cluster inputs (I) as input parameters. The router takes
a VPR ".arch" file @Rh4991 which is a textual description of the routing architecture. Once
the physical layout is complete the area and delay of the circuit in the given architecture are
extracted. VPR uses a transistor based area mode1 to calculate the total FPGA area and the
timing analyzer determines the criticd path delay by extracting the Elmore delay of each net
and performing a path based timing analysis.
In our approach to modeling the area of an FPGA required by any given circuit, we deter-
2.1 CAD FIow 9
Circuit
Logic opamimuon (SIS) + TrchnoIogy Map ro K-LUT$
m (FiowMap + FiowPack)
1 Logic ~ ~ o c i c wj . . - Pack BLEs inio Logic Clusters CI-VPACK) 1
l Architecture I \ I - I
I + Placement (VPR. timing-dnven placement)
FPGA ' Rouung p. Description Routing (VPR. timing-drïven muter) 4
Architecture \ J
NO Adjust channel capacities (W)
YES - Wmin detemined
1 1 Rouung with W = 13 Wmin (VPR. Üming-driven router) I
1 Determine critical path dtlay and uansistor m a to buiId FPGA routing (VPR)
Figure 2.2: Architecture Evaiuation Flow
mine the minimum number of tracks needed to successfully route each circuit, W,,. Clearly
this isn't possible in real FPGAs, but we believe this is meanina&ul as part of a logic density
metric for an architecture. The area mode1 which makes use of this minimum track count is
described more fully in Section 2-1.1. In order to determine the minimum nurnber of tracks
per channel to route each circuit we continuously route each circuit, rernoving tracks from the
architecture until it fails to route. We cal1 the situation where the FPGA has the minimum
number of tracks needed to route a given circuit a "high stress" routing since the circuit is
barely routable. We believe that measuring the performance of a circuit under these high-stress
conditions is unreasonable and atypical, because F'PGA designers don't like working just on
the edge of routability. They will typically change something to avoid it, such as using a larger
10 Chapter 2. Background
device, or removing part of the circuit.
For this reason, we add 30% more tracks to the minimum track count and then perforrn
final "low stress" routing and use that to measure the criticd path delay.
From the output of the router, and using the area models described in the next section dong
with the circuit delay parameters, we can compare different architectures,
2.11 Area Mode1
Betz' area modeling procedure DRM991 was to create the detailed, transistor-level circuit
design of d l of the Iogic and routing circuitry in the FPGA. This includes circuits for the
LUTs, Aip-flops, intra-cluster muxes, inter-cluster routing muxes and switches and al1 of the
associated programming bits. His basic assumption was that the totaI area of the FPGA was
active-area limited, which tends to be true when there are many layers of metal (according to
[BRM99]). Two commercial PLD vendors have confinned this assumption.
This design process includes proper sizing of al1 of the gates and buffers, including the
pass-transistors in the routing, Betz uses the number of "minimum-width transistor areas" as
his area rnetric. The definition of a minimum-width transistor area is the smalIest possible
Iayout area of a transistor that cari be processed for a specific technoIogy plus the minimum
spacing surrounding the transistor as shown in Figure 2.3. The spacing is dictated by the
design niles for that particular technology. Any transistors in the circuit design that are sized
Iarger than minimum are counted as a greater number of minimum-width transistors, taking
into account the fact that a double size transistor takes Iess than twice the layout area, One
advantage of this metric is that it is a somewhat process-independent estirnate of the FPGA
area.
2.2 FPGA Packing Algorithniis 11
Minimum horizontal spacing
\ Polysilicon (gate)
1 I
'- - - - - - - ------------------ I 1 Perimeter 01 minimum
Figure 2.3: Definition of a Minimum-Width Transistor Area FRM991
2.2 FPGA Packing Algorithms
width transistor m a
Contact
Difhrsion
In order to effectively study the feasibility of HLB architectures we need to have a packing al-
gorithm capable of targeting such structures. However, we should first provide a bnef overview
of some of the general packing algorithms relevant to Our work. This section will discuss three
such packing algorithms: RASP, VPACK and T-WACK. The input to al1 these packing dgo-
rithms is a nelist of LUTs and Flip-flops and the output is a clustered set of BLEs-
-
2.2.1 RASP
RASP [CPD96] is a generd synthesis system for SRAM-based FPGAs. It has the capability
to map circuits into various types of logic blocks. M S P is composed of a core which includes
synthesis and optirnization algonthms targeting technology-independent Iogic synthesis and
mapping for LUT-based FPGAs. The packing algorithm uses a "closeness" metnc to deterrnine
which LUTs to group together in the same cluster. It first creates a compatibility graph where
the vertices represent the LUTs that require grouping and edges are formed between vertices
12 Chapter 2. Background
if they can be grouped together. There is no edge in the compatibility graph if two BLEs
cannot be grouped together due to some hard constra.int violation (for example, exceeding
the maximum number of cluster inputs allowable). The next step is to assign weights to d l
the edges in the network, Weights are assigned depending on the design objective. if circuit
performance is the main objective then a large weight is assigned to edges which produce a
grouping that reduces the length of the critical path. Conversely, if FPGA area is the pnmary
objective then large weights are given to those edges resulting in groupings that do not create
cornplex interconnection patterns in the final mapping. The algorithmic complexity of the
rnapper is O(nm) where n is the number of LUTs and m is the number of edges. With the
current benchmarks used in our experiments, the number of edges m in the compatibility graph
is 0(n2). This results in an overall algorithmic cornplexity of 0(n3), which is excessive and
somewhat impracticd for modem day circuits which require in excess of 10,000 blocks.
2.2.2 VPACK
VPACK [BRM99] takes a netlist of LUTs and registers as input and outputs a netlist of logic
cIusters as illustrated in Figure 2.4. It groups BLEs together in order to maxirnize input sharing.
The number of BLEs per cluster (N), inputs per cluster (1), LUT size (K) and clocks per cluster
(Melk) are al1 input parameters to the VPACK algorithm. VPACK accepts any combination
of these parameters and creates optimized logic clusters. The complete pseudo-code for the
VPACK algorithm is given in Figure 2.5.
VPACK attempts to pack as many BLEs into a given cluster without violating the following
constraints:
a There can be no more than N BLEs in any given cluster.
a Each cluster must use 1 inputs or less.
Every BLE contained in the cluster rnust have K-inputs per LUT (strictly homogeneous
architecture).
2.2 FPGA Packiner Algofithms 13
NetList of BLEs Netlist of Clusters
F i p r e 2.4: Packing of BLEs to Fom CIusters
a Melk is the maximum number of clocks per cluster.
The basic algorithm co~iisists of two stages. The first stage examines the input netlist and
groups LUTs and registers ttogether to form BLEs. Only LUTs with a fanout of one and whose
output directly feeds a register are grouped togeti-ter. Al1 other structures such as LUTs with
fanout greater than one and LUT to LUT connections require multiple BLEs.
The second stage of the V A C K algorithm groups BLEs together to forrn clusters. Clusters
are created in a sequential manner one after the other. First, the algorithm is in a greedy mode
and continues to add BLEs rto the current cluster until the maximum cluster capacity has been
reached. If the maximum murnber of BLEs, N, have been packed then the current cluster is
"closed" and a new empty acluster is created and the packing process is repeated. The basic
VPACK clustering starts by- selecting a seed BLE to pack into the current open cluster. The
unclustered BLE with the rrnost number of used inputs is always selected as the next available
seed. Once a seed BLE h a s been selected, an attraction function is applied to determine the
next BLE to group within the current cluster. The attraction between a BLE, B, and the current
cluster, C, are the number oEcommon nets that are shared. A net is defined as any electrically
equivdent wire connecting one or more logic blocks together.
14 Chapter 2. Background
If the greedy algorithm of the packer does not completely fil1 the cluster to its maximum
capacity then there is the option to invoke the hill-climbing phase. Hill-climbing will continue
to pack BLEs into the cluster even though the cluster uses more than the maximum I inputs that
are allowabIe. Hill-climbing ailows BLEs to be added even though the result is an infeasible
cluster (too many inputs). It must be remembered that adding a BLE to a cluster in which
al1 the inputs are already present in the cluster and having its output used by another BLE
causes a reduction in the number of cluster inputs. This is the key driving force behind hill-
climbing. That is, even though infeasiblities may occur early on during packing, it may become
feasible at later stages and the result is an improvernent in Iogic block density. The dgorithmic
complexity of VPACK is O(k,, - K - n) where the maximum number of terminais on a net is
represented by km,, K is the number of inputs to each LUT and n is the nurnber of LUTs + registers in the circuit.
2.2.3 Timing-Driven Packing (T-VPACK)
T-VPACK m R 9 9 ] was an extension to VPACK that attempts to maximize Iogic cluster ca-
pacity while simultaneously working to reduce the critical path delay. The total delay is im-
proved by reducing the number of inter-cluster connections on the critical path. Since the
external cluster routing deIay is much larger than the focai routing inside the cluster this should
have a positive impact on delay. The pseudo-code for the T-VPACK algorithm is shown in
Figure 2.6.
Timing analysis is perforrned on the circuit before any packing begins. T-VPACK models
three types of delays within an FPGA: the delay through a logic block, Logic_BlockDelay,
the intracluster delay between logic blocks, IntraLogicDelay and the extemal delay be-
tween cluster input/output pins, Inter.BlockDelay. The external routing delays cannot be
determined before placement and routing and so these must be approximated. It was exper-
imentally demonstrated in mR99] @31RM99] that weighting the LogicJ3lockDelay and ln-
2.2 FPGA Packing Algorithms 15
Let:UnclusteredBLEs be the set of BLEs not contained in any cluster C be the set of BLEs conrained in the current cluster LogicClusters be the set of ctusters (where each cluster is a set of BLEs)
UncIusteredBLEs = PatternMatchToBLEs (LUTs, Registers); LogicClusters = NüLL;
while (UnclusteredBLEs != NULL) ( /* More BLEs to cluster */ C = Ge tBLEwithMostUsedInputs (UnclusteredBLEs) ; while (ICI c N) { /* CIuster is not full */ BestBLE = MaxAttnctionLegalBLE (C, UnclusteredBLEs); if (BestBLE = NULL) /* No BLE can be added to cluster */
break; UnclusteredBLEs = UnclusteredBLEs - BestBLE; C = C u BestBLE;
1 LogicClusters = LogicClusters u C;
1
Figure 2.5: Original VPACK Algorithm
traLogicDelay to 0.1 and InterBlockDelay to 1 produced the most accurate post-routing
results.
All the net timing and slack information is stored after initial timing analysis has been
completed. Frorn this, the criticality of every connection can be calculated. Slack @SC831 is
defined as the amount of delay that may be added to a connection before it becornes critical.
Slack(i) is the slack of a connection i and M d a c k is the maximum possible slack for d l
connections in the circuit. The criticality of any connection, i, is expressed as:
Once the connection criticalities have been detennined T-VPACK detemines which BLE
will be chosen as the seed for a new cluster by selecting the BLE attached to the net with the
16 Chapter 2. Background
highest criticality. Once a seed BLE has been selected, an attraction function is applied to
calculate the next BLE that would be the best candidate for inclusion in the current cluster.
The attraction function for a BLE, B, towards a cluster, C is given by:
1 Ners(B)Mers(C) 1 Artraction(B) = h - Criticality ( B ) + ( 1 - h) MdetS
The BLE with the highest attraction will be the next candidate and A is a parameter which
determines whether T-WACK should be h l ly timing driven or maximizes input sharing. If h
is 1 then T-VPACK attempts to minirnize delay without regard to maximizing input pin sharing
and if h is O then T-VPACK will attempt to minirnize the number of used inputs. mR99]
demonstrated that setting h to a range between 0.4 and 0.8 was best.
2.2 FPGA Packing Algorithms 17
Let:UnclusteredBLEs be the set of BLEs not contained in any cluster C be the set of BLEs contained in the current cluster LogicClusters be the set of clusters (where each cluster is a set of BLEs)
UnclusteredBLEs = PattemMatchToBLEs (LUTs, Registers); LogicClusters = NLTLL;
while (UnclusteredBLEs != NULL) { /* More BLEs to cluster */
C = GetMos tCrïticalBLE (UnclusteredBLEs) ; BLEsSinceLastCriticality Recompte ++;
while (ICI < N) { /* Cluster is not full */
if (BLEsSinceLastCriticalityRecompute >= Recomputehterval) { ComputeCriticalities(); BLEsSinceLastCriticalityRecompute = 0;
}
BestBLE = MaxAttractionLegalBLE (C, UnclusteredBLEs) ; if (BestBLE = NULL) /* No BLE c m be added to cluster */
break; UnclusteredBLEs = UnclusteredBLEs - BestBLE; C = C u BestBLE; BLEsSinceLastCnticalityRecompute ++;
LogicClusters = LogicClusters u C; 1
Figure 2.6: T-VPACK Algorithm [.BR991
18 Chapter 2. Background
2.3 FPGA Logic Block Architecture
Now that the FPGA experimental rnethodology and design flows have been described, we dis-
cuss eariïer work which exarnined logic block architecture and its impact on FPGA density and
performance. This research and much of the relevant pnor work was based on a cluster-based
island style FPGA. The structure of the cluster-based logic bIock is illustrated in Figure 2.7.
Each cluster contains N basic Iogic elements (BLEs) fed by 1 cluster inputs. The %LE, illus-
trated in Figure 2.7(a) typically consists of a K-input lookup table (LUT) and register, which
feed a two-input multiplexer that determines whether the registered or unregistered LUT out-
put drives the BLE output. For clusters containing more than one BLE, ''full connectivity" is
assurned. This means that d l 1 cluster inputs and N outputs c m be progammably connected to
each of the K inputs on every LUT. These are irnplernented using the multiplexers shown in the
figure, which are un-necessary for a cluster of size 1, The Altera Flex 6K, 8K, 10K [Tnc98a]
and Xilinx 5200 pnc] and Virtex Dc98bl are commercial examples of such clusters (dthough
the XiIinx logic clusters are not fully connected).
2.3.1 LUT Size
There have been several studies in the past which examined the effect of logic block function-
ality on the area and performance of FPGAs. The use of large LUTs generally produces good
performance results since there are a fewer number of BLEs on any given criticai path. How-
ever, because the Iogic block area grows exponentially with the number of LUT inputs, these
larger blocks are very expensive. With this in mind, the key to any FPGA Iogic block study
is to balance these two opposing factors and determine an appropriate LUT size which gives
both good performance and logic density. For exampie, the work in W C 9 0 1 and [KG92a]
showed that a LUT size of 4 is the most area efficient in a non-clustered context. One of the key
observations from this study was that the FPGA area was mainly dominated by the inter-cluster
routing area and even though increasing Iogic functionality resulted in fewer logic blocks this
2.3 FPGA Logic BIock Architecture 19
Clock
(a) Basic l o g i c eIement ( B E )
K-input Inputs - - LUT -
(b) Logic cluster
I D FF Out
Clock -b>
Figure 2.7: Structure of (a) Basic Logic EIernent (BLE) and (b) Logic Cluster @3RM99]
could easily be offset by the increase in routing area due to the Iarger number of extemal pin to
pin connections between blocks. Hence, it was concluded in [PFLCgO] that logic blocks with
high functionality per connected pin produced the best area results. Conversely, evaiuation of
FPGA performance was performed in [Sin911 [SRCL921 and KG911. It was observed that
using a LUT size of 5 to 6 gave the best performance. The FPGA performance studies tended
to favour Iarger LUTs since they resulted in fewer levels of logic on the critical path. Since
the routing delay was larger than the logic delay, this generally lead to postive delay results. A
recent publication m K C 9 9 ) has suggested that using a heterogeneous mixture of LUT sizes
of 2 and 3 was equivdent in area efficiency to a LUT size of 4. In addition [ACSGf 991 States
that a logic structure using two 3-input LUTs was most beneficid in terms of area. However, it
must be noted that both these last two papers did not perform a full area or delay study where
a range of LUT sizes was exarnined.
20 Chamter 2. Backaound
2.3.2 Cluster Size
In general, clusters can be classified using four parameters [BRM99]: the nnrmber of inputs that
each BLE has (K), the number of BLEs within each cluster (N), the numbaer of distinct cluster
input pins feeding each cluster (1) and the number of distinct clocks per clluster (MCIk). It W ~ S
experimentally shown in [BRM99] that given a cluster with a single dock amd assumi~g a LUT
size of 4, then clusters of size 4 to 10 produced the best area-delay resultts using non-timing
driven packing (VPACK). Later on, a separate study Mar991 I34l3R991 based on a timing-
driven packing algorithm (T-VPACK) found that cIusters of sizes 7 to 10 were best in terms of
area-delay product.
2.4 Summary
This chapter outlined some of the key results from the past in FPGA logic block architecture.
AIso, background information concerning logic synthesis, technology mâlpping and packing
were provided. The rest of the thesis will elaborate on the area and delay- resuIts for various
Iogic block architectures.
CHAPTER 3
FPGA Logic Block Architecture
This chapter explores the effect of Iogic block functionality on FPGA performance and density.
In particular, in the context of lookup table, cluster-based island-style FPGAs PRM991 we
look at the effect of lookup table (LUT) size and cluster size (number of LUTs per cluster) on
the speed and logic density of an FPGA.
We use a fülly timing-driven experimental fiow as describecl in Section 2.1 @3RM99]
Mar991 in which a set of benchmark circuits are synthesized into different cluster-based
PR971 PR981 war99] logic block architectures, which contain groups of LUTs and flip-
flops. We look across al1 architectures with LUT sizes in the range of 2 inputs to 7 inputs, and
cluster size from 1 to 10 LUTs. In order to judge the quality of the architecture we do both
detaiIed circuit level design and measure the demand of routing resources for every circuit in
each architecture.
Chapter 3. FPGA Logic Block Architecture
3.1 FPGA Architecture Modeling
In this section we give a brief description of the FPGA architecture and deiay modeling used in
Our work. The level of detail present in these models goes far beyond any modeling previously
used in this kind of experimental analysis. Al1 device parameters and circuits are modeled
using SPICE simulations of a 0.18 pm CMOS process.
We make the foIlowing assumptions about the basic island-style architecture:
a The number of routing tracks in each channel between logic blocks is uniforrn throughout
the FPGA.
a AI1 metal routing wires are placed on metal layer 3 with minimum width and spacing.
Each circuit is mapped into the smailest square (MxM) grid possible given the number
of logic clusters it requires-
However, it is important to note that the area metric we count is not the total area required
by the square M x M block on the FPGA. Rather, we use the exact number of clusters
required to implement the circuit. For exarnple, a circuit which requires 800 logic blocks
will be routed in 29 x 29 WGA grid which results in 841 blocks. We use the area of the
logic and routing surrounding 800 clusters as opposed to 841.
3.1.1 Logic Circuit Design and Delay Mode1
The circuit design process described above is also necessary to determine accurate delay mea-
surements of the final placed and routed circuit. In deep-submicron IC design processes, the
effect of wire resistance and capacitance becomes more prevalent. We account for these effects
in this delay modeling. Figure 3.1 shows the detailed logic block circuit. Al1 circuit design
was done in TSMC's 0.18 pm, 1.8 V CMOS process. The paths have been simulated with their
actual loads in place and the input driven by what would actualiy be driving it in a real FPGA.
3.1 FPGA Architecture Modeling 23
As the cluster size increases, the buffers shcswn in Figure 3.1 must be sized larger because
of iarger loading from the interna1 muxes, which results in an increase in the basic BLE delay.
This is shown in Table 3.1 which gives the logic delays as the cluster size increases for the
paths indicated in Figure 3.1 for a BLE based o n a 4-input LUT.
Input connection
block buffers & muxes
1 Local 1 buffen
I I
+ Local routing muxes
L - , - A I l BLE 1
œ
N 1 : BLEs 1
Logic Cluster
Figure 3.1 : Structure and Speed Paths of a Logic Cluster @3R'M99]
Table 3.1 : Logic Cluster Delays for Cinput LUT Using 0.18 pm CMOS process
Cluster Size (NI
1 (No local muxesi
2
4
6
8
10
Similarly, the design of the larger LUTs mus t be done carefully, with proper buffer sizing
and, in some cases, insertion of buffers within t h e tree of pass-transistors. Table 3.2 gives the
LUT delay as a function of the LUT size.
A tb B (PB)
377
377
377
377
377
377
B to C and D t0 C (RS)
180
221
301
3 3 2
331
337
to (ps)
376
385
401
3 97
396
3 87
24 Chapter 3. FPGA Logic Block Architecture
Table 3.2: LUT Delays Using 0- 18 prn CMOS process
LUT Size (K)
3.1.2 Routing Architecture
The target routing architecture of the CAD flow used in these experiments is one that Betz et,
al m g 9 1 indicate is a good choice. This architecture has the following parameters:
rn Routing segments have a Iogical length of four (the logical length of a segment is defined
as the number of logic block clusters that it spans)
50% of these se-wents use tri-state buffers as the programmable switch and 50% use
pass transistors
The experiments conducted in DRM991 were based on a LUT size of four and a cluster
size of four. We will assume these results are valid for a11 the LUT sizes and cluster sizes that
we are comparing.
However, the LUT and cluster size does afTect the sizing of the buffers used to drive the
programmable routing, both from the block itself and the tri-state buffers interna1 to the pro-
grammable routing. As the Iogic block cluster increases in size, the size of each logic tile is
Iarger, and therefore the length of the wires being driven by each buffer increases. Since this
increases the capacitive loading of each wire, the buffers must be sized appropriately. Betz
@3RM99] indicates that for a cluster size of four and a LUT size of four, the best routing pass
3.2 Experimental Results 25
transistor width was ten times the minimum width, while the best tri-state buffer size was only
five times the minimum, We size our buffers in direct proportion to the length of this tile. That
is, if the tile length has doubled, then we double the size of the routing buffers.
3.2 Experimental Results
In this section we present the expenmental results of synthesizing benchmark circuits through
the CAD flow described in Section 2.1 with the area delay modeled as described in Section 3.1.
The benchmark circuits used in these experiments were the 20 largest from MCNC Wang11
d o n g with 8 new benchmarks '. Table 3.3 gives a description of the circuits, including the
n m e , number of 4 input-LUTs and number of nets.
Each circuit was rnapped, placed and routed with LUT size varying from 2 to 7 and cluster
sizes from 1 to 10. With 6 different LUT sizes and 10 different cluster sizes this gives a total
of 60 distinct architectures.
3.2.1 Cluster Inputs Required vs. LUT and Cluster Size
Before answering the principal questions raised in the introduction, we need to determine an
appropriate value for 1, the number of logic block cIuster inputs. The value of 1 should be a
function of K (the LUT size) and N (the number of LUTs in a cluster). This is of concern
since the larger the number of inputs the Iarger and slower the multiplexers feeding the LUT
inputs will be, and more programmable switches wiII be needed to connect externally to the
logic block. hdeed, one of the principal advantages of fully-connected clusters is that they
require fewer than the full number of inputs (K x N) to achieve high Iogic utilization. There
are several reasons for this:
' ~ h e 8 new benchmarks are from the University of Toronto from two computer vision applications. The 8 benchmarks are {display-chip, img-calc, imginterp, inputchip, peak-chip, scaleI25,chip, scale2ship, warping)
26 Chapter 3. FPGA Logic Block Architecture
Table 3.3: MCNC Benchmark Circuit Descriptions
Circuit
a h 4
apex2
apex4
bigkey
c l m a
des
di£ feq
ds i p
e l l i p t i c
ex1010
ex5p
f r i s c
m i sex3
~ d c
s298
s38417
~ 3 8 5 8 4 . 1
seq
s p l a
tseng
d i s p l a y - c h i p
hg-calc
i m g - i n t e r p
i n p u t - c h i p
peak-chip
sca le125 ,ch ip
s c a l e 2 - c h i p
warp ing
# of ~ - I A P u ~ BLEs
1522
1878
1262
1707
8383
1 5 9 1
1497
1 3 7 0
3 604
4598
1064
3556
1397
4575
193 1
6406
6447
1750
3690
1047
1794
20141
2727
807
809
2632
1189
1353
Number of Nets
1536
1916
1 2 7 1
193 6
8445
1847
1 5 6 1
1599
3735
4608
1072
3576
1 4 1 1
4591
1935
643 5
6485
1 7 9 1
3706
1099
2419
10180
2769
8 4 1
840
2654
1202
1394
3.2 Experimental Resuits 27
a Some of the inputs are feedbacks from the outputs of LUTs within: the sarne clusters,
saving inputs.
a Some inputs are shared by multiple LUTs in the cluster
a Some of the LUTs do not require al1 of their K-inputs to be used- Indeed this is often the
case, as pointed out in WKC991.
Betz and Rose BR971 PR981 showed that when K=4 and 1 is set to t h e vaIue 2N+2, then
98% of al1 of the 4-LUTs in a cluster would typically be used. We would l i k e to find a sirnilar
relation, but one that includes the variable K.
To determine this relation, we ran several experiments, using only the first three steps il-
lustrated in Figure 2.2: logic synthesis, technology mapping and packing. For each possible
value of N and K, we ran experiments varying the value of 1 (the rnaximurn number of inputs
to the cluster allowed by the packer) from 1 to K x N. Following BR971 une chose the lowest
value of I that provided 98% utilization of al1 of the BLEs present in the cixrcuit. Figure 3.2 is
a plot of the relationship between the number of inputs (1) required to achiezve 98% utilization
and the cluster size (N) and the LUT size (K). Typically, the value of 1 must Ebe between 50 and
60% of the total possible BLE inputs, I = K xN.
By inspection we have generalized the relationship as:
This equation provides a close fit to the results in Figure 3.2. The averagre percentage error
across al1 possible data points is only IO. 1 % with a standard deviation of 7 . 6 9%.
3.2.2 Area as a Function of N and K
In this section we present and discuss the experimental results that show t h e area of an FPGA
as a function of N and K. Note that 1 was set to the value determined in the= previous section.
28 Chapter 3. FPGA Logic Block Architecture
O 1 1 I I I t t 1 1 1 2 3 4 5 6 7 8 9 10
Cluster Size (NI
Figure 3.2: Number of Inputs Required for 98% Logic Block Utilization
These results are for the 28 benchmark circuits. Area, as discussed above, is measured in terms
of the total number of minimum-width transistors required to implement al1 of the logic and
routing.
Total Area
Figures 3.3 and 3.4 give a plot of the geometric average (across al1 28 circuits) of the total area
required as a hnction of cluster size and LUT size. Several observations can be made from
this data:
0 LUT sizes of 4 and 5 are the most area-efficient for d l cluster sizes.
O There is a reduction in total area when the cluster size is increased frorn 1 to 3 for ail
LUT sizes. However, as clusters are made larger (N > 4) there is very little impact on
total FPGA area. Figure 3.4 demonstrates this behaviour very well. It is generally ex-
pected that increasing logic block iünctionality should result in more BLEs being added
to a cluster and connections that normally would have been routed extemally are now ab-
sorbed intemal to the cluster. This should reduce the inter-cluster area which is usually
much higher than the intra-cluster area and thus having a positive impact on total area-
However, the reason the total FPGA area doesn't decrease is because increasing logic
capacity (more input & output pins) results in an increase in track count. It rnust also
be rernembered that the logic block area is ais0 increasing due to the LUT area and the
multiplexer area, So the area savings is not as significant as it appears and Figures 3.4
and 3.3 confirm this fact.
6e+06 l 1 1 L
'Cluster Size = 1' - 'Cluster Size = 2' -x- "Cluster Site = 3' ---*--- 'Cluster Site = 4' -El- 'Cluster Size = 5" --*- -
3.5e+06 -
3e46 I 1 r I
2 3 4 5 6 7 LLJT Size
Figure 3.3: Total Area for Clusters of Size 1 to 5
fi is instructive to break out the components of the data in Figures 3.3 and 3.4 in order to
achieve both insight and inspiration on how to make more area-efficient FPGAs. The total area
can be broken into two parts, the logic block area (including the muxes inside the clusters)
and the routing are% which is the programmable routing extemal to the clusters. Throughout
the rest of this paper, these will be referred to as the intra-cluster area and inter-cluster area
respective1 y.
30 Chapter 3. FPGA Logic Block Architecture
Figure 3.4: Total Area for Clusters of Size 6 to 10
We wiIl first explore the intra-cluster area. Figure 3.5 shows the total intra-cluster area
component of the total area (again, geometrically averaged over the 28 circuits) as a function
of the LUT size. The data shows that the intra-cluster area increases as K increases. This area
is the product of the total number of clusrers times the area per clustec A plot of these two
components for a cIuster size of 1 is given in Figure 3.6.
The logic block area grows exponentially with LUT size as there are 2K bits in a K-input
LUT. in addition, larger LUT sizes require larger intra-cluster multiplexers because the size
of each rnultiplexer is (I + N) = (Ki2 (N+l) + N). As K increases, though, the nurnber of
clusters decreases (because each LUT can implement more of the logic function) as shown by
the downward curve in Figure 3.6.
However, the rate of decrease in the number of Iogic blocks is far outweighed by the in-
crease in the size of the block as K increases, and hence the upward trend in Figure 3.5. Fig-
ure 3.7 decomposes the logic block area into two parts: a) intra-cluster rnultiplexer area and b)
LUT area. The results illustrate that the local intra-ciuster routlng area cannot be ignored and
'Cluster Siie=lg + 'Cluster Siie=ZD -x- 'Cluster Sie*- ---r--- 'Cluster SizeS --a- 'Cluster Ske=5' -,-- 'Cluster Size4' --4.- 'Cluster Size=7' --+- 'Cluster Ske=eu ----x--- 'Cluster Size=gn --it-
'Cluster Size=lO* -e-
4 5 iüT Size
Figure 3.5: Total Logic Block Cluster Area
4 5 LUT Size
Figure 3.6: Number of CIusters and CIuster Area Versus K (for N=1)
cari be quite significant for larger clusters.
32 Chapter 3. FPGA Logic Block Architecture
'LUT Area' - 'Mux Area (Cluster Size = 2)' -K-
"Mux Area (Cl~ster Size = 9 c.- mMux Area (Ciuster Sire = 101 -.p-
1-4e+06 -
1.2e+06 -
l e 4 6 -
aooooo -
K----- 2: --
2 3 4 5 6 7 LUT Size
Figure 3.7: Intra-cluster Multiplexer Area and LUT Size
Observing the absolute values in Figures 3.3 to 3.5, we see that the intra-cluster area
typically takes up about only 25% to 35% of the total area, except when the LUT size reaches
6 and 7, at which point intra-cluster area becomes a dominant factor-
The key effect, as always in FPGAs, is with the routing area. Figure 3.8 is a plot of the total
inter-cluster routing area as a function of the LUT size and cluster size. The Figure shows that
the routing area decreases in a Iinear fashion with increasing LUT size. This particular result
is interesting since previous work frorn [RFLCgO] bas shown that the routing area achieved a
minimum between K=3 and K 4 , and increased for values of K beyond this.
To explain this obsewed behavior, observe Figure 3.9 which decomposes the total routing
area into two separate components: the number of clusters and the routing area per cluster.
These curves are given for a cluster size of 1, but are representative for ail ciuster sizes, The
product of these two curves gives the total inter-cluster routing area. The reason why the
routing area decreases linearly with LUT size is that as we increase the LUT size, the number
of clusters decreases much faster than the rate at which the routing area per cluster increases.
The routing area per cluster grows slightiy with increasing LUT size since due to the fact that
there is very little change in channe1 width as the LUT size is varied. This is shown in Tables 3.4
and 3.5 where it can also be seen that increasing cluster size leads to larger average channel
widths. The channel width is the major deterrnining factor in inter-cluster area. The difference
in results from IPT;LC90] and Our current results can be attributed to the fact that we are now
using better CAD tools with more sophisticated dgorithms; in particular the quality of the
placement tool and the routing tool is significantly better, and uses significantly less wiring.
In addition, for clustered logic blocks, more of the routing is being implemented within the
cluster itself.
2 3 4 5 6 7 LUT Size
Figure 3.8: Routing Area
3.2.3 Performance as a Function of N and K
The second key metric for FPGAs is their speed measured by the critical path delay. The
total critical path delay is defined as the total delay due to the logic cluster combined with the
routing delay. Figure 3.10 shows the geometnc average of the total critical path delay across
34 Chapter 3. FPGA Logic Blocla Architecture
Figure 3.9: Number of Clusters and Routing Area Per Cluster Versus K ( f s r N=l)
al1 28 circuits as a function of the cluster size and LUT size. Observing the Figure, it is clear
that increasing N or K decreases the critical path delay. These decreases are significant: an
architecture with N=l and K=2 has an average delay of 45 ns while K=7 and- N=lO has an
average critical path delay of just 14 ns. There are two trends that explain this behavior. As the
LUT and cluster size increases:
c the delay of the LUT and the delay through a cluster increases
0 the number of LUTs and clusters in series on the critical path decreases
We will discuss these effects in more detail below.
It is instructive to break the total delay into two components: intra-cluster delay (which
includes the delay of the muxes and LUTs), and inter-cluster delay.
Figure 3.11 shows the portion of the critical path delay that cornes from the intra-cluster
delay as a function of K and N. There are two key points to observe here. First, t he intra-cluster
delay decreases as the LUT size increases. This is due to the fact that there is a reduction in
3.2 Experimental Results 35
Table 3.4: Channel Width vs. LUT and Cluster Size (1 to 5)
Cluster Size (N) 1
LUT Size (K) 2
Channel Width (geometric avg.)
1 5 - 9
the number of BLE leveIs on the critical path and hence there will be fewer Iogic levels to
implement. This will translate into a reduction in intra-cluster delay. Figure 3. L2 illustrates
this concept more clearly at the BLE level: it is a plot of BLE delay and nurnber of BLEs on
the criticai path versus LUT size for a cluster size of 1. The number of BLE leveis decreases
quicker than the increase in BLE delay and hence the decrease in logic delay. The second
behaviour that should be noticed is that the intra-cluster delay increases for any given LUT size
as the cluster size is increased. This is because the intra-cluster muxes get larger and therefore
36 Chapter 3. FPGA Logic Block Architecture
Table 3.5: Channel Width vs. LUT and Cluster Size (6 to 10)
slower. However, the deIay through these muxes is still much faster than the inter-cluster delay,
as shown in figure 3.1 1.
Cluster Size (N) 6
Figure 3.13 shows the portion of the critical path delay that cornes from the inter-cluster
routing deIay as a function of K and N.
As K increases there are fewer LUTs on the critical path, and this translates into fewer
inter-cluster routing links, thus decreasing the inter-cluster routing deIay. Similarly, as N is
increased, more connections are captured within a cluster, and again, the inter-cluster routing
LUT Size (K) 2
Channel Width (geometric avg.) 4 0 - 9
"Cluster Size = 2. -x- 'Cluster Size = 3' ---*--- 'Cluster Size = 4. -Ef- - 'Cluster Size = 5' -*- 'Cluster Size = 6' --*-- 'Cluster Size = 7' - 'Cluster Size = 8' ----x--- - 'Cluster Size = 9' --
'Cluster Sire = 10' --+-
-
- m n - 0- +%
20 -
15 -
10 1 I I r t I 2 3 4 5 6 7
LUT Size
Figure 3.10: Total Delay for Clusters of Size 1 to 10
'Cluster Size = 2. -x-- 'Cluster Size = 3' ---*--- 'Cluster Size = 4" -a- 'Cluster Size = 5' -=-- 'Cluster Size = 6' ---*-- 'Cluster Size = 7' -- 'Cluster Site = 8' ----x--- "Cluster Size = 9' -+
'Cluster Size = 10' -+--
6 -
5 1
4 5 LUT Size
Figure 3.1 1 : Total Intra-Cluster Delay for Clusters of Sizes I to 10
deiay decreases.
38 Chapter 3. FPGA Logic Block Architecture
In discussing these trade-offs, it's useful to follow an explicit example: Table 3.6 shows
how the delay through one BLE and multiplexer stage (delay from B to D on Figure 3.1) rises
from 0.556 ns to 0.702 ns when going from K=4 and N=l to K=4 and N-4. Although the
number of BLE levels on the critical path remains fairly constant since we have not modified
K, the total logic delay increases from 5.58 ns to 6.65 ns due to the increase in the local cluster
routing multiplexers. However, since there are now 4 BLEs in every cluster as opposed to a
single BLE, more logic is irnplemented internally within the clusters. Nets that norrnally would
have been routed externally are now interna1 to the clusters. This transiates in a reduction in
the inter-cluster routing delay from 19-48 ns when using K=4, N= 1 to 1 1-72 ns for K=4 and
N=4. The total critical path delay decreases from 25.91 ns to 19.35 ns as originally shown in
Figure 3.10-
Table 3.6: Cntical Path Delay Cornparison for K=4
BLE t Mux Delay 1 0 . 5 5 6 ns
Avg # of BLEs on Critical Path 1 9 . 5 3
To ta1 Intra-Clus ter Delay l 5 - 5 8 ns To ta1 Inter-Cluster Delay 119.48 ns
In general, inter-cluster routing deiay is much larger than the intra-cluster delay, and hence
the value of increasing the cluster or LUT size. However, it is interesting that increasing cluster
size has little impact after a certain point (for N > 3). Figure 3.10 shows this clearly where
for any fixed LUT size, the majority of the improvement in critical path delay occurs as the
cluster size is increased from 1 to 3. Any further increases in cluster size results in a very
minimum delay improvement. This behaviour suggests that clustering has little effect after a
certain point. This is counter-intuitive to what we expect. That is, employing larger clusters
T o t a l D e l a y 2 5 . 9 1 ns
3.2 Experimental Results 39
'#of BLEs on Critiml Path' 'BLE Dela i
Figure 3.12: Number of BLEs on Critical Path and BLE delay vs K (for N= 1)
40 1 I I 'Cluster 1 Size = 1' - 'Cluster Size = 2' -x- 'Cluster Size = 3' ---*--- 'Cluster Size = 4' -a- 'Cluster Size = 5' --*-- 'Cluster Size = 6' -4.- 'Cluster Size = 7' -+- gCluster Size = 8' ----x--- 'Cluster Size = 9' -r-
'Cluster Size = 10' --+-
4 5 LUT Sire
Figure 3.13: Total Inter-Cluster Delay for Clusters of Size I to 10
should always reduce the critical path. Aithough, the total delay results from Figure 3.10 do not
contradict this, what was surprishg was how little of an improvement in total delay that was
40 Chapter 3. FPGA Loaic BIock Architecture
achieved with larger clusters. To better understand this situation, it is sufficient to examine the
number of logic block levels (ie: cluster levels). Figure 3.14 shows the number of cluster levels
as a fùnction of cluster and LUT size. The results clearly show the number of levels decreasing
with increasing cluster and LUT sizes. But, for any given LUT size it c m be seen that most
of the reduction in the number of levels occurs as the cluster size is increased from 1 to 3.
Also, recall that the rnajority of the criticaï path delay was reduced in this range (as shown
in Figure 3-10). The direct relationship between the number of cluster levels on the critical
path and the final total delay is no coincidence. Fewer logic blocks on the critical path leads
to improved performance. The main reason that the total delay did not improve significantly
as we varied the cluster size frorn 3 to 10 was that there was no significant reduction in the
number of Iogic block levels. Without a reduction of the number of inter-cluster levels on the
critical path we cannot possibly expect improvements in FPGA performance.
30 - I I I I I I I
~L'UTS~Z~ = 2' - 'LUT Size = 3" -K- 'LUT Size = 4' ---r--- 'LUT Size = 5' -a- 'LUT Çize = 6' -,-- 'LUT Çize = 7' ---e--
1 2 3 4 5 6 7 8 9 10 Cluster Size
Figure 3.14: Number of Cluster levels on Critical Path
Another interesting trend to observe from figure 3.14 is that increasing the cluster size has
less of an effect for architectures cornposed of larger LUTs. For exarnple, increasing the cluster
size from 1 to 10 for a 2-input LUT architecture results in a 60% reduction in the number of
cluster Ievels on the cntical path. Conversely, employing BLEs with a 7-input LUT and varying
the cluster size from 1 to 10 results in only a 22% reduction in logic levels. Hence, clustering
proves to be more effective for smaller LUTs. To understand this more clearly, we should
examine the average BLE fanout for every LUT size. Figure 3.15 shows this and as we c m see
larger LUTs correlate to larger average fanout. The reason smaller LUTs had a better response
to larger cluster sizes was due to the fact that each LUT had a relatively small fanout and hence
adding an extra BLE to a cluster usually guaranteed some reduction in the number of logic
levels. The same cannot be said about larger LUTs since they have a much larger average
block fanout and it becomes much more diffult to ensure that any subsequent BLE addition
wiII result in fewer cluster levels on the critical path.
Figure 3.15: Average BLE Fanout
3.2
3
2.8
2.6
2 4
2.2
L ' 1.8
2
I 1 L I 'Avg. Block Fanout' ---a---
_--- .a----------*--.-------.*f, - _-.- -
_--- __----- -a----
- - _--- .m-
- -
- -
-a. - -
- /---- -
I f I 1
3 4 5 6 7 LUT Sire
42 Chapter 3. FPGA Logic Block Architecture
3.2.4 Area-Delay Product
So fw, we have examined the effect of K and N on area and performance of FPGAs. As area
cari often be traded for delay, it is instructive to look at the area-delay product. Figures 3-16
and 3.17 illustrate the area-delay product versus K and N. This plot shows that using a LUT
sizes of 4 to 6 and clusters of 3 to 10 appear to give the best area-delay results.
Notice that area-delay decreases significantly as the LUT size is increased from 2 to 4 for
al1 cluster sizes. This is because their delay is poor due to the large arnount of BLE levels on
the cntical path combined with the fact that the total area requirements are slightly larger (by
about 20%), and so they are a bad choice.
The area-delay product jumps for K=7 on average by 10-15% principally because the huge
area cost for 7-input LUT outweighs the modest performance gains it achieves.
This latter observation suggests that, if there was a way to achieve the depth properties of a
7-input LUT without paying the heavy area price, then such a 7-input input function may well
be a good choice.
We have also observed that, for large clusters, a large portion of the deIay is taken up by
the intra-cluster muxes. If this delay could be reduced somehow, then significant speed wins
could be achieved.
3.2.5 Summary
We have studied the effect that different logic block architectures have on FPGA area and
performance. The main results are summarized in table 3-7. In addition, we experimentally
derived a relationship between the number of cluster logic block inputs required to achieve 98%
utilization as a function of the LUT size, K and the cluster size, N. This is I = 5 x (N + l),
where 1 is the number of distinct cluster inputs.
Secondly, we have shown that small LUT sizes (2-input and 3-input LUTs) are not as area
efficient as the 4 and 5-input LUTs and their performance characteristics are very poor. If area-
0.3 1 I I I 'Cluster Size = 1' - 'Cluster Size = 2. -K- 'Cluster Size = 3' ---r--- 'Cluster Size = 4' -a- 'Cluster Slze = 5 -,-
0.25 'Cluster Size = 6' ---O---
'Cluster Size = 7' - 'Cluster Size = 8' ----x--- 'Cluster Size = 9' ---+--
'Cluster Site = 1 O' -e--
2 3 4 5 6 LüT Size
Figure 3.16: Area-Delay Product for Clusters of Size 1 to 10
'~luste;~ize = 1' - 'Cluster Size = 2' -x- 'Cluster Site = 3' ---*--- 'Cluster Size = 4' -+i- 'Cluster Sire = 5' -m.- 'Cluster Size = 6' --*-- 'Cluster Size = 7' -+-- 'Cluster Site = 8' ----x--- 'Cluster Size = 9" --it
"Cluster Sire = 10' +-
LUT Size
Figure 3.17: Close-up Viecv of Area-Delay Product for Clusters of Size 1 to 10
delay is the main criteria, then the use of clusters of between 3 and 10 and LUT sizes of 4 to 6
will produce the best overall results.
44 Chapter 3. FPGA Logic BIock Architecture
FinalIy, Our work suggests two fbture directions: finding ways to reduce the number of
levels of Iogic without the expense of large LUTs, and reducing the delay of intra-cluster
rnultiplexers.
Table 3-7: Summary of Best Area, Delay, and Area-Delay Results
C r i t e r i a
Area
Delay
Area-Delay
LUT Size (K)
4 to 5
7
Cluster Size (N)
4 to 9
3 to 10
4 to 6 3 to 10
CHAPTER 4
Hardwired Logic Blocks
This chapter explores the hardwired logic block (HLB) architecture introduced in Chapter 1
with the expectation that it will lead to an improved area-delay product compared to the non-
hardwired 2 to 7-input LUT clustered architectures. R e d 1 that employing logic clusters con-
sisting of 3-10 BLEs and LUT sizes of 4-6 produced the best area-delay results. However, we
have shown that there was a 10-15% increase in the area-delay product when moving from
a 6-input LUT to a 7-input LUT. The increase in area-delay was due to the fact that the LUT
area grows exponentially with LUT size. Our main motivation for explorhg HLB architectures
was to find a coarse grain logic block which had aimost the equivalent logic functionality and
depth properties as a 7-input LUT but without the large area cost. The rest of this chapter will
examine one such architecture, a cascade of two 4-input LUTs, as illustrated in Figure 4.1, and
compare its area and performance to the non-hardwired architectures.
The HLB architecture in Figure 4.1 has several advantages over a 7-input LUT. First, there
is a considerable logic area savings since the 7-LUT requires 2' = 128 SRAM configuration
bits while the cascaded 4-LUTs demands only 2 x 24 = 32 SRAM bits. Secondly, the LUT
46 Chapter 4. Hardwired Logic Blocks
Figure 4.1 : Cascaded 4-LUTs
delay is slightly less since the output of the first LUT is fed directly into the fastest LUT input of
the second stage LUT. This is possible because LUTs are implemented as a tree of transistors,
and therefore the inputs have different deiays. Our goal in this chapter is to determine the area
and delay of the HLB structure shown in Figure 4- 1 and compare it to the regular 4 to 7-input
LUT-based Iogic clusters.
4.1 Hardwired Architecture
In general, HLB architectures consist of one or more fully-interconnected HLBs as illustrated
in Figure 4.2. Each HLB is hardwired in a tree topology such that every LUT within the HLB
c m oniy fanout to at most one other LUT within the sarne HLB. The root BLE in the HLB
is the onIy LUT that c m implement sequential logic since its output can be registered by a D
ff ip-flop. The finer points of an HLB architecture are presented below.
4.1.1 Cluster Inputs (I)
The number of distinct cluster inputs is an important architectural issue that needs to be ad-
dressed for HLBs. As discussed in Chapter 3 for non-hardwired architectures, just enough
inputs were used to produce 98% utilization of the LUTs. However, such high logic block
density is difficult to achieve in an HLB due to the inflexibility imposed by the hardwired
connections. For that reason it is assumed that the cascaded 4-LUT structure, which has 7
inputs, requires the same number of inputs as a 7-LUT. This way, the same relationship that
4.1 Hardwired Architecture 47
Inputs
HLB Contents
1
Clock '-, 1 8 ,
Figure 4.2: Cluster Description (with HLBs)
was determined earlier in Chapter 3 for 98% logic utilization cm be used:
Here, N represents the number of HLBs per cluster and K is the number of input pins
accessible externally from the HLB. For the cascaded 4-LUT, K=7 and therefore,
I = ~ x ( N + ~ ) .
4.1.2 Tapping Buffers
The HLB structure in Figure 4.2 has 7 inputs and one output which can be registered or un-
registered. Notice that only LUT B can have its output accessed externally frorn the cluster
output pins. Since the output of LUT A is hardwired into LUT B, this restricts how and where
subsequent LUTs are placed within the HLB. In Figure 4.2, LUTs A and B may be adjacent to
each other as long as LUT A has a fanout of one and LUT B uses the output of LUT A as one
of its inputs. If this pattern doesn't appear frequently enough in the netlist, many LUTs within
HLBs will be unused. The result is a reduction in logic block density for the HLB-based FPGA.
The use of tapping bufers [Chu941 will help alleviate this problem and improve the density of
HLB-based FPGAs. Tapping buffers allow the LUT outputs within an HLB to be externally
48 Chapter 4. Hardwired Logic Blocks
accessed. For exarnple, in Figure 4.3 the output of A can be accessed without propagating it
through LUTs B and C.
Tapping Buffer /
Tapping / Buffer
Figure 4.3: HLB Tapping Buffers
The tapping buffer saves two LUT delays and aIso allows other logic functions to be im-
plemented in LUTs B and C rather than simply prograrnming them to propagate the output of
A and essentially wasting area. The possible benefits of tapping buffers are increased logic
density and improvernents in speed. For these reasons, tapping buffers are included in the HLB
architecture studied in this chapter. Section 4.2.2 provides experirnental justification for this
decision. We should keep in rnind that tapping buffer-enabled HLB architectures lead to more
cluster output pins compared to HLBs without tapping buffers as ilEustrated in Figure 4.4.
Tapping 6u;fFer \
Clock
y-,-- LUT ,m- . ... . . Output I
a) HLB with no tapping buffen
Clock : 1
Figure 4.4: Cornparison of HLBs with and without tapping buffers
4.1 Hardwired Architecture 49
4.1.3 Logical Equivalence of Cluster Outputs
Logical equivalence on the cluster output pins means the router can connect to any one of the
output pins and still reach any of the BLE or HLB outputs in the cluster. This added flexibility
can reduce the track count by up to 50% compared to an architecture where the output pins are
not logically equivalent. For clusters based on the non-tapping structure shown in Figure 4.4(a),
the outputs are logically equivalent because they are identical and the full connectivity of the
inputs allows them to be easily swapped. Figure 4.4(b) shows a tapping-buffer based HLB
architecture and these outputs are not logically equivalent and the outputs of LUTs A and
B cannot be swapped. Output pin logical equivalence for this structure cm be achieved by
creating a full-routing crossbar which connects ail LUT and cluster outputs together as shown
in Figure 4.5. The crossbar contains N N: 1 multiplexers, where N is the number of LUTs in
the cluster. The area cost of this output crossbar multiplexer is relatively small in cornparison
to the total Iogic block area.
Cluster
1 1 1
- LUT f 1 1 )
1 1
CLB
CLB
outputs
outputs
Figure 4.5: HLB Cluster Contents with Full Routing Crossbar for Output Signals
50 Chapter 4. Hardwired Logic Blocks
Table 4.1 shows the percentage of the logic block area that this full crossbar routing rnul-
tiplexer occupies for different cluster sizes. Even for large cluster sizes, the logic block area
overhead is less than 8%.
Table 4.1: Percentage of Logic Block Area that is Occupied by Output Routing Crossbar
C l u s t e r Size
1
Multiglexer Area
4.2 HLB Packing
In order to explore tILBs, we need to synthesize circuits into HLB structures. This section
outlines a packing algorithm aimed at hardwired Iogic blocks. However, it should be noted that
there are several different ways of synthesizing circuits into HLBs.
0 The entire HLB can be considered as a coarse grain target for technology mapping dur-
ing logic synthesis. The basic procedure here is to translate the set of boolean equations
that represent the circuit into Iogic gates and then perform technology mapping directly
into the HLB structure. This was the approach taken in [CD941 since it was believed that
technology mapping to HLBs focused more attention on optimizing hardwired connec-
tions which could possibly lead to better results than the alternatives listed below.
4.2 HLB Packhg 51
HLB packing could be perfomed during layout synthesis as a separate step before place-
ment and routing as is done in VFACK [BRM99]. [Chu941 also took this approach to
HLB packing where the circuit would first be mapped to a netlist of basic blocks com-
posed of LUTs and flip-flops. Aftenvards, these LUTs and flip-flops are packed into
HLBs.
0 Finally, packing could be done simultaneously with placement. The placer would have
the ability to rnove BLEs from one cluster to another in order to improve FPGA area
andor delay .
This thesis does HLB packing prior to placement but after logic synthesis, as illustrated
in the flow given in Figure 4.6. Although the direct logic synthesis of gates into HLBs holds
the promise of the best results, this can be very cornplex and we chose the layout synthesis
approach due to time constraints.
Circuit
ech-Independent Logic Synthesis
Technology rnap 1 ta K - w n
JI Measure Area & Delay
Figure 4.6: HLB Packing Flow
52 Chapter 4. Hardwired Logic Blocks
4.2.1 HLB Packing Algorithm
The goal of the HLB packer is to minimize some combination of area and delay. The main
priority is to pack highly critical LUTs into the same HLB to take advantage of the "zero-
deiay" hardwired links. If this is not possible then critically related LUTs should be packed
into the sarne cluster. The HLB packing pseudo-code is illustrated in Figure 4.6. The first
several steps in the HLB packing algorithm are identical to the T-VPACK [MBR99] algorithm
described in Chapter 2- First, timing analysis is performed on the unclustered netlist of BLEs.
After timing analysis, the BLE with the highest crïticality ' is chosen as the seed of a new
cluster. This BLE is placed at the root of any of the HLBs, however, it may be shifted within
the HLB later on. With the seed BLE now packed, the next best BLE to pack into the HLB and
cluster is deterrnined. To do this, the attraction function described in Chapter 2 and originally
defined in m R 9 9 ] is used to find the next candidate BLE:
INers(l3) nNers(C) 1 Attraction ( B ) = h - Crit ical i fy(B) + ( 1 - A) - Maflers
The attraction function determines which unclustered BLE B should be added to the current
cluster C. The parameter which controls the tradeoff between area and delay is h. Setting h to
1 results in a packing algorithm focuses only on criticality. Conversely, setting h to O results
in a packing algorithm which tries to maximize input sharing and hence FPGA logic density.
Our work uses a h value of 0.75 as suggested in VEU9]. Once the next BLE (BestBLE) to
be clustered has been determined we attempt to place it into one of the HLBs in the currently
open cluster, At this stage, there are three choices which are evaIuated in order:
1. Place BestBLE in one of the non-empty HLBs within the current cluster, or
2. Place BestBLE into the next empty HLB in the current cluster, or
3. Place BestBLE into the next empty cluster.
' ~ e f e r to Chapter 2 for a definition o f criticality
4.2 HLB Packing 53
Let: UnclusteredBLEs be the set of BLEs not contained in any cluster C be the set of HLBs contained in the current cluster HLBs[i] be the set of HLBs in the cluster num-HLBs be the number of HLBs in the current cluster which are NOT empty N be the maximum number of HLBs in a cluster LogicClusters be the set of cIusters (where each cluster is a set of HLBs)
Unclus teredBLEs = PatternMatchToBLEs (LUTs, Registers); LogicClusters = NULL;
whiIe (UnclusteredBLEs != NULL) { /* More BLEs to cluster */
C = GeMostCriticalBLE (UnclusteredBLEs);
while (ICI < maximum-LUT-capacity) { /* Cluster is not full */ if (BestBLE = NLJLL) /* No BLE can be added to cluster */
break;
BLE-SUCCESSFULLY-ADDED = FALSE; /* initilization */ for ( i d ; i < num-HLBs; i++) {
if (i3estBLE-canbe-inserted-in to-HLB( BestBLE, HLB [il )) { HLBs[i] = HLBs[i] u BestBLE; BLE-SUCCESSFIJLLY-ADDED = TRUE; break;
1 1
if (BLE-SUCCESSFULLY-ADDED = FALSE) ( /* Try pIacing BestBLE into an empty HLB if one exists */ num-HLBs tt; if (Inum-HLBsl> N) {
break; /* No more empty HLB slots */ } else {
HLBs[num_HLB - 11 = insert-into-HLB( BestBLE ); HLBs[num-HLB -11 = HLBs[numHLBs -11 v BestBLE;
1 1
UncIusteredBLEs = UncIus teredBLEs - BestBLE; C = C uHLBs;
1
LogicClusters = LogicClusters u C; 1
Figure 4.7: Pseudo-code for HLB Timing Driven Packing
54 Chapter 4. Hardwired Logic BIocks
This aIgonthm tries packing BestBLE into one of the non-ernpty HLBs to determine if it
has a common net with any of the previously packed BLEs. The BestBLE c m be added to one
of the non-ernpty HLBs if a common net exists, the HLB is NOT full and only one of the LUTs
makes use of the register. BestBLE can still be added to an HLB in which no common net
exists as long as the logic block architecture contains tapping buffers. For example, Figure 4.8
shows a cascaded 4-LUT HLB where LUT F has been placed at the root. Due to the tapping
buffer and the fact that F only used 3 of its 4 LUT inputs, it is now possible to add another LUT
G which has no cornmon connection with LUT F.
Tappùig Buffer
Inputs
Figure 4.8: HLB with Cascaded 4-LUTs and Tapping Buffer
4.2.2 HLB Packing with Tapping Buffers
The example above illustrates how tapping buffers can improve FPGA density. We perfomed
two separate studies which examined the effectiveness of tapping buffer-enabled HLBs for
the cascaded 4-LUT stmcture shown in Figure 4.8. The first HLB architecture contained no
tapping buffers and the second had tapping buffers between al1 the LUTs in the HLBs. The 28
benchmark circuits used in Chapter 3 were synthesized and packed into HLBs. The results in
Table 4.2 show that that tapping buffered architectures had an average logic block utilization
4.3 Ex~erimental Results 55
rate of 88% compared to only 66% for non-tapping buffered architectures. This increased
logïc density lead to 26% fewer clusters on average as shown in Table 4.3.
Table 4-2: HLB Cluster Utilization (with and without tapping buffers)
Table 4.3: Number of Clusters with and without Tapping Buffers
Cluster Size
1
2
3
4
5
6
7
8
9
10
C l u s t e r Sizs
% Utilization (no tapping)
64
63
64
64
65
65
66
66
67
67
4.3 Experimental Results
% Utilization (with tapping)
89
87
88
87
88
87
87
87
88
88
# of Clustrrs (no tapping)
1729
869
573
428
337
280
23 9
207
183
165
This section presents the area and delay results for the cascaded 4-LUTs after HLB packing,
placement and routing through the full timing-driven CAD Aow given in Figure 2.2. This time,
# of Clustars (with tapping)
1233
63 1
4 18
3 15
251
2 10
179
157
13 9
12 5
2~tilization refers to the percentage of LUTs within the cluster that are actuaHy used
56 Chapter 4. Eardwired Logic BIocks
however, the new HLB packing algorithm is used. Although there are many different HLB
structures, due to time constraints only the cascaded 4-LUT configuration was explored. The
20 largest MCNC -911 circuits were used dong with 8 new benchmarks as described in
Chapter 3 and Table 3.3. Throughout this section, the results of the hardwired architecture
in Figure 4-1 will be compared to those of the conventional non-hardwired architecture, A11
results are based on a O. 18 pm CMOS design, including detailed circuit-level design for the
HLB-based logic clusters was perforrned.
4.3.1 Area Results
This section presents the total FPGA area results (logic cluster + extemal routing) of the cas-
caded 4-LUT architecture in relation to the non-hardwired 4 to 7-LUTs shown in Figure 4.gS3
From this, two key observations about the data can be made:
6e+06
5.5ei-06
S e m
4.5e+06
Figure 4.9: Total Area Cornparisons for Hardwired Arch. vs. Non-Hardwired
I a I I I 1 I i
'Hardwired 4-LUT$ - '4-input LUTs' -K- '5-input LUT$ ---.r---
- '6-input LüTs' -a- '7-input LüTs' --=-- '
- ..--- /.--A ---.---- --.----
it--------- -.-.-e---- ---m----'-l.'.---
3e+06
3The results for the small grain 2 and 3-input LUTs are not included due to their poor area-delay product. It is more important to compare the hardwired results with the best non-hardwired architectures.
- -
- =-ytJ
--*.-..-------a 3.Se+06 - - .- __-- _ _ - P m
-%-.=_Zr_
- *-.---=-----------*---------- -*-----------i-
-'.-r::----x ------>C-----X--K-----~~---- -----ic
2.5@+06 - r I I I I 1 t I
1 2 3 4 5 6 7 8 9 10 Cluster Size
t
4.3 Experimentai ResuIts 57
Increasing the cluster size from 4 to 10 in the cascaded CLUT architecture had a greater
negative effect on total area compared to a sirnilar increase for the non-hardwired 4 to
7-LUTs.
The cascaded CLUT requires less total area than the 7-input LUT, as anticipated. Al-
though it is not as efficient as the 4-LUT.
To further understand these observations, we have broken the totd area results into two
components: i) inter-cluster and ii) intra-cluster area. These components are shown in Fig-
ures 4.10 and 4.11. The inter-cluster area results show that the positive effects of clustenng
on reducing routing area are diminished after a certain point. More specifically, increasing the
cluster size beyond 4 has very little impact on reducing inter-cluster area The reason being
that nets were no longer being absorbed within the cluster after this point. If nets are not being
absorbed, then the BLEs attached to their terminais are in separate clusters and hence valuable
routing resources must be used to make the connections. As a result, there was no reduction in
inter-cluster area for any cluster size increase beyond 4.
The other component of the totd area is the intra-cluster area shown in Figure 4.1 1. Here,
the cascaded 4-LUT intra-cluster area demands more area than the 4 and 5-LUT non-hardwired
architectures. The hardwired logic block architecture requires a larger intra-clus ter area for
several reasons:
58 Chapter 4. Hardwired Logîc BIocks
1 2 3 4 5 6 7 8 9 10 Cluster Size
Figure 4.10: Inter-Cluster Area Cornparisons for Hardwired Arch. vs. Non-Hardwired
1. There are now twice as many LUTs per cluster for any given cluster size compared to
the non-hardwired architectures.
2. The intra-cluster rnukiplexer area is larger since there are now more LUTs which trans-
lates into more inputs. AIso, these extra LUTs result in more outputs which require a
full routing crossbar on the outputs to maintain logical equivalence on the cluster output
pins.
3. There is only 88% logic block utilization for the hardwired architecture compared to
98% for the non-hardwired architectures.
Al1 these factors contributed to the increase in total FPGA area of the cascaded 4-LUT ar-
chitecture as the cluster size was increased from 4 to 10. Note, that even though the hardwired
architecture demanded slightly more total area than the 4 to 6-LUTs, there was stitl up to 15%
in area savings compared to the 7-LUTs. Recall that the initial motivation for exploring these
hardwired structures was to reduce the large area requirements of the 7-LUT while still main-
4.3 Experimental Results 59
Figure 4.1 1: Intra-Cluster Area Cornparisons for Hardwired Arch. vs. Non-Hardwired
4.5M6
4 W 6
A m
taining its desirable performance characteristics. The results indicate that Our initial predictions
were correct as far as area is concemed. The 15% area reduction could have been better had
it not been for the 5 circuits in the benchmark suite which required significantly more 4-LUTs
than 7-LUTs after technology mapping." A poor mapping would be any circuit that requires
more than two 4-input LUTs for every 7-LUT after technology mapping. Table 4.4 presents
the ratio of 7-LUTs to 4-LUTs after technology mapping for al1 the circuits with the inefficient
ones marked in bold. The determination of why these circuits map inefficiently is left for future
work.
I I I b I ' '~ardwiréd 4-~lJTs' - '4-input LüTs' -K- '5-input ---u--- - %input Luis' -a- - '7-input LUTs' -=-
4.3.2 Delay Results
ài 3.5e+û6 -
- x--------
-
- 1 2 3 4 5 6 7 8 9 10
Cluster Size
The next step is to examine the hardwired logic block performance. Figure 4.12 compares the
critical path delays of the cascaded 4-LUTs to the regular non-hardwired architectures. The
results show that the cascaded 4-LUTs slightly out-perform the pure 4-LUTs but are worse
" e 5 circuits were bigkey, des, ex 10 10, s3S4 17, img-calc
60 Chapter 4. Hardwired Logic Blocks
Table 4.4: Cornparison of number of CLUT to 7-LUT blocks after technology mapping
C i r c u i t
bigkey
des
diffeq
ds ip
e l l i p t i c
ex1010
seq
sp la
t s eng
display-chip
hg-calc
hg- in te rp
input-chip
peak-chip
scale l25xhip
scale2-chip
w a r p i n g
than the 5 to 7-LUTs.
Figure
14 I l I I I 1 1 l I 1 2 3 4 5 6 7 8 9 1 O
Cluster Sîze
4-12: Total Critical Path Delay Comparisons for Hardwired Arch. vs. Non-Hardwired
To understand this situation more clearly, we should examine the number of BLEs on the
criticai path and the number of cluster levels on the critical path. The nurnber of BLEs simply
represents the arnount of BLEs in series traversed dong the critical path. Similarly, the number
of clusters represents how many inter-cluster levels that exist on the critical path. There is a
direct relationship between the number of cluster levels and the cntical path delay. Generally,
the more levels there are the larger the delay. This is generally true since in modem FPGAs the
rnajority of the delay is still due to the external cluster routing. Figures 4.13 and 4.14 illustrate
the number of BLEs and cluster levels respectively as a function of LUT and cluster size. The
trend shows that more coarse g a i n logic structures result in fewer BLE and cluster levels on
the critical path. These larger BLEs tend to absorb more connections internally and produce
a fewer number of logic blocks. It can be seen that the cascaded 4-LUTs have more cluster
levels than the 5 to 7-LUTs and this the major reason for the increase in total delay.
62 Chapter 4. Hardwired Logic Blocks
1 2 3 4 5 6 7 8 9 10 Cluster Size
Figure 4.13: Number of BLEs on the Criticd Path Comparisons for Hardwired Arch. vs.
Non-Hardwired
Figure 4.14:
10 - I I I I 1 I I C I
'Hardwired 4-LUTsg - '4-input LüTsg -x-- '5-input Lms' ---r--- '&input LüTs' ---a- 7-input Lms' --,--
5 6 Cluster Size
Number of Clusters on the Cnticai Path Comparisons for Hardwired Arch. vs.
Non-Hardwired
4.4 Area-DeIay Results 63
4.4 Area-Delay Results
In FPGA studies, another metric for evduating the quality of an architecture is by measuring its
area-delay product. Since there is always the tradeoff between area and delay, a good architec-
ture is one with a Iow value for area-delay product. This minimum represents the point where
we are sacrificing the Ieast amount of area for the most performance. Table 4.5 shows the area-
delay product for the cascaded 4-LUTs and the non-hardwired architectures (4 to 7-LUTs).
The 7-LUT architecture has the worst area-delay product but interestingly, the hardwired ar-
chitecture does not perforrn any better despite the fact that there was a reduction in the total
FPGA area, Unfortunately, the area reduction was not accompanied by a corresponding reduc-
tion in critical path delay. The improvement in total FPGA area was not enough to compensate
for the slight increase in delay.
Table 4-5: Area-Delay Product Cornparison Between Cascaded 4-LUTs and Non-hardwired
Architectures
Cluster Size 4-LUT 5-LUT
1 0 .098 0 .088
2 0 .084 0,077
3 0 . 0 6 7 0 .064
4 0.063 0 .058
7 -LUT Hardwired 4-LUTs t
64 Chapter 4. Hardwired Lopiic Blocks
4.5 Summary
This chapter exarnined the feasibility of employing hardwired logic blocks and how they com-
pared to the non-hardwired architectures- We have found that the hardwired architecture re-
quired 10-15% less area than the 7-input LUT. However, there was a slight increase in critical
path delay and this offset any positive gains we realized in area. The result was no improvement
in the area-delay product.
CHAPTER 5
Conclusions and Future Work
5.1 Summary and Contributions
The goal of this thesis was to study the impact of different logic block architectures on FPGA
density and performance. In chapter 3 we studied the effect of LUT and cluster size on FPGA
area and speed. It was found that in terrns of area, using LUT sizes of 4-5 and clusters of 4-9
were best- Also, a LUT size of 7 and chsters of 3-10 produced the lowest average criticd path
delay. However, for the area-defay product, LUT sizes of 4-6 and clusters of 3-20 were best.
Another important contribution from Chapter 3 was an expression for the number of distinct
cluster inputs (I) as a function of both LUT size (K) and cluster size (N). It had been previously
shown in PR971 [BR981 that when K=4 and 1 is set to 2N + 2 then 98% of al1 the 4-LUTs
would be utilized. We found a more generaI expression for the 98% logic block utilization:
66 Chapter 5, Conclusions and Future Work
Chapter 4 introduced a new HLB architecture and packing algorithm. We were specifically
interested in the cascaded 4-LUT hardwired logic blocks. Our motivation for exploring such a
hardwired architecture was that it may lead to an improvement in area-delay when compared
to the non-hardwired logic bIocks. However, we found that the use of the cascaded 4-LUT
architecture did not improve our area-delay product. In fact, the hardwired area-delay results
were quite similar to the 7-LUT logic block architectures.
5.2 Future Work
In the future it would be interesting to study the effect of larger LUT sizes (> 7) and larger
cluster sizes (> 10) on FPGA area and performance. Also, our HLB study focused strictly on
the hardwired 4-LUTs. However, the behaviour of different hardwired architectures such as
those shown in Figure 5.1 would be an interesting topic.
r - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - .
' " p u q - E q
I - I
, I
Figure 5.1 : Various HLB architectures
Finally, our packing algorithm was based on a layout synthesis approach where we first
technology mapped to K-LUTs and then packed these logic elements to form HLBs. Another
method that could possibly lead to better area and delay results would be to perform the HLB
mapping at the logic synthesis level. That is, treat the entire HLB as a coarse grain target for
technology mapping.
Total Area
Table A. 1: Total Area (x 106) in Min. Width Trans. Area (Cluster Size = 1)
68 Appendix A. Total Area
Table A.2: Total Area (x 1o6) in Min. Width Trans. Area (Cluster Size = 2)
Table A.3: Total Area ( x 1o6) in Min. Width Trans. Area (Cluster Size = 3)
"' du4
~pr-i
bigkcy clma d a diffq
dsip elliptic ~ 1 0 1 1 1
&P f r k misa3
pdc fi8
.dx4 17 s385Ki
scll sph
mn6 rlisp1;iyrhip imgnlc imginurrp inpuuhip
pcalifhip scaicl2Srhip scalc3chip
Ccom Avg.
3.156 3.7m 2754 3 8 7 18.706 267 1 12-U L I I Y I
5.775 9 5 6 8
220 1 6.933 1950 13.022 3.475 14056 12169 3.710 to.4n 1 -450
4.1110 36.277 5593 1 .6W 1.024 5.736 2 i25 2.326 4.475
Circuit
duJ
a m
a m
"iSkey clma Jcs difrq h i p clliptic crlOI0
&P frisc mircr3
pllE $98 s3R117 s385M
sy S P ~ J
w n g dirphyrhip img-nlc imginurp inpurship pmkxhip d e l 3 ~ h i p d L c h i p
w i n s Ccom Avg.
LUT Size
Table A.4: Total Area (x lo6) in Min. Width Trans. Area (Cluster Size = 4)
Circuit
Table AS: Total Area ( x lo6) in Min. Width Trans. Area (Cluster Size = 5)
Appendix A. Total Area
Table A.6: Total Area (x 106) in Min. Width Trans. Area (Cluster Size = 6 )
Table A.7: Total Area (x 1 0 ~ ) in Min. Width Trans. Area (Cluster Size = 7)
Circuit
a l d
V=Q J F ~
higky c l m Jcr di f fq rlrip clliptic cxl0lO
-3-J frisc miscd
pdc s2Yn r3îX17 s38SS4
seq sph
k'%
dirpbyrhip img-nlc imfinicrp inpuuhip
-"il' sdc l Z r h i p d & h i p
w i n a Gnim. Avg.
Table A.8: Total Area (x lo6) in Min. Width Trans. Area (Cluster Size = 8)
Circuit
du4
.w apura b i g w clma des difky &il: clliptic a l O l 0
cxsp TriSc m i s a 3
pdc s38 s3XJ 17 385W
seq spla
mnl: disphyrhip i m g a l c imginrcrp inpuuhip W h i p . d c l l S r h i p s d e h i p
v i n g Zcom Avg
Table A.9: Total Area (x 106) in Min. Width Trans. Area (Cluster Size = 9)
Appendur A. Total Area
du4
a w
biEl;cy
dm dcr Jiffkq
&ip
rlliptic cxlOL0
C ~ P rrix mis& * s2Y 8
d M 1 7 s385M
Sal spla
wns Jkphyrhip
i m g d c
imgïntcrp inpuuhip e h i p scaicl25rhip sraielEhip w ~ p i n g Cmm Avg.
LUT Si
Table A. 10: Total Area ( x 106) in Min. Width Trans. Area (Cluster Size = 10)
Intra-Cluster (Logic) Area
7 - o-.m 033:
023 1
0313 1.4w
0305 O268
0764
0575
O.RI2 0.187
0.632
0268
0.883
0.449
1.434
1312
03OY
0.78 1
0. 195 0302
LY73
0.691
0.234
0 . 3 5
0-721 02x4
0.35 OAM
Table B. 1 : Intra-Cluster Area (x 1o6) in Min. Width Trans. Area (Cluster Size = 1)
74 Appendix B. Intra-Cluster (Logic) Area
Table B.2: Intra-Cluster Area (x lo6) in Min. Width Trans. Area (Cluster Size = 2)
1
Table B.3: Intra-Cluster Area (x lo6) in Min. Width Trans. Area (Cluster Size = 3)
- 7
LOU > x w
1.813 0983
1 1.813 1.059
1.695
2 8 8 1
J.788
3983
I l l 2
538 1 2.237 63X1
'432
6 567
7398
2576
5.703 1.8W '+iO 7-FI l 3.031
LLhY
I.051
3.830 1 . w 1.822
2785
Table B.4: Intra-Cluster Area ( x lo6) in Min. Width Tram Area (Cluster Size = 4)
Circuit 1 L m Sue 7
du4 0.438
a m 0508
a-4 0355 higkcy 0.474 clma 2265 des 0.463
JiiTcq 0.408 dtip 0.SO 1
clliptic 0.873
a1010 1 3 8
d p 0.2911 CriSc 0.959
misu3 0.5 I 1
PJc 1.35 1 s 3 x o. m s3Ki 17 2.171
~nsiu 1.981
serl 0.474 sph I.IY.4
'-=Y! 0-r) Jisplîyrhip 0.760
i r n g d c 4508 imginicrp I.M8 inpuuhip 0.355
pc3kchip O385 s d c l 2 5 r h i p 1.086
sealcZchip 0.42X
Table B.5: Intra-Cluster Area ( x 1o6) in Min. Width Trans. Area (Cluster Size = 5)
76 Appendix B. Intra-Clus ter (Logic) Area
du4
~prZ
Jpn4 bi- clma dcr
d i i l q ùsip
ellipiic a l 0 1 0
CX~P frire
misu3
PJc 13s ~38417
dR58J
rq
-l: Jïsplayrhip
imgnlc imginlcrl, inpuhhip
e h i p
d c l 3 x h i p d r h i p
x ~ p i w Z e o m A v s
LUT s i 2 l 3 l 4 l 5 1 6 1 7
Table B.6: Intra-Cluster Area (x 106) in Min. Width Trans. Area (Cluster Size = 6)
1
0 5 5 1
IF-( 0385 higkcy 0.512
clma 2.453 Jcs 0.50 1
d i i lq 0.512 dsip 0.435 clliptic u s 2 u l O l 0 1.406
d1, 03 15 lrix 1.037 r n w 0.515
pdc 1 .-am 0.7U
D W l 7 1-35? LI47
=l 05 14 spls 1.90 w n g 0320
displayrhip 0.825 i m g a l c 4.88 1 img~nrcrp t.133
inpuuhip 0383
p a l d i i p 0.418 w i c l S r h i p 1.179 lolc'rhip 0.465
Table B.7: Intra-Cluster Area (x 106) in Min. Width Trans. Area (Cluster Size = 7)
Circuit
JTuJ
Jpcrj bigkcy clms dcs d i f f q dsip elliptic a l l ) l U
&P rriv m i x x )
pdc s 3 8
OR1 17 s38SRi
=Y sptr
-E disphyrhip i m g d c imginlcrp inpu~chip
W h i P s d c l 3 r h i p d c l c h i p uvping
Geom Avg.
Table B.8: Intra-Cluster Area ( x 106) in Min. Width Trans. Area (Cluster Size = 8)
Circuit I LUT Si
d a diffcq Lvip eüiptic a i 0 10
=% rrirc m i s c d
pdc 5298 d m 1 7 s385XJ
scq $pl3 '=w diplnyrhip i m g d c imginicrp inputrhip p z k c h i p scdcl2Lchip s c d f i h i p
Table B.9: Intra-Cluster Area (x 106) in Min. Width Trans. Area (Cluster Size = 9)
78 Appendu B. Intra-Cluster (Logic) Area
Table B. 10: Intra-Cluster Area (x 106) in Min. Width Trans. Area (Cluster Size = 10)
Inter-Cluster (Routing) Area
Circuit
Table C. 1 : Inter-Cluster Area ( x l 06) in Min. Width Tram Area (Cluster Size = 1)
80 Appendix C . Inter-Cluster (Routing) Area
4 du4 3.107 LX43
J@ 3142 3369
3 ~ ~ x 4 iMO 2355
bigky 2.492 2087 clma 17513 162llS dcs 1394 '089 dilfq LW1 1.734)
6ip LM7 1.MiR clliptic 51<K 5.UY3 crlUl0 8.052 10.658
-a' 20x7 'S fnSe 6.837 6.061
misa3 LW 25w pdc 12.133 10323 s29R 3.488 3.1 IO s3R117 17727 11.780 d85M I l ln9 10727
Scq 3362 3.1N spIa 9.609 9.40N)
-€ 13M 1.058 dispbyrhip 3.863 3.159 i rngdc 3 5 x 2 23.932 i n 5.459 4.1CK> 5pauhip 1556 i.1-U pcskchip IYOI 1522 dc l25 rh ip 5.442 1036
.dclchip 7IoX 1.459
w~ping 2-188 1.812
Cmm Avg. 5.231 3.678
Table C.2: Inter-Cluster Area ( x 106) in Min. Width Trans. Area (Cluster Size = 2)
Circuit
du4 3 p C e
bigkcy clma
des diffq &ip
ellipuc ulOlO
~XSP f N c
rniscx3
pdc s298 s3M 17 s385R)
x<l sph txng diploy~hip i m g d c irnginurp inpuuhip
pclkship dc l25 rh ip scdchhip warping
Gmm Avg
LU 4
1 .Y62 2.577 I.x(n I52Y IL727 1367 1365 1396 4.479 6.623 1.619 5332 1923 9.070 IYN 7251
7278 L3Y2 6.779 1.00Y 1542 13335 2393 0.6 16 O589 2242 0911 l.lJ0
lSZ9
Table C.3: Inter-Cluster Area (x 1o6) in Min. Wtdth Trans. Area (Cluster Size = 3)
Yr9-1
LZXD 6W.O H6.l Y w u Y L i u OLVl
8 L f S oo1-1
Z l t ' l Wïo'F C l K l
S E 5 t OX6f Y s r I 6 K S o z 1 8W'S WY'O L K E Y tL'E Z S f 1 tZU'1 lWP'0
t l Z S I 1 or-O s o r - I YLOY
619'1
two YYY'O YS61
O1 f'0 hSr-O
EXL'l YEYL L W 1 ZSL'O XSO'S h W 1 LE6't f Y 5 E WC I Ifil'Y 81 r l E 0 . S L Z 6 0 ;YI-S E S K 61 iT0 Citfi'U E Et.0
YIY'OI o ü 5 0 fo r - l LS6- 1
s06-1
WY'O L W 0 son-1
Wf 'O WO
Y-JI h f GL I O 5 1 MY' O
LZl'S OZOL
LtY'S 1 XY'S 5 9 5 1 u1.L-Y LZS- 1 L i S f (X)f-1 hSL'S 51Y'C SLY'O L X 1 151-1
fLf iOl Y t S O rOL-1 1127:
5960 L t C O 6 E 6 1 1 f 50 %O
m l ESL'OI X X i l I r m i rrr3 OYIY SEY'Y 861 -9 1 5 5 1 S88'L OSY'1 SES'; l E f l W C 5 S f t ' t t 1 0 - I I X f l rFz -1
61 1.1 1 691-1 E H ' I
K t i
U.97 SLTC *AV II-3
K I EIY'I Z u ! h
5660 M T 1 J!v~;.P LW? LZKC ~!VSZI~F= x u s 1 I r 1 d!q@ 5.58'0 K i 1 d!qr>ndu!
OXWSE Kt hu!-3ur!
Y W 8 1 Yf5Y; a p 3 y LZG1 OLYi J!ilrbhp HOB'O 6ti)'t Suxi
Y r n t x qds 8 8 E 7 Z t Y 9 hx
X68'L EPL'6 t i l S 8 f ~ SfL-X XYS'II L I M I F S h C I 1 f l L XKS OYbL ZhX'h >Pd S t i l MfT (HL-t Ytl-S J N J
ZhY'1 ZLY'I d~ M t 8 EXL'L 0101rJ M E SSX-t J ~ ~ ! I P i n ] - I LW-1 d!sl> I W l LhKI 'W!P h f t ' l IXL-1 '7'
m ' E l Ytl 'Sl r m p
S r 5 1 CM-1 hq;7!q IE8'1 E l 1 7 t d c I n n CiIKi7:
I8Y-1
Y W O EIY'O 6 N I 1 Zr-0
xwa '3511'1 C W L S u S I 1 -do f a - 5 OZ6-1 VI f's YlY'E K t ' l 9EI'Y osf l 610-5 YYao
L t T S H O - 5 S I 5 0 t86-O +a--O
r s 1 1 S 8 t U h m - 1 t W 7 EYI'I
Y
6WC
EIL'I 8 1 5 1
O W t 6 L r 1 SLTI 8 L f ' t
BLO'OE W E W l 8 L f 8 8 n a T
StS'OI LIU71 1%
M l ü l M t 7 689's 6tL-1 6t 1 -8
g - x t ZYf l MY'I 8l-L-1
ILO'SI E0X.l 0 8 1 7 LSll't M t Z
L
82 Appendix C. Inter-Cluster (Routing) Area
Table C.6: Inter-Cluster Area ( x 106) in Min. Width Trans. Area (Cluster Size = 6)
LUT Site
Table C.7: Inter-Cluster Area ( x 106) in Min. Width Trans. Area (Cluster Size = 7)
Table C.8: Inter-Cluster Area (x 106) in Min. Wixdth Trans. Area (Cluster Size = 8)
1
2545 2167 1.M1 1.680 1558 1.165
1 3 5 11.ü7.l 1551 1.196 1.47R 1. lM ln81 0.8s
4.180 3551 6.633 7.7M 1.581 1576 4.7G 2.1113 1.651 8 . m 6.852 1.824 1.436 YSYY 7.032 8.771 7.051 2307 22ûY 6923 5385 0.YM 0.753 -3M 1.716
21.870 15219 3.242 2.751
0.931 0.7OY 1.035 O.!Jlh
3.236 2357 Y O.RIO
Table C.9: Inter-Cluster Area (x 106) in Min. Width Trans. Area (Cluster Size = 9)
84 Appendk C. Inter-Cluster (Routing) Area
Table C . 10: Inter-Cluster Area ( x 1o6) in Min. W~dth Trans. Area (Cluster Size = IO)
APPENDIX D
FPGA Channel Width
dlLJ
3 6
aPCX1 bigkcy clma dcs Jiffcq k i p elliptic a 1 0 1 0
d r frix m k d
Fdc s 3 x OM17 JXSRI
sEq
spL
'=="E diplayship i m g d c imginrcrp inputchip w h i p d c l 5 r h i p d c L c h i p warping C m n Avg
Table D. 1 : Channel Width (Cluster Size = 1)
86 Appendix D. FPGA Channel Width
Circuit
du4
a m bi- c h a
des diffcq
h i p clliptic ~ l n l n &P fri5c misa3
pdc s3X J I U 1 7 s385KI
=l spls
'=ng dlrphyrhip
i m m c img-in-
inpuuhip
&hir sdc l -S rh ip rcJl&hip
" p i w C m m Avg.
Table D.2: Channel Width (Cluster Size = 2)
Circuit
Table D.3: Channel Width (Cluster Size = 3)
Ciraiit
du4
~ p r Z apa4 biSr;y cfma des diffcq &ip clliptic a l 0 1 0
&P r r k miscd
PJc a n d i U 1 7 s385X-I
Yq splr
LS;C"E
d i sphy~h ip i m g n l c imginrnp inpuuhip
D"lcrh i~ d c I 2 5 r . i i p s d c l c h i p
wlrpiw Ccom AVG
Table D.4: Channel Width (Cluster Size = 4) Circuit LUT Size
Table D.5: ChanneI Width (Cluster Size = 5)
Appendix D. FPGA Channel Width
Circuit
Table D.6: Channel Width (Cluster Size = 6 )
I LUT! l a 1
d d
V-' bigkey clmn dcr
diffcq
h i p clliprie alOLO
I& mircl3
pdc s 3 x s3Ki17 J 8 5 P '
rq
s p h WnE d i s p h y ~ h i p i m g s d c imginrrrp inpuuhip
-hi@ s n l c l 2 5 s h i p d c 2 r h i p
\ Y Y D ~ ~ Ceom Avg.
Circuit
d d
a p a l
npct? bigkey clma des d i m k i p cIlipuc orI0lO
&P frix miscx.3
pdc s 3 8 s3W 17 s385W
scq s p h LWnE duphyrhip i m g d c irnginlcrp inpurchip
scdc 1 S r h i p d - h i p
w;rrping G n m Avg.
I 1
16 51 52 3 62 32 33 32 57 58 5J 54 49 69 3 51 JY P,
65 28 30 63 34 Z5 26 32 16 33
4 0 3
Table D.7: Channel Width (CIuster Size = 7)
Circuit
al&
aP=J big- dma dcs
dicq dsip
dlipric
a1010
d p CrLIc miscd
PJc mu s3IU 17 s385x.t
rai spb
'="s d i h y x h i p
i m u c irngïnlcrp
inpuuhip
W h i f d c l 2 5 ~ h i p
d c 7 h i p
v i n s k l 8 l L AVG
55 52 58 cir 63 67 32 36
71 K1 39 3R
38 42
32 32 60 63
67 <X)
58 72 59 73 Ml 59
n a8
3 0 33
59 5s
54 56
59 M
75 89 29 37
33 36
65 63
39 39
29 30 29 33
36 36
30 30
34 33
463 499
Table D.8: Channel Width (Cluster Size = 8) Circuit 1 LUI s i
2 3
al114 56 11
62 O5
;il=4 62 OX bigkcy 39 32 clma 75 RI
des J I 38
diffcq 43 Ji
dsip 37 30 cllipiic h l 64
cx1010 69 95 cCI 76
frise iCZ 76
miscd 62 611
pLic Rf l Yu s38 32 3J s3IU17 fia 58
885iU 56 62
sec( 62 71
sp t 78 8R
-6 Y 39 Jisplayrhip 37 37
imgralc 69 67
imfinurp 39 311 inpurchip 29 13
p k h i p 30 33 .suIcl25rhip 37 37
sde lch ip 32 3 0
Table D.9: Channel Width (Cluster Size = 9)
90 Appendix D. FPGA Channel Width
Circuit
duj
3w apura big- clma dEs difkq b i p ctliptic a m n &P rr*c m i s c d
pdc sBR s3R117 dMRi
scq s p h Lrng
dirplayrhip irngnlc imginicrp inpuuhip
W h i r d c l25rhip xslc?chip
w r ~ i n g Gmm Avg.
- 7 - -
64)
67 67 43 80
4 2 41
42
M
76 M 67 f i l
86
311 63
56
65 80 33 Y 68
J1 30 31 37 32 37 -
50.6 -
Table D. IO: Channel Width (Cluster Size = 10)
Total Critical Path Delay
Table E. 1 : Total Delay in nano-seconds (Cluster Size = 1)
Circuit 1 LUT S i
e ~pcr4 bigkcy elma dcs di f fq Jsip clliptic ex1010
C ~ P fri5c m iscx3
pcic 5 3 8 LIU17 s385U
sal
PL -6 displzyrhip i m g d c
imginbxp inpuuhip W h i p r P l c I 3 r h i p d c h h i p
Table E.2: Total Delay in nano-seconds (Cluster Size = 2)
Circuit
-
- C
Table E.3: Total Delay in nano-seconds (Cluster Size = 3)
Cimit
du4
a F a 2 ni-" hi*
c l m d a diffq &ip cliipüc cx1(110
&P rrisc
m i s a 3 pdc -8
s3R< 17 s385RI
uy S P ~
N n F display~hip i m m c im ginicrp inpuuhip
W h i ~ scalc l?Srhip Xa lLch ip
" ~ p i n g Cmm Avk
Table E.4: Total Delay in nano-seconds (Cluster Size = 4)
Circuit
du4
J@
higky cfma dcr d i w dsip clliptic a l o l n
d r rrisc mi sa3
pdc s38
s3R117
s385Ki
secl sph
W"€ displayrhip i m g d c irnginrcrp inpuuhip
pcl lcrhi~ scdc l2Srhip d c l c h i p
v i n a Cmm Avg.
Table ES: Total Delay in nano-seconds (Cluster Size = 5)
94 Appendix E. Total Critical Path Delay
SW 24.17 3 ~ 4 2 x 5 1 bis* 1 1.63 clma 31.16 dec 16.72 diffq 2752
dsip 101)0 cllipuç 36.72 otI0lO 3 m
&P 19.87 frirs 5L.Q m a 19.00
pde 39 .w s 3 8 43 5 5 5.W 17 2s. 18 s3R5nJ 1931
rq z1.m spli 9 - 5 6
WnE 30.W disphyrhip 35.71 imgnlc 8538 imginrcrp 4-31 inpuuhip 33.71
pal;chip 3x25 .scdcl25rhip 49.86 safcLchip 3119
Circuit 2
al& 19.46
-
TabIe E.6: Total Delay in nano-seconds (Cluster Size = 6)
Table E.7: Total Delay in nano-seconds (Cluster Size = 7)
Circuit
du4
bidrcy
clmn des diffcq
I i p
cllipiic a l010
d p frise
m i s d i
pdc s29 8
s3R117 s385U-t
='l sph
'="' disphyrhip i m w c imginurp inpuuhip
s d c l 2 S s h i p solchhip
\rupin€ Ceom Av&
Table E.8: Total Delay in nano-seconds (Cluster Size = 8)
LUT sur
Table E.9: Total Delay in nano-seconds (Cluster Size = 9)
96 Appendix E. Total Critical Path Delay
LUT Sue
Table E. 10: Total Delay in nano-seconds (Cluster Size = 10)
Intra-Cluster (Logic) Delay
Table F. 1: Intra-Cluster Delay in nano-seconds (Cluster Size = 1)
Table F.2: Intra-Cluster Delay in nano-seconds (Cluster Size = 2)
Table F.3: Intra-Cluster Delay in nano-seconds (Cluster Size = 3)
Table F.4: Intra-Cluster Delay in nano-seconds (Cluster Size = 4)
Circuit
du4
JW apex4 biglrry clma Jcs difïcq dsip ellipuc crl010
frire m k x 3
pdc s 3 8
J W 1 7 38511-1
snI spln
-6 duphyrhip i r n g d c imginicrp inpuuhip p k c h i p d c l 2 S r h i p s d c L c h i p
warping Ceom Avg.
Table F.5: Intra-Cluster Delay in nano-seconds (Cluster Size = 5)
8'9 - LZ'S Lsh
XX'fl IL'X
W.01 1 CS
U U L L56 1 f-x Yt-s r c f 9t-S MY W'U 1 t f - Y
DY-t WU1
tL-f ZEY K Y
697 1 LX IR'+ L56 6Y L W' t PL-f Y t Y
69'9
LY't -. LL 11
hG»I WL WL -. LC x
t r s 1 W L W'L 1 t'L ES'Y IVL
W'L U I I C Y tirs K I 1 tirs ELY
65-01 o n W'L YY'C 6601
u n ws r i 9 E .9
Ln9 - XYL
51-1 1 18'tl XY'h 12'8 11'11 I r61 S68 IIY'b a's S6E XB'Y 1 Z'X 896 51.9 SfiE
STfI 8Y'V Cr3 l Zn S E 568 X Y ' t 89'6 S E S6f t f S Zr-s - - t
E r 9
tirs 80-8 Ur6 XUY
80'8 XISX
YI'S1 WS 80'8 LfY YrS LrY nss 11-11 Lf-Y 9 c s
lTE1
YTS v r s WL CITE YO'L r r f WX
tEO'E YCS YfS LrY
LJY
- 6- - OS? stx KU1 OX'Y OX'Y
SY'S w r l OB'Y S6.L UTL wr UCL SYS a v i 1 SITY W t Uf l l 06't s u 9 sr0 1 OtZ OB'Y SL-f
5ro1 O L Z
ED'Y SO'Y
6tP
Kt S6L mx 56L SGL snL 16f1 S6L S6L LTS LTS LiTY Sh'Y f6'01 9z-L Lz-s -aï7 1
L r s LrY S6L 867
Y6-S STE SGY 8m &-S Lc.;
- E6-n -
KOI m i 6Y61 SSY 1 LY'tl Z6Sl U'ff W t l hLL1 K S 8YY YTY SV'h
m'ri Ys-Y X Y ' t
n x u
1 CS Y59 f x t l Xff Y171 1 CS
Y171 1 0't
LWY M S f f f - -
E
- 1011 - Yh'EI 6C17 8Y'YZ t f - f i K 6 1 zL'ti: SL-sr L E K f Ï I t1-L 19's 91.9 ED'I 1 Ers1 z1-x YI'Y
6SOE LU'S fY'L Et: LY'P
0691 SY'Y 6rLI YI3 91-9 rl-L SY'Y
C L - P5s
w'01 6 S t l Y 1%
f i 0 1 91.6 CtïOZ Y1 -6
91'6 £9'9 CB't EY'Y S f L
Lh'Ul EY'Y ZB't W C 1 art art Lh-01 cx7 SC L ZX't W'01 EX7 ZCS ZCS ZS't - s ==!S
669
I B ' t XfX
C601 Yf'L YCL xf'n N'Li 9TL YirL It'Y IYY W L Yz-L
ES01 W'L 1FY
S l i l 61'5 I f Y
E6Ol L r T Yi-.?. LGf
Eh'Ol L C i 61's W ' f IYY
- 166 -
C f 1 1 OB'tl Z t Z YirYI 81-Y1 SY'61 50'6E 88'9 1 OX'tl E L S I T CS9
W O I 01't l ES4 Y f f EO'IL til's f KY
fit's1 1 Cf If'CI m-s IYCI I CC S I 3 E L SI -s - -
C -
SL-Y
OCS XfX K6
Xt'X n t % Xf'X
0 6 5 1 ors nt's YY'Y ws YY'Y Zt.L
YY'II OY-s OYY
CL71 W S DY'S xf'x XI'E ZYL XYE XP'X 81'C OY'S W ' S W'S
LIY
ZO'LS 3l"3ui!
IKYZ &vAcldnr L I ' G Sum 86.8 q d r ZX'L l u s
1 L'DI tXSXEs 50'1 i L I W P - rc XI P(jLs Y 5 6 VJ ZB'L P=W'
ETLE -J ZKL d p O f 8 o l o l n
OX'hZ ~nJ!lla 1 6f J!V
01-1; h;?U!r fZ'L np -. cr Xl c u i p SL'E ~ Z ! Y 811'9 t d c
8t ïB 7Y3dc CB'L VI= -
W 6 - 2 6 0 1 Z 6 t I 6 S t Z h 5 L 1 65S1 -aï31 (WLC YTYI Z K 1 [KY EYT 1 x 9 64-6 - ï . 1 OTY EY'S
Y r x l EYT Y69 K r 1 Kt
6 5 E l EU'S
-lx71 XSC Y 6 b 0CY EY'S - -
C -
mi - W Y 1
E C S i t X L f Xz-(iZ S S E Y t'Of K X S OI 'K U75 L1'6 1 S'Y LI-h
hX'El ZY'XI 5 6 0 1 ZY'S s o c 08'9 M L
9t'UC 1 0 ' s
X l ï X 0t'L
W S 1 ZB'E 1 B'Y x r x W L - - - -
TOT
102 Appendix F. Intra-Cluster (Logic) Delay
Circuit LLmSnc 7 m
Table F. 10: Intra-Cluster Delay in nano-seconds (Cluster Size = 10)
Inter-Cluster (Routing) Delay
Table G. 1 : Inter-Cluster Delay in nano-seconds (Cluster Size = 1)
Table G.4: Inter-Cluster Delay in nano-seconds (Cluster Size = 4)
Table G.5: Inter-Cluster Delay in nano-seconds (Cluster Size = 5)
Appendix G. Inter-Cluster (Routing) DeIay
araut LU 2 3 1 4
sld i r a i 14.12 1 1206
-
Table G.6: Inter-Cluster Delay in nano-seconds (Cluster Size = 6)
d u 4 apex2
=l=4 hiçkcy
c l m a da di- c k i p c l l ipt ic a 1 O l O
dl' ENc misa3
pic d Y 8 d 1 U t7 J R 5 X - I
scq sFh ="E dirpiayxbip i m g d c hgintcrp
imputchip W h i p . t n l c l Z S s h i p r c l l c ' h i p
-Tins Gean Av&
Table G.7: Inter-Cluster Delay in nano-seconds (Cluster Size = 7)
Circuit
a l d
a F a Z JFJ
hi- clmn Jcr d i w ùsip clliptic a l010
=% rriv rniux3
piic fi8
.Ci%% 17 s385nI
SCq
S~IJ
&"E dkplayrhip im@c irngïnrnp inpuuhip
W h i p sdcl75rhip d & h i p
w i n f Ccom Avg.
Table G.8: Inter-Cluster Delay in nano-seconds (Cluster Size = 8)
Table G.9: Inter-Cluster Delay in nano-seconds (Cluster Size = 9)
108 Appendix G. Inter-Cluster @muting) Delay
Cirmit I LUT S i i 2
duJ 2026
a w I272 18-79
bigkey 6.12 clmz 3 . 7 2 dn I 1 2 3 diffq 1031 Qip 6.11 elliptic 7.25 u m t o 24.02
= % 14.12 frir 15.15 m i s u 3 I451
pdc 0 . 8 5 s2Y 8 1135 s3W17 23.18 J85X-I 7.07
=I 11.41 sph 21.74
W"E 5 9 5 dtphyrhip 8.75 i m g d c 29.34 imginwrp 13.67 inpurrhip 11-42
pcz);chip 1 1 3 1 de l2S -ch ip I333 s e a l a h i p 9 3 2
Table G. 10: Inter-CIuster Delay in nano-seconds (Cluster Size = 103
Number of BLE Levels on Critical Path
I c i ~ i t I LUT S L ~ 1
Table H. 1 : Number of BLEs on Critical Path (Cluster Size = 1)
110 Appendix H. Nurnber of BLE Levels on Criticai Path
Table H.2: Number of BLEs on Critical Path (Cluster Size = 2)
Circuit
du4
Wcfi
big- C I ~ J
da
di-
dsip ellipuc cxlOl l l
,=SrJ f r k m w
@ sBX JIW 17
3RSM
Y''.= '="' disphy-chip
i r n g a l c irngintcrp
inpuurhip
pkrm d c l T 5 r h i p
. d c L c h i p
wyping CcomAv\ve
1 Circuit
7 - 12 13 12
ID
n 12
37 Y
51
15 12
62
Il
15 32
24 IR 13
15 J2 16
103 59
43
52
62
43
2.1.61
Table H.3: Number of BLEs on Cntical Path (Cluster Size = 3)
2
15
13 10
Y
3s 10
33 K
U
15
13 55 1 1
14
32 20
13
Il 15 37
43 $4
50
47
58
41 26
~ 9 3
3
7 9
7
5 12
Y
20 6
22 10 8
3 7 10
21
15 14 7
1 0 M 26
53 23
3 25 33
9
14.06
3
6
10
7
5
IX 9
17
6
19
10 X
30 7 Y 20 14 4
6
10 19
25 48
18
3 3 32 -3 14
1 3 3
LVfSizt
14
14
15
7
M
Y
15 7
2
8
II 15
Z7
1 3 9 11
15
I n 15
2 X l 3 Y 7 5 ,
9.16
LUT
R
I L
14
15 7
19
15
10
8
13 IJ
27
12
15
20 14 9
9-43
13
Y
12
6
7 6 5 4 16
5 5 4 5 7
13
d
7 6 5 5 7 10
11
20
12 10
16
11
723
S i
7
12
6 5 3 3 R
3 3 3 1 11 7
7 5 5 4 15
6 6 5 5 7 7 6 5
II
9 8 7 6 5
5 5 5 5 5 10
Il
25
Y
t l l ? 8 6 11 16
II 6
7.217
4 5 6 7
7 6 5 5 8 7 6 5
6 5 5 4
3 3 2 2 10
6 3 3 3
R
3 3 2 1 10
6
14
6
11
6 7 6 6
5 R
8
15
R 7 8 8
Io X
637
Y
6
9
6
11
6
In
6
4
7
6
11
6
6
9
7
<.CI
4 5 6 7
7 6 6 5
6 6 5 5
3 3 3 2 8
6
10 6
1'
IO
7
6 X
8
16 8
8
IO 8
4 4 (5 .~8
6 6
8
fi
R
5
10
y 6
5 7
6
13 7
6 Y
7
5.-
imginiap inpuuhip paLchip d c l 2 S x h i p d & h i p
" i n f 1 PM Ccom A v c
Table H.4: Number of BLEs on Criticd Path (Cluster Size = 4)
a l 4
am apcxJ hi-
clma des d i fkq drip cllipric a 1 0 1 1 ~
=% fnsc miscd
pdc sZYX
sJIU17 ~385%
scq SpkI
'="' duplayrhip i m g a l c imginvrp inpuuhip
WP scalc 1 X r h i p scalckhip
~ y p i n ~ Cmm AVG
LUT Sue
Table H.5: Number of BLEs on Critical Path (Cluster Size = 5 )
112 Appendix H. Number of BLE Levels on Critical Path
Table H.6:
Table H.7: Number of BLEs on Critical Path (Cluster Size = 7)
Table H.8: Nurnber of BLEs on Critical Path (Cluster Size = 8 )
Table H.9: Number oFBLEs on Critical Path (Cluster Size = 9)
114 Appendix H. Number of BLE LeveIs on Critical Path
LUT S i
Table H. 10: Number of BLEs on Critical Path (Cluster Size = 10)
Numlber of Cluster Levels on Critical Path
img-inurp 51
inpuuhip 18
I'&hi~ 5 1 s d c 12.5-chip 59 d e l c h i p 56
Table 1.1 : Number of Clusters on Critical Path (Cluster Size = 1)
116 Appendix 1. Number of Cluster Levels on Critical Path
Circuit Li ï ï Si
n
Table L2: Nurnber of Clusters on Cntical Path (Cluster Size = 2)
1
Table 1.3: Number of Clusters on Critical Path (Cluster Size = 3)
Table 1.4: Number of Clusters on Criticai Path (Cluster Size = 4)
Table 1.5: Number of Clusters on Critical Path (CIuster Size = 5)
118 Appendix 1. Number of Cluster Levels on Critical Path
Circuit I LU1 I
a p u l 7
al-4 7
b i 6 b 3 clma IR Jts 8 d i f fq 17 ûsip J
cllipuc 17
~ 1 0 1 0 R
&P 7 (ME 3
misa3 7
pdc 7 z l l ) R 16
s3M 17 LJ
JXSRI Y
rq Y spla 1 0
-S 16 dirphyrhip 20
i m g n i c 48 imginicrp 28 inpu~chip 23
pc;rkrhir 22 sa Ic l25 rh ip 35 d c l c h i p 16
Table 1.6: Number of CIusters on Critical Path (Cluster Size = 6) Circuit 1 LUTSUC
f 3
s l d 10 ri
JW 7 5 apurJ 7 6 higky 4 2
clma 15 11
del Ci 7 diiïcq 14 14 rlsip 5 3
ciliptic 11 Y
ex1010 R 5 t 5
(ME E 1 1
misa3 7 6
p<lc R 6
s 3 X 15 10 ~3x417 13 13 ~385x4 10 7
Sal X 6 spla 7 7
'.5="' 16 11 displayship 11 13
i m g d c J 5 41 imginlcrp 24 13 inpukhip 23 13
W h i p 20 13 sc J l c l2S~h ip 20 IR suic2-ship 15 12
Table 1.7: Number of Clusters on Critical Path (Cluster Size = 7)
Table 1.8: Number of Clusters on Critical Path (Cluster
LUT S i
S ize
I"ls
I 4 ' 3 12 6
5
3 8 5 6 16 J 4
IJ X
7 4
5 Y
8
1 1 10
4 10
12 IO
5 655
Table 1.9: Number of Clusters on Cntical Path (Cluster Size = 9)
4 5 7 4
4
3 I l 5 5 2 8 5 4
12 J
4
10 Y
3 4
5 in 7 13 Y 4
Y
7
n 6
5-78
120 A ~ ~ e n d i x 1. Number of Cluster Levels on Critical Path
Table 1-10: Number of Clusters on Critical Path (Cluster Size = 10)
-
Bibliography
[ACSG+99] O. Agrawal, H. Chang, B. Sharpe-Geisler, N. Schrnitz, B. Nguyen, J. Wong,
G. Tran, F. Fontana, and B. Harding. "An Innovative, Segmented High Pefor-
mance FPGA Family with Variable-Grain-Architecture and Wide-gating Func-
tions". En ACM Symp. on FPGAs, Monterey, CA, USA, 1999.
E. Ahmed and J. Rose. "The Effect of LUT and Cluster Size on Deep-Subrnicron
FPGA Peqomance and den si^". In ACM Symp. on FPGAs, pages 3-12,2000.
S. Brown, R. Francis, J. Rose, and Z. Vranesic. "Field-Programmable Gate Ar-
rays ". Kluwer Academic Publishers, 1992.
S. Brown and J. Rose. "FPGA and CPLD Architectures: A Eitot-ïaZ". in IEEE
Design and Test of Cornputers, pages 42-57, Surnmer 1996.
V. Betz and J. Rose. "Cluster-Based Logic Bloch for FPGAs: Area-Efleciency vs
Input Sharïng und Ske ". In IEEE Custom Integrated Circuits Conference, pages
55 1-554, Santa Clara, CA, 1997.
V. Betz and J. Rose. "How Much Logic Should Go in an FPGA Logic Block?".
In IEEE Design and Test Magazine, pages 10- 15, 19%.
V. Betz, J. Rose, and A. Marquardt. "Architecture and CAD for Deep-Submicron
FPGAs ". Kiuwer Academic Publishers, New York, 1999.
J. Cong and Y. Ding. "FlowMap: An Optimal Technology Mapping Algorithm
for Delay Optirnization in Loo kup-Table Based FPGA Designs ". In IEEE Trans.
on CAD, pages 1-12, Jan 1994.
J. Cong and Y- Hwang. "Boolenn Matching for Cornplex PLBs in LUT-based
FPGAs with Application tu Architecture Evaluation ". In ACM Symp. on FPGAs,
Monterey, CA, 1998.
K. Chung. "Architecture and Synkhesis of Field-ProgrammabIe Gate Arrays with
Hardwired Connections". PhD thesis, University of Toronto, 1994.
J. Cong, J. Peck, and Y. Ding. "RASP: A General Logic Synrhesis System fur
S W - b a s e d FPGAs". In ACM Symp, on FPGAs, pages 137-143, 1996.
E.M. Sentovich et al. "SIS: A System for Seqcrential Circuit Analysis". Technical
report, University of California, Berkeley, 1990.
R. Hitchcock, G. Smith, and D. Cheng. "Timing Analysis of Computer-
Hardware". Technical report, IBM Journal of Research and Development, Jan.
1983.
D. Hill and N-S Woo. "The BeneJits of FZexibility in Look-up Table FPGAs". In
International Worhhop on FPGAs, 199 1.
Xilinx Inc. "XC5200 Series of FPGAs ", YEAR = 1997.
Altera Inc. "Data Book ". 1998.
Xilinx Inc. "Virtex 2.5 V Field Programmable Gate Arrays ". 1998.
S. Kaptanoglu, G. Bakker, A. Kundu, and 1. Corneillet. "A new high densi0 and
very low cos? reprograrnrnable FPGA architecture". In ACM Symp, on FPGAs,
Monterey, CA, 1999.
J- Kouloheris and A.EI Gamal. "FPGA Petformance vs. Cell Granulari@"' In
Proc. of Ctrstom Integrated Circuits Conference, pages 6.2.1 - 6.2.4, May 199 1.
J. Kouloheris and A-El Gamal. "FPGA Area vs. Cell Granularity - P U Cells".
In Proc. of lustom Integrated Circuits Conference, 1 992-
J. Kouloheris and A.El Gamal. "FPGA Area vs. Ce11 Granulariq - Lmkup tables
and PLA Cells". In First ACM Workhop on FPGAs, Berkeley, CA, 1992.
A. Marquardt. "Cluster-Based Architecture, ïïming-Driven Packing, and Timing-
Driven Placement for FPGAs ". Master's thesis, University of Toronto, 1999.
A. Marquardt, V. Betz, and J. Rose. "Using Cluster-Based Logic Bloch and
Timing-Driven Packing to Improve FPGA Speed and Density ". In ACWSIGDA
FPGA, 1999.
1. Rose, R.J. Francis, P. Chow, and D. Lewis. "The Effect of Logic Block Corn-
plaitu on Area of Programmable Arrays ". In Proc. of Custorn Integrated Circuits
Conference, pages 5.3.1 - 5.3.5, 1989.
J. Rose, R.J. Francis, D. Lewis, and P. Chow, "Architecture of Field-
Programmable Gate Arrays: The Efect of Logic Functionaliy on Area Efl-
ciency". In IEEE Joccrnal of Solid-State Circuits, 1990-
S. Singh. "The Eflect of Logic Block Architecture on FPGA Petfiormance". Mas-
ter's thesis, University of Toronto, f 99 1.
S. Singh, J. Rose, P. Chow, and D. Lewis. "The Effect of Logic BZockArchitecture
on FPGA Pe$o mance". In IEEE Journal of Solid-State Circuits, 1 992.
A. Sedra and K. Smith. "Microelectronic Circuits: Third Edition". Oxford
University Press, 199 1.
N. West and K- Eshraghian. "Principles of CMOS VLSI Design; A System
Perspective; Second Edition ". Addison Wesley, 1993.
S. Yang. "Logic Synthesis and Optirnization Benchmarks, Version 3.0". Techni-
caI report, Microelectronics Centre of North Carolina, 199 1.