1
Architecture of Datapath-oriented Coarse-grain Logic
and Routing for FPGAs
Andy Ye, Jonathan Rose, David Lewis
Department of Electrical and Computer Engineering University of Toronto
{yeandy, jayar, lewis}@eecg.utoronto.ca
2
Outline
• Motivation– Datapath regularity
• An datapath-oriented FPGA– Architecture
– CAD flow
• Experimental results– Area efficiency
• Conclusion
3
Modern FPGAs
• Very large logic capacities– Over 10 million equivalent logic gates
• Increasingly used to implement large and complex applications– Central processing units
– Graphics accelerators
– Digital signal processors
– Packet switching networks
4
Datapath Circuits
• Large applications– Contain a greater amount of datapath circuits
• Datapath circuits – Consist of multiple identical logic structures
called bit-slices• Regularity
• Predictability
5
An Example
FullAdder
FullAdder
FullAdder
FullAdder
A0 A1 A2 A3B0 B1 B2 B3
C0 C1 C2 C3
Carry In
CarryOut
6
An Example
7
Research Goal
• Design a new FPGA architecture– Utilize datapath regularity
• Reduce the implementation area of datapath circuits on FPGAs
• Implement a full set of CAD tools for the new architecture– Synthesis
– Packing
– Placement
– Routing
8
Key Architectural Features
• A bus-oriented logic block architecture
• A mixture of coarse-grain tracks and fine-grain routing tracks
9
Datapath FPGA Overview
L L
L L
S
L Logic Block
Coarse grain routing tracksFine grain routing tracks
S Switch Block
RoutingChannels
10
Logic Block — Super-clusterBLEBLEBLEBLE
BLEBLEBLEBLE
BLEBLEBLEBLE
BLEBLEBLEBLE
Cluster 4Cluster 3Cluster 2Cluster 1
LocalRoutingNetwork
BLEBLEBLEBLE
A Cluster
MU
X
LUT
DFF
MA Basic Logic Element (BLE)
11
Datapath FPGA Overview
L L
L L
S
L Super-cluster
Coarse grain routing tracksFine grain routing tracks
S Switch Block
RoutingChannels
12
Coarse-grain Routing Tracks
Super-cluster
Cluster Cluster ClusterCluster
M
Sw
itch
Blo
ck
M
M
Coarse-grain Routing
M M M M
Fine-grain Routing
13
• CAD flow for the datapath-oriented FPGA consists of– Synthesis– Packing– Placement– Routing
• Conventional CAD flow– Minimize area and delay metrics– Destroy datapath regularity
CAD Flow
14
Datapath-oriented CAD Flow
• Preserve datapath regularity (bit-sliced structures)
• Map the preserved regularity onto the datapath-oriented FPGA architecture
• Maximize the utilization of coarse-grain routing tracks– Minimize the implementation area of datapath
structures
15
Datapath Representation
• Datapath circuits are represent by netlists of datapath components (VHDL or Verilog)
• Datapath component library– Multiplexers
– Adders/subtracters
– Shifters
– Comparators
– Registers
• Each component consists of identical bit-slices
16
Synthesis
• Enhanced module compaction algorithm
• Based on the Synopsys FPGA compiler
• Augmented with several datapath-oriented features– Preserve datapath regularity by preserving bit-
slice boundaries
– Achieve as good area results as the conventional synthesis tools
17
An Example Datapath Circuit
mux
+
c1
a1 b1
d1
s1
mux
+
c2
a2 b2
d2
s2
mux
+
c3
a3 b3
d3
s3
sel mux
+
c0
a0 b0
d0
s0
cin cout
18
Synthesis
mux
c0
a0 b0
d0
s0
sel
cin
4-LUT
a0 b0 c0 sel
4-LUT
4-LUT
d0
s0
cin
+
19
Synthesis
4-LUT
a2 b2 c2 sel
4-LUT
4-LUT
d2
s2
4-LUT
a1 b1 c1 sel
4-LUT
4-LUT
d1
s1
4-LUT
a0 b0 c0 sel
4-LUT
4-LUT
d0
s0
cin
4-LUT
a3 b3 c3 sel
4-LUT
4-LUT
d3
s3
cout
20
Packing
• Based on the T-VPACK packing algorithm
• Pack adjacent bit-slices into super-clusters
• Utilize carry connections in super-clusters to minimize the delay of carry chains
21
An Example
• Four clusters per super-cluster
• Two BLEs per cluster
• Six inputs per cluster
BLEBLE
BLEBLE
BLEBLE
BLEBLE
22
Packing Into Clusters
4-LUT
a0 b0 c0 sel
4-LUT
4-LUT
d0
s0
cin BLE
a0 b0 c0 sel
d0
s0
cin
BLE
BLEBLE
BLE
BLEBLE
23
Packing Into Super-clusters
BLEBLE
BLEBLE
BLEBLE
BLEBLE
BLEBLE
BLEBLE
BLEBLE
BLEBLE
a0 b0 c0 sel a2 b2 c2 sel a3 b3 c3 sel
d0 d1 d2 d3
s0 s1 s2 s3
cin
cout
a1 b1 c1 sel
24
Placement
• Based on the VPR placer
• Use simulated annealing algorithm
• For super-clusters containing datapath circuits– Move super-clusters only
• For super-clusters containing non-datapath circuits- Move individual clusters
25
Routing
• Based on the VPR router
• Use the path finder algorithm
• As much as possible– Route buses through coarse-grain routing tracks
– Route individual signals through fine-grain routing tracks
• When necessary– Use coarse-grain routing tracks for individual signals
– Use fine-grain routing tracks for buses
26
Area Efficiency
• Benchmarks– 15 datapath circuits from the Pico-java processor
• Architectural assumptions– Four BLEs per cluster– Four clusters per super-cluster– Four coarse-grain tracks sharing configuration memory– Logic track length of two– Disjoint switch block topology
• Architectural variables– Number of coarse-grain tracks
27
Area Efficiency
1.60
1.50
1.40
100.0%
95.0%
90.0%0% 0%-
10%10%-20%
20%-30%
30%-40%
40%-50%
50%-60%
60%-70%
circuit area in minimumtransistor area (x106)
normalizedcircuit area
% of coarse-grain tracks
28
Logic Track Length Vs. Area
• Architectural assumptions– Four clusters per super-cluster– Four coarse-grain tracks share configuration
memory– 50% of tracks are coarse-grain tracks– Disjoint switch block topology
• Architectural variables– Number of BLEs per cluster– Logic track length
29
Logic Track Length Vs. Area
1 2 4 8 16track length1.60
1.80
2.00
2.20
circuit area inminimum transistor area (x106) N = 2
N = 4
N = 8
N = 10
30
Conclusion
• Proposed a datapath-oriented FPGA architecture and its CAD tools
• Best area is achieved when – 40% - 50% of tracks are coarse-grain routing
tracks– Four BLEs per cluster– Logic track length of two
• Best area is 9.6% smaller than conventional FPGAs