Upload
eliana-pickett
View
32
Download
2
Tags:
Embed Size (px)
DESCRIPTION
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead. Heng Tan and Ronald F. DeMara University of Central Florida. Agenda. Introduction Previous work on reducing Partial Reconfiguration overhead Develop a multi-step physical area management strategy - PowerPoint PPT Presentation
Citation preview
A Physical Resource Management Approach toA Physical Resource Management Approach toMinimizing FPGA Partial Reconfiguration Minimizing FPGA Partial Reconfiguration
Overhead Overhead
Heng Tan and Ronald F. DeMaraHeng Tan and Ronald F. DeMaraUniversity of Central FloridaUniversity of Central Florida
Heng Tan and Ronald F. DeMaraHeng Tan and Ronald F. DeMaraUniversity of Central FloridaUniversity of Central Florida
Agenda
• Introduction• Previous work on reducing Partial
Reconfiguration overhead• Develop a multi-step physical area management
strategy• Experimental case studies:
4 simple circuits: serial, parallel, block multiplier resource,high fanout carry chain LUT multiplier
2 large circuits: SECDED, MD5 step function (new)
• Conclusion and future work
Motivation
For partial reconfiguration operations, especially those running on a SOC configuration,
storage space and
reconfiguration speed
are crucial.
Find way to optimize them . . .
Previous Work
• Architecture-level [Ganesan2000] Pipelining overlap execution of one temporal partition with reconfiguration
of another• Logical level [Raghuraman2005]
Minimization of frames relating the number of frames that need to be downloaded into
FPGAs to the number of minterms of a specially constructed logic function
• Hardware level [Compton2002][Hauck1999] Requires dedicated HW component Compression or defragmentation performed externally at
runtime• Physical-level resource management?
Not specifically addressed for partial reconfiguration modules in literature
But can be done … and … also cascaded with above techniques for multiplicative benefit
Module Based Flow
External Data
I ntermodul eSi gnal s
PRModul e
Fi xedModul e
Fi xedModul e
Bus
Macro
Bus
Macro
PRModul e
• Including Fixed modules and PR modules
• Using Bus Macro
• Suitable for potential full automation
• Primary PR technique to study
Basic concept and capability for PR proposed by Xilinx
Column-Level Configuration Format
• For Virtex II/Pro series
• Including IOB, IOI, CLB, GCLK, BlockRAM, and BlockRAM Interconnect
• Labeling with 32-bit addresses composed of a Block Address (BA), a Major Address (MJA), a Minor Address (MNA), and a byte number
• Only contents of CLB column are studied in this research
0
0
0
1
0
2
0
3
0
4
BA
MJ A
. . .
0
M+2
0
M+3
0
M+4
1
0
. . .
1
N
2
0
. . .
2
N
GCLK
IOB
IOI
CLB . . .
CLB
IOI
IOB
BRAM
BRAM. . .
BMINT . . .
BMINT
M = Total Number of CLB ColumnN = Total Number of Block RAM Column
Bitstream Compression
• CLB columns program the configurable logic blocks, routing, and most interconnect resources.
• Each CLB column contains 2 columns of slices.
• 22 frames are utilized within the bitstream for each column of slices, describing the logic and routing information respectively.
• Each frame occupies 424 bytes.
• In bitstream of PR modules, a compression technique is already used by Xilinx to represent the unused CLB frame.
• For unused CLB frames, only 10 bytes are used instead of 424 to describe empty contents.
Optimization Strategy
• Primary goal: minimize the number of columns of slices utilized, including routing resources
• Secondary goal: do not incur increase in propagation delay after area optimization
• Physical area management procedure:
1. Region Allocation: define size and boundary
2. Pin Assignment: top/bottom edge preferred
3. Column Alignment: fill column of slices at a time
4. Choke-Point Elimination: resolve high fanout cases
5. Repeat
Case Study
• The hardware platform is Xilinx Virtex II Pro VP7 device.
• Module-based partial reconfiguration flow is adopted to generate the partial reconfiguration bitstream.
• The Xilinx ISE 6.3 is used to support the module based flow.
• The area constraints are entered directly into User Constrain File (.ucf) before map and routing.
• Four representative small cases and two larger size cases studies are tested for the strategy.
• Similar or identical external pin arrangement.
• Hypothesis: larger circuits can achieve more savings
4-LUT Design Optimization (parallel)
• Simple 4-LUT elements
• Parallel logic path with direct input from external signals
• LUTs feed outputs straight though to flip flops
• Best strategy is by locating them in a single column close to the external pins
Shifter Optimization (serial)
• All logic elements are cascaded in contiguous string of CLBs
• Attributes of this serial circuit functionality will have best arrangement from input to output in single column serially
Block Multiplier Optimization
• Block multiplier resource involved
• Requires balancing the routing between two paths.
•One path is between the block multiplier and the LUTs. The other path is from the LUTs to the external pins.
• Decreased savings compared to use of LUT-only circuits
LUT Multiplier Optimization
• High fan-out occurs because of the carry chains
• Single column style is not optimal
• Arrange related LUTs around each other using adjacent columns
Larger Case Studies
• SECDED as full PR module and MD5 step functions were developed as PR module vs. SHA family
• During the optimization process, not every slice has been specifically placed because of the large number of resources involved
• Only the slices on the critical path are constrained. • These are comparatively larger modules, increased
bitstream savings of 33% and 30% are achieved
Benchmark Results
Module Name
# of LUT.
# of FF# of block
Multiplier
# of Slices
Original File Size (bytes)
Original Max. Delay (ns)
Optimized File Size (bytes)
Optimized Max. Delay (ns)
Area Saving
4 LUTs 4 16 0 12 64K 1.371 55K 1.347 14%
Shifter 1 24 0 13 87K 1.377 63K 1.367 28%
Block Multiplier
8 25 1 17 88K 1.346 66K 1.346 25%
LUT Multiplier
22 22 0 22 96K 1.367 68K 1.346 29%
SECDED 93 41 0 74 89K 1.355 60K 1.355 33%
MD5 292 128 0 168 120K 1.380 84K 1.322 30%
Area Optimization Results
• A physical resource area management strategy is proposed to minimize the reconfiguration overhead.
• Experiments show that up to one-third of size reduction can be achieved for partial reconfiguration modules.
• The maximum propagation delay has also been decreased slightly in most cases (not increased).
• On the other hand, the larger the module is, the more complicated and time consuming the process becomes.
• Providing autonomous area optimization capabilities is future work for integrating into our Multi-Layer Runtime Reconfiguration Architecture FGPA framework
Multi-Layer Runtime Reconfiguration Arch. (MRRA)
Resource name
Number of Available
Number of Used
Utilization
IOBs 396 85 21%
Slices 4928 1805 36%
BRAM 24 44 54%
TBUFs 2464 352 14%
PPC405 1 1 100%
BUFGMUXs 4 1 25%
. . .
PLB
PowerPC
OP
B
ReconfigurableModule
ReconfigurableModule
PCI
FPGA
Host PC
PLB/OPBBridge
Block RAM
UART
SRAMController
ICAP Controller
ExternalSRAM
JTAG
ICAP
SelectMAP
On Chip Data Flow
Reconfiguration Data Flow
External Data Flow
JTAG / SelectMAP / ICAP Reconfiguration Interfaces
Current Work:Direct Bitstream Management
•Change one-bit full adder to a one-bit full subtracter •Both have three one-bit inputs and two one-bit outputs.• Both used 2 LUTs with identical logic interconnections between LUTs and I/O signals. •Only difference between them is actually only one truth table stored inside one LUT, changing from 0xE8 to 0x8E
X Y Cin Cout S
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
0 0
0 1
0 1
1 0
0 1
1 0
1 0
1 1
X Y Bin Bout D
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
0 0
1 1
1 1
1 0
0 1
0 0
0 0
1 1
96
E8
X
Y S
Cout
Cin
X Y
Cin / BinCout / Bout
S / D
Adder /Subtracter
96
8E
X
Y D
Bout
BinLogic
Switch
(a) 1 Bit Full Adder (b) 1 Bit Full Subtracter
Combined MD5 / SHA-1 Step Function – Area Utilization
Original Algorithm Module Based Frame Based
SHA-1 192(slices) 65(slices) 32(slices)MD5 881(slices) 168(slices) 32(slices)Combined 1068(slices) 324(slices) 32(slices)