A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead

A Physical Resource Management Approach toA Physical Resource Management Approach toMinimizing FPGA Partial Reconfiguration Minimizing FPGA Partial Reconfiguration

Overhead Overhead

Heng Tan and Ronald F. DeMaraHeng Tan and Ronald F. DeMaraUniversity of Central FloridaUniversity of Central Florida

Heng Tan and Ronald F. DeMaraHeng Tan and Ronald F. DeMaraUniversity of Central FloridaUniversity of Central Florida

Agenda

• Introduction• Previous work on reducing Partial

Reconfiguration overhead• Develop a multi-step physical area management

strategy• Experimental case studies:

4 simple circuits: serial, parallel, block multiplier resource,high fanout carry chain LUT multiplier

2 large circuits: SECDED, MD5 step function (new)

• Conclusion and future work

Motivation

For partial reconfiguration operations, especially those running on a SOC configuration,

storage space and

reconfiguration speed

are crucial.

Find way to optimize them . . .

Previous Work

• Architecture-level [Ganesan2000] Pipelining overlap execution of one temporal partition with reconfiguration

of another• Logical level [Raghuraman2005]

Minimization of frames relating the number of frames that need to be downloaded into

FPGAs to the number of minterms of a specially constructed logic function

• Hardware level [Compton2002][Hauck1999] Requires dedicated HW component Compression or defragmentation performed externally at

runtime• Physical-level resource management?

Not specifically addressed for partial reconfiguration modules in literature

But can be done … and … also cascaded with above techniques for multiplicative benefit

Module Based Flow

External Data

I ntermodul eSi gnal s

PRModul e

Fi xedModul e

Fi xedModul e

Bus

Macro

Bus

Macro

PRModul e

• Including Fixed modules and PR modules

• Using Bus Macro

• Suitable for potential full automation

• Primary PR technique to study

Basic concept and capability for PR proposed by Xilinx

Column-Level Configuration Format

• For Virtex II/Pro series

• Including IOB, IOI, CLB, GCLK, BlockRAM, and BlockRAM Interconnect

• Labeling with 32-bit addresses composed of a Block Address (BA), a Major Address (MJA), a Minor Address (MNA), and a byte number

• Only contents of CLB column are studied in this research

0

0

0

1

0

2

0

3

0

4

BA

MJ A

. . .

0

M+2

0

M+3

0

M+4

1

0

. . .

1

N

2

0

. . .

2

N

GCLK

IOB

IOI

CLB . . .

CLB

IOI

IOB

BRAM

BRAM. . .

BMINT . . .

BMINT

M = Total Number of CLB ColumnN = Total Number of Block RAM Column

Bitstream Compression

• CLB columns program the configurable logic blocks, routing, and most interconnect resources.

• Each CLB column contains 2 columns of slices.

• 22 frames are utilized within the bitstream for each column of slices, describing the logic and routing information respectively.

• Each frame occupies 424 bytes.

• In bitstream of PR modules, a compression technique is already used by Xilinx to represent the unused CLB frame.

• For unused CLB frames, only 10 bytes are used instead of 424 to describe empty contents.

Optimization Strategy

• Primary goal: minimize the number of columns of slices utilized, including routing resources

• Secondary goal: do not incur increase in propagation delay after area optimization

• Physical area management procedure:

1. Region Allocation: define size and boundary

2. Pin Assignment: top/bottom edge preferred

3. Column Alignment: fill column of slices at a time

4. Choke-Point Elimination: resolve high fanout cases

5. Repeat

Case Study

• The hardware platform is Xilinx Virtex II Pro VP7 device.

• Module-based partial reconfiguration flow is adopted to generate the partial reconfiguration bitstream.

• The Xilinx ISE 6.3 is used to support the module based flow.

• The area constraints are entered directly into User Constrain File (.ucf) before map and routing.

• Four representative small cases and two larger size cases studies are tested for the strategy.

• Similar or identical external pin arrangement.

• Hypothesis: larger circuits can achieve more savings

4-LUT Design Optimization (parallel)

• Simple 4-LUT elements

• Parallel logic path with direct input from external signals

• LUTs feed outputs straight though to flip flops

• Best strategy is by locating them in a single column close to the external pins

Shifter Optimization (serial)

• All logic elements are cascaded in contiguous string of CLBs

• Attributes of this serial circuit functionality will have best arrangement from input to output in single column serially

Block Multiplier Optimization

• Block multiplier resource involved

• Requires balancing the routing between two paths.

•One path is between the block multiplier and the LUTs. The other path is from the LUTs to the external pins.

• Decreased savings compared to use of LUT-only circuits

LUT Multiplier Optimization

• High fan-out occurs because of the carry chains

• Single column style is not optimal

• Arrange related LUTs around each other using adjacent columns

Larger Case Studies

• SECDED as full PR module and MD5 step functions were developed as PR module vs. SHA family

• During the optimization process, not every slice has been specifically placed because of the large number of resources involved

• Only the slices on the critical path are constrained. • These are comparatively larger modules, increased

bitstream savings of 33% and 30% are achieved

Benchmark Results

Module Name

# of LUT.

# of FF# of block

Multiplier

# of Slices

Original File Size (bytes)

Original Max. Delay (ns)

Optimized File Size (bytes)

Optimized Max. Delay (ns)

Area Saving

4 LUTs 4 16 0 12 64K 1.371 55K 1.347 14%

Shifter 1 24 0 13 87K 1.377 63K 1.367 28%

Block Multiplier

8 25 1 17 88K 1.346 66K 1.346 25%

LUT Multiplier

22 22 0 22 96K 1.367 68K 1.346 29%

SECDED 93 41 0 74 89K 1.355 60K 1.355 33%

MD5 292 128 0 168 120K 1.380 84K 1.322 30%

Area Optimization Results

• A physical resource area management strategy is proposed to minimize the reconfiguration overhead.

• Experiments show that up to one-third of size reduction can be achieved for partial reconfiguration modules.

• The maximum propagation delay has also been decreased slightly in most cases (not increased).

• On the other hand, the larger the module is, the more complicated and time consuming the process becomes.

• Providing autonomous area optimization capabilities is future work for integrating into our Multi-Layer Runtime Reconfiguration Architecture FGPA framework

Multi-Layer Runtime Reconfiguration Arch. (MRRA)

Resource name

Number of Available

Number of Used

Utilization

IOBs 396 85 21%

Slices 4928 1805 36%

BRAM 24 44 54%

TBUFs 2464 352 14%

PPC405 1 1 100%

BUFGMUXs 4 1 25%

. . .

PLB

PowerPC

OP

B

ReconfigurableModule

ReconfigurableModule

PCI

FPGA

Host PC

PLB/OPBBridge

Block RAM

UART

SRAMController

ICAP Controller

ExternalSRAM

JTAG

ICAP

SelectMAP

On Chip Data Flow

Reconfiguration Data Flow

External Data Flow

JTAG / SelectMAP / ICAP Reconfiguration Interfaces

Current Work:Direct Bitstream Management

•Change one-bit full adder to a one-bit full subtracter •Both have three one-bit inputs and two one-bit outputs.• Both used 2 LUTs with identical logic interconnections between LUTs and I/O signals. •Only difference between them is actually only one truth table stored inside one LUT, changing from 0xE8 to 0x8E

X Y Cin Cout S

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

0 0

0 1

0 1

1 0

0 1

1 0

1 0

1 1

X Y Bin Bout D

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

0 0

1 1

1 1

1 0

0 1

0 0

0 0

1 1

96

E8

X

Y S

Cout

Cin

X Y

Cin / BinCout / Bout

S / D

Adder /Subtracter

96

8E

X

Y D

Bout

BinLogic

Switch

(a) 1 Bit Full Adder (b) 1 Bit Full Subtracter

Combined MD5 / SHA-1 Step Function – Area Utilization

Original Algorithm Module Based Frame Based

SHA-1 192(slices) 65(slices) 32(slices)MD5 881(slices) 168(slices) 32(slices)Combined 1068(slices) 324(slices) 32(slices)

Documents

A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead