A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays

French MAPLD 2004

A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays

Matthew French [email protected]

University of Southern California, Information Sciences Institute 3811 North Fairfax Dr, Suite 200

Arlington, VA 22203

French Page 2 MAPLD 2004

Xilinx FPGA Power Trend

1

10

100

1000

10000

100000

XC4000E XC4000XL XC4000XV Virtex Virtex-E Virtex-II

LogicBlocks

Frequency(MHz)

Power(mW)

Voltage

Fig

ure

of M

erit

Xilinx Family• Number of Logic Blocks & Maximum Operating Frequency both loosely track Moore’s Law

• Voltage Reduction is Slower

• Resulting Power Increase is Exponential!


Power Sensitive Applications

• Need to consider power as a first-class design constraint

• SRAM-based FPGA Quiescent power based on total circuit size

• Dynamic Power• Toggle Rates (Data Dependant)

• Components Used

• Routing

• Actual Quiescent and Dynamic Power not known until Circuit is Placed and Routed– For high accuracy, further simulation necessary on timing model

• Tools do timing driven placement and routing

• So how does one design for low power?


Virtex-II Component Power Profile

0.001

0.01

0.1

1

10

0 12.5 25 37.5 50 62.5 75 87.5 100

Frequency (MHz, 20% toggle)

mW

per

co

mp

on

ent

Flip-Flop

Shift Reg

LUT

Block Select RAM

Multiplier

• Derive micro-architecture feature capacitances from– Xilinx Power Estimation

Spreadsheets

– Xpower Designs

– Power Monitoring Testbed

– Shang, Kaviani, Bathala, “Dynamic Power Consumption in Virtex-II FPGA Family” FPGA ’02

• Only trying to establish relative capacitances– Models too imprecise to be exact

• Derive Low-Power Design Strategy– Minimize Multipliers

– Use Shortest Interconnect

Resource Capacitance (pF)

Embedded Multiplier 1,196

Block Select RAM 880

CLB 26

Long-line Route 23

Hex-line Route 18

Double-line Route 13

Direct-Connect Route 5


Traditional Image Convolution

•Slide Tap Mask Over Image

•Multiply each pixel

•Sum all Partial Products

•Resulting in new Filtered Pixel

•Operations

•9 Multiplies & 9 Additions / Output Pixel

Tap 1

Tap 8Tap 7

Tap 6Tap 5Tap 4

Tap 3Tap 2

Tap 9

Input Data

×

Data 1

Data 8Data 7

Data 6Data 5Data 4

Data 3Data 2

Data 9

PP 1

PP 8PP 7

PP 6PP 5PP 4

PP 3PP 2

PP 9

=

Out 1

Tap Mask Partial Products

Output


Straight Forward Implementation

Tap 1

Q

QSET

CLR

S

R Q

QSET

CLR

S

R

Tap 3Tap 2

Tap 4

Q

QSET

CLR

S

R Q

QSET

CLR

S

R

Tap 6Tap 5

Tap 7

Q

QSET

CLR

S

R Q

QSET

CLR

S

R

Tap 9Tap 8

9

Output

BinaryAdderTree

Input Row 1

InputRow 3

InputRow 2

• 3x3 Kernel = 9 parallel multipliers

• Multipliers are resource limited in FPGAs– Virtex E

• Instance in configurable logic

• XCV3200E: ~81 Multipliers Max

• 9 Pixels in Parallel

– Virtex-II

• Embedded Multiplier Blocks

• XC2V8000: 168 Multipliers

• 18 Pixels in Parallel

• Adder Trees Relatively Cheap– 100’s of slices

– XCV32000E: 32,000 slices

– XC2V8000: 46,000 slices

• This also reflects Power Prioritization


Convolution Kernel Types: Closer Look

• Spatial Filtering– Blurring, Smoothing (Lowpass)

– Sharpening (Highpass)

– Noise Reduction

– Edge Detection

• Derivative Filters– Roberts

– Prewitt

– Sobel

1/9 1/9 1/9

1/9 1/9 1/9

1/9 1/9 1/9

Smoothing Filter

-1 -1 -1

-1

-1

-1

-1-1

8

Sharpening Filter

-1

+1

+1

+1 -1 -1

-1-1

-1

Edge Detection Filter

-1 -1 -1

0 0 0

+1 +1 +1

-1 0 +1

-1 0 +1

-1 0 +1

Prewitt Basis

-1 -2 -1

0 0 0

+1 +2 +1

-1 0 +1

-2 0 +2

-1 0 +1

Sobel Basis

• Filter Tap Values Reused Often

• Can We Exploit This?

1 Unique Tap Value

2 Unique Tap Values

2 Unique Tap Values

3 Unique Tap Values

5 Unique Tap Values


1-D Symmetric FIR Filter Lessons

Q

QSET

CLR

D

C3 C2

Q

QSET

CLR

Dx(k)

y(k)

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

C1 C0

C(k) = C(K-(k+1))

• Telecommunication and Radar Communities– Exploit symmetric Filters

– Reorder Additions Before Multiplication

– 1/2 Multipliers Necessary

• Can We Exploit 2-D Symmetry?– Tap Values Reprogrammable

– Tap Symmetry Reprogrammable

• Minimize Multipliers

• Leverage Large Amount of Configurable Logic Blocks

• Benefits of Increased Parallelism– Higher Throughput

– More Efficient Power Utilization Over Time


Key Ideas

• Number of Active Multipliers Varies with Tap Mask– Turn off unused Multipliers – lower power– Or, use unused Multipliers to process next pixel

• Requires parallel memory accesses• Higher throughput

• Finish sooner – sleep device• Lower Clock Rate

• Adder Tree layers before and after multiply vary with number of Multipliers per pixel

• Input Data must be able to be routed to each multiplier

• Will multiplier savings outweigh extra routing, multiplexing, larger circuit quiescent power?


Adaptive Convolution Kernel Sizing

• Implementing Multiple Pixel Version• How Many Multipliers to Use?

– Multiple of 9– Size that is easy to place and allow for TMR growth

Number of Unique Taps in 3x3 Conv. Mask

Number of Masks per Kernel

Speedup Over Traditional Convolution

1 18 9x

2 9 4.5x

3 6 3x

4 4 2x

5 3 1.5x

6 3 1.5x

7-9 2 1x

18 Multipliers Per Kernel


Kernel Block Diagram

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

A

H

Q1

Q8

ENB

Register

Input Row 0

State Machine

Dat

a M

ux

3

3

3

3

9

9

9

Common Tap Mux

M

M

M

Adder Tree

Ou

tpu

t A

dd

er Tree

Output 0Output 1

Output 17

Number of Unique Taps

Tap Mask

Tap Value

Register Delay Bank

9

9

9

Input Row 1Input Row 2

Input Row 19

Group Data Values with Common Taps

Dynamically Adjust Multiplier Position within Adder Tree


Implementation Comparison

Baseline Power Efficient

Flip-Flops 6,435 7,231

LUTs 8,141 9,181

Block Rams 6 20

Multipliers 18 18

Operating Frequency 100 MHz 100 MHz

Quiescent Power (mW) 711.6 965.0

Quiescent Power 35% Higher


0

1

2

3

4

5

6

7

8

Baseline PowerEfficient(9 taps)

PowerEfficient(8 taps)








Kernel

Energy (mJ) per512 x 512 Image

EnergyImprovementFactor

Total Energy Comparison

For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power


What is hard?

• Poor Tool Support for Power Design– Analyzing Power Trade offs can be complex & time

consuming

– Have to have fully routed and simulated designs to compare approaches

– Router is optimized for throughput, not power

• Finding all Chip Enables to Disable– For each of several different multiplexer settings

• Secondary Power Effects– Can also use Relative Placement Macros to “help” Router

• Finding where can be time consuming


Analysis

• For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power– Crossover point at 7 taps is an implementation limitation of

using 18 multipliers in kernel

• Quiescent Power– Not much larger considering extra circuitry

• 18 Adder Trees, 16 Block RAMs

• Dynamic Power Consumption– Observed to vary by +50% within one circuit from one place and

route to another, even using same settings– Average of 3 routes used for each circuit

• For Systems Where Parallelizing Input Data Stream Is Difficult – Disabling extra Multipliers is best approach– Power savings expected to be less


Conclusions

• Substantial Power Savings can be Achieved by Making Power a First-Class Design Constraint

• Knowledge of Underlying Resource Capacitance a Key Foundation– Re-use Power-Critical Components

• Routing Can Be Influenced to Yield Lower Power

• Over-constrain timing on power sensitive nets

• Use Relative Placement Macros (RPMs)

Documents

A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays