Upload
kailani-james
View
35
Download
0
Embed Size (px)
DESCRIPTION
A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays. Matthew French [email protected] University of Southern California, Information Sciences Institute 3811 North Fairfax Dr, Suite 200 Arlington, VA 22203. Xilinx FPGA Power Trend. Figure of Merit. Xilinx Family. - PowerPoint PPT Presentation
Citation preview
French MAPLD 2004
A Power Efficient Image Convolution Engine for Field Programmable Gate Arrays
Matthew French [email protected]
University of Southern California, Information Sciences Institute 3811 North Fairfax Dr, Suite 200
Arlington, VA 22203
French Page 2 MAPLD 2004
Xilinx FPGA Power Trend
1
10
100
1000
10000
100000
XC4000E XC4000XL XC4000XV Virtex Virtex-E Virtex-II
LogicBlocks
Frequency(MHz)
Power(mW)
Voltage
Fig
ure
of M
erit
Xilinx Family• Number of Logic Blocks & Maximum Operating Frequency both loosely track Moore’s Law
• Voltage Reduction is Slower
• Resulting Power Increase is Exponential!
French Page 3 MAPLD 2004
Power Sensitive Applications
• Need to consider power as a first-class design constraint
• SRAM-based FPGA Quiescent power based on total circuit size
• Dynamic Power• Toggle Rates (Data Dependant)
• Components Used
• Routing
• Actual Quiescent and Dynamic Power not known until Circuit is Placed and Routed– For high accuracy, further simulation necessary on timing model
• Tools do timing driven placement and routing
• So how does one design for low power?
French Page 4 MAPLD 2004
Virtex-II Component Power Profile
0.001
0.01
0.1
1
10
0 12.5 25 37.5 50 62.5 75 87.5 100
Frequency (MHz, 20% toggle)
mW
per
co
mp
on
ent
Flip-Flop
Shift Reg
LUT
Block Select RAM
Multiplier
• Derive micro-architecture feature capacitances from– Xilinx Power Estimation
Spreadsheets
– Xpower Designs
– Power Monitoring Testbed
– Shang, Kaviani, Bathala, “Dynamic Power Consumption in Virtex-II FPGA Family” FPGA ’02
• Only trying to establish relative capacitances– Models too imprecise to be exact
• Derive Low-Power Design Strategy– Minimize Multipliers
– Use Shortest Interconnect
Resource Capacitance (pF)
Embedded Multiplier 1,196
Block Select RAM 880
CLB 26
Long-line Route 23
Hex-line Route 18
Double-line Route 13
Direct-Connect Route 5
French Page 5 MAPLD 2004
Traditional Image Convolution
•Slide Tap Mask Over Image
•Multiply each pixel
•Sum all Partial Products
•Resulting in new Filtered Pixel
•Operations
•9 Multiplies & 9 Additions / Output Pixel
Tap 1
Tap 8Tap 7
Tap 6Tap 5Tap 4
Tap 3Tap 2
Tap 9
Input Data
×
Data 1
Data 8Data 7
Data 6Data 5Data 4
Data 3Data 2
Data 9
PP 1
PP 8PP 7
PP 6PP 5PP 4
PP 3PP 2
PP 9
=
Out 1
Tap Mask Partial Products
Output
French Page 6 MAPLD 2004
Straight Forward Implementation
Tap 1
Q
QSET
CLR
S
R Q
QSET
CLR
S
R
Tap 3Tap 2
Tap 4
Q
QSET
CLR
S
R Q
QSET
CLR
S
R
Tap 6Tap 5
Tap 7
Q
QSET
CLR
S
R Q
QSET
CLR
S
R
Tap 9Tap 8
9
Output
BinaryAdderTree
Input Row 1
InputRow 3
InputRow 2
• 3x3 Kernel = 9 parallel multipliers
• Multipliers are resource limited in FPGAs– Virtex E
• Instance in configurable logic
• XCV3200E: ~81 Multipliers Max
• 9 Pixels in Parallel
– Virtex-II
• Embedded Multiplier Blocks
• XC2V8000: 168 Multipliers
• 18 Pixels in Parallel
• Adder Trees Relatively Cheap– 100’s of slices
– XCV32000E: 32,000 slices
– XC2V8000: 46,000 slices
• This also reflects Power Prioritization
French Page 7 MAPLD 2004
Convolution Kernel Types: Closer Look
• Spatial Filtering– Blurring, Smoothing (Lowpass)
– Sharpening (Highpass)
– Noise Reduction
– Edge Detection
• Derivative Filters– Roberts
– Prewitt
– Sobel
1/9 1/9 1/9
1/9 1/9 1/9
1/9 1/9 1/9
Smoothing Filter
-1 -1 -1
-1
-1
-1
-1-1
8
Sharpening Filter
-1
+1
+1
+1 -1 -1
-1-1
-1
Edge Detection Filter
-1 -1 -1
0 0 0
+1 +1 +1
-1 0 +1
-1 0 +1
-1 0 +1
Prewitt Basis
-1 -2 -1
0 0 0
+1 +2 +1
-1 0 +1
-2 0 +2
-1 0 +1
Sobel Basis
• Filter Tap Values Reused Often
• Can We Exploit This?
1 Unique Tap Value
2 Unique Tap Values
2 Unique Tap Values
3 Unique Tap Values
5 Unique Tap Values
French Page 8 MAPLD 2004
1-D Symmetric FIR Filter Lessons
Q
QSET
CLR
D
C3 C2
Q
QSET
CLR
Dx(k)
y(k)
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
C1 C0
C(k) = C(K-(k+1))
• Telecommunication and Radar Communities– Exploit symmetric Filters
– Reorder Additions Before Multiplication
– 1/2 Multipliers Necessary
• Can We Exploit 2-D Symmetry?– Tap Values Reprogrammable
– Tap Symmetry Reprogrammable
• Minimize Multipliers
• Leverage Large Amount of Configurable Logic Blocks
• Benefits of Increased Parallelism– Higher Throughput
– More Efficient Power Utilization Over Time
French Page 9 MAPLD 2004
Key Ideas
• Number of Active Multipliers Varies with Tap Mask– Turn off unused Multipliers – lower power– Or, use unused Multipliers to process next pixel
• Requires parallel memory accesses• Higher throughput
• Finish sooner – sleep device• Lower Clock Rate
• Adder Tree layers before and after multiply vary with number of Multipliers per pixel
• Input Data must be able to be routed to each multiplier
• Will multiplier savings outweigh extra routing, multiplexing, larger circuit quiescent power?
French Page 10 MAPLD 2004
Adaptive Convolution Kernel Sizing
• Implementing Multiple Pixel Version• How Many Multipliers to Use?
– Multiple of 9– Size that is easy to place and allow for TMR growth
Number of Unique Taps in 3x3 Conv. Mask
Number of Masks per Kernel
Speedup Over Traditional Convolution
1 18 9x
2 9 4.5x
3 6 3x
4 4 2x
5 3 1.5x
6 3 1.5x
7-9 2 1x
18 Multipliers Per Kernel
French Page 11 MAPLD 2004
Kernel Block Diagram
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
Input Row 0
State Machine
Dat
a M
ux
3
3
3
3
9
9
9
Common Tap Mux
M
M
M
Adder Tree
Ou
tpu
t A
dd
er Tree
Output 0Output 1
Output 17
Number of Unique Taps
Tap Mask
Tap Value
Register Delay Bank
9
9
9
Input Row 1Input Row 2
Input Row 19
Group Data Values with Common Taps
Dynamically Adjust Multiplier Position within Adder Tree
French Page 12 MAPLD 2004
Implementation Comparison
Baseline Power Efficient
Flip-Flops 6,435 7,231
LUTs 8,141 9,181
Block Rams 6 20
Multipliers 18 18
Operating Frequency 100 MHz 100 MHz
Quiescent Power (mW) 711.6 965.0
Quiescent Power 35% Higher
French Page 13 MAPLD 2004
0
1
2
3
4
5
6
7
8
Baseline PowerEfficient(9 taps)
PowerEfficient(8 taps)
PowerEfficient(7 taps)
PowerEfficient(6 taps)
PowerEfficient(5 taps)
PowerEfficient(4 taps)
PowerEfficient(3 taps)
PowerEfficient(2 taps)
PowerEfficient(1 taps)
Kernel
Energy (mJ) per512 x 512 Image
EnergyImprovementFactor
Total Energy Comparison
For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power
French Page 14 MAPLD 2004
What is hard?
• Poor Tool Support for Power Design– Analyzing Power Trade offs can be complex & time
consuming
– Have to have fully routed and simulated designs to compare approaches
– Router is optimized for throughput, not power
• Finding all Chip Enables to Disable– For each of several different multiplexer settings
• Secondary Power Effects– Can also use Relative Placement Macros to “help” Router
• Finding where can be time consuming
French Page 15 MAPLD 2004
Analysis
• For Higher Tap Commonality, Shorter Dynamic Power Consumption Window Overcomes Higher Quiescent Power– Crossover point at 7 taps is an implementation limitation of
using 18 multipliers in kernel
• Quiescent Power– Not much larger considering extra circuitry
• 18 Adder Trees, 16 Block RAMs
• Dynamic Power Consumption– Observed to vary by +50% within one circuit from one place and
route to another, even using same settings– Average of 3 routes used for each circuit
• For Systems Where Parallelizing Input Data Stream Is Difficult – Disabling extra Multipliers is best approach– Power savings expected to be less
French Page 16 MAPLD 2004
Conclusions
• Substantial Power Savings can be Achieved by Making Power a First-Class Design Constraint
• Knowledge of Underlying Resource Capacitance a Key Foundation– Re-use Power-Critical Components
• Routing Can Be Influenced to Yield Lower Power
• Over-constrain timing on power sensitive nets
• Use Relative Placement Macros (RPMs)