Upload
randolf-thompson
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications
From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University of Florida
Presentation by John Potts, University of Guelph
2
Outline
● Introduction
– What is a sliding-window application?– Justification
● Background
– Applications● Methodologies● Results● Analysis● Conclusions
3
Introduction: Sliding-Window Applications
● What is a Sliding-Window Application?
● 2-Dimensional Signal Analysis
● x by y image, n by m kernal or window
4
Introduction: Sliding-Window Applications
5
Introduction - Justification
● Computing Architectures tending towards parallelism and heterogeneity
● GPUs are common● Multitude of accelerator options available● Metrics for different devices vary widely with applications● Often many Pareto-optimal solutions● Study focuses on a particular application type and two
particular design criteria
6
Introduction: Devices
Devices tested:● Altera Stratix III E260 FPGA● NVIDIA GeForce GTX 295 GPU using CUDA framework● Quad-core xeon W3520 using OpenCL multicore
framework● Single Chip Systems also examined
7
Background: Previous Work
● Application performance for FPGAs and GPUs
– Sinha et al.: feature tracker on GPU– Porter et al.: stereo matching algorithms on FPGA
● FPGA, GPU and CPU comparisons
– Baker et al.: matched filter algorithm, Cell Processor for performance and energy, GPU for performance per dollar
– Pauwels et al.: Vision-based algorithms, FPGAs best for single stage algorithms only
● Different use cases:
– Cope et al.: 2D convolution and colour correction, performance dependant on kernal size
– Asano et al.: CPU, GPU and FPGA for applications of 2D filter, SAD stereo vision disparity, k-means clustering
8
Background: Improvements offered by this study
● Study provides a more in-depth analysis of sliding-window applications
● Wider range of image and kernal sizes● Presents a generalized circuit architecture● Optimizations deliver real-time sliding-window processing
of HD video on single GPU or FPGA● Evaluates a new application based on Information
Theoretic Learning
9
Background: Applications
● Applications where the kernal is fully immersed● SAD – Sum of Absolute Differences● 2D Convolution● Correntropy● 2D FFT – GPU and Multicore only
10
Applications: Sum of Absolute Differences
● Detect a degree of similarity between images● Eg: security system● Operation● Output: structure of size (x-n+1)x(y-m+1)
11
Applications: 2D Convolution
● Used in digital signal processing, scientific computing, small to high-performance embedded systems
● Operation● Equation:
● Common Optimization
12
Applications: Correntropy
● Measure of similarity based on Information Theoretic Learning
● Many possible applications, study focuses on one similar to SAD
● Equation:
● Operation
13
Methodology: FPGA
Circuit Architecture:
14
Methodology: FPGA● Uses a window generator to reduce bandwidth
requirements● Controller and host software transfers image, initializes,
polls, reads output● SAD implementation● 2D Convolution
15
Methodology: FPGA
● Correntropy
16
Methodology: FPGA Resources
LUTs Registers Block Memory Bits
DSP Blocks
SAD 137,260 156,377 2,256,464 0
2D Convolution: Fixed point
33,547 57,122 1,601,104 738
2D Convolution: Floating Point
129,024 126,821 1,633,872 676
Correntropy 141,633 143,137 2,256,464 0
17
Methodology: GPU
● Uses Specialized memory organisation● a x b output pixels, 64x32 selected● Macroblock size balances between threads per block and
memory bank conflicts. 2X2 chosen● SAD: calculated between kernal and four windows in the
corresponding macroblock
18
Methodology: GPU
● 2D Convolution: Similar to SAD
– Frequency domain also implemented (2D FFT)● Correntropy: SAD with extra step
– Challenge: locating maximum similarity values
19
Methodology: Multicore
● Utilized OpenCL parallel programming standard● Optimizations focused on minimizing communication
between threads● Implementation consists of straightforward specification
of the window function
20
Results
● Results examined include FPS, speedup analysis, energy efficiency
● Single chip systems such as APUs and standalone FPGA examined– Upper bound estimates found by removing PCIe
transfer times● Sequential C++ implementations used as baseline● Implementations evaluated for 480p, 720p, 1080p
video● Kernal sizes of 4x4, 9x9, 16x16, 25x25 evaluated for all
applications, also 36x36 and 45x45 for SAD and correntropy
21
Results: Sum of Absolute Difference
22
Results: 2D Convolution
23
Results: Correntropy
24
Results: Speedup
25
Results: Application Comparison
26
Results: Analysis
● GPU is best for smaller (4x4 and 9x9 kernals), equivalent in 16x16
● FPGA speedup reached 240x, 45x, 298x over sequential baseline for SAD, 2D Convolution, Correntropy
● 2D Convolution: GPU-FFT was faster than FPGA● FPGA implementations were near constant time due to
pipelining, extra steps present as latency rather than throughput
27
Results: Single Chip Systems
● PCIe transfer times were as much as 65% of GPU execution time, 64% of FPGA execution time
● FPGA single chip is consistently ~2x PCIe● At time of writing, GPU times minus PCIe transfer time is
not a realistic representation as standalone or single chip GPU systems do not have nearly the capability of the device tested
28
Results: Energy Comparison
Energy Consume for one frame
29
Results: Energy Comparison
Theoretical Wattage for 30 fps:
30
Results: Energy Comparison
● Example application: Embedded system using correntropy for target tracking
31
Conclusions
● Performance and Energy requirements of sliding-window applications for a variety of devices and use cases
● FPGAs were faster except for small inputs● FPGAs had lower power requirements● Consistency of results suggests applicability to other
sliding window applications
32
Questions?