A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications

From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University of Florida

Presentation by John Potts, University of Guelph

2

Outline

● Introduction

– What is a sliding-window application?– Justification

● Background

– Applications● Methodologies● Results● Analysis● Conclusions

3

Introduction: Sliding-Window Applications

● What is a Sliding-Window Application?

● 2-Dimensional Signal Analysis

● x by y image, n by m kernal or window

4

Introduction: Sliding-Window Applications

5

Introduction - Justification

● Computing Architectures tending towards parallelism and heterogeneity

● GPUs are common● Multitude of accelerator options available● Metrics for different devices vary widely with applications● Often many Pareto-optimal solutions● Study focuses on a particular application type and two

particular design criteria

6

Introduction: Devices

Devices tested:● Altera Stratix III E260 FPGA● NVIDIA GeForce GTX 295 GPU using CUDA framework● Quad-core xeon W3520 using OpenCL multicore

framework● Single Chip Systems also examined

7

Background: Previous Work

● Application performance for FPGAs and GPUs

– Sinha et al.: feature tracker on GPU– Porter et al.: stereo matching algorithms on FPGA

● FPGA, GPU and CPU comparisons

– Baker et al.: matched filter algorithm, Cell Processor for performance and energy, GPU for performance per dollar

– Pauwels et al.: Vision-based algorithms, FPGAs best for single stage algorithms only

● Different use cases:

– Cope et al.: 2D convolution and colour correction, performance dependant on kernal size

– Asano et al.: CPU, GPU and FPGA for applications of 2D filter, SAD stereo vision disparity, k-means clustering

8

Background: Improvements offered by this study

● Study provides a more in-depth analysis of sliding-window applications

● Wider range of image and kernal sizes● Presents a generalized circuit architecture● Optimizations deliver real-time sliding-window processing

of HD video on single GPU or FPGA● Evaluates a new application based on Information

Theoretic Learning

9

Background: Applications

● Applications where the kernal is fully immersed● SAD – Sum of Absolute Differences● 2D Convolution● Correntropy● 2D FFT – GPU and Multicore only

10

Applications: Sum of Absolute Differences

● Detect a degree of similarity between images● Eg: security system● Operation● Output: structure of size (x-n+1)x(y-m+1)

11

Applications: 2D Convolution

● Used in digital signal processing, scientific computing, small to high-performance embedded systems

● Operation● Equation:

● Common Optimization

12

Applications: Correntropy

● Measure of similarity based on Information Theoretic Learning

● Many possible applications, study focuses on one similar to SAD

● Equation:

● Operation

13

Methodology: FPGA

Circuit Architecture:

14

Methodology: FPGA● Uses a window generator to reduce bandwidth

requirements● Controller and host software transfers image, initializes,

polls, reads output● SAD implementation● 2D Convolution

15

Methodology: FPGA

● Correntropy

16

Methodology: FPGA Resources

LUTs Registers Block Memory Bits

DSP Blocks

SAD 137,260 156,377 2,256,464 0

2D Convolution: Fixed point

33,547 57,122 1,601,104 738

2D Convolution: Floating Point

129,024 126,821 1,633,872 676

Correntropy 141,633 143,137 2,256,464 0

17

Methodology: GPU

● Uses Specialized memory organisation● a x b output pixels, 64x32 selected● Macroblock size balances between threads per block and

memory bank conflicts. 2X2 chosen● SAD: calculated between kernal and four windows in the

corresponding macroblock

18

Methodology: GPU

● 2D Convolution: Similar to SAD

– Frequency domain also implemented (2D FFT)● Correntropy: SAD with extra step

– Challenge: locating maximum similarity values

19

Methodology: Multicore

● Utilized OpenCL parallel programming standard● Optimizations focused on minimizing communication

between threads● Implementation consists of straightforward specification

of the window function

20

Results

● Results examined include FPS, speedup analysis, energy efficiency

● Single chip systems such as APUs and standalone FPGA examined– Upper bound estimates found by removing PCIe

transfer times● Sequential C++ implementations used as baseline● Implementations evaluated for 480p, 720p, 1080p

video● Kernal sizes of 4x4, 9x9, 16x16, 25x25 evaluated for all

applications, also 36x36 and 45x45 for SAD and correntropy

21

Results: Sum of Absolute Difference

22

Results: 2D Convolution

23

Results: Correntropy

24

Results: Speedup

25

Results: Application Comparison

26

Results: Analysis

● GPU is best for smaller (4x4 and 9x9 kernals), equivalent in 16x16

● FPGA speedup reached 240x, 45x, 298x over sequential baseline for SAD, 2D Convolution, Correntropy

● 2D Convolution: GPU-FFT was faster than FPGA● FPGA implementations were near constant time due to

pipelining, extra steps present as latency rather than throughput

27

Results: Single Chip Systems

● PCIe transfer times were as much as 65% of GPU execution time, 64% of FPGA execution time

● FPGA single chip is consistently ~2x PCIe● At time of writing, GPU times minus PCIe transfer time is

not a realistic representation as standalone or single chip GPU systems do not have nearly the capability of the device tested

28

Results: Energy Comparison

Energy Consume for one frame

29


Theoretical Wattage for 30 fps:

30


● Example application: Embedded system using correntropy for target tracking

31

Conclusions

● Performance and Energy requirements of sliding-window applications for a variety of devices and use cases

● FPGAs were faster except for small inputs● FPGAs had lower power requirements● Consistency of results suggests applicability to other

sliding window applications

32

Questions?

Documents

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University