67
Erik D’Hollander University of Ghent Belgium

Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Embed Size (px)

Citation preview

Page 1: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Erik D’Hollander

University of Ghent Belgium

Page 2: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Outline

1. Super desktop GPU/FPGA architecture

2. Programming tool chain

3. FPGA vs. GPU strengths

4. Roofline performance model for FPGA

5. Tuning performance

6. Optimizing compute resources

7. Conclusion

Page 3: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Supercomputing 1969-2018

• 1969: MFlops

• 1985: GFlops

• 1997: PFlops

• 2008: TFlops

• 2018: EFlops? 1.E+03

1.E+06

1.E+09

1.E+12

1.E+15

1.E+18

CD

C 7

60

0

CD

C S

TAR

Cra

y X

-MP

Cra

y-2

Fujit

su N

WT

Hit

ach

i SR

22

01

Inte

l ASC

I

NEC

Ear

th S

imu

lato

r

IBM

Blu

e G

ene

IBM

Ro

adru

nn

er

Tian

he

I K

1969 1974 1982 1985 1990 1996 1997 2004 2005 2008 2010 2011

MFLOPS(y) = 1.72(y-1969)

Page 4: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Trendlines

• Supercomputing FLOPS > Moore’s law

• Memory speed increase << Moore’s law

R² = 0.97

0

2

4

6

8

10

12

14

16

18

1960 1970 1980 1990 2000 2010 2020

FLops (log10)

Moore's law

MFlops Trendline

Memory speed increase (relative)

Page 5: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Trendlines

• Supercomputing FLOPS > Moore’s law

• Memory speed increase << Moore’s law

R² = 0.97

0

2

4

6

8

10

12

14

16

18

1960 1970 1980 1990 2000 2010 2020

FLops (log10)

Moore's law

MFlops Trendline

Memory speed increase (relative)

PC today

Page 6: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Super desktop with GP-GPU and FPGA

• Host = Supermicro PC

• Accelerators =

– GPGPU Tesla C2050 highly regular parallel apps.

– FPGA board Pico EX500 with 2x M501 Virtex 6 configurable, massively parallel apps., low power

“GUDI” Tetra project supported by IWT Flanders Belgium, EhB, VUB and UGent

Page 7: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Super desktop with GP-GPU and FPGA

Page 8: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Combining GPU and FPGA strengths

• Image processing + Bio-informatics

• Face recognition + Security

• Audio processing + HMM speech recognition

• Traffic analysis + Neural network control

Page 9: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Super desktop with GP-GPU and FPGA

• Internal architecture and interconnections

Page 10: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Super desktop with GP-GPU and FPGA

• Hybrid system : CPU, 2 FPGAS, GP-GPU

Page 11: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Super desktop with GP-GPU and FPGA

• Internal bandwiths CPU memory: 19.2 GB/s CPU accelerators: 25.6 GB/s (QPI)

Page 12: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Super desktop with GP-GPU and FPGA

• Internal bandwiths CPU FPGAs : 8 GB/s CPU GP-GPU: 8 GB/s

Page 13: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Super desktop with GP-GPU and FPGA

• Internal bandwiths: GPU SMP Global Mem: 115.0 GB/s SMP Shared Mem: 73.5 GB/s

Page 14: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Super desktop with GP-GPU and FPGA

• Internal bandwiths: FPGA DSP/Logic Block RAM: 386 GB/s DSP/Logic PCIe switch: 4 GB/s DSP/Logic DDR3 RAM: 3.2 GB/s

Page 15: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Super desktop with GP-GPU and FPGA

• Heterogeneous architecture:

– 3 computing architectures

– non-uniform memories

Page 16: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Programming tool chain

• Algorithm decomposed in GPU, Host and FPGA parts

Page 17: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Programming tool chain

• FPGA architecture generated with High Level Synthesis tools (C to VHDL compilers)

Page 18: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Programming tool chain

• Bitmap files = hardware procedure calls

Page 19: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Programming tool chain

• Code executed on combined platform

• Communication via PCIe

Page 20: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Heterogeneous computing

Page 21: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Data transfer

• GPU: AllDataToDev calculate AllResultToHost (*)

• FPGA: StreamToDev calculate StreamToHost

Local Mem CPU GPU

PCIe

CPU FPGA PCIe stream

(*) unless explicit double buffering

Fast

Page 22: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Comparison axes

• Speed: computational power

• Communication: bandwidth/latency

• Programmability: IDE efficiency speed

programmability

communication

Page 23: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Programming environment Programming language: C

• GPU: CUDA, OpenCL

– C PTX (Parallel Thread Execution)

• FPGA: HLS (High Level Synthesis)

– C VHDL

– History:

AutoESL (Xilinx) Vivado HLS Catapult C tool from Mentor Graphics

C-to HDL tool from Politecnico di Milano (Italy) C-to-Verilog tool from www.c-to-verilog.com

DIME-C from Nallatech Handel-C from Celoxica (defunct)

HercuLeS (C/assembly-to-VHDL) tool Impulse C from Impulse Accelerated Technologies Nios II C-to-Hardware Acceleration Compiler from Altera

ROCCC 2.0 (free and open source C to HDL tool) from Jacquard Computing Inc. SPARK (a C-to-VHDL) from University Of California, San Diego

SystemC from Celoxica (defunct)

Page 24: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

FPGA high level synthesis compilers

• ROCCC Riverside Optimizing Compiler for Configurable Computing – target:

• platform dependent modules (IP cores) into library • platform independent systems use modules as functions replicate, parallelize and pipeline

– optimizations • low level: arithmetic balancing • high level: loop unrolling, fusion, wavefront, mul/div elimination,

subexpression elimination • data optimizations: stream with smart buffer

– output • vhdl design + testbench • PCore (Xilinx)

Page 25: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

FPGA high level synthesis compilers

• AutoESL: – target:

• Xilinx FPGAs

– optimizations • code: loop unroll, fusion, pipeline, inline • data: remap, partition, arrays, reshape, resource, stream • interface selection: handshake, fifo, bus, register, none,…

– output • vhdl design • performance reports: timing, design and loops latency, utilization,

area, power, interface • design viewer with timeline, regs and interfaces, with links back to

source code

Page 26: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

AutoESL programming example: Tuning design for performance

• Simple example: sum of array (N=1.e8) for(i=0; i<N; i++) sum += A[i];

• No optimizations: AutoESL reports 2 * N = 2.e8 cycles

• AutoESL Designer view: 2 cycles/add

cycles

Page 27: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Unroll for parallelism

• Unroll 8 times arith. balancing (4, // adds)

• AutoESL directive:

• Designer view: only 2 // adds?

cycles

Page 28: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Increase # memory ports

• Dual-port memory: only 2 loads at a time!

• I/O bottleneck, increase # mem ports

Page 29: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Partition data for // access

• Partition A over 4 memories (=8 ports, 256 bits)

• 8 loads, 4 // adds

Page 30: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Balance unroll and partitioning

• Impact of Unrolling and Partitioning (N=108)

• Best result: 64 unroll, 32 memory ports, speedup = 16

0.E+00

1.E+08

2.E+08

3.E+08

1 10 100 1000

# cycles

Unroll factor 1, 8, 64, 512

Unrolling loops and increasing memory ports

2 PORTS ONLY

Partition=2 , 4 // streams (DP)

Partition=4 , 8 // streams (DP)

Partition=8 , 16 // streams (DP)

Partition=16, 32 // streams (DP)I/O

bound Resource

bound

Page 31: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

• Compare lines C vs. lines VHDL

• Order of magnitude speed up

• VHDL design is correct

Programming Productivity

Code C AutoESL bare AutoESL opt Ratio AutoESL/C

Sum Array 16 266 6,346 17 - 397

Erosion 3x3 31 195 1,067 6 - 34

Gaxpy 13 374 3,904 29 - 300

Page 32: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Performance evaluation Roofline Performance Model

• What is it?

• Why is it required?

• How is it able to compare both architectures?

Page 33: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Roofline model

• Peak Performance (PP) is limited by

– Compute power, CP GFlops/s

– I/O Bandwidth, BW GBytes/s

– Arithmetic Intensity, AI Flops/Byte

• Hardware limited PP = CP

• I/O limited PP = BW*AI

• PP = Min (CP, BW*AI)

AI(Ops/Byte)

PP (GOps/s)

CP(GOps/s)

1

PP=BW

Page 34: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Roofline model

• Roofline model for FPGA. I/O limit?

BRAM: 386 GB/s 386 Gops/s @ AI=1 op/byte

Page 35: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Roofline model

• Roofline model for FPGA. I/O limit?

32 streams @ 4GB/s 128 Gops/s @ AI=1 op/byte

(Pico Computing firmware allows 32 streams)

Page 36: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Roofline model

• Roofline model for FPGA. I/O limit?

1 streams @ 4GB/s 4 Gops/s @ AI=1 op/byte

Page 37: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Roofline model

• Roofline model for FPGA. Computation limit?

• 32 bit addition on Virtex 6 resource consumption

Total: 3834 @ 250 Mhz 958.5 Gops/s

AVAILABLE ADD_DSP ADD_Logic

LUT 98125 0 32

FF 201715 0 32

DSP 768 1 0

TOTAL

AVAILABLE: 768 3066

Page 38: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Roofline model

• Roofline model for FPGA.

Page 39: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Roofline model

• Roofline model for GPU

Page 40: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Roofline model

• Roofline model for GPU and FPGA combined

Page 41: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: FPN

FPN (Fixed Pattern Noise Correction) algorithm Output pixel = f(input pixel, gain, offset, origin)

Requires 4 input bytes to generate 1 output byte Computational intensity = 1 / 4 (output overlaps)

Pico stream = 16 bytes @ 250 MHz = 4 GB/s

One full-duplex stream fits 4 FPNs

Page 42: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: FPN

Max number of FPNS?: Logic Resources

FPGA logic resources allow 96 full-duplex streams

Peak performance = 96 * 4 Ops / 4ns = 96 Gops/s

Page 43: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: FPN

Max number of streams? : Available Bandwidth

AI = 1/4 (pipelined output overlaps with input)

I/O limited performance = BW*AI

32 Pico streams = 32*4GB/4 = 32 Gops/s

1 PCI e stream = 4GB/4 = 1 Gops/s

Page 44: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: FPN

Max performance on the Pico board (32 PicoStreams)

Page 45: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: FPN

Max performance on the combined platform

Page 46: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Image Erosion 3x3

• Example: 3x3 erosion pixel(i,j) = Min(neighbor pixels) = 1 “operation”

• Handwritten VHDL: 9 cycles for 1 computational block (CB)

• Peak? Virtex 6 FPGA accomodates 1536 CBs @250MHz clock rate PP = 42.6 Gops/s

Page 47: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Erosion3x3 on FPGA

Erosion3x3 operation requires 9 input bytes to generate 1 output byte

Computational intensity = 1 / 10

Handwritten VHDL code:

– 1 input bytes per clock cycles

– 1 output byte each 9 clock cycles

Performance = 27.77 MPixelOperations/s

Page 48: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Erosion3x3 on FPGA

Handwritten VHDL code:

One full-duplex stream fits 16x parallel erosion operations = 1 erosion block:

Page 49: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

Max number of erosion blocks? : Logic Resources

FPGA logic resources allow 96 full-duplex streams

Peak performance = 96 * 16 Ops/36ns = 42.66 Gops/s

RESOURCE ESTIMATIONS

Logic Utilization128x[16x Erosion[128b]] 96x[16x Erosion[128b]]

Used Available Utilization Used Available Utilization

Number of Slice Registers 214874 301440 71% 174220 301440 58%

109095 150720 72% 76423 150720 51%

Number of fully used LUT-FF pairs 49994 213650 23% 33806 248902 14%

81 600 14% 81 600 14%

Number of Block RAM/FIFO 542 416 130% 414 416 100%

7 32 22% 7 32 22%

Number of DSP48E1s 0 768 0% 0 768 0%

Number of Slice LUTs

Number of bonded IOBs

Number of BUFG/BUFGCTRLs

Page 50: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

I/O limited performance? : Available bandwidth

AI = 1 result per 9 bytes = 1/9

BRAM BW = 386 GB/s limit = 42.88 Gops/s

Pico streams BW = 32 GB/s limit = 3.55 Gops/s

PCIe stream BW = 4 GB/s limit = 0.44 Gops/s

Hardware peak = 42.66 Gops/s

I/O streams limit performance

Page 51: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

HandWritten VHDL code: Measurements

Page 52: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

ROCCC

Smart buffers reuse data only 1 fetch and store

Impact of the smart buffers on the computational intensity:

Improvement of about a factor of (k+1) for larger images H = Height of the image

W= Width of the image

k2= Dimension of the kernel or mask

Page 53: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

ROCCC

Manual partial loop unrolling increases data reuse with smart buffers:

Page 54: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

ROCCC

Loop Unrolling increases Computationl Intensity

1x Pixel in Parallel 2x Pixel in Parallel 4x Pixel in Parallel

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

32x32

64x64

128x128

256x256

512x512

1024x1024

Com

pute

r In

tensity CIx2.25

CIx2.97

CIx3.60 CI original = 0.11

Page 55: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

ROCCC: Measurements

Page 56: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

AutoESL

First implementation:

Extremely similar to the Handwritten VHDL code.

Same Computational Intensity

Page 57: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

AutoESL

Partial Loop Unrolling x4:

Page 58: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

AutoESL

Partial Loop Unrolling x4:

Erosion 1

Page 59: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

AutoESL

Partial Loop Unrolling x4:

Erosion 2

Page 60: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

AutoESL

Partial Loop Unrolling x4:

Erosion 3

Page 61: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

AutoESL

Partial Loop Unrolling x4

Unrolled loops are pipelined and data reused CI increases (less bytes fetched per operation):

Erosion 4

Page 62: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

AutoESL: Measurements

Page 63: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

Handwritten VHDL code vs ROCCC vs AutoESL

Page 64: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Experimental Results: Erosion3x3

Internal Performance (32 PicoStreams)

HandWritten VHDL: Stream Version

HandWritten VHDL: BRAM VersionROCCC 4xParalell Stream.: Default + Inlinemodule

ROCCC 4xParalell BRAM.: Default + InlinemoduleAutoESL Stream.: Pipeline

AutoESL Stream.: Pipeline, PLU x2AutoESL Stream.: Pipeline, PLU x4

AutoESL Stream.: Pipeline, PLU x16

0

20

40

60

80

100

120

0

20

40

60

80

100

120

Perfomance based on the maximum streams

Maximum Nof Streams

Max resource limit performance

GP

ixe

lsO

pe

ratio

ns/s

Fu

ll d

up

lex S

tre

am

s

Highest

Performance

Handwritten ROCCC AutoESL

96 handwritten CBs

Page 65: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

HandWritten VHDL: Stream VersionHandWritten VHDL: BRAM Version

ROCCC 4xParalell Stream.: Default + InlinemoduleROCCC 4xParalell BRAM.: Default + Inlinemodule

AutoESL Stream.: Pipeline AutoESL Stream.: Pipeline, PLU x2

AutoESL Stream.: Pipeline, PLU x4AutoESL Stream.: Pipeline, PLU x16

0

20

40

60

80

100

120

0

20

40

60

80

100

120

Performance based on the 32 available streams

Maximum Nof Streams

Max resource limit performance

Max bandwidth limit performance

GP

ixe

lOp

era

tio

ns/s

Fu

ll d

up

lex S

tre

am

s

Experimental Results: Erosion3x3

Internal Performance (32 PicoStreams)

Handwritten ROCCC AutoESL

Page 66: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

HandWritten VHDL: Stream VersionHandWritten VHDL: BRAM Version

ROCCC 4xParalell Stream.: Default + InlinemoduleROCCC 4xParalell BRAM.: Default + Inlinemodule

AutoESL Stream.: Pipeline AutoESL Stream.: Pipeline, PLU x2

AutoESL Stream.: Pipeline, PLU x4AutoESL Stream.: Pipeline, PLU x16

0

20

40

60

80

100

120

0

20

40

60

80

100

120

Performance based on the 32 available streams

Maximum Nof Streams

Max resource limit performance

Max bandwidth limit performance

GP

ixe

lOp

era

tio

ns/s

Fu

ll d

up

lex S

tre

am

s

Experimental Results: Erosion3x3

Internal Performance (32 PicoStreams)

Highest

Performance

Handwritten ROCCC AutoESL

Page 67: Erik D’Hollander - Unical · •GPU: CUDA, OpenCL –C PTX (Parallel Thread Execution) ... HercuLeS (C/assembly-to-VHDL) ... pipeline, inline

Conclusion

ROCCC presents the best performance per stream but is resource hungry.

AutoESL offers the best trade-off between performance and resource consumption.

I/O stress # I/O streams limited to 1 DDR3 memory too slow PCIe limited to 8 lanes

FPGA needs more HPC tweaking HLS tools (AutoESL) are productive