39
Evaluation of FPGAs resurgence for hardware acceleration applied to computed tomography 3D Tomography back- projection parallelization on FPGAs using OpenCL Presented by : Maxime MARTELLI , 1 st year PhD Student L2S, SATIE, TSA 1 2017 GPU Winter School, Grenoble, FR

3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Evaluation of FPGAs resurgence for hardware acceleration applied to computed tomography

3D Tomography back-projection parallelization on

FPGAs using OpenCL

Presented by : Maxime MARTELLI , 1st year PhD Student

L2S, SATIE, TSA

1

2017 GPU Winter School, Grenoble, FR

Page 2: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

CONTEXT

Moore’s law end announced for 2021

Architecture Algorithm Adequacy- Granular hardware specialization - Processors will offload specific processing to a suited architecture

Software FPGA design tools multiplication

2

Page 3: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

HYPOTHESISThe idea

Does HLS tools progress means a resurgence of FPGAs for computed tomography?

3

With the rise of Accelerator-as-a-Service (AaaS), what is the future landscape for FPGAs?

Page 4: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Summary

4

I. What is OpenCL ?II. Why use HLS on FPGAs ?III. Use case highlightIV.OpenCL Memory modelV. Custom implementationsVI.Conclusion and perspectives

Page 5: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

I. WHAT IS OPENCL?

5

Page 6: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

• Open, royalty-free standard for parallel, compute intensive applica

tion development

• Initiated by Apple, specification maintained by the Khronos group

• Supports multiple device classes, CPUs, GPUs, DSPs, Cell, etc.

• First release on December 2008

• Specification currently at version 2.0

• SDKs and tools are provided by compliant device vendors

OpenCL basics

6

Page 7: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

• Proprietary technology for GPGPU programming from Nvidi

a

• Not just API and tools, but name for the whole architecture

• Targets Nvidia hardware and GPUs only

• First SDK released February 2007

• SDK and tools available to 32- and 64-bit Windows, Linux a

nd Mac OS

• Tools and SDK are available for free from Nvidia.

CUDA basics

7

Page 8: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Basics compared

8

CUDA OpenCLWhat it is HW architecture,

programming language, API, SDK

and tools

Open API and language

specification

Propietary or open technology

Proprietary Open and royalty-free

When introduced Q4 2006 Q4 2008SDK vendor Nvidia Implementation

vendorsFree SDK Yes Depends on vendor

Heterogeneous device support

No, just NVIDIA GPUs

Yes (Apple, Nvidia, AMD, IBM, Intel,

…)

Page 9: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

OpenCL Memory Architecture

9

Page 10: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

CUDA Memory Architecture

10

Page 11: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

OpenCL Execution model

11

Page 12: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

II. WHY USE HLS ON FPGAS ?

12

Page 13: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Field Programmable Gate Array (FPGA)

13Programmable Switch FabricSource : Intel

Page 14: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

CPU instruction mapping

14Source : Intel

Page 15: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

CPU execution path (1)

15Source : Intel

Page 16: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

CPU execution path (2)

16Source : Intel

Page 17: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

CPU vs FPGA execution

17Source : Intel

Page 18: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

• Custom data-path that matches your algorithms

• Uses exactly what you need (Operation, Data Width, memory

configuration, …)

• Timing closure and reduced power consumption

• Much easier programming than VHDL

Advantages of FPGA HLS

18

Page 19: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

II. USE CASE HIGHLIGHT

19

Page 20: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Brief history

In 2004, FPGA were widely used in Tomography

For 10 years now, GPU dominates the field

With the evolution of HLS tools, a new interest for FPGAs emerge

20

Page 21: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

3D Computed Tomography Projection

21

Page 22: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Back-projection algorithm

Memory bound algorithm

!"#$%&"'(&*+,"-./00+$$

= 3,5

Density calculation :d(𝑐)=∫ 𝑠𝑖𝑛𝑜89

:;< 𝑢(𝜑, 𝑐 . 𝑣 𝜑, 𝑐 , 𝜑). 𝑤(𝜑, 𝑐):𝑑𝜑

Input : α [dimϕ], β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]

For z = 0 to dimZ - 1For y = 0 to dimY - 1

For x = 0 to dimX - 1voxelsum = 0For ϕ = 0 to dimϕ - 1| Calculate (U, V) from α[ϕ] and β[ϕ] | voxelsum += sinogram[U, V,ϕ]volume[x,y,z] = voxelsum

22

Massively parallel2563 voxels

256 angles variations

Page 23: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Back-projection results on FPGA

23

Page 24: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Benchmark the different memory structures

Main contributions

01

Implement algorithm-focused optimizations02

Assessing OpenCL code optimization for FPGA03

24

Page 25: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

III. OPENCL MEMORY MODEL

25

Page 26: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Memory structure latency on an Altera Cyclone V

240

10 15 3

Global Constant Local Private

Mean latency (cycles)

Tricky situations for calculation (LSU embedded cache)

Custom benchmark (random reads)

26

Page 27: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

IV. CUSTOM IMPLEMENTATIONS

27

Page 28: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

OpenCL work-group enqueueing mechanism

Data parallelism : ND Range

Task parallelism : Single Work Item (SWI)

28

Page 29: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

- Max FPGA frequency : 205 MHz- Intel FPGA SDK for OpenCL 16.0

Experiment setup : DE1-SoC

29

- 1 Gb of DDR3 memory- Dual core ARM Cortex A9 processor and FPGA fabric within an Altera Cyclone V

Page 30: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Implementation 1 : Shift-Register Pattern (TP)

Input : α [dimϕ], β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]For ϕ = 0 to dimϕ - 1

SRP[ϕ]= (α[ϕ], β[ϕ]) ◁ SRP initializationFor z = 0 to dimZ - 1

For y = 0 to dimY - 1For x = 0 to dimX - 1

voxelsum = 0#pragma unroll ◁ Task parallelismFor ϕ = 0 to dimϕ – 1| SRP[dimϕ – 1] = SRP[0] || For i = 0 to dimϕ – 2 |-- SRP implementation | SRP[i] = SRP [i+1] || Calculate (U, V) from α[ϕ] and β[ϕ] | voxelsum += sinogram[U, V,ϕ]volume[x,y,z] = voxelsum

30

Page 31: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Implementation 2 : Memory pre-fetching(DP)

Input : const α [dimϕ], const β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]Local int local_sinogram[Xoff* Yoff]/* Recovery of work-item characteristics (x,y,z) */voxelsum = 0For ϕ = 0 to dimϕ – 1| /* Calculate Un, Vn coordinates */| /* Dispatch min, max coordinates computation | between local work-items */| barrier(CLK_LOCAL_MEM_FENCE)| /* Global sinogram fetching by local work-items */| barrier(CLK_LOCAL_MEM_FENCE)| voxelsum += local_sinogram[localU, localV]volume[x,y,z] = voxelsum

31

Page 32: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Kernels implementations on Cyclone V SoC

222,9

67,5

32,2616,9 31,3 30,8

SWI+Naive SWI+SRP ND+Naive ND+2CU ND+MF ND+Backbone

Raw Execution Time (s)

ND+2CU : linear extrapolation model verification

ND+Backbone : irreducible logic utilization

ND + MF uses less logic than naïve NDrange

SWI + SRP uses less logic and is faster than naïve SWI

Key Points

32

Logic Utilization (%)

4936

55

96

4021

Page 33: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Kernels implementations on Cyclone V SoC

109,2

24,317,7 16,2

12,56,47

Normalized Execution Time (s)

Speedup SWI+Naïve à ND+MF

8,74

33

Matching VHDL FPGA implementations for

ND+MF

Computation rate 137 M“voxel”/s

68 MHz 112 MHz 140 MHz 140 MHz 140 MHz 140 MHz

Page 34: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

250

15 2,270

50

100

150

200

250

300 Power(W)

GPU vs FPGA with OpenCL

An embedded GPU is more energy

efficient

Algorithm inadequacy implies

longer FPGA execution time

Low FPGA consumption

12 94

991

0

200

400

600

800

1000

1200

Executiontime(ms)

0,83

0,39

0,63

0

0,2

0,4

0,6

0,8

1

TitanXPascal(GPU) JetsonTK2(GPU) Arria10FPGA

Energy(mWh)

34

Page 35: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

V. CONCLUSION AND PERSPECTIVES

35

Page 36: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Intel SDK guarantees one “voxel” computation per clock

Achieved speedup of 8.74 with little hardware knowledge

FPGAs still fall short compared to embedded GPU (performance and power)

for this family of CT algorithm

FPGA (2009) = FPGA OpenCL

Efficient tool for software developers

FPGA < Embedded GPU

CONCLUSION

36

Room for improvement

By reducing kernel footprint or increasing kernel frequency

Page 37: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Many algorithms, like radar clutter computation, are well adapted to

FPGAs strength

Old algorithms not fit for GPUs can re-emerge

Adapted Use-Case Computed Tomography with FPGA?

PERSPECTIVES

37

- Bigger card- Xilinx SDx evaluation- New adapted algorithm

Next ?

Page 38: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

THANK YOU

Any questions or comments are welcomed !

38

Page 39: 3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

FPGA key numbers

6,9 %

2015 Global Market

6,36 billion

Intel

Xilinx

Others

In 2016, FPGAs outgrew the overall semiconductor market

(resp. 6,9 % vs 1,5 %)

The market is expected to reach 10 billion $ by 2024

Xilinx stays as the first FPGA manufacturer

Market sharesAverage annual gross

39