iMinds The Conference: Jan Lemeire

01/05/2023

GPU acceleration of image processing Jan

Lemeire

GPU vs CPU Peak Performance Trends

GPU peak performance has grown aggressively. Hardware has kept up with Moore’s law

Source : NVIDIA

2010350 Million triangles/second3 Billion transistors GPU

19955,000 triangles/second800,000 transistors GPU

01/05/2023

To the rescue: Graphical Processing Units (GPUs)

94 fps (AMD Tahiti Pro)

GPU: 1-3 TeraFlop/second instead of 10-20 GigaFlop/second for CPU

Figure 1.1. Enlarging Performance Gap between GPUs and CPUs.

Multi-core CPU

Many-core GPU

Courtesy: John Owens

01/05/2023

GPUs are an alternative for CPUs

in offering processing power

01/05/2023 7

pixel rescaling lens correction pattern detection

CPU gives only 4 fpsnext generation machines need 50fps

01/05/2023

CPU: 4 fps GPU: 70 fps

01/05/2023

Methodology

Application

Identification of compute-intensive parts

Feasibility study of GPU acceleration

GPU implementation

GPU optimization

Hardware

01/05/2023

Obstacle 1Hard(er) to implement

01/05/2023 11

Device/GPU ± 1TFLOPS

Global Memory (1GB)

Multiprocessor 1

Local Memory (16/48KB)

ScalarProcessor

± 1GHz

Private 16K/8

ScalarProcessor

Private

Multiprocessor 2

Local Memory

ScalarProcessor

Private

ScalarProcessor

PrivateHost/CPU

Constant Memory (64KB)

GPU Programming Concepts

Texture Memory (in global memory)

Proces-sor

Grid (1D, 2D or 3D)

Group(0, 0)

Group(1, 0)

Group(0, 1)

Group(1, 1)

Group(2, 0)

Group(2, 1)

Work group

Work item(0, 0)

Work item(1, 0)

Work item(2, 0)

Work item(0, 1)

Work item(1, 1)

Work item(2, 1)

Work item(0, 2)

Work item(1, 2)

Work item(2, 2)

kernel

Max #work items per work group: 1024Executed in warps/wavefronts of 32/64 work itemsMax work groups simultaneously on MP: 8Max active warps on MP: 24/48

get_local_size(0)

local_

size(1

S yWork group size Sx

(get_local_id(0), get_local_id(1))

(get_group_id(0),get_group_id(1))100GB/s 200 cycles

40GB/s few cycles

4-8 GB/s

OpenCL terminology

01/05/2023

Semi-abstract scalable hardware model

Need to know model for effective and efficient code

CPU: processor ensures efficient execution

Need to know more details than of CPU

Code remains compatible/efficient

01/05/2023

Increased code complexity

1. Complex index calculations Mapping data elements on processing elements (at

least 2 levels) Sometimes better to group elements

2. Optimizations Impact on performance need to be tested

3. A lot of parameters:a. Algorithm, implementationb. Configuration of mappingc. Hardware parameters (limits)d. Optimized versions

01/05/2023 14

Application

GPU implementation

GPU optimization

Hardware

Skeleton-based

OpenCL

Pragma-based

Parallelization by compiler

Methodology

01/05/2023

Obstacle 2Hard(er) to get efficiency

01/05/2023

We expect peak performance Speedup of 100x possible

At least, we expect some speedup But what is 5x worth?

Reasons for low efficiency?

01/05/2023 17

Roofline model

01/05/2023 18

01/05/2023 19

Application

Performance estimation

GPU implementation Performance analysis

GPU optimization

Hardware

Algorithm characterization

Hardwarecharacterization

bottlenecks & trade-offs

Skeleton-based

OpenCL

Pragma-based

Parallelization by compiler

Roofline model& benchmarks

Analytical model

benchmarks

Methodology: our contribution

Anti-parallel patterns

01/05/2023

Conclusions

01/05/2023

Conclusions

Changed into…

01/05/2023

Conclusions

01/05/2023

Competence Center for Personal Supercomputing

Offer trainings (overcome obstacle 1) Acquire expertise Take an independent, critical position

Offer feasibility and performance studies (overcome obstacle 2)

Symposium: Brussels, December 13th 2012

http://parallel.vub.ac.be 23

iMinds The Conference: Jan Lemeire

Documents

Wouter Joosen, iMinds Security Department, iMinds The Conference 2013

FORGE poster @ iMinds Conference

150910 iMinds Ground lion-event

Business pitch Instadeal @iMinds

iMinds insights - Operational optimization

Frank Gielen iMinds, iMinds The Conference

iMinds - Health: Gebruikscenarios Presentatie

iMinds The Conference 2012 - Entrepreneurship

Practical Parallel Programming IIIparallel.vub.ac.be › education › ppp › slides2019 › PPP_3... · 2019-10-11 · LINK 1 PPP Chapter 7 KUMAR Section 6.3. Jan Lemeire Pag. 12/72

iMinds' course: preceding exercises

iMinds The Conference: Christian Vanhove

1Causality & MDL Causal Models as Minimal Descriptions of Multivariate Systems Jan Lemeire June 15 th 2006

Parallel Systems Introductionparallel.vub.ac.be/education/parsys/notes2011/ParSys_Introduction.pdf · Parallel Systems: Introduction Parallel Systems Introduction Jan Lemeire Parallel

iMinds insights over gezondheidsempowerment

Learning Causal Models of Multivariate Systems and the Value of it for the Performance Modeling of Computer Programs Jan Lemeire December 19 th 2007 Supervisor:

Overlay Networks for Smart Grids 3-28-12 iMinds · Chapter 8 Overlay Networks for Smart Grids Tim Wauters, Ghent University – iMinds Filip De Turck, Ghent University – iMinds

Causal Performance Modelsparallel.vub.ac.be/~jan/papers/JanLemeire_WhenAreCausal...Jan Lemeire Pag. 2 /22 When are Causal Models not Good Models? Overview The Kolmogorov Minimal Sufficient

Causal Models, Learning Algorithms and their Application to Performance Modeling Jan Lemeire Parallel Systems lab November 15 th 2006

iMinds The Conference: Jarmo Eskelinen

iMinds Living Labs Overview