GPU computing and CUDA Marko Mišić ([email protected])[email protected] Milo Tomašević ([email protected])[email protected] YUINFO 2012 Kopaonik, 29.02.2012

GPU computing and CUDA

Marko Mišić ([email protected])Milo Tomašević ([email protected])

YUINFO 2012Kopaonik, 29.02.2012.

2/30

Introduction to GPU computing (1)

Graphics Processing Units (GPUs) have been used

for non-graphics computation for several years This trend is called

General-Purpose computation on GPUs (GPGPU) The GPGPU applications can be found in:

Computational physics/chemistry/biology Signal processing Computational geometry Database management Computational finance Computer vision

3/30

Introduction to GPU computing (2)

The GPU is a highly parallel processor good at data-parallel processing with many calculations per memory access The same computation executed

on many data elements in parallel with high arithmetic intensity

Same computation means lower requirement for sophisticated flow control

High arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches

4/30

CPU vs. GPU trends (1)

CPU is optimized to execute tasks Big caches hide memory latencies Sophisticated flow control

GPU is specialized for compute-intensive, highly parallel computation More transistors can be devoted to data processing

rather than data caching and flow control

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

5/30

CPU vs. GPU trends (2) The GPU has evolved into a very flexible and powerful processor

Programmable using high-level languages Computational power: 1 TFLOPS vs. 100 GFLOPS Bandwidth: ~10x bigger GPU is found in almost every workstation

6/30

CPU vs. GPU trends (3)

CUDAAdvantage

Rigid BodyPhysicsSolver

10x

20x

47x

197x

MatrixNumerics

BLAS1:60+ GB/sBLAS3:

100+ GFLOPS

WaveEquation

FDTD:1.2 Gcells/s

FFT:52 GFLOPS

(GFLOPS as defined by benchFFT)

BiologicalSequence

Match

SSEARCH:5.2 Gcells/s

Finance

Black Scholes:4.7 GOptions/s

7/30

History of GPU programming The fast-growing video game industry puts

strong pressure that forces constant innovation GPUs evolved from fixed-function pipeline processors

to the more programmable, general-purpose processors Programmable shaders (2000)

Programmed through OpenGL and DirectX API Lots of limitations

Memory access, ISA, floating-point support, etc. NVIDIA CUDA (2007) AMD/ATI (Brook+, FireStream, Close-To-Metal) Microsoft DirectCompute (DirectX 10/DirectX 11) OpenCompute Language, OpenCL (2009)

8/30

CUDA overview (1)

Compute Device Unified Architecture (CUDA) A new hardware and software architecture

for issuing and managing computations on the GPU Started with NVIDIA 8000 (G80) series GPUs

General-purpose programming model SIMD / SPMD User launches batches of threads on the GPU GPU could be seen as dedicated

super-threaded, massively data parallel coprocessor Explicit and unrestricted memory management

9/30

CUDA overview (2)

The GPU is viewed as a compute device that is a coprocessor to the CPU (host) Executes compute-intensive part of the application Runs many threads in parallel Has its own DRAM (device memory)

Data-parallel portions of an application are expressed as device kernels which run on many threads GPU threads are extremely lightweight

Very little creation overhead GPU needs 1000s of threads for full efficiency

Multicore CPU needs only a few

10/30

CUDA overview (3)

Dedicated software stack Runtime and driver C-language extension

for easier programming Targeted API for advanced

users Complete tool chain

Compiler, debugger, profiler Libraries and 3rd party support

GPU Computing SDK cuFFT, cuBLAS... FORTRAN, C++, Python,

MATLAB, Thrust, GMAC…

GPU

CPU

CUDA Runtime

CUDA Libraries(FFT, BLAS)

CUDA Driver

Application

11/30

Programming model (1)

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

CUDA application consists of two parts Sequential parts are executed on the CPU (host) Compute-intensive parts are executed on the GPU (device) The CPU is responsible for data management,

memory transfers, and the GPU execution configuration

12/30

Programming model (2) A kernel is executed as

a grid of thread blocks A thread block is a batch

of threads that can cooperate with each other by: Efficiently sharing data

through shared memory Synchronizing their

execution Two threads from

two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

13/30

Programming model (3) Threads and blocks have

IDs So each thread can

decide what data to work on

Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory

addressing when processingmultidimensional data Image processing Solving PDEs on volumes

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

14/30

Memory model (1) Each thread can:

Read/write per-thread registers

Read/write per-thread local memory

Read/write per-block shared memory

Read/write per-grid global memory

Read only per-grid constant memory

Read only per-grid texture memory

Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

15/30

Memory model (2)

The host can read/write global, constant, and texture memory All stored in device DRAM

Global memory accesses are slow Around ~200 cycles

Memory architecture optimized for high bandwidth Memory banks Transactions

Device

Global Memory (DRAM)

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Global Memory (DRAM)

16/30

Memory model (3)

Shared memory is a fast on-chip memory Allows threads in a block to share intermediate data

Access time ~ 3-4 cycles Could be seen as user-managed cache (scratchpad)

Threads are responsible to bring the data to and move it from the shared memory

Small in size (up to 48KB)

DRAM

ALU

Sharedmemory

Control

CacheALU ALU ...

d0 d1 d2 d3

d0 d1 d2 d3

ALU

Sharedmemory

Control

CacheALU ALU ...

d4 d5 d6 d7

d4 d5 d6 d7

…

…

17/30

A common programming strategy Local and global memory reside in device memory

(DRAM) Much slower access than shared memory

A common way of performing computation on the device is to block it up (tile) to take advantage of fast shared memory Partition the data set into subsets that fit into shared

memory Handle each data subset with one thread block by:

Loading the subset from global memory to shared memory Performing the computation on the subset from shared

memory Each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory

18/30

Matrix Multiplication Example (1) P = M * N of size WIDTH x WIDTH Without blocking:

One thread handles one element of P M and N are loaded WIDTH times from

global memory

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

19/30

Matrix Multiplication Example (2) P = M * N of size WIDTH x WIDTH With blocking:

One thread block handles one BLOCK_SIZE x BLOCK_SIZE sub-matrix Psub of P

M and N are only loaded WIDTH / BLOCK_SIZE times from global memory

Great saving of memory bandwidth! M

N

P

Psub

BLOCK_SIZEBLOCK_SIZE BLOCK_SIZE BLOCK_SIZE

BL

OC

K_S

IZE

BL

OC

K_S

IZE

BL

OC

K_S

IZE

BL

OC

K_S

IZE

WID

TH

WID

TH

WIDTH WIDTH

20/30

CUDA API (1)

The CUDA API is an extension to the C programming language consisting of: Language extensions

To target portions of the code for execution on the device

A runtime library split into: A common component providing built-in vector types

and a subset of the C runtime library in both host and device codes

A host component to control and access one or more devices from the host

A device component providing device-specific functions

21/30

CUDA API (2)

Function declaration qualifiers __global__, __host__, __device__

Variable qualifiers __host__, __device___, __shared__, etc.

Built-in variables gridDim, blockDim, blockIdx, threadIdx

Mathematical functions Kernel calling convention (execution configuration)

myKernel<<< DimGrid, DimBlock >>>(arg1, … ); Programmer explicitly specifies block and grid organization

1D, 2D or 3D

22/30

Hardware implementation (1) The device is a set of multiprocessors Each multiprocessor is a set of 32-bit

processors with a SIMD architecture At each clock cycle, a multiprocessor

executes the same instruction on a group of threads called a warp Including branches

Allows scalable execution of kernels Adding more multiprocessors improves

performance

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

InstructionUnit

Processor 1

…

Processor 2

Processor M

…

23/30

Hardware implementation (2)

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

24/30

Hardware implementation (3) Each thread block of a grid is split into warps

that get executed by one multiprocessor Warp consists of threads with consecutive thread IDs)

Each thread block is executed by only one multiprocessor Shared memory space resides in the on-chip shared memory Registers are allocated among the threads A kernel that requires too many registers will fail to launch

A multiprocessor can execute several blocks concurrently Shared memory and registers are allocated

among the threads of all concurrent blocks Decreasing shared memory usage (per block) and

register usage (per thread) increases number of blocks that can run concurrently

25/30

Memory architecture (1)

In a parallel machine, many threads access memory Memory is divided into banks Essential to achieve high bandwidth

Each bank can service one address per cycle A memory can service

as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank

result in a bank conflict Conflicting accesses are serialized

Shared memory is organized in similar fashion

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

26/30

Memory architecture (2)

When accessing global memory, accesses are combined into transactions Peak bandwidth is achieved

when all threads in a half warp access continuous memory locations

“Memory coalescing” In that case, there are no bank conflicts

Programmer is responsible to optimize algorithms to access data in appropriate fashion

27/30

Performance considerations CUDA has a low learning curve

It is easy to write a correct program Performance can vary greatly

depending on the resource constraints of the particular device architecture Performance concerned programmers still need

to be aware of them to make a good use of a contemporary hardware

It is essential to understand hardware and memory architecture Thread scheduling and execution Suitable memory access patterns Shared memory utilization Resource limitations

28/30

Conclusion

Highly multithreaded architecture of modern GPUs is very suitable for solving data-parallel problems Vastly improves performance in certain domains

It is expected that GPU architectures will evolve to further broaden application domains We are in the dawn of heterogeneous computing

Software support is developing rapidly Mature tool chain Libraries Available applications OpenCL

29/30

References David Kirk, Wen-mei Hwu, Programming Massively Parallel Processors:

A Hands on Approach, Morgan Kaufmann, 2010. Course ECE498AL, University of Illinois, Urbana-Champaign

http://courses.engr.illinois.edu/ece498/al/ Dann Connors, OpenCL and CUDA Programming for Multicore

and GPU Architectures, ACACES 2011, Fiuggi, Italy, 2011. David Kirk, Wen-mei Hwu, Programming and tUnining

Massively Parallel Systems, PUMPS 2011, Barcelona, Spain, 2011. NVIDIA CUDA C Programming Guide 4.0, 2011. Mišić, Đurđević, Tomašević, “Evolution and Trends in GPU Computing”,

MIPRO 2012, Abbazia, Croatia, 2012. (to be published) NVIDIA Developer zone,

http://developer.nvidia.com/category/zone/cuda-zone http://en.wikipedia.org/wiki/GPGPU http://en.wikipedia.org/wiki/CUDA GPU training wiki,

https://hpcforge.org/plugins/mediawiki/wiki/gpu-training/index.php/Main_Page

GPU computing and CUDA

Questions?

Marko Mišić ([email protected])Milo Tomašević ([email protected])

YUINFO 2012Kopaonik, 29.02.2012.

Documents

GPU computing and CUDA Marko Mišić ([email protected])[email protected] Milo Tomašević ([email protected])[email protected] YUINFO 2012 Kopaonik, 29.02.2012