An Introduction to CUDA Programming - GTC On …on-demand.gputechconf.com/gtc/2016/webinar/catch-up-on-cuda.pdf · An Introduction to CUDA Programming ... CUDA C/C++, CUDA Fortran

An Introduction to CUDA Programming

Chris Mason Director of Product Management, Acceleware

GTC Express Webinar

Date: June 8, 2016

Catch Up on CUDA

© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.

About Acceleware Programmer Training

CUDA and other HPC training classes

Over 100 courses taught worldwide

http://acceleware.com/training

Consulting Services Projects for Oil & Gas, Medical, Finance, Security and

Defence, CAD, Media & Entertainment

Mentoring, code review and complete project

implementation

http://acceleware.com/services

GPU Accelerated Software Seismic imaging & modelling

Electromagnetics

2


Programmer Training CUDA and other HPC training classes

Teachers with real world experience

Hands-on lab exercises

Progressive lectures

Small class sizes to maximize learning

90 days post training support

“The level of detail is fantastic. The course did not focus on

syntax but rather on how to expertly program for the GPU. I

loved the course and I hope that we can get more of our team

to take it.”

Jason Gauci, Software Engineer

Lockheed Martin

3


Consulting Services

4


Seismic Imaging & Modelling AxWAVE™

Seismic forward modelling

2D, 3D, constant, and variable density models

High fidelity finite-difference modelling

AxRTM™

High performance Reverse Time Migration

application

Isotropic, VTI, and TTI media

AxFWI™

Inversion of the full seismic data to provide an

accurate subsurface velocity model

Customizable for specific workflows

HPC Implementation

Optimized for NVIDIA Tesla GPUs

Efficient multi-GPU scaling

5


Electromagnetics

AxFDTD™

Finite-Difference Time-Domain

Electromagnetic Solver

Optimized for NVIDIA GPUs

Sub-gridding and large feature

coverage

Multi-GPU, GPU clusters, GPU

targeting

Available from:

6


Outline

CUDA overview

Data-parallelism

GPU programming model

GPU kernels

Host vs. device responsibilities

CUDA syntax

Thread hierarchy

7


Why use GPUs? Performance!

GPU advances are outpacing CPU advances

Continuing Moore’s Law?

8


Why use GPUs? Performance!

9

Intel Xeon E5-2699v3 (Haswell-EP)

NVIDIA Tesla M60 (Maxwell)

NVIDIA Tesla K80 (Kepler)

NVIDIA Tesla P100 (Pascal)

Processing Cores 22 4096 4992 3584

Clock Frequency 2.2-3.6* 0.900-1.180 GHz 0.562-0.875 GHz 1.328-1.48 GHz

Memory Bandwidth 76.8 GB/s / socket 320 GB/s 480 GB/s 720 GB/s

Peak Tflops (single) 1.83 @ 2.6GHz 9.68 @ 1.180 GHz 8.74 @ 0.875GHz 10.6 @ 1.48GHz

Peak Tflops (double) 0.915 @ 2.6GHz 0.30 @ 1.180 GHz 2.91 @ 0.875GHz 5.3 @ 1.48GHz

Gflops/Watt (single) 9.1 32.2 29.1 35.3

Total Memory >>24GB 16GB 24GB 16 GB


GPU Potential Advantages

Tesla K80 vs. Xeon E5-2699 v3

5.8x more single-precision floating-point

throughput

5.8x more double-precision floating-point

throughput

9.4x higher memory bandwidth

10


GPU Disadvantages

Architecture not as flexible as CPU

Must rewrite algorithms and maintain

software in GPU languages

Discrete GPUs attached to CPU via

relatively slow PCIe

16GB/s bi-directional for PCIe 3.0 16x

Limited memory (though 8-24GB is

reasonable for many applications)

11


Software Approaches for Acceleration

Maximum Flexibility CUDA C/C++, CUDA Fortran

MATLAB, Mathematica, LabVIEW

Python: NumaPro, PyCUDA

Simple programming for heterogeneous systems Simple compiler hints/pragmas

Compiler parallelizes code

Target a variety of platforms

“Drop-in” Acceleration In-depth GPU knowledge not required

Highly optimized by GPU experts

Provides functions used in a broad range of applications

(eg. FFT, BLAS, RNG, sparse iterative solvers, deep

neural networks)

Programming

Languages

OpenACC

Directives

Libraries

Eff

ort

12


Compute Capability

13

Hardware architecture version number defined by a major and minor version number Major version specifies core architecture type

Minor version refers to incremental improvements and new features

Architecture Name

Compute Capability

GPUs Example Features

Tesla

1.0 GeForce 8800, Quadro FX 5600, Tesla C870 Base Functionality

1.1 GeForce 9800, Quadro FX 4700 x2 Asynchronous Memory Transfers

1.3 GeForce GTX 295, Quadro FX 5800, Tesla C1060 Double Precision

Fermi 2.0 GeForce GTX 480, Tesla C2050 R/W Memory Cache

Kepler

3.0 GeForce GTX 680, Tesla K10 Warp Shuffle Functions PCI-e 3.0

3.5 Tesla K20, K20X, and K40 Dynamic Parallelism

3.7 Tesla K80 More registers and shared memory

Maxwell 5.0 GeForce GTX 750 Ti New architecture

5.2 GeForce GTX 970/980 More shared memory

Pascal 6.0 Tesla P100 Page Migration Engine


CUDA Overview

Parallel computing architecture

developed by NVIDIA

CUDA programming interface consists of:

C language extensions to target portions of source

code on the compute device (GPU)

A library of C functions that execute on the host

(CPU) to interact with the device

14


Data-Parallel Computing

1. Performs operations on a data set organized into a

common structure (eg. an array)

2. A set of tasks work collectively and simultaneously on the

same structure with each task operating on its own portion

of the structure

3. Tasks perform identical operations on their portions of the

structure. Operations on each portion must be data

independent!

15


Data Dependence

Data dependence occurs when a program statement refers to

the data of a preceding statement

Data dependence limits parallelism

16

a = 2 * x; b = 2 * y; c = 3 * x;

a = 2 * x; b = 2 * a * a; c = b * 9;

These 3 statements are

independent! b depends on a, c depends

on b and a!


Data-Parallel Computing Example

Data set consisting of

arrays A,B, and C

Same operations performed

on each element

Cx = Ax + Bx

Two tasks operating on a

subset of the arrays. Tasks

0 and 1 are independent.

Could have more tasks.

17

A0 A1 A2 A3 A4 A5 A6 A7

B0 B1 B2 B3 B4 B5 B6 B7

C0 C1 C2 C3 C4 C5 C6 C7

Cx = Ax + Bx

Task 0 Task 1

Operation


Data-Parallel Computing on GPUs

Data-parallel computing maps well to GPUs: Identical operations executed on many data elements in

parallel Simplified flow control allows increased ratio of compute logic (ALUs) to

control logic

18

DRAM

GPU

DRAM

CPU

ALU Control

L1 Cache

L2 Cache

ALU

ALU ALU

ALU Control

L1 Cache

L2 Cache

ALU

ALU ALU

L3 Cache


The CUDA Programming Model

CUDA is a heterogeneous model, including provisions

for both host and device

19

DRAM GPU

Device

PCIe DRAM CPU

Host

CPU GPU

DRAM

Host Device

Discrete GPU

Mobile Processor



Data-parallel portions of an algorithm are

executed on the device as kernels Kernels are C/C++ functions with some restrictions, and a few

language extensions

Only one kernel is executed at a time Newer GPU architectures relax this restriction

Each kernel is executed by many threads

20


CUDA Threads

CUDA threads are conceptually similar to data-parallel tasks Each thread performs the same operations on a subset of a data

structure

Threads execute independently

CUDA threads are not CPU threads CUDA threads are extremely lightweight

Little creation overhead

Instant context-switching

CUDA threads must execute the same kernel

21


CUDA Thread Hierarchy

CUDA is designed to execute

1000s of threads

Threads are grouped together

into thread blocks

Thread blocks are grouped

together into a grid

22

Image courtesy of NVIDIA Corp.


CUDA Thread Hierarchy

Thread blocks and Grids can be

1D, 2D or 3D

Dimensions set at launch time

Thread blocks and grids do not

need to have the same

dimensionality

ie. 1D Grid of 2D Thread

Blocks

23

Block (0,0) Block (1,0) Block( 2,0)

Block (0,1) Block (1,1) Block (2,1)

Grid

Thread (3,0) Thread (1,0) Thread (2,0) Thread (0,0)



Block (1,1)



The host launches kernels

The host executes serial

code between device

kernel launches Memory management

Data exchange to/from device

Error handling

24

Block (0,0) Block (1,0)

Block (0,1) Block (1,1)

Block (0,2) Block ( 1,2)

Grid



Grid

Host

Device

Host

Device


CUDA APIs

Can use CUDA through CUDA C (Runtime API), or Driver API

This tutorial presentation uses CUDA C Uses host side C-extensions that greatly simplify code

Driver API has a much more verbose syntax that clouds CUDA (parallel) fundamentals

Same ability, same performance, but: Historically Driver API provided slightly more control over host process

interactions with GPU

Runtime API now supports all functionality of Driver API, and is interoperable with Driver API

No reason to start new development using Driver API

Don’t confuse the two when referring to CUDA Documentation cuFunctionName() – Driver API

cudaFunctionName() – Runtime API

25


CUDA Kernel Launch Syntax CUDA kernels are launched by the host using a modified C function

call syntax:

dim3 is vector type with x, y, and z components (eg. dG.x)

26

myKernel<<<dim3 dGrid, dim3 dBlock>>>(…)

Maximum Values For

Each Dimension

Compute Capability

2.x (Fermi) 3.x (Kepler) 5.x (Maxwell)

Total Threads per Block 1024 1024

Grid Size

dGrid.x 65535 231-1

dGrid.y 65535 65535

dGrid.z 65535 65535

Block Size

dBlock.x 1024 1024

dBlock.y 1024 1024

dBlock.z 64 64


CUDA Kernels

Denoted by __global__function qualifier Eg. __global__void myKernel(float* a)

Called from host, executed on device

A few noteworthy restrictions: No access to host memory (in general!)

Must return void

No static variables

No access to host functions

27


CUDA Syntax – Kernels (I)

Kernels can take arguments just like any C function

Pointers to device memory

Parameters passed by value

28

__global__ void SimpleKernel(float* a, float b) { a[0] = b; }


CUDA Syntax – Kernels (II)

Kernels must be

declared (but not

necessarily

defined) in

source/header files

before they are

called

29

// Kernel declaration __global__ void kernel(float* a); int main() { dim3 gridSize, blockSize; … kernel<<<gridSize,blockSize>>>(a); } __global__ void kernel(float* a) { … }


CUDA Syntax - Kernels

Kernels have read-only built-in variables: gridDim: dimensions of the grid

Uniform for all threads

blockIdx: unique index of a block within grid

blockDim: dimensions of the block

Uniform for all threads

threadIdx: unique index of the thread within the thread block

Cannot vary the size of blocks or grids during a kernel call

30



Built-in variables are typically used to compute

unique thread identifiers Map local thread ID to a global array index

31

blockIdx.x

blockDim.x = 5

gridDim.x = 3

threadIdx.x

blockIdx.x*blockDim.x + threadIdx.x

myKernel<<<3,5>>>(…)

Grid

0

0 1 2 3 4

1

0 1 2 3 4

2

0 1 2 3 4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14


CUDA Syntax – Index & Size Calculations

Global index calculation idx = blockIdx.x * blockDim.x + threadIdx.x

Grid size calculation

Where Size: Total size of the array

BlkDim: Size of the block (max 1024)

GridSize: Number of blocks in the grid

32

DivisionIntegerBlkDim

BlkDimSizeGridSize

1


CUDA Syntax – Thread Identifiers

Result for each kernel launched with the following execution configuration:

MyKernel<<<3,4>>>(a);

33

__global__ void MyKernel(int* a) { int idx = blockIdx.x*blockDim.x+threadIdx.x; a[idx] = 7; } __global__ void MyKernel(int* a) { int idx = blockIdx.x*blockDim.x+threadIdx.x; a[idx] = blockIdx.x; } __global__ void MyKernel(int* a) { int idx = blockIdx.x*blockDim.x+threadIdx.x; a[idx] = threadIdx.x; }

a: 7 7 7 7 7 7 7 7 7 7 7 7

a: 0 0 0 0 1 1 1 1 2 2 2 2

a: 0 1 2 3 0 1 2 3 0 1 2 3



All C operators are supported eg. +, *, /, ^, >, >>

All functions from the standard math library eg. sinf(), cosf(), ceilf(), fabsf()

Control flow statements too! eg. if(), while(), for()

34


CUDA Kernel C++ Support

Supported Classes

Including inheritance and virtual functions

Need to add __device__ qualifiers to member functions!

Templates

C++11 features including auto and lambda functions

Not supported C++ Standard Library

Run time type information (RTTI)

Exception handling

Classes with virtual functions are not binary compatible between host and device

35


User-defined Device Functions

Can write/call your own device functions Device functions cannot be called by host

Functions declared with both __device__ and __host__ will be compiled for both the CPU and GPU

36

__device__ float myDeviceFunction(int i) {

... } __global__ void myKernel(float* a) { int idx = blockIdx.x*blockDim.x+threadIdx.x; a[idx] = myDeviceFunction(idx); }


CUDA Syntax - Memory Management

Typically, host code manages device memory:

cudaMalloc(void** pointer, size_t nbytes)

cudaMemset(void* pointer, int value, size_t count)

cudaFree(void* pointer)

37

// Memory allocation example

int n = 1024; int nBytes = 1024*sizeof(int); int* a = 0; cudaMalloc((void**)&a, nbytes); cudaMemset( a, 0, nbytes); cudaFree(a);


CUDA Syntax – Memory Spaces

Host and device have separate memory spaces For discrete GPUs data is moved between

them via PCIe bus

Pointers are just addresses Can’t tell from the pointer value whether

the address is on device or host

Must exercise caution when dereferencing pointers

Dereferencing host pointers on device likely crashed, and vice versa

38


CUDA Syntax – Data Transfers

Host code manages data transfers to and from the device: cudaMemcpy(void* dst, void* src, size_t nbytes, enum

cudaMemcpyKind direction);

Direction is one of:

cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaMemcpyHostToHost cudaMemcpyDefault

Blocking call - returns once copy is complete

Waits for all outstanding CUDA calls to complete before starting transfer

With cudaMemcpyDefault, runtime determines which way to copy data

39


CUDA Syntax - Synchronization

Kernel launches are asynchronous Control returns to CPU immediately

Kernel starts executing once all

outstanding CUDA calls are complete

cudaMemcpy() is synchronous Blocks until copy is complete

Copy starts once all outstanding

CUDA calls are complete

cudaDeviceSynchronize() Blocks until all outstanding CUDA

calls are complete

40

cudaMemcpy(…,cudaMemcpyHostToDevice); // Data is on the GPU at this point MyKernel<<<…>>>(…); // Kernel is launched but // not necessarily complete cudaMemcpy(…,cudaMemcpyDeviceToHost); // CPU waits until kernel is complete // and then transfers data // Data is on the CPU at this point


CUDA Syntax – Error Management

Host code manages errors

Most CUDA function calls return cudaError_t

Enumeration type

cudaSuccess (value 0) indicates no errors

char* cudaGetErrorString(cudaError_t err) Returns a string describing the error condition

41

cudaError_t err; err = cudaMemcpy(…); if(err)

printf(“Error: %s\n”, cudaGetErrorString(err));


CUDA Syntax – Error Management

Kernel launches have no return value!

cudaError_t cudaGetLastError() Returns error code for last CUDA runtime function (including kernel

launches)

Resets global error state to cudaSuccess

In case of multiple errors, only the last one is reported

For kernels:

Asynchronous, must call cudaDeviceSynchronize() first, then cudaGetLastError()

42

MyKernel<<< ... >>> (...); cudaDeviceSynchronize(); e = cudaGetLastError();


Putting It All Together

This kernel assumes that the size of the

array is evenly divisible by the block size.

What happens if does not?

43

A0 A1 A2 A3 A4 A5 A6 A7

B0 B1 B2 B3 B4 B5 B6 B7

C0 C1 C2 C3 C4 C5 C6 C7

Cx = Ax + Bx

One Thread per Output

Operation

__global__

void VectorAddKernel( float* a,

float* b,

float* c)

{

int idx = threadIdx.x

+ blockIdx.x * blockDim.x;

c[idx] = a[idx] + b[idx];

}


Vector Add – Host Code

44

void VectorAdd(float* aH, float* bH, float* cH, int N) {

float* aD, *bD, *cD; int N_BYTES = N * sizeof(float); dim3 blockSize, gridSize; cudaMalloc((void**)&aD, N_BYTES); cudaMalloc((void**)&bD, N_BYTES); cudaMalloc((void**)&cD, N_BYTES); cudaMemcpy(aD, aH, N_BYTES, cudaMemcpyHostToDevice); cudaMemcpy(bD, bH, N_BYTES, cudaMemcpyHostToDevice); blockSize.x = 512; gridSize.x = N / blockSize.x; VectorAddKernel<<<gridSize, blockSize>>>(aD, bD, cD); cudaMemcpy(cH, cD, N_BYTES, cudaMemcpyDeviceToHost);

}

Allocate memory on

GPU

Transfer input

arrays to GPU

Launch kernel

Transfer output

array to CPU

This code assumes N

is a multiple of 512


CUDA Syntax – Unified Memory

CUDA 6 introduced Unified Memory

Use ‘managed memory’ instead of explicitly declaring

memory for the host and device

cudaMallocManaged(void **devPtr, size_t size)

45

// Memory allocation example

int n = 1024; int nBytes = 1024*sizeof(int); int* a = 0; cudaMallocManaged((void**)&a, nbytes); cudaFree(a);


Putting It All Together… Again!

46

int* a, *b, *c; int N_BYTES = 2 * sizeof(int); cudaMallocManaged((void**)&a, N_BYTES); cudaMallocManaged((void**)&b, N_BYTES); cudaMallocManaged((void**)&c, N_BYTES); a[0] = 5; b[0] = 7; a[1] = 3; b[1] = 4; VectorAddKernel<<<1,2>>>(a, b, c); cudaDeviceSynchronize(); printf(“%d %d\n”, c[0], c[1]);

Allocate managed

memory

Initializing memory from

host

Launch kernel and

synchronize device

Access result from host


Summary

GPUs are data-parallel architectures

CUDA provides a heterogeneous compute model:

Host:

Memory management (usually)

Data transfers

Data-parallel kernel launches on device as a grid of thread blocks

Error management

Device (GPU):

Executes data-parallel kernels in threads

Implemented in C/C++ with a few important extensions, and a few restrictions

47


Need More CUDA?

Public Training:

August 30 – September 2 Calgary, Alberta

Private Training:

Onsite, anywhere in the world

Tailored to your specific needs

NEW Online Training: Access our courses from anywhere

August 30 – September 2

Mentoring & Services: Let us help develop & optimize your

applications for maximum performance

48

Email: [email protected]


Questions?

Acceleware Ltd.

Tel: +1 403.249.9099

Email: [email protected] [email protected]

CUDA Blog: http://acceleware.com/blog

Website: http://acceleware.com

49

Chris Mason [email protected]

Documents

An Introduction to CUDA Programming - GTC On …on-demand.gputechconf.com/gtc/2016/webinar/catch-up-on-cuda.pdf · An Introduction to CUDA Programming ... CUDA C/C++, CUDA Fortran