Upload
dokien
View
307
Download
2
Embed Size (px)
Citation preview
An Introduction to CUDA Programming
Chris Mason Director of Product Management, Acceleware
GTC Express Webinar
Date: June 8, 2016
Catch Up on CUDA
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
About Acceleware Programmer Training
CUDA and other HPC training classes
Over 100 courses taught worldwide
http://acceleware.com/training
Consulting Services Projects for Oil & Gas, Medical, Finance, Security and
Defence, CAD, Media & Entertainment
Mentoring, code review and complete project
implementation
http://acceleware.com/services
GPU Accelerated Software Seismic imaging & modelling
Electromagnetics
2
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Programmer Training CUDA and other HPC training classes
Teachers with real world experience
Hands-on lab exercises
Progressive lectures
Small class sizes to maximize learning
90 days post training support
“The level of detail is fantastic. The course did not focus on
syntax but rather on how to expertly program for the GPU. I
loved the course and I hope that we can get more of our team
to take it.”
Jason Gauci, Software Engineer
Lockheed Martin
3
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Consulting Services
4
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Seismic Imaging & Modelling AxWAVE™
Seismic forward modelling
2D, 3D, constant, and variable density models
High fidelity finite-difference modelling
AxRTM™
High performance Reverse Time Migration
application
Isotropic, VTI, and TTI media
AxFWI™
Inversion of the full seismic data to provide an
accurate subsurface velocity model
Customizable for specific workflows
HPC Implementation
Optimized for NVIDIA Tesla GPUs
Efficient multi-GPU scaling
5
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Electromagnetics
AxFDTD™
Finite-Difference Time-Domain
Electromagnetic Solver
Optimized for NVIDIA GPUs
Sub-gridding and large feature
coverage
Multi-GPU, GPU clusters, GPU
targeting
Available from:
6
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Outline
CUDA overview
Data-parallelism
GPU programming model
GPU kernels
Host vs. device responsibilities
CUDA syntax
Thread hierarchy
7
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Why use GPUs? Performance!
GPU advances are outpacing CPU advances
Continuing Moore’s Law?
8
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Why use GPUs? Performance!
9
Intel Xeon E5-2699v3 (Haswell-EP)
NVIDIA Tesla M60 (Maxwell)
NVIDIA Tesla K80 (Kepler)
NVIDIA Tesla P100 (Pascal)
Processing Cores 22 4096 4992 3584
Clock Frequency 2.2-3.6* 0.900-1.180 GHz 0.562-0.875 GHz 1.328-1.48 GHz
Memory Bandwidth 76.8 GB/s / socket 320 GB/s 480 GB/s 720 GB/s
Peak Tflops (single) 1.83 @ 2.6GHz 9.68 @ 1.180 GHz 8.74 @ 0.875GHz 10.6 @ 1.48GHz
Peak Tflops (double) 0.915 @ 2.6GHz 0.30 @ 1.180 GHz 2.91 @ 0.875GHz 5.3 @ 1.48GHz
Gflops/Watt (single) 9.1 32.2 29.1 35.3
Total Memory >>24GB 16GB 24GB 16 GB
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
GPU Potential Advantages
Tesla K80 vs. Xeon E5-2699 v3
5.8x more single-precision floating-point
throughput
5.8x more double-precision floating-point
throughput
9.4x higher memory bandwidth
10
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
GPU Disadvantages
Architecture not as flexible as CPU
Must rewrite algorithms and maintain
software in GPU languages
Discrete GPUs attached to CPU via
relatively slow PCIe
16GB/s bi-directional for PCIe 3.0 16x
Limited memory (though 8-24GB is
reasonable for many applications)
11
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Software Approaches for Acceleration
Maximum Flexibility CUDA C/C++, CUDA Fortran
MATLAB, Mathematica, LabVIEW
Python: NumaPro, PyCUDA
Simple programming for heterogeneous systems Simple compiler hints/pragmas
Compiler parallelizes code
Target a variety of platforms
“Drop-in” Acceleration In-depth GPU knowledge not required
Highly optimized by GPU experts
Provides functions used in a broad range of applications
(eg. FFT, BLAS, RNG, sparse iterative solvers, deep
neural networks)
Programming
Languages
OpenACC
Directives
Libraries
Eff
ort
12
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Compute Capability
13
Hardware architecture version number defined by a major and minor version number Major version specifies core architecture type
Minor version refers to incremental improvements and new features
Architecture Name
Compute Capability
GPUs Example Features
Tesla
1.0 GeForce 8800, Quadro FX 5600, Tesla C870 Base Functionality
1.1 GeForce 9800, Quadro FX 4700 x2 Asynchronous Memory Transfers
1.3 GeForce GTX 295, Quadro FX 5800, Tesla C1060 Double Precision
Fermi 2.0 GeForce GTX 480, Tesla C2050 R/W Memory Cache
Kepler
3.0 GeForce GTX 680, Tesla K10 Warp Shuffle Functions PCI-e 3.0
3.5 Tesla K20, K20X, and K40 Dynamic Parallelism
3.7 Tesla K80 More registers and shared memory
Maxwell 5.0 GeForce GTX 750 Ti New architecture
5.2 GeForce GTX 970/980 More shared memory
Pascal 6.0 Tesla P100 Page Migration Engine
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Overview
Parallel computing architecture
developed by NVIDIA
CUDA programming interface consists of:
C language extensions to target portions of source
code on the compute device (GPU)
A library of C functions that execute on the host
(CPU) to interact with the device
14
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Data-Parallel Computing
1. Performs operations on a data set organized into a
common structure (eg. an array)
2. A set of tasks work collectively and simultaneously on the
same structure with each task operating on its own portion
of the structure
3. Tasks perform identical operations on their portions of the
structure. Operations on each portion must be data
independent!
15
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Data Dependence
Data dependence occurs when a program statement refers to
the data of a preceding statement
Data dependence limits parallelism
16
a = 2 * x; b = 2 * y; c = 3 * x;
a = 2 * x; b = 2 * a * a; c = b * 9;
These 3 statements are
independent! b depends on a, c depends
on b and a!
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Data-Parallel Computing Example
Data set consisting of
arrays A,B, and C
Same operations performed
on each element
Cx = Ax + Bx
Two tasks operating on a
subset of the arrays. Tasks
0 and 1 are independent.
Could have more tasks.
17
A0 A1 A2 A3 A4 A5 A6 A7
B0 B1 B2 B3 B4 B5 B6 B7
C0 C1 C2 C3 C4 C5 C6 C7
Cx = Ax + Bx
Task 0 Task 1
Operation
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Data-Parallel Computing on GPUs
Data-parallel computing maps well to GPUs: Identical operations executed on many data elements in
parallel Simplified flow control allows increased ratio of compute logic (ALUs) to
control logic
18
DRAM
GPU
DRAM
CPU
ALU Control
L1 Cache
L2 Cache
ALU
ALU ALU
ALU Control
L1 Cache
L2 Cache
ALU
ALU ALU
L3 Cache
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
The CUDA Programming Model
CUDA is a heterogeneous model, including provisions
for both host and device
19
DRAM GPU
Device
PCIe DRAM CPU
Host
CPU GPU
DRAM
Host Device
Discrete GPU
Mobile Processor
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
The CUDA Programming Model
Data-parallel portions of an algorithm are
executed on the device as kernels Kernels are C/C++ functions with some restrictions, and a few
language extensions
Only one kernel is executed at a time Newer GPU architectures relax this restriction
Each kernel is executed by many threads
20
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Threads
CUDA threads are conceptually similar to data-parallel tasks Each thread performs the same operations on a subset of a data
structure
Threads execute independently
CUDA threads are not CPU threads CUDA threads are extremely lightweight
Little creation overhead
Instant context-switching
CUDA threads must execute the same kernel
21
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Thread Hierarchy
CUDA is designed to execute
1000s of threads
Threads are grouped together
into thread blocks
Thread blocks are grouped
together into a grid
22
Image courtesy of NVIDIA Corp.
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Thread Hierarchy
Thread blocks and Grids can be
1D, 2D or 3D
Dimensions set at launch time
Thread blocks and grids do not
need to have the same
dimensionality
ie. 1D Grid of 2D Thread
Blocks
23
Block (0,0) Block (1,0) Block( 2,0)
Block (0,1) Block (1,1) Block (2,1)
Grid
Thread (3,0) Thread (1,0) Thread (2,0) Thread (0,0)
Thread (3,1) Thread (1,1) Thread (2,1) Thread (0,1)
Thread (3,2) Thread (1,2) Thread (2,2) Thread (0,2)
Block (1,1)
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
The CUDA Programming Model
The host launches kernels
The host executes serial
code between device
kernel launches Memory management
Data exchange to/from device
Error handling
24
Block (0,0) Block (1,0)
Block (0,1) Block (1,1)
Block (0,2) Block ( 1,2)
Grid
Block (0,0) Block (1,0) Block (2,0)
Block (0,1) Block (1,1) Block (2,1)
Grid
Host
Device
Host
Device
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA APIs
Can use CUDA through CUDA C (Runtime API), or Driver API
This tutorial presentation uses CUDA C Uses host side C-extensions that greatly simplify code
Driver API has a much more verbose syntax that clouds CUDA (parallel) fundamentals
Same ability, same performance, but: Historically Driver API provided slightly more control over host process
interactions with GPU
Runtime API now supports all functionality of Driver API, and is interoperable with Driver API
No reason to start new development using Driver API
Don’t confuse the two when referring to CUDA Documentation cuFunctionName() – Driver API
cudaFunctionName() – Runtime API
25
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Kernel Launch Syntax CUDA kernels are launched by the host using a modified C function
call syntax:
dim3 is vector type with x, y, and z components (eg. dG.x)
26
myKernel<<<dim3 dGrid, dim3 dBlock>>>(…)
Maximum Values For
Each Dimension
Compute Capability
2.x (Fermi) 3.x (Kepler) 5.x (Maxwell)
Total Threads per Block 1024 1024
Grid Size
dGrid.x 65535 231-1
dGrid.y 65535 65535
dGrid.z 65535 65535
Block Size
dBlock.x 1024 1024
dBlock.y 1024 1024
dBlock.z 64 64
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Kernels
Denoted by __global__function qualifier Eg. __global__void myKernel(float* a)
Called from host, executed on device
A few noteworthy restrictions: No access to host memory (in general!)
Must return void
No static variables
No access to host functions
27
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax – Kernels (I)
Kernels can take arguments just like any C function
Pointers to device memory
Parameters passed by value
28
__global__ void SimpleKernel(float* a, float b) { a[0] = b; }
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax – Kernels (II)
Kernels must be
declared (but not
necessarily
defined) in
source/header files
before they are
called
29
// Kernel declaration __global__ void kernel(float* a); int main() { dim3 gridSize, blockSize; … kernel<<<gridSize,blockSize>>>(a); } __global__ void kernel(float* a) { … }
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax - Kernels
Kernels have read-only built-in variables: gridDim: dimensions of the grid
Uniform for all threads
blockIdx: unique index of a block within grid
blockDim: dimensions of the block
Uniform for all threads
threadIdx: unique index of the thread within the thread block
Cannot vary the size of blocks or grids during a kernel call
30
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax - Kernels
Built-in variables are typically used to compute
unique thread identifiers Map local thread ID to a global array index
31
blockIdx.x
blockDim.x = 5
gridDim.x = 3
threadIdx.x
blockIdx.x*blockDim.x + threadIdx.x
myKernel<<<3,5>>>(…)
Grid
0
0 1 2 3 4
1
0 1 2 3 4
2
0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax – Index & Size Calculations
Global index calculation idx = blockIdx.x * blockDim.x + threadIdx.x
Grid size calculation
Where Size: Total size of the array
BlkDim: Size of the block (max 1024)
GridSize: Number of blocks in the grid
32
DivisionIntegerBlkDim
BlkDimSizeGridSize
1
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax – Thread Identifiers
Result for each kernel launched with the following execution configuration:
MyKernel<<<3,4>>>(a);
33
__global__ void MyKernel(int* a) { int idx = blockIdx.x*blockDim.x+threadIdx.x; a[idx] = 7; } __global__ void MyKernel(int* a) { int idx = blockIdx.x*blockDim.x+threadIdx.x; a[idx] = blockIdx.x; } __global__ void MyKernel(int* a) { int idx = blockIdx.x*blockDim.x+threadIdx.x; a[idx] = threadIdx.x; }
a: 7 7 7 7 7 7 7 7 7 7 7 7
a: 0 0 0 0 1 1 1 1 2 2 2 2
a: 0 1 2 3 0 1 2 3 0 1 2 3
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax - Kernels
All C operators are supported eg. +, *, /, ^, >, >>
All functions from the standard math library eg. sinf(), cosf(), ceilf(), fabsf()
Control flow statements too! eg. if(), while(), for()
34
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Kernel C++ Support
Supported Classes
Including inheritance and virtual functions
Need to add __device__ qualifiers to member functions!
Templates
C++11 features including auto and lambda functions
Not supported C++ Standard Library
Run time type information (RTTI)
Exception handling
Classes with virtual functions are not binary compatible between host and device
35
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
User-defined Device Functions
Can write/call your own device functions Device functions cannot be called by host
Functions declared with both __device__ and __host__ will be compiled for both the CPU and GPU
36
__device__ float myDeviceFunction(int i) {
... } __global__ void myKernel(float* a) { int idx = blockIdx.x*blockDim.x+threadIdx.x; a[idx] = myDeviceFunction(idx); }
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax - Memory Management
Typically, host code manages device memory:
cudaMalloc(void** pointer, size_t nbytes)
cudaMemset(void* pointer, int value, size_t count)
cudaFree(void* pointer)
37
// Memory allocation example
int n = 1024; int nBytes = 1024*sizeof(int); int* a = 0; cudaMalloc((void**)&a, nbytes); cudaMemset( a, 0, nbytes); cudaFree(a);
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax – Memory Spaces
Host and device have separate memory spaces For discrete GPUs data is moved between
them via PCIe bus
Pointers are just addresses Can’t tell from the pointer value whether
the address is on device or host
Must exercise caution when dereferencing pointers
Dereferencing host pointers on device likely crashed, and vice versa
38
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax – Data Transfers
Host code manages data transfers to and from the device: cudaMemcpy(void* dst, void* src, size_t nbytes, enum
cudaMemcpyKind direction);
Direction is one of:
cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaMemcpyHostToHost cudaMemcpyDefault
Blocking call - returns once copy is complete
Waits for all outstanding CUDA calls to complete before starting transfer
With cudaMemcpyDefault, runtime determines which way to copy data
39
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax - Synchronization
Kernel launches are asynchronous Control returns to CPU immediately
Kernel starts executing once all
outstanding CUDA calls are complete
cudaMemcpy() is synchronous Blocks until copy is complete
Copy starts once all outstanding
CUDA calls are complete
cudaDeviceSynchronize() Blocks until all outstanding CUDA
calls are complete
40
cudaMemcpy(…,cudaMemcpyHostToDevice); // Data is on the GPU at this point MyKernel<<<…>>>(…); // Kernel is launched but // not necessarily complete cudaMemcpy(…,cudaMemcpyDeviceToHost); // CPU waits until kernel is complete // and then transfers data // Data is on the CPU at this point
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax – Error Management
Host code manages errors
Most CUDA function calls return cudaError_t
Enumeration type
cudaSuccess (value 0) indicates no errors
char* cudaGetErrorString(cudaError_t err) Returns a string describing the error condition
41
cudaError_t err; err = cudaMemcpy(…); if(err)
printf(“Error: %s\n”, cudaGetErrorString(err));
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax – Error Management
Kernel launches have no return value!
cudaError_t cudaGetLastError() Returns error code for last CUDA runtime function (including kernel
launches)
Resets global error state to cudaSuccess
In case of multiple errors, only the last one is reported
For kernels:
Asynchronous, must call cudaDeviceSynchronize() first, then cudaGetLastError()
42
MyKernel<<< ... >>> (...); cudaDeviceSynchronize(); e = cudaGetLastError();
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Putting It All Together
This kernel assumes that the size of the
array is evenly divisible by the block size.
What happens if does not?
43
A0 A1 A2 A3 A4 A5 A6 A7
B0 B1 B2 B3 B4 B5 B6 B7
C0 C1 C2 C3 C4 C5 C6 C7
Cx = Ax + Bx
One Thread per Output
Operation
__global__
void VectorAddKernel( float* a,
float* b,
float* c)
{
int idx = threadIdx.x
+ blockIdx.x * blockDim.x;
c[idx] = a[idx] + b[idx];
}
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Vector Add – Host Code
44
void VectorAdd(float* aH, float* bH, float* cH, int N) {
float* aD, *bD, *cD; int N_BYTES = N * sizeof(float); dim3 blockSize, gridSize; cudaMalloc((void**)&aD, N_BYTES); cudaMalloc((void**)&bD, N_BYTES); cudaMalloc((void**)&cD, N_BYTES); cudaMemcpy(aD, aH, N_BYTES, cudaMemcpyHostToDevice); cudaMemcpy(bD, bH, N_BYTES, cudaMemcpyHostToDevice); blockSize.x = 512; gridSize.x = N / blockSize.x; VectorAddKernel<<<gridSize, blockSize>>>(aD, bD, cD); cudaMemcpy(cH, cD, N_BYTES, cudaMemcpyDeviceToHost);
}
Allocate memory on
GPU
Transfer input
arrays to GPU
Launch kernel
Transfer output
array to CPU
This code assumes N
is a multiple of 512
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
CUDA Syntax – Unified Memory
CUDA 6 introduced Unified Memory
Use ‘managed memory’ instead of explicitly declaring
memory for the host and device
cudaMallocManaged(void **devPtr, size_t size)
45
// Memory allocation example
int n = 1024; int nBytes = 1024*sizeof(int); int* a = 0; cudaMallocManaged((void**)&a, nbytes); cudaFree(a);
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Putting It All Together… Again!
46
int* a, *b, *c; int N_BYTES = 2 * sizeof(int); cudaMallocManaged((void**)&a, N_BYTES); cudaMallocManaged((void**)&b, N_BYTES); cudaMallocManaged((void**)&c, N_BYTES); a[0] = 5; b[0] = 7; a[1] = 3; b[1] = 4; VectorAddKernel<<<1,2>>>(a, b, c); cudaDeviceSynchronize(); printf(“%d %d\n”, c[0], c[1]);
Allocate managed
memory
Initializing memory from
host
Launch kernel and
synchronize device
Access result from host
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Summary
GPUs are data-parallel architectures
CUDA provides a heterogeneous compute model:
Host:
Memory management (usually)
Data transfers
Data-parallel kernel launches on device as a grid of thread blocks
Error management
Device (GPU):
Executes data-parallel kernels in threads
Implemented in C/C++ with a few important extensions, and a few restrictions
47
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Need More CUDA?
Public Training:
August 30 – September 2 Calgary, Alberta
Private Training:
Onsite, anywhere in the world
Tailored to your specific needs
NEW Online Training: Access our courses from anywhere
August 30 – September 2
Mentoring & Services: Let us help develop & optimize your
applications for maximum performance
48
Email: [email protected]
© 2016 Acceleware Ltd. Reproduction or distribution strictly prohibited.
Questions?
Acceleware Ltd.
Tel: +1 403.249.9099
Email: [email protected] [email protected]
CUDA Blog: http://acceleware.com/blog
Website: http://acceleware.com
49
Chris Mason [email protected]