Alastair Nottingham Security and Networks Research Group CUDA CRASH-COURSE

Alastair Nottingham

Security and Networks Research Group

CUDA CRASH-COURSE

OV

ER

VIE

WLecture Overview

Introduction To GPGPU

CUDA Architecture

Programming with CUDA

Maximising Performance

Results

Memory Model

INTRODUCTION TO GPGPU

INTR

OD

UC

TIO

N T

O G

PG

PU

Introduction

What is a GPU?A GPU is essentially a massively parallel coprocessor, primarily

utilised for real-time rendering of complex graphical scenes.

What is CUDA?Compute Unified Device Architecture.A highly parallel architecture for General Processing on a GPU (GPGPU).

Benefits:• Commodity hardware.• Negligible CPU overhead.• Astoundingly fast, when done right.

What should I use it for?Highly parallelisable processing of very large datasets.

What shouldn’t I use it for?• Sequential processing tasks (obviously).• Context sensitive data-set processing.• Small data-sets.

INTR

OD

UC

TIO

N T

O G

PG

PU

Brief History of GPGPU

• Pre-1999 – Early graphics cards.

• 1999 – Geforce256, the first GPU.

• 2001 – nfiniteFX and custom programmable shaders.

• 2003 – Stanford Brook, the first GPGPU language.

• 2006 – DirectX10 and the Unified Shading Model.

• 2007 – NVIDIA CUDA and AMD (Fire)Stream released.

• 2009 – OpenCL & DirectCompute.

• 2011 – AMD Accelerated Parallel Processing (APP) SDK.

INTR

OD

UC

TIO

N T

O G

PG

PU

CPU vs. GPU (Performance)

Jun2008

May2007

Nov2006

Mar2006

Jun2005

Apr2004

Jan2003

Jun2003

3.0 GHzCore2 Duo

3.2 GHzHarpertown

GT200

G92G80Ultra

G80

G71

G70NV40

NV35NV30

G80Ultra

G80

G71

NV40

NV30HarpertownWoodcrest

Prescott EENorthwood

INTR

OD

UC

TIO

N T

O G

PG

PU

Currently….

Core i7 980 XE Geforce GTX 590 Radeon HD 69900

1000

2000

3000

4000

5000

6000

7000

107.6

2488.3

5952

GFLOP/s

GFLOP/s

INTR

OD

UC

TIO

N T

O G

PG

PU

CPU vs. GPU (Architecture)

Control

CACHE

ALU

ALU

ALU

ALU

CPU GPU

The GPU devotes more transistors to data processing

For simplicity, discussion will focus on GTX 280 architecture.

CUDA ARCHITECTURE

CU

DA

AR

CH

ITEC

TU

RE

GTX 280 Hardware Model (abstract)

1 GB Global Memory

CUDA DEVICE

Block NBlock 1

CU

DA

AR

CH

ITEC

TU

RE

CUDA Programming Model (Abstract)

Thread 1 Thread M...Shared Memory

Local Memory 1

Local Memory M

Thread 1 Thread M...Shared Memory

Local Memory 1

Local Memory M

...

Device Memory

Global Memory Constant Memory

A CUDA Device executes a Grid containing N Blocks.Each Block has a Shared Memory store.And executes M threads.Each Thread has a local memory store (registers).Device memory is accessible to all threads in all blocks.Device memory is divided into Global and Constant Memory.

CU

DA

AR

CH

ITEC

TU

RE

How Thread Blocks Map to Multiprocessors

GTX 280 Multiprocessor (1 of 30 on GTX 280)

Maximum 1024 Threads

Thread Block

256 Threads

Thread Block

256 Threads

Thread Block

256 Threads

Thread Block

256 Threads

Thread Block

512 Threads

(Maximum Block Size)

Thread Block

512 Threads

(Maximum Block Size)

Thread Block

192 Threads

Thread Block

192 Threads

Thread Block

192 Threads

Thread Block

192 Threads

Thread Block

192 Threads

Unu

sed

Cap

acity

64 T

hrea

ds

FULL OCCUPANCYPARTIAL OCCUPANCY

Occupancy is also affected by:• Per Block Shared Memory Requirements• Per Block Register Requirements

Data-dependent conditional branching within a warp is handled by executing both if and else branches sequentially. • Threads which do not match the condition sleep. • If no threads match a condition, the associated branch is not executed at all.

Warp size is set in hardware, and is unlikely to change in successive GPU generations.

CU

DA

AR

CH

ITEC

TU

RE

Thread Warps

CUDA enabled GPUs implement a SIMT (Single Instruction, Multiple Thread) architecture, with zero thread scheduling overhead.

While essentially SPMD (Single Program, Multiple Data), threads are executed in SIMD batches called Warps.

Each Warp contains 32 threads.

All threads in a warp execute the same instructions in unison.

Thread divergence in different warps

Thread divergence in same warp

Thread 0

Thread 32

Thread 0

Thread 1

if

else

CU

DA

AR

CH

ITEC

TU

RE

Why Warps?

Each processing core can issue an instruction to 4 threads in the time it takes the instruction register to update.

This corresponds to threads sharing a single instruction. The instruction register on Fermi cards is twice as fast. GTX 400 series MP has 32 cores and 2 IRs. GTX 500 series MP has 48 cores and 3 IRs.

Warps are differentiated by block-level thread index: Threads 00-31 are in the first warp. Threads 32-63 are in the second warp. Threads 64-95 and in the third warp, etc.

GT

X 2

80

Mul

tipro

cess

or

Texture Cache

Shared MemoryRegister File

Processing Cores Instruction Register

5 61 2 7 83 4

CU

DA

AR

CH

ITEC

TU

RE

Thread Synchronisation

Thread Synchronisation provides a mechanism for merging divergent thread paths in a block, thus reducing divergence.

if (threadIdx.x > 7) //if thread index is greater than 7 {

...Important Stuff... //warp divergence } __syncthreads(); //no more warp divergence

All threads must wait at __syncthreads() until all other threads in the block reach the directive. Only then may threads recommence.

Must be used with caution – if some threads cant reach the directive, the kernel will never complete.

if (threadIdx.x > 7) //if thread index is greater than 7 {

...Important Stuff...__syncthreads(); //FAIL

}

MEMORY MODEL

MEM

ORY M

OD

EL

Type Access Resides In Size

Registers Thread local

On-chip Register File 16384 32-bit Registers

Shared Memory

Block local Multiprocessor 16KB

Constant Memory

Global Device DRAM 64KB (16KB cached)

Global Memory

Global Device DRAM 1024 MB

Texture Memory

Global Multiprocessor & Device DRAM

512MB + 64KB cache

MEM

ORY M

OD

EL

Global Memory

Global Memory is the largest memory region on the device, with 1024 MB of GDDR3 memory. While plentiful, it is at least two orders of magnitude slower than on-chip memory.

Access Speed: 200 – 1000+ clock cycles

Exceptions:

Global memory performance can be improved through Coalescing, or by leveraging the texture cache.

MEM

ORY M

OD

EL

Coalescing Global Memory

MEM

ORY M

OD

EL

Texture Memory

To avoid the headaches of coalescing, you can instead abuse the 64KB texture cache available on each multiprocessor, by binding regions of Global Memory to a Texture Reference.

Memory accessed using a Texture Reference reads roughly as fast as fully coalesced Global Memory.

Texture References Limitations:

1. READ ONLY.

2. Memory must be bound to texture references in host-side code, prior to kernel execution.

3. Only supports 32-bit and 64-bit ints, floats, doubles, and vectors of these types.

4. On GTX 280, texture memory is mysteriously limited to 512 MB...

MEM

ORY M

OD

EL

Registers

Each Multiprocessor contains 16 384 registers, which reside in an on-chip register file.

These registers are divided equally between each of the blocks running on the multiprocessor.

• Blocks of 512 threads get 8192 registers each.• Blocks of 256 threads get 4096 registers each.

Access Speed: 0 clock cycles

Exceptions:Read-After-Write Dependencies:

• 24 clock cycles if thread count is less than 192.• 0 otherwise.

Register Bank Conflicts:• Register allocation is handled internally, so cannot be explicitly avoided.• Can be minimised by ensuring a multiple of 64 threads per block.

Register Pressure:• Occurs when registers are over allocated, and are pushed into

global memory.• Extremely expensive (200+ clock cycles)

MEM

ORY M

OD

EL

Shared Memory

16KB of Shared memory resides on each multiprocessor, acting as an explicit cache. This memory is shared between all executing blocks.

• Blocks of 512 threads get 8KB each.• Blocks of 256 threads get 4KB each.

Values stored in shared memory are accessible by any other thread in the executing block. This allows threads in the same block to communicate with another, while also providing fast temporary storage.

Access Speed: 1 clock cycleExceptions:Memory Bank Conflicts:

• If N threads in a half-warp try to access the same memory bank, their requests must be serialised into N separate requests.

1 2 3 4 5 6 7 8

0

8

1

9

2

10

3

11

4

12

5

13

6

14

7

15

Memory Banks

Shared Memory Indexes

MEM

ORY M

OD

EL

Constant Memory

CUDA devices have 64KB of constant memory, with 16KB of cache. This region can be used to store program instructions, pointers to global memory, and kernel arguments (which would otherwise consume shared memory).

Access Speed on Cache Hit: Fast as a Register (0 – 24 clock cycles)

Access Speed on Cache Miss: Slow as Global Memory (200 – 1000+ clock cycles).

Exceptions:

On a cache hit, constant memory is as fast as a register ONLY if all threads within the half-warp access the same cached value. If threads in the half-warp request different cached values, then the requests are serialised.

PROGRAMMING WITH CUDA

PR

OG

RA

MM

ING

WIT

H C

UD

AUsing the CUDA Runtime API

Typical Program Structure (Host Code)

1. Includes: #include “cuda.h”, #include “cuda_runtime_api.h”

2. Allocate Memory for host and device• cudaMalloc, cudaMallocHost, cudaMallocPitch, etc.

3. Fill host memory

4. Copy host memory to device memory• cudaMemcpy, cudaMemcpyHostToDevice

5. Execute Kernel• myKernel<<<BLOCK_DIM, GRID_DIM [, SHARED_MEM]>>>(args);

6. Copy device memory to host• cudaMemcpy, cudaMemcpyDeviceToHost

7. Free allocated memory• cudaFree

Fill Memory

Execute Kernel

Collect Results

PR

OG

RA

MM

ING

WIT

H C

UD

ACopying Data to and from a Device

The CUDA Runtime API provides functions for allocating Device Memory, and copying data to and from the device.

int * data_h, * data_d;

cudaMallocHost((void **) &data_h, 256 * sizeof(int)); //== data_h = (int *) malloc(256 * sizeof(int));cudaMalloc((void **) &data_d, 256 * sizeof(int));

for (int k = 0; k < 256; ++k) data_h[k] = k; //fill host array

// (dest, source, size, type)cudaMemcpy(data_d, data_h, 256 * sizeof(int), cudaMemcpyHostToDevice);

/* - run the kernel (asynchronous) - do any host-side processing */

cudaThreadSynchronize(); //ensure all threads have completed

cudaMemcpy(data_h, data_d, 256 * sizeof(int), cudaMemcpyDeviceToHost);

cudaFreeHost(data_h); //same as free(data_h)cudaFree(data_d);

Declare, Allocate and Fill Arrays

Copy Data to and from Device Memory

Free Memory

PR

OG

RA

MM

ING

WIT

H C

UD

AExecuting a Kernel

Only CUDA Kernels may be called from the Host.

Kernels need to be informed of Grid and Block dimensions (and optionally Shared Memory size) when called.

These are passed to the kernel at runtime. This allows kernels to be optimised for a variety of datasets.

dim3 BlockDim(256); //y and z components set to 1dim3 GridDim(data_size / BlockDim.x); //assume data_size is multiple of 256int Shared = BlockDim.x * 4 * sizeof(int); //each thread uses 4 ints shared storage

arbitraryKernel<<<GridDim, BlockDim, Shared>>>(someData);

Note that 1D grid and block dimensions may be integers, but dim3 is required for higher dimensions.

blockblockblock

PR

OG

RA

MM

ING

WIT

H C

UD

AKernel Orienteering

thread

0

thread

1

thread

2

thread

3

thread

0

thread

1

Max Threads / Block: 512Max Shared Memory / Block: 16 384 bytesMax Blocks / Grid: 65 536 per dimension (> 281 trillion)

thread

2

thread

3

thread

0

thread

1

thread

2

thread

3

Shared Memory

0

Shared MemoryShared Memory

1 2

threadIdx.x

blockDim.x

blockIdx.x

gridDim.x

Global Thread ID = blockDim.x * blockIdx.x + threadIdx.x

6 = 4 * 1 + 2

NOTE: Block execution order is not guaranteed. Ensure block independence.

PR

OG

RA

MM

ING

WIT

H C

UD

AExample 1 – Increasing 1024 integers by 1.

__global__ void add_one(int* data_in, int* data_out)

{

//find the index of the thread

int thread = blockDim.x * blockIdx.x + threadIdx.x;

//read in data and increment

int tmp = data_in[thread] + 1;

//copy out data

data_out[thread] = tmp;

}

…

add_one<<< 4, 256 >>>(data_in_host, data_out_host); //on host

PR

OG

RA

MM

ING

WIT

H C

UD

ADeclaring Shared Memory

Shared Memory is typically statically allocated within a kernel, using the __shared__ qualifier.

When statically allocated, multiple shared variables and arrays may be declared.

__shared__ char temp1[512];__shared__ int temp2[SHARED_SIZE];

Shared Memory may also be dynamically allocated using the extern qualifier.

All dynamically allocated shared memory starts at the same memory offset, so layout must be explicitly managed in code.

extern __shared__ char array[];

short * array0 = (short *) array;float * array1 = (float *) &array0[128];int * array2 = (int *) &array1[64];

PR

OG

RA

MM

ING

WIT

H C

UD

AUsing Shared Memory

Once declared, shared memory can be treated as a normal array.

Because shared memory is shared by all threads in a block, it is important for threads to orientate themselves, such that they read and write to the correct elements of the array.

//each thread deals with 1 piece of datatemp1[threadIdx.x] = device_array[blockIdx.x * blockDim.x + threadIdx.x];

//each thread deals with 16 pieces of datafor (int k = 0; k < 16; ++k)

temp2[ 16 * threadIdx.x + k] = device_array[16 * (blockIdx.x * blockDim.x + threadIdx.x) + k];

if (temp3[threadIdx.x] == 5) /* do something */

When Reading and Writing to Shared Memory operated on by other threads in a block, use __syncthreads() and __threadfence_block() to protect data integrity.

Atomic functions are also helpful.

PR

OG

RA

MM

ING

WIT

H C

UD

ATools: Occupancy Calculator And Visual Profiler

Occupancy Calculator:

• Excel Spreadsheet for CUDA algorithm design.

• Given Threads, Registers and Shared Memory usage per block, calculates the associated performance implications.

• Allows you to maximise GPU utilization within your kernel.

Visual Profiler:

• Visualizes Kernels, to help determine bottlenecks, defects and areas of general low performance.

• Produces graphs and tables.

• Useful for diagnosing poor performance.

MAXIMISING PERFORMANCE

MA

XIM

ISIN

G

PER

FOR

MA

NC

ESpeed Bumps

Memory Access Latency

Thread Divergence

Operator Performance

Host-Device Transfer Overhead

PCI-E 16x can transfer at up to 16GB/s, which, in the grand scheme of things, is quite slow.

Due to the wide range of explicit criteria for optimal performance, poorly crafted Kernels can suffer significant penalties.

Because warp level thread-divergence is essentially serialised by the instruction register, decisional logic should be eliminated where possible.

Certain arithmetic operators perform relatively poorly, and should be avoided.

MA

XIM

ISIN

G

PER

FOR

MA

NC

EExample: Array Reversal

__global__ void reverse_array(int * array_in, int * array_out) //naive kernel{

int curr_index = blockDim.x * blockIdx.x + threadIdx.x;

array_out[curr_index] = array_in[gridDim.x * blockDim.x – 1 – curr_index];}

__global__ void reverse_array_ex(int * array_in, int * array_out) //less naive kernel{

__shared__ int tmp[BLOCK_DIM];

tmp[blockDim.x - 1 - threadIdx.x] = array_in[blockDim.x * blockIdx.x + threadIdx.x];

__syncthreads();

array_out[blockDim.x * (gridDim.x - 1 - blockIdx.x) + threadIdx.x] = tmp[threadIdx.x];}

0 1 2 3 4 5 6 7 8 9 10 11

0123 4567 891011

01234567891011

array_in

tmp

array_out

Better Kernel

MA

XIM

ISIN

G

PER

FOR

MA

NC

ESumming Elements in an Array (Conceptual)

15 6 32 11 2 27 19 9

0 1 2 3 4 5 6 7

21 0 43 0 29 0 28 0

0 1 2 3 4 5 6 7

64 0 0 0 57 0 0 0

0 1 2 3 4 5 6 7

121 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7

MA

XIM

ISIN

G

PER

FOR

MA

NC

EA Better Solution

15 6 32 11 2 27 19 9

0 1 2 3 4 5 6 7

21 43 29 28 0 0 0 0

0 1 2 3 4 5 6 7

64 57 0 0 0 0 0 0

0 1 2 3 4 5 6 7

121 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7

Minimises Divergence in Thread Warps

Performance difference

With 256 threads in a block, both solutions take 9 iterations to sum 512 elements. Total active warps differ significantly, however.

Iteration Reduction Solution 1Active warps

Solution 2Active warps

1 512 – 256 8 8

2 256 – 128 8 4

3 128 – 64 8 2

4 64 – 32 8 1

5 32 – 16 8 1

6 16 – 8 8 1

7 8 – 4 4 1

8 4 – 2 2 1

9 2 – 1 1 1

Total 55 20

PAR

ALLE

L CLA

SS

IFICATIO

NNested Loop Unroll-and-Jam

for (int j = 0; j < M; j++){

for (int k = 0; k < N; k++){

foo(j, k);}

}

for (int j = 0; j < M; j += 4){

for (int k = 0; k < N; k++){

foo(j, k);}for (int k = 0; k < N; k++){

foo(j + 1, k);}for (int k = 0; k < N; k++){

foo(j + 2, k);}for (int k = 0; k < N; k++){

foo(j + 3, k);}

}

for (int j = 0; j < M; j += 4){

for (int k = 0; k < N; k++){

foo(j, k);foo(j + 1, k);foo(j + 2, k);foo(j + 3, k);

}}

1. Initial Nested Loop 2. Partially Unrolled Outer Loop

3. Unroll-and-Jam

MA

XIM

ISIN

G

PER

FOR

MA

NC

EOptimising to Host-To-Device Transfer

CUDA supports two basic memory types:1. Pageable Memory (8 GBps)2. Pagelocked Memory (16 GBps)

Pagelocked memory transfers faster than pageable memory, and supports several optimisations. It is however a scarce resource, and thus overuse can degrade system performance.

Write-Combined Memory – Transfers faster over PCI-E, and frees up L1 and L2 cache resources for the rest of the program to use. It should not be read from by the host.

Mapped Memory – Map a region of host memory onto the GPU. Eliminates all transfer overhead in integrated devices, but is very slow on discreet devices (only 4GBps) .

Asynchronous Streams – Typically, data transfer and kernel execution happen synchronously (separately). Through streams, data transfer and kernel execution can occur asynchronously (concurrently), dramatically improving the overall speed of the program. Buffers are advised.

Asynchronous Execution

MA

XIM

ISIN

G

PER

FOR

MA

NC

EInteger Operator Performance

Most arithmetic integer operators perform well in CUDA (between 2 and 8 operations per clock cycle). However, integer division and modulo operations are extremely expensive, taking up to 10x longer to compute.

This expense can be avoided if the dividend is a multiple of 2 using bitwise operations. These are particularly fast for 32-bit integers.

a / b == a >> log2(b) a % b == a & (b – 1)

The performance of the operators and functions available to kernels is provided in the CUDA Programming Guide.

MA

XIM

ISIN

G

PER

FOR

MA

NC

E

OUR RESULTS

RE

SU

LTS

Our Results (1): Comparative Packet Classificatin Speeds

RE

SU

LTS

Our Results (2): Classification Throughput vs. Packet Count

QUESTIONS?

Documents

Alastair Nottingham Security and Networks Research Group CUDA CRASH-COURSE