Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens...

Preview:

Citation preview

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Lecture 4Tuesday, 3 November, 2009

Programming with CUDA and Parallel Algorithms

Waqar SaleemJens Müller

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Recap• Grid and block dimensions

• CUDA extensions to C

• built-in variables

• vector types, variable and function qualifiers

• synchronization and timing

• atomic functions

• memory fence functions, volatile variables, math, warp vote, texture memory functions

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Recap

• Grid and block dimensions

• CUDA extensions to C

• Memory management (runtime API)

• cudaMalloc, cudaFree, cudaMemcpy

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• GPGPU before CUDA? OpenGL, DirectX

• CUDA thread creation and scheduling takes only a few cycles

• CPU threads can take up to thousands

• Device initialization at first device function call

• Caution: causes delay

Loose ends

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Handling device memoryvoid main() { // allocate h_A, h_B,h_C, size N // assign values to host vectors // initialize device // allocate d_A,d_B,d_C, size N // copy h_A,h_B to d_A,d_B vAdd<<<1,N>>>(d_A,d_B,d_C); // copy d_C to h_C // output h_C // free host variables // free device variables}

void main() { int N; // assign N size_t size = N * sizeof( int ); int *h_A = malloc( size ); int *h_B = malloc( size ); int *h_C = malloc( size ); // assign values to vectors int *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, size ); cudaMalloc( (void**) &d_B, size ); cudaMalloc( (void**) &d_C, size ); cudaMemcpy( d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy( d_B, h_B, size, cudaMemcpyHostToDevice); vAdd<<<1,N>>>( d_A, d_B, d_C ); // ...}

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Today

• Thread divergence

• Compiling CUDA programs (intro)

• Thread/block allocation in a MP

• Optimizing memory access

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Thread divergence

• Threads in a block diverge when they follow different execution paths

• Divergent threads are serialized

• Slows device performance

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

B?

A

DC

E

noyes

A

B?

C

D

E

wait

wait

Thread divergence

• Threads in a block diverge when they follow different execution paths

• Divergent threads are serialized (slow)

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

NVIDIA C Compiler (nvcc)• Compile CUDA programs with nvcc

• Separates host and device code

• host code compiled by host compiler

• device code compiled further by nvcc

• many options: emulation mode, fast math, optimization level, ...

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Transparent scalability• Code written once can run on any kind of

device

• Scaling (scheduling) is transparent to the user

• Imposes lack of inter-block communication

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

SPMD / SIMT

• SPMD: Single Program Multiple Data

• SIMT: Single Instruction Multiple Thread

• SIMD exposes size of device vector

• CUDA kernels can be programmed for a device of any specification

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• GeForce 8800GTX

• up to 8 active blocks per MP

• up to 768 active threads

• 86.4 GB/s access to global memory

• peak performance of 367 GFLOPS

• warp size of 32

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• Specific to hardware, not a CUDA concept

• Scheduling unit in a MP

• Max of 768 active threads = 24 active warps per MP

• Dedicated hardware to track IDs and execution status of threads in active warps

• Limits maximum number of active warps

Warps

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Priority Queue of warps

• Active warps are queued and prioritized

• While a warp waits for the result of some high latency operation, another can start executing

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Variables Memory Scope Lifetime

Automatic arrays

Automatic scalars

__shared__

__device__

__constant__

global (local) thread thread

registers thread thread

shared thread block

global grid(s) application

constant grid(s) application

Variables and memory• Shared is static

• Caution: Each thread has private version

• Constant memory resides in global memory

• cached, 65,536 bytes

• fast access depending on access pattern

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Variables Memory Scope Lifetime

Automatic arrays

Automatic scalars

__shared__

__device__

__constant__

global (local) thread thread

registers thread thread

shared thread block

global grid(s) application

constant grid(s) application

Variables and memory

• Global memory has no sync, bad for inter-block communication

• used to pass info b/w kernels

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• Solution: partition data into small tiles that can fit in shared memory

• Kernel computation on tiles must be independent of each other

• Might require modification of the algorithm

Memory trade-off

Speed Size

Global

Shared

Slow Large

Fast Small

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Example: Matrix multiplication

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Simple matrix multiplication kernel• // Pd, Md, Nd: global memory, width: shared memory

__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // row and column indices int bx = blockIdx.x, by = blockIdx.y; int tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; // compute Pd entry Pvalue = 0; for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // store computed entry Pd[Row][Col] = Pvalue;}

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Kernel performance• // Pd, Md, Nd: global memory, width: shared memory

__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // ... for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // ...}

• 2 global memory accesses and 2 computations

• Compute operations to Global Memory Access (CGMA) ratio = 1.0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Kernel performance• // Pd, Md, Nd: global memory, width: shared memory

__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // ... for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // ...}

• Global memory bandwidth of 86.4GB/s

• Each iteration computes at no more than 86.4/4 = 21.6 GFLOPS

• The card has peak performance of 367 GFLOPS!

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Spotting data parallelism

• Need to re-use dataexample: 2x2 blocks for 4x4 matrices

0,0 1,0 0,1 1,1

Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0

Md1,0 * Nd0,1 Md1,0 * Nd1,1 Md1,1 * Nd0,1 Md1,1 * Nd1,1

Md2,0 * Nd0,2 Md2,0 * Nd1,2 Md2,1 * Nd0,2 Md2,1 * Nd1,2

Md3,0 * Nd0,3 Md3,0 * Nd1,3 Md3,1 * Nd0,3 Md3,1 * Nd1,3

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Spotting data parallelism• Need to re-use data

example: 2x2 blocks for 4x4 matrices

0,0 1,0 0,1 1,1

Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0

Md1,0 * Nd0,1 Md1,0 * Nd1,1 Md1,1 * Nd0,1 Md1,1 * Nd1,1

Md2,0 * Nd0,2 Md2,0 * Nd1,2 Md2,1 * Nd0,2 Md2,1 * Nd1,2

Md3,0 * Nd0,3 Md3,0 * Nd1,3 Md3,1 * Nd0,3 Md3,1 * Nd1,3

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Spotting data parallelism (cont’d)• 4 rows/columns, each is fetched twice

• For NxN block, each is fetched N times

0,0 1,0 0,1 1,1

Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0

Md1,0 * Nd0,1 Md1,0 * Nd1,1 Md1,1 * Nd0,1 Md1,1 * Nd1,1

Md2,0 * Nd0,2 Md2,0 * Nd1,2 Md2,1 * Nd0,2 Md2,1 * Nd1,2

Md3,0 * Nd0,3 Md3,0 * Nd1,3 Md3,1 * Nd0,3 Md3,1 * Nd1,3

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Re-organizing memory access• Load each row/column once into shared

memory and re-use within block

• Reduce global memory traffic by N

• Global memory access /= N, CGMA *= N

• Loaded rows, columns form a tile

• Tile size dictated by size of shared memory

• Simplest case, block size = tile size

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Tiled kernel using shared memory

• __global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // allocate tile in shared memory __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; // row and column indices int bx = blockIdx.x, by = blockIdx.y, tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty, Col = bx * TILE_WIDTH + tx; // compute Pd entry tile-by-tile Pvalue = 0; for ( int tileNum = 0; tileNum < width/TILE_WIDTH; ++tileNum ) { // Collaborative loading into shared memory Mds[tx][ty] = Md[tileNum * TILE_WIDTH + tx][Row]; Nds[tx][ty] = Nd[Col][tileNum * TILE_WIDTH + ty]; Pvalue += Md[Row][k] * Nd[k][Col]; for ( int k = 0; k < TILE_WIDTH; ++k ) Pvalue += Mds[tx][k] + Nds[k][ty] } Pd[Row][Col] = Pvalue; // store computed entry}

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

T0,0

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

T1,0

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

T0,1

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

T1,1

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 0

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 0

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 0

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 1

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

B0,0 tile 0

B0,0 tile 0

B0,0 tile 1

B0,0 tile 1

B0,0

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

B0,0 tile 0B1,0 tile 0

B0,0 tile 0B0,1 tile 0

B0,0 tile 1B1,0 tile 1

B0,0 tile 1B0,1 tile 1

B0,0 B1,0

B0,1 B1,1B0,1 tile 0B1,1 tile 0

B0,1 tile 1B1,1 tile 1

B1,0 tile 0B1,1 tile 0

B1,0 tile 1B1,1 tile 1

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Tiled kernel using shared memory

• __global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // allocate tile in shared memory __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; // row and column indices int bx = blockIdx.x, by = blockIdx.y, tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty, Col = bx * TILE_WIDTH + tx; // compute Pd entry tile-by-tile Pvalue = 0; for ( int tileNum = 0; tileNum < width/TILE_WIDTH; ++tileNum ) { // Collaborative loading into shared memory Mds[tx][ty] = Md[tileNum * TILE_WIDTH + tx][Row]; Nds[tx][ty] = Nd[Col][tileNum * TILE_WIDTH + ty]; Pvalue += Md[Row][k] * Nd[k][Col]; for ( int k = 0; k < TILE_WIDTH; ++k ) Pvalue += Mds[tx][k] + Nds[k][ty] } Pd[Row][Col] = Pvalue; // store computed entry}

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

m = tileNum

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Performance gain• Theoretical gain for 16x16 tiles = 16

• (86.4/4) * 16 = 345.6 GFLOPS

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Memory limitations on Parallelism

• MP resources are split between active blocks

• Imposes limits on number of active blocks

• 8K registers for a max of 768 threads

• 8K/768 = 10 registers per thread

• If a kernel uses more than 10 registers, the number of blocks to be processed by the MP is reduced to fit the registers

• GeForce 8800GTX has 16K shared memory for a max of 8 blocks

• ~2K shared memory per block

• ~16x16 tiles in matrix multiplication are optimal

• If a block uses more than ~2K shared memory, the number of blocks to be processed by the MP is reduced to fit shared memory

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• CUDA texture memory

• CUDA runtime and driver APIs

• Streams

Next time

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

See you next time!

Recommended