26
GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course

GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course

Embed Size (px)

Citation preview

GPU Programming

David Monismith

Based on Notes from the Udacity Parallel Programming (cs344)

Course

Recap

• A grid of blocks is 1, 2 or 3D• A block of threads is 1, 2 or 3D

• arrayMult<<<1,64>>> == arrayMult<<<dim3(1,1,1),dim3(64,1,1)>>>

• arrayMult<<<dim3(bx,by,bz), dim3(tx,ty,tz), shmem>>>(...)

• shmem is shared memory per block in bytes

CUDA Communication

• Discussed threads solving a problem by working together

• CUDA communication takes place through memory– Read from same input location– Write to same output location– Exchange partial results

Communication Patterns

• How to map tasks (threads) and their memory together.

• We have seen map.– Perform same task on each piece of data– E.g. in an array– In CUDA, have one thread do each task

• Many operations cannot be accomplished with map.

More Patterns

• Gather – write a result from many array locations into one location.

• Scatter – write/update many results from information in one array location.– Note that an array location may be updated.

• Stencil – compute a result using a fixed neighborhood in an array.– Examples include von Neumann and Moore

Neighborhoods.• Examples of each of these will be drawn on the

board.

An OpenCV Pixelstruct uchar4{ //Red unsigned char x; //Green unsigned char y; //Blue unsigned char z; //Alpha unsigned char w;};

To convert to grayscale, Red is x, Green is y, Blue is z I = .299f*Red + .587f*Green + .114f*Blue

Converting To Grayscale

• A grayscale conversion could be viewed as either a stencil or gather operation.

• grayscaleImg[i] = img[i].x + img[i].y + img[i].z;

• How could this be implemented in CUDA?

Transpose

• Often, it is necessary to perform a matrix transpose on data such as an image or a matrix.

• In CUDA, this information is often stored in a 1D array.

• This is often useful when transforming an array of structures into a structure of arrays.

Recap of Communication

• Map and Transpose – one to one• Gather – many to one• Scatter – one to many• Stencil – specialized many to one• Reduce and Scan – all to one• Sorting – all to all

Programming Model and GPU HW

• Divide computations into kernels– C/C++ functions– Functions represent threads– Different threads may take different paths

• Groups of threads are called thread blocks– These threads work together to solve a

particular problem or subproblem– GPU is responsible for assigning/allocating

blocks to Streaming Multiprocessors

GPU Hardware

• GPUs may contain one or more Streaming Multiprocessors (SMs).

• SMs each have their own memory and their own simple processors (i.e. CUDA cores)– The CUDA cores in an SM may map to one or more

thread blocks.

• The GPU is responsible for allocating blocks to SMs.

• SMs run in parallel and independently.

CUDA Guarantees and Advantages

• Advantages of CUDA paradigm– Flexibility– Scalability– Efficient

• Disadvantages– No communication between blocks.– No guarantees about where thread blocks will run.

• CUDA guarantees– All threads in a block run on the same SM at the

same time.– All blocks in one kernel must finish before any block

from the next kernel starts.

CUDA Compilation

• For Stampede– module load cuda– nvcc -arch=compute_30 -code=sm_30

<sourcefile> <other options> -o <executable>– #SBATCH -p gpudev

• For LittleFe2– nvcc -arch=compute_21 -code=sm_21

<sourcefile> <other options> -o <executable>

CUDA Memory Access

• Consequences of CUDA paradigm– Cannot have communication between blocks– Threads and blocks must run to completion

• Threads have both local and shared memory– Shared memory is only shared between

threads in the same block– All threads have access to global memory

GPU Layout +------------------------------------------------------+ | GPU | | Thread Global Memory | | | +-------+ +------------------+ | | S <----> | Local | | | | | | |Memory | | | | | V +-------+ | | | | | | | | +-------+ +---------+ | | | | | | | | |<->| Shared | | | | | | S S S | | Memory | | | | | | | | | | | | | | | | | V V V | | | | | | | +-------+ +---------+ +------------------+ | | Thread Block ^ | +----------------------------------------|-------------+ +--------+ | Host | | Memory | +--------+

Thread Synchronization

• Barrier - a point in a program where all threads or processes stop and wait

• When all threads or processes reach the barrier, they may all continue

• syncthreads creates a barrier within a thread block

Barriers

• Need for barriers

int idx = threadIdx.x;

__shared__ int arr[128];

arr[idx] = threadIdx.x;

if(idx > 0 && idx <= 127) arr[idx] = arr[idx-1];

Barriers Continued• Should be rewritten as

int idx = threadIdx.x;__shared__ int array[128];

__syncthreads();

array[idx] = threadIdx.x;

if(idx > 0 && idx <= 127) { int temp = arr[idx-1]; __syncthreads(); arr[idx] = arr[idx-1]; __syncthreads();}

Writing CUDA Programs• CUDA is a hierarchy of computation, synchronization,

and memory• To write efficient programs use several high level

strategies• Maximize your program’s math intensity

– Perform lots of math per unit of memory– Maximize compute operations per thread– Minimize time spent on memory per thread

• Move frequently accessed data to fast memory– Memory speed– local > shared >> global– local - registers/L1 cache– shared – per block memory

Local Memory Example

__global__void locMemGPU(double in) { double f; //local memory f = in; //Local memory }

int main(int argc, char ** argv){ locMemGPU<<<1,512>>>(4.5); cudaSynchronize();}

Global Memory Example__global__ void globalMemGPU(double * myArr) { myArr[threadIdx.x] = 5.0 + myArr[threadIdx.x]; //Array is in global memory}

int main(int argc, char ** argv) { float * myHostArr = malloc(sizeof(float)*512); float * devArr; cudaMalloc((void **) &devArr, sizeof(float)*512); for(i = 0; i < 512; i++) myHostArr[i] = i; cudaMemcpy((void *) devArr, (void *) myHostArr,

sizeof(float)*512, cudaMemcpyHostToDevice); globalMemGPU<<<1,512>>>(devArr); cudaMemcpy((void *) devArr, (void *) myHostArr,

sizeof(float)*512,cudaMemcpyDeviceToHost);}

CUDA Memory Access

• CUDA works well when threads have contiguous memory accesses.

• The GPU is most efficient when threads read or write to the same area of memory at the same time.

• Each thread, when it accesses global memory, must access a chunk of memory, not the single data item.

• Therefore, you should remember the following about memory access:– Contiguous is good,– Strided access is not so good, and– Random access is bad.

Memory Conflicts• Example: Assume 10000 threads accessing/modifying 10 array

elements.• This problem can be solved with atomics.

– atomicAdd()– atomicMin()– atomicXOR()– atomicCAS() - compare and swap

• Atomics are only provided for certain operations and data types.• There is no atomic mod or exponentiation.• Mostly operations are only available for integer types.• Can implement any atomic op with CAS, quite complicated though.

– Still no ordering constraints.

Even More Memory Issues

• Floating point arithmetic is non-associative• (a + b) + c != a + (b + c)

• Synchronization of such operations serializes memory access– This makes atomic ops very slow

Example

• Try the following in CUDA:• 10^6 threads incrementing 10^6 elements

(0.11728ms)• 10^6 threads atomically incrementing 10^6

elements (0.1727ms)• 10^6 threads incrementing 100 elements

(0.33616ms)• 10^6 threads atomically incrementing 100

elements (0.372ms)• 10^7 threads atomically incrementing 100

elements (3.45853ms)

Improving CUDA Program Performance

• Avoid thread divergence.• This means avoid if statements whenever

possible. • Divergence means threads that do different

things.• Divergence can happen in loops, too.• Especially where loops may result for different

numbers of iterations.• All other threads have to wait until all divergent

threads finish.