19
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Embed Size (px)

Citation preview

Page 1: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

CUDA Programming

David MonismithCS599

Based on notes from the Udacity Parallel Programming (cs344) Course

Page 2: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

CUDA Guarantees

• All threads in a block in the same SM will run at the same time.

• All blocks in a kernel will finish before any blocks in the next kernel will start.

Page 3: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Memory Model

• Every thread has local memory (e.g. local variables)

• Threads in a block have access to a “per block” shared memory.

• All threads can read and write to and from global memory.

• CPU memory is separate from GPU memory and is called host memory.

Page 4: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Synchronization

• Warning: threads can access and modify each other’s results in shared and global memory.

• But what if a thread modifies another thread’s data?

• We need a tool to synchronize memory access and to synchronized thread operations.

Page 5: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Barriers

• Similar to MPI and OpenMP once a thread reaches a barrier it must wait until all other threads reach the barrier.

• Then all threads may continue.• See the next slide for example code.

Page 6: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Barriers

• Need for barriers

int idx = threadIdx.x;

__shared__ int arr[128];

arr[idx] = threadIdx.x;

if(idx > 0 && idx <= 127) arr[idx] = arr[idx-1];

Page 7: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Barriers Continued• Should be rewritten as

int idx = threadIdx.x;__shared__ int array[128];

__syncthreads();

array[idx] = threadIdx.x;

if(idx > 0 && idx <= 127) { int temp = arr[idx-1]; __syncthreads(); arr[idx] = arr[idx-1]; __syncthreads();}

Page 8: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

__syncthreads()

• __syncthreads() creates a barrier between block runs.

• Implicit barriers also exist between kernel function calls.

• So, CUDA is a hierarchy of computation, memory, and synchronization primitives.

Page 9: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Efficient CUDA Programs

• High Level Strategies– Modern GPUs can perform 3 Trillion Math

Operations Per Second (3TFLOPS)– Maximize intensity of math operations per unit

of memory– Maximize number of useful compute

operations per thread– Minimize time spent on memory access per

thread

Page 10: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Minimize Time Spent On Memory

• Move frequently accessed data to shared memory.• Memory Speed

– Local > Shared >> Global >> Host– Local – registers/L1 cache

• Local Memory Example

__global__ void locMemEx(double f) { double local_f; local_f = f;}

int main(int argc, char ** argv) { locMemEx<<<1,256>>>(10.2); cudaSynchronize();}

Page 11: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Global Memory Example//Global memory__global__ void globalMemEx(double * myArr) { myArr[threadIdx.x] = 8.3 + myArr[threadIdx.x]; //myArr is in global memory}

int main(int argc, char ** argv) { float * myHostArr = malloc(sizeof(double)*256); float * devArr; cudaMalloc((void **) &devArr, sizeof(double)*256); for(i = 0; i < 256; i++) myHostArr[i] = i; cudaMemcpy((void *) devArr, (void *) myHostArr,

sizeof(double)*256,cudaMemcpyHostToDevice); globalMemEx<<<1,256>>>(devArr); cudaMemcpy((void *) devArr, (void *) myHostArr,

sizeof(double)*256,cudaMemcpyDeviceToHost);}

Page 12: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Shared Memory Example

__global__ void shmemEx(double * arr) { int i, idx = threadIdx.x; double avg, sum = 0.0; __shared__ double shArr[256]; shArr[i] = arr[i]; __syncthreads(); for(i = 0; i < idx; i++){ sum += shArr[i]; } avg = sum / (idx + 1.0); if(arr[idx] > avg) arr[idx] = avg; //This code does not affect results. shArr[idx] += shArr[idx];}

Page 13: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Code from Main Function

shmemEx<<<1,256>>>(devArr);cudaMemcpy((void *) hostArr, (void *) devArr, sizeof(double)*256, cudaMemcpyHostToDevice);

Page 14: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Memory Access

• Want threads to have contiguous memory accesses

• GPU is most efficient when threads read or write to the same area of memory at the same time

• Each thread when it accesses global memory must access a chunk of memory, not the single data item– Contiguous is good– Strided, not so good– Random, bad

• In class exercise, we will draw pictures of each type of memory access.

Page 15: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Memory Conflicts

• Many threads may try to access the same memory location.

• Ex: 1,000,000 threads accessing 10 array elements

• Solve with atomics• atomicAdd()• atomicMin()• atomicXOR()• atomicCAS() - compare and swap

Page 16: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Atomic Limitations

• Only certain operations and data types• No mod or exponentiation• Mostly integer types• Can implement any atomic op with CAS, quite

complicated though• Still no ordering constraints• Floating point arithmetic is non-associative• Ex: (a + b) + c != a + (b + c)• Serializes memory access

– This makes atomic ops very slow

Page 17: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

In Class Exercise

• Try each of the following:– 10^6 threads incrementing 10^6 elements– 10^5 threads atomically incrementing 10^5 elements– 10^6 threads incrementing 1000 elements – 10^6 threads atomically incrementing 1000 elements– 10^7 threads atomically incrementing 1000 elements

• Time your results

Page 18: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Avoiding Control Structures

• In CUDA we want to avoid thread divergence because threads operate on the same kernel code at the same time.– Threads with branch statements will be forced to wait

if they are not operating on the same code as all other threads (e.g. if one thread needs to operate on an else and the other on an if).

• This means we should avoid if statements in GPU code whenever possible.

Page 19: CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course

Divergence

• Divergence (in terms of threads) means threads that do different things.

• This can happen in both loops and if statements.

• This occurs often where loops run for different numbers of iterations.

• Keep in mind that all other GPU threads have to wait until all divergent threads finish.