CS 179: Lecture 3

CS 179: Lecture 3

A bold problem! Suppose we have a polynomial P(r) with coefficients c0, …, cn-1, given by:

We want, for r0, …, rN-1, the sum:

A kernel

!!

A more correct kernel

Performance problem! Serialization of atomicAdd()

Parallel accumulation

…

Parallel accumulation

Accumulation methods – A visual Purely atomic: “Linear accumulation”:

*output

GPU: InternalsBlock

SIMD processing unit(Warps go here)

Warp

Multiprocessor (Blocks go here)

Warp 32 threads that execute simultaneously! A[] + B[] -> C[] problem:

Didn’t matter!

Warp of threads 0-31: A[idx] + B[idx] -> C[idx]



…

Divergence – Example//Suppose we have a pointer to some floating-point//value *val...

if (threadIdx.x % 2 == 0)*value += 2; //Branch A

else*value /= 3; //Branch B

Branches result in different instructions!

Divergence

What happens: Executes normally until if-statement Branches to calculate Branch A (blue threads) Goes back (!) and branches to calculate Branch B (red threads)

Calculating polynomial values – does it matter?

Warp of threads 0-31: (calculate some values)



Same instructions! Doesn’t matter!

Linear reduction - does it matter? (after calculating values…) Warp of threads 0-31:

Thread 0: Accumulate sum Threads 1-31: Do nothing

Warp of threads 32-63: Do nothing

Warp of threads 64-95: Do nothing

Doesn’t really matter… in this case.

Improving our reduction More threads participating in the process! “Binary tree”

Improving our reduction

Improving our reduction//Let our shared memory block be partial_outputs[]...

synchronize threads before starting...set offset to 1while ( (offset * 2) <= block dimension):

if (thread index % (offset * 2) is 0)AND you won’t exceed your block dimension: add partial_outputs[thread index + offset] to partial_outputs[thread index]double the offsetsynchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

Improving our reduction What will happen?

Each warp will do meaningful execution on ~16 threads You use lots more warps than you have to! A “divergent tree”

How to improve?

“Non-divergent tree”

“Non-divergent tree”//Let our shared memory block be partial_outputs[]...set offset to highest power of 2 that’s less than the

block dimension

//For the first iteration, check that you don’t access//out of range memory

while (offset >= 1):if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index]halve the offsetsynchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

“Non-divergent tree” Suppose we have our block of 512 threads…

Warp of threads 0-31: Accumulate result

… Warp of threads 224-255:

Accumulate result Warp of threads 256-287:

Do nothing … Warp of threads 480-511:

Do nothing

“Non-divergent tree” Suppose we’re now down to 32 threads…

Warp of threads 0-31: Threads 0-15: Accumulate result Threads 16-31: Do nothing

Much less divergence Divergence only occurs in the middle, if ever!

Reduction – Four Approaches Atomic only:

Divergent tree:

Linear:

Non-divergent tree:

*output

Notes on files? (An aside) Labs and CUDA programming typically have the following

files involved: ____.cc

Allows C++ code g++ compiles this

____.cu Allows CUDA syntax nvcc compiles this

____.cuh CUDA header file (declare accessible functions)

Big Picture (so far) CUDA allows large speedups! Parallelism with CUDA: Different from CPU parallelism

Threads are “small” Memory to think about

Increase speedup by thinking about problem constraints! Reduction (and more!)

Big Picture (so far) Steps in these two problems are widely applicable!

Dot product Norm calculation (and more)

Other notes Office hours:

Monday: 8-10 PM Tuesday: 8-10 PM

Lab 1 due Wednesday, 5 PM

Documents

CS 179: Lecture 3