27
CS 179: Lecture 3

CS 179: Lecture 3

  • Upload
    imala

  • View
    50

  • Download
    1

Embed Size (px)

DESCRIPTION

CS 179: Lecture 3. A bold problem!. Suppose we have a polynomial P(r) with coefficients c 0 , …, c n -1 , given by: We want, for r 0 , …, r N -1 , the sum:. A kernel. !!. A more correct kernel. Performance problem!. Serialization of atomicAdd (). Parallel accumulation. …. - PowerPoint PPT Presentation

Citation preview

Page 1: CS 179: Lecture 3

CS 179: Lecture 3

Page 2: CS 179: Lecture 3

A bold problem! Suppose we have a polynomial P(r) with coefficients c0, …, cn-1, given by:

We want, for r0, …, rN-1, the sum:

Page 3: CS 179: Lecture 3

A kernel

!!

Page 4: CS 179: Lecture 3

A more correct kernel

Page 5: CS 179: Lecture 3

Performance problem! Serialization of atomicAdd()

Page 6: CS 179: Lecture 3

Parallel accumulation

Page 7: CS 179: Lecture 3

Parallel accumulation

Page 8: CS 179: Lecture 3

Accumulation methods – A visual Purely atomic: “Linear accumulation”:

*output

Page 9: CS 179: Lecture 3

GPU: InternalsBlock

SIMD processing unit(Warps go here)

Warp

Multiprocessor (Blocks go here)

Page 10: CS 179: Lecture 3

Warp 32 threads that execute simultaneously! A[] + B[] -> C[] problem:

Didn’t matter!

Warp of threads 0-31: A[idx] + B[idx] -> C[idx]

Warp of threads 32-63: A[idx] + B[idx] -> C[idx]

Warp of threads 64-95: A[idx] + B[idx] -> C[idx]

Page 11: CS 179: Lecture 3

Divergence – Example//Suppose we have a pointer to some floating-point//value *val...

if (threadIdx.x % 2 == 0)*value += 2; //Branch A

else*value /= 3; //Branch B

Branches result in different instructions!

Page 12: CS 179: Lecture 3

Divergence

What happens: Executes normally until if-statement Branches to calculate Branch A (blue threads) Goes back (!) and branches to calculate Branch B (red threads)

Page 13: CS 179: Lecture 3

Calculating polynomial values – does it matter?

Warp of threads 0-31: (calculate some values)

Warp of threads 32-63: (calculate some values)

Warp of threads 64-95: (calculate some values)

Same instructions! Doesn’t matter!

Page 14: CS 179: Lecture 3

Linear reduction - does it matter? (after calculating values…) Warp of threads 0-31:

Thread 0: Accumulate sum Threads 1-31: Do nothing

Warp of threads 32-63: Do nothing

Warp of threads 64-95: Do nothing

Doesn’t really matter… in this case.

Page 15: CS 179: Lecture 3

Improving our reduction More threads participating in the process! “Binary tree”

Page 16: CS 179: Lecture 3

Improving our reduction

Page 17: CS 179: Lecture 3

Improving our reduction//Let our shared memory block be partial_outputs[]...

synchronize threads before starting...set offset to 1while ( (offset * 2) <= block dimension):

if (thread index % (offset * 2) is 0)AND you won’t exceed your block dimension: add partial_outputs[thread index + offset] to partial_outputs[thread index]double the offsetsynchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

Page 18: CS 179: Lecture 3

Improving our reduction What will happen?

Each warp will do meaningful execution on ~16 threads You use lots more warps than you have to! A “divergent tree”

How to improve?

Page 19: CS 179: Lecture 3

“Non-divergent tree”

Page 20: CS 179: Lecture 3

“Non-divergent tree”//Let our shared memory block be partial_outputs[]...set offset to highest power of 2 that’s less than the

block dimension

//For the first iteration, check that you don’t access//out of range memory

while (offset >= 1):if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index]halve the offsetsynchronize threads

Get thread 0 to atomicAdd() partial_outputs[0] to output

Page 21: CS 179: Lecture 3

“Non-divergent tree” Suppose we have our block of 512 threads…

Warp of threads 0-31: Accumulate result

… Warp of threads 224-255:

Accumulate result Warp of threads 256-287:

Do nothing … Warp of threads 480-511:

Do nothing

Page 22: CS 179: Lecture 3

“Non-divergent tree” Suppose we’re now down to 32 threads…

Warp of threads 0-31: Threads 0-15: Accumulate result Threads 16-31: Do nothing

Much less divergence Divergence only occurs in the middle, if ever!

Page 23: CS 179: Lecture 3

Reduction – Four Approaches Atomic only:

Divergent tree:

Linear:

Non-divergent tree:

*output

Page 24: CS 179: Lecture 3

Notes on files? (An aside) Labs and CUDA programming typically have the following

files involved: ____.cc

Allows C++ code g++ compiles this

____.cu Allows CUDA syntax nvcc compiles this

____.cuh CUDA header file (declare accessible functions)

Page 25: CS 179: Lecture 3

Big Picture (so far) CUDA allows large speedups! Parallelism with CUDA: Different from CPU parallelism

Threads are “small” Memory to think about

Increase speedup by thinking about problem constraints! Reduction (and more!)

Page 26: CS 179: Lecture 3

Big Picture (so far) Steps in these two problems are widely applicable!

Dot product Norm calculation (and more)

Page 27: CS 179: Lecture 3

Other notes Office hours:

Monday: 8-10 PM Tuesday: 8-10 PM

Lab 1 due Wednesday, 5 PM