ME964 High Performance Computing for Engineering Applications

ME964High Performance Computing for Engineering Applications

Gauging Kernel Performance Control Flow in CUDA

Oct. 9, 2008

Before we get started…

Last Time Guest Lecturer: Michael Garland, Researcher at NVIDIA

Writing Efficient CUDA Algorithms

Today Gauging the extent to which you use hardware resources in CUDA Control Flow in CUDA

Homework related HW6 available for download (exclusive scan operation) HW5, 2D matrix convolution, due at 11:59 PM today

2

Exercise: Does Matrix Multiplication Incur Shared Memory Bank Conflicts?

0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

0 1 2 3 4 5 6 7 15

0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15

0 1 2 3 4 5 6 7 15

In scenario A, all threads in a half-warp access the same shared memory entry leading to broadcast. Below what’s highlighted is the second step of computing the 5th row of the tile.

In scenario A, all threads in a half-warp access elements in neighboring banks as they walk through the computation

3

Scenario A. The tile matrix is computed as follows: one half warp computes one row of the tile at a time.

Scenario B. The tile matrix is computed as follows: one half warp computes one column of the tile at a time (do it yourself).

Final Comments, Memory Access

Given the GPU memory spaces and their latency, a typical programming pattern emerges at the thread level:

Load data from device memory into shared memory (coalesced if possible)

Synchronize with al the other threads of the block to avoid data access hazards

Process the data that you just brought over in shared memory

Synchronize as needed

Write the results back to global memory (coalesced if possible)

NOTE: for CUDA computing, always try hard to make your computation fit this model

4

CUDA Programming Common Sense Advice

Keep this in mind:

Allocating memory on device or host is expensive

Moving data back and forth between the host and device is a killer

Global memory accesses are going to be slow If they are not coalesced they are even slower…

Make sure that you keep the SM Occupied (currently, 24 warps can be managed concurrently) Busy (avoid data starvation, have it crunch numbers)

If you can, avoid bank conflicts. Not that big of a deal tough.

5

Gauging the level of HW use In order to gauge how well your code uses the HW, you need to use

the CUDA occupancy calculator (google it)

6

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

Gauging the level of HW use (cont.)

Three things are asked of you: Number of threads per block (this is trivial to provide) Number of registers per thread Number of bytes of shared memory used by each block

The last two quantities, you get them by adding the “–ptxas-options –v” to the compile command line:

7

$(CUDA_BIN_PATH)\nvcc.exe -cuda --ptxas-options -v -I"$(CUDA_INC_PATH)" -I./ -I../../common/inc -I"$(VCInstallDir)\include" -I"$(VCInstallDir)\PlatformSDK\include" -o $(ConfigurationName)\matrixmul.gen.c matrixmul.cu

In Visual Studio, right-click the main .cu file, go to properties, and edit the Custom Build Step by adding “–ptxas-options –v”

Gauging the level of HW use (cont.) Open in a text editor the object file to find, in a pile of stuff that

doesn’t make any sense, a chunk of text that looks like this:

8

This is telling you that MatrixMulKernel (which is the name I gave my kernel) uses 2112 bytes in shared memory, 14 registers per thread, and that there is no use of the local memory (lmem)

code {name = _Z15MatrixMulKernel6MatrixS_S_lmem = 0smem = 2112reg = 14bar = 0bincode {

0x3004c815 0xe43007c0 0x10008025 0x00000003 0xd0800205 0x00400780 0xa000000d 0x04000780 0xa0000205 0x04000780 0x10056003 0x00000100 0x30040601 0xc4100780 0x20000001 0x04004780

Alternatively, in Developer Studio:

This is what you are interested in:smem: 2672 bytesregisters: 9

9

10

End: Discussion on Memory Spaces (Access and Latency Issues)

Begin: Control Flow

Objective

Understand the implications of control flow on Branch divergence overhead SM execution resource utilization

Learn better ways to write code with control flow

Understand compiler/HW predication An idea meant to reduce the impact of control flow There is a cost involved with this process.

11HK-UIUC

Quick terminology review

Thread: concurrent code executed and an associated state on the CUDA device (in parallel with other threads) The unit of parallelism in CUDA Number of threads used controlled by user

Warp: a group of threads executed physically in parallel in G80 Number of threads in warp not controlled by user

Block: a group of threads that are executed together and form the unit of resource assignment Number of blocks used controlled by user

Grid: a group of thread blocks that must all complete before the next phase of the program can begin

12HK-UIUC

How thread blocks are partitioned

Each thread block is partitioned into warps Thread IDs within a warp are consecutive and increasing

Remember: In multidimensional blocks, the x thread index runs first, followed by the y thread index, and finally followed by the z thread index

Warp 0 starts with Thread ID 0

Partitioning is always the same Thus you can use this knowledge in control flow However, the exact size of warps may change from release to release

While you can rely on ordering among threads, DO NOT rely on any ordering among warps

Remember, the concept of warp is not something you control through CUDA If there are any dependencies between threads, you must __syncthreads() to

get correct results

13HK-UIUC

Control Flow Instructions

Main performance concern with branching is divergence

Threads within a single warp take different paths

Different execution paths are serialized in G80 The control paths taken by the threads in a warp are traversed

one at a time until there is no more.

NOTE: Don’t forget that divergence can manifest only at the warp level. You can not discuss this concept in relation to code executed by threads in different warps

14HK-UIUC

Control Flow Instructions (cont.)

A common case: avoid divergence when branch condition is a function of thread ID

Example with divergence: If (threadIdx.x > 2) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0 and 1 follow different path

than the rest of the threads in the first warp

Example without divergence: If (threadIdx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any

given warp follow the same path

15HK-UIUC

Illustration: Parallel Reduction

Use the “Parallel Reduction” algorithm as a vehicle to discuss the issue of control flow

Given an array of values, “reduce” them in parallel to a single value

Examples Sum reduction: sum of all values in the array Max reduction: maximum of all values in the array

Typically parallel implementation: Recursively halve the number of threads, add two values per thread Takes log(n) steps for n elements, requires n/2 threads

16HK-UIUC

A Vector Reduction Example

Assume an in-place reduction using shared memory

We are in the process of summing up a 512 element array

The shared memory used to hold a partial sum vector

Each iteration brings the partial sum vector closer to the final sum

The final sum will be stored in element 0

17HK-UIUC

A simple implementation

Assume we have already loaded array into __shared__ float partialSum[]

18HK-UIUC

unsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride];}

The “Bank Conflicts” Aspect

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+9

0...3 4..7 8..11

0..7 8..15

1

2

3

Array elements

iterations

19HK-UIUC

The “Branch Divergence” Aspect

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+9

0...3 4..7 8..11

0..7 8..15

1

2

3

Array elements

iterations

Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10

20HK-UIUC

Some Observations

In each iterations, two control flow paths will be sequentially traversed for each warp

Threads that perform addition and threads that do not

Threads that do not perform addition may cost extra cycles depending on the implementation of divergence

21HK-UIUC

Some Observations (cont.)

No more than half of the threads will be executing at any time

All odd index threads are disabled right from the beginning!

On average, less than ¼ of the threads will be activated for all warps over time.

After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. This can go on for a while, up to 4 more iterations (512/32=16= 24),

where each iteration only has one thread activated until all warps retire

22HK-UIUC

Shortcomings of the implementation


23HK-UIUC

unsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride];} BAD: Bank conflicts

due to stride

BAD: Divergence due to interleaved branch decisions

A better implementation


24HK-UIUC

unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) { __syncthreads(); if (t < stride)

partialSum[t] += partialSum[t+stride];}

Thread 0

No Divergence until < 16 sub-sums

0 1 2 3 … 13 1514 181716 19

0+16 15+311

3

4

25HK-UIUC

Some Observations About the New Implementation

Only the last 5 iterations will have divergence

Entire warps will be shut down as iterations progress For a 512-thread block, 4 iterations to shut down all but one

warp in the block Better resource utilization, will likely retire warps and thus

block executes faster

Recall, no bank conflicts either

26HK-UIUC

A Potential Further Refinement but Bad Idea For last 6 loops only one warp active (i.e. tid’s 0..31)

Shared reads & writes SIMD synchronous within a warp So skip __syncthreads() and unroll last 5 iterations

27

unsigned int tid = threadIdx.x;for (unsigned int d = n>>1; d > 32; d >>= 1) {

__syncthreads();if (tid < d) shared[tid] += shared[tid + d];

}

__syncthreads();if (tid < 32) { // unroll last 6 predicated steps

shared[tid] += shared[tid + 32];shared[tid] += shared[tid + 16];shared[tid] += shared[tid + 8];shared[tid] += shared[tid + 4];shared[tid] += shared[tid + 2];shared[tid] += shared[tid + 1];

}HK-UIUC

A Potential Further Refinement but bad idea

Concluding remarks on the “further refinement”:

This would not work properly is warp size decreases.

Also doesn’t look that attractive if the warp size increases.

Finally you need __synchthreads() between each statement! Having __synchthreads() in an if-statement is problematic.

28HK-UIUC

Control Flow Instructions

if, switch, for, while – can significantly impact the effective instruction throughput when threads of the same warp diverge

If this happens, the execution is serialized This increases the number of instructions executed for this

warp When all the different execution paths have completed, the

threads converge back to the same execution path Not only that you execute more instructions, but you also

need logic associated with this process (book-keeping)

29

Predicated Execution Concept

• The thread divergence can be avoided in some cases by using the concept of predication

<p1> LDR r1,r2,0

• If p1 is TRUE, the assembly code instruction above executes normally

• If p1 is FALSE, instruction treated as NOP

30HK-UIUC

Predication Example

::if (x == 10) c = c + 1;::

: : LDR r5, X p1 <- r5 eq 10<p1> LDR r1 <- C<p1> ADD r1, r1, 1<p1> STR r1 -> C : :

31HK-UIUC

B

A

C

D

ABCD

Predication very helpful for if-else

32HK-UIUC

If-else example

::

p1,p2 <- r5 eq 10<p1> inst 1 from B<p1> inst 2 from B<p1> :

:<p2> inst 1 from C<p2> inst 2 from C :

:

::

p1,p2 <- r5 eq 10<p1> inst 1 from B<p2> inst 1 from C

<p1> inst 2 from B<p2> inst 2 from C

<p1> ::

This is what gets scheduled

The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence. 33HK-UIUC

Instruction Predication in G80

Your comparison instructions set condition codes (CC)

Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.)

The compiler tries to predict if a branch condition is likely to produce many divergent warps If that’s the case, go ahead and predicate if the branch has <7

instructions If that’s not the case, only predicate if the branch has <4

instructions Note: it’s pretty bad if you predicate when it was obvious that

there would have been no divergence

34HK-UIUC

Instruction Predication in G80 (cont.)

ALL predicated instructions take execution cycles Those with false conditions don’t write their output, and do not evaluate

addresses or read operands Saves branch instructions, so can be cheaper than serializing divergent

paths

If all this business is confusing, remember this: Avoid thread divergence

It’s not 100% clear to me, but I believe that there is no cost if a subset of threads belonging to a warp sits there and does nothing while the other warp threads are all running the same instruction

35HK-UIUC

Documents

ME964 High Performance Computing for Engineering Applications