Upload
phuc
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ME964 High Performance Computing for Engineering Applications. Gauging Kernel Performance Control Flow in CUDA Oct. 9, 2008. Before we get started…. Last Time Guest Lecturer: Michael Garland, Researcher at NVIDIA Writing Efficient CUDA Algorithms Today - PowerPoint PPT Presentation
Citation preview
ME964High Performance Computing for Engineering Applications
Gauging Kernel Performance Control Flow in CUDA
Oct. 9, 2008
Before we get started…
Last Time Guest Lecturer: Michael Garland, Researcher at NVIDIA
Writing Efficient CUDA Algorithms
Today Gauging the extent to which you use hardware resources in CUDA Control Flow in CUDA
Homework related HW6 available for download (exclusive scan operation) HW5, 2D matrix convolution, due at 11:59 PM today
2
Exercise: Does Matrix Multiplication Incur Shared Memory Bank Conflicts?
0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 150 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 15
In scenario A, all threads in a half-warp access the same shared memory entry leading to broadcast. Below what’s highlighted is the second step of computing the 5th row of the tile.
In scenario A, all threads in a half-warp access elements in neighboring banks as they walk through the computation
3
Scenario A. The tile matrix is computed as follows: one half warp computes one row of the tile at a time.
Scenario B. The tile matrix is computed as follows: one half warp computes one column of the tile at a time (do it yourself).
Final Comments, Memory Access
Given the GPU memory spaces and their latency, a typical programming pattern emerges at the thread level:
Load data from device memory into shared memory (coalesced if possible)
Synchronize with al the other threads of the block to avoid data access hazards
Process the data that you just brought over in shared memory
Synchronize as needed
Write the results back to global memory (coalesced if possible)
NOTE: for CUDA computing, always try hard to make your computation fit this model
4
CUDA Programming Common Sense Advice
Keep this in mind:
Allocating memory on device or host is expensive
Moving data back and forth between the host and device is a killer
Global memory accesses are going to be slow If they are not coalesced they are even slower…
Make sure that you keep the SM Occupied (currently, 24 warps can be managed concurrently) Busy (avoid data starvation, have it crunch numbers)
If you can, avoid bank conflicts. Not that big of a deal tough.
5
Gauging the level of HW use In order to gauge how well your code uses the HW, you need to use
the CUDA occupancy calculator (google it)
6
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
Gauging the level of HW use (cont.)
Three things are asked of you: Number of threads per block (this is trivial to provide) Number of registers per thread Number of bytes of shared memory used by each block
The last two quantities, you get them by adding the “–ptxas-options –v” to the compile command line:
7
$(CUDA_BIN_PATH)\nvcc.exe -cuda --ptxas-options -v -I"$(CUDA_INC_PATH)" -I./ -I../../common/inc -I"$(VCInstallDir)\include" -I"$(VCInstallDir)\PlatformSDK\include" -o $(ConfigurationName)\matrixmul.gen.c matrixmul.cu
In Visual Studio, right-click the main .cu file, go to properties, and edit the Custom Build Step by adding “–ptxas-options –v”
Gauging the level of HW use (cont.) Open in a text editor the object file to find, in a pile of stuff that
doesn’t make any sense, a chunk of text that looks like this:
8
This is telling you that MatrixMulKernel (which is the name I gave my kernel) uses 2112 bytes in shared memory, 14 registers per thread, and that there is no use of the local memory (lmem)
code {name = _Z15MatrixMulKernel6MatrixS_S_lmem = 0smem = 2112reg = 14bar = 0bincode {
0x3004c815 0xe43007c0 0x10008025 0x00000003 0xd0800205 0x00400780 0xa000000d 0x04000780 0xa0000205 0x04000780 0x10056003 0x00000100 0x30040601 0xc4100780 0x20000001 0x04004780
Alternatively, in Developer Studio:
This is what you are interested in:smem: 2672 bytesregisters: 9
9
10
End: Discussion on Memory Spaces (Access and Latency Issues)
Begin: Control Flow
Objective
Understand the implications of control flow on Branch divergence overhead SM execution resource utilization
Learn better ways to write code with control flow
Understand compiler/HW predication An idea meant to reduce the impact of control flow There is a cost involved with this process.
11HK-UIUC
Quick terminology review
Thread: concurrent code executed and an associated state on the CUDA device (in parallel with other threads) The unit of parallelism in CUDA Number of threads used controlled by user
Warp: a group of threads executed physically in parallel in G80 Number of threads in warp not controlled by user
Block: a group of threads that are executed together and form the unit of resource assignment Number of blocks used controlled by user
Grid: a group of thread blocks that must all complete before the next phase of the program can begin
12HK-UIUC
How thread blocks are partitioned
Each thread block is partitioned into warps Thread IDs within a warp are consecutive and increasing
Remember: In multidimensional blocks, the x thread index runs first, followed by the y thread index, and finally followed by the z thread index
Warp 0 starts with Thread ID 0
Partitioning is always the same Thus you can use this knowledge in control flow However, the exact size of warps may change from release to release
While you can rely on ordering among threads, DO NOT rely on any ordering among warps
Remember, the concept of warp is not something you control through CUDA If there are any dependencies between threads, you must __syncthreads() to
get correct results
13HK-UIUC
Control Flow Instructions
Main performance concern with branching is divergence
Threads within a single warp take different paths
Different execution paths are serialized in G80 The control paths taken by the threads in a warp are traversed
one at a time until there is no more.
NOTE: Don’t forget that divergence can manifest only at the warp level. You can not discuss this concept in relation to code executed by threads in different warps
14HK-UIUC
Control Flow Instructions (cont.)
A common case: avoid divergence when branch condition is a function of thread ID
Example with divergence: If (threadIdx.x > 2) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0 and 1 follow different path
than the rest of the threads in the first warp
Example without divergence: If (threadIdx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any
given warp follow the same path
15HK-UIUC
Illustration: Parallel Reduction
Use the “Parallel Reduction” algorithm as a vehicle to discuss the issue of control flow
Given an array of values, “reduce” them in parallel to a single value
Examples Sum reduction: sum of all values in the array Max reduction: maximum of all values in the array
Typically parallel implementation: Recursively halve the number of threads, add two values per thread Takes log(n) steps for n elements, requires n/2 threads
16HK-UIUC
A Vector Reduction Example
Assume an in-place reduction using shared memory
We are in the process of summing up a 512 element array
The shared memory used to hold a partial sum vector
Each iteration brings the partial sum vector closer to the final sum
The final sum will be stored in element 0
17HK-UIUC
A simple implementation
Assume we have already loaded array into __shared__ float partialSum[]
18HK-UIUC
unsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride];}
The “Bank Conflicts” Aspect
0 1 2 3 4 5 76 1098 11
0+1 2+3 4+5 6+7 10+118+9
0...3 4..7 8..11
0..7 8..15
1
2
3
Array elements
iterations
19HK-UIUC
The “Branch Divergence” Aspect
0 1 2 3 4 5 76 1098 11
0+1 2+3 4+5 6+7 10+118+9
0...3 4..7 8..11
0..7 8..15
1
2
3
Array elements
iterations
Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10
20HK-UIUC
Some Observations
In each iterations, two control flow paths will be sequentially traversed for each warp
Threads that perform addition and threads that do not
Threads that do not perform addition may cost extra cycles depending on the implementation of divergence
21HK-UIUC
Some Observations (cont.)
No more than half of the threads will be executing at any time
All odd index threads are disabled right from the beginning!
On average, less than ¼ of the threads will be activated for all warps over time.
After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. This can go on for a while, up to 4 more iterations (512/32=16= 24),
where each iteration only has one thread activated until all warps retire
22HK-UIUC
Shortcomings of the implementation
Assume we have already loaded array into __shared__ float partialSum[]
23HK-UIUC
unsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride];} BAD: Bank conflicts
due to stride
BAD: Divergence due to interleaved branch decisions
A better implementation
Assume we have already loaded array into __shared__ float partialSum[]
24HK-UIUC
unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) { __syncthreads(); if (t < stride)
partialSum[t] += partialSum[t+stride];}
Thread 0
No Divergence until < 16 sub-sums
0 1 2 3 … 13 1514 181716 19
0+16 15+311
3
4
25HK-UIUC
Some Observations About the New Implementation
Only the last 5 iterations will have divergence
Entire warps will be shut down as iterations progress For a 512-thread block, 4 iterations to shut down all but one
warp in the block Better resource utilization, will likely retire warps and thus
block executes faster
Recall, no bank conflicts either
26HK-UIUC
A Potential Further Refinement but Bad Idea For last 6 loops only one warp active (i.e. tid’s 0..31)
Shared reads & writes SIMD synchronous within a warp So skip __syncthreads() and unroll last 5 iterations
27
unsigned int tid = threadIdx.x;for (unsigned int d = n>>1; d > 32; d >>= 1) {
__syncthreads();if (tid < d) shared[tid] += shared[tid + d];
}
__syncthreads();if (tid < 32) { // unroll last 6 predicated steps
shared[tid] += shared[tid + 32];shared[tid] += shared[tid + 16];shared[tid] += shared[tid + 8];shared[tid] += shared[tid + 4];shared[tid] += shared[tid + 2];shared[tid] += shared[tid + 1];
}HK-UIUC
A Potential Further Refinement but bad idea
Concluding remarks on the “further refinement”:
This would not work properly is warp size decreases.
Also doesn’t look that attractive if the warp size increases.
Finally you need __synchthreads() between each statement! Having __synchthreads() in an if-statement is problematic.
28HK-UIUC
Control Flow Instructions
if, switch, for, while – can significantly impact the effective instruction throughput when threads of the same warp diverge
If this happens, the execution is serialized This increases the number of instructions executed for this
warp When all the different execution paths have completed, the
threads converge back to the same execution path Not only that you execute more instructions, but you also
need logic associated with this process (book-keeping)
29
Predicated Execution Concept
• The thread divergence can be avoided in some cases by using the concept of predication
<p1> LDR r1,r2,0
• If p1 is TRUE, the assembly code instruction above executes normally
• If p1 is FALSE, instruction treated as NOP
30HK-UIUC
Predication Example
::if (x == 10) c = c + 1;::
: : LDR r5, X p1 <- r5 eq 10<p1> LDR r1 <- C<p1> ADD r1, r1, 1<p1> STR r1 -> C : :
31HK-UIUC
B
A
C
D
ABCD
Predication very helpful for if-else
32HK-UIUC
If-else example
::
p1,p2 <- r5 eq 10<p1> inst 1 from B<p1> inst 2 from B<p1> :
:<p2> inst 1 from C<p2> inst 2 from C :
:
::
p1,p2 <- r5 eq 10<p1> inst 1 from B<p2> inst 1 from C
<p1> inst 2 from B<p2> inst 2 from C
<p1> ::
This is what gets scheduled
The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence. 33HK-UIUC
Instruction Predication in G80
Your comparison instructions set condition codes (CC)
Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.)
The compiler tries to predict if a branch condition is likely to produce many divergent warps If that’s the case, go ahead and predicate if the branch has <7
instructions If that’s not the case, only predicate if the branch has <4
instructions Note: it’s pretty bad if you predicate when it was obvious that
there would have been no divergence
34HK-UIUC
Instruction Predication in G80 (cont.)
ALL predicated instructions take execution cycles Those with false conditions don’t write their output, and do not evaluate
addresses or read operands Saves branch instructions, so can be cheaper than serializing divergent
paths
If all this business is confusing, remember this: Avoid thread divergence
It’s not 100% clear to me, but I believe that there is no cost if a subset of threads belonging to a warp sits there and does nothing while the other warp threads are all running the same instruction
35HK-UIUC