Upload
kalea
View
39
Download
0
Embed Size (px)
DESCRIPTION
Automated Dynamic Analysis of CUDA Programs. Michael Boyer, Kevin Skadron*, and Westley Weimer University of Virginia {boyer,skadron,weimer}@cs.virginia.edu * currently on sabbatical with NVIDIA Research. Outline. GPGPU CUDA Automated analyses Correctness: race conditions - PowerPoint PPT Presentation
Citation preview
1
UNIVERSITY OF VIRGINIA
Automated Dynamic Analysisof CUDA Programs
Michael Boyer, Kevin Skadron*, and Westley WeimerUniversity of Virginia
{boyer,skadron,weimer}@cs.virginia.edu
* currently on sabbatical with NVIDIA Research
2
UNIVERSITY OF VIRGINIA
Outline
GPGPU CUDA Automated analyses
– Correctness: race conditions– Performance: bank conflicts
Preliminary results Future work Conclusion
3
UNIVERSITY OF VIRGINIA
Why GPGPU?
From: NVIDIA CUDA Programming Guide, Version 1.1
4
UNIVERSITY OF VIRGINIA
CPU vs. GPU Design
Single-Thread Latency
Aggregate Throughput
From: NVIDIA CUDA Programming Guide, Version 1.1
5
UNIVERSITY OF VIRGINIA
GPGPU Programming
Traditional approach: graphics APIs ATI/AMD: Close-to-the-Metal (CTM) NVIDIA: Compute Unified Device
Architecture (CUDA)
6
UNIVERSITY OF VIRGINIA
CUDA: Abstractions
Kernel functions Scratchpad memory Barrier synchronization
7
UNIVERSITY OF VIRGINIA
CUDA: Example Program__host__ void example(int *cpu_mem) {
cudaMalloc(&gpu_mem, mem_size);cudaMemcpy(gpu_mem, cpu_mem, HostToDevice);kernel <<< grid, threads, mem_size >>>
(gpu_mem);cudaMemcpy(cpu_mem, gpu_mem, DeviceToHost);
}
__global__ void kernel(int *mem) {int thread_id = threadIdx.x;mem[thread_id] = thread_id;
}
8
UNIVERSITY OF VIRGINIA
GPU
Multiprocessor 2Multiprocessor 1 Multiprocessor N● ● ●
CUDA: Hardware
Global Device Memory
Multiprocessor
● ● ●
Per-Block Shared Memory (PBSM)
Processing Element 1
Processing Element 2
Processing Element M
Instruction Unit
Registers Registers Registers
9
UNIVERSITY OF VIRGINIA
Outline
GPGPU CUDA Automated analyses
– Correctness: race conditions– Performance: bank conflicts
Preliminary results Future work Conclusion
10
UNIVERSITY OF VIRGINIA
Race Conditions
Ordering of instructions among multiple threads is arbitrary
Relaxed memory consistency model
Synchronization: __syncthreads()– Barrier / memory fence
11
UNIVERSITY OF VIRGINIA
Race Conditions: Example
1 extern __shared__ int s[ ];23 __global__ void kernel(int *out)
{4 int id = threadIdx.x;5 int nt = blockDim.x;67 s[id] = id;8 out = s[(id + 1) % nt];9 }
W
W
W
W
W
W
0
1
2
3
4
5
threads
R
R
R
R
R
R
s0
1
2
3
4
5
8 out = s[(id + 1) % nt];
12
UNIVERSITY OF VIRGINIA
Automatic InstrumentationOriginal CUDA Source Code
Intermediate Representation
Instrumented CUDA Source
Code
Instrumentation
Compile
Execute
Output: Race Conditions Detected?
13
UNIVERSITY OF VIRGINIA
Race Condition Instrumentation
Two global bookkeeping arrays:– Reads & writes of all threads
Two per-thread bookkeeping arrays:– Reads & writes of a single thread
After each shared memory access:– Update bookkeeping arrays– Detect & report race conditions
14
UNIVERSITY OF VIRGINIA
Add synchronization between lines 7 and 8No race conditions detected
Race Condition Detection
Original codeRAW hazard at expression:
#line 8 out[id] = s[(id + 1) % nt];
15
UNIVERSITY OF VIRGINIA
Outline
GPGPU CUDA Automated analyses
– Correctness: race conditions– Performance: bank conflicts
Preliminary results Future work Conclusion
16
UNIVERSITY OF VIRGINIA
Bank Conflicts
PBSM is fast– Much faster than global memory– Potentially as fast as register access
…assuming no bank conflicts– Bank conflicts cause serialized access
17
UNIVERSITY OF VIRGINIA
Non-Conflicting Access Patterns
0 1 2 3 4 5 6 7Threads
0 1 2 3 4 5 6 7Banks
Stride = 1
0 1 2 3 4 5 6 7Threads
0 1 2 3 4 5 6 7Banks
Stride = 3
18
UNIVERSITY OF VIRGINIA
Conflicting Access Patterns
0 1 2 3 4 5 6 7Threads
0 1 2 3 4 5 6 7Banks
Stride = 4
0 1 2 3 4 5 6 7Threads
0 1 2 3 4 5 6 7Banks
Stride = 16
19
UNIVERSITY OF VIRGINIA
0
1
2
3
4
5
6
7
8
0 0.2 0.4 0.6 0.8 1
Iterations (Millions)
Ru
nti
me
(Sec
on
ds)
No Bank Conflicts Maximal Bank Conflicts
0
1
2
3
4
5
6
7
8
0 0.2 0.4 0.6 0.8 1
Iterations (Millions)
Ru
nti
me
(Sec
on
ds)
No Bank Conflicts Maximal Bank Conflicts Global Memory
Impact of Bank Conflicts
20
UNIVERSITY OF VIRGINIA
Output: Race Conditions Detected?
Automatic InstrumentationOriginal CUDA Source Code
Intermediate Representation
Instrumented CUDA Source
Code
Instrumentation
Compile
Execute
Output: Bank Conflicts
Detected?
21
UNIVERSITY OF VIRGINIA
Bank Conflict Instrumentation
Global bookkeeping array:– Tracks address accessed by each thread
After each PBSM access:– Each thread updates its entry– One thread computes and reports bank conflicts
22
UNIVERSITY OF VIRGINIA
CAUSE_BANK_CONFLICTS = trueBank conflicts at: #line 14 mem[j]++
Bank: 0 1 2 3 4 5 6 7 8 9 …
Accesses: 16 0 0 0 0 0 0 0 0 0 …
Bank Conflict Detection
CAUSE_BANK_CONFLICTS = falseNo bank conflicts at:
#line 14 mem[j]++
23
UNIVERSITY OF VIRGINIA
Preliminary Results
Scan– Included in CUDA SDK– All-prefix sums operation– 400 lines of code– Explicitly prevents race conditions and bank conflicts
24
UNIVERSITY OF VIRGINIA
Preliminary Results:Race Condition Detection
Original code:– No race conditions detected
Remove any synchronization calls:– Race conditions detected
25
UNIVERSITY OF VIRGINIA
Preliminary Results:Bank Conflict Detection
Original code:– Small number of minor bank conflicts
Enable bank conflict avoidance macro:– Bank conflicts increased!– Confirmed by manual analysis– Culprit: incorrect emulation mode
26
UNIVERSITY OF VIRGINIA
Instrumentation Overhead
Two sources:– Emulation– Instrumentation
Assumption: for debugging, programmers will already use emulation mode
27
UNIVERSITY OF VIRGINIA
Instrumentation Overhead
Code Version
Execution Environmen
t
Average Runtim
e
Slowdown (Relative to Native)
Slowdown (Relative
to Emulation
)
Original Native 0.4 ms
Original Emulation 27 ms 62x
Instrumented (bank conflicts)
Emulation 71 ms 163x 2.6x
Instrumented (race
conditions)Emulation 324 ms 739x 12x
28
UNIVERSITY OF VIRGINIA
Future Work
Find more types of bugs– Correctness: array bounds checking– Performance: memory coalescing
Reduce instrumentation overhead– Execute instrumented code natively
29
UNIVERSITY OF VIRGINIA
Conclusion
GPGPU: enormous performance potential– But parallel programming is challenging
Automated instrumentation can help– Find synchronization bugs– Identify inefficient memory accesses– And more…
30
UNIVERSITY OF VIRGINIA
Questions?
Instrumentation tool will be available at:http://www.cs.virginia.edu/~mwb7w/cuda
31
UNIVERSITY OF VIRGINIA
Domain Mapping
From: NVIDIA CUDA Programming Guide, Version 1.1
32
UNIVERSITY OF VIRGINIA
Coalesced Accesses
From: NVIDIA CUDA Programming Guide, Version 1.1
33
UNIVERSITY OF VIRGINIA
Non-Coalesced Accesses
From: NVIDIA CUDA Programming Guide, Version 1.1
34
UNIVERSITY OF VIRGINIA
Race Condition Detection Algorithm
A thread t knows a race condition exists at shared memory location m if:– Location m has been read from and written to– One of the accesses to m came from t– One of the accesses to m came from a thread other than t
Note that we are only checking for RAW and WAR hazards
35
UNIVERSITY OF VIRGINIA
Bank Conflicts: Example
extern __shared__ int mem[];
__global__ void kernel(int iters) {int min, stride, max, id = threadIdx.x;
if (CAUSE_BANK_CONFLICTS)// Set stride to cause bank conflicts
else// Set stride to avoid bank conflicts
for (int i = 0; i < iters; i++)
for (int j = min; j < max; j += stride)mem[j]++;
}
36
UNIVERSITY OF VIRGINIA
Instrumented Code Example
extern __shared__ int s[];
__global__ void kernel() {int id = threadIdx.x;int nt = blockDim.x *
blockDim.y * blockDim.z;
s[id] = id;int temp = s[(nt+id-1) % nt];
}
extern __shared__ int s[] ;
__global__ void kernel(void) ;void kernel(void) {
// Instrumentation codeint block_size = blockDim.x * blockDim.y * blockDim.z;int thread_id = threadIdx.x + (threadIdx.y * blockDim.x) +
(threadIdx.z * blockDim.x * blockDim.y);__shared__ char mem_reads[PUT_ARRAY_SIZE_HERE];__shared__ char mem_writes[PUT_ARRAY_SIZE_HERE];if (thread_id == 0) {
for (int i = 0; i < block_size; i++) {mem_reads[i] = 0;mem_writes[i] = 0;
}}__syncthreads();char hazard = 0;
int id ; int nt ; int temp ;
{id = (int )threadIdx.x; nt = (int )((blockDim.x * blockDim.y) * blockDim.z);
//#line 9 s[id] = id;
// Instrumentation code mem_writes[id] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] &&
mem_writes[i]) { hazard =
1; break; } } if (hazard) printf("WAR hazard at expression:
#line 9 s[id] = id;\n"); hazard = 0; } //#line 10 temp = s[((nt + id) - 1) % nt];
// Instrumentation code mem_reads[((nt + id) - 1) % nt] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] &&
mem_writes[i]) { hazard =
1; break; } } if (hazard) printf("RAW hazard at expression:
#line 10 temp = s[((nt + id) - 1) %% nt];\n"); hazard = 0; } //#line 11 return;}}
Original Code
Instrumentation RAW hazard at expression:
#line 10 temp = s[((nt + id) - 1) % nt];