36
1 UNIVERSITY OF VIRGINIA Automated Dynamic Analysis of CUDA Programs Michael Boyer, Kevin Skadron*, and Westley Weimer University of Virginia {boyer,skadron,weimer}@cs.virginia.edu * currently on sabbatical with NVIDIA Research

Automated Dynamic Analysis of CUDA Programs

  • Upload
    kalea

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

Automated Dynamic Analysis of CUDA Programs. Michael Boyer, Kevin Skadron*, and Westley Weimer University of Virginia {boyer,skadron,weimer}@cs.virginia.edu * currently on sabbatical with NVIDIA Research. Outline. GPGPU CUDA Automated analyses Correctness: race conditions - PowerPoint PPT Presentation

Citation preview

Page 1: Automated Dynamic Analysis of CUDA Programs

1

UNIVERSITY OF VIRGINIA

Automated Dynamic Analysisof CUDA Programs

Michael Boyer, Kevin Skadron*, and Westley WeimerUniversity of Virginia

{boyer,skadron,weimer}@cs.virginia.edu

* currently on sabbatical with NVIDIA Research

Page 2: Automated Dynamic Analysis of CUDA Programs

2

UNIVERSITY OF VIRGINIA

Outline

GPGPU CUDA Automated analyses

– Correctness: race conditions– Performance: bank conflicts

Preliminary results Future work Conclusion

Page 3: Automated Dynamic Analysis of CUDA Programs

3

UNIVERSITY OF VIRGINIA

Why GPGPU?

From: NVIDIA CUDA Programming Guide, Version 1.1

Page 4: Automated Dynamic Analysis of CUDA Programs

4

UNIVERSITY OF VIRGINIA

CPU vs. GPU Design

Single-Thread Latency

Aggregate Throughput

From: NVIDIA CUDA Programming Guide, Version 1.1

Page 5: Automated Dynamic Analysis of CUDA Programs

5

UNIVERSITY OF VIRGINIA

GPGPU Programming

Traditional approach: graphics APIs ATI/AMD: Close-to-the-Metal (CTM) NVIDIA: Compute Unified Device

Architecture (CUDA)

Page 6: Automated Dynamic Analysis of CUDA Programs

6

UNIVERSITY OF VIRGINIA

CUDA: Abstractions

Kernel functions Scratchpad memory Barrier synchronization

Page 7: Automated Dynamic Analysis of CUDA Programs

7

UNIVERSITY OF VIRGINIA

CUDA: Example Program__host__ void example(int *cpu_mem) {

cudaMalloc(&gpu_mem, mem_size);cudaMemcpy(gpu_mem, cpu_mem, HostToDevice);kernel <<< grid, threads, mem_size >>>

(gpu_mem);cudaMemcpy(cpu_mem, gpu_mem, DeviceToHost);

}

__global__ void kernel(int *mem) {int thread_id = threadIdx.x;mem[thread_id] = thread_id;

}

Page 8: Automated Dynamic Analysis of CUDA Programs

8

UNIVERSITY OF VIRGINIA

GPU

Multiprocessor 2Multiprocessor 1 Multiprocessor N● ● ●

CUDA: Hardware

Global Device Memory

Multiprocessor

● ● ●

Per-Block Shared Memory (PBSM)

Processing Element 1

Processing Element 2

Processing Element M

Instruction Unit

Registers Registers Registers

Page 9: Automated Dynamic Analysis of CUDA Programs

9

UNIVERSITY OF VIRGINIA

Outline

GPGPU CUDA Automated analyses

– Correctness: race conditions– Performance: bank conflicts

Preliminary results Future work Conclusion

Page 10: Automated Dynamic Analysis of CUDA Programs

10

UNIVERSITY OF VIRGINIA

Race Conditions

Ordering of instructions among multiple threads is arbitrary

Relaxed memory consistency model

Synchronization: __syncthreads()– Barrier / memory fence

Page 11: Automated Dynamic Analysis of CUDA Programs

11

UNIVERSITY OF VIRGINIA

Race Conditions: Example

1 extern __shared__ int s[ ];23 __global__ void kernel(int *out)

{4 int id = threadIdx.x;5 int nt = blockDim.x;67 s[id] = id;8 out = s[(id + 1) % nt];9 }

W

W

W

W

W

W

0

1

2

3

4

5

threads

R

R

R

R

R

R

s0

1

2

3

4

5

8 out = s[(id + 1) % nt];

Page 12: Automated Dynamic Analysis of CUDA Programs

12

UNIVERSITY OF VIRGINIA

Automatic InstrumentationOriginal CUDA Source Code

Intermediate Representation

Instrumented CUDA Source

Code

Instrumentation

Compile

Execute

Output: Race Conditions Detected?

Page 13: Automated Dynamic Analysis of CUDA Programs

13

UNIVERSITY OF VIRGINIA

Race Condition Instrumentation

Two global bookkeeping arrays:– Reads & writes of all threads

Two per-thread bookkeeping arrays:– Reads & writes of a single thread

After each shared memory access:– Update bookkeeping arrays– Detect & report race conditions

Page 14: Automated Dynamic Analysis of CUDA Programs

14

UNIVERSITY OF VIRGINIA

Add synchronization between lines 7 and 8No race conditions detected

Race Condition Detection

Original codeRAW hazard at expression:

#line 8 out[id] = s[(id + 1) % nt];

Page 15: Automated Dynamic Analysis of CUDA Programs

15

UNIVERSITY OF VIRGINIA

Outline

GPGPU CUDA Automated analyses

– Correctness: race conditions– Performance: bank conflicts

Preliminary results Future work Conclusion

Page 16: Automated Dynamic Analysis of CUDA Programs

16

UNIVERSITY OF VIRGINIA

Bank Conflicts

PBSM is fast– Much faster than global memory– Potentially as fast as register access

…assuming no bank conflicts– Bank conflicts cause serialized access

Page 17: Automated Dynamic Analysis of CUDA Programs

17

UNIVERSITY OF VIRGINIA

Non-Conflicting Access Patterns

0 1 2 3 4 5 6 7Threads

0 1 2 3 4 5 6 7Banks

Stride = 1

0 1 2 3 4 5 6 7Threads

0 1 2 3 4 5 6 7Banks

Stride = 3

Page 18: Automated Dynamic Analysis of CUDA Programs

18

UNIVERSITY OF VIRGINIA

Conflicting Access Patterns

0 1 2 3 4 5 6 7Threads

0 1 2 3 4 5 6 7Banks

Stride = 4

0 1 2 3 4 5 6 7Threads

0 1 2 3 4 5 6 7Banks

Stride = 16

Page 19: Automated Dynamic Analysis of CUDA Programs

19

UNIVERSITY OF VIRGINIA

0

1

2

3

4

5

6

7

8

0 0.2 0.4 0.6 0.8 1

Iterations (Millions)

Ru

nti

me

(Sec

on

ds)

No Bank Conflicts Maximal Bank Conflicts

0

1

2

3

4

5

6

7

8

0 0.2 0.4 0.6 0.8 1

Iterations (Millions)

Ru

nti

me

(Sec

on

ds)

No Bank Conflicts Maximal Bank Conflicts Global Memory

Impact of Bank Conflicts

Page 20: Automated Dynamic Analysis of CUDA Programs

20

UNIVERSITY OF VIRGINIA

Output: Race Conditions Detected?

Automatic InstrumentationOriginal CUDA Source Code

Intermediate Representation

Instrumented CUDA Source

Code

Instrumentation

Compile

Execute

Output: Bank Conflicts

Detected?

Page 21: Automated Dynamic Analysis of CUDA Programs

21

UNIVERSITY OF VIRGINIA

Bank Conflict Instrumentation

Global bookkeeping array:– Tracks address accessed by each thread

After each PBSM access:– Each thread updates its entry– One thread computes and reports bank conflicts

Page 22: Automated Dynamic Analysis of CUDA Programs

22

UNIVERSITY OF VIRGINIA

CAUSE_BANK_CONFLICTS = trueBank conflicts at: #line 14 mem[j]++

Bank: 0 1 2 3 4 5 6 7 8 9 …

Accesses: 16 0 0 0 0 0 0 0 0 0 …

Bank Conflict Detection

CAUSE_BANK_CONFLICTS = falseNo bank conflicts at:

#line 14 mem[j]++

Page 23: Automated Dynamic Analysis of CUDA Programs

23

UNIVERSITY OF VIRGINIA

Preliminary Results

Scan– Included in CUDA SDK– All-prefix sums operation– 400 lines of code– Explicitly prevents race conditions and bank conflicts

Page 24: Automated Dynamic Analysis of CUDA Programs

24

UNIVERSITY OF VIRGINIA

Preliminary Results:Race Condition Detection

Original code:– No race conditions detected

Remove any synchronization calls:– Race conditions detected

Page 25: Automated Dynamic Analysis of CUDA Programs

25

UNIVERSITY OF VIRGINIA

Preliminary Results:Bank Conflict Detection

Original code:– Small number of minor bank conflicts

Enable bank conflict avoidance macro:– Bank conflicts increased!– Confirmed by manual analysis– Culprit: incorrect emulation mode

Page 26: Automated Dynamic Analysis of CUDA Programs

26

UNIVERSITY OF VIRGINIA

Instrumentation Overhead

Two sources:– Emulation– Instrumentation

Assumption: for debugging, programmers will already use emulation mode

Page 27: Automated Dynamic Analysis of CUDA Programs

27

UNIVERSITY OF VIRGINIA

Instrumentation Overhead

Code Version

Execution Environmen

t

Average Runtim

e

Slowdown (Relative to Native)

Slowdown (Relative

to Emulation

)

Original Native 0.4 ms

Original Emulation 27 ms 62x

Instrumented (bank conflicts)

Emulation 71 ms 163x 2.6x

Instrumented (race

conditions)Emulation 324 ms 739x 12x

Page 28: Automated Dynamic Analysis of CUDA Programs

28

UNIVERSITY OF VIRGINIA

Future Work

Find more types of bugs– Correctness: array bounds checking– Performance: memory coalescing

Reduce instrumentation overhead– Execute instrumented code natively

Page 29: Automated Dynamic Analysis of CUDA Programs

29

UNIVERSITY OF VIRGINIA

Conclusion

GPGPU: enormous performance potential– But parallel programming is challenging

Automated instrumentation can help– Find synchronization bugs– Identify inefficient memory accesses– And more…

Page 30: Automated Dynamic Analysis of CUDA Programs

30

UNIVERSITY OF VIRGINIA

Questions?

Instrumentation tool will be available at:http://www.cs.virginia.edu/~mwb7w/cuda

Page 31: Automated Dynamic Analysis of CUDA Programs

31

UNIVERSITY OF VIRGINIA

Domain Mapping

From: NVIDIA CUDA Programming Guide, Version 1.1

Page 32: Automated Dynamic Analysis of CUDA Programs

32

UNIVERSITY OF VIRGINIA

Coalesced Accesses

From: NVIDIA CUDA Programming Guide, Version 1.1

Page 33: Automated Dynamic Analysis of CUDA Programs

33

UNIVERSITY OF VIRGINIA

Non-Coalesced Accesses

From: NVIDIA CUDA Programming Guide, Version 1.1

Page 34: Automated Dynamic Analysis of CUDA Programs

34

UNIVERSITY OF VIRGINIA

Race Condition Detection Algorithm

A thread t knows a race condition exists at shared memory location m if:– Location m has been read from and written to– One of the accesses to m came from t– One of the accesses to m came from a thread other than t

Note that we are only checking for RAW and WAR hazards

Page 35: Automated Dynamic Analysis of CUDA Programs

35

UNIVERSITY OF VIRGINIA

Bank Conflicts: Example

extern __shared__ int mem[];

__global__ void kernel(int iters) {int min, stride, max, id = threadIdx.x;

if (CAUSE_BANK_CONFLICTS)// Set stride to cause bank conflicts

else// Set stride to avoid bank conflicts

for (int i = 0; i < iters; i++)

for (int j = min; j < max; j += stride)mem[j]++;

}

Page 36: Automated Dynamic Analysis of CUDA Programs

36

UNIVERSITY OF VIRGINIA

Instrumented Code Example

extern __shared__ int s[];

__global__ void kernel() {int id = threadIdx.x;int nt = blockDim.x *

blockDim.y * blockDim.z;

s[id] = id;int temp = s[(nt+id-1) % nt];

}

extern __shared__ int s[] ;

__global__ void kernel(void) ;void kernel(void) {

// Instrumentation codeint block_size = blockDim.x * blockDim.y * blockDim.z;int thread_id = threadIdx.x + (threadIdx.y * blockDim.x) +

(threadIdx.z * blockDim.x * blockDim.y);__shared__ char mem_reads[PUT_ARRAY_SIZE_HERE];__shared__ char mem_writes[PUT_ARRAY_SIZE_HERE];if (thread_id == 0) {

for (int i = 0; i < block_size; i++) {mem_reads[i] = 0;mem_writes[i] = 0;

}}__syncthreads();char hazard = 0;

int id ; int nt ; int temp ;

{id = (int )threadIdx.x; nt = (int )((blockDim.x * blockDim.y) * blockDim.z);

//#line 9 s[id] = id;

// Instrumentation code mem_writes[id] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] &&

mem_writes[i]) { hazard =

1; break; } } if (hazard) printf("WAR hazard at expression:

#line 9 s[id] = id;\n"); hazard = 0; } //#line 10 temp = s[((nt + id) - 1) % nt];

// Instrumentation code mem_reads[((nt + id) - 1) % nt] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] &&

mem_writes[i]) { hazard =

1; break; } } if (hazard) printf("RAW hazard at expression:

#line 10 temp = s[((nt + id) - 1) %% nt];\n"); hazard = 0; } //#line 11 return;}}

Original Code

Instrumentation RAW hazard at expression:

#line 10 temp = s[((nt + id) - 1) % nt];