33
GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of Computing, University of Utah Indradeep Ghosh and Sreeranga P. Rajan Fujitsu Labs of America Work associated with The Center for Parallel Computing (CPU), and the Gauss Group at Utah http://www.cs.utah.edu/ fv /GKLEE

GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Embed Size (px)

Citation preview

Page 1: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

GKLEE : Concolic Verification and Test Generation for GPUs

Guodong LiFujitsu Labs of America

Peng Li, Geof Sawaya, and Ganesh GopalakrishnanSchool of Computing, University of Utah

Indradeep Ghosh and Sreeranga P. RajanFujitsu Labs of America

Work associated withThe Center for Parallel Computing (CPU), and the Gauss Group at Utah

http://www.cs.utah.edu/fv/GKLEE

Page 2: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Motivation for this work

• GPUs are exciting in so many ways– Parallelism for the masses!– Growing relevance: hand-held devices to Exascale

• There are many ways to arrive at GPU code:– Write it from scratch– Various compilation approaches

• Debugging GPU code is important– Library functions, students learning GPU programming, ..– Compiler transformations need to be verified, as well

• We contribute GKLEE, a tool that finds real bugs– Main Take-Away Message:

Formal methods can be exciting and practical in the GPU domain !!

Page 3: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

What is GKLEE ?

• A CUDA/C++ Concrete+Symbolic Execution Tool– Designers can decide which variables to declare as symbolic– Symbolic execution considers all possible values

• Not just the test inputs that the designer happened to pick• This is made possible by the power of SMT (constraint) solving

– Provides far more incisive coverage– Yet bugs are displayed as concrete traces– Concolic tools can also generate tests that can be run on the HW

• GKLEE also models all possible schedules– E.g. Different warps executed in different orders– Helps expose bugs that are execution platform dependent– GKLEE does this very efficiently by exploring a canonical schedule

Page 4: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Value of GKLEE to CUDA Programmers

• Finds deadlocks caused by incorrect uses of __syncthreads– GKLEE detects barriers that are not textually aligned

• GKLEE can help verify functional correctness• Verification can be conducted over symbolic inputs

• Detects many types of races– Shared memory races:

• Intra warp under warp divergence (we call it “porting race”)• Intra-warp without warp divergence• Inter-warp races• Global memory races• GKLEE can solve control flow constraints and generate test input

that exposes races (Example-5 presented later)

Page 5: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Value of GKLEE to CUDA Programmers (contd.)

• Detects many causes of performance loss– Bank conflicts, Warp divergences, Non-coalesced mem. accesses– Currently reported as % of affected Barrier Intervals / Warps– Considers all inputs and schedules

• Again, it is sufficient to analyze the canonical schedule

• Multi-kernel examples with 2K threads have been verified– Additional scalability through parameterized verification (in progress)

Page 6: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Architecture of GKLEE• GKLEE was realized by extending KLEE (Dunbar, Cadar, Engler – OSDI 2008)• GKLEE employs symbolic virtual machine that “understands” CUDA

Page 7: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Symbolic Virtual Machine of GKLEE

Page 8: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

GKLEE through examples– Basic usage (including Emacs mode)– Example-1: Porting a prefix-sum example

• Automation of these steps is in progress

– Example-2: Bitonic Sort• Shows automatic test generation covering execution paths

– Example-3: Deliberately introduced deadlock (Sanders/Kandrot, p. 88)• Textbook shows risk of “too much optimization”; GKLEE can be safety-net

– Example-4: A multi-kernel example: (AB)^T = B^T A^T• The whole assertion was verified for 2K threads• A broken calculation immediately caught (no wading through results)

– Example-5: Detecting “unexpected” bank conflicts• Code claims that all bank conflicts have been eliminated• Yet GKLEE finds bank conflicts (and provides a scenario)

– Example-6: Input-dependent race/bank conflict in SDK kernel• The racing location was input-dependent (also for bank conflict)• Without symbolic analysis, nearly impossible to hit these errors

Page 9: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

GKLEE Features not covered by these examples (see our paper)

– Test generation and reduction heuristics• Scripts to convert GKLEE tests to hardware

– Different kinds of races• Shared memory vs. global memory races• Intra-warp races

– With warp divergence (“porting race”)– Without warp divergence

• Inter-warp races

– Bank conflicts and non-coalesced accesses• Computed with respect to 1.x and 2.x rules

– Bugs as a function of compiler optimization level revealed• Volatile bugs• Other compilation issues

– Handy emacs-mode with• Thread, block, warp stepping• Ability to see LLVM byte-codes• Trace actions wrt source code

Page 10: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Basic Usage• Shell mode

klee-l++ <options> file.C gklee --device-capability=[0,1,2] file.o

• Emacs modeLoad gklee-mode.elVisit buffer file.CESC-gr – run GKLEEView results in buffers *gklee-compile-debug*, *gklee-run-debug*, and

*gklee-run*. Click on trace files in *gklee-run* and navigate tracesSee GKLEE manual for commands that control various views

Page 11: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Example-1 : Porting Prefix-Sum (Example adapted from Allinea DDT distribution)

#include "cutil.h"#include "klee.h"#include <stdio.h>#include <stdlib.h>

bool verify(int data[], int ROM_data[], int length){ // Do a prefix-sum sequentially onto ROM_data for (int i = 1; i < length; ++i) { ROM_data[i] += ROM_data[i-1]; printf("ROM_data[%d]=%d\n", i, ROM_data[i]); }

// Now, verify for (int i = 1 ; i < length; ++i) { if (data[i] != ROM_data[i] )

{ printf("error, results disagree at loc %d\n", i); return false; }

} return true;}

//#define BLOCK_SIZE 64 #define BLOCK_SIZE 32

__global__ void prefixsumblock(int *in, int *out, int length)__global__ void correctsumends(int *ends, int *in, int *out)__global__ void gathersumends(int *in, int *out)__global__ void zarro(int *data, int length)

void prefixsum(int* in, int *out, int length){

//dim3 dimGrid(blocks, 1, 1);__modify_Grid(blocks, 1);

//dim3 dimBlock(BLOCK_SIZE, 1, 1);__modify_Block(BLOCK_SIZE, 1, 1);

__begin_GPU();zarro(out, length);__end_GPU();...

}

Page 12: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Example-1 (contd.)int main(int argc, char *argv[]) {

int length;if (argc < 2) { length = NITEMS;}else length = atoi(argv[1]);

int *data = (int*) malloc(length * sizeof(int));int *ROM_data = (int*) malloc(length * sizeof(int));

klee_make_symbolic(data, NITEMS * sizeof(int), "data_symb");

klee_assume(data[0] != data[1]);

// Copy all the symbolic stuff in!for (int i = 0; i < length; ++i) { ROM_data[i] = data[i]; }

// Fun fooling compiler, making it do both pathsif (data[0] < data[1])

{ printf("a\n"); cudasummer(data, length); }else { printf("b\n"); cudasummer(data, length); }

if (length < 1000) for (int i = 0 ; i < length; ++i){

printf("%d\n", data[i]);}

// Symbolic verificationverify(data, ROM_data, length);

}

Page 13: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Example-1 (contd.)

• With the indicated changes, the example can be easily verified

• With the trick to force the compiler to consider both paths, we can examine the behavior under two scenarios

• The kernel verifies fine• Seeded calculation bugs are easily caught (try

breaking the computation)

Page 14: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Example-2: Bitonic Sorting

• CUDA SDK 2.0 example• Can be verified for functional correctness• Concolic verifier generates 28 (or so) paths– For each conditional, GKLEE forks two executions– Test limiting heuristics are available• -Path-Reduce :

– B: Item covered by some thread at least once– T : Item covered by all threads at least once

Page 15: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Example-2: Bitonic Sorting

#include "cutil.h"#include "klee.h"#include "stdio.h"

#ifdef _SYM#define NUM 4 #else#define NUM 6#endif

__shared__ int shared[NUM];

__device__ inline void swap(int & a, int & b) { int tmp = a; a = b; b = tmp;}__global__ void BitonicKernel(int * values){ unsigned int tid = threadIdx.x; // Copy input to shared mem. shared[tid] = values[tid]; printf("tid: %d, blockDim: %d\n", tid, blockDim.x); __syncthreads();

// Parallel bitonic sort. for (unsigned int k = 2; k <= blockDim.x; k *= 2) { for (unsigned int j = k / 2; j>0; j /= 2) { unsigned int ixj = tid ^ j; if (ixj > tid) {

if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]);}else { if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]);}}

__syncthreads(); } } // Write result. values[tid] = shared[tid];}

#ifndef _SYM // for debugging for (int i = 0; i < NUM; i++) { printf("%d ", values[i]); } printf("\n");#endif

// here blockDim.x should be NUM; we use this hack for (int i = 1; i < NUM; i++) { if (dvalues[i] < dvalues[i-1]) { printf("The sorting algorithm is incorrect since values[%d] < values[%d]!\n", i, i-1); return 1; } }

cudaFree(dvalues); cudaFree(values);

return 0;}

int main() {#ifdef _SYM //__device__ int values[NUM]; __input__ int *values = (int *)malloc(sizeof(int) * NUM); #else __input__ int values[NUM] = {6, 5}; // , 2, 1, 4, 3}; //__input__ int values[NUM] = {6, 5, 2, 1, 4, 3};

// for debugging printf("\nInput values:\n"); for (int i = 0; i < NUM; i++) { printf("%u ", values[i]); } printf("\n");#endif klee_make_symbolic(values, sizeof(int)*NUM, "values");

int *dvalues; cudaMalloc((void **)&dvalues, sizeof(int) * NUM); cudaMemcpy(dvalues, values, sizeof(int) * NUM, cudaMemcpyHostToDevice); __modify_Block(NUM); __begin_GPU(); BitonicKernel(dvalues); __end_GPU();

Page 16: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Example-3: Deadlock due to incorrect __syncthread call in dot-product

(Illustration p.88, Sanders and Kandrot, “CUDA By Example”)

// buggy code suggested on page 88 while (i != 0) { if (cacheIndex < i) { cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); } i /= 2; }

// begin corrected code as suggested on page 88 while (i != 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; }

GKLEE: Thread 128 and Thread 127 encounter different barrier sequences, one hits the end of kernel, but the other does not!

t128 found a deadlock: #barriers at the threads:

Report:

Page 17: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Example-4: Symbolic verification of multi-kernel exampleVerify symbolically that (AB)^T = B^ A^, for matrices A,B

int main(int argc, char* argv[]){ // const unsigned int seed = 99; //doGkleeTransposeTest(); //doGkleeMultTest(); // A^T ... int *A, *AT; // A: [64 * 32] cudaMalloc((void **)&A, sizeof(int) * AN); cudaMalloc((void **)&AT, sizeof(int) * AN); // Make the input 'A' as symbolic... klee_make_symbolic(A, sizeof(int) * AN, "A_var"); __modify_Grid(GRIDSIZE_X, P/BLOCKSIZE);// (1, 2) __modify_Block(BLOCKSIZE, BLOCKSIZE);// (8, 8) __begin_GPU(); MatTrans(A, AT); __end_GPU(); printf("After A's transpose!\n");

// B^T ... int *B, *BT; // B: [32 * 64] cudaMalloc((void **)&B, sizeof(int) * BN); cudaMalloc((void **)&BT, sizeof(int) * BN); // Make the input 'B' as symbolic... klee_make_symbolic(B, sizeof(int) * BN, "B_var"); __modify_Grid(P/BLOCKSIZE, GRIDSIZE_Y); // (1, 2) __modify_Block(BLOCKSIZE, BLOCKSIZE); // (8, 8) __begin_GPU(); MatTrans(B, BT); __end_GPU(); printf("After B's transpose!\n”);

// A^T * B^T = C... int *C; cudaMalloc((void **)&C, sizeof(int) * CN); __modify_Grid(GRIDSIZE_Y, GRIDSIZE_X); // (1, 1) __modify_Block(BLOCKSIZE, BLOCKSIZE); // (8, 8) __begin_GPU(); matrixMul(AT, BT, C, P, DIM_X); __end_GPU(); printf("After AT and BT multiplication !\n");

// B * A = T int *T; cudaMalloc((void **)&T, sizeof(int) * CN); __modify_Grid(GRIDSIZE_Y, GRIDSIZE_X); __modify_Block(BLOCKSIZE, BLOCKSIZE); __begin_GPU(); matrixMul(B, A, T, P, DIM_X); __end_GPU();

// T^T = C' int *C_P; cudaMalloc((void **)&C_P, sizeof(int) * CN); __modify_Grid(GRIDSIZE_X, GRIDSIZE_Y); __modify_Block(BLOCKSIZE, BLOCKSIZE); __begin_GPU(); MatTrans(T, C_P); __end_GPU();

if (!matricesEquiv(C, C_P, CN)) { printf("**********************"\n); printf("Post-condition failed!\n"); printf("**********************"\n); } else { printf("**********************"\n); printf("Post-condition succeeded!\n"); printf("**********************"\n); }

cudaFree(C_P); cudaFree(T); cudaFree(C); cudaFree(B); cudaFree(BT); cudaFree(A); cudaFree(AT);}

Page 18: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Example-4 (contd..): The actual seeded bug

//--correct--> C[c + wB * ty + tx] = Csub;

//--buggy-> C[c + wB * ty + tx] = ++Csub;

Inside the matrix multiplication routine…

Performs this verification in under a minute on a slow laptop

Page 19: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Example-5: Find deep race and bank conflict These are functions of input. Designer picks locns to make

symbolic.

int main() { __device__ unsigned int d_Histogram[BIN_COUNT]; __device__ unsigned int d_Data[DATA_N]; unsigned int h_result[BIN_COUNT];

__device__ unsigned int data[10]; klee_make_symbolic(data, sizeof(data), "input"); for (int i = 0; i < 10; i++) d_Data[i] = data[i];

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++;}

Page 20: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++;}

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos];

addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 10) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 18) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}

Example-5 contd…

Page 21: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++;}

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos];

addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 10) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 18) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}

Example-5 contd…

Page 22: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { // d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); ...}

Example-5 contd…

Page 23: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { // d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; // threadPos: 0 for thread 0, and threadPos: 32 for thread 8 // s_Hist[threadPos + IMUL(data, THREAD_N)]++; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); ...}

Example-5 contd…

Page 24: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { // d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; // threadPos: 0 for thread 0, and threadPos: 32 for thread 8 // s_Hist[threadPos + IMUL(data, THREAD_N)]++; // threadPos + IMUL(data, THREAD_N) is being shown below... THREAD_N is 32 // threadPos + 32 * ((d_Data[pos] >> 2) & 0x3FU) threadPos + (((d_Data[pos] << 3) & 2016)) addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); ...}

Example-5 contd…

Page 25: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { // d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; // threadPos: 0 for thread 0, and threadPos: 32 for thread 8 // s_Hist[threadPos + IMUL(data, THREAD_N)]++; // threadPos + IMUL(data, THREAD_N) is being shown below... THREAD_N is 32 // threadPos + 32 * ((d_Data[pos] >> 2) & 0x3FU) threadPos + (((d_Data[pos] << 3) & 2016)) // Constraint: 0 + ((d_Data[0] << 3) & 2016) == 32 + ((d_Data[8] << 3) & 2016) // this constraint is satisfied d_Data[0] is \x50\xa4\xb8\x84, d_Data[8] is \x4c\xa0\xb4\x80 addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); … }

Example-5 contd…

Page 26: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Generated test case leading to Race

• ktest file : 'klee-last/test000005.ktest'• args : ['histogram64_kernel.o']• num objects: 1• object 0: name: 'input'• object 0: size: 40• object 0: data: ‘\x50\xa4\xb8\x84\x00\x00\x00\x00\x00\x00\

x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x4c\xa0\xb4\x80\x00\x00\x00\x00'

Example-5 contd…

Page 27: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Concluding Remarks• A Concolic Verifier for CUDA/C++• Detects correctness / performance issues• High coverage, automatic test generation• Tool finds issues in well-known kernels (SDK)

• Tool Demos during talk will illustrate these examples– Can provide a LiveDVD or ISO image (will be posted in the URL below…)

• Our paper provides details on all the issues glossed over here– Paper, user-manual, and example code available fromhttp://www.cs.utah.edu/fv/GKLEE

• Comments / Suggestions are very welcome !!

Page 28: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Some Future Directions

• Support for CUDA 4.0 features– Atomics + SIMD– GPU2GPU transfers– GPU + MPI

• Incorporate into GPU-oriented compilation frameworks– E.g. OpenACC, others.

• Suggestions are welcome

Page 29: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

Extra Slides

Page 30: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

AB

P1 P2 Pi Pi+1 Pj ….Consider an arbitrary schedulethat brings the execution to theIllustrated where a race FIRST occurs

The race is between A and B

Page 31: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

AB

P1 P2 Pi Pi+1 Pj ….Then clearly, the red executionis equivalent to the race-freeexecution, because it is occurringin the race-free region of theexecution-space.

Page 32: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

AB

P1 P2 Pi Pi+1 Pj ….Our canonical schedule isshown by the dashed edgeshere.

Page 33: GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

AB

P1 P2 Pi Pi+1 Pj ….The Extra Executions Should not matter

Unless they themselves race !

But that race would then be caught !

So under the absence of ANY race,ALL schedules within a barrier intervalare equivalent.

Extra Execution