GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of

GKLEE : Concolic Verification and Test Generation for GPUs

Guodong LiFujitsu Labs of America

Peng Li, Geof Sawaya, and Ganesh GopalakrishnanSchool of Computing, University of Utah

Indradeep Ghosh and Sreeranga P. RajanFujitsu Labs of America

Work associated withThe Center for Parallel Computing (CPU), and the Gauss Group at Utah

http://www.cs.utah.edu/fv/GKLEE




Motivation for this work

• GPUs are exciting in so many ways– Parallelism for the masses!– Growing relevance: hand-held devices to Exascale

• There are many ways to arrive at GPU code:– Write it from scratch– Various compilation approaches

• Debugging GPU code is important– Library functions, students learning GPU programming, ..– Compiler transformations need to be verified, as well

• We contribute GKLEE, a tool that finds real bugs– Main Take-Away Message:

Formal methods can be exciting and practical in the GPU domain !!

What is GKLEE ?

• A CUDA/C++ Concrete+Symbolic Execution Tool– Designers can decide which variables to declare as symbolic– Symbolic execution considers all possible values

• Not just the test inputs that the designer happened to pick• This is made possible by the power of SMT (constraint) solving

– Provides far more incisive coverage– Yet bugs are displayed as concrete traces– Concolic tools can also generate tests that can be run on the HW

• GKLEE also models all possible schedules– E.g. Different warps executed in different orders– Helps expose bugs that are execution platform dependent– GKLEE does this very efficiently by exploring a canonical schedule

Value of GKLEE to CUDA Programmers

• Finds deadlocks caused by incorrect uses of __syncthreads– GKLEE detects barriers that are not textually aligned

• GKLEE can help verify functional correctness• Verification can be conducted over symbolic inputs

• Detects many types of races– Shared memory races:

• Intra warp under warp divergence (we call it “porting race”)• Intra-warp without warp divergence• Inter-warp races• Global memory races• GKLEE can solve control flow constraints and generate test input

that exposes races (Example-5 presented later)

Value of GKLEE to CUDA Programmers (contd.)

• Detects many causes of performance loss– Bank conflicts, Warp divergences, Non-coalesced mem. accesses– Currently reported as % of affected Barrier Intervals / Warps– Considers all inputs and schedules

• Again, it is sufficient to analyze the canonical schedule

• Multi-kernel examples with 2K threads have been verified– Additional scalability through parameterized verification (in progress)

Architecture of GKLEE• GKLEE was realized by extending KLEE (Dunbar, Cadar, Engler – OSDI 2008)• GKLEE employs symbolic virtual machine that “understands” CUDA

Symbolic Virtual Machine of GKLEE

GKLEE through examples– Basic usage (including Emacs mode)– Example-1: Porting a prefix-sum example

• Automation of these steps is in progress

– Example-2: Bitonic Sort• Shows automatic test generation covering execution paths

– Example-3: Deliberately introduced deadlock (Sanders/Kandrot, p. 88)• Textbook shows risk of “too much optimization”; GKLEE can be safety-net

– Example-4: A multi-kernel example: (AB)^T = B^T A^T• The whole assertion was verified for 2K threads• A broken calculation immediately caught (no wading through results)

– Example-5: Detecting “unexpected” bank conflicts• Code claims that all bank conflicts have been eliminated• Yet GKLEE finds bank conflicts (and provides a scenario)

– Example-6: Input-dependent race/bank conflict in SDK kernel• The racing location was input-dependent (also for bank conflict)• Without symbolic analysis, nearly impossible to hit these errors

GKLEE Features not covered by these examples (see our paper)

– Test generation and reduction heuristics• Scripts to convert GKLEE tests to hardware

– Different kinds of races• Shared memory vs. global memory races• Intra-warp races

– With warp divergence (“porting race”)– Without warp divergence

• Inter-warp races

– Bank conflicts and non-coalesced accesses• Computed with respect to 1.x and 2.x rules

– Bugs as a function of compiler optimization level revealed• Volatile bugs• Other compilation issues

– Handy emacs-mode with• Thread, block, warp stepping• Ability to see LLVM byte-codes• Trace actions wrt source code

Basic Usage• Shell mode

klee-l++ <options> file.C gklee --device-capability=[0,1,2] file.o

• Emacs modeLoad gklee-mode.elVisit buffer file.CESC-gr – run GKLEEView results in buffers *gklee-compile-debug*, *gklee-run-debug*, and

*gklee-run*. Click on trace files in *gklee-run* and navigate tracesSee GKLEE manual for commands that control various views

Example-1 : Porting Prefix-Sum (Example adapted from Allinea DDT distribution)

#include "cutil.h"#include "klee.h"#include <stdio.h>#include <stdlib.h>

bool verify(int data[], int ROM_data[], int length){ // Do a prefix-sum sequentially onto ROM_data for (int i = 1; i < length; ++i) { ROM_data[i] += ROM_data[i-1]; printf("ROM_data[%d]=%d\n", i, ROM_data[i]); }

// Now, verify for (int i = 1 ; i < length; ++i) { if (data[i] != ROM_data[i] )

{ printf("error, results disagree at loc %d\n", i); return false; }

} return true;}

//#define BLOCK_SIZE 64 #define BLOCK_SIZE 32

__global__ void prefixsumblock(int *in, int *out, int length)__global__ void correctsumends(int *ends, int *in, int *out)__global__ void gathersumends(int *in, int *out)__global__ void zarro(int *data, int length)

void prefixsum(int* in, int *out, int length){

//dim3 dimGrid(blocks, 1, 1);__modify_Grid(blocks, 1);

//dim3 dimBlock(BLOCK_SIZE, 1, 1);__modify_Block(BLOCK_SIZE, 1, 1);

__begin_GPU();zarro(out, length);__end_GPU();...

}

Example-1 (contd.)int main(int argc, char *argv[]) {

int length;if (argc < 2) { length = NITEMS;}else length = atoi(argv[1]);

int *data = (int*) malloc(length * sizeof(int));int *ROM_data = (int*) malloc(length * sizeof(int));

klee_make_symbolic(data, NITEMS * sizeof(int), "data_symb");

klee_assume(data[0] != data[1]);

// Copy all the symbolic stuff in!for (int i = 0; i < length; ++i) { ROM_data[i] = data[i]; }

// Fun fooling compiler, making it do both pathsif (data[0] < data[1])

{ printf("a\n"); cudasummer(data, length); }else { printf("b\n"); cudasummer(data, length); }

if (length < 1000) for (int i = 0 ; i < length; ++i){

printf("%d\n", data[i]);}

// Symbolic verificationverify(data, ROM_data, length);

}

Example-1 (contd.)

• With the indicated changes, the example can be easily verified

• With the trick to force the compiler to consider both paths, we can examine the behavior under two scenarios

• The kernel verifies fine• Seeded calculation bugs are easily caught (try

breaking the computation)

Example-2: Bitonic Sorting

• CUDA SDK 2.0 example• Can be verified for functional correctness• Concolic verifier generates 28 (or so) paths– For each conditional, GKLEE forks two executions– Test limiting heuristics are available• -Path-Reduce :

– B: Item covered by some thread at least once– T : Item covered by all threads at least once

Example-2: Bitonic Sorting

#include "cutil.h"#include "klee.h"#include "stdio.h"

#ifdef _SYM#define NUM 4 #else#define NUM 6#endif

__shared__ int shared[NUM];

__device__ inline void swap(int & a, int & b) { int tmp = a; a = b; b = tmp;}__global__ void BitonicKernel(int * values){ unsigned int tid = threadIdx.x; // Copy input to shared mem. shared[tid] = values[tid]; printf("tid: %d, blockDim: %d\n", tid, blockDim.x); __syncthreads();

// Parallel bitonic sort. for (unsigned int k = 2; k <= blockDim.x; k *= 2) { for (unsigned int j = k / 2; j>0; j /= 2) { unsigned int ixj = tid ^ j; if (ixj > tid) {

if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]);}else { if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]);}}

__syncthreads(); } } // Write result. values[tid] = shared[tid];}

#ifndef _SYM // for debugging for (int i = 0; i < NUM; i++) { printf("%d ", values[i]); } printf("\n");#endif

// here blockDim.x should be NUM; we use this hack for (int i = 1; i < NUM; i++) { if (dvalues[i] < dvalues[i-1]) { printf("The sorting algorithm is incorrect since values[%d] < values[%d]!\n", i, i-1); return 1; } }

cudaFree(dvalues); cudaFree(values);

return 0;}

int main() {#ifdef _SYM //__device__ int values[NUM]; __input__ int *values = (int *)malloc(sizeof(int) * NUM); #else __input__ int values[NUM] = {6, 5}; // , 2, 1, 4, 3}; //__input__ int values[NUM] = {6, 5, 2, 1, 4, 3};

// for debugging printf("\nInput values:\n"); for (int i = 0; i < NUM; i++) { printf("%u ", values[i]); } printf("\n");#endif klee_make_symbolic(values, sizeof(int)*NUM, "values");

int *dvalues; cudaMalloc((void **)&dvalues, sizeof(int) * NUM); cudaMemcpy(dvalues, values, sizeof(int) * NUM, cudaMemcpyHostToDevice); __modify_Block(NUM); __begin_GPU(); BitonicKernel(dvalues); __end_GPU();

Example-3: Deadlock due to incorrect __syncthread call in dot-product

(Illustration p.88, Sanders and Kandrot, “CUDA By Example”)

// buggy code suggested on page 88 while (i != 0) { if (cacheIndex < i) { cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); } i /= 2; }

// begin corrected code as suggested on page 88 while (i != 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; }

GKLEE: Thread 128 and Thread 127 encounter different barrier sequences, one hits the end of kernel, but the other does not!

t128 found a deadlock: #barriers at the threads:

Report:

Example-4: Symbolic verification of multi-kernel exampleVerify symbolically that (AB)^T = B^ A^, for matrices A,B

int main(int argc, char* argv[]){ // const unsigned int seed = 99; //doGkleeTransposeTest(); //doGkleeMultTest(); // A^T ... int *A, *AT; // A: [64 * 32] cudaMalloc((void **)&A, sizeof(int) * AN); cudaMalloc((void **)&AT, sizeof(int) * AN); // Make the input 'A' as symbolic... klee_make_symbolic(A, sizeof(int) * AN, "A_var"); __modify_Grid(GRIDSIZE_X, P/BLOCKSIZE);// (1, 2) __modify_Block(BLOCKSIZE, BLOCKSIZE);// (8, 8) __begin_GPU(); MatTrans(A, AT); __end_GPU(); printf("After A's transpose!\n");

// B^T ... int *B, *BT; // B: [32 * 64] cudaMalloc((void **)&B, sizeof(int) * BN); cudaMalloc((void **)&BT, sizeof(int) * BN); // Make the input 'B' as symbolic... klee_make_symbolic(B, sizeof(int) * BN, "B_var"); __modify_Grid(P/BLOCKSIZE, GRIDSIZE_Y); // (1, 2) __modify_Block(BLOCKSIZE, BLOCKSIZE); // (8, 8) __begin_GPU(); MatTrans(B, BT); __end_GPU(); printf("After B's transpose!\n”);

// A^T * B^T = C... int *C; cudaMalloc((void **)&C, sizeof(int) * CN); __modify_Grid(GRIDSIZE_Y, GRIDSIZE_X); // (1, 1) __modify_Block(BLOCKSIZE, BLOCKSIZE); // (8, 8) __begin_GPU(); matrixMul(AT, BT, C, P, DIM_X); __end_GPU(); printf("After AT and BT multiplication !\n");

// B * A = T int *T; cudaMalloc((void **)&T, sizeof(int) * CN); __modify_Grid(GRIDSIZE_Y, GRIDSIZE_X); __modify_Block(BLOCKSIZE, BLOCKSIZE); __begin_GPU(); matrixMul(B, A, T, P, DIM_X); __end_GPU();

// T^T = C' int *C_P; cudaMalloc((void **)&C_P, sizeof(int) * CN); __modify_Grid(GRIDSIZE_X, GRIDSIZE_Y); __modify_Block(BLOCKSIZE, BLOCKSIZE); __begin_GPU(); MatTrans(T, C_P); __end_GPU();

if (!matricesEquiv(C, C_P, CN)) { printf("**********************"\n); printf("Post-condition failed!\n"); printf("**********************"\n); } else { printf("**********************"\n); printf("Post-condition succeeded!\n"); printf("**********************"\n); }

cudaFree(C_P); cudaFree(T); cudaFree(C); cudaFree(B); cudaFree(BT); cudaFree(A); cudaFree(AT);}

Example-4 (contd..): The actual seeded bug

//--correct--> C[c + wB * ty + tx] = Csub;

//--buggy-> C[c + wB * ty + tx] = ++Csub;

Inside the matrix multiplication routine…

Performs this verification in under a minute on a slow laptop

Example-5: Find deep race and bank conflict These are functions of input. Designer picks locns to make

symbolic.

int main() { __device__ unsigned int d_Histogram[BIN_COUNT]; __device__ unsigned int d_Data[DATA_N]; unsigned int h_result[BIN_COUNT];

__device__ unsigned int data[10]; klee_make_symbolic(data, sizeof(data), "input"); for (int i = 0; i < 10; i++) d_Data[i] = data[i];

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++;}


// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos];

addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 10) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 18) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}

Example-5 contd…


// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos];

addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 10) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 18) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}

Example-5 contd…

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { // d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); ...}

Example-5 contd…

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { // d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; // threadPos: 0 for thread 0, and threadPos: 32 for thread 8 // s_Hist[threadPos + IMUL(data, THREAD_N)]++; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); ...}

Example-5 contd…

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { // d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; // threadPos: 0 for thread 0, and threadPos: 32 for thread 8 // s_Hist[threadPos + IMUL(data, THREAD_N)]++; // threadPos + IMUL(data, THREAD_N) is being shown below... THREAD_N is 32 // threadPos + 32 * ((d_Data[pos] >> 2) & 0x3FU) threadPos + (((d_Data[pos] << 3) & 2016)) addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); ...}

Example-5 contd…

// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN){ const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { // d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; // threadPos: 0 for thread 0, and threadPos: 32 for thread 8 // s_Hist[threadPos + IMUL(data, THREAD_N)]++; // threadPos + IMUL(data, THREAD_N) is being shown below... THREAD_N is 32 // threadPos + 32 * ((d_Data[pos] >> 2) & 0x3FU) threadPos + (((d_Data[pos] << 3) & 2016)) // Constraint: 0 + ((d_Data[0] << 3) & 2016) == 32 + ((d_Data[8] << 3) & 2016) // this constraint is satisfied d_Data[0] is \x50\xa4\xb8\x84, d_Data[8] is \x4c\xa0\xb4\x80 addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); … }

Example-5 contd…

Generated test case leading to Race

• ktest file : 'klee-last/test000005.ktest'• args : ['histogram64_kernel.o']• num objects: 1• object 0: name: 'input'• object 0: size: 40• object 0: data: ‘\x50\xa4\xb8\x84\x00\x00\x00\x00\x00\x00\

x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x4c\xa0\xb4\x80\x00\x00\x00\x00'

Example-5 contd…

Concluding Remarks• A Concolic Verifier for CUDA/C++• Detects correctness / performance issues• High coverage, automatic test generation• Tool finds issues in well-known kernels (SDK)

• Tool Demos during talk will illustrate these examples– Can provide a LiveDVD or ISO image (will be posted in the URL below…)

• Our paper provides details on all the issues glossed over here– Paper, user-manual, and example code available fromhttp://www.cs.utah.edu/fv/GKLEE

• Comments / Suggestions are very welcome !!


Some Future Directions

• Support for CUDA 4.0 features– Atomics + SIMD– GPU2GPU transfers– GPU + MPI

• Incorporate into GPU-oriented compilation frameworks– E.g. OpenACC, others.

• Suggestions are welcome

Extra Slides

AB

P1 P2 Pi Pi+1 Pj ….Consider an arbitrary schedulethat brings the execution to theIllustrated where a race FIRST occurs

The race is between A and B

AB

P1 P2 Pi Pi+1 Pj ….Then clearly, the red executionis equivalent to the race-freeexecution, because it is occurringin the race-free region of theexecution-space.

AB

P1 P2 Pi Pi+1 Pj ….Our canonical schedule isshown by the dashed edgeshere.

AB

P1 P2 Pi Pi+1 Pj ….The Extra Executions Should not matter

Unless they themselves race !

But that race would then be caught !

So under the absence of ANY race,ALL schedules within a barrier intervalare equivalent.

Extra Execution

Documents

GKLEE : Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya, and Ganesh Gopalakrishnan School of