Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P. Khatri Texas A&M University Acknowledgements: NVIDIA, Nascentric

Introduction to GPU Programming for EDA

John F. CroixCadence Design Systems, Inc.

Sunil P. KhatriTexas A&M University

Acknowledgements: NVIDIA, Nascentric Inc., Accelicon Inc.Students: Kanupriya Gulati, Vinay Karkala, Kalyana Bollapalli

2

Outline

GPU Architecture Overview GPU Programming Algorithm Acceleration Guidelines Case Studies Conclusion Q&A

3

Outline

GPU Architecture Overview Evolution and architecture Peak performance GPU and CPU interaction – practical considerations

GPU Programming Algorithm Acceleration Guidelines Case Studies Conclusion Q&A

4

GPU Evolution In the early days, graphics accelerators were primitive

Acceleration of graphics rendering tasks for (CRT) displays Many hardwired graphics acceleration units

With VLSI technology scaling, the GPU was born Many programmable processors to handle graphics rendering tasks Increased peak memory bandwidths and peak performance Goal was faster and more realistic rendering for gaming applications

Recently, several scientific communities began to leverage these GPUs Initially used graphics APIs like OpenGL and DirectX for these tasks

GPU vendors recognized this interest Development of C-like programming environments such as CUDA Development of GPU architectures tuned for scientific computations

5

GPU Introduction A GPU is essentially a commodity stream processor

Highly parallel (100s of processor cores) Very fast (>900 GFLOPS of peak performance) Operates in a SIMD manner. This is a key restriction Multiple processors operate in lock-step (same instruction) but on different

data

GPUs, owing to their massively parallel architecture, have been used to accelerate Image/stream processing, data compression, numerical algorithms Recently they have been used to accelerate CAD algorithms as well.

Inexpensive, off-the-shelf cards like the NVIDIA Quadro FX / 280 GTX GPU achieve impressive performance 933 GFLOPs peak performance 240 SIMD cores partitioned into 30 Multiprocessors (MPs) 4GB (Quadro) and 1GB (GTX 280) device memory with 142 GB/s bandwidth 1.4 GHz GPU operating frequency Programmed with Compute Unified Device Architecture (CUDA) framework

6

GPU Architecture In the GTX 280, there are 10 Thread Processing Clusters

(TPCs) Each has 3 Streaming Multiprocessors (SMs), which we will

refer to as multiprocessors (MPs) Each MP has 8 Streaming Processors (SPs) or Thread

Processors (TPs). We will refer to these as processors. 240 processors and 30 MPs in all!

One double-precision FP unit per SM

Source : NVIDIA

GPU vs CPU:NVIDIA 280 vs Intel i7 860

GPU CPU1

Registers 16,384 (32-bit) / multi-processor3

128 reservation stations

Peak memory bandwidth 141.7 Gb/sec 21 Gb/sec

Peak GFLOPs 562 (float)/77 (double)

50 (double)

Cores 240 4/8 (hyperthreaded)

Processor Clock (MHz) 1296 2800

Memory 1Gb 16Gb

Shared memory 16Kb/TPC2 N/A

Virtual memory None

7

1http://ark.intel.com/Product.aspx?id=413162TPC = Thread Processing Cluster (24 cores)330 multi-processors in a 280

8

GPU vs CPU Peak Performance Trends GPU peak performance has grown aggressively. Hardware has kept up with Moore’s law

Source : NVIDIA

9

GPU Programming Model The GPU is viewed as a compute device that:

Is a coprocessor (slave) to the CPU (host) Has its own DRAM (device memory) but no virtual memory

Entire design instance may not fit on the GPU! Kernel is a CPU-callable function. Thread is an instance of a kernel. GPU runs many threads in parallel.

Host Device

DeviceMemory

Kernel Threads (instances of

the kernel)PCIe

(CPU)(GPU)

Data Transfers (CPUGPU) GPUs and CPUs communicate via a PCIe bus

This communication is expensive and should be minimized for target applications

Graphics applications usually require Initial data to be sent from CPU to GPU Single transfer of processed data from GPU to CPU

General purpose computations usually require Multiple transfers between CPU and GPU (since conditional checks on CPU) Possibility of saturating the PCIe bus and reducing the achievable performance

10

Host Device

DeviceMemory

Kernel Threads (instances of

the kernel)PCIe

(CPU) (GPU)

GPU Threads v/s CPU Threads GPU threads:

Lightweight, small creation and scheduling overhead, extremely fast hardware context switching

Need to issue 1000s of GPU threads to hide global memory latencies (600-800 cycles)

CPU threads: Heavyweight, large scheduling overhead, slow context

switching Multi-GPU usage requires invocation of multiple CPU

threads Each CPU thread creates a GPU context Context swapping is required for a CPU thread to access

GPU memory allocated by another CPU thread11

Device Memory Space Overview Each thread runs on a SP and has:

R/W per-thread registers (on-chip) Limit usage (max 16K/MP)

R/W per-thread local memory (off) R/W per-block shared memory (on)

Need to avoid bank conflicts R/W per-grid global memory (off)

Not cached, 600-800 cycle read L

atency hidden by parallelism

and fast context switches

Main means for data transfer from host and device Coalescing recommended

RO per-grid cached constant and texture memory (off)

The host can R/W global, constant and texture memories (visible to all threads)

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Source : “NVIDIA CUDA Programming Guide” version 1.1

13

Outline

GPU Architecture Overview GPU Programming

CPU threads Conditional and Loop processing Floating point General GPU program structure

CUDA and OpenCL Algorithm Acceleration Guidelines Case Studies Conclusion Q&A

CPU Threading CPU

All threads are equivalent Read/write concurrently to the same memory Synchronization primitives required to avoid collisions

GPU (NVIDIA) Each CPU thread maintains a unique context GPU resources (e.g. memory, code modules, address

space) are context-specific Each CPU thread can access a single context at once Contexts must be exchanged between CPU threads to

share GPU resources between CPU threads Contexts use reference counting and are automatically

destroyed14

15

SIMD Conditional Processing

Unlike threads in a CPU-based program, SIMD programs cannot follow different execution paths

Ideal scenario: All GPU threads follow the same execution path All processors active continuously

In divergent paths, some processors execute the then-block and others the else-block Program flow cannot actually diverge. All instructions are executed The then- and else- blocks are both executed A bit is used to enable/disable processors based on the block being

executed Parallelism is reduced, impacting performance

Idle Processors Idle CPU processors can be dynamically rescheduled

by OS SIMD processors are not actually idle

All processors scheduled are following identical execution paths

Disabled (idle) processors are unavailable for other work and cannot be rescheduled

Effective utilization of processors is the programmer’s responsibility

Scheduling is an art, not necessarily a science Techniques will vary from chip to chip

16

Conditional Processing

…If (condition){ …}else{ …}…

17

Nested Conditional Processing

…If (condition){ if (condition2) { … } else { … }}else{ …}…

18

Loop Processing

…while (condition){ if (cond2) { … }}…

19

The Cost of Memory Access Registers are extremely fast, but are a limited resource Cached memories also tend to be small For large data sets, global memory provides read & write access

Accesses take between 600 and 800 clock cycles Accesses are *not* cached To hide memory latency, the hardware provides fast context switches when

memory is accessed However, there must be enough computational work to do to hide the high

cost of memory access Programmers need to be smart

Compilers often don’t provide the necessary optimizations when optimizing for speed instead of code size

It can sometimes be cheaper to recompute a result than perform a memory read/write

20

Conditional Processing

...if (condition){ ... float a = someVar; ...}else{ ... float a = someVar; ...}...

21

…float a = someVar;if (condition){ …}else{ …}…

Access & Swap

Access & Swap

Access & Swap

Floating Point GPUs are optimized for 32-

bit accesses 64-bit double-precision

values fetched from memory as two 32-bit quantities May impact performance in

the event of memory bank conflicts

One double-precision unit per multi-processor1

22

1http://www.ddj.com/hpc-high-performance-computing/210102115

OpenCL vs CUDA CUDA uses early code binding

Code is compiled with normal C/C++/FORTRAN (beta) source code

Need CUDA occupancy calculator to determine number of threads based on resource utilization

Library support: BLAS & FFT & DPT OpenCL

Late binding of OpenCL code to executable OpenCL compiler/linker embedded within application No need for CUDA occupancy calculator

Only supports C No libraries

23

CUDA Occupancy Calculator

24

OpenCL vs CUDA

25

General Program Structure Initialize GPU Create GPU context Build GPU program Allocate GPU memory Transfer data from CPU to GPU Invoke GPU functions Transfer data from GPU to CPU Deallocate GPU memory Finalize GPU usage

26

Create GPU Context CUDA

Context creation is implicit in single-threaded programs Multiple contexts can be explicitly created Each thread maintains a context stack

Top context is current context Threads

Contexts can be swapped between threads A thread can only have one context active at a time (stack) A context cannot be shared simultaneously between

threads

OpenCL All commands explicitly associated with a context Must create a command queue to invoke

27

Initialize GPU CUDA

cudaGetDeviceCount() cudaSetDevice() cudaGetDeviceProperties()

28

CUDA::CUDA(int Device) : Base(){ mValid = false; int DeviceCount; cudaGetDeviceCount( &DeviceCount ); if (!DeviceCount) { return; }

Device = Device == -1 ? DeviceCount - 1 : Device; cudaSetDevice( Device ); mValid = true;}

Initialize GPU OpenCL

Context must be built before anything can be done on the GPU

All commands are with respect to a given context

29

OpenCL::OpenCL(int Device) : Base(){ init(); // Initialize class pointers to NULL cl_int RC; mGPUContext = clCreateContextFromType( 0, CL_DEVICE_TYPE_GPU, NULL, NULL, &RC ); size_t Bytes; RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &Bytes ); int NumDevices = Bytes / sizeof( cl_device_id ); cl_device_id *Devices = new cl_device_id[ NumDevices ]; RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, Bytes, Devices, NULL ); mCommandQueue = clCreateCommandQueue( mGPUContext, Devices[ Device ], 0, &RC ); size_t MaxWorkItemSizes[ 256 ]; RC = clGetDeviceInfo( Devices[ Device ], CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof( MaxWorkItemSizes ), MaxWorkItemSizes, NULL ); mMaxWorkItems = MaxWorkItemSizes[ 0 ]; mMaxWorkItemsMask = ~(mMaxWorkItems - 1);

Build GPU Program CUDA

GPU code is compiled using nvcc compiler Object code is statically bound to CPU executable

GPU code is intrinsically part of the program Mapping of problem to threads performed at compile time

30

Build GPU Program OpenCL

GPU code is bound at runtime to the GPU OpenCL compiler is part of executable Code can be source code or object code

Source code can be dynamically generated by the program Can be stored in an external file

31

// Continued from constructor

char *code = shrFindFilePath( ”code.cl", "." ); size_t CodeLength = 0; char *Source = oclLoadProgSource( myCode, "", &CodeLength ); const char *SourceCode = Source; mProgram = clCreateProgramWithSource( mGPUContext, 1, &SourceCode, &CodeLength, &RC ); RC = clBuildProgram( mProgram, 0, NULL, NULL, NULL, NULL ); std::free( code ); std::free( Source ); mValid = RC == CL_SUCCESS;}

Allocate/Deallocate GPU Memory CUDA

Most frequently used allocator: cudaMalloc() Returns a memory pointer to GPU memory Memory pointer cannot be used by CPU directly

Passed to GPU calls

32

void *CUDA::malloc(size_t Bytes){ void *Memory; cudaError_t RC = cudaMalloc( &Memory, Bytes ); return( RC == cudaSuccess ? Memory : NULL );}

void CUDA::free(void *Memory){ if (Memory) { cudaFree( Memory ); }}

Allocate/Deallocate GPU Memory OpenCL

Like all things, memory allocation explicitly performed within a context

33

void *OpenCL::malloc(size_t NumBytes){ size_t Size = NumBytes / 32 + (NumBytes & 31 ? 1 : 0); cl_int RC; cl_mem Memory = clCreateBuffer( mGPUContext, CL_MEM_READ_WRITE, Size, NULL, &RC ); return( RC == CL_SUCCESS ? Memory : NULL );}

void OpenCL::free(void *Memory){ if (Memory) { cl_mem *Ptr = reinterpret_cast<cl_mem>( Memory ); clReleaseMemObject( Memory ); }}

CPU/GPU Data Transfer Data moved across PCIe bus CUDA

Data transfer accomplished via cudaMemcpy() routine Implicit synchronization point

Non-blocking copies are available Direction is determined by enumeration

cudaMemcpyHostToDevice cudaMemcpyDeviceToHost

Allocated memory can be bound to texture memory cudaBindTexture

OpenCL Memory transfer via clEnqueueWriteBuffer() and clEnqueueReadBuffer() Synchronization controlled by parameters to calls

Default is non-blocking

34

Call GPU Functions (Kernels) Functions in CPU are executed when invoked GPU function calls from CPU create execution queue

CPU does not wait until GPU function completes – command is simply queued

GPU executes commands on the queue using its own ordering

Synchronization points cause CPU to stall to wait for GPU return

CUDA cudaThreadSynchronize()

35

GPU Function Calls GPU function calls have an associated

dimensionality (which can be 1D, 2D or 3D) CUDA

Extended language syntax to include problem dimension Syntax

function<<<dimBlock,dimGrid>>>( arguments );

OpenCL Must explicitly put function arguments into context

clSetKernelArg() Invoke kernel using the context

Kernel retrieves arguments from context automatically

36

GPU Cleanup/Termination CUDA

Manages most cleanup operations automatically as a context is destroyed

OpenCL Provides low-level APIs for deallocation of all resources Invoked in order opposite to invocation

clReleaseKernel()clReleaseProgram()clReleaseCommandQueue()clReleaseContext()

37

Thread Batching: Grids and Blocks

A kernel is executed as a grid of thread blocks (aka blocks)

A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution

Diverging execution results in performance loss

Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)


Block and Thread IDs

Threads and blocks have IDs So each thread can identify what data

they will operate on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes Other problems with underlying 1D,

2D or 3D geometry

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)


GPU Kernels Each function is passed data to create a unique ID

Data typically specifies “spatial coordinates” of function execution processor within the hardware

The ID is used to coordinate data access Ensures that two threads’ accesses do not collide

CUDA function types __global__

Callable by CPU Cannot be called by GPU

__device__ Callable by other GPU functions Cannot be called by CPU CUDA expands these as inline functions via nvcc

Adds to function resource utilization

40

OpenCL Kernel Invocation Use C++ templates to simplify argument handling

41

template<typename T> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, T Arg){ return( clSetKernelArg( Kernel, Pos, sizeof( T ), &Arg ) );}template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, size_t SharedSize){ // This routine, unlike the others, sets up shared memory by passing // NULL in as the pointer to the variable. return( clSetKernelArg( Kernel, Pos, SharedSize, NULL ) );}template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, int Arg){ cl_int ArgInt = Arg; return( clSetKernelArg( Kernel, Pos, sizeof( ArgInt ), &ArgInt ) );}template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, float Arg){ cl_float ArgFloat = Arg; return( clSetKernelArg( Kernel, Pos, sizeof( ArgFloat ), &ArgFloat ) );}...template<typename T0> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0){ return( setArg( Kernel, 0, Arg0 ) );}template<typename T0, typename T1> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1){ return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) );}template<typename T0, typename T1, typename T2> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1, T2 Arg2){ return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) | setArg( Kernel, 2, Arg2 ) );}...

OpenCL Kernel Invocation BLAS-like example

CUDA provides BLAS library; OpenCL doesn’t Must write own BLAS routines in OpenCL to port between the two

easily swap() function swaps contents of 2 vectors with differing vector strides

42

void OpenCL::blasSswap(int n, float *x, int incx, float *y, int incy){ if (!checkBLASKernel( &mSswapKernel, "Sswap" )) { return; } mLastBLASStatus = Base::BLAS_INTERNAL_ERROR; if (x && y) { if (setArgs( mSswapKernel, n, x, incx, y, incy ) == CL_SUCCESS) { executeBLASKernel( mSswapKernel, n ); } }}

OpenCL Kernel Invocation BLAS support functions

43

bool OpenCL::checkBLASKernel(cl_kernel *Kernel, const char *KernelName){ if (!mValid) { mLastBLASStatus = Base::BLAS_NOT_INITIALIZED; return( false ); } if (!(*Kernel)) { cl_int RC; *Kernel = clCreateKernel( mProgram, KernelName, &RC ); if (RC != CL_SUCCESS) { mLastBLASStatus = Base::BLAS_INTERNAL_ERROR; return( false ); } } return( true );}

inline void OpenCL::executeBLASKernel(cl_kernel Kernel, int n){ size_t Size = n; size_t GlobalWorkSize = Size & mMaxWorkItemsMask; if (Size & ~mMaxWorkItemsMask) { GlobalWorkSize += mMaxWorkItems; } cl_int RC = clEnqueueNDRangeKernel( mCommandQueue, Kernel, 1, NULL, &GlobalWorkSize, &mMaxWorkItems, 0, NULL, NULL ); clFinish( mCommandQueue ); mLastBLASStatus = (RC == CL_SUCCESS) ? Base::BLAS_SUCCESS : Base::BLAS_EXECUTION_FAILED;}

OpenCL Kernels BLAS SSWAP example

44

__kernel void Sswap(__global int n, __global float *x, __global int incx, __global float *y, __global int incy){ const unsigned GID = get_global_id( 0 ); if (GID < n) { int lx = (incx >= 0) ? 0 : ((1 - n) * incx); int ly = (incy >= 0) ? 0 : ((1 - n) * incy); float temp = y[ ly + GID * incy ]; y[ ly + GID * incy ] = x[ lx + GID * incx ]; x[ lx + GID * incx ] = temp; }}

http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf

CUDA Kernels CPU

GPU (kernel.cu)

45

#define BLOCK_DIM 16

__global__ void transpose_naive(float *odata, float* idata, int width, int height){ unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; if (xIndex < width && yIndex < height) { unsigned int index_in = xIndex + width * yIndex; unsigned int index_out = yIndex + height * xIndex; odata[index_out] = idata[index_in]; }}

#include “kernel.cu”...{ const unsigned int size_x = 256; const unsigned int size_y = 4096; ... dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1); dim3 threads(BLOCK_DIM, BLOCK_DIM, 1); transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y); cudaThreadSynchronize(); ...}

46

Outline

GPU Architecture Overview GPU Programming Algorithm Acceleration Guidelines

Streams and Pinned Memory Thread Scheduling Parallel reduction Program partitioning Simultaneous graphics and algorithm processing

Case Studies Conclusion Q&A

Streams

Sequence of commands that execute serially

Allow overlapping of memory transfers and kernel computations from different streams Hides data transfer cost

Implementable in CUDA deviceswith compute capability ≥ 1.1

Host memory must be of type‘pinned’

47

Data2Data1

Data2Data1

H→D Transfers D→H TransfersKernel Computation

Data1 Data2

Data2

H→D Transfers

D→H Transfers

Kernel Computation

Data1 Data2

Data1 Data2

Data1

Pinned Memory Memory on the host that is mapped to device’s address space and

thus accessible directly by a kernel Has several advantages

There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel

Bandwidth between host and device memories is higher Write-combining Memory

Type of pinned memory where individual writes are aggregated into a larger write operation

Avoids internal L1, L2 cache writes making more cache available for rest of the application

Is not snooped during transfers across the PCI Express bus, which can improve transfer performance by up to 40%

48

GPU consists of “multiprocessors”, each of which has many processors

A kernel is executed as a grid of blocks Thread block is a batch of threads that

cooperate with each other by: Synchronizing their execution

Diverging execution results in performance loss

Efficiently sharing data through a low latency shared memory

All threads of a block reside on the same multiprocessor (max 1024/MP)

Number of blocks a multiprocessor can process at once depends on register and shared memory usage per thread

Threads and Scheduling in GPUHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)


Threads and Scheduling in GPU (contd…)

Before execution a block is split into warps A warp is a set of 32 threads which execute the same instruction on a

MP Half-warp is either first 16 or second 16 threads of a warp

Full efficiency is realized when all 16 threads of a half-warp agree on their execution path

Branch divergence occurs if threads of a half-warp diverge via a data dependent conditional branch The half-warp serially executes each branch path taken, ignoring the

result from threads that are not on that path Increases kernel execution time

Warps of the same block are executed in time sliced fashion

50

Program Parallelism The GPU is designed to address applications that are

data-parallel Parallelism is an inherent factor to determine

suitability of a problem for GPU applications In fact, applications in which enough parallelism cannot be

exposed may be slower on a GPU in comparison to a single threaded CPU

Since the same program is executed for each data element, there is no sophisticated flow control

Conditional checks need to be done on the CPU Reduce the output of all threads, transfer reduced result

to CPU which tests condition and appropriately issues further GPU threads

Can be expensive since transfers are done over the PCIe bus!

Level 0

Level 1

Parallel Reduction Perform a reduction of the data before transferring to the CPU Tree based reduction approach used within each thread block

Reduction decomposed into multiple kernels to reduce number of threads issued in the later stages of tree based reduction

52

3 1 7 0 4 1 6 3

4 7 5 9

11 14

25

Example of tree based SUM

syncThreads()

Parallel Reduction (contd…)

Types of optimization for efficient parallel reduction Algorithmic optimizations

Avoid divergent warps Avoid shared memory bank conflicts – sequential addressing First addition during global load – halves the number of

blocks Code optimizations

Loop unrolling Multiple adds per thread to increase ‘arithmetic intensity’ of

kernels (high ratio of computation in kernel to global read and writes)

53

Example of tree based reduced sum


54

0 2 4 6 8 10 12 14

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

0 4 8 12

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

0 8

18 1 7 -1 6 -2 8 5 -5 -3 9 7 13 11 2 2

0

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Shared Memory

Thread IDs

Warp divergence removed


55

0 1 2 3 4 5 6 7

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

0 1 2 3

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

0 1

18 1 7 -1 6 -2 8 5 -5 -3 9 7 13 11 2 2

0

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Shared Memory

Thread IDs

Bank IDs 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7


56

0 1 2 3 4 5 6 7

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

0 1 2 3

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

0 1

18 1 7 -1 6 -2 8 5 -5 -3 9 7 13 11 2 2

0

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Shared Memory

Thread IDs

Sequential Addressing

Program Partitioning Assume a subroutine S is invoked N times in an application

A multiprocessor of the GPU has 16K registers, then maximum parallelism = 16K/x

Since GPU can do fast hardware context switches between the threads, which share the 16K registers

However, data transfers between kernels will become a significant overhead with increase in number of partitions

57

1Time TRegisters x

2 3 N

Registers y

Simultaneous Graphics and Algorithm Processing

If the same GPU is used for graphics and algorithmic processing GPU resources may be saturated by graphics application, leaving

little bandwidth for other applications The fixed size of GPU memory (without swap space) may cause

application launch failure Graphics tasks may cause cache pollution which may cause

erratic runtimes for general purpose applications Run “warm up code” to flush out caches

A single kernel execution cannot be longer than 5 seconds

Using a separate GPU for graphics and computation avoids the above listed problems

58

59

Outline

GPU Architecture Overview GPU Programming Algorithm Acceleration Guidelines Case Studies

Boolean Satisfiability Fast SPICE model evaluation Fault Simulation SSTA

Conclusion Q&A

Guidelines for GPU Acceleration for Software Current GPUs have an expensive communication link

to the host. Data transfers should be minimized Streams should be used to overlap communication

and computation Partition kernels to increase parallelism that can be

leveraged Full efficiency is realized when all 16 threads of a

half-warp agree on their execution path Reduce warp divergence

Avoid bank conflicts when using shared memory Kernels should have high arithmetic intensity

60

Case Studies Two approaches for accelerating an algorithm on the GPU

Re-architecting approach Applicable when the problem does not have inherent SIMD nature May require significant algorithmic modifications Examples:

Boolean Satisfiability Fault Dictionary Computation (not covered in this talk, slides at end)

Porting approach Applicable when problem runtime is dominated by a subroutine,

multiple invocations of which operate upon independent data Partition the subroutine into GPU kernels Examples:

Accelerating SPICE by porting model evaluation on the GPU Fault Simulation Monte Carlo based statistical static timing analysis (SSTA) 61

Given a Boolean formula in conjunctive normal form (CNF) Either find a satisfying truth assignment of all variables Or prove that there is no satisfying assignment

Decisions: x = true y = true The unassigned literal z’ gets implied because of the unit clause rule

Implication: z = false Iterative application of the unit clause rule is called Boolean constant

propagation (BCP) Recent BCP based SAT solvers incorporate conflict driven learning A learned clause represents the search space that has been pruned

62

f = ( x + z )( y + z )( x’ + y’ + z’ )f = ( x + z )( y + z )( x’ + y’ + z’ )

Boolean Satisfiability (SAT)

Clause

x = true y = true

Positive

Literal

Negative

Literal

MiniSAT

Approach Complete Approaches for SAT

Are exact, but algorithms do not easily lend themselves to parallel implementations. Examples: GRASP, zChaff , CirCUs, MiniSAT

Stochastic Approaches for SAT Can execute at high speeds, are scalable, but are not exact. Examples:

Survey Propagation, WalkSAT, RandomSAT Present a hybrid procedure for SAT

Retains the best features of complete and stochastic approaches Proposed algorithm is based on MiniSAT (implemented on the CPU) The variable ordering heuristic of MiniSAT is enhanced by a survey

propagation (SP) based procedure, which is implemented on the GPU Proposed approach is called MESP (MiniSAT enhanced with SP)

63

Next few slides: Discuss the GPU based SP implementation Describe our MESP approach

SP

MESP

Survey Propagation (SP) based SAT Factor Graph - graphical representation of a SAT instance

Variable nodes (variables) Function nodes (clauses) Is a tree if it has no cycles

SP is an algorithm in which ‘agreement’ between clauses and variables is reached by sending probabilistic ‘messages’ along edges of the factor graph (message passing) Pros: highly scalable, parallelizable, exact for factor graphs that are trees Cons: incomplete for non-tree factor graphs

64

x y z

)'()( zxyx

65

Survey Propagation Equations Notation

, β are clauses; i, j are variables V +(i) set of all clauses where i appears in the positive ‘+’ form V -(i) set of all clauses where i appears in the negative ‘-’ form ηα→i is a warning (a probability) from clause to variable i

Let i be in the ‘+’ form in η’s and π’s are iteratively computed until convergence

p (1 i) q V ( i)\

(1 i) V ( i)\

iu p(1 q)

is q(1 p)

i* pq

i j

u

ju j

s j*j \ i

p (1 i)V ( i) q (1 i)V (i)

i

p (1 q )

i q (1 p )

i* p q

W i() i

i i

i*

W i( ) i

i i

i*

During Computation After Convergence

66

Survey Propagation FlowchartRandomly initialize ηα→i

Compute π

Compute W (biases)Sort variables in

decreasing order of W’s

C==0

Y

N

N

Y

Σ(ηα→i ) ≈ 0

Y

N

Call WalkSAT to determinesatisfying assignment

Fix first x% of variables

Declarenon-convergence

Sorted List

Fixed variables &satisfied clauses(ignored)

ηα→i←ηα→i

it++>max

new

Compute ηα→inew

C = Σ [ | ηα→i - ηα→i | ≤ ε?0:1] new

If contradiction,

report and quit

N

67

Survey Propagation on the GPU Implemented GPU kernels for the following

Compute π’s, for all variables (V ) in parallel Compute η’s, for all clauses (C ) in parallel

In particular, computes ηα→i for each variable i in clause α Check convergence (Σ[(ηα→i - ηα→i ) ≤ ε?0:1]) using a reduced

‘integer’ add operation over all literals in all clauses Compute Σ( ηα→i ) (to determine if non- trivial convergence) using

a reduced ‘float’ add operation Compute W’s, for all variables in parallel Parallel ‘bitonic’ sort to find the largest x% of the W’s

CPU performs conditional checks, fixes variables and executes WalkSAT

new

Data Structure on the GPU

68

Polarity

Clause #

Literal #

1 2 |V |

Per Variable Data (Static)

Variable #

Polarity

1 2 |C |

Per Clause Data (Static)

ηα→i

1 2

η’s Written by Clauses Read by Variables

|C |

π -

π+

1 2 |V |

π’s Written by Variables Read by Clauses

With 1 GB of Global memory, the 280 GTX GPU can fit instances with upto 10M clauses and 1M variables

69

Survey Propagation on the GPU Memory transfers between GPU and CPU

Single transfer for static per variable and per clause data During the computation of π and η, there are no transfers

at all. All intermediate data is stored in the global memory of the GPU

After convergence is detected, the sorted list of variables in decreasing order of biases is transferred (GPU → CPU)

After the graph is simplified, the following are updated (CPU → GPU)

Variables that are fixed (don’t contribute to η computation) Clauses that are satisfied (don’t contribute to π computation)

70

Results (GPU based SP)Inst. Name # VARs. # CLs. MiniSAT B05 Ours Speedup

Runtime in seconds

Random_1 20,000 83,999 >2 hrs. 3009.67 172.87 17.41X

Random_2 16,000 67,199 >2 hrs. 1729.48 110.60 15.63X

Random_3 12,000 50,399 >2 hrs. 1002.48 57.98 17.29X

Random_4 8,000 33,599 >2 hrs. 369.61 5.82 63.80X

Random_5 4,000 16,799 >2 hrs. 65.01 3.69 17.62X

Uf200-07 200 860 0.15 0.20 0.08 2.50X

hole10 187 792 1.30 Contrdn. Contrdn.

Uf200-018 200 860 0.19 No Conv No Conv

Avg. (over 20) 22.37X

MESP is compared against Braunstein et al. 2005 (B05) and MiniSAT which were executed on a 3.6 GHz,

3GB Intel machine running Linux Manolios et al. 2006 (M06), which uses OpenGL on NVIDIA GTX 7900 (512 MB

memory , 128 cores, 750MHz) to implement survey propagation For hard random instances MESP shows a 22× speedup over B05

M06 reports a 9× speedup over B05

MESP

71

SAT instance is read into MiniSAT and on the GPU (executing SP) MiniSAT is first invoked on the instance and after it has made some

progress, it invokes GPU-based SP. MiniSAT transfers to SP The current assignments and A subset of the current learned clauses

Augment the current clause database in GPU-based SP with 3 sets of learned clauses (LC) C1, C2 and C3 . L is num. of literals in LC C1 (0 < L ≤ 10) ; C2 (10 < L ≤ 25); C3 (25 < L ≤ 50)

Statically allocate enough space in GPU’s Global Memory to store 8K clauses in C1, C2 and C3 each

Messages computed over all clauses (η) are now computed in 4 separate kernels, one for each set of clauses (C1, C2, C3 and C*)

On convergence, SP (in MESP) fixes variables for which the absolute bias difference |W (+) - W (-)| < τ

MESP MiniSAT decides the next variable to assign based on Variable State

Independent Decaying Sum (VSIDS) heuristic VSIDS chooses next decision variable with the highest activity Activity is the variable occurrence count, with a higher weight on the

variables of the more recently added learned clauses Activity of the variables in the learned clauses is incremented by FM

In MESP, GPU-based SP invocation can return with the following outcomes

72

SP converges and fixes certain variables, S

SP converges but does not fix any variable

SP does not converge/reports contradiction

SP converges, fixes S and determines factor graph is a tree, invokes WalkSAT. If WalkSAT

finds assignment, instance is solved. Else fixed variables in S are returned to MiniSAT

MiniSAT updates activity of variables in

S by FSP

MiniSAT continues the search

MESP

73

MiniSAT’s Decision Tree

GPUCPU

CPU GPU

ActivityTable

Survey Propagation(stochastic)

MiniSAT(complete)

Initial search

Current AssignmentsSubset of Learned Clauses

Activity updated for the variables S that are fixed in SP

GPU works in conjunction with CPU to fix variables

Continuessearch using

updatedactivities

GPU attempts to converge on the SP messages

CPU instructs GPU to ignore fixed variables and

satisfied clauses

ResultsInstance S/U K SAT 3 SAT Speedup

# VARs.

# CLs.

MiniSAT (k)

# VARs.

# CLs.

MiniSAT (3)

MESP MiniSAT (k)

MiniSAT (3)

1394694p S 327932 1283772 29.84 530027 1890057 39.58 15.28 1.95 X 2.59 X

AProVE07 U 78607 208911 110.39 104732 287286 166.25 95.91 1.15 X 1.73 X

eijk.bs4863 S 140089 530249 487.98 234412 813218 619.03 181.86 2.68 X 3.40 X

: : : : : : : : : : :

eijk.S298 U 73222 283211 8.42 136731 473738 10.01 8.47 0.99 X 1.18 X

Avg (over 13) 1.64 X 1.92 X

74

MESP approach on GTX 280 GPU card on an Intel i7 CPU with 2.6 GHz, 9GB RAM, and running Linux. MiniSAT run on the same CPU. Runtime in seconds

D = 1% of Number of Variables; FSP = FM = 1; C = 20; τ = 0.01 The learned clauses on the GPU were updated at every 5th invocation of SP

Up to 24K learned clauses None of these instances were solved in MESP by an invocation to WalkSAT

75

Summary MESP is a GPU enhanced variable ordering heuristic for SAT

GPU based survey propagation π’s for all variables and η’s for all clauses computed in parallel Check convergence using a reduced ‘integer’ add operation over all

literals in all clauses Test whether non-trivial convergence uses a reduced ‘float’ add

operation Compute biases for all variables in parallel Parallel ‘bitonic’ sort to find the largest x% of the biases

Survey propagation enhances the variable ordering in MESP Augment clause database on GPU with 3 sets of learned clauses η’s for all clauses computed in 4 different kernels

On average MESP is 64% (92%) faster than MiniSAT on original (3-SAT) instance

SPICE Model Evaluation on a GPU SPICE is the de facto industry standard for VLSI circuit

simulations Significant motivation for accelerating SPICE simulations

without losing accuracy Increasing complexity and size of VLSI circuits Increasing impact of process variations on the electrical behavior of

circuits Require Monte Carlo based simulations

Accelerate the computationally expensive portion of SPICE – transistor model evaluation – on a GPU

Proposed approach is integrated into a commercial SPICE accelerator tool OmegaSIM Already 10-1000x faster than traditional SPICE implementations

With the proposed approach integrated, OmegaSIM achieves a further speedup of 2.36X (3.07X) on average (max)

Approach Profiled SPICE simulations over several benchmarks

75% of time spent in BSIM3 device model evaluations Billions of calls to device model evaluation routines

Every device in the circuit is evaluated for every time step Possibly repeatedly until the Newton Raphson loop for solving

non-linear equations converges Asymptotic speedup of 4X considering Amdahl’s law.

These calls are parallelizable Since they are independent of each other Each call performs identical computations on different data

Conform to the GPU’s SIMD operating paradigm

Approach CDFG-guided manual partitioning of BSIM3 evaluation code

Limitation on the available hardware resources Registers (8192/per multiprocessor) Shared Memory (16KB/per multiprocessor) Bandwidth to global memory (max. sustainable is ~80 GB/s)

If entire BSIM3 model is implemented as a single kernel Number of threads that can be issued in parallel are not enough

To hide global memory access latency If BSIM3 code is partitioned into many (small) kernels

Requires large amounts of data transfer across kernels Done using global memory (not cached) Negatively impacts performance

Proposed approach Creates CDFG of the BSIM3 equations Uses maximally disconnected components of this graph as different

kernels, considering the above hardware limitations

Approach Take GPU memory constraints into account

Global Memory Used to store intermediate data – which is generated by one kernel and

needed by another (instead of transferring this data to host) Texture Memory

Used for storing ‘runtime parameters’ Device parameters that remain unchanged throughout the simulation

Advantages It is cached, unlike global memory No coalescing requirements, unlike global memory No bank conflicts, such as possible in shared memory CUDA’s efficient built in texture fetching routines are used Small texture memory loading overhead is easily amortized

Constant Memory used for storing physical constants Most efficient when all threads access the same data

Experiments

Proposed approach is implemented and integrated into a commercial SPICE accelerator tool – OmegaSIM

Hardware used: CPU: Intel Core 2 Quad, 2.4 GHz, 4GB RAM GPU: GeForce 8800 GTS, 128 Processors, 675 MHz, 512 MB RAM

Comparing BSIM3 model evaluation alone

# Eval. GPU runtimes (ms) CPU runtimes (ms) Speedup

Proc. Tran. Tot.1M 81.17 196.48 277.65 8975.63 32.33X

2M 184.91 258.79 443.7 18086.29 40.76X

Ckt. Name # Trans.

Total #Evals.

OmegaSIM (s) Speedup

CPU-alone GPU+CPU

Industrial_1 324 1.86 X 107 49.96 34.06 1.47X

Industrial_2 1098 2.62 X 109 118.69 38.65 3.07X

Industrial_3 1098 4.30 X108 725.35 281.5 2.58X

Buf_1 500 1.62 X 107 27.45 20.26 1.35X

Buf_2 1000 5.22 X 107 111.5 48.19 2.31X

Buf_3 2000 2.13 X 108 486.6 164.96 2.95X

ClockTree_1 1922 1.86 X 108 345.69 132.59 2.61X

ClockTree_2 7682 1.92 X 108 458.98 182.88 2.51X

Avg. 2.36X

Experiments - Complete SPICE Sim

With increase in number of transistors, speedup obtained is higher More device evaluation calls made in parallel, latencies are better hidden

High accuracy with single precision floating point implementation Over 1M device evals. avg. (max.) error of 2.88 X 10-26 (9.0 X 10-22) Amp. Newer devices with double precision capability already in market

Conclusions Significant interest in accelerating SPICE 75% of the SPICE runtime spent in BSIM3 model evaluation – allows

asymptotic speedup of 4X Our approach of accelerating model evaluation using GPUs has been

integrated with a commercial fast SPICE tool Obtained speedup of 2.36 X on average

BSIM3 model evaluation can be sped up by 30-40X over 1M-2M calls Take GPU memory constraints into account

Global Memory used to store intermediate data Texture Memory used for storing ‘runtime parameters’ Constant Memory used for storing physical constants

Carefully partition kernels since If entire BSIM3 model is implemented as a single kernel

Number of threads that can be issued in parallel are not enough to hide global memory access latency

If BSIM3 code is partitioned into many (small) kernels Requires large amounts of data transfer across kernels done using global

memory

83

Introduction – Fault Simulation Fault Simulation (FS) is crucial in the VLSI design flow

Given a digital design and a set of vectors V, FS evaluates the number of stuck at faults (Fsim) tested by applying V

The ratio of Fsim/Ftotal is a measure of fault coverage

Current designs have millions of logic gates The number of faulty variations are proportional to design size Each of these variations needs to be simulated for the V vectors

Therefore, it is important to explore ways to accelerate FS

The ideal FS approach should be Fast Scalable & Cost effective

84

Approach Implement a look up table (LUT) based FS

All gates’ LUTs stored in texture memory (cached) LUTs of all library gates fit in texture cache

To avoid cache misses during lookup Individual k-input gate LUT requires 2k entries Each gate’s LUT entries are located at a fixed offset in the texture memory as

shown above Gate output is obtained by

accessing the memory at the “gate offset + input value” Example: output of AND2 gate when inputs are ‘1’ and ‘0’

0 1 2 30

85

Approach Evaluate two vectors for the same gate in a single thread

1/2/3/4 input gates require 4/16/64/256 entries in LUT respectively

Our library consists of an INV and 2/3/4 input AND, NAND, NOR and OR gates.

Hence total memory required for all LUTs is 1348 words This fits in the texture memory cache (8KB per MP)

Exploit both fault and pattern parallelism Fault Parallel

All gates at a fixed topological level are evaluated in parallel Pattern Parallel

Simulations for any gate, for different patterns, are done in parallel

86

Approach

In practice, simulations for any gate, for different patterns, are done in 2 phases, for all the faults which lie in its TFI only

Phase 1 : Good circuit simulation. Results returned to CPU Phase 2 : Faulty circuit simulation. CPU does not schedule a stuck-at-v

fault in a pattern which has v as the good circuit value

Fault injection also performed in parallel

Faulty Good

vector1

vectorvector2N

Good circuit valuefor vector 1

Faulty circuit valuefor vector 1

Approach – Fault Injection

87

m0 m1 Meaning

- 11 Stuck-a-1 Mask

11 00 No Fault Injection

00 00 Stuck-at-0 Mask

typedef struct __align__(16){int offset; // Gate type’s offsetint a, b, c, d; // Input valuesint m0, m1; // Mask variables} threadData;

Approach – Fault Simulation

88

Approach – Fault Detectiontypedef struct __align__(16){int offset; // Gate type’s offsetint a, b, c, d; // Input valuesint Good_Circuit_threadID; // Good circuit simulation thread ID} threadData_Detect;

89

Approach We maximize GPU performance by ensuring that

No data dependency exists between threads issued in parallel The same instructions, on different data are executed by all threads We adapt to specific G80 memory constraints

LUT stored in texture memory. Key advantages are: Texture memory is cached Total LUT size easily fits into available cache size of 8KB/MP No memory coalescing requirements Efficient built-in texture fetching routines available in CUDA Non-zero time taken to load texture memory, but cost easily

amortized Global memory writes for level i gates (and reads for level i+1

gates) are performed in a coalesced fashion

Results

FS on 280 GTX runtimes compared to a commercial fault simulator for 30 IWLS and ITC benchmarks

32 K patterns were simulated for all 30 circuits CPU times obtained on a 1.5 GHz 1.5 GB UltraSPARCIV+ processor running Solaris 9 GPU time includes

Data transfer time between the GPU and CPU (both directions) CPU → GPU : 32 K patterns, LUT data GPU → CPU : 32 K good circuit evals. for all gates, array Detect

Processing time on the GPU Time spent by CPU to issue good/faulty gate evaluation calls Time spent for loading the LUTs

~300X

75.970

:

320.997

333.344

199.723

275.754

Speed Up

1.504

:

0.260

0.155

1.390

0.134

GPU

~47X

11.390

:

57.648

54.052

37.352

46.067

Speed Up

0.225

:

0.047

0.025

0.260

0.022

PROJ.

17.1308620535280b22

Avg (30 ckts.)

::::

14.98057352195s13207

8.39048211907s5378

51.9203462814828s35932

6.19038831462s9234_1

Comm. #Faults#GatesCircuit

91

Conclusions Fault simulation is accelerated using GPUs

Implement a pattern and fault parallel technique Maximize GPU performance by ensuring that

No data dependency exists between threads issued in parallel The same instructions, on different data are executed by all threads Adapt to specific G280 memory constraints

LUT stored in texture memory Global memory writes for level i gates (and reads for level i+1

gates) are performed in a coalesced fashion When using a Single 280 GTX GPU

47X speedup compared to commercial FS engine When projected for a 1U NVIDIA Tesla Server

300X speedup is possible over the commercial engine

92

Introduction - SSTA Static timing analysis (STA) is heavily used in VLSI design to estimate

circuit delay Impact of process variations on circuit delay is increasing Therefore, statistical STA (SSTA) was proposed

It includes the effect of variations while estimating circuit delay Monte Carlo (MC) based SSTA accounts for variations by

Generating N delay samples for each gate (random variable) Executing STA for each sample Aggregating results to generate full circuit delay under variations

MC based SSTA has several advantages over block based and path based SSTA High accuracy, simplicity and compatibility to fabrication line data

Main disadvantage is extremely high runtime cost

93

Approach – STA STA at a gate

Over all inputs compute the MAX of the SUM of Input arrival time for input i and Pin-to-output (P2O) rising (or falling) delay from pin i to output

For example, let Ati

fall (Atirise) denote the arrival time of a falling(rising) signal at node i

MAX (D11→00, D11→01) (MAX (D11→00, D11→10)) denotes the P2O rising delay from a to c (b to c)

ATcrise = MAX [(ATa

fall + MAX (D11→00 , D11→01)), ((ATbfall + MAX (D11→00 , D11→10))]

STA at a gate on the GPU The P2O rising (or falling) delay from every input to output is stored in a lookup

table (LUT) in texture memory of GPU For an n-input gate, do the following

Fetch n pin-to-output rising (or falling) delays from texture memory Using the gate type offset, pin number and falling/rising delay information

n SUM computations Of the pin-to-output delay and input arrival time n-1 MAX computations CUDA only supports 2 operand MAX operations

ac

b

94

Approach – SSTA SSTA at a gate

Need (µ , σ) for the 2n Gaussian distributions of the pin-to-output rising and falling delay values for n inputs

Store (µ , σ) for every input in the LUT As opposed to storing the nominal delay, as for STA

Mersenne Twister (MT) pseudo random number generator is used The uniformly distributed random number sequences are then

transformed into the normal distribution N(0,1) Using the Box-Muller transformations (BM)

Delay of a sample = µ + k · σ Both algorithms, MT and BM kernels are available with the CUDA

software development kit (SDK)

For a circuit, SSTA is performed topologically from inputs to outputs Delays of gates at logic depth i are computed, and stored in global memory Gates at logic higher depths use this data as their input arrival times

95

Experiments – SSTA MC based SSTA on 280 GTX runtimes compared to a CPU based

implementation for 30 large IWLS and ITC benchmarks Monte Carlo analysis performed by using 64 K samples for all 30 circuits CPU runtimes are computed

On 3.6 GHz, 3GB RAM Intel processor running Linux GPU time includes data transfer time

CPU → GPU : arrival time at each primary input µ and σ for all pin-to-output delays of all gates

GPU → CPU: 64K delay values at each primary output

GPU time also includes the time spent in the MT and BM kernels, and loading texture memory

Computation results have been verified for correctness For the SLI Quad system, the runtimes are obtained by scaling the

processing times only Transfer times are included as well (not scaled)

96

Results – SSTA

When using a single 280 GTX GPU ~818X speedup in MC based SSTA is obtained

The SSTA runtimes are projected on a Quad GPU system ~2405X speedup is possible

CircuitRuntime (s) Speedup

GPU SLI QUAD CPU GPU SLI QUAD

s9234_1 8.11 2.92 6621.16 816.64 2269.11

s35932 46.50 18.14 36174.56 778.00 1993.97

s38584 47.24 17.24 38270.72 810.19 2219.98

s13207 14.55 6.21 10633.48 731.07 1712.24

: : : : : :

b22_1 51.50 15.51 45909.95 891.51 2959.80

Avg. (30 Ckts.) ~818X ~2405

Conclusions We accelerate MC based SSTA using graphics processors We take maximal advantage of the GPU’s

Raw computational power and Huge memory bandwidths

Maximize GPU performance by ensuring that No data dependency exists between threads issued in parallel The same instructions, on different data are executed by all

threads Adapt to specific G280 memory constraints

LUT stored in texture memory Global memory writes for level i gates (and reads for

level i+1 gates) are performed in a coalesced fashion

Summary We discussed the GPU platform, and its use in high-

performance EDA applications, with case studies. Outlined the GPU memory and processing

constraints induced by the GPU architecture. Presented programming guidelines with sample code

fragments Suggested tips to maximize performance of GPU-

based code Discussed case studies of EDA algorithms, and

pointed out how the code was architected for maximum performance.

Resources

General www.gpgpu.org

CUDA references www.nvidia.com/object/cuda_home.html Supported platforms: Windows, MacOS, Linux

OpenCL stuff General OpenCL information: www.khronos.org Apple: developer.apple.com/mac/snowleopard/opencl.html Amd: developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx Nvidia: http://developer.nvidia.com/object/opencl-download.html

http://www.gpgpu.org/

http://www.khronos.org/

Thank You

Fault Table Generation Two key steps in VLSI testing and debug

Fault detection: Differentiates a faulty design from a fault free design Fault diagnosis: Identifies and isolates a fault, to analyze the defect

causing the faulty behavior Both detection and diagnosis require precalculated fault table

Whether vector vj can detect fault fi

Stored as matrix [aij], where aij = 1(0) if fault fi is (not) detected by vector vj

Implemented pattern parallel approach on the GPU Simulate several patterns simultaneously Other parallel efforts require dynamic load balancing

Algorithm parallel: Partition fault list across many processors Model parallel: Partition circuit into components, each assigned to one

or more processors101

Approach FSIM [Lee et al. 91] is an efficient fault simulator FSIM+ is FSIM modified to compute a fault table FSIM and FSIM+ both

Are pattern parallel, run on a single core microprocessor Simulate a circuit in a forward levelized manner Prune off unnecessary simulations early

New approach (GFTABLE) is an enhancement of FSIM+ Target hardware is a GPU – SIMD machine Issue thousands of threads (T) in parallel

‘word_size × T’ patterns (packet width) computed in parallel Hardware and software constraints are maximally satisfied

CUDA specific (memory, device utilization) Only CPU can launch a kernel or perform efficient conditional tests

Minimize (expensive) transfers between the GPU and CPU

102

Approach

103

s-a-1

s-a-0✗

✗

CPT

CD(k)

Stem region (SR) All gates on any path from a stem to its immediate dominator

Fanout free region (FFR) A subcircuit induced by cutting off the fanout branches of each stem Such subcircuits form a partition of the original netlist

Stems: Fanout Nets Primary Outputsp is the dominator of k

p is also the immediate dominator of k

Approach Sensitive input

Only input of a gate driving the dominant logic value (DLV) All inputs, when all inputs driving DLV

Critical line is the line driving the sensitive input Critical path tracing (CPT)

Determine paths of critical lines in FFR(k) by backtracking from the output of the FFR(k) towards its inputs

104

1

0

0

Approach

Detectabilities

105105

D(a,k) = 0

D(b,k) = 1

D(i,k) = 1

D(c,k) = 0

D(j,k) = 0

s-a-0

s-a-0✗

✗

FD(a s-a-0, k) = 0

FD(c s-a-0, k) = 0

FD(b s-a-1, k) = 1

✗s-a-1

CD(k) = 0CD(k) = 1

FS

D(k, p)FD(a s-a-0, p)FD(c s-a-0, p)FD(b s-a-1, p)

FD(a s-a-0, p) = FD(a s-a-0, k) D(k,p)

Approach

106

Approach

All threads evaluate the same gate for different patterns Sort gates topologically from inputs to outputs

Fault free data for the first L gates, for all patterns, is stored in the global memory

Avoids the need to transfer this data from the CPU107

Mersenne Twister (MT) pseudo random number generator Long period, efficient use of memory, good distribution properties

Approach

CD(s) is computed using CPT Launch T threads in parallel In FSIM+, gates that are not driving any

critical lines are not backtracked on during CPT

108

In GFTABLE, all gates are backtracked on during CPT The test (if gates are driving critical lines) does not help prune

99.99% of gates, due to the large packet width – T × 32 bits (T = 16K) Large packet width is necessary in order to

Take advantage of the immense parallelism on the GPU Reduce overhead of kernel launch or global memory access

Approach

Explicit fault simulation, in the forward levelized manner From stem s to its immediate dominator t (or PO) Input is CD(s) XORed with fault free value at s

Injecting faults which are upstream from s and observable at s

109

Example: CD(k) = 0010, Fault Free Value at k = 0000 Input applied at k = 0010 0000 = 0010 Fault simulation yields p = 0010 Fault free value at p = 0000 Therefore D(k, p) = 0010 0000 = 0010

In FSIM+, simulation of the fanout of a gate g is scheduled Only if output at g is different from its fault free value

Approach

On the GPU, a bitwise XOR operation is performed on T words of the current output (gate evaluation) and fault free data For the test (if result of XOR is all zero) perform a hybrid of a depth-first and breadth-first approach

110

Divide T long array (of the XOR’s output) into groups of size Q (256)

Compute reduced OR of data in each group into single word which is transferred to the CPU

Avoid bank conflicts and divergent executions Minimize global memory access latencies Employ loop unrolling in reduction code

At the first non-zero value found on the CPU, return false

Perform this test after simulating G (20) gates All conditional tests, on CD(s) or D(s,t), are performed in a similar manner

0 0 0 0 1 0 1 10

0 1 1

Q = 3

These values returned to CPU

Approach

For all faults in FFR(s) of the current stem s If fault fi is detectable at the stem s and If stem s is detectable at a primary output, then

Fault fi is (globally) detectable The ith row of the fault table is accordingly updated

In GFTABLE Detectabilities are computed on the GPU

Fault Detectabilities are immediately transferred to the CPU ‘word_size × T’ bits transferred

The entire fault table is never stored on the GPU

111

ResultsCircuit FSIM+

(s)GFTABLE

(s)GFTABLE v/s FSIM+

GFTABLE-TESLA (s)

GFTABLE-TESLA v/s FSIM+

b14 1502.47 100.87 14.90 X 17.65 85.12 X

b20 4992.73 319.82 15.61 X 55.97 89.21 X

:

b22 6319.47 399.34 15.82 X 69.88 90.43 X

Avg (20 ckts) 15.68 X 89.57 X

112

GFTABLE implemented on NVIDIA Quadro FX 5800 T = 16 K, word_size = 32, L = 32 K, Q = 256, G = 20 To use the global memory effectively, the FAULT_LIST is partitioned into

subsets of 1K faults, and GFTABLE executed iteratively Allows GFTABLE to operate on circuits with arbitrary number of gates

and faults FSIM+ run on a 32-bit, 3.6 GHz Intel CPU with 3GB RAM, running Linux Projected runtime to NVIDIA Tesla system (8 GPUs) is 90× faster

Summary Fault table is required by fault detection and diagnosis

Compute time of a fault table is very high Fault table generation is accelerated using GPUs

All data parallel computations are performed on the GPU All conditional statements are evaluated on the CPU

Tree based reduced OR operation performed on the GPU, before transferring test data to the CPU

Entire fault table is never stored on the GPU’s memory To handle larger circuits

Global memory stores a subset of the fault free data Fault list is partitioned

Experimental results show 15× speedup over FSIM+ using a single Quadro FX 5800 Potential speedup over FSIM+ is 90× when using a Tesla GPU system

113

Documents

Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P. Khatri Texas A&M University Acknowledgements: NVIDIA, Nascentric