View
226
Download
0
Tags:
Embed Size (px)
Citation preview
Introduction to GPU Programming for EDA
John F. CroixCadence Design Systems, Inc.
Sunil P. KhatriTexas A&M University
Acknowledgements: NVIDIA, Nascentric Inc., Accelicon Inc.Students: Kanupriya Gulati, Vinay Karkala, Kalyana Bollapalli
2
Outline
GPU Architecture Overview GPU Programming Algorithm Acceleration Guidelines Case Studies Conclusion Q&A
3
Outline
GPU Architecture Overview Evolution and architecture Peak performance GPU and CPU interaction – practical considerations
GPU Programming Algorithm Acceleration Guidelines Case Studies Conclusion Q&A
4
GPU Evolution In the early days, graphics accelerators were primitive
Acceleration of graphics rendering tasks for (CRT) displays Many hardwired graphics acceleration units
With VLSI technology scaling, the GPU was born Many programmable processors to handle graphics rendering tasks Increased peak memory bandwidths and peak performance Goal was faster and more realistic rendering for gaming applications
Recently, several scientific communities began to leverage these GPUs Initially used graphics APIs like OpenGL and DirectX for these tasks
GPU vendors recognized this interest Development of C-like programming environments such as CUDA Development of GPU architectures tuned for scientific computations
5
GPU Introduction A GPU is essentially a commodity stream processor
Highly parallel (100s of processor cores) Very fast (>900 GFLOPS of peak performance) Operates in a SIMD manner. This is a key restriction Multiple processors operate in lock-step (same instruction) but on different
data
GPUs, owing to their massively parallel architecture, have been used to accelerate Image/stream processing, data compression, numerical algorithms Recently they have been used to accelerate CAD algorithms as well.
Inexpensive, off-the-shelf cards like the NVIDIA Quadro FX / 280 GTX GPU achieve impressive performance 933 GFLOPs peak performance 240 SIMD cores partitioned into 30 Multiprocessors (MPs) 4GB (Quadro) and 1GB (GTX 280) device memory with 142 GB/s bandwidth 1.4 GHz GPU operating frequency Programmed with Compute Unified Device Architecture (CUDA) framework
6
GPU Architecture In the GTX 280, there are 10 Thread Processing Clusters
(TPCs) Each has 3 Streaming Multiprocessors (SMs), which we will
refer to as multiprocessors (MPs) Each MP has 8 Streaming Processors (SPs) or Thread
Processors (TPs). We will refer to these as processors. 240 processors and 30 MPs in all!
One double-precision FP unit per SM
Source : NVIDIA
GPU vs CPU:NVIDIA 280 vs Intel i7 860
GPU CPU1
Registers 16,384 (32-bit) / multi-processor3
128 reservation stations
Peak memory bandwidth 141.7 Gb/sec 21 Gb/sec
Peak GFLOPs 562 (float)/77 (double)
50 (double)
Cores 240 4/8 (hyperthreaded)
Processor Clock (MHz) 1296 2800
Memory 1Gb 16Gb
Shared memory 16Kb/TPC2 N/A
Virtual memory None
7
1http://ark.intel.com/Product.aspx?id=413162TPC = Thread Processing Cluster (24 cores)330 multi-processors in a 280
8
GPU vs CPU Peak Performance Trends GPU peak performance has grown aggressively. Hardware has kept up with Moore’s law
Source : NVIDIA
9
GPU Programming Model The GPU is viewed as a compute device that:
Is a coprocessor (slave) to the CPU (host) Has its own DRAM (device memory) but no virtual memory
Entire design instance may not fit on the GPU! Kernel is a CPU-callable function. Thread is an instance of a kernel. GPU runs many threads in parallel.
Host Device
DeviceMemory
Kernel Threads (instances of
the kernel)PCIe
(CPU)(GPU)
Data Transfers (CPUGPU) GPUs and CPUs communicate via a PCIe bus
This communication is expensive and should be minimized for target applications
Graphics applications usually require Initial data to be sent from CPU to GPU Single transfer of processed data from GPU to CPU
General purpose computations usually require Multiple transfers between CPU and GPU (since conditional checks on CPU) Possibility of saturating the PCIe bus and reducing the achievable performance
10
Host Device
DeviceMemory
Kernel Threads (instances of
the kernel)PCIe
(CPU) (GPU)
GPU Threads v/s CPU Threads GPU threads:
Lightweight, small creation and scheduling overhead, extremely fast hardware context switching
Need to issue 1000s of GPU threads to hide global memory latencies (600-800 cycles)
CPU threads: Heavyweight, large scheduling overhead, slow context
switching Multi-GPU usage requires invocation of multiple CPU
threads Each CPU thread creates a GPU context Context swapping is required for a CPU thread to access
GPU memory allocated by another CPU thread11
Device Memory Space Overview Each thread runs on a SP and has:
R/W per-thread registers (on-chip) Limit usage (max 16K/MP)
R/W per-thread local memory (off) R/W per-block shared memory (on)
Need to avoid bank conflicts R/W per-grid global memory (off)
Not cached, 600-800 cycle read L
atency hidden by parallelism
and fast context switches
Main means for data transfer from host and device Coalescing recommended
RO per-grid cached constant and texture memory (off)
The host can R/W global, constant and texture memories (visible to all threads)
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
Source : “NVIDIA CUDA Programming Guide” version 1.1
13
Outline
GPU Architecture Overview GPU Programming
CPU threads Conditional and Loop processing Floating point General GPU program structure
CUDA and OpenCL Algorithm Acceleration Guidelines Case Studies Conclusion Q&A
CPU Threading CPU
All threads are equivalent Read/write concurrently to the same memory Synchronization primitives required to avoid collisions
GPU (NVIDIA) Each CPU thread maintains a unique context GPU resources (e.g. memory, code modules, address
space) are context-specific Each CPU thread can access a single context at once Contexts must be exchanged between CPU threads to
share GPU resources between CPU threads Contexts use reference counting and are automatically
destroyed14
15
SIMD Conditional Processing
Unlike threads in a CPU-based program, SIMD programs cannot follow different execution paths
Ideal scenario: All GPU threads follow the same execution path All processors active continuously
In divergent paths, some processors execute the then-block and others the else-block Program flow cannot actually diverge. All instructions are executed The then- and else- blocks are both executed A bit is used to enable/disable processors based on the block being
executed Parallelism is reduced, impacting performance
Idle Processors Idle CPU processors can be dynamically rescheduled
by OS SIMD processors are not actually idle
All processors scheduled are following identical execution paths
Disabled (idle) processors are unavailable for other work and cannot be rescheduled
Effective utilization of processors is the programmer’s responsibility
Scheduling is an art, not necessarily a science Techniques will vary from chip to chip
16
Conditional Processing
…If (condition){ …}else{ …}…
17
Nested Conditional Processing
…If (condition){ if (condition2) { … } else { … }}else{ …}…
18
Loop Processing
…while (condition){ if (cond2) { … }}…
19
The Cost of Memory Access Registers are extremely fast, but are a limited resource Cached memories also tend to be small For large data sets, global memory provides read & write access
Accesses take between 600 and 800 clock cycles Accesses are *not* cached To hide memory latency, the hardware provides fast context switches when
memory is accessed However, there must be enough computational work to do to hide the high
cost of memory access Programmers need to be smart
Compilers often don’t provide the necessary optimizations when optimizing for speed instead of code size
It can sometimes be cheaper to recompute a result than perform a memory read/write
20
Conditional Processing
...if (condition){ ... float a = someVar; ...}else{ ... float a = someVar; ...}...
21
…float a = someVar;if (condition){ …}else{ …}…
Access & Swap
Access & Swap
Access & Swap
Floating Point GPUs are optimized for 32-
bit accesses 64-bit double-precision
values fetched from memory as two 32-bit quantities May impact performance in
the event of memory bank conflicts
One double-precision unit per multi-processor1
22
1http://www.ddj.com/hpc-high-performance-computing/210102115
OpenCL vs CUDA CUDA uses early code binding
Code is compiled with normal C/C++/FORTRAN (beta) source code
Need CUDA occupancy calculator to determine number of threads based on resource utilization
Library support: BLAS & FFT & DPT OpenCL
Late binding of OpenCL code to executable OpenCL compiler/linker embedded within application No need for CUDA occupancy calculator
Only supports C No libraries
23
CUDA Occupancy Calculator
24
OpenCL vs CUDA
25
General Program Structure Initialize GPU Create GPU context Build GPU program Allocate GPU memory Transfer data from CPU to GPU Invoke GPU functions Transfer data from GPU to CPU Deallocate GPU memory Finalize GPU usage
26
Create GPU Context CUDA
Context creation is implicit in single-threaded programs Multiple contexts can be explicitly created Each thread maintains a context stack
Top context is current context Threads
Contexts can be swapped between threads A thread can only have one context active at a time (stack) A context cannot be shared simultaneously between
threads
OpenCL All commands explicitly associated with a context Must create a command queue to invoke
27
Initialize GPU CUDA
cudaGetDeviceCount() cudaSetDevice() cudaGetDeviceProperties()
28
CUDA::CUDA(int Device) : Base(){ mValid = false; int DeviceCount; cudaGetDeviceCount( &DeviceCount ); if (!DeviceCount) { return; }
Device = Device == -1 ? DeviceCount - 1 : Device; cudaSetDevice( Device ); mValid = true;}
Initialize GPU OpenCL
Context must be built before anything can be done on the GPU
All commands are with respect to a given context
29
OpenCL::OpenCL(int Device) : Base(){ init(); // Initialize class pointers to NULL cl_int RC; mGPUContext = clCreateContextFromType( 0, CL_DEVICE_TYPE_GPU, NULL, NULL, &RC ); size_t Bytes; RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &Bytes ); int NumDevices = Bytes / sizeof( cl_device_id ); cl_device_id *Devices = new cl_device_id[ NumDevices ]; RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, Bytes, Devices, NULL ); mCommandQueue = clCreateCommandQueue( mGPUContext, Devices[ Device ], 0, &RC ); size_t MaxWorkItemSizes[ 256 ]; RC = clGetDeviceInfo( Devices[ Device ], CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof( MaxWorkItemSizes ), MaxWorkItemSizes, NULL ); mMaxWorkItems = MaxWorkItemSizes[ 0 ]; mMaxWorkItemsMask = ~(mMaxWorkItems - 1);
Build GPU Program CUDA
GPU code is compiled using nvcc compiler Object code is statically bound to CPU executable
GPU code is intrinsically part of the program Mapping of problem to threads performed at compile time
30
Build GPU Program OpenCL
GPU code is bound at runtime to the GPU OpenCL compiler is part of executable Code can be source code or object code
Source code can be dynamically generated by the program Can be stored in an external file
31
// Continued from constructor
char *code = shrFindFilePath( ”code.cl", "." ); size_t CodeLength = 0; char *Source = oclLoadProgSource( myCode, "", &CodeLength ); const char *SourceCode = Source; mProgram = clCreateProgramWithSource( mGPUContext, 1, &SourceCode, &CodeLength, &RC ); RC = clBuildProgram( mProgram, 0, NULL, NULL, NULL, NULL ); std::free( code ); std::free( Source ); mValid = RC == CL_SUCCESS;}
Allocate/Deallocate GPU Memory CUDA
Most frequently used allocator: cudaMalloc() Returns a memory pointer to GPU memory Memory pointer cannot be used by CPU directly
Passed to GPU calls
32
void *CUDA::malloc(size_t Bytes){ void *Memory; cudaError_t RC = cudaMalloc( &Memory, Bytes ); return( RC == cudaSuccess ? Memory : NULL );}
void CUDA::free(void *Memory){ if (Memory) { cudaFree( Memory ); }}
Allocate/Deallocate GPU Memory OpenCL
Like all things, memory allocation explicitly performed within a context
33
void *OpenCL::malloc(size_t NumBytes){ size_t Size = NumBytes / 32 + (NumBytes & 31 ? 1 : 0); cl_int RC; cl_mem Memory = clCreateBuffer( mGPUContext, CL_MEM_READ_WRITE, Size, NULL, &RC ); return( RC == CL_SUCCESS ? Memory : NULL );}
void OpenCL::free(void *Memory){ if (Memory) { cl_mem *Ptr = reinterpret_cast<cl_mem>( Memory ); clReleaseMemObject( Memory ); }}
CPU/GPU Data Transfer Data moved across PCIe bus CUDA
Data transfer accomplished via cudaMemcpy() routine Implicit synchronization point
Non-blocking copies are available Direction is determined by enumeration
cudaMemcpyHostToDevice cudaMemcpyDeviceToHost
Allocated memory can be bound to texture memory cudaBindTexture
OpenCL Memory transfer via clEnqueueWriteBuffer() and clEnqueueReadBuffer() Synchronization controlled by parameters to calls
Default is non-blocking
34
Call GPU Functions (Kernels) Functions in CPU are executed when invoked GPU function calls from CPU create execution queue
CPU does not wait until GPU function completes – command is simply queued
GPU executes commands on the queue using its own ordering
Synchronization points cause CPU to stall to wait for GPU return
CUDA cudaThreadSynchronize()
35
GPU Function Calls GPU function calls have an associated
dimensionality (which can be 1D, 2D or 3D) CUDA
Extended language syntax to include problem dimension Syntax
function<<<dimBlock,dimGrid>>>( arguments );
OpenCL Must explicitly put function arguments into context
clSetKernelArg() Invoke kernel using the context
Kernel retrieves arguments from context automatically
36
GPU Cleanup/Termination CUDA
Manages most cleanup operations automatically as a context is destroyed
OpenCL Provides low-level APIs for deallocation of all resources Invoked in order opposite to invocation
clReleaseKernel()clReleaseProgram()clReleaseCommandQueue()clReleaseContext()
37
Thread Batching: Grids and Blocks
A kernel is executed as a grid of thread blocks (aka blocks)
A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution
Diverging execution results in performance loss
Efficiently sharing data through a low latency shared memory
Two threads from two different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Block and Thread IDs
Threads and blocks have IDs So each thread can identify what data
they will operate on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes Other problems with underlying 1D,
2D or 3D geometry
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Source : “NVIDIA CUDA Programming Guide” version 1.1
GPU Kernels Each function is passed data to create a unique ID
Data typically specifies “spatial coordinates” of function execution processor within the hardware
The ID is used to coordinate data access Ensures that two threads’ accesses do not collide
CUDA function types __global__
Callable by CPU Cannot be called by GPU
__device__ Callable by other GPU functions Cannot be called by CPU CUDA expands these as inline functions via nvcc
Adds to function resource utilization
40
OpenCL Kernel Invocation Use C++ templates to simplify argument handling
41
template<typename T> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, T Arg){ return( clSetKernelArg( Kernel, Pos, sizeof( T ), &Arg ) );}template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, size_t SharedSize){ // This routine, unlike the others, sets up shared memory by passing // NULL in as the pointer to the variable. return( clSetKernelArg( Kernel, Pos, SharedSize, NULL ) );}template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, int Arg){ cl_int ArgInt = Arg; return( clSetKernelArg( Kernel, Pos, sizeof( ArgInt ), &ArgInt ) );}template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, float Arg){ cl_float ArgFloat = Arg; return( clSetKernelArg( Kernel, Pos, sizeof( ArgFloat ), &ArgFloat ) );}...template<typename T0> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0){ return( setArg( Kernel, 0, Arg0 ) );}template<typename T0, typename T1> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1){ return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) );}template<typename T0, typename T1, typename T2> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1, T2 Arg2){ return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) | setArg( Kernel, 2, Arg2 ) );}...
OpenCL Kernel Invocation BLAS-like example
CUDA provides BLAS library; OpenCL doesn’t Must write own BLAS routines in OpenCL to port between the two
easily swap() function swaps contents of 2 vectors with differing vector strides
42
void OpenCL::blasSswap(int n, float *x, int incx, float *y, int incy){ if (!checkBLASKernel( &mSswapKernel, "Sswap" )) { return; } mLastBLASStatus = Base::BLAS_INTERNAL_ERROR; if (x && y) { if (setArgs( mSswapKernel, n, x, incx, y, incy ) == CL_SUCCESS) { executeBLASKernel( mSswapKernel, n ); } }}
OpenCL Kernel Invocation BLAS support functions
43
bool OpenCL::checkBLASKernel(cl_kernel *Kernel, const char *KernelName){ if (!mValid) { mLastBLASStatus = Base::BLAS_NOT_INITIALIZED; return( false ); } if (!(*Kernel)) { cl_int RC; *Kernel = clCreateKernel( mProgram, KernelName, &RC ); if (RC != CL_SUCCESS) { mLastBLASStatus = Base::BLAS_INTERNAL_ERROR; return( false ); } } return( true );}
inline void OpenCL::executeBLASKernel(cl_kernel Kernel, int n){ size_t Size = n; size_t GlobalWorkSize = Size & mMaxWorkItemsMask; if (Size & ~mMaxWorkItemsMask) { GlobalWorkSize += mMaxWorkItems; } cl_int RC = clEnqueueNDRangeKernel( mCommandQueue, Kernel, 1, NULL, &GlobalWorkSize, &mMaxWorkItems, 0, NULL, NULL ); clFinish( mCommandQueue ); mLastBLASStatus = (RC == CL_SUCCESS) ? Base::BLAS_SUCCESS : Base::BLAS_EXECUTION_FAILED;}
OpenCL Kernels BLAS SSWAP example
44
__kernel void Sswap(__global int n, __global float *x, __global int incx, __global float *y, __global int incy){ const unsigned GID = get_global_id( 0 ); if (GID < n) { int lx = (incx >= 0) ? 0 : ((1 - n) * incx); int ly = (incy >= 0) ? 0 : ((1 - n) * incy); float temp = y[ ly + GID * incy ]; y[ ly + GID * incy ] = x[ lx + GID * incx ]; x[ lx + GID * incx ] = temp; }}
http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf
CUDA Kernels CPU
GPU (kernel.cu)
45
#define BLOCK_DIM 16
__global__ void transpose_naive(float *odata, float* idata, int width, int height){ unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; if (xIndex < width && yIndex < height) { unsigned int index_in = xIndex + width * yIndex; unsigned int index_out = yIndex + height * xIndex; odata[index_out] = idata[index_in]; }}
#include “kernel.cu”...{ const unsigned int size_x = 256; const unsigned int size_y = 4096; ... dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1); dim3 threads(BLOCK_DIM, BLOCK_DIM, 1); transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y); cudaThreadSynchronize(); ...}
46
Outline
GPU Architecture Overview GPU Programming Algorithm Acceleration Guidelines
Streams and Pinned Memory Thread Scheduling Parallel reduction Program partitioning Simultaneous graphics and algorithm processing
Case Studies Conclusion Q&A
Streams
Sequence of commands that execute serially
Allow overlapping of memory transfers and kernel computations from different streams Hides data transfer cost
Implementable in CUDA deviceswith compute capability ≥ 1.1
Host memory must be of type‘pinned’
47
Data2Data1
Data2Data1
H→D Transfers D→H TransfersKernel Computation
Data1 Data2
Data2
H→D Transfers
D→H Transfers
Kernel Computation
Data1 Data2
Data1 Data2
Data1
Pinned Memory Memory on the host that is mapped to device’s address space and
thus accessible directly by a kernel Has several advantages
There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel
Bandwidth between host and device memories is higher Write-combining Memory
Type of pinned memory where individual writes are aggregated into a larger write operation
Avoids internal L1, L2 cache writes making more cache available for rest of the application
Is not snooped during transfers across the PCI Express bus, which can improve transfer performance by up to 40%
48
GPU consists of “multiprocessors”, each of which has many processors
A kernel is executed as a grid of blocks Thread block is a batch of threads that
cooperate with each other by: Synchronizing their execution
Diverging execution results in performance loss
Efficiently sharing data through a low latency shared memory
All threads of a block reside on the same multiprocessor (max 1024/MP)
Number of blocks a multiprocessor can process at once depends on register and shared memory usage per thread
Threads and Scheduling in GPUHost
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Threads and Scheduling in GPU (contd…)
Before execution a block is split into warps A warp is a set of 32 threads which execute the same instruction on a
MP Half-warp is either first 16 or second 16 threads of a warp
Full efficiency is realized when all 16 threads of a half-warp agree on their execution path
Branch divergence occurs if threads of a half-warp diverge via a data dependent conditional branch The half-warp serially executes each branch path taken, ignoring the
result from threads that are not on that path Increases kernel execution time
Warps of the same block are executed in time sliced fashion
50
Program Parallelism The GPU is designed to address applications that are
data-parallel Parallelism is an inherent factor to determine
suitability of a problem for GPU applications In fact, applications in which enough parallelism cannot be
exposed may be slower on a GPU in comparison to a single threaded CPU
Since the same program is executed for each data element, there is no sophisticated flow control
Conditional checks need to be done on the CPU Reduce the output of all threads, transfer reduced result
to CPU which tests condition and appropriately issues further GPU threads
Can be expensive since transfers are done over the PCIe bus!
Level 0
Level 1
Parallel Reduction Perform a reduction of the data before transferring to the CPU Tree based reduction approach used within each thread block
Reduction decomposed into multiple kernels to reduce number of threads issued in the later stages of tree based reduction
52
3 1 7 0 4 1 6 3
4 7 5 9
11 14
25
Example of tree based SUM
syncThreads()
Parallel Reduction (contd…)
Types of optimization for efficient parallel reduction Algorithmic optimizations
Avoid divergent warps Avoid shared memory bank conflicts – sequential addressing First addition during global load – halves the number of
blocks Code optimizations
Loop unrolling Multiple adds per thread to increase ‘arithmetic intensity’ of
kernels (high ratio of computation in kernel to global read and writes)
53
Example of tree based reduced sum
Parallel Reduction (contd…)
54
0 2 4 6 8 10 12 14
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
0 4 8 12
11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2
0 8
18 1 7 -1 6 -2 8 5 -5 -3 9 7 13 11 2 2
0
24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
Shared Memory
Thread IDs
Warp divergence removed
Parallel Reduction (contd…)
55
0 1 2 3 4 5 6 7
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
0 1 2 3
11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2
0 1
18 1 7 -1 6 -2 8 5 -5 -3 9 7 13 11 2 2
0
24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
Shared Memory
Thread IDs
Bank IDs 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Parallel Reduction (contd…)
56
0 1 2 3 4 5 6 7
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
0 1 2 3
11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2
0 1
18 1 7 -1 6 -2 8 5 -5 -3 9 7 13 11 2 2
0
24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
Shared Memory
Thread IDs
Sequential Addressing
Program Partitioning Assume a subroutine S is invoked N times in an application
A multiprocessor of the GPU has 16K registers, then maximum parallelism = 16K/x
Since GPU can do fast hardware context switches between the threads, which share the 16K registers
However, data transfers between kernels will become a significant overhead with increase in number of partitions
57
1Time TRegisters x
2 3 N
Registers y
Simultaneous Graphics and Algorithm Processing
If the same GPU is used for graphics and algorithmic processing GPU resources may be saturated by graphics application, leaving
little bandwidth for other applications The fixed size of GPU memory (without swap space) may cause
application launch failure Graphics tasks may cause cache pollution which may cause
erratic runtimes for general purpose applications Run “warm up code” to flush out caches
A single kernel execution cannot be longer than 5 seconds
Using a separate GPU for graphics and computation avoids the above listed problems
58
59
Outline
GPU Architecture Overview GPU Programming Algorithm Acceleration Guidelines Case Studies
Boolean Satisfiability Fast SPICE model evaluation Fault Simulation SSTA
Conclusion Q&A
Guidelines for GPU Acceleration for Software Current GPUs have an expensive communication link
to the host. Data transfers should be minimized Streams should be used to overlap communication
and computation Partition kernels to increase parallelism that can be
leveraged Full efficiency is realized when all 16 threads of a
half-warp agree on their execution path Reduce warp divergence
Avoid bank conflicts when using shared memory Kernels should have high arithmetic intensity
60
Case Studies Two approaches for accelerating an algorithm on the GPU
Re-architecting approach Applicable when the problem does not have inherent SIMD nature May require significant algorithmic modifications Examples:
Boolean Satisfiability Fault Dictionary Computation (not covered in this talk, slides at end)
Porting approach Applicable when problem runtime is dominated by a subroutine,
multiple invocations of which operate upon independent data Partition the subroutine into GPU kernels Examples:
Accelerating SPICE by porting model evaluation on the GPU Fault Simulation Monte Carlo based statistical static timing analysis (SSTA) 61
Given a Boolean formula in conjunctive normal form (CNF) Either find a satisfying truth assignment of all variables Or prove that there is no satisfying assignment
Decisions: x = true y = true The unassigned literal z’ gets implied because of the unit clause rule
Implication: z = false Iterative application of the unit clause rule is called Boolean constant
propagation (BCP) Recent BCP based SAT solvers incorporate conflict driven learning A learned clause represents the search space that has been pruned
62
f = ( x + z )( y + z )( x’ + y’ + z’ )f = ( x + z )( y + z )( x’ + y’ + z’ )
Boolean Satisfiability (SAT)
Clause
x = true y = true
Positive
Literal
Negative
Literal
MiniSAT
Approach Complete Approaches for SAT
Are exact, but algorithms do not easily lend themselves to parallel implementations. Examples: GRASP, zChaff , CirCUs, MiniSAT
Stochastic Approaches for SAT Can execute at high speeds, are scalable, but are not exact. Examples:
Survey Propagation, WalkSAT, RandomSAT Present a hybrid procedure for SAT
Retains the best features of complete and stochastic approaches Proposed algorithm is based on MiniSAT (implemented on the CPU) The variable ordering heuristic of MiniSAT is enhanced by a survey
propagation (SP) based procedure, which is implemented on the GPU Proposed approach is called MESP (MiniSAT enhanced with SP)
63
Next few slides: Discuss the GPU based SP implementation Describe our MESP approach
SP
MESP
Survey Propagation (SP) based SAT Factor Graph - graphical representation of a SAT instance
Variable nodes (variables) Function nodes (clauses) Is a tree if it has no cycles
SP is an algorithm in which ‘agreement’ between clauses and variables is reached by sending probabilistic ‘messages’ along edges of the factor graph (message passing) Pros: highly scalable, parallelizable, exact for factor graphs that are trees Cons: incomplete for non-tree factor graphs
64
x y z
)'()( zxyx
65
Survey Propagation Equations Notation
, β are clauses; i, j are variables V +(i) set of all clauses where i appears in the positive ‘+’ form V -(i) set of all clauses where i appears in the negative ‘-’ form ηα→i is a warning (a probability) from clause to variable i
Let i be in the ‘+’ form in η’s and π’s are iteratively computed until convergence
p (1 i) q V ( i)\
(1 i) V ( i)\
iu p(1 q)
is q(1 p)
i* pq
i j
u
ju j
s j*j \ i
p (1 i)V ( i) q (1 i)V (i)
i
p (1 q )
i q (1 p )
i* p q
W i() i
i i
i*
W i( ) i
i i
i*
During Computation After Convergence
66
Survey Propagation FlowchartRandomly initialize ηα→i
Compute π
Compute W (biases)Sort variables in
decreasing order of W’s
C==0
Y
N
N
Y
Σ(ηα→i ) ≈ 0
Y
N
Call WalkSAT to determinesatisfying assignment
Fix first x% of variables
Declarenon-convergence
Sorted List
Fixed variables &satisfied clauses(ignored)
ηα→i←ηα→i
it++>max
new
Compute ηα→inew
C = Σ [ | ηα→i - ηα→i | ≤ ε?0:1] new
If contradiction,
report and quit
N
67
Survey Propagation on the GPU Implemented GPU kernels for the following
Compute π’s, for all variables (V ) in parallel Compute η’s, for all clauses (C ) in parallel
In particular, computes ηα→i for each variable i in clause α Check convergence (Σ[(ηα→i - ηα→i ) ≤ ε?0:1]) using a reduced
‘integer’ add operation over all literals in all clauses Compute Σ( ηα→i ) (to determine if non- trivial convergence) using
a reduced ‘float’ add operation Compute W’s, for all variables in parallel Parallel ‘bitonic’ sort to find the largest x% of the W’s
CPU performs conditional checks, fixes variables and executes WalkSAT
new
Data Structure on the GPU
68
Polarity
Clause #
Literal #
1 2 |V |
Per Variable Data (Static)
Variable #
Polarity
1 2 |C |
Per Clause Data (Static)
ηα→i
1 2
η’s Written by Clauses Read by Variables
|C |
π -
π+
1 2 |V |
π’s Written by Variables Read by Clauses
With 1 GB of Global memory, the 280 GTX GPU can fit instances with upto 10M clauses and 1M variables
69
Survey Propagation on the GPU Memory transfers between GPU and CPU
Single transfer for static per variable and per clause data During the computation of π and η, there are no transfers
at all. All intermediate data is stored in the global memory of the GPU
After convergence is detected, the sorted list of variables in decreasing order of biases is transferred (GPU → CPU)
After the graph is simplified, the following are updated (CPU → GPU)
Variables that are fixed (don’t contribute to η computation) Clauses that are satisfied (don’t contribute to π computation)
70
Results (GPU based SP)Inst. Name # VARs. # CLs. MiniSAT B05 Ours Speedup
Runtime in seconds
Random_1 20,000 83,999 >2 hrs. 3009.67 172.87 17.41X
Random_2 16,000 67,199 >2 hrs. 1729.48 110.60 15.63X
Random_3 12,000 50,399 >2 hrs. 1002.48 57.98 17.29X
Random_4 8,000 33,599 >2 hrs. 369.61 5.82 63.80X
Random_5 4,000 16,799 >2 hrs. 65.01 3.69 17.62X
Uf200-07 200 860 0.15 0.20 0.08 2.50X
hole10 187 792 1.30 Contrdn. Contrdn.
Uf200-018 200 860 0.19 No Conv No Conv
Avg. (over 20) 22.37X
MESP is compared against Braunstein et al. 2005 (B05) and MiniSAT which were executed on a 3.6 GHz,
3GB Intel machine running Linux Manolios et al. 2006 (M06), which uses OpenGL on NVIDIA GTX 7900 (512 MB
memory , 128 cores, 750MHz) to implement survey propagation For hard random instances MESP shows a 22× speedup over B05
M06 reports a 9× speedup over B05
MESP
71
SAT instance is read into MiniSAT and on the GPU (executing SP) MiniSAT is first invoked on the instance and after it has made some
progress, it invokes GPU-based SP. MiniSAT transfers to SP The current assignments and A subset of the current learned clauses
Augment the current clause database in GPU-based SP with 3 sets of learned clauses (LC) C1, C2 and C3 . L is num. of literals in LC C1 (0 < L ≤ 10) ; C2 (10 < L ≤ 25); C3 (25 < L ≤ 50)
Statically allocate enough space in GPU’s Global Memory to store 8K clauses in C1, C2 and C3 each
Messages computed over all clauses (η) are now computed in 4 separate kernels, one for each set of clauses (C1, C2, C3 and C*)
On convergence, SP (in MESP) fixes variables for which the absolute bias difference |W (+) - W (-)| < τ
MESP MiniSAT decides the next variable to assign based on Variable State
Independent Decaying Sum (VSIDS) heuristic VSIDS chooses next decision variable with the highest activity Activity is the variable occurrence count, with a higher weight on the
variables of the more recently added learned clauses Activity of the variables in the learned clauses is incremented by FM
In MESP, GPU-based SP invocation can return with the following outcomes
72
SP converges and fixes certain variables, S
SP converges but does not fix any variable
SP does not converge/reports contradiction
SP converges, fixes S and determines factor graph is a tree, invokes WalkSAT. If WalkSAT
finds assignment, instance is solved. Else fixed variables in S are returned to MiniSAT
MiniSAT updates activity of variables in
S by FSP
MiniSAT continues the search
MESP
73
MiniSAT’s Decision Tree
GPUCPU
CPU GPU
ActivityTable
Survey Propagation(stochastic)
MiniSAT(complete)
Initial search
Current AssignmentsSubset of Learned Clauses
Activity updated for the variables S that are fixed in SP
GPU works in conjunction with CPU to fix variables
Continuessearch using
updatedactivities
GPU attempts to converge on the SP messages
CPU instructs GPU to ignore fixed variables and
satisfied clauses
ResultsInstance S/U K SAT 3 SAT Speedup
# VARs.
# CLs.
MiniSAT (k)
# VARs.
# CLs.
MiniSAT (3)
MESP MiniSAT (k)
MiniSAT (3)
1394694p S 327932 1283772 29.84 530027 1890057 39.58 15.28 1.95 X 2.59 X
AProVE07 U 78607 208911 110.39 104732 287286 166.25 95.91 1.15 X 1.73 X
eijk.bs4863 S 140089 530249 487.98 234412 813218 619.03 181.86 2.68 X 3.40 X
: : : : : : : : : : :
eijk.S298 U 73222 283211 8.42 136731 473738 10.01 8.47 0.99 X 1.18 X
Avg (over 13) 1.64 X 1.92 X
74
MESP approach on GTX 280 GPU card on an Intel i7 CPU with 2.6 GHz, 9GB RAM, and running Linux. MiniSAT run on the same CPU. Runtime in seconds
D = 1% of Number of Variables; FSP = FM = 1; C = 20; τ = 0.01 The learned clauses on the GPU were updated at every 5th invocation of SP
Up to 24K learned clauses None of these instances were solved in MESP by an invocation to WalkSAT
75
Summary MESP is a GPU enhanced variable ordering heuristic for SAT
GPU based survey propagation π’s for all variables and η’s for all clauses computed in parallel Check convergence using a reduced ‘integer’ add operation over all
literals in all clauses Test whether non-trivial convergence uses a reduced ‘float’ add
operation Compute biases for all variables in parallel Parallel ‘bitonic’ sort to find the largest x% of the biases
Survey propagation enhances the variable ordering in MESP Augment clause database on GPU with 3 sets of learned clauses η’s for all clauses computed in 4 different kernels
On average MESP is 64% (92%) faster than MiniSAT on original (3-SAT) instance
SPICE Model Evaluation on a GPU SPICE is the de facto industry standard for VLSI circuit
simulations Significant motivation for accelerating SPICE simulations
without losing accuracy Increasing complexity and size of VLSI circuits Increasing impact of process variations on the electrical behavior of
circuits Require Monte Carlo based simulations
Accelerate the computationally expensive portion of SPICE – transistor model evaluation – on a GPU
Proposed approach is integrated into a commercial SPICE accelerator tool OmegaSIM Already 10-1000x faster than traditional SPICE implementations
With the proposed approach integrated, OmegaSIM achieves a further speedup of 2.36X (3.07X) on average (max)
Approach Profiled SPICE simulations over several benchmarks
75% of time spent in BSIM3 device model evaluations Billions of calls to device model evaluation routines
Every device in the circuit is evaluated for every time step Possibly repeatedly until the Newton Raphson loop for solving
non-linear equations converges Asymptotic speedup of 4X considering Amdahl’s law.
These calls are parallelizable Since they are independent of each other Each call performs identical computations on different data
Conform to the GPU’s SIMD operating paradigm
Approach CDFG-guided manual partitioning of BSIM3 evaluation code
Limitation on the available hardware resources Registers (8192/per multiprocessor) Shared Memory (16KB/per multiprocessor) Bandwidth to global memory (max. sustainable is ~80 GB/s)
If entire BSIM3 model is implemented as a single kernel Number of threads that can be issued in parallel are not enough
To hide global memory access latency If BSIM3 code is partitioned into many (small) kernels
Requires large amounts of data transfer across kernels Done using global memory (not cached) Negatively impacts performance
Proposed approach Creates CDFG of the BSIM3 equations Uses maximally disconnected components of this graph as different
kernels, considering the above hardware limitations
Approach Take GPU memory constraints into account
Global Memory Used to store intermediate data – which is generated by one kernel and
needed by another (instead of transferring this data to host) Texture Memory
Used for storing ‘runtime parameters’ Device parameters that remain unchanged throughout the simulation
Advantages It is cached, unlike global memory No coalescing requirements, unlike global memory No bank conflicts, such as possible in shared memory CUDA’s efficient built in texture fetching routines are used Small texture memory loading overhead is easily amortized
Constant Memory used for storing physical constants Most efficient when all threads access the same data
Experiments
Proposed approach is implemented and integrated into a commercial SPICE accelerator tool – OmegaSIM
Hardware used: CPU: Intel Core 2 Quad, 2.4 GHz, 4GB RAM GPU: GeForce 8800 GTS, 128 Processors, 675 MHz, 512 MB RAM
Comparing BSIM3 model evaluation alone
# Eval. GPU runtimes (ms) CPU runtimes (ms) Speedup
Proc. Tran. Tot.1M 81.17 196.48 277.65 8975.63 32.33X
2M 184.91 258.79 443.7 18086.29 40.76X
Ckt. Name # Trans.
Total #Evals.
OmegaSIM (s) Speedup
CPU-alone GPU+CPU
Industrial_1 324 1.86 X 107 49.96 34.06 1.47X
Industrial_2 1098 2.62 X 109 118.69 38.65 3.07X
Industrial_3 1098 4.30 X108 725.35 281.5 2.58X
Buf_1 500 1.62 X 107 27.45 20.26 1.35X
Buf_2 1000 5.22 X 107 111.5 48.19 2.31X
Buf_3 2000 2.13 X 108 486.6 164.96 2.95X
ClockTree_1 1922 1.86 X 108 345.69 132.59 2.61X
ClockTree_2 7682 1.92 X 108 458.98 182.88 2.51X
Avg. 2.36X
Experiments - Complete SPICE Sim
With increase in number of transistors, speedup obtained is higher More device evaluation calls made in parallel, latencies are better hidden
High accuracy with single precision floating point implementation Over 1M device evals. avg. (max.) error of 2.88 X 10-26 (9.0 X 10-22) Amp. Newer devices with double precision capability already in market
Conclusions Significant interest in accelerating SPICE 75% of the SPICE runtime spent in BSIM3 model evaluation – allows
asymptotic speedup of 4X Our approach of accelerating model evaluation using GPUs has been
integrated with a commercial fast SPICE tool Obtained speedup of 2.36 X on average
BSIM3 model evaluation can be sped up by 30-40X over 1M-2M calls Take GPU memory constraints into account
Global Memory used to store intermediate data Texture Memory used for storing ‘runtime parameters’ Constant Memory used for storing physical constants
Carefully partition kernels since If entire BSIM3 model is implemented as a single kernel
Number of threads that can be issued in parallel are not enough to hide global memory access latency
If BSIM3 code is partitioned into many (small) kernels Requires large amounts of data transfer across kernels done using global
memory
83
Introduction – Fault Simulation Fault Simulation (FS) is crucial in the VLSI design flow
Given a digital design and a set of vectors V, FS evaluates the number of stuck at faults (Fsim) tested by applying V
The ratio of Fsim/Ftotal is a measure of fault coverage
Current designs have millions of logic gates The number of faulty variations are proportional to design size Each of these variations needs to be simulated for the V vectors
Therefore, it is important to explore ways to accelerate FS
The ideal FS approach should be Fast Scalable & Cost effective
84
Approach Implement a look up table (LUT) based FS
All gates’ LUTs stored in texture memory (cached) LUTs of all library gates fit in texture cache
To avoid cache misses during lookup Individual k-input gate LUT requires 2k entries Each gate’s LUT entries are located at a fixed offset in the texture memory as
shown above Gate output is obtained by
accessing the memory at the “gate offset + input value” Example: output of AND2 gate when inputs are ‘1’ and ‘0’
0 1 2 30
85
Approach Evaluate two vectors for the same gate in a single thread
1/2/3/4 input gates require 4/16/64/256 entries in LUT respectively
Our library consists of an INV and 2/3/4 input AND, NAND, NOR and OR gates.
Hence total memory required for all LUTs is 1348 words This fits in the texture memory cache (8KB per MP)
Exploit both fault and pattern parallelism Fault Parallel
All gates at a fixed topological level are evaluated in parallel Pattern Parallel
Simulations for any gate, for different patterns, are done in parallel
86
Approach
In practice, simulations for any gate, for different patterns, are done in 2 phases, for all the faults which lie in its TFI only
Phase 1 : Good circuit simulation. Results returned to CPU Phase 2 : Faulty circuit simulation. CPU does not schedule a stuck-at-v
fault in a pattern which has v as the good circuit value
Fault injection also performed in parallel
Faulty Good
vector1
vectorvector2N
Good circuit valuefor vector 1
Faulty circuit valuefor vector 1
Approach – Fault Injection
87
m0 m1 Meaning
- 11 Stuck-a-1 Mask
11 00 No Fault Injection
00 00 Stuck-at-0 Mask
typedef struct __align__(16){int offset; // Gate type’s offsetint a, b, c, d; // Input valuesint m0, m1; // Mask variables} threadData;
Approach – Fault Simulation
88
Approach – Fault Detectiontypedef struct __align__(16){int offset; // Gate type’s offsetint a, b, c, d; // Input valuesint Good_Circuit_threadID; // Good circuit simulation thread ID} threadData_Detect;
89
Approach We maximize GPU performance by ensuring that
No data dependency exists between threads issued in parallel The same instructions, on different data are executed by all threads We adapt to specific G80 memory constraints
LUT stored in texture memory. Key advantages are: Texture memory is cached Total LUT size easily fits into available cache size of 8KB/MP No memory coalescing requirements Efficient built-in texture fetching routines available in CUDA Non-zero time taken to load texture memory, but cost easily
amortized Global memory writes for level i gates (and reads for level i+1
gates) are performed in a coalesced fashion
Results
FS on 280 GTX runtimes compared to a commercial fault simulator for 30 IWLS and ITC benchmarks
32 K patterns were simulated for all 30 circuits CPU times obtained on a 1.5 GHz 1.5 GB UltraSPARCIV+ processor running Solaris 9 GPU time includes
Data transfer time between the GPU and CPU (both directions) CPU → GPU : 32 K patterns, LUT data GPU → CPU : 32 K good circuit evals. for all gates, array Detect
Processing time on the GPU Time spent by CPU to issue good/faulty gate evaluation calls Time spent for loading the LUTs
~300X
75.970
:
320.997
333.344
199.723
275.754
Speed Up
1.504
:
0.260
0.155
1.390
0.134
GPU
~47X
11.390
:
57.648
54.052
37.352
46.067
Speed Up
0.225
:
0.047
0.025
0.260
0.022
PROJ.
17.1308620535280b22
Avg (30 ckts.)
::::
14.98057352195s13207
8.39048211907s5378
51.9203462814828s35932
6.19038831462s9234_1
Comm. #Faults#GatesCircuit
91
Conclusions Fault simulation is accelerated using GPUs
Implement a pattern and fault parallel technique Maximize GPU performance by ensuring that
No data dependency exists between threads issued in parallel The same instructions, on different data are executed by all threads Adapt to specific G280 memory constraints
LUT stored in texture memory Global memory writes for level i gates (and reads for level i+1
gates) are performed in a coalesced fashion When using a Single 280 GTX GPU
47X speedup compared to commercial FS engine When projected for a 1U NVIDIA Tesla Server
300X speedup is possible over the commercial engine
92
Introduction - SSTA Static timing analysis (STA) is heavily used in VLSI design to estimate
circuit delay Impact of process variations on circuit delay is increasing Therefore, statistical STA (SSTA) was proposed
It includes the effect of variations while estimating circuit delay Monte Carlo (MC) based SSTA accounts for variations by
Generating N delay samples for each gate (random variable) Executing STA for each sample Aggregating results to generate full circuit delay under variations
MC based SSTA has several advantages over block based and path based SSTA High accuracy, simplicity and compatibility to fabrication line data
Main disadvantage is extremely high runtime cost
93
Approach – STA STA at a gate
Over all inputs compute the MAX of the SUM of Input arrival time for input i and Pin-to-output (P2O) rising (or falling) delay from pin i to output
For example, let Ati
fall (Atirise) denote the arrival time of a falling(rising) signal at node i
MAX (D11→00, D11→01) (MAX (D11→00, D11→10)) denotes the P2O rising delay from a to c (b to c)
ATcrise = MAX [(ATa
fall + MAX (D11→00 , D11→01)), ((ATbfall + MAX (D11→00 , D11→10))]
STA at a gate on the GPU The P2O rising (or falling) delay from every input to output is stored in a lookup
table (LUT) in texture memory of GPU For an n-input gate, do the following
Fetch n pin-to-output rising (or falling) delays from texture memory Using the gate type offset, pin number and falling/rising delay information
n SUM computations Of the pin-to-output delay and input arrival time n-1 MAX computations CUDA only supports 2 operand MAX operations
ac
b
94
Approach – SSTA SSTA at a gate
Need (µ , σ) for the 2n Gaussian distributions of the pin-to-output rising and falling delay values for n inputs
Store (µ , σ) for every input in the LUT As opposed to storing the nominal delay, as for STA
Mersenne Twister (MT) pseudo random number generator is used The uniformly distributed random number sequences are then
transformed into the normal distribution N(0,1) Using the Box-Muller transformations (BM)
Delay of a sample = µ + k · σ Both algorithms, MT and BM kernels are available with the CUDA
software development kit (SDK)
For a circuit, SSTA is performed topologically from inputs to outputs Delays of gates at logic depth i are computed, and stored in global memory Gates at logic higher depths use this data as their input arrival times
95
Experiments – SSTA MC based SSTA on 280 GTX runtimes compared to a CPU based
implementation for 30 large IWLS and ITC benchmarks Monte Carlo analysis performed by using 64 K samples for all 30 circuits CPU runtimes are computed
On 3.6 GHz, 3GB RAM Intel processor running Linux GPU time includes data transfer time
CPU → GPU : arrival time at each primary input µ and σ for all pin-to-output delays of all gates
GPU → CPU: 64K delay values at each primary output
GPU time also includes the time spent in the MT and BM kernels, and loading texture memory
Computation results have been verified for correctness For the SLI Quad system, the runtimes are obtained by scaling the
processing times only Transfer times are included as well (not scaled)
96
Results – SSTA
When using a single 280 GTX GPU ~818X speedup in MC based SSTA is obtained
The SSTA runtimes are projected on a Quad GPU system ~2405X speedup is possible
CircuitRuntime (s) Speedup
GPU SLI QUAD CPU GPU SLI QUAD
s9234_1 8.11 2.92 6621.16 816.64 2269.11
s35932 46.50 18.14 36174.56 778.00 1993.97
s38584 47.24 17.24 38270.72 810.19 2219.98
s13207 14.55 6.21 10633.48 731.07 1712.24
: : : : : :
b22_1 51.50 15.51 45909.95 891.51 2959.80
Avg. (30 Ckts.) ~818X ~2405
Conclusions We accelerate MC based SSTA using graphics processors We take maximal advantage of the GPU’s
Raw computational power and Huge memory bandwidths
Maximize GPU performance by ensuring that No data dependency exists between threads issued in parallel The same instructions, on different data are executed by all
threads Adapt to specific G280 memory constraints
LUT stored in texture memory Global memory writes for level i gates (and reads for
level i+1 gates) are performed in a coalesced fashion
Summary We discussed the GPU platform, and its use in high-
performance EDA applications, with case studies. Outlined the GPU memory and processing
constraints induced by the GPU architecture. Presented programming guidelines with sample code
fragments Suggested tips to maximize performance of GPU-
based code Discussed case studies of EDA algorithms, and
pointed out how the code was architected for maximum performance.
Resources
General www.gpgpu.org
CUDA references www.nvidia.com/object/cuda_home.html Supported platforms: Windows, MacOS, Linux
OpenCL stuff General OpenCL information: www.khronos.org Apple: developer.apple.com/mac/snowleopard/opencl.html Amd: developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx Nvidia: http://developer.nvidia.com/object/opencl-download.html
Thank You
Fault Table Generation Two key steps in VLSI testing and debug
Fault detection: Differentiates a faulty design from a fault free design Fault diagnosis: Identifies and isolates a fault, to analyze the defect
causing the faulty behavior Both detection and diagnosis require precalculated fault table
Whether vector vj can detect fault fi
Stored as matrix [aij], where aij = 1(0) if fault fi is (not) detected by vector vj
Implemented pattern parallel approach on the GPU Simulate several patterns simultaneously Other parallel efforts require dynamic load balancing
Algorithm parallel: Partition fault list across many processors Model parallel: Partition circuit into components, each assigned to one
or more processors101
Approach FSIM [Lee et al. 91] is an efficient fault simulator FSIM+ is FSIM modified to compute a fault table FSIM and FSIM+ both
Are pattern parallel, run on a single core microprocessor Simulate a circuit in a forward levelized manner Prune off unnecessary simulations early
New approach (GFTABLE) is an enhancement of FSIM+ Target hardware is a GPU – SIMD machine Issue thousands of threads (T) in parallel
‘word_size × T’ patterns (packet width) computed in parallel Hardware and software constraints are maximally satisfied
CUDA specific (memory, device utilization) Only CPU can launch a kernel or perform efficient conditional tests
Minimize (expensive) transfers between the GPU and CPU
102
Approach
103
s-a-1
s-a-0✗
✗
CPT
CD(k)
Stem region (SR) All gates on any path from a stem to its immediate dominator
Fanout free region (FFR) A subcircuit induced by cutting off the fanout branches of each stem Such subcircuits form a partition of the original netlist
Stems: Fanout Nets Primary Outputsp is the dominator of k
p is also the immediate dominator of k
Approach Sensitive input
Only input of a gate driving the dominant logic value (DLV) All inputs, when all inputs driving DLV
Critical line is the line driving the sensitive input Critical path tracing (CPT)
Determine paths of critical lines in FFR(k) by backtracking from the output of the FFR(k) towards its inputs
104
1
0
0
Approach
Detectabilities
105105
D(a,k) = 0
D(b,k) = 1
D(i,k) = 1
D(c,k) = 0
D(j,k) = 0
s-a-0
s-a-0✗
✗
FD(a s-a-0, k) = 0
FD(c s-a-0, k) = 0
FD(b s-a-1, k) = 1
✗s-a-1
CD(k) = 0CD(k) = 1
FS
D(k, p)FD(a s-a-0, p)FD(c s-a-0, p)FD(b s-a-1, p)
FD(a s-a-0, p) = FD(a s-a-0, k) D(k,p)
Approach
106
Approach
All threads evaluate the same gate for different patterns Sort gates topologically from inputs to outputs
Fault free data for the first L gates, for all patterns, is stored in the global memory
Avoids the need to transfer this data from the CPU107
Mersenne Twister (MT) pseudo random number generator Long period, efficient use of memory, good distribution properties
Approach
CD(s) is computed using CPT Launch T threads in parallel In FSIM+, gates that are not driving any
critical lines are not backtracked on during CPT
108
In GFTABLE, all gates are backtracked on during CPT The test (if gates are driving critical lines) does not help prune
99.99% of gates, due to the large packet width – T × 32 bits (T = 16K) Large packet width is necessary in order to
Take advantage of the immense parallelism on the GPU Reduce overhead of kernel launch or global memory access
Approach
Explicit fault simulation, in the forward levelized manner From stem s to its immediate dominator t (or PO) Input is CD(s) XORed with fault free value at s
Injecting faults which are upstream from s and observable at s
109
Example: CD(k) = 0010, Fault Free Value at k = 0000 Input applied at k = 0010 0000 = 0010 Fault simulation yields p = 0010 Fault free value at p = 0000 Therefore D(k, p) = 0010 0000 = 0010
In FSIM+, simulation of the fanout of a gate g is scheduled Only if output at g is different from its fault free value
Approach
On the GPU, a bitwise XOR operation is performed on T words of the current output (gate evaluation) and fault free data For the test (if result of XOR is all zero) perform a hybrid of a depth-first and breadth-first approach
110
Divide T long array (of the XOR’s output) into groups of size Q (256)
Compute reduced OR of data in each group into single word which is transferred to the CPU
Avoid bank conflicts and divergent executions Minimize global memory access latencies Employ loop unrolling in reduction code
At the first non-zero value found on the CPU, return false
Perform this test after simulating G (20) gates All conditional tests, on CD(s) or D(s,t), are performed in a similar manner
0 0 0 0 1 0 1 10
0 1 1
Q = 3
These values returned to CPU
Approach
For all faults in FFR(s) of the current stem s If fault fi is detectable at the stem s and If stem s is detectable at a primary output, then
Fault fi is (globally) detectable The ith row of the fault table is accordingly updated
In GFTABLE Detectabilities are computed on the GPU
Fault Detectabilities are immediately transferred to the CPU ‘word_size × T’ bits transferred
The entire fault table is never stored on the GPU
111
ResultsCircuit FSIM+
(s)GFTABLE
(s)GFTABLE v/s FSIM+
GFTABLE-TESLA (s)
GFTABLE-TESLA v/s FSIM+
b14 1502.47 100.87 14.90 X 17.65 85.12 X
b20 4992.73 319.82 15.61 X 55.97 89.21 X
:
b22 6319.47 399.34 15.82 X 69.88 90.43 X
Avg (20 ckts) 15.68 X 89.57 X
112
GFTABLE implemented on NVIDIA Quadro FX 5800 T = 16 K, word_size = 32, L = 32 K, Q = 256, G = 20 To use the global memory effectively, the FAULT_LIST is partitioned into
subsets of 1K faults, and GFTABLE executed iteratively Allows GFTABLE to operate on circuits with arbitrary number of gates
and faults FSIM+ run on a 32-bit, 3.6 GHz Intel CPU with 3GB RAM, running Linux Projected runtime to NVIDIA Tesla system (8 GPUs) is 90× faster
Summary Fault table is required by fault detection and diagnosis
Compute time of a fault table is very high Fault table generation is accelerated using GPUs
All data parallel computations are performed on the GPU All conditional statements are evaluated on the CPU
Tree based reduced OR operation performed on the GPU, before transferring test data to the CPU
Entire fault table is never stored on the GPU’s memory To handle larger circuits
Global memory stores a subset of the fault free data Fault list is partitioned
Experimental results show 15× speedup over FSIM+ using a single Quadro FX 5800 Potential speedup over FSIM+ is 90× when using a Tesla GPU system
113