1)Leverage raw computational power of GPU Magnitude performance gains possible

1) Leverage raw computational power of GPU Magnitude performance gains possible

2) Leverage maturation of GPU HW and SW

Dedicated fixed 3Daccelerators

Programmable gfx pipeline

(shaders)

General computing(nVidia G80)

HW:

Assembly code

Shaderprogramming languages (Cg/HLSL)

General programming languages(CUDA)

SW:

1995 - 2000

2000 - 2005

2006 -

---

Nanoscale Molecular Dynamics (NAMD) University of Illinois, Urbana-Champaign tools for simulating and visualizing

biomolecular processes Yield 3.5x – 8x performance gains

Develop a high performance library of core computational methods using the GPU Library level

BLAS (Basic Linear Algebra Subprograms) numerical methods image processing kernels

Application level port LONI algorithms

G80 chipset: nVidia 8800 GTX 680 million transistors Intel Core 2 (290 million)

128 micro-processors 16 multi-processor units @ 1.3 GHz 8 processors per multi-processor unit Device memory: 768 MB

High performance parallel architecture On-chip shared memory (16 KB) Texture cache (8 KB) Constant memory (64 KB) and cache (8 KB)

Compatible with all cards with CUDA driver Linux / Windows Mobile (GeForce 8M), desktop (GeForce 8), server

(Quadro)

Scalable to multi-GPUs nVidia SLI Workstation cluster (nVidia Tesla)

1.5 GB Dedicated Memory 2 or 4 G80 GPUs (256 or 512 micro-processors)

Attractive cost-to-performance ratio nVidia 8800 GTX: $550 nVidia Tesla: $7500 - $12,000

nVidia CUDA is 1st generation Not all algorithms scale well to GPU Host memory to device memory

bottleneck Single-precision floating point Cross-GPU development currently not

available

Task Time

a) Identify computational methods to implementb) Evaluate if scalable to GPU

2 - 4 weeks

Experimentation/Implementation 3 - 4 months

Develop prototype Feb 2008

Basic definitions BLOCK = conceptual computational node

Max number = 65535 Optimal if # of blocks multiple of # of

multiprocessors (16) Each BLOCK runs a number of threads

Max threads per block = 512 Optimal if # threads multiple of warp size (32)

Pivot-divide for 3D volume data Matrix pivot-divide applied to each slice

independently Mapped each slice to “block” (NUMBLOCKS = N) Each thread in block handles one row in slice

(NUMTHREADS = N)

As long as no synchronization among slices, scales well to GPU Concurrent read of other slices should be possible

Host to Device latency 1GB/s measured (2GB/s reported) PCIe settings?

Need Investigating: NUMBLOCKS and multiprocessor count?

Fine-tune number of slices per block? CUDA scheduler seems to handle it well when

NUMBLOCKS = N Scaling issues

N > NUMTHREADS ? Will we ever hit BLOCK limit?

t( total ) = t( mem ) + t( compute ) GPU

t(mem) = host to device transfer t(compute) = kernel time

CPU t(mem) = memcpy() t(compute) = loop time

Parameters for N=16…256, BLOCKS = 256 for N=272…512, BLOCKS=512

Host to Device memory bottleneck Pageable vs Pinned memory allocation 2x faster with pinned

Single Instruction Multiple Data Model (SIMD) Less synchronization, higher performance v1.0 – no sync among blocks

High Arithmetic Intensity Arithmetic Intensity = Arithmetic OPs/Memory

Ops Computations can overlap with memory

operations

Memory Operations highest latency Shared memory

Fast as accessing register with no bank conflicts Limited to 16KB

Texture memory Cached from device memory Optimized for 2D spatial locality Built-in filtering/interpolation methods Read packed data in one operation (ie: RGBA)

Constant memory Cached from device memory Fast as register if all threads read same address

Device memory Uncached, very slow Faster if byte aligned and coalesced into single contiguous

access

Arithmetic Operations 4 clock cycles for float (+,*,*+), int (+) 16 clock cycles for 32-bit int mul (4 cycles for

24-bit) 36 clock cycles for float division

(int division and modulo very costly) v1.0 – only floats (double converted to float)

Atomic operations (v1.1 only) Provides locking mechanisms

Minimize host-to-device memory transfers Minimize device memory access

Optimize with byte alignment, coalescing Minimize execution divergence

Minimize branching in kernel Unroll loops

Make high use of shared memory Must correctly stripe data to avoid bank conflicts

For image processing tasks, texture memory may be more efficient

# threads per block = multiple( 32 ) # blocks = ?

Documents

1)Leverage raw computational power of GPU Magnitude performance gains possible