Upload
kellie-cooper
View
228
Download
0
Embed Size (px)
Citation preview
1) Leverage raw computational power of GPU Magnitude performance gains possible
2) Leverage maturation of GPU HW and SW
Dedicated fixed 3Daccelerators
Programmable gfx pipeline
(shaders)
General computing(nVidia G80)
HW:
Assembly code
Shaderprogramming languages (Cg/HLSL)
General programming languages(CUDA)
SW:
1995 - 2000
2000 - 2005
2006 -
---
Nanoscale Molecular Dynamics (NAMD) University of Illinois, Urbana-Champaign tools for simulating and visualizing
biomolecular processes Yield 3.5x – 8x performance gains
Develop a high performance library of core computational methods using the GPU Library level
BLAS (Basic Linear Algebra Subprograms) numerical methods image processing kernels
Application level port LONI algorithms
G80 chipset: nVidia 8800 GTX 680 million transistors Intel Core 2 (290 million)
128 micro-processors 16 multi-processor units @ 1.3 GHz 8 processors per multi-processor unit Device memory: 768 MB
High performance parallel architecture On-chip shared memory (16 KB) Texture cache (8 KB) Constant memory (64 KB) and cache (8 KB)
Compatible with all cards with CUDA driver Linux / Windows Mobile (GeForce 8M), desktop (GeForce 8), server
(Quadro)
Scalable to multi-GPUs nVidia SLI Workstation cluster (nVidia Tesla)
1.5 GB Dedicated Memory 2 or 4 G80 GPUs (256 or 512 micro-processors)
Attractive cost-to-performance ratio nVidia 8800 GTX: $550 nVidia Tesla: $7500 - $12,000
nVidia CUDA is 1st generation Not all algorithms scale well to GPU Host memory to device memory
bottleneck Single-precision floating point Cross-GPU development currently not
available
Task Time
a) Identify computational methods to implementb) Evaluate if scalable to GPU
2 - 4 weeks
Experimentation/Implementation 3 - 4 months
Develop prototype Feb 2008
Basic definitions BLOCK = conceptual computational node
Max number = 65535 Optimal if # of blocks multiple of # of
multiprocessors (16) Each BLOCK runs a number of threads
Max threads per block = 512 Optimal if # threads multiple of warp size (32)
Pivot-divide for 3D volume data Matrix pivot-divide applied to each slice
independently Mapped each slice to “block” (NUMBLOCKS = N) Each thread in block handles one row in slice
(NUMTHREADS = N)
As long as no synchronization among slices, scales well to GPU Concurrent read of other slices should be possible
Host to Device latency 1GB/s measured (2GB/s reported) PCIe settings?
Need Investigating: NUMBLOCKS and multiprocessor count?
Fine-tune number of slices per block? CUDA scheduler seems to handle it well when
NUMBLOCKS = N Scaling issues
N > NUMTHREADS ? Will we ever hit BLOCK limit?
t( total ) = t( mem ) + t( compute ) GPU
t(mem) = host to device transfer t(compute) = kernel time
CPU t(mem) = memcpy() t(compute) = loop time
Parameters for N=16…256, BLOCKS = 256 for N=272…512, BLOCKS=512
Host to Device memory bottleneck Pageable vs Pinned memory allocation 2x faster with pinned
Single Instruction Multiple Data Model (SIMD) Less synchronization, higher performance v1.0 – no sync among blocks
High Arithmetic Intensity Arithmetic Intensity = Arithmetic OPs/Memory
Ops Computations can overlap with memory
operations
Memory Operations highest latency Shared memory
Fast as accessing register with no bank conflicts Limited to 16KB
Texture memory Cached from device memory Optimized for 2D spatial locality Built-in filtering/interpolation methods Read packed data in one operation (ie: RGBA)
Constant memory Cached from device memory Fast as register if all threads read same address
Device memory Uncached, very slow Faster if byte aligned and coalesced into single contiguous
access
Arithmetic Operations 4 clock cycles for float (+,*,*+), int (+) 16 clock cycles for 32-bit int mul (4 cycles for
24-bit) 36 clock cycles for float division
(int division and modulo very costly) v1.0 – only floats (double converted to float)
Atomic operations (v1.1 only) Provides locking mechanisms
Minimize host-to-device memory transfers Minimize device memory access
Optimize with byte alignment, coalescing Minimize execution divergence
Minimize branching in kernel Unroll loops
Make high use of shared memory Must correctly stripe data to avoid bank conflicts
For image processing tasks, texture memory may be more efficient
# threads per block = multiple( 32 ) # blocks = ?