Upload
rodney-barker
View
249
Download
1
Tags:
Embed Size (px)
Citation preview
CUDA 5.0
By Peter HolvenstotCS6260
CUDA 5.0
Latest iteration of CUDA toolkit
Requires Compute Capability 3.0
Compatible Kepler cards being installed @WMU
Major New Features
GPUDirect Allows Direct Memory Access
GPU Object Linking Libraries for GPU code
Dynamic Parallelism Kernels inside kernels
GPUDirect
Allows Direct Memory Access to PCIe bus
Third-party device access now supported
Requires use of pinned memory
DMAs can be chained across network
GPUDirect
Pinned Memory
Malloc() - unpinned, can be paged out
CudaHostAlloc() - pinned
Cannot be paged out
Takes longer to allocate, but allows features requiring DMA and increases copy performance
Kernel Linking
Kernels now support compilation to .obj file
Allows compiling into/against static libraries
Allows closed-source distribution of libraries
Dynamic Parallelism
CUDA 4.1: __device__ functions may make inline-able recursive calls
However, __global__ functions/kernels cannot
CUDA 5: GPU/kernels may launch additional kernels
Dynamic Parallelism
Most important feature in release
Reduces need for synchronization
Allows program flow to be controlled by GPU
Allows recursion and subdivision of problems
Dynamic Parallelism
CPU code can now become a kernel
Kernel calls can be used as tasks
GPU controls kernel launch/flow/scheduling
Increases practical thread count to thousands
Dynamic Parallelism
Interesting data is not uniformly distributed
Dynamic parallelism can launch additional threads in interesting areas
Allows higher resolution in critical areas without slowing down others
Source: NVIDIA
Dynamic Parallelism
Nested Dependencies
Source: NVIDIA
Dynamic Parallelism
Scheduling can be controlled by streams
No new concurrency guarantees
Launched kernels may execute out-of-order within a stream
Named streams can guarantee concurrency
Dynamic Parallelism
Nested Dependencies - cudaDeviceSynchronize ()
Can be used inside a kernel
Synchronizes all launches by any kernel in block
Does NOT imply __syncthreads()!
Dynamic Parallelism
Kernel launch implies memory sync operation
Child sees state at time of launch
Parent sees child writes after sync
Local and shared memory are private, cannot be shared with children
Questions?
Sources
http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
http://docs.nvidia.com/cuda/gpudirect-rdma/index.html
http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0338-GTC2012-CUDA-Programming-Model.pdf
https://developer.nvidia.com/content/trenches-gtc-cuda-5-and-beyond