Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CUDATom Guerin
Agenda● Why use GPUs/GPGPU?
○ Graphics Processing Pipeline
○ GPU Architecture
● What is CUDA?
● CUDA Programming Model (CUDA C)
● Practical Applications
● Performance Benchmarks
● CUDA-Aware MPI
● CUDA 8 Key Features
● Alternative: OpenCL
● Conclusions
●
http://www.geforce.com/whats-new/articles/geforce-gtx-1080
Why use GPUs/GPGPU? [1][2][5][7]● CPUs are best suited for
sequential computing and
coarse grain parallel
computing.
● CPUs contain a small number
of complex ALUs which
support many operations
● CPUs have relatively large
caches
● Low latency, low throughput
● GPUs are best suited for fine-grain parallel computing (SIMD)
○ Obvious example: image processing
● High latency, high throughput
Graphics Processing Pipeline [4]● Pipeline Entities ● Graphics Pipeline
○ All operations entities are processed
independently!
Graphics Processing Pipeline [4]● Since entities are independent, there can be a
“shader core” dedicated to each
○ Also known as streaming processor, shading
unit, or CUDA core (in the case of NVidia)
● Raster is dependent on shaded fragments, so
it is a separate module
GPU Architecture [4]● Make shader cores small and fast by providing a limited number of
ops
○ 2006: GeForce 8800: 128 cores at 575 MHz
○ 2017: GeForce GTX 1080 Ti: 3584 cores at 1582 MHz
● Single Instruction Multiple Data
○ Shared Instruction Stream
○ Many ALUs
GPU Architecture [1]
Nvidia GPU Architecture [5]● Global Memory
● Streaming Multiprocessors (SMs)
○ Set of CUDA cores
○ Two schedulers (supporting thousands of
concurrent threads)
○ Shared memory
○ L1 cache
○ Thousands of registers
What is CUDA?● Compute Unified Device Architecture
● Leverages parallelism provided in GPU architecture
CUDA Processing Flow [5]● Data Input:
○ CPU sends input (problem) data to
GPU memory
○ GPU runs programs (instructions) as
they are sent over the PCIe bus from
the CPU
● Computation is performed on GPU
cores
● Data Output:
○ Results are stored in GPU memory
○ Results care copied from GPU memory
to CPU memory
CUDA Processing Flow [5]● Data Input:
○ CPU sends input (problem) data to
GPU memory
○ GPU runs programs (instructions) as
they are sent over the PCIe bus from
the CPU
● Computation is performed on GPU
cores
● Data Output:
○ Results are stored in GPU memory
○ Results care copied from GPU memory
to CPU memory
CUDA Processing Flow [5]● Data Input:
○ CPU sends input (problem) data to
GPU memory
○ GPU runs programs (instructions) as
they are sent over the PCIe bus from
the CPU
● Computation is performed on GPU
cores
● Data Output:
○ Results are stored in GPU memory
○ Results care copied from GPU memory
to CPU memory
Anatomy of a CUDA C/C++ Application [5][7]● Two Parts
○ Sequential code run on host CPU
○ Parallel code run on GPU
● Compilation workflow:
○ One program is written
○ Program is split into CUDA parts and serial
parts
○ CUDA parts compiled by NVCC for GPU
○ Serial parts compiled for CPU
CUDA C/C++ [5][7]Example: SAXPY: Single-prevision AX Plus Y
(Vector operations)
● CUDA Kernel: function called by host (CPU)
that is executed on the GPU
○ Can only access GPU memory
○ Fixed number of arguments
○ No static variables
● Qualifiers:
○ __global__: kernel returning voic
○ __device__: can be called from GPU functions
○ __host__: can be called from CPU functions
○ __host__ and __slave__ can both be used
● The GPU executes kernels while the GPU
execute functions
CUDA Kernels: Threads, Blocks, and Grids [5][7]● Threads are grouped into one or more
blocks
● Blocks are grouped into one or more grids
● A kernel is executed as a grid of blocks and
threads
● Each block is mapped to a SM
● Kernels can only execute on one device at a
time
● Multiple kernels can execute on a device
Advantages of Thread Blocks [5][7]● Cooperation via shared memory (SAS)
○ Recall each SM has some memory and some
registers shared between cores
○ May be needed for synchronization
○ Declared with __shared__ qualifier
○ Low latency, high bandwidth
○ Programmer can specify cache use
● Scalability
○ They can execute in any order, concurrently
or sequentially, depending on resources
available
● Memory hierarchy
○ Threads use registers and local memory
○ Blocks of threads have shared memory
○ Grids have global memory
■ Accessible by all threads
■ High latency, low bandwidth
Practical Applications [6][7]● Video transcoding
● Video enhancement
● Oil and natural resource exploration
● Medical imaging
● Data science / neural networks / graphs
○ Heavy reliance on matrix multiplication
● Euclidean distance calculation
○ This is done repeatedly in many algorithms
○ Each axis is independent
○ Recall: Gravitational problems, k-means clustering
● VLSI simulation
● Fluid dynamics
Performance Benchmarks [8][9]● Square root
○ A common operation for large simulations
○ Ex: Euclidean distance in n-body problem
○ GPUs have specialized hardware on each
ALU for this purpose, and hundreds of them
○ GTX 280 vs Core 2 Duo
● Matrix Multiplication
○ GeForce 210 (16 cores)
○ CPU faster up to 25x25 matrix
○ GPU much faster for large matrices
Performance Benchmarks [10][11]● GPU-accelerated MATLAB
○ Built-in CUDA support for matrix
multiplication and other operations
● Ray Tracing
○ GTX 1080 (2560 CUDA cores)
○ 4 core / 8 thread CPU
(presumably a recent Intel
Core i7)
CUDA-Aware MPI [12]● Needed when a problem is divided among
two or more GPUs, two or more hosts
○ For instance, one GPU does not have enough
memory
● Recall: Heterogenous clusters
● Some implementations:
○ OpenMPI
○ CRAY
○ IBM Platform MPI
○ SGI MPI
● Special CUDA-enabled MPI functions are
needed:
○ cudaMemcpy
○ cudaRecv
○ cudaSend
● Unified Virtual Addressing (UVA)
○ Introduced in CUDA 4.0
○ Abstracts away the physical location of GPU
memory
CUDA-Aware MPI [12]● Communication between GPUs on
different hosts:
CUDA-Aware MPI [12]● GPUDirect
○ Introduced in CUDA 3.1
○ Provides high bandwidth, low
latency communications between
GPUs
○ Like cut-through networking
CUDA-Aware MPI [12]● GPUDirect RDMA
○ Provides high bandwidth, low latency
communications between GPUs on different
hosts
○ Provides accelerated access to other hosts on
the network via the CA/NIC and the CPU
chipset
○ Introduced in CUDA 5.1
● GPUDirect P2P
○ Provides accelerated access to other GPUs on
the same host via the CPU chipset
○ Introduced in CUDA 4.0
CUDA 8 Key Features [13]● nvGRAPH library for deep
learning
● Support for new Pascal GPU
architecture
● Page Migration Engine
● Improved performance
● Unified memory abstraction makes
programming easier
● Developer/Debug tools
○ Critical path analysis
○ 2x faster compiler
○ Debug on display GPU
Alternative: OpenCL [14][15]● Unlike CUDA, is supported many architectures including Nvidia, AMD, Intel
● Advantage: Host code is platform agnostic
● Advantage: GPU Architectures and OEMs can be mixed
● Advantage: More options for synchronization/queueing
● Disadvantage: Standard OpenCL kernels are also platform agnostic, but performance is not necessarily
optimal
○ Proprietary code is needed to optimize, which can be a development and debugging nightmare
● Disadvantage: Very yew debugging tools are available
● OpenCL is more common in popular applications, but CUDA-enabled applications usually perform
better.
Conclusions● CUDA/GPU programming is an
efficient way of solving
highly-parallel tasks on low-cost
commodity hardware
● MPI implementations allow for
scalability to very large problem
sizes
● CUDA programming model
provides an easy and efficient
API to developers
Questions?
References[1] http://www.training.prace-ri.eu/uploads/tx_pracetmo/CCC.pdf
[2] http://haifux.org/lectures/267/Introduction-to-GPUs.pdf
[3] http://www.techspot.com/article/650-history-of-the-gpu/
[4] https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf
[5] http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf
[6] http://supercomputingblog.com/cuda/practical-applications-for-cuda/
[7] http://www.kdnuggets.com/2016/11/parallelism-machine-learning-gpu-cuda-threading.html
[8] http://supercomputingblog.com/cuda/performance-of-sqrt-in-cuda/
[9] https://rohitnarurkar.wordpress.com/2013/11/02/cuda-matrix-multiplication/
[10]
https://www.microway.com/hpc-tech-tips/benchmark-matlab-gpu-acceleration-nvidia-tesla-k40-gpus/
[11] https://www.andrew.cmu.edu/user/kunfungl/15418/report.html
[12] https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/
[13] http://on-demand.gputechconf.com/gtc/2016/webinar/cuda-8-features-overview.pdf
[14] https://wiki.tiker.net/CudaVsOpenCL
[15] http://create.pro/blog/opencl-vs-cuda/