CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDATom Guerin

Agenda● Why use GPUs/GPGPU?

○ Graphics Processing Pipeline

○ GPU Architecture

● What is CUDA?

● CUDA Programming Model (CUDA C)

● Practical Applications

● Performance Benchmarks

● CUDA-Aware MPI

● CUDA 8 Key Features

● Alternative: OpenCL

● Conclusions

●

http://www.geforce.com/whats-new/articles/geforce-gtx-1080



Why use GPUs/GPGPU? [1][2][5][7]● CPUs are best suited for

sequential computing and

coarse grain parallel

computing.

● CPUs contain a small number

of complex ALUs which

support many operations

● CPUs have relatively large

caches

● Low latency, low throughput

● GPUs are best suited for fine-grain parallel computing (SIMD)

○ Obvious example: image processing

● High latency, high throughput

Graphics Processing Pipeline [4]● Pipeline Entities ● Graphics Pipeline

○ All operations entities are processed

independently!

Graphics Processing Pipeline [4]● Since entities are independent, there can be a

“shader core” dedicated to each

○ Also known as streaming processor, shading

unit, or CUDA core (in the case of NVidia)

● Raster is dependent on shaded fragments, so

it is a separate module

GPU Architecture [4]● Make shader cores small and fast by providing a limited number of

ops

○ 2006: GeForce 8800: 128 cores at 575 MHz

○ 2017: GeForce GTX 1080 Ti: 3584 cores at 1582 MHz

● Single Instruction Multiple Data

○ Shared Instruction Stream

○ Many ALUs

GPU Architecture [1]

Nvidia GPU Architecture [5]● Global Memory

● Streaming Multiprocessors (SMs)

○ Set of CUDA cores

○ Two schedulers (supporting thousands of

concurrent threads)

○ Shared memory

○ L1 cache

○ Thousands of registers

What is CUDA?● Compute Unified Device Architecture

● Leverages parallelism provided in GPU architecture

CUDA Processing Flow [5]● Data Input:

○ CPU sends input (problem) data to

GPU memory

○ GPU runs programs (instructions) as

they are sent over the PCIe bus from

the CPU

● Computation is performed on GPU

cores

● Data Output:

○ Results are stored in GPU memory

○ Results care copied from GPU memory

to CPU memory



GPU memory



the CPU


cores

● Data Output:



to CPU memory



GPU memory



the CPU


cores

● Data Output:



to CPU memory

Anatomy of a CUDA C/C++ Application [5][7]● Two Parts

○ Sequential code run on host CPU

○ Parallel code run on GPU

● Compilation workflow:

○ One program is written

○ Program is split into CUDA parts and serial

parts

○ CUDA parts compiled by NVCC for GPU

○ Serial parts compiled for CPU

CUDA C/C++ [5][7]Example: SAXPY: Single-prevision AX Plus Y

(Vector operations)

● CUDA Kernel: function called by host (CPU)

that is executed on the GPU

○ Can only access GPU memory

○ Fixed number of arguments

○ No static variables

● Qualifiers:

○ __global__: kernel returning voic

○ __device__: can be called from GPU functions

○ __host__: can be called from CPU functions

○ __host__ and __slave__ can both be used

● The GPU executes kernels while the GPU

execute functions

CUDA Kernels: Threads, Blocks, and Grids [5][7]● Threads are grouped into one or more

blocks

● Blocks are grouped into one or more grids

● A kernel is executed as a grid of blocks and

threads

● Each block is mapped to a SM

● Kernels can only execute on one device at a

time

● Multiple kernels can execute on a device

Advantages of Thread Blocks [5][7]● Cooperation via shared memory (SAS)

○ Recall each SM has some memory and some

registers shared between cores

○ May be needed for synchronization

○ Declared with __shared__ qualifier

○ Low latency, high bandwidth

○ Programmer can specify cache use

● Scalability

○ They can execute in any order, concurrently

or sequentially, depending on resources

available

● Memory hierarchy

○ Threads use registers and local memory

○ Blocks of threads have shared memory

○ Grids have global memory

■ Accessible by all threads

■ High latency, low bandwidth

Practical Applications [6][7]● Video transcoding

● Video enhancement

● Oil and natural resource exploration

● Medical imaging

● Data science / neural networks / graphs

○ Heavy reliance on matrix multiplication

● Euclidean distance calculation

○ This is done repeatedly in many algorithms

○ Each axis is independent

○ Recall: Gravitational problems, k-means clustering

● VLSI simulation

● Fluid dynamics

Performance Benchmarks [8][9]● Square root

○ A common operation for large simulations

○ Ex: Euclidean distance in n-body problem

○ GPUs have specialized hardware on each

ALU for this purpose, and hundreds of them

○ GTX 280 vs Core 2 Duo

● Matrix Multiplication

○ GeForce 210 (16 cores)

○ CPU faster up to 25x25 matrix

○ GPU much faster for large matrices

Performance Benchmarks [10][11]● GPU-accelerated MATLAB

○ Built-in CUDA support for matrix

multiplication and other operations

● Ray Tracing

○ GTX 1080 (2560 CUDA cores)

○ 4 core / 8 thread CPU

(presumably a recent Intel

Core i7)

CUDA-Aware MPI [12]● Needed when a problem is divided among

two or more GPUs, two or more hosts

○ For instance, one GPU does not have enough

memory

● Recall: Heterogenous clusters

● Some implementations:

○ OpenMPI

○ CRAY

○ IBM Platform MPI

○ SGI MPI

● Special CUDA-enabled MPI functions are

needed:

○ cudaMemcpy

○ cudaRecv

○ cudaSend

● Unified Virtual Addressing (UVA)

○ Introduced in CUDA 4.0

○ Abstracts away the physical location of GPU

memory

CUDA-Aware MPI [12]● Communication between GPUs on

different hosts:

CUDA-Aware MPI [12]● GPUDirect


○ Provides high bandwidth, low

latency communications between

GPUs

○ Like cut-through networking

CUDA-Aware MPI [12]● GPUDirect RDMA

○ Provides high bandwidth, low latency

communications between GPUs on different

hosts

○ Provides accelerated access to other hosts on

the network via the CA/NIC and the CPU

chipset


● GPUDirect P2P

○ Provides accelerated access to other GPUs on

the same host via the CPU chipset


CUDA 8 Key Features [13]● nvGRAPH library for deep

learning

● Support for new Pascal GPU

architecture

● Page Migration Engine

● Improved performance

● Unified memory abstraction makes

programming easier

● Developer/Debug tools

○ Critical path analysis

○ 2x faster compiler

○ Debug on display GPU

Alternative: OpenCL [14][15]● Unlike CUDA, is supported many architectures including Nvidia, AMD, Intel

● Advantage: Host code is platform agnostic

● Advantage: GPU Architectures and OEMs can be mixed

● Advantage: More options for synchronization/queueing

● Disadvantage: Standard OpenCL kernels are also platform agnostic, but performance is not necessarily

optimal

○ Proprietary code is needed to optimize, which can be a development and debugging nightmare

● Disadvantage: Very yew debugging tools are available

● OpenCL is more common in popular applications, but CUDA-enabled applications usually perform

better.

Conclusions● CUDA/GPU programming is an

efficient way of solving

highly-parallel tasks on low-cost

commodity hardware

● MPI implementations allow for

scalability to very large problem

sizes

● CUDA programming model

provides an easy and efficient

API to developers

Questions?

References[1] http://www.training.prace-ri.eu/uploads/tx_pracetmo/CCC.pdf

[2] http://haifux.org/lectures/267/Introduction-to-GPUs.pdf

[3] http://www.techspot.com/article/650-history-of-the-gpu/

[4] https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf

[5] http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf

[6] http://supercomputingblog.com/cuda/practical-applications-for-cuda/

[7] http://www.kdnuggets.com/2016/11/parallelism-machine-learning-gpu-cuda-threading.html

[8] http://supercomputingblog.com/cuda/performance-of-sqrt-in-cuda/

[9] https://rohitnarurkar.wordpress.com/2013/11/02/cuda-matrix-multiplication/

[10]

https://www.microway.com/hpc-tech-tips/benchmark-matlab-gpu-acceleration-nvidia-tesla-k40-gpus/

[11] https://www.andrew.cmu.edu/user/kunfungl/15418/report.html

[12] https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/

[13] http://on-demand.gputechconf.com/gtc/2016/webinar/cuda-8-features-overview.pdf

[14] https://wiki.tiker.net/CudaVsOpenCL

[15] http://create.pro/blog/opencl-vs-cuda/

http://www.training.prace-ri.eu/uploads/tx_pracetmo/CCC.pdf

http://haifux.org/lectures/267/Introduction-to-GPUs.pdf

http://www.techspot.com/article/650-history-of-the-gpu/

https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf

http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf

http://supercomputingblog.com/cuda/practical-applications-for-cuda/

http://www.kdnuggets.com/2016/11/parallelism-machine-learning-gpu-cuda-threading.html

http://supercomputingblog.com/cuda/performance-of-sqrt-in-cuda/

https://rohitnarurkar.wordpress.com/2013/11/02/cuda-matrix-multiplication/



https://www.andrew.cmu.edu/user/kunfungl/15418/report.html

https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/

http://on-demand.gputechconf.com/gtc/2016/webinar/cuda-8-features-overview.pdf

https://wiki.tiker.net/CudaVsOpenCL

http://create.pro/blog/opencl-vs-cuda/

Documents

CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain