28
CUDA Tom Guerin

CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDATom Guerin

Page 2: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Agenda● Why use GPUs/GPGPU?

○ Graphics Processing Pipeline

○ GPU Architecture

● What is CUDA?

● CUDA Programming Model (CUDA C)

● Practical Applications

● Performance Benchmarks

● CUDA-Aware MPI

● CUDA 8 Key Features

● Alternative: OpenCL

● Conclusions

http://www.geforce.com/whats-new/articles/geforce-gtx-1080

Page 3: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Why use GPUs/GPGPU? [1][2][5][7]● CPUs are best suited for

sequential computing and

coarse grain parallel

computing.

● CPUs contain a small number

of complex ALUs which

support many operations

● CPUs have relatively large

caches

● Low latency, low throughput

● GPUs are best suited for fine-grain parallel computing (SIMD)

○ Obvious example: image processing

● High latency, high throughput

Page 4: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Graphics Processing Pipeline [4]● Pipeline Entities ● Graphics Pipeline

○ All operations entities are processed

independently!

Page 5: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Graphics Processing Pipeline [4]● Since entities are independent, there can be a

“shader core” dedicated to each

○ Also known as streaming processor, shading

unit, or CUDA core (in the case of NVidia)

● Raster is dependent on shaded fragments, so

it is a separate module

Page 6: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

GPU Architecture [4]● Make shader cores small and fast by providing a limited number of

ops

○ 2006: GeForce 8800: 128 cores at 575 MHz

○ 2017: GeForce GTX 1080 Ti: 3584 cores at 1582 MHz

● Single Instruction Multiple Data

○ Shared Instruction Stream

○ Many ALUs

Page 7: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

GPU Architecture [1]

Page 8: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Nvidia GPU Architecture [5]● Global Memory

● Streaming Multiprocessors (SMs)

○ Set of CUDA cores

○ Two schedulers (supporting thousands of

concurrent threads)

○ Shared memory

○ L1 cache

○ Thousands of registers

Page 9: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

What is CUDA?● Compute Unified Device Architecture

● Leverages parallelism provided in GPU architecture

Page 10: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA Processing Flow [5]● Data Input:

○ CPU sends input (problem) data to

GPU memory

○ GPU runs programs (instructions) as

they are sent over the PCIe bus from

the CPU

● Computation is performed on GPU

cores

● Data Output:

○ Results are stored in GPU memory

○ Results care copied from GPU memory

to CPU memory

Page 11: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA Processing Flow [5]● Data Input:

○ CPU sends input (problem) data to

GPU memory

○ GPU runs programs (instructions) as

they are sent over the PCIe bus from

the CPU

● Computation is performed on GPU

cores

● Data Output:

○ Results are stored in GPU memory

○ Results care copied from GPU memory

to CPU memory

Page 12: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA Processing Flow [5]● Data Input:

○ CPU sends input (problem) data to

GPU memory

○ GPU runs programs (instructions) as

they are sent over the PCIe bus from

the CPU

● Computation is performed on GPU

cores

● Data Output:

○ Results are stored in GPU memory

○ Results care copied from GPU memory

to CPU memory

Page 13: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Anatomy of a CUDA C/C++ Application [5][7]● Two Parts

○ Sequential code run on host CPU

○ Parallel code run on GPU

● Compilation workflow:

○ One program is written

○ Program is split into CUDA parts and serial

parts

○ CUDA parts compiled by NVCC for GPU

○ Serial parts compiled for CPU

Page 14: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA C/C++ [5][7]Example: SAXPY: Single-prevision AX Plus Y

(Vector operations)

● CUDA Kernel: function called by host (CPU)

that is executed on the GPU

○ Can only access GPU memory

○ Fixed number of arguments

○ No static variables

● Qualifiers:

○ __global__: kernel returning voic

○ __device__: can be called from GPU functions

○ __host__: can be called from CPU functions

○ __host__ and __slave__ can both be used

● The GPU executes kernels while the GPU

execute functions

Page 15: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA Kernels: Threads, Blocks, and Grids [5][7]● Threads are grouped into one or more

blocks

● Blocks are grouped into one or more grids

● A kernel is executed as a grid of blocks and

threads

● Each block is mapped to a SM

● Kernels can only execute on one device at a

time

● Multiple kernels can execute on a device

Page 16: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Advantages of Thread Blocks [5][7]● Cooperation via shared memory (SAS)

○ Recall each SM has some memory and some

registers shared between cores

○ May be needed for synchronization

○ Declared with __shared__ qualifier

○ Low latency, high bandwidth

○ Programmer can specify cache use

● Scalability

○ They can execute in any order, concurrently

or sequentially, depending on resources

available

● Memory hierarchy

○ Threads use registers and local memory

○ Blocks of threads have shared memory

○ Grids have global memory

■ Accessible by all threads

■ High latency, low bandwidth

Page 17: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Practical Applications [6][7]● Video transcoding

● Video enhancement

● Oil and natural resource exploration

● Medical imaging

● Data science / neural networks / graphs

○ Heavy reliance on matrix multiplication

● Euclidean distance calculation

○ This is done repeatedly in many algorithms

○ Each axis is independent

○ Recall: Gravitational problems, k-means clustering

● VLSI simulation

● Fluid dynamics

Page 18: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Performance Benchmarks [8][9]● Square root

○ A common operation for large simulations

○ Ex: Euclidean distance in n-body problem

○ GPUs have specialized hardware on each

ALU for this purpose, and hundreds of them

○ GTX 280 vs Core 2 Duo

● Matrix Multiplication

○ GeForce 210 (16 cores)

○ CPU faster up to 25x25 matrix

○ GPU much faster for large matrices

Page 19: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Performance Benchmarks [10][11]● GPU-accelerated MATLAB

○ Built-in CUDA support for matrix

multiplication and other operations

● Ray Tracing

○ GTX 1080 (2560 CUDA cores)

○ 4 core / 8 thread CPU

(presumably a recent Intel

Core i7)

Page 20: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA-Aware MPI [12]● Needed when a problem is divided among

two or more GPUs, two or more hosts

○ For instance, one GPU does not have enough

memory

● Recall: Heterogenous clusters

● Some implementations:

○ OpenMPI

○ CRAY

○ IBM Platform MPI

○ SGI MPI

● Special CUDA-enabled MPI functions are

needed:

○ cudaMemcpy

○ cudaRecv

○ cudaSend

● Unified Virtual Addressing (UVA)

○ Introduced in CUDA 4.0

○ Abstracts away the physical location of GPU

memory

Page 21: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA-Aware MPI [12]● Communication between GPUs on

different hosts:

Page 22: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA-Aware MPI [12]● GPUDirect

○ Introduced in CUDA 3.1

○ Provides high bandwidth, low

latency communications between

GPUs

○ Like cut-through networking

Page 23: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA-Aware MPI [12]● GPUDirect RDMA

○ Provides high bandwidth, low latency

communications between GPUs on different

hosts

○ Provides accelerated access to other hosts on

the network via the CA/NIC and the CPU

chipset

○ Introduced in CUDA 5.1

● GPUDirect P2P

○ Provides accelerated access to other GPUs on

the same host via the CPU chipset

○ Introduced in CUDA 4.0

Page 24: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

CUDA 8 Key Features [13]● nvGRAPH library for deep

learning

● Support for new Pascal GPU

architecture

● Page Migration Engine

● Improved performance

● Unified memory abstraction makes

programming easier

● Developer/Debug tools

○ Critical path analysis

○ 2x faster compiler

○ Debug on display GPU

Page 25: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Alternative: OpenCL [14][15]● Unlike CUDA, is supported many architectures including Nvidia, AMD, Intel

● Advantage: Host code is platform agnostic

● Advantage: GPU Architectures and OEMs can be mixed

● Advantage: More options for synchronization/queueing

● Disadvantage: Standard OpenCL kernels are also platform agnostic, but performance is not necessarily

optimal

○ Proprietary code is needed to optimize, which can be a development and debugging nightmare

● Disadvantage: Very yew debugging tools are available

● OpenCL is more common in popular applications, but CUDA-enabled applications usually perform

better.

Page 26: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Conclusions● CUDA/GPU programming is an

efficient way of solving

highly-parallel tasks on low-cost

commodity hardware

● MPI implementations allow for

scalability to very large problem

sizes

● CUDA programming model

provides an easy and efficient

API to developers

Page 27: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

Questions?

Page 28: CUDA - Rochester Institute of Technologymeseec.ce.rit.edu/756-projects/spring2017/1-1.pdfWhy use GPUs/GPGPU? [1][2][5][7] CPUs are best suited for sequential computing and coarse grain

References[1] http://www.training.prace-ri.eu/uploads/tx_pracetmo/CCC.pdf

[2] http://haifux.org/lectures/267/Introduction-to-GPUs.pdf

[3] http://www.techspot.com/article/650-history-of-the-gpu/

[4] https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf

[5] http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf

[6] http://supercomputingblog.com/cuda/practical-applications-for-cuda/

[7] http://www.kdnuggets.com/2016/11/parallelism-machine-learning-gpu-cuda-threading.html

[8] http://supercomputingblog.com/cuda/performance-of-sqrt-in-cuda/

[9] https://rohitnarurkar.wordpress.com/2013/11/02/cuda-matrix-multiplication/

[10]

https://www.microway.com/hpc-tech-tips/benchmark-matlab-gpu-acceleration-nvidia-tesla-k40-gpus/

[11] https://www.andrew.cmu.edu/user/kunfungl/15418/report.html

[12] https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/

[13] http://on-demand.gputechconf.com/gtc/2016/webinar/cuda-8-features-overview.pdf

[14] https://wiki.tiker.net/CudaVsOpenCL

[15] http://create.pro/blog/opencl-vs-cuda/