43
© NVIDIA Corporation 2010 Advanced CUDA Optimization 1. Introduction Thomas Bradley

Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

Embed Size (px)

Citation preview

Page 1: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Advanced CUDA Optimization

1. Introduction

Thomas Bradley

Page 2: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Agenda

CUDA Review

Review of CUDA Architecture

Programming & Memory Models

Programming Environment

Execution

Performance

Optimization Guidelines

Productivity

Resources

Page 3: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

REVIEW OF CUDA ARCHITECTURE

CUDA Review

Page 4: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Processing Flow

1. Copy input data from CPU memory to GPU

memory

PCI Bus

Page 5: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Processing Flow

1. Copy input data from CPU memory to GPU

memory

2. Load GPU program and execute,

caching data on chip for performance

PCI Bus

Page 6: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Processing Flow

1. Copy input data from CPU memory to GPU

memory

2. Load GPU program and execute,

caching data on chip for performance

3. Copy results from GPU memory to CPU

memory

PCI Bus

Page 7: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Parallel Computing Architecture

Parallel computing architecture

and programming model

Includes a CUDA C compiler,

support for OpenCL and

DirectCompute

Architected to natively support

multiple computational

interfaces (standard languages

and APIs)

Page 8: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Parallel Computing Architecture

CUDA defines:

Programming model

Memory model

Execution model

CUDA uses the GPU, but is for general-purpose computing

Facilitate heterogeneous computing: CPU + GPU

CUDA is scalable

Scale to run on 100s of cores/1000s of parallel threads

Page 9: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

PROGRAMMING MODEL

CUDA Review

Page 10: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Kernels

Parallel portion of application: execute as a kernel

Entire GPU executes kernel, many threads

CUDA threads:

Lightweight

Fast switching

1000s execute simultaneously

CPU Host Executes functions

GPU Device Executes kernels

Page 11: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Kernels: Parallel Threads

A kernel is a function executed

on the GPU

Array of threads, in parallel

All threads execute the same

code, can take different paths

Each thread has an ID

Select input/output data

Control decisions

float x = input[threadID];

float y = func(x);

output[threadID] = y;

Page 12: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Page 13: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Page 14: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

Page 15: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

A kernel is executed as a grid of blocks of threads

Page 16: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

A kernel is executed as a grid of blocks of threads

GPU

Page 17: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Communication Within a Block

Threads may need to cooperate

Memory accesses

Share results

Cooperate using shared memory

Accessible by all threads within a block

Restriction to “within a block” permits scalability

Fast communication between N threads is not feasible when N large

Page 18: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Transparent Scalability – G84

1 2 3 4 5 6 7 8 9 10 11 12

1 2

3 4

5 6

7 8

9 10

11 12

Page 19: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Transparent Scalability – G80

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8

9 10 11 12

Page 20: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Transparent Scalability – GT200

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12...

Idle Idle Idle

Page 21: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA Programming Model - Summary

A kernel executes as a grid of

thread blocks

A block is a batch of threads

Communicate through shared

memory

Each block has a block ID

Each thread has a thread ID

Host

Kernel 1

Kernel 2

Device

0 1 2 3

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

1D

2D

Page 22: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

MEMORY MODEL

CUDA Review

Page 23: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Page 24: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Page 25: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

Page 26: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

Page 27: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory

Page 28: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory

Page 29: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Additional Memories

Host can also allocate textures and arrays of constants

Textures and constants have dedicated caches

Page 30: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

PROGRAMMING ENVIRONMENT

CUDA Review

Page 31: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

CUDA C and OpenCL

Shared back-end compiler

and optimization technology

Entry point for developers

who want low-level APIEntry point for developers

who prefer high-level C

Page 32: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Visual Studio

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Compilation rules: cuda.rules

Syntax highlighting

Intellisense

Integrated debugger and

profiler: Nexus

Page 33: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

NVIDIA Nexus IDE

The industry’s first IDE for massively

parallel applications

Accelerates co-processing (CPU + GPU)

application development

Complete Visual Studio-integrated

development environment

Page 34: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Linux

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Typically makefile driven

cuda-gdb for debugging

CUDA Visual Profiler

Page 35: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

OPTIMIZATION GUIDELINES

Performance

Page 36: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Optimize Algorithms for GPU

Algorithm selectionUnderstand the problem, consider alternate algorithms

Maximize independent parallelism

Maximize arithmetic intensity (math/bandwidth)

Recompute?GPU allocates transistors to arithmetic, not memory

Sometimes better to recompute rather than cache

Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy to/from host

Page 37: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Optimize Memory Access

Coalesce global memory access

Maximise DRAM efficiency

Order of magnitude impact on performance

Avoid serialization

Minimize shared memory bank conflicts

Understand constant cache semantics

Understand spatial locality

Optimize use of textures to ensure spatial locality

Page 38: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Exploit Shared Memory

Hundreds of times faster than global memory

Inter-thread cooperation via shared memory and synchronization

Cache data that is reused by multiple threads

Stage loads/stores to allow reordering

Avoid non-coalesced global memory accesses

Page 39: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Use Resources Efficiently

Partition the computation to keep multiprocessors busyMany threads, many thread blocks

Multiple GPUs

Monitor per-multiprocessor resource utilizationRegisters and shared memory

Low utilization per thread block permits multiple active blocks per multiprocessor

Overlap computation with I/OUse asynchronous memory transfers

Page 40: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

RESOURCES

Productivity

Page 41: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Getting Started

CUDA Zone

www.nvidia.com/cuda

Introductory tutorials/webinars

Forums

Documentation

Programming Guide

Best Practices Guide

Examples

CUDA SDK

Page 42: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01

© NVIDIA Corporation 2010

Libraries

NVIDIA

cuBLAS Dense linear algebra (subset of full BLAS suite)

cuFFT 1D/2D/3D real and complex

Third party

NAG Numeric libraries e.g. RNGs

cuLAPACK/MAGMA

Open Source

Thrust STL/Boost style template language

cuDPP Data parallel primitives (e.g. scan, sort and reduction)

CUSP Sparse linear algebra and graph computation

Many more...

Page 43: Advanced CUDA 01 - Técnico Lisboa - Autenticação · and programming model Includes a CUDA C compiler, ... 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 1D 2D ... Advanced CUDA 01