38
Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Embed Size (px)

Citation preview

Page 1: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Intermediate GPGPU Programming in CUDA

Supada Laosooksathit

Page 2: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

NVIDIA Hardware Architecture

Hostmemory

Page 3: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Recall

• 5 steps for CUDA Programming– Initialize device– Allocate device memory– Copy data to device memory– Execute kernel– Copy data back from device memory

Page 4: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Initialize Device Calls

• To select the device associated to the host thread– cudaSetDevice(device)– This function must be called before any __global__

function, otherwise device 0 is automatically selected.• To get number of devices– cudaGetDeviceCount(&devicecount)

• To retrieve device’s property– cudaGetDeviceProperties(&deviceProp, device)

Page 5: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Hello World Example

• Allocate host and device memory

Page 6: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Hello World Example

• Host code

Page 7: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Hello World Example

• Kernel code

Page 8: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

To Try CUDA Programming

• SSH to 138.47.102.111• Set environment vals in .bashrc in your home directory

export PATH=$PATH:/usr/local/cuda/binexport LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

• Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK

• Compile the following directories– NVIDIA_GPU_Computing_SDK/shared/– NVIDIA_GPU_Computing_SDK/C/common/

• The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

Page 9: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Demo

• Hello World– Print out block and thread IDs

• Vector Add– C = A + B

Page 10: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

NVIDIA Hardware Architecture

SM

Page 11: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Specifications of a Device

• For more details– deviceQuery in CUDA SDK– Appendix F in Programming Guide 4.0

Specifications Compute Capability 1.3

Compute Capability 2.0

Warp size 32 32

Max threads/block 512 1024

Max Blocks/grid 65535 65535

Shared mem 16 KB/SM 48 KB/SM

Page 12: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Demo

• deviceQuery– Show hardware specifications in details

Page 13: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Memory Optimizations

• Reduce the time of memory transfer between host and device– Use asynchronous memory transfer (CUDA

streams)– Use zero copy

• Reduce the number of transactions between on-chip and off-chip memory– Memory coalescing

• Avoid bank conflicts in shared memory

Page 14: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Reduce Time of Host-Device Memory Transfer

• Regular memory transfer (synchronously)

Page 15: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Reduce Time of Host-Device Memory Transfer• CUDA streams– Allow overlapping between kernel and memory copy

Page 16: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

CUDA Streams Example

Page 17: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

CUDA Streams Example

Page 18: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

GPU Timers

• CUDA Events– An API– Use the clock shade in kernel– Accurate for timing kernel executions

• CUDA timer calls– Libraries implemented in CUDA SDK

Page 19: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

CUDA Events Example

Page 20: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Demo

• simpleStreams

Page 21: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Reduce Time of Host-Device Memory Transfer

• Zero copy– Allow device pointers to access page-locked host

memory directly– Page-locked host memory is allocated by cudaHostAlloc()

Page 22: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Demo

• Zero copy

Page 23: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Reduce number of On-chip and Off-chip Memory Transactions

• Threads in a warp access global memory• Memory coalescing– Copy a bunch of words at the same time

Page 24: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Memory Coalescing

• Threads in a warp access global memory in a straight forward way (4-byte word per thread)

Page 25: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Memory Coalescing

• Memory addresses are aligned in the same segment but the accesses are not sequential

Page 26: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Memory Coalescing

• Memory addresses are not aligned in the same segment

Page 27: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Shared Memory

• 16 banks for compute capability 1.x, 32 banks for compute capability 2.x

• Help utilizing memory coalescing• Bank conflicts may occur– Two or more threads in access the same bank– In compute capability 1.x, no broadcast– In compute capability 2.x, broadcast the same

data to many threads that request

Page 28: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Bank Conflicts

00Threads: Banks:

11

22

33

00Threads: Banks:

11

22

33

No bank conflict 2-way bank conflict

Page 29: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Matrix Multiplication Example

Page 30: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Matrix Multiplication Example

• Reduce accesses to global memory– (A.height/BLOCK_SIZE) times reading A– (B.width/BLOCK_SIZE) times reading B

Page 31: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Demo

• Matrix Multiplication– With and without shared memory– Different block sizes

Page 32: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Control Flow

• if, switch, do, for, while• Branch divergence in a warp– Threads in a warp issue different instruction sets

• Different execution paths will be serialized• Increase number of instructions in that warp

Page 33: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Branch Divergence

Page 34: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Summary

• 5 steps for CUDA Programming• NVIDIA Hardware Architecture– Memory hierarchy: global memory, shared

memory, register file– Specifications of a device: block, warp, thread, SM

Page 35: Intermediate GPGPU Programming in CUDA Supada Laosooksathit

Summary

• Memory optimization– Reduce overhead due to host-device memory

transfer with CUDA streams, Zero copy– Reduce the number of transactions between on-

chip and off-chip memory by utilizing memory coalescing (shared memory)

– Try to avoid bank conflicts in shared memory• Control flow– Try to avoid branch divergence in a warp

Page 37: Intermediate GPGPU Programming in CUDA Supada Laosooksathit
Page 38: Intermediate GPGPU Programming in CUDA Supada Laosooksathit