9
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

Embed Size (px)

Citation preview

Page 1: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

GPU Programming with CUDA – CUDA 5 and 6

Paul Richmond

GPUComputing@Sheffieldhttp://gpucomputing.sites.sheffield.ac.uk/

Page 2: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

• Dynamic Parallelism (CUDA 5+)• GPU Object Linking (CUDA 5+)• Unified Memory (CUDA 6+)• Other Developer Tools

Overview

Page 3: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

• Before CUDA 5 threads had to be launched from the host• Limited ability to perform recursive functions

• Dynamic Parallelism allows threads to be launched from the device• Improved load balancing• Deep Recursion

Dynamic Parallelism

CPU Kernel A

Kernel B

Kernel C

Kernel D

GPU

Page 4: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

//Host Code

...

A<<<...>>>(data);

B<<<...>>>(data);

C<<<...>>>(data);

//Kernel Code

__global__ void vectorAdd(float *data)

{

do_stuff(data);

X<<<...>>>(data);

X<<<...>>>(data);

X<<<...>>>(data);

do_more stuff(data);

}

An Example

Page 5: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

• CUDA 4 required a single source file for a single kernel• No linking of compiled device code

• CUDA 5.0+ Allows different object files to be linked• Kernels and host code can be built independently

GPU Object Linking

Main .cpp___________________________

a.cu____________________

b.cu____________________

c.cu____________________

a.o b.o c.o

+ Program.exe

Page 6: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

• Objects can also be built into static libraries• Shared by different sources• Much better code reuse• Reduces compilation time• Closed source device libraries

GPU Object Linking

Main .cpp___________________________

a.cu____________________

b.cu____________________

a.o b.o

ab.culib

+

Program.exe

+

+

Main2 .cpp___________________________

ab.culib

Program2.exe

+

+foo.cu bar.cu

...

Page 7: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

• Developer view is that GPU and CPU have separate memory• Memory must be explicitly copied• Deep copies required for complex data structures

• Unified Memory changes that view• Single pointer to data accessible anywhere• Simpler code porting

Unified Memory

System Memory GPU Memory

CPU GPU

Unified Memory

CPU GPU

Page 8: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

Unified Memory Example

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }

void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort(data, N, 1, compare); cudaDeviceSynchronize(); use_data(data); free(data); }

Page 9: GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond GPUComputing@Sheffield

• XT and Drop-in libraries• cuFFT and cuBLAS optimised for multi GPU (on the same node)

• GPUDirect• Direct Transfer between GPUs (cut out the host)• To support direct transfer via Infiniband (over a network)

• Developer Tools• Remote Development using Nsight Eclipse• Enhanced Visual Profiler

Other Developer Tools