Upload
andrew-houston
View
221
Download
3
Embed Size (px)
Citation preview
GPU Programming with CUDA – CUDA 5 and 6
Paul Richmond
GPUComputing@Sheffieldhttp://gpucomputing.sites.sheffield.ac.uk/
• Dynamic Parallelism (CUDA 5+)• GPU Object Linking (CUDA 5+)• Unified Memory (CUDA 6+)• Other Developer Tools
Overview
• Before CUDA 5 threads had to be launched from the host• Limited ability to perform recursive functions
• Dynamic Parallelism allows threads to be launched from the device• Improved load balancing• Deep Recursion
Dynamic Parallelism
CPU Kernel A
Kernel B
Kernel C
Kernel D
GPU
//Host Code
...
A<<<...>>>(data);
B<<<...>>>(data);
C<<<...>>>(data);
//Kernel Code
__global__ void vectorAdd(float *data)
{
do_stuff(data);
X<<<...>>>(data);
X<<<...>>>(data);
X<<<...>>>(data);
do_more stuff(data);
}
An Example
• CUDA 4 required a single source file for a single kernel• No linking of compiled device code
• CUDA 5.0+ Allows different object files to be linked• Kernels and host code can be built independently
GPU Object Linking
Main .cpp___________________________
a.cu____________________
b.cu____________________
c.cu____________________
a.o b.o c.o
+ Program.exe
• Objects can also be built into static libraries• Shared by different sources• Much better code reuse• Reduces compilation time• Closed source device libraries
GPU Object Linking
Main .cpp___________________________
a.cu____________________
b.cu____________________
a.o b.o
ab.culib
+
Program.exe
+
+
Main2 .cpp___________________________
ab.culib
Program2.exe
+
+foo.cu bar.cu
...
• Developer view is that GPU and CPU have separate memory• Memory must be explicitly copied• Deep copies required for complex data structures
• Unified Memory changes that view• Single pointer to data accessible anywhere• Simpler code porting
Unified Memory
System Memory GPU Memory
CPU GPU
Unified Memory
CPU GPU
Unified Memory Example
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }
void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort(data, N, 1, compare); cudaDeviceSynchronize(); use_data(data); free(data); }
• XT and Drop-in libraries• cuFFT and cuBLAS optimised for multi GPU (on the same node)
• GPUDirect• Direct Transfer between GPUs (cut out the host)• To support direct transfer via Infiniband (over a network)
• Developer Tools• Remote Development using Nsight Eclipse• Enhanced Visual Profiler
Other Developer Tools