Upload
giles-skinner
View
214
Download
1
Embed Size (px)
Citation preview
by Martin Kruliš (v1.0) 1
Heterogeneous Programming
Martin Kruliš
4 11. 2014
by Martin Kruliš (v1.0) 2
GPU◦ “Independent” device◦ Controlled by host◦ Used for “offloading”
Host Code◦ Needs to be designed in a way
that Utilizes GPU(s) efficiently Utilize CPU while GPU is working CPU and GPU do not wait for each
other
4 11. 2014
Heterogeneous Programming
by Martin Kruliš (v1.0) 3
Bad Example
cudaMemcpy(..., HostToDevice);Kernel1<<<...>>>(...);cudaDeviceSynchronize();cudaMemcpy(..., DeviceToHost);...cudaMemcpy(..., HostToDevice);Kernel2<<<...>>>(...);cudaDeviceSynchronize();cudaMemcpy(..., DeviceToHost);...
4 11. 2014
Heterogeneous Programming
CPU GPU
Device is working
by Martin Kruliš (v1.0) 4
Overlapping CPU and GPU work◦ Kernels
Started asynchronously Can be waited for (cudaDeviceSynchronize()) A little more can be done with streams
◦ Memory transfers cudaMemcpy() is synchronous and blocking Alternatively cudaMemcpyAsync() starts the transfer
and returns immediately Can be synchronized the same way as the kernel
4 11. 2014
Overlapping Work
by Martin Kruliš (v1.0) 5
Using Asynchronous Transfers
cudaMemcpyAsync(HostToDevice);Kernel1<<<...>>>(...);cudaMemcpyAsync(DeviceToHost);...do_something_on_cpu();...cudaDeviceSynchronize();
4 11. 2014
Overlapping Work
CPU GPU
Workload balance becomes an issue
by Martin Kruliš (v1.0) 6
CPU Threads◦ Multiple CPU threads may use the GPU
GPU Overlapping Capabilities◦ Multiple kernels may run simultaneously
Since Fermi architecture cudaDeviceProp.concurrentKernels
◦ Kernel execution may overlap with data transfers Or even multiple data transfers cudaDeviceProp.asyncEngineCount
4 11. 2014
Overlapping Work
by Martin Kruliš (v1.0) 7
Stream◦ In-order GPU command queue (like in OpenCL)
Asynchronous GPU operations are registered in queue Kernel execution Memory data transfers
Commands in different streams may overlap Provide means for explicit and implicit
synchronization◦ Default stream (stream 0)
Always present, does not have to be created Global synchronization capabilities
4 11. 2014
Streams
by Martin Kruliš (v1.0) 8
Stream CreationcudaStream_t stream;cudaStreamCreate(&stream);
Stream UsagecudaMemcpyAsync(dst, src, size, kind, stream);kernel<<<grid, block, sharedMem, stream>>>(...);
Stream DestructioncudaStreamDestroy(stream);
4 11. 2014
Streams
by Martin Kruliš (v1.0) 9
Synchronization◦ Explicit
cudaStreamSynchronize(stream) – waits until all commands issued to the stream have completed
cudaStreamQuery(stream) – a non-blocking test whether the stream has finished
◦ Implicit Operations in different streams cannot overlap if a
special operation is issued between them Memory allocation A CUDA command to default stream Switch between L1/shared memory configuration
4 11. 2014
Streams
by Martin Kruliš (v1.0) 10
Overlapping Behavior◦ Commands in different streams overlap if the
hardware is capable running them concurrently◦ Unless implicit/explicit synchronization prohibits
so
for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(…HostToDevice, stream[i]); MyKernel<<<g, b, 0, stream[i]>>>(...); cudaMemcpyAsync(…DeviceToHost, stream[i]);}
4 11. 2014
Streams
May have many implicit synchronizations, depending on CC and hardware overlapping capabilities.
by Martin Kruliš (v1.0) 11
Overlapping Behavior◦ Commands in different streams overlap if the
hardware is capable running them concurrently◦ Unless implicit/explicit synchronization prohibits
so
for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…HostToDevice, stream[i]);for (int i = 0; i < 2; ++i) MyKernel<<<g, b, 0, stream[i]>>>(...);for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…DeviceToHost, stream[i]);
4 11. 2014
Streams
Much less opportunities for implicit synchronization
by Martin Kruliš (v1.0) 12
Callbacks◦ Callbacks are registered in streams bycudaStreamAddCallback(stream, fnc, data, 0);
◦ The callback function is invoked asynchronously after all preceding commands terminate
◦ Callback registered to the default stream is invoked after previous commands in all streams terminate
◦ Operations issued after registration start after the callback returns
◦ The callback looks likevoid CUDART_CB MyCallback(stream, errorStatus, userData) { ...
4 11. 2014
Streams
by Martin Kruliš (v1.0) 13
Events◦ Special markers that can be used for
synchronization and performance monitoring◦ The typical usage is
Waiting for all commands before the marker finishes Explicit synchronization between selected streams Measuring time between two events
◦ ExamplecudaEvent_t event;cudaEventCreate(&event);cudaEventRecord(event, stream);cudaEventSynchronize(event);
4 11. 2014
Streams
by Martin Kruliš (v1.0) 14
Making a Good Use of Overlapping◦ Split the work into smaller fragments◦ Create a pipeline effect (load, process, store)
4 11. 2014
Pipelining
by Martin Kruliš (v1.0) 15
Data Gather and Scatter Problem
4 11. 2014
Feeding Threads
Host Memory
GPU Memory
Host Memory
Gather
Scatter
KernelExecution
Input Data
Results
Multiple cudaMemcpy() calls may be quite inefficient
by Martin Kruliš (v1.0) 16
Gather and Scatter◦ Reducing overhead◦ Performed by CPU before/after cudaMemcpy
4 11. 2014
Feeding Threads
Main Thread
Gather ScatterKernelHtD copy DtH copy
Gather ScatterKernelHtD copy DtH copy
Stream 0
Stream 1
… # of thread per GPU and # of streams per thread depends on
the workload structure
by Martin Kruliš (v1.0) 17
Page-locked (Pinned) Host Memory◦ Host memory that is prevented from swapping◦ Created/dismissed by
cudaHostAlloc(), cudaFreeHost()cudaHostRegister(), cudaHostUnregister()
◦ Optionally with flagscudaHostAllocWriteCombinedcudaHostAllocMappedcudaHostAllocPortable
◦ Copies between pinned host memory and device are automatically performed asynchronously
◦ Pinned memory is a scarce resource
4 11. 2014
Page-locked Memory
Optimized for writing, not cached on CPU
by Martin Kruliš (v1.0) 18
Device Memory Mapping◦ Allowing GPU to access portions of host memory
directly (i.e., without explicit copy operations) For both reading and writing
◦ The memory must be allocated/registered with flag cudaHostAllocMapped
◦ The context must have cudaDeviceMapHost flag (set by cudaSetDeviceFlags())
◦ Function cudaHostGetDevicePointer() gets host pointer and returns corresponding device pointer
4 11. 2014
Memory Mapping
by Martin Kruliš (v1.0) 19
Asynchronous Errors◦ An error may occur outside the a CUDA call
In case of asynchronous memory transfers or kernel execution
◦ The error is reported by the following CUDA call◦ To make sure all errors were reported, the device
must synchronize (cudaDeviceSynchronize())◦ Error handling functions
cudaGetLastError() cudaPeekAtLastError() cudaGetErrorString(error)
4 11. 2014
Asynchronous Error Handling
by Martin Kruliš (v1.0) 204 11. 2014
Discussion