20
Heterogeneous Programming Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0) 1

Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

Embed Size (px)

Citation preview

Page 1: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 1

Heterogeneous Programming

Martin Kruliš

4 11. 2014

Page 2: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 2

GPU◦ “Independent” device◦ Controlled by host◦ Used for “offloading”

Host Code◦ Needs to be designed in a way

that Utilizes GPU(s) efficiently Utilize CPU while GPU is working CPU and GPU do not wait for each

other

4 11. 2014

Heterogeneous Programming

Page 3: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 3

Bad Example

cudaMemcpy(..., HostToDevice);Kernel1<<<...>>>(...);cudaDeviceSynchronize();cudaMemcpy(..., DeviceToHost);...cudaMemcpy(..., HostToDevice);Kernel2<<<...>>>(...);cudaDeviceSynchronize();cudaMemcpy(..., DeviceToHost);...

4 11. 2014

Heterogeneous Programming

CPU GPU

Device is working

Page 4: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 4

Overlapping CPU and GPU work◦ Kernels

Started asynchronously Can be waited for (cudaDeviceSynchronize()) A little more can be done with streams

◦ Memory transfers cudaMemcpy() is synchronous and blocking Alternatively cudaMemcpyAsync() starts the transfer

and returns immediately Can be synchronized the same way as the kernel

4 11. 2014

Overlapping Work

Page 5: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 5

Using Asynchronous Transfers

cudaMemcpyAsync(HostToDevice);Kernel1<<<...>>>(...);cudaMemcpyAsync(DeviceToHost);...do_something_on_cpu();...cudaDeviceSynchronize();

4 11. 2014

Overlapping Work

CPU GPU

Workload balance becomes an issue

Page 6: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 6

CPU Threads◦ Multiple CPU threads may use the GPU

GPU Overlapping Capabilities◦ Multiple kernels may run simultaneously

Since Fermi architecture cudaDeviceProp.concurrentKernels

◦ Kernel execution may overlap with data transfers Or even multiple data transfers cudaDeviceProp.asyncEngineCount

4 11. 2014

Overlapping Work

Page 7: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 7

Stream◦ In-order GPU command queue (like in OpenCL)

Asynchronous GPU operations are registered in queue Kernel execution Memory data transfers

Commands in different streams may overlap Provide means for explicit and implicit

synchronization◦ Default stream (stream 0)

Always present, does not have to be created Global synchronization capabilities

4 11. 2014

Streams

Page 8: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 8

Stream CreationcudaStream_t stream;cudaStreamCreate(&stream);

Stream UsagecudaMemcpyAsync(dst, src, size, kind, stream);kernel<<<grid, block, sharedMem, stream>>>(...);

Stream DestructioncudaStreamDestroy(stream);

4 11. 2014

Streams

Page 9: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 9

Synchronization◦ Explicit

cudaStreamSynchronize(stream) – waits until all commands issued to the stream have completed

cudaStreamQuery(stream) – a non-blocking test whether the stream has finished

◦ Implicit Operations in different streams cannot overlap if a

special operation is issued between them Memory allocation A CUDA command to default stream Switch between L1/shared memory configuration

4 11. 2014

Streams

Page 10: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 10

Overlapping Behavior◦ Commands in different streams overlap if the

hardware is capable running them concurrently◦ Unless implicit/explicit synchronization prohibits

so

for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(…HostToDevice, stream[i]); MyKernel<<<g, b, 0, stream[i]>>>(...); cudaMemcpyAsync(…DeviceToHost, stream[i]);}

4 11. 2014

Streams

May have many implicit synchronizations, depending on CC and hardware overlapping capabilities.

Page 11: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 11

Overlapping Behavior◦ Commands in different streams overlap if the

hardware is capable running them concurrently◦ Unless implicit/explicit synchronization prohibits

so

for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…HostToDevice, stream[i]);for (int i = 0; i < 2; ++i) MyKernel<<<g, b, 0, stream[i]>>>(...);for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…DeviceToHost, stream[i]);

4 11. 2014

Streams

Much less opportunities for implicit synchronization

Page 12: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 12

Callbacks◦ Callbacks are registered in streams bycudaStreamAddCallback(stream, fnc, data, 0);

◦ The callback function is invoked asynchronously after all preceding commands terminate

◦ Callback registered to the default stream is invoked after previous commands in all streams terminate

◦ Operations issued after registration start after the callback returns

◦ The callback looks likevoid CUDART_CB MyCallback(stream, errorStatus, userData) { ...

4 11. 2014

Streams

Page 13: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 13

Events◦ Special markers that can be used for

synchronization and performance monitoring◦ The typical usage is

Waiting for all commands before the marker finishes Explicit synchronization between selected streams Measuring time between two events

◦ ExamplecudaEvent_t event;cudaEventCreate(&event);cudaEventRecord(event, stream);cudaEventSynchronize(event);

4 11. 2014

Streams

Page 14: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 14

Making a Good Use of Overlapping◦ Split the work into smaller fragments◦ Create a pipeline effect (load, process, store)

4 11. 2014

Pipelining

Page 15: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 15

Data Gather and Scatter Problem

4 11. 2014

Feeding Threads

Host Memory

GPU Memory

Host Memory

Gather

Scatter

KernelExecution

Input Data

Results

Multiple cudaMemcpy() calls may be quite inefficient

Page 16: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 16

Gather and Scatter◦ Reducing overhead◦ Performed by CPU before/after cudaMemcpy

4 11. 2014

Feeding Threads

Main Thread

Gather ScatterKernelHtD copy DtH copy

Gather ScatterKernelHtD copy DtH copy

Stream 0

Stream 1

… # of thread per GPU and # of streams per thread depends on

the workload structure

Page 17: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 17

Page-locked (Pinned) Host Memory◦ Host memory that is prevented from swapping◦ Created/dismissed by

cudaHostAlloc(), cudaFreeHost()cudaHostRegister(), cudaHostUnregister()

◦ Optionally with flagscudaHostAllocWriteCombinedcudaHostAllocMappedcudaHostAllocPortable

◦ Copies between pinned host memory and device are automatically performed asynchronously

◦ Pinned memory is a scarce resource

4 11. 2014

Page-locked Memory

Optimized for writing, not cached on CPU

Page 18: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 18

Device Memory Mapping◦ Allowing GPU to access portions of host memory

directly (i.e., without explicit copy operations) For both reading and writing

◦ The memory must be allocated/registered with flag cudaHostAllocMapped

◦ The context must have cudaDeviceMapHost flag (set by cudaSetDeviceFlags())

◦ Function cudaHostGetDevicePointer() gets host pointer and returns corresponding device pointer

4 11. 2014

Memory Mapping

Page 19: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 19

Asynchronous Errors◦ An error may occur outside the a CUDA call

In case of asynchronous memory transfers or kernel execution

◦ The error is reported by the following CUDA call◦ To make sure all errors were reported, the device

must synchronize (cudaDeviceSynchronize())◦ Error handling functions

cudaGetLastError() cudaPeekAtLastError() cudaGetErrorString(error)

4 11. 2014

Asynchronous Error Handling

Page 20: Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

by Martin Kruliš (v1.0) 204 11. 2014

Discussion