Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Optimizing OpenCL for nVidia GPGPU

Emil Karlson

January 29, 2010

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Differences to cpu programming

GPU has less on chip memory, but it can be explicitly controlledGPU has single instruction unit per several stream processorsGraphics card is optimized for throughput not latency or flexibilityThreads on gpu are very lightweight


General guidelines

GPU invocation is expensiveTransfer between host and graphics card is low bandwidth.Use proper functions and provided profiling tools for benchmarkingIf GPU is used for some operations, it can used also for others withoutdirect performance gains to avoid wasting bandwidthPCI-E about 10 times slower than on chip memory access


Choosing the algorithm

Use parallel algorithmAvoid algorithms with complex control or data structuresAvoid algorithms with excessive complex message passingMinimal memory usage per thread is optimal


Interleaving latencies

time

Memory

Idle

Not enough memory

Memory operations

Arithmetic operations

Latency

Latency

Operation

Figure: Interleaving latencies


device memory overview

Figure: Overview of the device, [nVidia best practices guide]


Reading the global and private device memory

Use coalescence to merge thread transcationsCoalescence by half warps with native data sizes and proper alignmentsData structure sizes can be explicitly increased to native data sizesLatency of 400-600 cycles needs to be interleaved.Maximum bandwidth 10 to 150GB/s


Coalescence in compute capability 1.2 or higher devices

Figure: unseenEmil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 8 / 20

Global memory access with offset

Figure: Effective bandwidth of sequential global memory access [nVidia OpenCLbest practices guide]


Shared memory

Shared memory accessess are done by half warpsMemory accessess to different banks are done simultaneously,otherwise seriallyBanks are defined by (address/4)%16Use shared memory to avoid data transfer to and from device memory


Constant memory

Constant memory has very fast cachePhysical maximum for constant memory is 64KiBIf all threads don’t fetch the same constant address, the operation isdone seriallyUse constant memory to avoid excessively long kernel argument lists


Texture memory access

Accessing regular global memory as texture memoryTexture memory is cached by 2d locality for better bandwidthTexture memory allows for efficient operations without coalescencerulesTexture memory has no cache coherency for writingTexture memory has a few flavour operations


Memory speeds and amounts

On compute capability 1.2 chips there is 16K 32bit registers, earlierones have only 8K registers.There is 16KiB of local memory per threadThere is 16KiB of shared memoryThere is about 8KiB of texture and constant cache each.shared memory is very fastOther memory types are considerably slower


Choosing blocksize

Streaming multiprocessor may run multiple blocksBlocksize should be a multiple of warpsize 32Blocksize should be a multiple of 64 for register dependencyinterleaving optimizationsBenchmarking different blocksizes is recommended, typically might bebetween 128 and 256


Block resource usage

The amount of register used by the block is

Rblock = ceil(Rceil(T , 32),Rmax/32) (1)

Where R is the emount of registers used by a thread, T is the number ofthreads and Rmax is the amount of registers in the streamingmultiprocessor. Shared memory is also shared between blocks on thestreaming multiprocessor


Branching optimization

When conditions like if, while, for are used each branch is runseparately by the instruction unitShort divergences are run as predicative instructionsLoops can be explicitly or implicitly unrolled for the whole blockCondition is optimally the same for the entire block


Arithmetic instruction optimizations

Hardware has comparatively fast native operations for a few functionswith some precision tradeoffUse reciprocal square root and fused multiply add when appropriateCompiler may have flags for arithmetic instruction operationsQuite a bit of details in the instruction troughput referenceUse float constants ie. 1.0f not 1.0


Integer instruction optimizations

Bit shift operators can be used for several operationsDivision and modulo are slowMultiplication with 24 bit precision faster on current GPUsConversions and addition as fast as floating point ops


Other sources

Some embedded programming practices are probably applicable toGPGPU programmingSome mpi parallel programming practices are applicable


Bunnies

Figure: Bunny with a pancake on top of its head


Documents

Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20