Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Optimizing OpenCL for nVidia GPGPU
Emil Karlson
January 29, 2010
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20
Differences to cpu programming
GPU has less on chip memory, but it can be explicitly controlledGPU has single instruction unit per several stream processorsGraphics card is optimized for throughput not latency or flexibilityThreads on gpu are very lightweight
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 2 / 20
General guidelines
GPU invocation is expensiveTransfer between host and graphics card is low bandwidth.Use proper functions and provided profiling tools for benchmarkingIf GPU is used for some operations, it can used also for others withoutdirect performance gains to avoid wasting bandwidthPCI-E about 10 times slower than on chip memory access
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 3 / 20
Choosing the algorithm
Use parallel algorithmAvoid algorithms with complex control or data structuresAvoid algorithms with excessive complex message passingMinimal memory usage per thread is optimal
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 4 / 20
Interleaving latencies
time
Memory
Idle
Not enough memory
Memory operations
Arithmetic operations
Latency
Latency
Operation
Figure: Interleaving latencies
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 5 / 20
device memory overview
Figure: Overview of the device, [nVidia best practices guide]
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 6 / 20
Reading the global and private device memory
Use coalescence to merge thread transcationsCoalescence by half warps with native data sizes and proper alignmentsData structure sizes can be explicitly increased to native data sizesLatency of 400-600 cycles needs to be interleaved.Maximum bandwidth 10 to 150GB/s
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 7 / 20
Coalescence in compute capability 1.2 or higher devices
Figure: unseenEmil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 8 / 20
Global memory access with offset
Figure: Effective bandwidth of sequential global memory access [nVidia OpenCLbest practices guide]
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 9 / 20
Shared memory
Shared memory accessess are done by half warpsMemory accessess to different banks are done simultaneously,otherwise seriallyBanks are defined by (address/4)%16Use shared memory to avoid data transfer to and from device memory
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 10 / 20
Constant memory
Constant memory has very fast cachePhysical maximum for constant memory is 64KiBIf all threads don’t fetch the same constant address, the operation isdone seriallyUse constant memory to avoid excessively long kernel argument lists
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 11 / 20
Texture memory access
Accessing regular global memory as texture memoryTexture memory is cached by 2d locality for better bandwidthTexture memory allows for efficient operations without coalescencerulesTexture memory has no cache coherency for writingTexture memory has a few flavour operations
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 12 / 20
Memory speeds and amounts
On compute capability 1.2 chips there is 16K 32bit registers, earlierones have only 8K registers.There is 16KiB of local memory per threadThere is 16KiB of shared memoryThere is about 8KiB of texture and constant cache each.shared memory is very fastOther memory types are considerably slower
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 13 / 20
Choosing blocksize
Streaming multiprocessor may run multiple blocksBlocksize should be a multiple of warpsize 32Blocksize should be a multiple of 64 for register dependencyinterleaving optimizationsBenchmarking different blocksizes is recommended, typically might bebetween 128 and 256
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 14 / 20
Block resource usage
The amount of register used by the block is
Rblock = ceil(Rceil(T , 32),Rmax/32) (1)
Where R is the emount of registers used by a thread, T is the number ofthreads and Rmax is the amount of registers in the streamingmultiprocessor. Shared memory is also shared between blocks on thestreaming multiprocessor
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 15 / 20
Branching optimization
When conditions like if, while, for are used each branch is runseparately by the instruction unitShort divergences are run as predicative instructionsLoops can be explicitly or implicitly unrolled for the whole blockCondition is optimally the same for the entire block
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 16 / 20
Arithmetic instruction optimizations
Hardware has comparatively fast native operations for a few functionswith some precision tradeoffUse reciprocal square root and fused multiply add when appropriateCompiler may have flags for arithmetic instruction operationsQuite a bit of details in the instruction troughput referenceUse float constants ie. 1.0f not 1.0
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 17 / 20
Integer instruction optimizations
Bit shift operators can be used for several operationsDivision and modulo are slowMultiplication with 24 bit precision faster on current GPUsConversions and addition as fast as floating point ops
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 18 / 20
Other sources
Some embedded programming practices are probably applicable toGPGPU programmingSome mpi parallel programming practices are applicable
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 19 / 20
Bunnies
Figure: Bunny with a pancake on top of its head
Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 20 / 20