20
Optimizing OpenCL for nVidia GPGPU Emil Karlson January 29, 2010 Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Optimizing OpenCL for nVidia GPGPU

Emil Karlson

January 29, 2010

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Page 2: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Differences to cpu programming

GPU has less on chip memory, but it can be explicitly controlledGPU has single instruction unit per several stream processorsGraphics card is optimized for throughput not latency or flexibilityThreads on gpu are very lightweight

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 2 / 20

Page 3: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

General guidelines

GPU invocation is expensiveTransfer between host and graphics card is low bandwidth.Use proper functions and provided profiling tools for benchmarkingIf GPU is used for some operations, it can used also for others withoutdirect performance gains to avoid wasting bandwidthPCI-E about 10 times slower than on chip memory access

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 3 / 20

Page 4: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Choosing the algorithm

Use parallel algorithmAvoid algorithms with complex control or data structuresAvoid algorithms with excessive complex message passingMinimal memory usage per thread is optimal

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 4 / 20

Page 5: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Interleaving latencies

time

Memory

Idle

Not enough memory

Memory operations

Arithmetic operations

Latency

Latency

Operation

Figure: Interleaving latencies

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 5 / 20

Page 6: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

device memory overview

Figure: Overview of the device, [nVidia best practices guide]

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 6 / 20

Page 7: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Reading the global and private device memory

Use coalescence to merge thread transcationsCoalescence by half warps with native data sizes and proper alignmentsData structure sizes can be explicitly increased to native data sizesLatency of 400-600 cycles needs to be interleaved.Maximum bandwidth 10 to 150GB/s

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 7 / 20

Page 8: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Coalescence in compute capability 1.2 or higher devices

Figure: unseenEmil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 8 / 20

Page 9: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Global memory access with offset

Figure: Effective bandwidth of sequential global memory access [nVidia OpenCLbest practices guide]

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 9 / 20

Page 10: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Shared memory

Shared memory accessess are done by half warpsMemory accessess to different banks are done simultaneously,otherwise seriallyBanks are defined by (address/4)%16Use shared memory to avoid data transfer to and from device memory

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 10 / 20

Page 11: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Constant memory

Constant memory has very fast cachePhysical maximum for constant memory is 64KiBIf all threads don’t fetch the same constant address, the operation isdone seriallyUse constant memory to avoid excessively long kernel argument lists

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 11 / 20

Page 12: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Texture memory access

Accessing regular global memory as texture memoryTexture memory is cached by 2d locality for better bandwidthTexture memory allows for efficient operations without coalescencerulesTexture memory has no cache coherency for writingTexture memory has a few flavour operations

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 12 / 20

Page 13: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Memory speeds and amounts

On compute capability 1.2 chips there is 16K 32bit registers, earlierones have only 8K registers.There is 16KiB of local memory per threadThere is 16KiB of shared memoryThere is about 8KiB of texture and constant cache each.shared memory is very fastOther memory types are considerably slower

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 13 / 20

Page 14: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Choosing blocksize

Streaming multiprocessor may run multiple blocksBlocksize should be a multiple of warpsize 32Blocksize should be a multiple of 64 for register dependencyinterleaving optimizationsBenchmarking different blocksizes is recommended, typically might bebetween 128 and 256

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 14 / 20

Page 15: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Block resource usage

The amount of register used by the block is

Rblock = ceil(Rceil(T , 32),Rmax/32) (1)

Where R is the emount of registers used by a thread, T is the number ofthreads and Rmax is the amount of registers in the streamingmultiprocessor. Shared memory is also shared between blocks on thestreaming multiprocessor

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 15 / 20

Page 16: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Branching optimization

When conditions like if, while, for are used each branch is runseparately by the instruction unitShort divergences are run as predicative instructionsLoops can be explicitly or implicitly unrolled for the whole blockCondition is optimally the same for the entire block

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 16 / 20

Page 17: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Arithmetic instruction optimizations

Hardware has comparatively fast native operations for a few functionswith some precision tradeoffUse reciprocal square root and fused multiply add when appropriateCompiler may have flags for arithmetic instruction operationsQuite a bit of details in the instruction troughput referenceUse float constants ie. 1.0f not 1.0

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 17 / 20

Page 18: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Integer instruction optimizations

Bit shift operators can be used for several operationsDivision and modulo are slowMultiplication with 24 bit precision faster on current GPUsConversions and addition as fast as floating point ops

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 18 / 20

Page 19: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Other sources

Some embedded programming practices are probably applicable toGPGPU programmingSome mpi parallel programming practices are applicable

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 19 / 20

Page 20: Optimizing OpenCL for nVidia GPGPU - Aalto · OptimizingOpenCLfornVidiaGPGPU EmilKarlson January29,2010 Emil Karlson Optimizing OpenCL for nVidia GPGPU January 29, 2010 1 / 20

Bunnies

Figure: Bunny with a pancake on top of its head

Emil Karlson () Optimizing OpenCL for nVidia GPGPU January 29, 2010 20 / 20