73
GPU Optimization using CUDA Framework Anusha Vidap Divya Sree Yamani Raffath Sulthana

GPU Optimization using CUDA Framework

  • Upload
    lindley

  • View
    115

  • Download
    1

Embed Size (px)

DESCRIPTION

GPU Optimization using CUDA Framework. Anusha Vidap Divya Sree Yamani Raffath Sulthana. Talk Structure. GPU????? History !!! CUDA. Is optimization really needed??? Techniques to optimize. Project research goal. Why GPU??? . What is SIMD??. - PowerPoint PPT Presentation

Citation preview

GPU Optimization using CUDA Framework Anusha VidapDivya Sree Yamani Raffath Sulthana

Graphical processing units(GPUs) are praised for their enormous computational performance. They devote their transistors to data processing. Then why CPU??? Let us go into details1Talk StructureGPU?????

History !!!

CUDA.

Is optimization really needed???

Techniques to optimize.

Project research goal.

What is this GPU? Do we really need it? Why not just CPU??? How did this actually start??? Circumstances that lead to origin of GPU? Going into details need for optimizing GPU and reason for devising of CUDA as a way to optimize. Ways of optimizing for better performance? Finally our project goal through optimization methods.2Why GPU???

What is SIMD??Before going into details of this word GPU. Let us first know what lead to invention of this. It is nothing but illustration of SIMD. What is SIMD???3

Implementation of illustrated SIMD conceptin terms of GPU.4

HistoryNecessity is the mother of invention and having a knowledge of necessity helps in having better understanding of that invention.Graphical Processing Unit, which is very familiar as GPU, too had some need behind it, before it actually originated.

The first GPUs were designed as graphics accelerators, supporting only specific fixed-function pipelines.

Starting in the late 1990s, the hardware became increasingly programmable, culminating in NVIDIA's first GPU in 1999.

Less than a year after NVIDIA coined the term GPU, the General Purpose GPU (GPGPU) movement had dawned.

But GPGPU was far from easy back then, even for those who knew graphics programming languages such as OpenGL. Developers had to map scientific calculations onto problems that could be represented by triangles and polygons. GPGPU was practically off-limits to those who hadn't memorized the latest graphics APIs until a group of Stanford University researchers set out to re-imagine the GPU as a "streaming processor."

In 2003, a team of researchers, the first widely adopted programming model to extend C with data-parallel constructs. The programs written were seven times faster than similar existing code.

NVIDIA knew that blazingly fast hardware had to be coupled with intuitive software and hardware tools, and start evolving a solution to seamlessly run C on the GPU. Putting the software and hardware together, NVIDIA unveiled CUDA in 2006, the world's first solution for general-computing on GPUs.5GPU is just more than a graphic processing in this era.But then, what is the need of CPUs???

So, this is more detail about our graphical processing unit, apart from history.Reason for increasing innovations on GPU are various.Dedication on data processing.Speed and cost effective issues.Hiding latency efficiently through data processing.6

And combination of multi-core CPU and GPU has many advantages.

The following slide is the timeline of GPU all through.

7

Gpu hardware8CUDA

CUDA, what is this again???

COMPTE UNIFIED DEVICE ARCHITECTURE

CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).Identify hidden plaque in arteries.Analyze air traffic flow.Visualize molecules.

9Why only CUDA???

CUDA will stay a big player in professional and scientific markets because of the legacy software they are currently building up and the more friendly development-support.

On NVidia hardware, others are up to 10% slower; this is mainly because others are implemented on top of CUDA-architecture

FFT library so in others like OpenCL, etc.. you need to have your own kernels for it.

In short CUDA certainly has made a lot of things just easier for the developer

Functionality. User-interface.API and library usage. separationof concerns, less development-time and dependency.Data-types. Handlingvideois different from handlingvideo-formats. OS and platform. Besides many visible specifics, an OS is also a collection of APIs.Hardware-performance.

CUDA tries to be one in a packet for developers, while OpenCL is mostly language-descriptiononly. For OpenCL the SDK, IDE, debugger, etc., all come from different vendors. So, if you have an Intel SandyBridge and an AMD Radeon, you need even more software when working on performance-optimizing kernels for different hardware. In reality, this is not ideal, but all you need is really there. You need to go to different places, but it is not that the software is not available as is claimed much too often.

Rich Libraries

OpenCL works on different hardware, but the software needs to be adapted for each architecture. It is not something that will blow minds: you need different types of cars to be fastest on different kinds of area.

Some codes can't run parallel without some adaptation.10Automatic Scalability

DYNAMIC PARALLELISM

CUDA supports dynamic parallelism, wherein it extends allocation of number of blocks depending on the amount of work that needs to be processed. 12

CUDA driving GPU growth

15

CPU code in C (left) vs. CUDA C (right) code written for parallel computation benefits on GPU.17

HOST CODE vs CUDA CODE

19CUDA Memories Overview

GlobalLocalConstantTextureSharedRegistersThese are the memory hierarchies , Each has it own scope defined.20Threads, Blocks,Grid and Warp

21Representation :

Global MemoryGlobal memory resides in Device DRAM.The name global refers to scope, can be accessed and modified from host and device.Declared as __device__declarationCudaMalloc() is dynamically allocated and is assigned to a regular C pointer variable. Allocated explicitly by host(CPU) thread

23

Global Memory _device_For data available to all threads in device.

Declared outside function bodies

Scope of Grid and lifetime of application

Basic Operation

Local MemoryResides in Device memoryCant read other threads local memoryMuch slower than register memoryUsed to hold array if they are not indexed with a constant valueHold variable when no more registers available for them

Constant MemoryStep 1: A constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently.

Step 2: A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.

Final Step: The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.

Constant memory _constant_

For data not altered by device.

Although stored in global memory, cached and has fast access

Declared outside function bodies

Textures:The GPUs sophisticated texture memory may also be used for general-purpose computing. Although NVIDIA designed the texture units for the classical OpenGL and DirectX rendering pipelines Texture memory has some properties that make it extremely useful for computing.Texture memory is cached on chipTexture caches are designed for graphics applications.

Textures..It is designed for addressing called texture fetchingAll threads share texture memoryTexture addressed as one dimensional / two dimensional array.Elements of the array are called texels, short for texture elements.

Shared MemoryEach block has its own shared memoryEnables fast communication between threads in a blockUsed to hold data that will be read and written by multiple threadsReduces memory latencyThis is very fast in accessing

Shared memory_shared_Shared memory is on the GPU chip and very fast

Separate data available to all threads in one block.

Declared inside function bodies

So each block would have its own array A[N]

Registers:Registers are the fastest memory on GPUUnlike CPU , there are thousand of registers on GPU.Registers can be considered instead of 50 threads per blocks

RegistersCompiler will place variables declared in kernel in registers when possibleLimit to the number of registersRegisters divided across warps (group of 32 threads that will operate in the SIMT mode) and have the lifetime of the warps__global__ kernel() {int x, y, z;

}Overview of all Memories

Declaration :

So. How does this GPU actually work??? Data transfer and communication between host and GPU

So, what are the issues with data transfer being faced?? Optimizing data traffic between the host and GPU has been, and still considered to be an important issue in achieving maximum benefit from the massive performance of the GPU. This applies to any model which you use to program the GPU applications.

A GPGPU computation begins by transferring the data to the graphic memory by PCI express. PCI express speed and latency depend on the size of the data. But even in the best case, PCI express bandwidth is about 20 times less than GPU memorys bandwidth. And 5 times less than CPUs bandwidth.

1)these transfers make effective GPUs performances slower than they could.2) For applications where the data complexity is bigger or equal to the computing complexity, data transfers can become a large overhead. 3) To avoid this latency penalty, the developer has to make a tradeoff: computing the whole on GPU or computing the whole on CPU.

37

Processing Flow :Copy input data from CPU memory to GPU memory38

Load GPU program and execute , and then caching the data on chip for performance39

Copy the results from GPU memory to CPU memory40Techniques to optimizeCoalescing data transfers to and from global memoryShared memory bank conflictsPerformance of L1 cache settingsOptimization through CUDA Streams

How can GPU performance be improved? What sort of things can hinder the performance and why? Are there any tricks that can improve performance in specific situations? How and why? Many approaches exist to how a GPU can be optimized for best performance. We will take a look how memory is accessed and shared, and how certain memory bank addressing can get sticky. Also, if memory access is optimal, should we not be able to skip L1 caching? How can the pipeline between the CPU and GPU be better utilized?41

Coalescing data transfers to and from global memory42Memory CoalescingA coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is over simple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.43

This is UnCoalesced access , All elements in a row are placed in a consecutive locations. (row major order)First the column elements will be accessed then the consecutive elements

46

47

This is Coalesced access , All elements in a row are placed in a consecutive locations. (row major order)First the row elements will be accessed along with the consecutive elements

48

Shared memory bank conflicts49Memory Banks

Device can fetch A[0], A[1], A[2], A[3] A[B-1] at the same time, where there are B banks. Shared Memory BanksShared memory divided into 16 or 32 banks of 32-bit width.Banks can be accessed simultaneouslyTo achieve maximum bandwidth, threads in a half warp should access different banks of shared memory Exception: all threads read the same location which results in a broadcast operation

Shared Memory ArchitectureMany threads need to access memory at the same time Divide memory into banks, interleave themEach bank services one address per cycle Can service as many accesses as there are banks Multiple simultaneous accesses to the same bank causes bank conflict Accesses are serialized!Each bank services 32-bit words at addresses mod N * 4Shared Memory Bank ConflictsShared memory is as fast as registers since its stored on-chip, if there are no bank conflicts Fast cases: All threads in a warp access different banks All threads in a warp access the same address Broadcast read the data once, give it to all threads Slow cases : Bank conflict: multiple threads access the same bank at different addresses - must serialize accessesShared Memory BanksOne helpful feature for reading from memory banks: Broadcast If multiple threads read the same address at the same time, only one read is performed, and data is broadcast to all reading threadsShared Memory banks

Shared Memory banks

Performance of L1 cache settings

57What does L1 cache do?In general L1 cache is faster and enhances the performance of the system.When a request comes, first the L1 cache is checked followed by L2 and so on. If the request data is there in L1 then it is serviced if not L2 is checked followed by next till main memory is accessed..When the request is performed, the resulting memory address is bought all the way through the caches, from main memory to L1. When a large of number of non-consecutive data access fetches need to occur, the L1 cache may actually hinder the overall performance instead of enhancing it. As a result, better performance might be achieved by turning off the L1 cache when it can be determined that large number of non-consecutive data access fetches are about to occur. As such, an experiment will be created that alternates between turning the L1 cache on and off while performing large numbers of consecutive and non-consecutive data access fetches. The idea is to determine in which situations performance is benefited from disabling the L1 cache. The cache performance can be analyzed by the usage of the Profiler. As an aside to this experiment, the L2 cache data will also be analyzed to attempt to gain understanding of the L2 cache performance in the CUDA Fermi architecture implementation.

58Disabling L1 cache

Caches increases the request throughput and minimizes the request latency and thus improves the request performance. So why do we want to disable L1 cache? When the fetches occur during the large number of non consecutive data access operations, The L1 cache results in delay of overall performance instead of enhancing. So in order to achieve better performance L1 cache can be disabled. Can We Disable L2 or L3? No. L2 is generally quite a bit larger than L1, and the miss rate is a bit better. L2 is typically also shared by GPU cores, so disabling it for a process in a core would affect other cores. L3 is not available in the GPU59How it is done?We can disable the complete L1 cache using the compiler flag for nvcccompiler flag for nvcc:-Xptxas -dlcm=cgWhen compiling the code program can be compiled with L1 cache enabled or disabledExamine and observe the results whether the program runs faster with L1 cache enabled or disabled.

Optimization through CUDA Streams We have seen that various applications run much faster on GPU than on CPU. Then why not all applications are run on GPU rather than on CPU. The major show stopper of this is Data transfer time to send data from RAM to GPU memory. Any kind of GPGPU applications first need to handle the CPU-GPU transfer time properly. If the data transfer time hinders the performance gain of running applications on GPU then there is no point of running applications on GPU. Hence There is a need for optimization or reduction of time for data transfers between CPU and GPU.

61CPU-GPU interactionOne of the key optimization for any GP-GPU application. Cause:PCI-Bandwidth much lower than GPU memory bandwidth.Problems faced

Tuning and optimizing data traffic between the host and GPU has been, and continues to be important to achieving maximum benefit from the massive performance of the GPU. This is true regardless of which model you use to program the GPU applications.

Before a GPGPU computation, data is transferred to the graphic memory by PCI express. PCI express speed and latency depend on the size of the data. But even in the best case, PCI express bandwidth is about 20 times less than GPU memorys bandwidth. And 5 times less than CPUs bandwidth.

1)these transfers make effective GPUs performances slower than they could.2) For applications where the data complexity is bigger or equal to the computing complexity, data transfers can become a large overhead. 3) To avoid this latency penalty, the developer has to make a tradeoff: computing the whole on GPU or computing the whole on CPU. (ex: The PCI express has very high latencies. Because of this latency penalty, its not possible to choose which computing device (between CPU and GPU) is the most efficient for a little task. For example, if a code has three parts A, B and C, A and C are faster on GPU and B is faster on CPU. The developer cant use the best device for each part of the code. )

62Optimizing Data transfer between host and device Minimize the amount of data transferred between host and device.Batching many small transfers into one larger transferOptimizations through CUDAHigher bandwidth is possible between the host and the device when using page-locked (or pinned) memory.Data transfers between the host and device can sometimes be overlapped with kernel execution and other data transfers.

63Optimizing contdMinimizing data transfersIntermediate data manipulation directly on GPUSelect those which has less data transfer to and from GPUMinimal transfer is not applicable for all kind of GP-GPU applications.Group TransferOne larger transfer rather than multiple smaller transfer.Group transfer does not reduce or hide the CPU-GPU data transfer latency.

64Optimizations through CUDA

Pinned or Non-pageable memory optimizationDecrease the time to copy data from CPU-GPUOptimization through multiple streams.Hides the transfer time by overlapped execution of kernel and memory transfersPinned memory

Pinned memory : It is page locked memory that is not paged in or out the main memory by the OS through paging but it will remain resident.Pinned memory : Pinned memory will make faster PCI copies thus reducing the data transfer time between CPU-GPU.

It makes memory copies asynchronous with CPU and GPU. Cuda has mechanism to allocate pinned memory on host.

Zero copy : Zero-copy memory is using pinned memory directly from a kernel, global device memory access is slow, and zero-copy host memory access is even slower. but since you don't have to do a memory copy to/from the memory after the kernel, it may make your overall application faster if it's only being used as a one-time input or output

The main drawback of the pinned memory is over doing of pinned memory would reduce the memory available for the Host to do paging which reduces its performance.Allocate pinned host memory in CUDA C/C++ usingcudaMallocHost()66What are streams?A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code.One host thread can define multiple CUDA streamsWhat are the typical operations in a stream? Invoking a data transfer Invoking a kernel executionHandling events

Asynchronous and overlapping transfersBlocking transfers: The control is returned to the host thread only after the data transfer is completeData transfers between the host and the device usingcudaMemcpy()are blocking transfers;ThecudaMemcpyAsync()function is a non-blocking variant ofcudaMemcpy()in which control is returned immediately to the host thread.On all CUDA-enabled devices, it is possible to overlap host computation with asynchronous data transfers and with device computationsIn order to make some execution to run concurrently we need to have asynchronous calls.

Asynchronous calls are calls for which control is returned to the host thread before the device has completedthe requested task

In general, host device data transfers using cudaMemcpy() are blocking. Control is returned to the host thread only after the data transfer is complete

There is a non-blocking variant, cudaMemcpyAsync().The host does not wait on the device to finish the mem copy and the kernel call for it to start execution of cpuFunction() callThe launch of kernel only happens after the mem copy call finishescudaStream_t stream1, stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);cudeMemcpyAsync(dst, src, size, dir, stream1);kernel();The last two statements are overlapped.

68CUDA Streams

for the sequential kernel,there is no overlap in any of the operations. For the first asynchronous versionof our code the order of execution in the copy engine is: H2D stream(1), D2H stream(1), H2Dstream(2), D2H stream(2), and so forth. This is why we do not see any speed-up when using the firstasynchronous version on the C1060: tasks were issued to the copy engine in an order that precludesany overlap of kernel execution and data transfer. For version two, however, where allthe host-to-device transfers are issued before any of the device-to-host transfers, overlap is possibleas indicated by the lower execution time. From our schematic, we expect the execution of asynchronous version 2 to be 8/12 of the sequential version, or 8.7 ms which is confirmed inthe timing results given above.69Implementation PlanImplementing the previously mentioned optimized techniques and analyzing the applications of GPU likeCoalescing data transfers to and from global memoryPerformance of L1 cache settingsExamine if our assumptions match the hypothetical results References[PE11] Sawant, N.; Kulkarni, D.; , "Performance Evaluation of Feature Extraction Algorithm on GPGPU," Communication Systems and Network Technologies (CSNT), 2011 International Conference on , vol., no., pp.536-540, 3-5 June 2011 URL:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5966504&isnumb er=5966391 [CTG10] van den Braak, G.-J.; Mesman, B.; Corporaal, H.; , "Compile-time GPU memory access optimizations," Embedded Computer Systems (SAMOS), 2010 International Conference on , vol., no., pp.200-207, 19-22 July 2010 URL:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5642066&isnumber=5642040

71References[FER11] Wittenbrink, C.M.; Kilgariff, E.; Prabhu, A.; , "Fermi GF100 GPU Architecture," Micro, IEEE , vol.31, no.2, pp.50-59, March-April 2011URL:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5751939&isnumber=5751930 http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdfhttp://www-users.cselabs.umn.edu/classes/Spring-2010/csci5451/FILES/LecN5.pdf[NCG08] NVIDIA, NVIDIA GPU Computing Documentation URL: http://developer.nvidia.com/nvidia-gpu-computing-documentation http://courses.cms.caltech.edu/cs101gpu/lec8_cuda_architecture_memory.pdfhttp://run.usc.edu/cs520-s14/gpu/CUDA_Introduction-mod.pdfhttp://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf