Gpu with cuda architecture

“GPU With CUDA Architecture”

Presented By-Dhaval Kaneria (13014061010)

Guided By-Mr. Rajesh k Navandar

Table Of Contents

• Introduction of GPU• Performance Factors Of GPU• GPU Pipeline• Block Diagram Of Pipeline Process Flow• Introduction Of CUDA • Thread Batching• Simple Processing Flow• CUDA C/C++• Applications• The Future Scope Of CUDA Technology• Conclusion• References

2

Introduction of GPU

• A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically for the processing of 3D graphics.

• The processor is built with integrated transform, lighting, triangle setup/clipping, and rendering engines, capable of handling millions of math-intensive processes per second.

• GPUs form the heart of modern graphics cards, relieving the CPU (central processing units) of much of the graphics processing load. GPUs allow products such as desktop PCs, portable computers, and game consoles to process real-time 3D graphics that only a few years ago were only available on high-end workstations.

• Used primarily for 3-D applications, a graphics processing unit is a single-chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn. These are mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU. Lifting this burden from the CPU frees up cycles that can be used for other jobs.

3

Performance Factors Of GPU• Fill Rate:It is defined as the number of pixels or texels (textured pixels) rendered per second by the GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given to it.• Memory Bandwidth:It is the data transfer speed between the graphics chip and its local frame buffer. More bandwidth usually gives better performance with the image to be rendered is of high quality and at very high resolution. • Memory Management:The performance of the GPU also depends on how efficiently the memory is managed, because memory bandwidth may become the only bottle neck if not managed properly.• Hidden Surface removal:A term to describe the reducing of overdraws when rendering a scene by not rendering surfaces that are not visible. This helps a lot in increasing performance of GPU.

4

GPU Pipeline• The GPU receives geometry information from the CPU as an input and provides a

picture as an output• The host interface is the communication bridge between the CPU and the GPU• It receives commands from the CPU and also pulls geometry information from

system memory.• It outputs a stream of vertices in object space with all their associated information

(normals, texture coordinates, per vertex color etc) • The vertex processing stage receives vertices from the host interface in object

space and outputs them in screen space• This may be a simple linear transformation, or a complex operation involving

morphing effects

hostinterface

vertexprocessing

trianglesetup

pixel processing

memoryinterface

Cont..

• A fragment is generated if and only if its center is inside the triangle• Every fragment generated has its attributes computed to be the

perspective correct interpolation of the three vertices that make up the triangle

• Each fragment provided by triangle setup is fed into fragment processing as a set of attributes (position, normal, texcord etc), which are used to compute the final color for this pixel Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests

6

Block Diagram Of Pipeline Process Flow

7

Cont..• Allow shader to be applied to each vertex Transformation and other per

vertex ops• Allow vertex shader to fetch texture data• Cull/clip–per primitive operation and data preparation for rasterization• Rasterization: primitive to pixel mapping• Z culling: quick pixel elimination based on Depth• Fragment : a candidate pixel Varying number of pixel pipelines• SIMD processing hides texture fetch latency

8

Introduction Of CUDA

9

•CUDA aka Compute unified device architecture is parallel computing platform and programing model which is implemented by graphics processing unit.

CUDA Programming Model:A Highly Multithreaded Coprocessor

• The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel

• Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

• Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

Thread Batching: Grids and Blocks

•A kernel is executed as a grid of thread blocks

– All threads share data memory space•A thread block is a batch of threads that can cooperate with each other by:

– Synchronizing their execution• For hazard-free shared memory

accesses– Efficiently sharing data through a low

latency shared memory•Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

Block and Thread IDs

•Threads and blocks have IDs– So each thread can decide what data to

work on– Block ID: 1D or 2D– Thread ID: 1D, 2D, or 3D – Simplifies memory– addressing when processing

•multidimensional data– Image processing– Solving PDEs on volumes

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

CUDA Device Memory Space Overview

•Each thread can:– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory– R/W per-grid global memory– Read only per-grid constant memory– Read only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

The host can R/W global, constant, and texture memories

Global, Constant, and Texture Memories

•Global memory– Main means of communicating R/W -

Data between host and device– Contents visible to all threads

•Texture and Constant Memories– Constants initialized by host – Contents visible to all threads

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Courtesy: NDVIA

Simple Processing Flow1. Copy input data from CPU memory to GPU memory2. CPU instruct process to GPU3. Load GPU program and execute, caching data on chip for performance4. Copy results from GPU memory to CPU memory

15

CUDA C/C++

16

• CUDA Language:C with Minimal Extensions

• Philosophy: provide minimal set of extensions necessary to expose power

• Declaration specifiers to indicate where things live__global__ void KernelFunc(...); // kernel function, runs on device__device__ int GlobalVar; // variable in device memory__shared__ int SharedVar; // variable in per-block shared memory

• Extend function invocation syntax for parallel kernel launchKernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each

• Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;

• Intrinsics that expose specific operations in kernel code__syncthreads(); // barrier synchronization within kernel

Applications

17

•Military (lots)•Mine planning•Molecular dynamics•MRI reconstruction•Network processing•Neural network•Protein folding•Quantum chemistry•Ray tracing•Radar•Reservoir simulation•Robotic vision/AI•Robotic surgery•Satellite data analysis•Seismic imaging•Surgery simulation

•3D image analysis•Adaptive radiation therapy•Astronomy•Automobile vision•Bio informatics•Biological simulation•Broadcast•Computational Fluid Dynamics•Computer Vision•Cryptography•CT reconstruction•Data Mining•Electromagnetic simulation•Equity training•Financial - lots of areas•Mathematics research

Simulation Result

18

•If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar

19

•Valid Results from bandwidth Test CUDA Sample

20

• Create an Array at the size of BLOCKS, allocate space for the array on the device, and call,

generateArray<<<BLOCKS,1>>>( deviceArray );.

•This function will now run in BLOCKS parallel kernels, creating the entire array in one call .

The Future Scope Of CUDA Technology

• Currently most of research is going on general purpose GPU. As GPU have a highly-efficient and flexible parallel programmable features, a growing number of researchers and business organizations started to use some of the non-graphical rendering with GPU to implement the calculations, and create a new field of study: GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to implement more extensive scientific computing. GPGPU has been successfully used in algebra, fluid simulation, database applications, spectrum analysis, and other non-graphical applications

• Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in a distributed and cooperative way.

• Size reduction • Cooling technique

21

Conclusion.• CUDA is a powerful parallel programming model

Heterogeneous - mixed serial-parallel programmingScalable - hierarchical thread execution modelAccessible - minimal but expressive changes to C

• CUDA on GPUs can achieve great results on data parallel computations with a few simple performance optimization strategies:

• Structure your application and select execution configurations to maximize exploitation of the GPU’s parallel capabilities.

• Minimize CPU ↔GPU data transfers. • Coalesce global memory accesses.• Take advantage of shared memory.• Minimize divergent warps.• Minimize use of low-throughput instructions.

22

References

1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year -20042.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment ”,CSIT-20083.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory for GPU’,IEEE-20134.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia Corporation5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer Technology

23

Thank-You

24