GPGPU programming with CUDA

Final Review Content 1

GPGPU PROGRAMMING WITH CUDA

Prepared by,

SAVITH.S14CA65

MCA ,II SEMNIT –K ,Surathkal

19/17/2015


INTRODUCTIONGPU- stands for Graphics processing unit (visual

processing unit (VPU)) ,simply our graphics card.

Electronic circuit that is used to accelerate the creation of images on frame buffer and enhances the output quality.

Generally interact with motherboard through PCIexpress (PCIe) port or AGP ports.

Very efficient tool for manipulating computer graphics.

Today, parallel GPUs have begun making computational inroads against the CPU.

It have wide range of options due to its parallel behavior ,it leads to the development of GPGPU.

19/17/2015


ABSTRACT GPGPU stands for general purpose graphical

processing unit , that is designed to use the GPU for algorithms that are traditionally run on the CPU. It makes the algorithms very faster in execution and it saves the processing time. So there is unlimited applications are possible by the usage of this concept. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. So by customizing a GPU we can implement a GPGPU which is hundred times faster than a traditional cpu.

19/17/2015


GPGPUUtilization of GPU for various computations

which is traditionaly handled by the CPU.Can perform any set of operations very

accurately and can compute any computable value.

use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

This concept turns the massive computational power of a modern graphics accelerator's shader pipeline into general-purpose computing power .

19/17/2015


STREAM PROCESSING STREAM is a set of records that need similar

type of computations . In traditional GPU we can read multiple

independent rows simultaneously perform operations and write multiple outputs ,But never have a piece of m/y ie, Both readable and writable.

Utilization of massive computational power of a gpu into the general purpose cpu related operations.

Generalization of GPU.

19/17/2015


CUDACUDA stands for Compute Unified Device

Architecture devaloped by NVIDIA cooperation.It is used to develop software for graphics

processors and is used to develop a variety of general purpose applications for GPUs that are highly parallel in nature and run on hundreds of GPU’s processor cores.

CUDA is supported only on NVIDIA’s GPUs based on Tesla architecture. The graphics cards that support CUDA are GeForce 8-series, Quadro, and Tesla.

CUDA has some specific functions, called kernels. It is executed N number of times in parallel on GPU by using N number of threads.

19/17/2015


EXISTING SYSTEMThe existing system may be a traditional GPU

or a CPU.

A CPU is used in normal case and which has less performance power and throughput when compare to the GPU also having less parallel nature.

A GPU is the basic form of GPGPU but it is used only for graphic accelerating options such as game consoles high definition images , computer aided designs etc .

19/17/2015


PROPOSED SYSTEM The GPGPU solves problems of a traditional

CPU by its highly parallel nature . In principle, any Boolean function can be

built-up from a functionally complete set of logic operators.

GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speed up .

Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.

19/17/2015


EXPECTED FUNCTIONALITIES

Having a high performance nature and about hundred times faster than traditional CPU.

Highly parallel behavior .Contains multiple cores , each has individual

executing nature.Implementing multiple GPU’s in a single system

improves its functionalities again and again .Can be customized to any type by using different

platforms like CUDA ,openCL etc.Major upcoming applications in high performance

computing areas.

19/17/2015


WORKING

19/17/2015


SYSTEM IMPLEMENTATION

The implementation of system is completely carried out by the hardware assembling of GPU in a normal/special purpose computer system and software installation .

There is no separate software s needed for customizing a GPU . It can be done by the CUDA C or CUDA C++ with NVIDIA’s compiler .

The programming is in the usual way but we have to implement some additional header files like #include mpich.h (explained in the later section) in the program .

19/17/2015


HARDWARE REQUIREMENTSGPU shader cores, which run GPU kernels, are

both parallel and deeply multithreaded to provide significant computational power, currently on the order of a tera flopper GPU.

Graphics memory, which is directly accessible by GPU kernels, has a high clock rate and wide bus width to provide substantial bandwidth, currently about a hundred gigabytes per second.

GPU interconnect, providing main board access to the GPU. This is typically PCI Express, and so delivers afew gigabytes per second of bandwidth.

19/17/2015


Main board RAM, which is directly accessible by CPU programs and the network.

CPU cores, which are deeply pipelined and superscalarto provide good performance on sequential programs.

Network hardware, which moves bytes between nodes. We use the trivial latency plus bandwidth performance.

19/17/2015


THE GRID AND BLOCK STRUCTUREThe Grid consists of one-dimensional, two-dimensional or three-dimensional thread blocks.

Each thread block is further divided into one-dimensional or two-dimensional threads.

A thread block is a set of threads running on one processor.

All this threads creation, their execution, and termination are automatic and handled by the GPU, and is invisible to the programmer.

The user only needs to specify the number of threads in a thread block and the number of thread blocks in a grid.

19/17/2015


SINGLE PROGRAM MULTIPLE DATA (SPMD) & MPICHGPU is suited for single program multiple data type

parallel calculations and work well with message passing interface approach to programming .

In SPMD concept , There is only a single program for controlling various activities done in the GPU .

There is no direct connection between our network device and GPU memory.

Thus to send GPU data across the network, we must copy the send-side GPU data to CPU memory .

So we are using a standard CPU interface such as MPI, and finally copy the received data from CPU memory into GPU memory.

19/17/2015


MPICH is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing’ .

The CH part of the name was derived from "Chameleon", which was a portable parallel programming library developed by William Gropp, one of the founders of MPICH.

After installing MPICH they we have to create a user using user add or the GUI. Also set password

MPD_SECRETWORD=passwordHere password refers to the password given

for their user id. 19/17/2015


Next change the read/write/execute profile of .mpd.conf using chmod 600.mpd.conf . Next create a file named mpd.hosts containing the following :

MasterNode1Node2..Node m-1Where m in node m-1 refers to the total

number of nodes .Next to boot MPICH type : Mpdboot –n m –r ssh –f mpd.hosts

19/17/2015


Given below is a sample program using mpi in cuda .#include<stdio.h>#include<mpi.h>Main(int argc,char **,argv){Int I,root=0,rank,size,tag=32,mysum,total;Mpi_init(&argc,&argv);/*gets rank or identity of each processor*/Mpi_Comm_rank(MPI_COMM_WORLD,&rank);/*gets the total number of available processors*/Mpi_Comm_size(MPI_COMM_WORLD,&size);Mysum=0;Total=0;For(i=rank+1;i<100;i=i+size)mysum=mysum+1;/*Adds all the partial sums called mysum and stores into total at root using the

MPI_SUM call */MPI_Reduce(&mysum,&total,tag,MPI_INT,MPI_SUM,root,MPI_INT,MPI_SUM,root,MPI_COMM_WORLD)If(rank==0)Printf(“The total is %d\n”,total);}In this program mp processors add upto 100. The first processor adds

1,mp+1,2*mp+1 ; the second processor adds 2,mp+2,2*mp+2 ; etc…..19/17/2015


APPLICATIONSResearch: Higher Education and

Supercomputing.Computational Chemistry and Biology.BioinformaticsMolecular DynamicsHigh Performance Computing (HPC)

clusters.Grid computing.Auto signal processing.Scientific computing.

19/17/2015


CONCLUSION AND FUTURE WORK

It is clear that by using GPGPU we can execute many rows of data in parallel thus it provides a high performance .

The NVIDIA’s cuda is well suited for building the GPGPU platform.

We have presented and benchmarked cudaMPI and glMPI, message passing libraries for distributed-memory GPU clusters.

Many functions and variables of MPI is under development .

A common platform for various GPU’s is needed and is under development .

19/17/2015


THANK YOU

19/17/2015


Interactive Section

19/17/2015

Technology

GPGPU programming with CUDA