Upload
savith-satheesh
View
187
Download
2
Embed Size (px)
Citation preview
Final Review Content 1
GPGPU PROGRAMMING WITH CUDA
Prepared by,
SAVITH.S14CA65
MCA ,II SEMNIT –K ,Surathkal
19/17/2015
Final Review Content 2
INTRODUCTIONGPU- stands for Graphics processing unit (visual
processing unit (VPU)) ,simply our graphics card.
Electronic circuit that is used to accelerate the creation of images on frame buffer and enhances the output quality.
Generally interact with motherboard through PCIexpress (PCIe) port or AGP ports.
Very efficient tool for manipulating computer graphics.
Today, parallel GPUs have begun making computational inroads against the CPU.
It have wide range of options due to its parallel behavior ,it leads to the development of GPGPU.
19/17/2015
Final Review Content 3
ABSTRACT GPGPU stands for general purpose graphical
processing unit , that is designed to use the GPU for algorithms that are traditionally run on the CPU. It makes the algorithms very faster in execution and it saves the processing time. So there is unlimited applications are possible by the usage of this concept. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. So by customizing a GPU we can implement a GPGPU which is hundred times faster than a traditional cpu.
19/17/2015
Final Review Content 4
GPGPUUtilization of GPU for various computations
which is traditionaly handled by the CPU.Can perform any set of operations very
accurately and can compute any computable value.
use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.
This concept turns the massive computational power of a modern graphics accelerator's shader pipeline into general-purpose computing power .
19/17/2015
Final Review Content 5
STREAM PROCESSING STREAM is a set of records that need similar
type of computations . In traditional GPU we can read multiple
independent rows simultaneously perform operations and write multiple outputs ,But never have a piece of m/y ie, Both readable and writable.
Utilization of massive computational power of a gpu into the general purpose cpu related operations.
Generalization of GPU.
19/17/2015
Final Review Content 6
CUDACUDA stands for Compute Unified Device
Architecture devaloped by NVIDIA cooperation.It is used to develop software for graphics
processors and is used to develop a variety of general purpose applications for GPUs that are highly parallel in nature and run on hundreds of GPU’s processor cores.
CUDA is supported only on NVIDIA’s GPUs based on Tesla architecture. The graphics cards that support CUDA are GeForce 8-series, Quadro, and Tesla.
CUDA has some specific functions, called kernels. It is executed N number of times in parallel on GPU by using N number of threads.
19/17/2015
Final Review Content 7
EXISTING SYSTEMThe existing system may be a traditional GPU
or a CPU.
A CPU is used in normal case and which has less performance power and throughput when compare to the GPU also having less parallel nature.
A GPU is the basic form of GPGPU but it is used only for graphic accelerating options such as game consoles high definition images , computer aided designs etc .
19/17/2015
Final Review Content 8
PROPOSED SYSTEM The GPGPU solves problems of a traditional
CPU by its highly parallel nature . In principle, any Boolean function can be
built-up from a functionally complete set of logic operators.
GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speed up .
Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.
19/17/2015
Final Review Content 9
EXPECTED FUNCTIONALITIES
Having a high performance nature and about hundred times faster than traditional CPU.
Highly parallel behavior .Contains multiple cores , each has individual
executing nature.Implementing multiple GPU’s in a single system
improves its functionalities again and again .Can be customized to any type by using different
platforms like CUDA ,openCL etc.Major upcoming applications in high performance
computing areas.
19/17/2015
Final Review Content 10
WORKING
19/17/2015
Final Review Content 11
SYSTEM IMPLEMENTATION
The implementation of system is completely carried out by the hardware assembling of GPU in a normal/special purpose computer system and software installation .
There is no separate software s needed for customizing a GPU . It can be done by the CUDA C or CUDA C++ with NVIDIA’s compiler .
The programming is in the usual way but we have to implement some additional header files like #include mpich.h (explained in the later section) in the program .
19/17/2015
Final Review Content 12
HARDWARE REQUIREMENTSGPU shader cores, which run GPU kernels, are
both parallel and deeply multithreaded to provide significant computational power, currently on the order of a tera flopper GPU.
Graphics memory, which is directly accessible by GPU kernels, has a high clock rate and wide bus width to provide substantial bandwidth, currently about a hundred gigabytes per second.
GPU interconnect, providing main board access to the GPU. This is typically PCI Express, and so delivers afew gigabytes per second of bandwidth.
19/17/2015
Final Review Content 13
Main board RAM, which is directly accessible by CPU programs and the network.
CPU cores, which are deeply pipelined and superscalarto provide good performance on sequential programs.
Network hardware, which moves bytes between nodes. We use the trivial latency plus bandwidth performance.
19/17/2015
Final Review Content 14
THE GRID AND BLOCK STRUCTUREThe Grid consists of one-dimensional, two-dimensional or three-dimensional thread blocks.
Each thread block is further divided into one-dimensional or two-dimensional threads.
A thread block is a set of threads running on one processor.
All this threads creation, their execution, and termination are automatic and handled by the GPU, and is invisible to the programmer.
The user only needs to specify the number of threads in a thread block and the number of thread blocks in a grid.
19/17/2015
Final Review Content 15
SINGLE PROGRAM MULTIPLE DATA (SPMD) & MPICHGPU is suited for single program multiple data type
parallel calculations and work well with message passing interface approach to programming .
In SPMD concept , There is only a single program for controlling various activities done in the GPU .
There is no direct connection between our network device and GPU memory.
Thus to send GPU data across the network, we must copy the send-side GPU data to CPU memory .
So we are using a standard CPU interface such as MPI, and finally copy the received data from CPU memory into GPU memory.
19/17/2015
Final Review Content 16
MPICH is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing’ .
The CH part of the name was derived from "Chameleon", which was a portable parallel programming library developed by William Gropp, one of the founders of MPICH.
After installing MPICH they we have to create a user using user add or the GUI. Also set password
MPD_SECRETWORD=passwordHere password refers to the password given
for their user id. 19/17/2015
Final Review Content 17
Next change the read/write/execute profile of .mpd.conf using chmod 600.mpd.conf . Next create a file named mpd.hosts containing the following :
MasterNode1Node2..Node m-1Where m in node m-1 refers to the total
number of nodes .Next to boot MPICH type : Mpdboot –n m –r ssh –f mpd.hosts
19/17/2015
Final Review Content 18
Given below is a sample program using mpi in cuda .#include<stdio.h>#include<mpi.h>Main(int argc,char **,argv){Int I,root=0,rank,size,tag=32,mysum,total;Mpi_init(&argc,&argv);/*gets rank or identity of each processor*/Mpi_Comm_rank(MPI_COMM_WORLD,&rank);/*gets the total number of available processors*/Mpi_Comm_size(MPI_COMM_WORLD,&size);Mysum=0;Total=0;For(i=rank+1;i<100;i=i+size)mysum=mysum+1;/*Adds all the partial sums called mysum and stores into total at root using the
MPI_SUM call */MPI_Reduce(&mysum,&total,tag,MPI_INT,MPI_SUM,root,MPI_INT,MPI_SUM,root,MPI_COMM_WORLD)If(rank==0)Printf(“The total is %d\n”,total);}In this program mp processors add upto 100. The first processor adds
1,mp+1,2*mp+1 ; the second processor adds 2,mp+2,2*mp+2 ; etc…..19/17/2015
Final Review Content 19
APPLICATIONSResearch: Higher Education and
Supercomputing.Computational Chemistry and Biology.BioinformaticsMolecular DynamicsHigh Performance Computing (HPC)
clusters.Grid computing.Auto signal processing.Scientific computing.
19/17/2015
Final Review Content 20
CONCLUSION AND FUTURE WORK
It is clear that by using GPGPU we can execute many rows of data in parallel thus it provides a high performance .
The NVIDIA’s cuda is well suited for building the GPGPU platform.
We have presented and benchmarked cudaMPI and glMPI, message passing libraries for distributed-memory GPU clusters.
Many functions and variables of MPI is under development .
A common platform for various GPU’s is needed and is under development .
19/17/2015
Final Review Content 21
THANK YOU
19/17/2015
Final Review Content 22
Interactive Section
19/17/2015