73
A PROJECT REPORT on Numerical Methods Implementation On CUDA submitted for partial fulfillment for the degree of Bachelor of Technology in Department of Computer Engineering (2007-11) Supervisor: Dr. Vijay Laxmi Ankur Sharma (2007UCP132) Nihar Amin (2007UCP161) Praveen Khokher (2007UCP157) Shehjad Khan (2007UCP113) MALAVIYA NATIONAL INSTITUTE Of TECHNOLOGY, JAIPUR May 2011

Numerical Methods Implementation on CUDA

Embed Size (px)

Citation preview

Page 1: Numerical Methods Implementation on CUDA

A

PROJECT REPORT

on

Numerical Methods

Implementation On CUDA

submitted for partial fulfillment for the degree of

Bachelor of Technology

in

Department of Computer Engineering

(2007-11)

Supervisor: Dr. Vijay Laxmi Ankur Sharma (2007UCP132)

Nihar Amin (2007UCP161)

Praveen Khokher (2007UCP157)

Shehjad Khan (2007UCP113)

MALAVIYA NATIONAL INSTITUTE Of TECHNOLOGY, JAIPUR

May 2011

Page 2: Numerical Methods Implementation on CUDA
Page 3: Numerical Methods Implementation on CUDA

Contents

Acknowledgements ix

Certificate xi

1 Overview Of CUDA Programming Model 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thread Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Memory Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Implementation Of Matrix Multiplication Algorithm On CUDA 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Matrix proves to be advantageous in the implementation of following logics:- 6

2.3 Sequential matrix-multiplication: . . . . . . . . . . . . . . . . . . . 6

2.4 Parallel matrix-multiplications on CUDA:- . . . . . . . . . . . . . . 6

2.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Kernel Specifications: . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 Salient Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Implementation Of Prefix Sum Algorithm On CUDA 11

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Sequential Prefix-sum algorithm: . . . . . . . . . . . . . . . . . . . 12

3.3 Parallel Prefix-Sum On CUDA: . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Implementation- . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Kernel Specifications: . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 Salient Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.7 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Implementation Of Bitonic Sort Algorithm On CUDA 17

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Parallel Bitonic-Sort On CUDA: . . . . . . . . . . . . . . . . . . . . 18

4.3 Salient Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

i

Page 4: Numerical Methods Implementation on CUDA

ii CONTENTS

4.4 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.5 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Implementation of Odd Even transposition Sort 23

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 The odd even merge sort is advantageous as it can . . . . . . . . . 23

5.3 Sequential Odd-Even Merge Sort: . . . . . . . . . . . . . . . . . . . 24

5.4 Parallel Odd Even Transposition Sort: . . . . . . . . . . . . . . . . 24

5.4.1 Implemention . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.5 Kernel Specification: . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.6 Salient Features:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Implementation Of Parallel Quicksort By Regular Sampling Algorithm On CUDA 29

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.2 Sequential Quicksort: . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3 Parallel Quicksort Using Regular Sampling: . . . . . . . . . . . . . 30

6.3.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.4 Kernel Specifications: . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.5 Salient features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.7 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Implementation of matrix transpose algorithm on CUDA 35

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.2 Matrix transpose proves to be advantageous in the implementation of following logics: 36

7.3 Sequential matrix transpose: . . . . . . . . . . . . . . . . . . . . . . 36

7.4 Parallel matrix transpose: . . . . . . . . . . . . . . . . . . . . . . . 36

7.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.5 Kernel specifications: . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.6 Salient features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8 Implementation of parallel sum algorithm on CUDA 41

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8.2 Parallel-sum proves to be advantageous in the implementation of following logics: 41

8.3 Sequential Parallel-Sum Algorithm:- . . . . . . . . . . . . . . . . . . 42

8.4 Parallel Prefix-Sum: . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 42

Page 5: Numerical Methods Implementation on CUDA

CONTENTS iii

8.5 Kernel Specification:- . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.6 Salient Features:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.7 Limitations:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

9 Calculation Of Variance and Standard Deviations on CUDA 47

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

9.2 Finding VARIANCE AND DEVIATION proves to be advantageous 47

9.3 Sequentially Calculate Variance and SD: . . . . . . . . . . . . . . . 48

9.4 Parallely Calculate Variance and SD: . . . . . . . . . . . . . . . . . 48

9.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 48

9.5 Kernel Specification: . . . . . . . . . . . . . . . . . . . . . . . . . . 49

9.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

9.7 Observations:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

9.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

10 Data of Algorithms 53

Page 6: Numerical Methods Implementation on CUDA
Page 7: Numerical Methods Implementation on CUDA

List of Figures

1.1 Thread Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Memory Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Thread Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 execution time vs Input size . . . . . . . . . . . . . . . . . . . . . . 8

2.3 SpeedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 SpeedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Prefix-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Prefix-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Prefix-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Sample Bitonic Sorting . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Kernel Used in Bitonic Sorting . . . . . . . . . . . . . . . . . . . . . 19

4.3 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 20

4.4 slope of speedUp vs input size . . . . . . . . . . . . . . . . . . . . . 20

4.5 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 26

5.2 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.1 Sequential Quicksort algorithm . . . . . . . . . . . . . . . . . . . . 30

6.2 execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 33

6.3 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.4 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.1 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 38

7.2 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7.3 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8.1 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 44

8.2 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8.3 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9.1 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 50

9.2 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 50

9.3 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 51

v

Page 8: Numerical Methods Implementation on CUDA
Page 9: Numerical Methods Implementation on CUDA

List of Tables

10.1 Matrix Multiplication(time in 10−6s) . . . . . . . . . . . . . . . . . 54

10.2 Bitonic Sort Algorithm (time in 10−6s) . . . . . . . . . . . . . . . . 54

10.3 Prefix Sum (time in 10−6s) . . . . . . . . . . . . . . . . . . . . . . . 54

10.4 Odd-Even Transposition Sort (time in 10−6s) . . . . . . . . . . . . . 55

10.5 Quicksort (time in 10−6s) . . . . . . . . . . . . . . . . . . . . . . . . 55

10.6 Matrix-transpose (time in 10−6s) . . . . . . . . . . . . . . . . . . . 55

10.7 Summation Algorithm (time in 10−6s) . . . . . . . . . . . . . . . . 56

10.8 Variance and SD (time in 10−6s) . . . . . . . . . . . . . . . . . . . . 56

vii

Page 10: Numerical Methods Implementation on CUDA
Page 11: Numerical Methods Implementation on CUDA

Acknowledgements

We wish to express our gratitude to all people involved in the successful comple-

tion of our Final Year Major Project, especially to our project mentor Dr. Vijay

Laxmi for her guidance and critical reviews.

Our sincere thanks to Dr. M.S Gaur who was very generous to devote his

precious time, sharing his knowledge with us, and helping us out in every possible

manner

We are also thankful to all of our team members, working with whom was a

great experience.

And finally, our deep gratitude to our family members for their unflinching emo-

tional support during the whole period.

Ankur Sharma

Nihar Amin

Praveen Khokher

Shehjad Khan

May 2011

ix

Page 12: Numerical Methods Implementation on CUDA
Page 13: Numerical Methods Implementation on CUDA

Certificate

This is to certify that the work contained in this report entitled ”Numerical

Methods Implementation On CUDA” by Ankur Sharma (2007UCP132), Ni-

har Amin (2007UCP161), Praveen Khokher (2007UCP157) and Shehjad Khan

(2007UCP113) has been carried out under my supervision and this work has not

been submitted elsewhere for a degree.

May, 2011

Dr. Vijay Laxmi

Department of Computer Engineering,

Malaviya National Institute of Technology,

Jaipur.

xi

Page 14: Numerical Methods Implementation on CUDA
Page 15: Numerical Methods Implementation on CUDA

ABSTRACT

Parallel computing is the process of dividing large problems into smaller ones and

concurrently executing them. This implies that many computations are carried

out simultaneously. The main objective of devising parallel algorithms is to check

whether they give faster responses than their sequential versions. The implementa-

tion of numerical methods for heavy calculations on CUDA architecture and their

comparison with time taken for the same calculations sequentially on the CPU is

the basic aim of the project. The understanding of CUDA architecture and how

mapping is done using threads and blocks is first understood. Algorithms that can

be implemented parallely are recognized, their sequential CPU codes are written

and then their parallel implementation on CUDA architecture is done. Sets of

data are used to study the time taken by both implementations and inferences are

made. These are primarily on the basis of complexities of sequential algorithms

and their method of implementation on CUDA. Some parallel algorithms give suf-

ficient speed up and some are slower than the sequential versions. The reasons

and conclusions are inferred and optimizations that can be done are mentioned.

Page 16: Numerical Methods Implementation on CUDA
Page 17: Numerical Methods Implementation on CUDA

Chapter 1

Overview Of CUDA

Programming Model

1.1 Introduction

Compute Unified Device Architecture(CUDA) is an application programming in-

terface to the graphical processors .It is basically a parallel computing architecture

developed by Nvidia. The architecture emphasizes the thinking of working many

threads slowly in parallel rather than running a particular thread very fastly.CUDA

specific computations are performed on GPU(graphics processing units).The ar-

chitecture favours applications which are compute intensive rather then memory

intensive.It is a scalable programming model .Programmers generally use C for

CUDA for executing the code on GPU.

There are levels of abstarction in CUDA which are visible to the programmers:-

1. Thread level heirarchy

2. Memory level heirarchy

3. barrier synchronizations

The basic advantage of using CUDA is to run the parallel fraction of a large code

efficiently and quick.It basically follows the approach of dividing a large set of in-

put data into blocks and execute the different blocks in parallel.The main features

to look out for in parallel processing of blocks are efficient communication of data

between diffrent blocks and between the threads of the same block,synchronization

1

Page 18: Numerical Methods Implementation on CUDA

2 Chapter 1 Overview Of CUDA Programming Model

between blocks and threads of a block.

CUDA executes the sequential part of the code on CPU ,while the parallel portion

is executed on GPU.The GPU code is compiled by the open64 compiler that

produces parallel thread execution(PTX) files to run on the GPU.Qualifiers are

used to distinguish between the variables and functions of the CPU code and GPU

code .CUDA operates on single instruction multiple data (SIMD ) architecture but

the thread can diverge from this on the basis of conditional opeartors ,blockId and

threadId.

1.2 Thread Level Heirarchy

The Thread level abstraction can be viewed as shown below in figure:-

Figure 1.1: Thread Level Heirarchy

The thread level abstraction on CUDA can be viewed as a grid of blocks con-

taining threads.Each thread possesses a unique ID associated with it .A Block

can contain upto maximum of 512 threads quadroFX1700GPGPUarchitecture,a

thread basically can have its unique Id in x, y ,z dimension ,ie threadIdx.x, hrea-

dIdx.y, theadIdx.z.similarly a collection of blocks is called a grid and can contain

blocks in all the three dimensions.The threads within a block can communicate

with each other using the shared memory visible per block and can synchronize

there execution using the inbuilt syncthreads() function.The execution between

different blocks launched by the kernel cannot be done using the synthreads()

Page 19: Numerical Methods Implementation on CUDA

Chapter 1 Overview Of CUDA Programming Model 3

function.Different blocks communicate with each other using the device memory

or the global memory.when a kernel is launched a grid of thread blocks gets cre-

ated on the device with each thread block containing many threads .Both Fine

grained data parallelism and coarse grained data parallelism can be implemented

in CUDA .The threads provide Fine grained parallelism while the blocks provide

coarse grained parallelism.

1.3 Memory Level Heirarchy

The memory level abstraction can be viewed as shown below in figure:-

Figure 1.2: Memory Level Heirarchy

There are four different types of memories shown above:registers,shared,global,constant(not

including the texture memory).The global memory can be accessed by every

thread,different blocks and the CPU.The registers are specific to each thread and

are the fastest type of memory.The shared memory is visible to a particular block

and thus threads of a block can access the shared memory.Constant memory is

faster than global memory but slower than registers and shared memory however,

it can only be written to in host code. Device code can read constant memory but

it can not write to it.The sizes of global and constant memeory can scale in Gb’s

but the sizes of shared memory is very limited (usually upto 16Kb).

The memory allocation and deallocation of the global memory is done by the

host.Functions like cudaMemcpy() and cudaMalloc(),are used for the allocation

and movement of data from or to the device .Identifiers like cudaMemcpyDevice-

ToHost are used guide the direction of data transfer The memory transfer functions

Page 20: Numerical Methods Implementation on CUDA

4 Chapter 1 Overview Of CUDA Programming Model

can be synchronous as well as asynchronous . Synchronous means the CPU can

start its execution only after the entire data has been transfered to the GPU.

Page 21: Numerical Methods Implementation on CUDA

Chapter 2

Implementation Of Matrix

Multiplication Algorithm On

CUDA

2.1 Introduction

Matrix multiplication have inherent parallelism in it and thus by using a parallel

architecture we can compute the work in lesser time i.e achieve speed up. We

multiply to matrix of size M x N and N x O and get a resulting matrix of dimension

M x O. Its a necessary condition that the number of column of 1st matrix is equal

to number of rows of 2nd matrix,otherwise multiplicationis not possible.

Figure 2.1: Thread Level Heirarchy

INPUT- Two matrices say, A and B with dimensions M x N and N x O

OUTPUT Final matrix with dimension M x O .

5

Page 22: Numerical Methods Implementation on CUDA

6 Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA

2.2 Matrix proves to be advantageous in the im-

plementation of following logics:-

1. Graph Theory

2. Probability theory and statistics

3. Symmetries and transformations of physics

4. MATLAB

2.3 Sequential matrix-multiplication:

Suppose we have to multiple two matrix A and B and get the final result in matrix

C. Then each element of C can be found by

sum=sum+ mat1[i][k]*mat2[k][j];

mat3[i][j]=sum;

here r1 is the number of rows of first matrix and c2 is the number of coloums of

second matrix

for(i=0;i < r1;i=i+1)

{

for(j=0;j < c2;j=j+1)

{

sum=0;

for(k=0;k < c1;k++)

sum=sum+mat1[i][k]*mat2[k][j];

mat3[i][j]=sum;

}

}

2.4 Parallel matrix-multiplications on CUDA:-

As matrix multiplication have many independent stages thus we can think of get-

ting some speed-up using parallel architecture like CUDA.

Page 23: Numerical Methods Implementation on CUDA

Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA 7

2.4.1 Implementation:

We launch same number of threads as the number of element in a resultant matrix.

Each thread simultaneously calculate the the corresponding index of the resultant

matrix. Our blocks are of 2D nature and have dimension N x O (here we have

taken input values such that N and O both are equal). Both the dimensions of 2D

grid is equal to sqrt(total number of blocks lauched).Indexing to each element is

done using the threadIdx, threadIdx , blockIdx and blockIdx.

dim3 threads(My block blocksX,My block);

float grid D=sqrt(My block); dim3 grid(grid D,grid D);

Indexing int row = blockIdx*block D +threadIdx; int col = blockIdx*block D

+threadIdx;

2.5 Kernel Specifications:

A) global void matrixMul globalmemory - 9 registers,28+16 bytes of smem,4

bytesof cmem[1].

2.6 Salient Features:

1. We have implemented on global memory as our threads are independent of

each other and we face no synchronisation problem.

2. Motivation for using global memory was to run our code for matrices with

large dimensions.

3. The code is generalised to run on very lage number of values.

4. Both the times t1(without considering memory copy overhead) and t2(considering

memory transfers overhead) are calculated .

2.7 Limitations:

1. For lage values of arrays(>512 values),the input size was limited to the mul-

tiples of 512.

Page 24: Numerical Methods Implementation on CUDA

8 Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA

2. GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY

CALCULATOR.

2.8 Observations:

1. Immediate speedUp for N>32,due to n3 complexity of sequential algorithm.

2. Sequential time almost linearly proportional to size of resultant matrix.

3. Initial speedUp

Nie the slope of the speedUp graph is very steep.

4. With the increase in size of the input ,time taken by sequential code increases

almost linearly,whereas the time taken by the kernel to execute remains a

constant ,but the overall performance of the parallel code is degraded by the

time acoounted for memorycopy overhead between host and device .

2.9 Conclusions:

1. As the sequential algorithm is of order n to the power 3 thus for large of

dimensions we got a decent speed-up.

2. Parallel approach very favourable when sequential complexity is higher.

3. Even better speedUps can be achieved with memory optimization techniques.

Figure 2.2: execution time vs Input size

Page 25: Numerical Methods Implementation on CUDA

Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA 9

Figure 2.3: SpeedUp vs input size

Figure 2.4: SpeedUp vs input size

Page 26: Numerical Methods Implementation on CUDA
Page 27: Numerical Methods Implementation on CUDA

Chapter 3

Implementation Of Prefix Sum

Algorithm On CUDA

3.1 Introduction

Prefix sum also known as the partial sum of the series is in programming terms

the fold of addition operation.Th prefix sum is considered to be the simplest and

most useful block of parallel algorithms.The prefix sum can be calculated for a

very large sets of input data and is generally described a s below:-

For a set of N values { a1,a2,a3,a4..................................an }

prefix-sum can be calculated as { a1,(a1+a2),(a1+a2+a3),.....(a1+.....an-1}

For Example - a[8]={1,3,4,2,6,3,7,1}

prefix-sum ={1,4,8,10,16,19,26,27}

Prefix-sum proves to be advantageous in the implementation of following logics:-

1. In the implementation of radix sort quick sort .

2. Performing lexical analysis and search for regular expressions.

3. In evaluating polynomials ,solving recurrences and addition of multiprecision

numbers .

4. It can be very much helpful in performing string matching algorithms.

11

Page 28: Numerical Methods Implementation on CUDA

12 Chapter 3 Implementation Of Prefix Sum Algorithm On CUDA

3.2 Sequential Prefix-sum algorithm:

The sequential prefix-sum algorithm is a very simple method to calculate the

prefix-sum of a given input array of numbers ,just by looping through the size of

the array and adding the current value with that of the previous indexed value

.The logic is demonstrated as below:-

for( i=1;i<size;i=i+1)

a[i]=a[i]+a[i-1];

This code performs exactly N adds for a array of size N.and thus is a very simple

implementation.

3.3 Parallel Prefix-Sum On CUDA:

The prefix-sum algorithm can be very efficiently performed using the parallel archi-

tecture.We just need to divide the input array into blocks of proper dimension.and

launch the kernel.

3.3.1 Implementation-

For a input array of size N(can be very large),a single dimension grid is created

with ( N512

) blocks.If the size of the input is N<512 ,then a grid with one block and

containing N number of threads is launched by the kernel function.

Each of the blocks is provided with a shared array of size=512 and its set of shared

variables.All the values of the input array which are stored in global memory are

mapped with a specific thread ID dependent on the number of blocks

ID=blockIdx*dim block + threadIdx;

Thus,respective elements are copied from the global memory to the shared memory

of each block. The parallel sums of values in each block is generated and stored

in a global array according to the respective block index.

3.4 Kernel Specifications:

1. global Sum prefix() - 6 registers,4120+16bytes of smem,4 bytes of cmem[1]

Page 29: Numerical Methods Implementation on CUDA

Chapter 3 Implementation Of Prefix Sum Algorithm On CUDA 13

2. global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1]

3.5 Salient Features:

1. The use of shared memory to perform consecutive reads,which reduces the

time that would have been spent in performing the same reads and write

using global memory.

2. Performing a proper synchronization between threads operating in parallel

inside a block.

3. It was difficult to perform synchronization between different blocks ,so the

sums of previous blocks were propagated to the consecutive blocks using a

global array .

4. The code is generalised to run on very large number of values.

5. Both the times t1(without considering memory copy overhead) and t2(considering

memory transfers overhead) are calculated .

3.6 Limitations:

1. For lage values of arrays(>512 values),the input size was limited to the mul-

tiples of 512.

2. gpu-occupancy of 67 % was achieved as calculated by the GPU-OCCUPANCY

CALCULATOR..

3.7 Observations:

1. For very small input sizes ,the sequential prefix sum appears to be much

faster then the parallel code

2. With the increase in size of the input ,time taken by sequential code increases

almost linearly,whereas the time taken by the kernel to execute remains a

constant.

Page 30: Numerical Methods Implementation on CUDA

14 Chapter 3 Implementation Of Prefix Sum Algorithm On CUDA

3. Very large speedup wrt. kernel execution times are achieved,which de-

mostrates the efficiency of running the parallel code on cuda ,but the memory

overhead for large values limits the overall speed up .

3.8 Conclusions:

1. Using effiecient memory optimizing techniques,the memory transfer overhead

between the host and the device can be reduced.

2. Using much better kernel optimization speedUp can be increased.

Figure 3.1: Prefix-sum algorithm

Page 31: Numerical Methods Implementation on CUDA

Chapter 3 Implementation Of Prefix Sum Algorithm On CUDA 15

Figure 3.2: Prefix-sum algorithm

Figure 3.3: Prefix-sum algorithm

Page 32: Numerical Methods Implementation on CUDA
Page 33: Numerical Methods Implementation on CUDA

Chapter 4

Implementation Of Bitonic Sort

Algorithm On CUDA

4.1 Introduction

It is a fast method to sort the large number of values. Basically contains two

types of operations which are shown by down arrow(also by (+) operation ,just

a symbolic representation) and up arrow(also by (-) operation). In + operation

both the values are compared and after comparsion larger value should be at higher

index (for this purpose swapping might be required).In - operation both the values

are compared and larger value should be at the lower index(again swapping may

or may not be required).

INPUT:- Array of N element say A OUTPUT:- Sorted array of A, say sort(A)=

such that

for (i and j )=0 to n-1

a(i)<=a(j) for i<j

Bitonic-sort proves to be advantageous in the implementation of following logics:-

1. In any application which requires sorted input as for example binary search

algorithm.

2. In forming directory and managing large data.

17

Page 34: Numerical Methods Implementation on CUDA

18 Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA

Figure 4.1: Sample Bitonic Sorting

4.2 Parallel Bitonic-Sort On CUDA:

The parallel bitonic-sort can be very efficiently performed using the parallel CUDA

architecture.For N number of element , we can divide our problem into log to the

(base 2 ) of N number of stages,and further each stage can be divided into number

of substages. For stage i number of sub- stages in it are equal to i, i.e if we have

8 elements then total number of stages are 3. 1st stage have 1 sub-stage,2nd stage

have 2 sub-stages and 3rd stage have 3 sub-stages. Each sub-stage has to do N/2

number of independent computations. Thus we can lauch N/2 number of threads

for these N/2 computations. But sub-stages are not independent from each other

and thus we have to ensure proper synchronization between threads , otherwise

we will get incorrect results.

As in our CUDA architecture we can only at maximum have 512 threads in a

block thus, for values larger than 512, we have to launch multiple number of

blocks. As we feel we have to perform interblock synchronization,which we have

tried but can’t implement it so we have computed result only upto 512 values.

We have to find whether the thread has to perform (+) or (-) operations. For

this purpose we have used a flag variable in our kernel flag=(int)(id/power(i))%2;

If flag has value 0 then we have to perform (+) operation,otherwise the (-) oper-

ation. Threads in blocks are of 1D nature and can be accessed by indexing them

using threadIdx.x

indexing id = threadIdx ;

Page 35: Numerical Methods Implementation on CUDA

Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA 19

For synchronisation of threads of the same block we have used the standard library

function syncthreads();

Figure 4.2: Kernel Used in Bitonic Sorting

4.3 Salient Features:

1. Different sub-satge at the same stage level are not independent.

2. In last stage we only have to perform (+) operations.

4.4 Limitations:

1. We have assumed that the number of input value must be in form of 2’s

power,like 4 ,8 , 16 , 32, 64, 128, 256, 512

2. As we have only used 1D block so at max we can take 512 values for the

sorting.

3. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY

CALCULATOR..

4.5 Observations:

1. SpeedUp gained for (N>256).

2. For sequential nearly linear increase in time with increasing N.

3. Very sharp increase in speedUp after (N=256).

Page 36: Numerical Methods Implementation on CUDA

20 Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA

4.6 Conclusions:

1. Speedup due to memory overhead decreases significantly.

2. Much higher SpeedUps can be achieved with multiple blocks.

Figure 4.3: Execution time vs input size

Figure 4.4: slope of speedUp vs input size

Page 37: Numerical Methods Implementation on CUDA

Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA 21

Figure 4.5: speedUp vs input size

Page 38: Numerical Methods Implementation on CUDA
Page 39: Numerical Methods Implementation on CUDA

Chapter 5

Implementation of Odd Even

transposition Sort

5.1 Introduction

The network odd-even transposition sort for n input data consists of n comparing

stages. In each stage, either all inputs at odd index positions or all inputs at even

index positions are compared with their next element. Odd stages are followed

by the even stages and only after the completion of an Odd stage an Even stage

can start and vice versa. It is similar to the bubble sort except for the fact that

odd-even transposition sort compares disjointed pairs by using alternating odd

and even index values during different phases of the sort.

5.2 The odd even merge sort is advantageous as

it can

1. Can be used for sorting on 2-D processor arrays and

2. Be parallely implemented which can achieve speed ups of more than 2.0 even

on marginally small number of elements.

23

Page 40: Numerical Methods Implementation on CUDA

24 Chapter 5 Implementation of Odd Even transposition Sort

5.3 Sequential Odd-Even Merge Sort:

The algorithm is simple to implement and is synonymous with bubble sort. In

the first phase of odd-even exchange, control jumps to all the even indices and

compare their neighbouring element. In the second phase control jumps to odd

indices and compares their neighbouring elements.These pair of phases continue

till the array is sorted. Thus, there are exactly half the number of pair of phases

as there are elements in the array to be sorted. The looping logic as follows

for (i = 0; i< n2; i=i+1 )

{

for (j = 0; j+1<n; j=j+2)

if (A[j]>A[j+1])

{

int T=A[j];

A[j]=A[j+1];

A[j+1]=T;

}

for (j = 1; j+1< n; j=j+2)

if (A[j]>A[j+1])

{

int T = A[j];

A[j] = A[j+1];

A[j+1] = T;

}

}

5.4 Parallel Odd Even Transposition Sort:

The odd-even transposition sort on CUDA architecture is implemented on a single

block with a max size of 512 elements. Each thread process one element and hence

even threads process even indexed elements and odd threads process odd indexed

elements.

Page 41: Numerical Methods Implementation on CUDA

Chapter 5 Implementation of Odd Even transposition Sort 25

5.4.1 Implemention

For an input size of N a block with N threads is created and each thread pro-

cesses one element.The kernel creates a shared memory portion for the block and

copies the array in this.All the values of the input array which are stored in global

memory are mapped with a specific thread ID dependent on the number of blocks

ID=blockIdx*dim block + threadIdx

Thus, respective elements are copied from the global memory to the shared mem-

ory for the block. The kernel then sorts the array in combinations of odd-even

phases and the resultant is copied back to the host memory.The kernel functioan

can be examined as follows.

5.5 Kernel Specification:

global Sort() - 8 registers,2068+16bytes of smem,4 bytes of cmem[1].

5.6 Salient Features:-

1. The use of shared memory to perform consecutive reads,which reduces the

time that would have been spent in performing the same reads and write

using global memory.

2. Performing a proper synchronization between threads operating in parallel

inside a block.

3. It was difficult to perform synchronization between diffrent blocks ,so the

sums of previous blocks were propagated to the consecutive blocks using a

global array.

4. Both the times t1(without considering memory copy overhead) and t2(considering

memory transfers overhead) are calculated.

5. Synchronization done as to ensure that during parallel execution of threads

the even phase always follows the odd phase

Page 42: Numerical Methods Implementation on CUDA

26 Chapter 5 Implementation of Odd Even transposition Sort

5.7 Limitations:

1. Maximum size of array can be 512, limited to maximum threads in a block

2. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY

CALCULATOR..

5.8 Observations:

1. Steep increase in speedUp as N increases.

2. Due to N being limited to 512 memory overhead time is less than calculation

time .Therefore less effect of memory overhead in performance graph

5.9 Conclusions:

1. Due to calculative complexity in sequential approach,the parallel approach

gains recognizable speedUp.

2. Due to N being limited to 512 memory overhead time is less than calculation

time .Therefore less effect of memory overhead in performance graph

Figure 5.1: Execution time vs input size

Page 43: Numerical Methods Implementation on CUDA

Chapter 5 Implementation of Odd Even transposition Sort 27

Figure 5.2: speedUp vs input size

Page 44: Numerical Methods Implementation on CUDA
Page 45: Numerical Methods Implementation on CUDA

Chapter 6

Implementation Of Parallel

Quicksort By Regular Sampling

Algorithm On CUDA

6.1 Introduction

Quicksort (also known as partition -exchange sort)is a very well known sorting

algorithm developed by A.R Hoare.It is a comparison sort and in effiecient im-

plementations ,is not a stable sort.Quicksort tends to make a excellrnt usage of

memory heirarchy ,taking a perfect advantage of virtual memory and availible

caches .It is very well suited for modern computer architectures ,as it uses no

temporaray memory and thus is a in-place sort.

6.2 Sequential Quicksort:

The sequential implementation of quicksort algorithm follows a divide and conquer

approach to sort a large input array of values.Th procedure involves:-

1. Selecting one of the numbers (any random numbermay be selected) from the

input as pivot element.

2. Locating the index(position) of the number in the input array and then di-

viding the array into sub-arrays .the Lower sub array contains elements with

29

Page 46: Numerical Methods Implementation on CUDA

30Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling

Algorithm On CUDA

value smalller then the pivot,and the upper sub array containing elements

with values higher then that of the pivot element .

3. Applying the step one recursively on both the lower and upper arrays.

4. Finally a sorted list of values is obtained(sorted here in ascending order).

ILLUSTRATION OF QUICKSORT

Figure 6.1: Sequential Quicksort algorithm

Quicksort is known to be the fastest sorting algorithm based on comparison of

pivots, in the average case and Quicksort has some natural concurrency(sorting

the lower and upper list concurrently).

6.3 Parallel Quicksort Using Regular Sampling:

Parallel quicksort using regular sampling can be applied on a very large sets of

data .It basically involves segmenting the unsorted list into blocks.The unsorted

list is evenly distributed among the blocks.There are in all four phases invloved :-

1. Individual sorting of values on each segment ,selecting data items at local

indices 0, np2,2∗np2

, . . . , (p−1)np2

as a regular sample of its locally sorted block.

2. All the selected pivots are then again sorted and (p-1) pivots are selected

and broadcast to every block.

3. Each Block then partitions its sorted subarray into P disjoint partitions

4. Each Block (i) keeps its (ith) partition and sends the (jth) partition to process

(j), for all (j 6=i) and then each block merges its P partitions into a single

global array.

Page 47: Numerical Methods Implementation on CUDA

Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling

Algorithm On CUDA 31

6.3.1 Implementation:

1. The input Unsorted list is divided into N blocks ( size512

) .and the unsorted

partitions are then copied from the global array to the shared array of each

block on the GPU.

2. Sorting of the segemented list stored in shared array is performed by every

block independent of each other

3. Local pivots are selected and copied to a global array ,indexed according to

the blockId.

4. The list of pivots is then again sorted and P-1 pivots are agin selected and

brodcast to every block.

5. Local sorted arrays are partioned according to the pivots and then the par-

titions are merged to a global array accordingly.

6.4 Kernel Specifications:

1. kernel1 6 registers,6810+16 bytes smem,4 bytes cmem

2. kernel2 8 registers,24+16 bytes smem,4 bytes cmem

3. kernel3 7 registers,2084+16 bytes smem,8 bytes cmem

6.5 Salient features:

1. The use of shared memory to perform consecutive reads,which reduces the

time that would have been spent in performing the same reads and write

using global memory.

2. The code is generalised to run on very large number of values.

3. Better load balance

4. Repeated communications of a same value are avoided

5. Use of three kernel functions to increase the extent of parallelization at the

same time continuosly using shared memory.

Page 48: Numerical Methods Implementation on CUDA

32Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling

Algorithm On CUDA

6.6 Limitations:

1. The input size is limited to be taken in multiples of 512.

2. The sorting of segmented array performed at block level is implemented using

a single thread ,this affecting the overall efficiency and reducing parallelism.

3. Better load balance

4. There is a constant use of global memory for broadcasting the pivots and

globally sorting them

5. GPU-Occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY

CALCULATOR..

6.7 Observations:

1. Highly efficient and recursive sequential code

2. Use of three kernels drastically increses the execution time

6.8 Conclusions:

1. Efficient sequential codes can outperform the parallel versions.

Page 49: Numerical Methods Implementation on CUDA

Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling

Algorithm On CUDA 33

Figure 6.2: execution time vs input size

Figure 6.3: speedUp vs input size

Page 50: Numerical Methods Implementation on CUDA

34Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling

Algorithm On CUDA

Figure 6.4: speedUp vs input size

Page 51: Numerical Methods Implementation on CUDA

Chapter 7

Implementation of matrix

transpose algorithm on CUDA

7.1 Introduction

Matrix transpose is a operation in which we exchange the rows with there corre-

sponding column i.e values in row 1st becames the values of column 1st. Transpose

can opnly be found for a square matrix i.e both the dimension of matrix should be

same. The matrix transpose can be calculated for a very large sets of input data

and is generally described as below:-

INPUT: Matrice A having N*N dimension

OUTPUTMatrice transpose(A) having same dimensions. 1st row of A must match

with 1st column of transpose (A) and so on.

Example:-

matrix A=

1 2 3

4 5 6

7 8 9

transpose (A)=

1 4 7

2 5 8

3 6 9

35

Page 52: Numerical Methods Implementation on CUDA

36 Chapter 7 Implementation of matrix transpose algorithm on CUDA

7.2 Matrix transpose proves to be advantageous

in the implementation of following logics:

1. Used to find inverse of matrix.

2. Orthogonal matrice applications.

7.3 Sequential matrix transpose:

The logic for sequential is pretty straight-forward as the rows and colum are ex-

changed hence basically we have swapped there two indexs, i.e

A[i][j]=transpose(A[j][i]);

thus we have to index our program to follow the above logic.

for(i=0;i< r1;i=i+1)

{

for(j=0;j< c1;j=j+1)

{

transpose(A[j][i])=A[i][j];

}

}

here r1= number of rows in A matrice and c1 number of column and we know

both must be equal as its a square matrix

7.4 Parallel matrix transpose:

As matrice A and transpose(A) are different and thus we can launch as many

threads as there are number of element and thus we dont even have to synchronize

them.

7.4.1 Implementation:

For a input array of size N(can be very large),a 2-D grid is created . If the square

of N<512 ,then a grid with one block and containing N*N number of threads

Page 53: Numerical Methods Implementation on CUDA

Chapter 7 Implementation of matrix transpose algorithm on CUDA 37

is launched by the kernel function.If (N*N>512) then number of block launched

are N∗N256

and a 2-D block of each with dimesion 16 is launched.Indexing to each

element is done using the threadIdx, threadIdx , blockIdx and blockIdx.

Indexing int row = blockIdx*block D +threadIdx; int col = blockIdx*block D

+threadIdx;

7.5 Kernel specifications:

global void matrixMul globalmemory - 9 registers,28+16 bytes of smem,4

bytes of cmem[1].

7.6 Salient features:

1. we have implemented on global memory as our threads are independent of

each other and we face no synchronisation problem.

2. Motivation for using global memory was to run our code for matrices with

large dimensions.

3. The code is generalised to run on very lage number of values.

4. Both the times t1(without considering memory copy overhead) and t2(considering

memory transfers overhead) are calculated .

7.7 Limitations:

1. For lage values of arrays(>512 values),the input size was limited to the mul-

tiples of 512.

2. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY

7.8 Observations:

1. As N increases calculationtimememoryoverhead

decreases significantly.This is due to simple

calculative logic.

2. Due to memory overhead speedUp did not increase beyond 0.91.

Page 54: Numerical Methods Implementation on CUDA

38 Chapter 7 Implementation of matrix transpose algorithm on CUDA

7.9 Conclusions:

1. SpeedUp in calculations at (CPU vs GPU) easily achieved

2. Better memory optimizations can gain significant speedUp.

Figure 7.1: Execution time vs input size

Figure 7.2: speedUp vs input size

Page 55: Numerical Methods Implementation on CUDA

Chapter 7 Implementation of matrix transpose algorithm on CUDA 39

Figure 7.3: speedUp vs input size

Page 56: Numerical Methods Implementation on CUDA
Page 57: Numerical Methods Implementation on CUDA

Chapter 8

Implementation of parallel sum

algorithm on CUDA

8.1 Introduction

Parallel sum is the program to find out the sum of all the elements present in an

array. The parallel sum can be calculated for a very large sets of input data and

is generally described as below:-

INPUT:

For a set on N values [a1,a2,a3,.....................................,an-1,an]

OUTPUTWe will get the final sum of array say SUM=a1+a2+a3...........................+

an-1 + an;

For Example - a[8]={1,3,4,2,6,3,7,1}

SUM={1+3+4+2+6+3+7+1}=27

8.2 Parallel-sum proves to be advantageous in

the implementation of following logics:

1. In the implementation of finding out mean of set of values .

2. In the implementation of finding of variance.

41

Page 58: Numerical Methods Implementation on CUDA

42 Chapter 8 Implementation of parallel sum algorithm on CUDA

8.3 Sequential Parallel-Sum Algorithm:-

The sequential parallel-sum algorithm is a very simple method to calculate the

total sum of a given input array of numbers ,just by looping through the size of

the array and adding the current value with the variable sum .The logic is demon-

strated as below:-

SUM=0;

for( i=0;i<size;i=i+1)

SUM=a[i]+SUM ;

This code performs exactly N adds for a array of size N.and thus is a very simple

implementation.

8.4 Parallel Prefix-Sum:

The prefix-sum algorithm can be very efficiently performed using the parallel ar-

chitecture.We assume our size of input array to in form of powers of two, i.e 2 ,4

,16 , 32 ....1024,...8192...and so on.

8.4.1 Implementation:

For a input array of size N(can be very large),a single dimension grid is created

with ( N512

) blocks.If the size of the input is N<512 ,then a grid with one block and

containing N number of threads is launched by the kernel function.

Basically in kernel function each thread executes it code by performing the sum of

two elements and storing that sum in the index of number with lower index. For

example:

if we have input array say A={1,2,3,4,5,6,7,8} Now in first run for 8 values

we create 4 threads first thread,i.e thread with the threadIdx=0 adds the value

(a[0]=a[0]+a[1]=1+2=3) and stores it at the lower index i.e 0;similarly second

thread(threadIdx=1) adds the value (a[2]=a[2]+a[3]=3+4=7) third thread(threadIdx=2)

adds the value (a[4]=a[4]+a[5]=5+6=11) fourth thread(threadIdx=3) adds the

value (a[6]=a[6]+a[7]=7+8=15) Now the number of values have reduced from 8

to 4 now we require only 2 threads instead of 4. this is done by using thread Ids

Page 59: Numerical Methods Implementation on CUDA

Chapter 8 Implementation of parallel sum algorithm on CUDA 43

of threads. Condition: if((int)(threadIdx)-power(j)geq0 here j denotes the value

of run i.e for 1st run its equal to 0 for second run its equal to 1 and so on. As we

observe each time number of values reduces by 2. thus to compute the sum of N

value we need log to the base 2 of value N.

Each of the blocks is provided with a shared array of size=512 and its set of shared

variables.All the values of the input array which are stored in global memory are

mapped with a specific thread ID dependent on the number of blocks

ID=blockIdx*dim block + threadIdx;

Proper synchronisation must be insured between the diffrent run of threads. We

have used the standard function from CUDA library ( syncthreads();)

8.5 Kernel Specification:-

1. global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1].

8.6 Salient Features:-

1. The use of shared memory to perform consecutive reads,which reduces the

time that would have been spent in performing the same reads and write

using global memory.

2. Performing a proper synchronization between threads operating in parallel

inside a block.

3. The code is generalised to run on very lage number of values.

4. Both the times t1(without considering memory copy overhead) and t2(considering

memory transfers overhead) are calculated .

8.7 Limitations:-

1. We have assumed that the number of input value must be in form of 2’s

power.

Page 60: Numerical Methods Implementation on CUDA

44 Chapter 8 Implementation of parallel sum algorithm on CUDA

2. We can run it for large values untill the condition of maximum number of

blocks occurs, i.e we can have at max number of blocks is 65536 thus we

can compute parallel sum of array having 65536*512=33554432 number of

elements.

3. GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY

CALCULATOR..

8.8 Observations:

1. For very small input sizes ,the sequential sum appears to be much faster

then the parallel code .

2. Good speedup wrt. kernel execution times are achieved,which demostrates

the efficiency of running the parallel code on CUDA.

8.9 Conclusions:

(a) use of shared memory requires extreme synchronization logics.

(b) bank conflicts very comon due to unrestricted access of shared memory.

Figure 8.1: Execution time vs input size

Page 61: Numerical Methods Implementation on CUDA

Chapter 8 Implementation of parallel sum algorithm on CUDA 45

Figure 8.2: speedUp vs input size

Figure 8.3: speedUp vs input size

Page 62: Numerical Methods Implementation on CUDA
Page 63: Numerical Methods Implementation on CUDA

Chapter 9

Calculation Of Variance and

Standard Deviations on CUDA

9.1 Introduction

The mean of a data set is simply the arithmetic average of the values in the

set, obtained by summing the values and dividing by the number of values.

The mean is a measure of the center of the distribution. The variance is used

as a measure of how far a set of numbers are spread out from each other. It

gives a measure of how away or far the numbers lie from their mean. The

variance of a data set is the arithmetic average of the squared differences

between the values and the mean

Standard deviation gives a measure of how much variation or dispersion is

there from the mean. Mathematically it is the square root of the variance.

The variance and the standard deviation are both measures of the spread of

the distribution about the mean

9.2 Finding VARIANCE AND DEVIATION

proves to be advantageous

(a) The spread of the data around the mean is to be found

(b) When large data is to be analyzed on the basis of extent of the spread

in the data

47

Page 64: Numerical Methods Implementation on CUDA

48 Chapter 9 Calculation Of Variance and Standard Deviations on CUDA

(c) For example, the margin of error in polling data is determined by calcu-

lating the standard deviation in the results if the polling is to be done

multiple times.

9.3 Sequentially Calculate Variance and SD:

the sum is easily calculated by adding each element of the N sized array and

the mean is found by dividing this dum by N.

for(i=0; i<n; i=i+1)

{

sum = sum + A[i];

}

avrg = sumn

;

the variance is then calculated using the deviation from this mean value by

using the formula stated above. The looping would be: for(i=0; i<n; i++)

{

sum1+=(A[i]-avrg)*(A[i]-avrg);

}

var = sum1n

;

SD =√

(var)

the SD is the standard deviation which is the square root of the variance.

9.4 Parallely Calculate Variance and SD:

The process of finding the sum parallely on CUDA is a complex one due to

synchronization problems. The sum is calculated using the kernel described

in chapter 3. the sum gives the average by dividing the sum by N and this

is used by the 2nd kernel for the calculation of variance and SD.

9.4.1 Implementation:

For a input array of size N(can be very large),a single dimension grid is

created with ( N512

) blocks.If the size of the input is N<512 ,then a grid with

Page 65: Numerical Methods Implementation on CUDA

Chapter 9 Calculation Of Variance and Standard Deviations on CUDA 49

one block and containing N number of threads is launched by the kernel

function. Each of the blocks is provided with a shared array of size=512 and

its set of shared variables.All the values of the input array which are stored

in global memory are mapped with a specific thread ID dependent on the

number of blocks

ID=blockIdx*dim block + threadIdx;

Thus, respective elements are copied from the global memory to the shared

memory of each block. The average calculated by kernel 1 is passed on to

the kernel 2 and the variance of each block is calculated and stored in an

array. Its summation gives the variance of the data and the square root of

the variance gives the SD.

9.5 Kernel Specification:

(a) global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1].

9.6 Limitations:

(a) For lage values of arrays(>512 values),the input size was limited to the

multiples of 512.

(b) GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY.

9.7 Observations:-

(a) For very small input sizes ,the sequential prefix sum appears to be much

faster then the parallel code .

(b) With the increase in size of the input ,time taken by sequential code

increases almost linearly,whereas the time taken by the kernel to execute

remains a constant ,but the overall performance of the parallel code is

degraded by the time acoounted for memorycopy overhead between host

and device .

(c) Very large speedup wrt kernel execution times are achieved,which de-

mostrates the efficiency of running the parallel code on cuda ,but the

memory overhead for large values limits the overall speed up .

Page 66: Numerical Methods Implementation on CUDA

50 Chapter 9 Calculation Of Variance and Standard Deviations on CUDA

9.8 Conclusions:

(a) Finding the mean, variance and SD sequentially is of the O(n). hence

there is no speed up achieved as the kernel for finding the sum has

synchronization problems to be met.

(b) Memory optimization techniques can be used to control synchronization

of shared memory and speed up may be achieved but not guaranteed.

Figure 9.1: Execution time vs input size

Figure 9.2: speedUp vs input size

Page 67: Numerical Methods Implementation on CUDA

Chapter 9 Calculation Of Variance and Standard Deviations on CUDA 51

Figure 9.3: speedUp vs input size

Page 68: Numerical Methods Implementation on CUDA
Page 69: Numerical Methods Implementation on CUDA

Chapter 10

Data of Algorithms

The CPU we used has the following specifications:

Processor : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz

Memory : 1GB DDR2 RAM

L2 Cache : 4 MB

The Nvidia Quadro FX 1700 GPGPU we used has the following specifica-

tions:

CUDA Parallel Processor Cores : 32

Memory Size : 512 MB

Memory Interface : 128-bit

Graphics Memory Bandwidth : 12.8 GB/sec

The graphics card used for our experiment (Quadro FX 1700) is of compute

capability 1.1. The version does not support double precision floating point.

Also,the mathematical functions used are not accurate. This leads to mild

loss of accuracy in the final results.

53

Page 70: Numerical Methods Implementation on CUDA

54 Chapter 10 Data of Algorithms

Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 43 67 21 28 554 924 1009 0.17 0.1516 2360 2118 2250 0.60 0.5532 9414 8160 8405 1.11 1.0564 37784 32041 32486 1.18 1.01128 131292 133058 133952 0.99 0.98256 538807 526462 528415 1.02 1.02512 2378744 2118810 2122760 1.12 1.121024 11560038 8538991 8547882 1.35 1.352048 52087845 34331100 34357273 1.52 1.52

Table 10.1: Matrix Multiplication(time in 10−6s)

Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 51 190 0.02 0.018 2 61 200 0.03 0.0116 6 77 226 0.08 0.0332 13 94 243 0.14 0.0564 33 120 280 0.28 0.12128 77 147 297 0.52 0.26256 179 182 332 0.98 0.54512 423 251 402 1.69 1.05

Table 10.2: Bitonic Sort Algorithm (time in 10−6s)

Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up216 1 76 97 0.01 0.0132 1 68 93 0.01 0.0164 2 75 98 0.03 0.02128 3 81 105 0.04 0.03256 4 98 123 0.04 0.03512 7 146 172 0.05 0.041024 14 151 179 0.09 0.082048 28 151 179 0.19 0.164096 54 250 301 0.22 0.188192 108 467 553 0.23 0.2016384 215 934 1075 0.23 0.232768 430 1958 2266 0.22 0.1965536 858 4503 5087 0.19 0.17262144 2956 33192 35403 0.09 0.08524288 5922 107130 111562 0.06 0.05

Table 10.3: Prefix Sum (time in 10−6s)

Page 71: Numerical Methods Implementation on CUDA

Chapter 10 Data of Algorithms 55

Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 43 67 0.02 0.018 1 47 73 0.02 0.0116 3 50 73 0.06 0.0432 9 58 83 0.16 0.1164 32 78 105 0.41 0.30128 113 144 67 0.96 0.78256 446 282 67 1.75 1.58512 1786 838 67 2.21 2.13

Table 10.4: Odd-Even Transposition Sort (time in 10−6s)

Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 63 87 0.02 0.0116 2 177 201 0.01 0.0132 3 549 574 0.01 0.0164 7 1023 1030 0.01 0.01256 32 2000 2018 0.02 0.02512 68 2608 2646 0.03 0.031024 144 4608 4698 0.03 0.032048 290 7500 7568 0.04 0.048192 1252 17062 17124 0.07 0.0732768 5392 29865 29936 0.18 0.18131072 23079 92452 92498 0.25 0.25

Table 10.5: Quicksort (time in 10−6s)

Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 45 71 0.02 0.018 1 45 72 0.02 0.0116 3 46 74 0.07 0.0432 8 47 79 0.17 0.1064 33 58 113 0.57 0.29128 127 147 298 0.86 0.43256 458 411 1045 1.11 0.44512 2057 1528 3748 1.35 0.551024 9906 6076 13738 1.36 0.722048 45757 24740 50454 1.85 0.914096 202233 206395 307262 0.98 s0.82

Table 10.6: Matrix-transpose (time in 10−6s)

Page 72: Numerical Methods Implementation on CUDA

56 Chapter 10 Data of Algorithms

Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up216 1 53 80 0.02 0.0164 1 56 79 0.02 0.01256 2 70 98 0.03 0.02512 3 87 113 0.03 0.031024 6 89 117 0.07 0.054096 22 136 190 0.16 0.128192 41 227 312 0.18 0.1316384 83 418 572 0.20 0.1532768 168 768 1102 0.22 0.15262144 1155 5829 8045 0.20 0.141048576 4650 23189 30856 0.20 0.154194304 18459 92547 118429 0.20 0.1616777216 73597 369930 470787 0.20 0.16

Table 10.7: Summation Algorithm (time in 10−6s)

Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up2512 7 272 294 0.03 0.021024 13 273 297 0.05 0.042048 25 281 311 0.09 0.084096 49 380 418 0.13 0.128192 98 582 639 0.17 0.1516384 166 997 1090 0.17 0.1532768 384 1767 1976 0.22 0.1965536 767 3415 3813 0.22 0.20131072 1323 6666 7427 0.20 0.18262144 2639 13148 14643 0.20 0.181048576 10651 51131 55262 0.21 0.194194304 42947 200280 214462 0.21 0.2016777216 171841 799691 854827 0.21 0.20

Table 10.8: Variance and SD (time in 10−6s)

Page 73: Numerical Methods Implementation on CUDA

Bibliography

57