38
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Michael J. Quinn

Parallel Programming in C with MPI and OpenMP

Embed Size (px)

DESCRIPTION

Parallel Programming in C with MPI and OpenMP. Michael J. Quinn. Chapter 5. The Sieve of Eratosthenes. Chapter Objectives. Analysis of block allocation schemes Function MPI_Bcast Performance enhancements. Outline. Sequential algorithm Sources of parallelism Data decomposition options - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel Programming in C with MPI and OpenMP

Parallel Programmingin C with MPI and OpenMP

Michael J. QuinnMichael J. Quinn

Page 2: Parallel Programming in C with MPI and OpenMP

Chapter 5

The Sieve of EratosthenesThe Sieve of Eratosthenes

Page 3: Parallel Programming in C with MPI and OpenMP

Chapter Objectives

Analysis of block allocation schemesAnalysis of block allocation schemes Function MPI_BcastFunction MPI_Bcast Performance enhancementsPerformance enhancements

Page 4: Parallel Programming in C with MPI and OpenMP

Outline

Sequential algorithmSequential algorithm Sources of parallelismSources of parallelism Data decomposition optionsData decomposition options Parallel algorithm development, analysisParallel algorithm development, analysis MPI programMPI program BenchmarkingBenchmarking OptimizationsOptimizations

Page 5: Parallel Programming in C with MPI and OpenMP

Sequential Algorithm

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

2 4 6 8 10 12 14 16

18 20 22 24 26 28 30

32 34 36 38 40 42 44 46

48 50 52 54 56 58 60

3 9 15

21 27

33 39 45

51 57

5

25

35

55

7

49

Complexity: (n ln ln n)

Page 6: Parallel Programming in C with MPI and OpenMP

Pseudocode1. Create list of unmarked natural numbers 2, 3, …, n2. k 23. Repeat

(a) Mark all multiples of k between k2 and n(b) k smallest unmarked number > k

until k2 > n4. The unmarked numbers are primes

Page 7: Parallel Programming in C with MPI and OpenMP

Sources of Parallelism

Domain decompositionDomain decomposition Divide data into piecesDivide data into pieces Associate computational steps with dataAssociate computational steps with data

One primitive task per array elementOne primitive task per array element

Page 8: Parallel Programming in C with MPI and OpenMP

Making 3(a) Parallel

Mark all multiples of k between k2 and n

for all j where k2 j n do if j mod k = 0 then mark j (it is not a prime) endifendfor

Page 9: Parallel Programming in C with MPI and OpenMP

Making 3(b) ParallelFind smallest unmarked number > k

Min-reduction (to find smallest unmarked number > k)

Broadcast (to get result to all tasks)

Page 10: Parallel Programming in C with MPI and OpenMP

Agglomeration Goals

Consolidate tasksConsolidate tasks Reduce communication costReduce communication cost Balance computations among processesBalance computations among processes

Page 11: Parallel Programming in C with MPI and OpenMP

Data Decomposition Options

Interleaved (cyclic)Interleaved (cyclic) Easy to determine “owner” of each indexEasy to determine “owner” of each index Leads to load imbalance Leads to load imbalance for this problemfor this problem

BlockBlock Balances loadsBalances loads More complicated to determine owner if More complicated to determine owner if

nn not a multiple of not a multiple of pp

Page 12: Parallel Programming in C with MPI and OpenMP

Block Decomposition Options

Want to balance workload when Want to balance workload when nn not a not a multiple of multiple of pp

Each process gets either Each process gets either n/pn/p or or n/pn/p elementselements

Seek simple expressionsSeek simple expressions Find low, high indices given an ownerFind low, high indices given an owner Find owner given an indexFind owner given an index

Page 13: Parallel Programming in C with MPI and OpenMP

Method #1

Let Let r r = = nn mod mod pp If If rr = 0, all blocks have same size = 0, all blocks have same size ElseElse

First First rr blocks have size blocks have size n/pn/p Remaining Remaining p-rp-r blocks have size blocks have size n/pn/p

Page 14: Parallel Programming in C with MPI and OpenMP

Examples17 elements divided among 7 processes

17 elements divided among 5 processes

17 elements divided among 3 processes

Page 15: Parallel Programming in C with MPI and OpenMP

Method #1 Calculations

First element controlled by process First element controlled by process ii

Last element controlled by process Last element controlled by process ii

Process controlling element Process controlling element jj

),min(/ ripni

1),1min(/)1( ripni

)//)(,)1//(min( pnrjpnj

Page 16: Parallel Programming in C with MPI and OpenMP

Method #2

Scatters larger blocks among processesScatters larger blocks among processes First element controlled by process First element controlled by process ii

Last element controlled by process Last element controlled by process ii

Process controlling element Process controlling element jj

pin /

1/)1( pni

njp /)1)1(

Page 17: Parallel Programming in C with MPI and OpenMP

Examples17 elements divided among 7 processes

17 elements divided among 5 processes

17 elements divided among 3 processes

Page 18: Parallel Programming in C with MPI and OpenMP

Comparing Methods

OperationsOperations Method 1Method 1 Method 2Method 2

Low indexLow index 44 22

High indexHigh index 66 44

OwnerOwner 77 44

Assuming no operations for “floor” function

Our choice

Page 19: Parallel Programming in C with MPI and OpenMP

Pop Quiz

Illustrate how block decomposition method Illustrate how block decomposition method #2 would divide 13 elements among 5 #2 would divide 13 elements among 5 processes.processes.

13(0)/ 5 = 0

13(1)/5 = 2

13(2)/ 5 = 5

13(3)/ 5 = 7

13(4)/ 5 = 10

Page 20: Parallel Programming in C with MPI and OpenMP

Block Decomposition Macros#define BLOCK_LOW(id,p,n) ((i)*(n)/(p))

#define BLOCK_HIGH(id,p,n) \ (BLOCK_LOW((id)+1,p,n)-1)

#define BLOCK_SIZE(id,p,n) \ (BLOCK_LOW((id)+1)-BLOCK_LOW(id))

#define BLOCK_OWNER(index,p,n) \ (((p)*(index)+1)-1)/(n))

Page 21: Parallel Programming in C with MPI and OpenMP

Local vs. Global Indices

L 0 1

L 0 1 2

L 0 1

L 0 1 2

L 0 1 2

G 0 1 G 2 3 4

G 5 6

G 7 8 9 G 10 11 12

Page 22: Parallel Programming in C with MPI and OpenMP

Looping over Elements

Sequential programSequential programfor (i = 0; i < n; i++) {for (i = 0; i < n; i++) { … …}}

Parallel programParallel programsize = BLOCK_SIZE (id,p,n);size = BLOCK_SIZE (id,p,n);for (i = 0; i < size; i++) {for (i = 0; i < size; i++) { gi = i + BLOCK_LOW(id,p,n);gi = i + BLOCK_LOW(id,p,n);}}

Index i on this process…

…takes place of sequential program’s index gi

Page 23: Parallel Programming in C with MPI and OpenMP

Decomposition Affects Implementation

Largest prime used to sieve is Largest prime used to sieve is nn First process has First process has nn//pp elements elements It has all sieving primes if It has all sieving primes if pp < < nn First process always broadcasts next sieving First process always broadcasts next sieving

primeprime No reduction step neededNo reduction step needed

Page 24: Parallel Programming in C with MPI and OpenMP

Fast Marking

Block decomposition allows same marking as sequential Block decomposition allows same marking as sequential algorithm:algorithm:

jj, , j j + + kk, , j j + 2+ 2kk, , j j + 3+ 3kk, …, …

instead ofinstead of

for all for all jj in block in blockif if jj mod mod kk = 0 then mark = 0 then mark jj (it is not a prime) (it is not a prime)

Page 25: Parallel Programming in C with MPI and OpenMP

Parallel Algorithm Development1. Create list of unmarked natural numbers 2, 3, …, n

2. k 2

3. Repeat

(a) Mark all multiples of k between k2 and n

(b) k smallest unmarked number > k

until k2 > m

4. The unmarked numbers are primes

Each process creates its share of listEach process does this

Each process marks its share of list

Process 0 only

(c) Process 0 broadcasts k to rest of processes

5. Reduction to determine number of primes

Page 26: Parallel Programming in C with MPI and OpenMP

Function MPI_Bcastint MPI_Bcast (

void *buffer, /* Addr of 1st element */

int count, /* # elements to broadcast */

MPI_Datatype datatype, /* Type of elements */

int root, /* ID of root process */

MPI_Comm comm) /* Communicator */

MPI_Bcast (&k, 1, MPI_INT, 0, MPI_COMM_WORLD);

Page 27: Parallel Programming in C with MPI and OpenMP

Task/Channel Graph

Page 28: Parallel Programming in C with MPI and OpenMP

Analysis

is time needed to mark a cellis time needed to mark a cell Sequential execution time: Sequential execution time: nn ln ln ln ln nn Number of broadcasts: Number of broadcasts: n n / ln / ln n n Broadcast time: Broadcast time: log log pp Expected execution time:Expected execution time:

pnnpnn log)ln/(/lnln

Page 29: Parallel Programming in C with MPI and OpenMP

Code (1/4)#include <mpi.h>#include <math.h>#include <stdio.h>#include "MyMPI.h"#define MIN(a,b) ((a)<(b)?(a):(b))

int main (int argc, char *argv[]){ ... MPI_Init (&argc, &argv); MPI_Barrier(MPI_COMM_WORLD); elapsed_time = -MPI_Wtime(); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p);if (argc != 2) { if (!id) printf ("Command line: %s <m>\n", argv[0]); MPI_Finalize(); exit (1);}

Page 30: Parallel Programming in C with MPI and OpenMP

Code (2/4) n = atoi(argv[1]); low_value = 2 + BLOCK_LOW(id,p,n-1); high_value = 2 + BLOCK_HIGH(id,p,n-1); size = BLOCK_SIZE(id,p,n-1); proc0_size = (n-1)/p; if ((2 + proc0_size) < (int) sqrt((double) n)) { if (!id) printf ("Too many processes\n"); MPI_Finalize(); exit (1); }

marked = (char *) malloc (size); if (marked == NULL) { printf ("Cannot allocate enough memory\n"); MPI_Finalize(); exit (1); }

Page 31: Parallel Programming in C with MPI and OpenMP

Code (3/4) for (i = 0; i < size; i++) marked[i] = 0; if (!id) index = 0; prime = 2; do { if (prime * prime > low_value) first = prime * prime - low_value; else { if (!(low_value % prime)) first = 0; else first = prime - (low_value % prime); } for (i = first; i < size; i += prime) marked[i] = 1; if (!id) { while (marked[++index]); prime = index + 2; } MPI_Bcast (&prime, 1, MPI_INT, 0, MPI_COMM_WORLD); } while (prime * prime <= n);

Page 32: Parallel Programming in C with MPI and OpenMP

Code (4/4)

count = 0; for (i = 0; i < size; i++) if (!marked[i]) count++; MPI_Reduce (&count, &global_count, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); elapsed_time += MPI_Wtime(); if (!id) { printf ("%d primes are less than or equal to %d\n", global_count, n); printf ("Total elapsed time: %10.6f\n", elapsed_time); } MPI_Finalize (); return 0;}

Page 33: Parallel Programming in C with MPI and OpenMP

Benchmarking

Execute sequential algorithmExecute sequential algorithm Determine Determine = 85.47 nanosec = 85.47 nanosec Execute series of broadcastsExecute series of broadcasts Determine Determine = 250 = 250 secsec

Page 34: Parallel Programming in C with MPI and OpenMP

Execution Times (sec)

ProcessorsProcessors PredictedPredicted Actual (sec)Actual (sec)

11 24.90024.900 24.90024.900

22 12.72112.721 13.01113.011

33 8.8438.843 9.0399.039

44 6.7686.768 7.0557.055

55 5.7945.794 5.9935.993

66 4.9644.964 5.1595.159

77 4.3714.371 4.6874.687

88 3.9273.927 4.2224.222

Page 35: Parallel Programming in C with MPI and OpenMP

Improvements

Delete even integersDelete even integers Cuts number of computations in halfCuts number of computations in half Frees storage for larger values of Frees storage for larger values of nn

Each process finds own sieving primesEach process finds own sieving primes Replicating computation of primes to Replicating computation of primes to nn Eliminates broadcast stepEliminates broadcast step

Reorganize loopsReorganize loops Increases cache hit rateIncreases cache hit rate

Page 36: Parallel Programming in C with MPI and OpenMP

Reorganize Loops

Cache hit rate

Lower

Higher

Page 37: Parallel Programming in C with MPI and OpenMP

Comparing 4 VersionsProcsProcs Sieve 1Sieve 1 Sieve 2Sieve 2 Sieve 3Sieve 3 Sieve 4Sieve 4

11 24.90024.900 12.23712.237 12.46612.466 2.5432.543

22 12.72112.721 6.6096.609 6.3786.378 1.3301.330

33 8.8438.843 5.0195.019 4.2724.272 0.9010.901

44 6.7686.768 4.0724.072 3.2013.201 0.6790.679

55 5.7945.794 3.6523.652 2.5592.559 0.5430.543

66 4.9644.964 3.2703.270 2.1272.127 0.4560.456

77 4.3714.371 3.0593.059 1.8201.820 0.3910.391

88 3.9273.927 2.8562.856 1.5851.585 0.3420.342

10-fold improvement

7-fold improvement

Page 38: Parallel Programming in C with MPI and OpenMP

Summary

Sieve of Eratosthenes: parallel design uses Sieve of Eratosthenes: parallel design uses domain decompositiondomain decomposition

Compared two block distributionsCompared two block distributions Chose one with simpler formulasChose one with simpler formulas

Introduced Introduced MPI_BcastMPI_Bcast Optimizations reveal importance of Optimizations reveal importance of

maximizing single-processor performancemaximizing single-processor performance