37
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 5 The Sieve of Eratosthenes The Sieve of Eratosthenes

Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Chapter 5

The Sieve of EratosthenesThe Sieve of Eratosthenes

Page 2: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Chapter Objectives

Analysis of block allocation schemesAnalysis of block allocation schemes Function MPI_BcastFunction MPI_Bcast Performance enhancementsPerformance enhancements

Page 3: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Outline

Sequential algorithmSequential algorithm Sources of parallelismSources of parallelism Data decomposition optionsData decomposition options Parallel algorithm development, analysisParallel algorithm development, analysis MPI programMPI program BenchmarkingBenchmarking OptimizationsOptimizations

Page 4: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Sequential Algorithm

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

2 4 6 8 10 12 14 16

18 20 22 24 26 28 30

32 34 36 38 40 42 44 46

48 50 52 54 56 58 60

3 9 15

21 27

33 39 45

51 57

5

25

35

55

7

49

Complexity: (n ln ln n)

Page 5: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Pseudocode1. Create list of unmarked natural numbers 2, 3, …, n2. k 23. Repeat

(a) Mark all multiples of k between k2 and n(b) k smallest unmarked number > k

until k2 > n4. The unmarked numbers are primes

Page 6: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Sources of Parallelism

Domain decompositionDomain decompositionDivide data into piecesDivide data into piecesAssociate computational steps with dataAssociate computational steps with data

One primitive task per array elementOne primitive task per array element

Page 7: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Making 3(a) Parallel

Mark all multiples of k between k2 and n

for all j where k2 j n do if j mod k = 0 then mark j (it is not a prime) endifendfor

Page 8: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Making 3(b) ParallelFind smallest unmarked number > k

Min-reduction (to find smallest unmarked number > k)

Broadcast (to get result to all tasks)

Page 9: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Agglomeration Goals

Consolidate tasksConsolidate tasks Reduce communication costReduce communication cost Balance computations among processesBalance computations among processes

Page 10: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Data Decomposition Options

Interleaved (cyclic)Interleaved (cyclic)Easy to determine “owner” of each indexEasy to determine “owner” of each indexLeads to load imbalance Leads to load imbalance for this problemfor this problem

BlockBlockBalances loadsBalances loadsMore complicated to determine owner if More complicated to determine owner if

nn not a multiple of not a multiple of pp

Page 11: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Block Decomposition Options

Want to balance workload when Want to balance workload when nn not a not a multiple of multiple of pp

Each process gets either Each process gets either n/pn/p or or n/pn/p elementselements

Seek simple expressionsSeek simple expressionsFind low, high indices given an ownerFind low, high indices given an ownerFind owner given an indexFind owner given an index

Page 12: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Method #1

Let Let r r = = nn mod mod pp If If rr = 0, all blocks have same size = 0, all blocks have same size ElseElseFirst First rr blocks have size blocks have size n/pn/pRemaining Remaining p-rp-r blocks have size blocks have size n/pn/p

Page 13: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Examples17 elements divided among 7 processes

17 elements divided among 5 processes

17 elements divided among 3 processes

Page 14: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Method #1 Calculations

First element controlled by process First element controlled by process ii

Last element controlled by process Last element controlled by process ii

Process controlling element Process controlling element jj

),min(/ ripni

1),1min(/)1( ripni

)//)(,)1//(min( pnrjpnj

Page 15: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Method #2

Scatters larger blocks among processesScatters larger blocks among processes First element controlled by process First element controlled by process ii

Last element controlled by process Last element controlled by process ii

Process controlling element Process controlling element jj

pin /

1/)1( pni

njp /)1)1(

Page 16: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Examples17 elements divided among 7 processes

17 elements divided among 5 processes

17 elements divided among 3 processes

Page 17: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Comparing Methods

24Low index

47Owner

46High index

Method 2Method 1Operations

Assuming no operations for “floor” function

Our choice

Page 18: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Pop Quiz

Illustrate how block decomposition method Illustrate how block decomposition method #2 would divide 13 elements among 5 #2 would divide 13 elements among 5 processes.processes.

13(0)/ 5 = 0

13(1)/5 = 2

13(2)/ 5 = 5

13(3)/ 5 = 7

13(4)/ 5 = 10

Page 19: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Block Decomposition Macros#define BLOCK_LOW(id,p,n) ((i)*(n)/(p))

#define BLOCK_HIGH(id,p,n) \ (BLOCK_LOW((id)+1,p,n)-1)

#define BLOCK_SIZE(id,p,n) \ (BLOCK_LOW((id)+1)-BLOCK_LOW(id))

#define BLOCK_OWNER(index,p,n) \ (((p)*(index)+1)-1)/(n))

Page 20: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Local vs. Global Indices

L 0 1

L 0 1 2

L 0 1

L 0 1 2

L 0 1 2

G 0 1 G 2 3 4

G 5 6

G 7 8 9 G 10 11 12

Page 21: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Looping over Elements

Sequential programSequential programfor (i = 0; i < n; i++) {for (i = 0; i < n; i++) { … …}}

Parallel programParallel programsize = BLOCK_SIZE (id,p,n);size = BLOCK_SIZE (id,p,n);for (i = 0; i < size; i++) {for (i = 0; i < size; i++) { gi = i + BLOCK_LOW(id,p,n);gi = i + BLOCK_LOW(id,p,n);}}

Index i on this process…

…takes place of sequential program’s index gi

Page 22: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Decomposition Affects Implementation

Largest prime used to sieve is Largest prime used to sieve is nn First process has First process has nn//pp elements elements It has all sieving primes if It has all sieving primes if pp < < nn First process always broadcasts next sieving First process always broadcasts next sieving

primeprime No reduction step neededNo reduction step needed

Page 23: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Fast Marking

Block decomposition allows same marking as Block decomposition allows same marking as sequential algorithm:sequential algorithm:

jj, , j j + + kk, , j j + 2+ 2kk, , j j + 3+ 3kk, …, …

instead ofinstead of

for all for all jj in block in blockif if jj mod mod kk = 0 then mark = 0 then mark jj (it is not a prime) (it is not a prime)

Page 24: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Parallel Algorithm Development1. Create list of unmarked natural numbers 2, 3, …, n

2. k 2

3. Repeat

(a) Mark all multiples of k between k2 and n

(b) k smallest unmarked number > k

until k2 > m

4. The unmarked numbers are primes

Each process creates its share of listEach process does this

Each process marks its share of list

Process 0 only

(c) Process 0 broadcasts k to rest of processes

5. Reduction to determine number of primes

Page 25: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Function MPI_Bcastint MPI_Bcast (

void *buffer, /* Addr of 1st element */

int count, /* # elements to broadcast */

MPI_Datatype datatype, /* Type of elements */

int root, /* ID of root process */

MPI_Comm comm) /* Communicator */

MPI_Bcast (&k, 1, MPI_INT, 0, MPI_COMM_WORLD);

Page 26: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Task/Channel Graph

Page 27: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Analysis

is time needed to mark a cellis time needed to mark a cell Sequential execution time: Sequential execution time: nn ln ln ln ln nn Number of broadcasts: Number of broadcasts: n n / ln / ln n n Broadcast time: Broadcast time: log log pp Expected execution time:Expected execution time:

pnnpnn log)ln/(/lnln

Page 28: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Code (1/4)#include <mpi.h>#include <math.h>#include <stdio.h>#include "MyMPI.h"#define MIN(a,b) ((a)<(b)?(a):(b))

int main (int argc, char *argv[]){ ... MPI_Init (&argc, &argv); MPI_Barrier(MPI_COMM_WORLD); elapsed_time = -MPI_Wtime(); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p);if (argc != 2) { if (!id) printf ("Command line: %s <m>\n", argv[0]); MPI_Finalize(); exit (1);}

Page 29: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Code (2/4) n = atoi(argv[1]); low_value = 2 + BLOCK_LOW(id,p,n-1); high_value = 2 + BLOCK_HIGH(id,p,n-1); size = BLOCK_SIZE(id,p,n-1); proc0_size = (n-1)/p; if ((2 + proc0_size) < (int) sqrt((double) n)) { if (!id) printf ("Too many processes\n"); MPI_Finalize(); exit (1); }

marked = (char *) malloc (size); if (marked == NULL) { printf ("Cannot allocate enough memory\n"); MPI_Finalize(); exit (1); }

Page 30: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Code (3/4) for (i = 0; i < size; i++) marked[i] = 0; if (!id) index = 0; prime = 2; do { if (prime * prime > low_value) first = prime * prime - low_value; else { if (!(low_value % prime)) first = 0; else first = prime - (low_value % prime); } for (i = first; i < size; i += prime) marked[i] = 1; if (!id) { while (marked[++index]); prime = index + 2; } MPI_Bcast (&prime, 1, MPI_INT, 0, MPI_COMM_WORLD); } while (prime * prime <= n);

Page 31: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Code (4/4)

count = 0; for (i = 0; i < size; i++) if (!marked[i]) count++; MPI_Reduce (&count, &global_count, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); elapsed_time += MPI_Wtime(); if (!id) { printf ("%d primes are less than or equal to %d\n", global_count, n); printf ("Total elapsed time: %10.6f\n", elapsed_time); } MPI_Finalize (); return 0;}

Page 32: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Benchmarking

Execute sequential algorithmExecute sequential algorithm Determine Determine = 85.47 nanosec = 85.47 nanosec Execute series of broadcastsExecute series of broadcasts Determine Determine = 250 = 250 secsec

Page 33: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Execution Times (sec)

4.2223.9278

4.6874.3717

5.1594.9646

5.9935.7945

7.0556.7684

9.0398.8433

13.01112.7212

24.90024.9001

Actual (sec)PredictedProcessors

Page 34: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Improvements

Delete even integersDelete even integers Cuts number of computations in halfCuts number of computations in half Frees storage for larger values of Frees storage for larger values of nn

Each process finds own sieving primesEach process finds own sieving primes Replicating computation of primes to Replicating computation of primes to nn Eliminates broadcast stepEliminates broadcast step

Reorganize loopsReorganize loops Increases cache hit rateIncreases cache hit rate

Page 35: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Reorganize Loops

Cache hit rate

Lower

Higher

Page 36: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Comparing 4 Versions

1.585

1.820

2.127

2.559

3.201

4.272

6.378

12.466

Sieve 3 Sieve 4

0.3422.8563.9278

0.3913.0594.3717

0.4563.2704.9646

0.5433.6525.7945

0.6794.0726.7684

0.9015.0198.8433

1.3306.60912.7212

2.54312.23724.9001

Procs Sieve 2Sieve 110-fold improvement

7-fold improvement

Page 37: Parallel Programming in C with the Message Passing Interfaceacc6.its.brooklyn.cuny.edu/~cisc7340/examples/Chapter5.pdf6.378 12.466 Sieve 3 Sieve 4 8 3.927 2.856 0.342 7 4.371 3.059

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Summary

Sieve of Eratosthenes: parallel design uses Sieve of Eratosthenes: parallel design uses domain decompositiondomain decomposition

Compared two block distributionsCompared two block distributionsChose one with simpler formulasChose one with simpler formulas

Introduced Introduced MPI_BcastMPI_Bcast Optimizations reveal importance of Optimizations reveal importance of

maximizing single-processor performancemaximizing single-processor performance