31
1 Parallel Algorithms, Fall, 2008 Parallel Algorithm Design Instructor: Tsung-Che Chiang [email protected] Department of Computer Science and Information Engineering National Taiwan Normal University 2 Parallel Algorithms, Fall, 2008 Outline Task/Channel Model Fosters Design Methodology Boundary Value Problem Finding the Maximum The n-body Problem Summary

3 Parallel Algorithm Design - ntnu.edu.tw

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 3 Parallel Algorithm Design - ntnu.edu.tw

1

Parallel Algorithms, Fall, 2008

Parallel AlgorithmDesign

Instructor: Tsung-Che [email protected]

Department of Computer Science and Information EngineeringNational Taiwan Normal University

2Parallel Algorithms, Fall, 2008

Outline

Task/Channel ModelFoster’s Design MethodologyBoundary Value ProblemFinding the MaximumThe n-body ProblemSummary

Page 2: 3 Parallel Algorithm Design - ntnu.edu.tw

2

3Parallel Algorithms, Fall, 2008

Task/Channel Model

It represents a parallel computation as a setof tasks that may interact with each other bysending messages through channels.

A task is a program, its local memory, and acollection of I/O ports.

A channel is a message queue that connectsone task’s output port with another task’sinput port.

4Parallel Algorithms, Fall, 2008

Task/Channel Model

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

Page 3: 3 Parallel Algorithm Design - ntnu.edu.tw

3

5Parallel Algorithms, Fall, 2008

Task/Channel Model

Send vs. receiveReceiving is synchronous, while sending is

asynchronous.

Local memory vs. non-local memory It is easy to distinguish between private data and

non-local data over channels, which is good sincewe should think of local accesses as being muchfaster.

Execution time of a parallel algorithm the period of time during which any task is active.

6Parallel Algorithms, Fall, 2008

Foster’s Design Methodology

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

Page 4: 3 Parallel Algorithm Design - ntnu.edu.tw

4

7Parallel Algorithms, Fall, 2008

Partitioning

To discover as much parallelism as possible,partitioning is the process of dividing thecomputation and the data into pieces.

Foster’s Design Methodology

8Parallel Algorithms, Fall, 2008

Partitioning

Decompositions domain decomposition (data parallelism)

Remember to focus on the largest and/or most frequentlyaccessed data.

functional decomposition It often yields collections of tasks that achieve

concurrency through pipelining.

Whichever decomposition we choose, we calleach of these pieces a primitive task.

The number of primitive tasks is an upperbound on the parallelism we can exploit.

Foster’s Design Methodology

Page 5: 3 Parallel Algorithm Design - ntnu.edu.tw

5

9Parallel Algorithms, Fall, 2008

Partitioning

Three domain decompositions of a 3-D matrix

Foster’s Design Methodology

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

Preferred!!

10Parallel Algorithms, Fall, 2008

Partitioning

Concurrency:While one task is converting an image from physical coordinates,a second task can be displaying the previous image,and a third task can be tracking instrument positions for the nextimage.

Acquire patientimages

Registerimages

Track position ofinstruments

Determine imagelocations

Display image

Foster’s Design Methodology

Functional decomposition of a system supportinginteractive image-guided surgery

Page 6: 3 Parallel Algorithm Design - ntnu.edu.tw

6

11Parallel Algorithms, Fall, 2008

Partitioning

Checklist to evaluate the quality of apartitioning: There are at least an order of magnitude more

primitive tasks than processors. (potential)Redundant computations and redundant data

structure storage are minimized. (scalability)Primitive tasks are roughly the same size.

(workload balance) The number of tasks is an increasing function of

the problem size. (future potential)

Foster’s Design Methodology

12Parallel Algorithms, Fall, 2008

Communication

Parallel algorithms have two kinds ofcommunication patterns: Local –a task needs values from a small number

of other tasksGlobal –a significant number of tasks must

contribute data to perform the computation e.g. computing the sum of values held by the primitive

processes It is generally not helpful to draw channels for global

communications at this stage.

Foster’s Design Methodology

Page 7: 3 Parallel Algorithm Design - ntnu.edu.tw

7

13Parallel Algorithms, Fall, 2008

Communication

Communication is the overhead of a parallelalgorithm, and we want to minimize it.

Check list: The communication operations are balanced

among the tasks.Each task communicates with only a small number

of neighbors. Tasks can perform their communications

concurrently. Tasks can perform their computations

concurrently.

Foster’s Design Methodology

14Parallel Algorithms, Fall, 2008

AgglomerationFoster’s Design Methodology

In the last two steps of the design process,we have a target architecture in mind, andconsider how to combine primitive tasks intolarger tasks and map them onto physicalprocessors to reduce the amount of paralleloverhead.

Goal of agglomeration to lower communication overhead to maintain the scalability to reduce software engineering costs

Page 8: 3 Parallel Algorithm Design - ntnu.edu.tw

8

15Parallel Algorithms, Fall, 2008

AgglomerationFoster’s Design Methodology

Combining tasks that are connected by a channel eliminatesthat communication.

Combining sending and receiving tasks reduces the numberof message transmissions.

To lower communication overhead

16Parallel Algorithms, Fall, 2008

AgglomerationFoster’s Design Methodology

To maintain the scalabilitySuppose we want to develop a parallel program

that manipulates a 3D matrix of size 8128256.

If we agglomerate the second and thirddimensions, we will not be able to port theprogram to a parallel computer with more than 8CPUs.

Page 9: 3 Parallel Algorithm Design - ntnu.edu.tw

9

17Parallel Algorithms, Fall, 2008

AgglomerationFoster’s Design Methodology

To reduce software engineering costs If we are parallelizing a sequential program, one

agglomeration may allow us to make greater useof the existing sequential code, reducing the timeand expense of developing the parallel program.

18Parallel Algorithms, Fall, 2008

AgglomerationFoster’s Design Methodology

Check list The agglomeration has increased the locality of

the parallel algorithm. Replicated computations take less time than the

communications they replace.

The amount of replicated data is small enough toallow the algorithm to scale.

Agglomerated tasks have similar computationaland communications costs. (workload balance)

Page 10: 3 Parallel Algorithm Design - ntnu.edu.tw

10

19Parallel Algorithms, Fall, 2008

AgglomerationFoster’s Design Methodology

Check list (continued)

The number of tasks is (future potential) an increasing function of the problem size. as small as possible, yet at least as great as the number

of processors in the likely target computers.

The trade-off between the chosen agglomerationand the cost of modifications to existing sequentialcode is reasonable. (software engineering)

20Parallel Algorithms, Fall, 2008

Mapping

Mapping is the process of assigning tasks toprocessors.

Its goal is tomaximize processor utilization andminimize inter-processor communication.

Processor utilization is maximized when thecomputation is balanced evenly.

Inter-processor communication increases whentwo tasks connected by a channel are mapped todifferent processors.

Foster’s Design Methodology

Page 11: 3 Parallel Algorithm Design - ntnu.edu.tw

11

21Parallel Algorithms, Fall, 2008

Mapping

Mapping 8 tasks to 3 processors The middle processor deals with twice as many

tasks and inter-processor communications as theother processors.

Foster’s Design Methodology

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

22Parallel Algorithms, Fall, 2008

Mapping

Maximizing processor utilization and minimizing inter-processor communication are often conflicting goals.

e.g. Suppose there are p processors. Mapping alltasks to one processor reduces all inter-processorcommunications, but also reduces the utilization to1/p.

There are no known polynomial-time algorithms tomap tasks to processors to minimize the executiontime. Hence, we must rely on heuristics.

Foster’s Design Methodology

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

Page 12: 3 Parallel Algorithm Design - ntnu.edu.tw

12

23Parallel Algorithms, Fall, 2008

MappingFoster’s Design Methodology

Static number of tasks Dynamic number of tasks

Structuredcomm. pattern

Unstructuredcomm. pattern

Constant comp.time per task

Agglomerate tasksto minimizecommunication.Create one task perprocessor.

Varied comp.time per task

Cyclically map tasksto processors tobalancecomputational load.

Use a static loadbalancing algorithm.

Frequent comm.between tasks

Use a dynamicload balancingalgorithm.

No intertaskcommunications.

Use a run-timetask-schedulingalgorithm

centralized distributed

24Parallel Algorithms, Fall, 2008

MappingFoster’s Design Methodology

Check list:Designs based on one task per processor and

multiple tasks per processor have beenconsidered.

Both static and dynamic allocation of tasks toprocessors have been evaluated.

If a dynamic allocation has been chosen, the taskmanager is not a bottleneck to performance.

If a static allocation has been chosen, the ratio oftasks to processors is at least 10:1.

Page 13: 3 Parallel Algorithm Design - ntnu.edu.tw

13

25Parallel Algorithms, Fall, 2008

Boundary Value Problem

Introduction

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

0C 0C

x

T0 = 100sin(x)

1

26Parallel Algorithms, Fall, 2008

Boundary Value Problem

Introduction

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

Page 14: 3 Parallel Algorithm Design - ntnu.edu.tw

14

27Parallel Algorithms, Fall, 2008

Boundary Value Problem

Introduction

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

ui,j+1 = rui-1,j + (12r)ui,j + rui+1,j

r = k/h2

28Parallel Algorithms, Fall, 2008

Boundary Value Problem

Partition There is one data item per grid point. Let’s

associate one primitive task with each grid point. This yields a 2-D domain decomposition.

Page 15: 3 Parallel Algorithm Design - ntnu.edu.tw

15

29Parallel Algorithms, Fall, 2008

Boundary Value Problem

Communication

ui,j+1 = rui-1,j + (12r)ui,j + rui+1,j

30Parallel Algorithms, Fall, 2008

Boundary Value Problem

Agglomeration Tasks computing rod temperature later in time

depend upon the results produced by taskscomputing rod temperatures earlier in time.

Page 16: 3 Parallel Algorithm Design - ntnu.edu.tw

16

31Parallel Algorithms, Fall, 2008

Boundary Value Problem

Static number of tasks Dynamic number of tasks

Structuredcomm. pattern

Unstructuredcomm. pattern

Constant comp.time per task

Agglomerate tasksto minimizecommunication.Create one task perprocessor.

Varied comp.time per task

Cyclically map tasksto processors tobalancecomputational load.

Use a static loadbalancing algorithm.

Frequent comm.between tasks

Use a dynamicload balancingalgorithm.

No intertaskcommunications.

Use a run-timetask-schedulingalgorithm

32Parallel Algorithms, Fall, 2008

Boundary Value Problem

MappingA good strategy is to associate a contiguous piece

of the rod with each task, which preserves the simplenearest-neighbor communication between tasks andeliminates unnecessary communications for those datapoints within a single task.

Page 17: 3 Parallel Algorithm Design - ntnu.edu.tw

17

33Parallel Algorithms, Fall, 2008

Boundary Value Problem

Analysis –sequential algorithm

n+1 points

m+1points

m(n 1)

: time needed to computeui,j+1 = rui-1,j + (12r)ui,j + rui+1,j

34Parallel Algorithms, Fall, 2008

Boundary Value Problem

Analysis –parallel algorithm

n+1 points

m+1points

m (n 1) /p

m(2)+

p: number of processors: communication time

Page 18: 3 Parallel Algorithm Design - ntnu.edu.tw

18

35Parallel Algorithms, Fall, 2008

Finding the Maximum

ReductionGiven a set of n values, a0, a1, …, an-1 and an

associative binary operator , reduction is theprocess of computing a0 a1 …an-1.

Example: addition, minimum, maximum

Since reduction requires exactly n –1 operations, ithas time complexity(n).

How quickly can we perform a reduction on a parallelcomputer?

36Parallel Algorithms, Fall, 2008

Finding the Maximum

PartitioningSince the list has n values, let’s divide it into n

pieces, as finely as possible. If we associate one task per piece, we have n

tasks, each with one value.Our goal is to find the maximum of all n values.

Page 19: 3 Parallel Algorithm Design - ntnu.edu.tw

19

37Parallel Algorithms, Fall, 2008

Finding the Maximum

CommunicationSince a task cannot directly access a value stored

in the memory of another task, we must set upchannels between the tasks.

In one communication step, each task may eithersend or receive one message.

At the end of the computation, we want one task(the root task) to have the global maximum.

38Parallel Algorithms, Fall, 2008

Finding the Maximum

Communication

n –1tasks (n 1)(+ )

: time to perform max(.): communication time

n/2 –1tasks

n/2 –1tasks (n/2 1)(+ ) + (+ )

= (n/2)(+ )

Page 20: 3 Parallel Algorithm Design - ntnu.edu.tw

20

39Parallel Algorithms, Fall, 2008

Finding the Maximum

Communication

n/4 –1tasks

n/4 –1tasks

n/4 –1tasks

n/4 –1tasks

(n/4 1)(+ ) + 2(+ )= (n/4 + 1)(+ )

40Parallel Algorithms, Fall, 2008

Finding the Maximum

CommunicationA single message-passing step is sufficient to

combine two values into one.Two message-passing steps are sufficient tocombine four values into one.

In general, it is possible to perform a reduction of nvalues in log n message passing steps.

Page 21: 3 Parallel Algorithm Design - ntnu.edu.tw

21

41Parallel Algorithms, Fall, 2008

Finding the Maximum

4 2

-3 5

0 7

-6 -3

8 1

-4 4

2 3

6 -1

4 5 0 7

8 4 6 3

Communication –An example

Step 1.

42Parallel Algorithms, Fall, 2008

Finding the Maximum

5 7

8 6

4 5 0 7

8 4 6 3 8 7

Maximum = 8.

Step 2. Step 3. Step 4.

Communication –An example

Page 22: 3 Parallel Algorithm Design - ntnu.edu.tw

22

43Parallel Algorithms, Fall, 2008

Finding the Maximum

Communication –What if n is not a power of 2?Suppose n = 2k + r. In the first step, r tasks send

values to the other r tasks and then becomeinactive.

If n is not a power of 2, log n+1 communications arerequired.

In general, log ncommunications are required.

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

44Parallel Algorithms, Fall, 2008

Finding the Maximum

Agglomeration

Static number of tasks Dynamic number of tasks

Structuredcomm. pattern

Unstructuredcomm. pattern

Constant comp.time per task

Agglomerate tasksto minimizecommunication.Create one task perprocessor.

Varied comp.time per task

Cyclically map tasksto processors tobalancecomputational load.

Use a static loadbalancing algorithm.

Frequent comm.between tasks

Use a dynamicload balancingalgorithm.

No intertaskcommunications.

Use a run-timetask-schedulingalgorithm

Page 23: 3 Parallel Algorithm Design - ntnu.edu.tw

23

45Parallel Algorithms, Fall, 2008

Finding the Maximum

Agglomeration & mappingWe try to minimize communication by assigning

n/p tasks to each processor.

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

max max

max max

46Parallel Algorithms, Fall, 2008

Finding the Maximum

Analysis If n integers are divided evenly among the p

processors, each processor has no more thann/pintegers.

Since all processors perform concurrently, thetime to compute maximum of their integers is(n/p1).

Each reduction step requires +to send/receivea message and calculate the maximum.

Since there are log pcommunications, theoverall time of this parallel program is(n/p1)+ log p(+).

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

Page 24: 3 Parallel Algorithm Design - ntnu.edu.tw

24

47Parallel Algorithms, Fall, 2008

The n-body Problem

We are simulating the motion of n particles ofvarying mass in two dimensions.

During each iteration, we need to compute thenew position and velocity vector of each particle,given the positions of all the other particles.

The future position of the pinkparticle is influenced by thegravitational forces exerted by theother two particles.

48Parallel Algorithms, Fall, 2008

The n-body Problem

Partitioning Let’s assume we have one task per particle. In order to compute the new location, it must know

the locations of all the other particles.

Page 25: 3 Parallel Algorithm Design - ntnu.edu.tw

25

49Parallel Algorithms, Fall, 2008

The n-body Problem

CommunicationA gather operation is a global communication that

takes a dataset distributed among a group of tasksand collects the items on a single task.

50Parallel Algorithms, Fall, 2008

The n-body Problem

Communication –an intuitive way

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

After n –1 communications, each taskhas the position of all the other particles.

Is there a quicker way?

Page 26: 3 Parallel Algorithm Design - ntnu.edu.tw

26

51Parallel Algorithms, Fall, 2008

The n-body Problem

Communication –a faster way

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

v0 v1

v2 v3

v0, v1 v0, v1

v2, v3 v2, v3

v0, v1, v2, v3 v0, v1, v2, v3

v0, v1, v2, v3 v0, v1, v2, v3

52Parallel Algorithms, Fall, 2008

The n-body Problem

Communication –a faster wayA logarithmic number of exchange steps are

necessary and sufficient to allow every processorto acquire all the values.

In the ith exchange step, the messages havelength 2i-1.

Page 27: 3 Parallel Algorithm Design - ntnu.edu.tw

27

53Parallel Algorithms, Fall, 2008

The n-body Problem

Agglomeration & MappingAssume that n is a multiple of p.We associate one task per processor and

agglomerate n/p particles into each task.Now the all-gather communication requires log p

communication steps. In the ith exchange step, the messages have

length 2i-1(n/p).

54Parallel Algorithms, Fall, 2008

The n-body Problem

Analysis In the previous examples, we assumed that it tookunits of time to send a message since in thoseexamples the messages always have length 1.

From now on, time for communication isdetermined by (latency): the time needed to initiate a message (bandwidth): the number of data items that can be sent

down a channel in one unit of time.

Sending a message containing n data items requirestime + n/.

Page 28: 3 Parallel Algorithm Design - ntnu.edu.tw

28

55Parallel Algorithms, Fall, 2008

The n-body Problem

Analysis The communication time for each iteration of

simulation is

Hence, the total time is

p

i

i

ppn

ppnlog

1

1 )1(log)

2(

)/()1(

log pnp

pnp

56Parallel Algorithms, Fall, 2008

The n-body Problem

How about adding data I/O? Let’s assign I/O duties to processor 0.We need to input/output 2-D positions px and py

and velocities vx and vy of n particles. It takes io + 4n/io to send all the values.

Input

Output

Page 29: 3 Parallel Algorithm Design - ntnu.edu.tw

29

57Parallel Algorithms, Fall, 2008

The n-body Problem

After the data are read, we need to assignsuitable subsections containing 4n/p values toeach processor.

A straightforward way takes (p1)(+4n/(p))= (p 1)+(p 1)4n/(p) units of time.

Input

Output0

1

2

3

58Parallel Algorithms, Fall, 2008

The n-body Problem

We need a global communication scatter.Can you see how scatter is like a gather operation

in reverse?

p

i

i

ppn

pp

nlog

1

1 )1(4log)

24(

0 1

2 3

0 1

2 3

0 1

2 3

v0, v1, v2, v3v0, v1

v2, v3

v0

v2

v1

v3

Page 30: 3 Parallel Algorithm Design - ntnu.edu.tw

30

59Parallel Algorithms, Fall, 2008

The n-body Problem

Scatter communication time Straightforward way: (p 1)+(p 1)4n/(p) Smart way: (log p)+(p 1)4n/(p)

Note that the data transmission time (the term with )is identical for the previous two ways. We assume that our task/channel model supports the

concurrent transmission of messages from multiple tasks. It is reasonable for commercial system, but perhaps

unreasonable for a network of workstations connected by ahub or any shared medium that supports only a singlemesssage at a time.

60Parallel Algorithms, Fall, 2008

The n-body Problem

The expected overall execution time of theparallel computation (suppose m iterations) iscomposed of data input and output scattering at the beginning and gather at the endm (all-gathering + computation)

))1()/()/()1(2log(

))/()1(4log(2)/4(2

npnppnpm

ppnpn ioio

Page 31: 3 Parallel Algorithm Design - ntnu.edu.tw

31

61Parallel Algorithms, Fall, 2008

Summary

Task/channel model for parallel computationFoster’s design methodology

partitioning, communication, agglomeration,mapping

Goal Maximize processor utilization Minimize inter-processor communications

Common operations reduction all-gather gather/scatter