3 Parallel Algorithm Design - ntnu.edu.tw

1

Parallel Algorithms, Fall, 2008

Parallel AlgorithmDesign

Instructor: Tsung-Che [email protected]

Department of Computer Science and Information EngineeringNational Taiwan Normal University

2Parallel Algorithms, Fall, 2008

Outline

Task/Channel ModelFoster’s Design MethodologyBoundary Value ProblemFinding the MaximumThe n-body ProblemSummary

2


Task/Channel Model

It represents a parallel computation as a setof tasks that may interact with each other bysending messages through channels.

A task is a program, its local memory, and acollection of I/O ports.

A channel is a message queue that connectsone task’s output port with another task’sinput port.


Task/Channel Model

By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003

3


Task/Channel Model

Send vs. receiveReceiving is synchronous, while sending is

asynchronous.

Local memory vs. non-local memory It is easy to distinguish between private data and

non-local data over channels, which is good sincewe should think of local accesses as being muchfaster.

Execution time of a parallel algorithm the period of time during which any task is active.


Foster’s Design Methodology


4


Partitioning

To discover as much parallelism as possible,partitioning is the process of dividing thecomputation and the data into pieces.



Partitioning

Decompositions domain decomposition (data parallelism)

Remember to focus on the largest and/or most frequentlyaccessed data.

functional decomposition It often yields collections of tasks that achieve

concurrency through pipelining.

Whichever decomposition we choose, we calleach of these pieces a primitive task.

The number of primitive tasks is an upperbound on the parallelism we can exploit.


5


Partitioning

Three domain decompositions of a 3-D matrix



Preferred!!


Partitioning

Concurrency:While one task is converting an image from physical coordinates,a second task can be displaying the previous image,and a third task can be tracking instrument positions for the nextimage.

Acquire patientimages

Registerimages

Track position ofinstruments

Determine imagelocations

Display image


Functional decomposition of a system supportinginteractive image-guided surgery

6


Partitioning

Checklist to evaluate the quality of apartitioning: There are at least an order of magnitude more

primitive tasks than processors. (potential)Redundant computations and redundant data

structure storage are minimized. (scalability)Primitive tasks are roughly the same size.

(workload balance) The number of tasks is an increasing function of

the problem size. (future potential)



Communication

Parallel algorithms have two kinds ofcommunication patterns: Local –a task needs values from a small number

of other tasksGlobal –a significant number of tasks must

contribute data to perform the computation e.g. computing the sum of values held by the primitive

processes It is generally not helpful to draw channels for global

communications at this stage.


7


Communication

Communication is the overhead of a parallelalgorithm, and we want to minimize it.

Check list: The communication operations are balanced

among the tasks.Each task communicates with only a small number

of neighbors. Tasks can perform their communications

concurrently. Tasks can perform their computations

concurrently.



AgglomerationFoster’s Design Methodology

In the last two steps of the design process,we have a target architecture in mind, andconsider how to combine primitive tasks intolarger tasks and map them onto physicalprocessors to reduce the amount of paralleloverhead.

Goal of agglomeration to lower communication overhead to maintain the scalability to reduce software engineering costs

8



Combining tasks that are connected by a channel eliminatesthat communication.

Combining sending and receiving tasks reduces the numberof message transmissions.

To lower communication overhead



To maintain the scalabilitySuppose we want to develop a parallel program

that manipulates a 3D matrix of size 8128256.

If we agglomerate the second and thirddimensions, we will not be able to port theprogram to a parallel computer with more than 8CPUs.

9



To reduce software engineering costs If we are parallelizing a sequential program, one

agglomeration may allow us to make greater useof the existing sequential code, reducing the timeand expense of developing the parallel program.



Check list The agglomeration has increased the locality of

the parallel algorithm. Replicated computations take less time than the

communications they replace.

The amount of replicated data is small enough toallow the algorithm to scale.

Agglomerated tasks have similar computationaland communications costs. (workload balance)

10



Check list (continued)

The number of tasks is (future potential) an increasing function of the problem size. as small as possible, yet at least as great as the number

of processors in the likely target computers.

The trade-off between the chosen agglomerationand the cost of modifications to existing sequentialcode is reasonable. (software engineering)


Mapping

Mapping is the process of assigning tasks toprocessors.

Its goal is tomaximize processor utilization andminimize inter-processor communication.

Processor utilization is maximized when thecomputation is balanced evenly.

Inter-processor communication increases whentwo tasks connected by a channel are mapped todifferent processors.


11


Mapping

Mapping 8 tasks to 3 processors The middle processor deals with twice as many

tasks and inter-processor communications as theother processors.




Mapping

Maximizing processor utilization and minimizing inter-processor communication are often conflicting goals.

e.g. Suppose there are p processors. Mapping alltasks to one processor reduces all inter-processorcommunications, but also reduces the utilization to1/p.

There are no known polynomial-time algorithms tomap tasks to processors to minimize the executiontime. Hence, we must rely on heuristics.



12


MappingFoster’s Design Methodology

Static number of tasks Dynamic number of tasks

Structuredcomm. pattern

Unstructuredcomm. pattern

Constant comp.time per task

Agglomerate tasksto minimizecommunication.Create one task perprocessor.

Varied comp.time per task

Cyclically map tasksto processors tobalancecomputational load.

Use a static loadbalancing algorithm.

Frequent comm.between tasks

Use a dynamicload balancingalgorithm.

No intertaskcommunications.

Use a run-timetask-schedulingalgorithm

centralized distributed


MappingFoster’s Design Methodology

Check list:Designs based on one task per processor and

multiple tasks per processor have beenconsidered.

Both static and dynamic allocation of tasks toprocessors have been evaluated.

If a dynamic allocation has been chosen, the taskmanager is not a bottleneck to performance.

If a static allocation has been chosen, the ratio oftasks to processors is at least 10:1.

13


Boundary Value Problem

Introduction


0C 0C

x

T0 = 100sin(x)

1



Introduction


14



Introduction


ui,j+1 = rui-1,j + (12r)ui,j + rui+1,j

r = k/h2



Partition There is one data item per grid point. Let’s

associate one primitive task with each grid point. This yields a 2-D domain decomposition.

15



Communication

ui,j+1 = rui-1,j + (12r)ui,j + rui+1,j



Agglomeration Tasks computing rod temperature later in time

depend upon the results produced by taskscomputing rod temperatures earlier in time.

16

















MappingA good strategy is to associate a contiguous piece

of the rod with each task, which preserves the simplenearest-neighbor communication between tasks andeliminates unnecessary communications for those datapoints within a single task.

17



Analysis –sequential algorithm

n+1 points

m+1points

m(n 1)

: time needed to computeui,j+1 = rui-1,j + (12r)ui,j + rui+1,j



Analysis –parallel algorithm

n+1 points

m+1points

m (n 1) /p

m(2)+

p: number of processors: communication time

18


Finding the Maximum

ReductionGiven a set of n values, a0, a1, …, an-1 and an

associative binary operator , reduction is theprocess of computing a0 a1 …an-1.

Example: addition, minimum, maximum

Since reduction requires exactly n –1 operations, ithas time complexity(n).

How quickly can we perform a reduction on a parallelcomputer?


Finding the Maximum

PartitioningSince the list has n values, let’s divide it into n

pieces, as finely as possible. If we associate one task per piece, we have n

tasks, each with one value.Our goal is to find the maximum of all n values.

19


Finding the Maximum

CommunicationSince a task cannot directly access a value stored

in the memory of another task, we must set upchannels between the tasks.

In one communication step, each task may eithersend or receive one message.

At the end of the computation, we want one task(the root task) to have the global maximum.


Finding the Maximum

Communication

n –1tasks (n 1)(+ )

: time to perform max(.): communication time

n/2 –1tasks

n/2 –1tasks (n/2 1)(+ ) + (+ )

= (n/2)(+ )

20


Finding the Maximum

Communication

n/4 –1tasks

n/4 –1tasks

n/4 –1tasks

n/4 –1tasks

(n/4 1)(+ ) + 2(+ )= (n/4 + 1)(+ )


Finding the Maximum

CommunicationA single message-passing step is sufficient to

combine two values into one.Two message-passing steps are sufficient tocombine four values into one.

In general, it is possible to perform a reduction of nvalues in log n message passing steps.

21


Finding the Maximum

4 2

-3 5

0 7

-6 -3

8 1

-4 4

2 3

6 -1

4 5 0 7

8 4 6 3

Communication –An example

Step 1.


Finding the Maximum

5 7

8 6

4 5 0 7

8 4 6 3 8 7

Maximum = 8.

Step 2. Step 3. Step 4.

Communication –An example

22


Finding the Maximum

Communication –What if n is not a power of 2?Suppose n = 2k + r. In the first step, r tasks send

values to the other r tasks and then becomeinactive.

If n is not a power of 2, log n+1 communications arerequired.

In general, log ncommunications are required.



Finding the Maximum

Agglomeration













23


Finding the Maximum

Agglomeration & mappingWe try to minimize communication by assigning

n/p tasks to each processor.


max max

max max


Finding the Maximum

Analysis If n integers are divided evenly among the p

processors, each processor has no more thann/pintegers.

Since all processors perform concurrently, thetime to compute maximum of their integers is(n/p1).

Each reduction step requires +to send/receivea message and calculate the maximum.

Since there are log pcommunications, theoverall time of this parallel program is(n/p1)+ log p(+).


24


The n-body Problem

We are simulating the motion of n particles ofvarying mass in two dimensions.

During each iteration, we need to compute thenew position and velocity vector of each particle,given the positions of all the other particles.

The future position of the pinkparticle is influenced by thegravitational forces exerted by theother two particles.


The n-body Problem

Partitioning Let’s assume we have one task per particle. In order to compute the new location, it must know

the locations of all the other particles.

25


The n-body Problem

CommunicationA gather operation is a global communication that

takes a dataset distributed among a group of tasksand collects the items on a single task.


The n-body Problem

Communication –an intuitive way


After n –1 communications, each taskhas the position of all the other particles.

Is there a quicker way?

26


The n-body Problem

Communication –a faster way


v0 v1

v2 v3

v0, v1 v0, v1

v2, v3 v2, v3

v0, v1, v2, v3 v0, v1, v2, v3

v0, v1, v2, v3 v0, v1, v2, v3


The n-body Problem

Communication –a faster wayA logarithmic number of exchange steps are

necessary and sufficient to allow every processorto acquire all the values.

In the ith exchange step, the messages havelength 2i-1.

27


The n-body Problem

Agglomeration & MappingAssume that n is a multiple of p.We associate one task per processor and

agglomerate n/p particles into each task.Now the all-gather communication requires log p

communication steps. In the ith exchange step, the messages have

length 2i-1(n/p).


The n-body Problem

Analysis In the previous examples, we assumed that it tookunits of time to send a message since in thoseexamples the messages always have length 1.

From now on, time for communication isdetermined by (latency): the time needed to initiate a message (bandwidth): the number of data items that can be sent

down a channel in one unit of time.

Sending a message containing n data items requirestime + n/.

28


The n-body Problem

Analysis The communication time for each iteration of

simulation is

Hence, the total time is

p

i

i

ppn

ppnlog

1

1 )1(log)

2(

)/()1(

log pnp

pnp


The n-body Problem

How about adding data I/O? Let’s assign I/O duties to processor 0.We need to input/output 2-D positions px and py

and velocities vx and vy of n particles. It takes io + 4n/io to send all the values.

Input

Output

29


The n-body Problem

After the data are read, we need to assignsuitable subsections containing 4n/p values toeach processor.

A straightforward way takes (p1)(+4n/(p))= (p 1)+(p 1)4n/(p) units of time.

Input

Output0

1

2

3


The n-body Problem

We need a global communication scatter.Can you see how scatter is like a gather operation

in reverse?

p

i

i

ppn

pp

nlog

1

1 )1(4log)

24(

0 1

2 3

0 1

2 3

0 1

2 3

v0, v1, v2, v3v0, v1

v2, v3

v0

v2

v1

v3

30


The n-body Problem

Scatter communication time Straightforward way: (p 1)+(p 1)4n/(p) Smart way: (log p)+(p 1)4n/(p)

Note that the data transmission time (the term with )is identical for the previous two ways. We assume that our task/channel model supports the

concurrent transmission of messages from multiple tasks. It is reasonable for commercial system, but perhaps

unreasonable for a network of workstations connected by ahub or any shared medium that supports only a singlemesssage at a time.


The n-body Problem

The expected overall execution time of theparallel computation (suppose m iterations) iscomposed of data input and output scattering at the beginning and gather at the endm (all-gathering + computation)

))1()/()/()1(2log(

))/()1(4log(2)/4(2

npnppnpm

ppnpn ioio

31


Summary

Task/channel model for parallel computationFoster’s design methodology

partitioning, communication, agglomeration,mapping

Goal Maximize processor utilization Minimize inter-processor communications

Common operations reduction all-gather gather/scatter

Documents

3 Parallel Algorithm Design - ntnu.edu.tw