1
Parallel Algorithms, Fall, 2008
Parallel AlgorithmDesign
Instructor: Tsung-Che [email protected]
Department of Computer Science and Information EngineeringNational Taiwan Normal University
2Parallel Algorithms, Fall, 2008
Outline
Task/Channel ModelFoster’s Design MethodologyBoundary Value ProblemFinding the MaximumThe n-body ProblemSummary
2
3Parallel Algorithms, Fall, 2008
Task/Channel Model
It represents a parallel computation as a setof tasks that may interact with each other bysending messages through channels.
A task is a program, its local memory, and acollection of I/O ports.
A channel is a message queue that connectsone task’s output port with another task’sinput port.
4Parallel Algorithms, Fall, 2008
Task/Channel Model
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
3
5Parallel Algorithms, Fall, 2008
Task/Channel Model
Send vs. receiveReceiving is synchronous, while sending is
asynchronous.
Local memory vs. non-local memory It is easy to distinguish between private data and
non-local data over channels, which is good sincewe should think of local accesses as being muchfaster.
Execution time of a parallel algorithm the period of time during which any task is active.
6Parallel Algorithms, Fall, 2008
Foster’s Design Methodology
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
4
7Parallel Algorithms, Fall, 2008
Partitioning
To discover as much parallelism as possible,partitioning is the process of dividing thecomputation and the data into pieces.
Foster’s Design Methodology
8Parallel Algorithms, Fall, 2008
Partitioning
Decompositions domain decomposition (data parallelism)
Remember to focus on the largest and/or most frequentlyaccessed data.
functional decomposition It often yields collections of tasks that achieve
concurrency through pipelining.
Whichever decomposition we choose, we calleach of these pieces a primitive task.
The number of primitive tasks is an upperbound on the parallelism we can exploit.
Foster’s Design Methodology
5
9Parallel Algorithms, Fall, 2008
Partitioning
Three domain decompositions of a 3-D matrix
Foster’s Design Methodology
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
Preferred!!
10Parallel Algorithms, Fall, 2008
Partitioning
Concurrency:While one task is converting an image from physical coordinates,a second task can be displaying the previous image,and a third task can be tracking instrument positions for the nextimage.
Acquire patientimages
Registerimages
Track position ofinstruments
Determine imagelocations
Display image
Foster’s Design Methodology
Functional decomposition of a system supportinginteractive image-guided surgery
6
11Parallel Algorithms, Fall, 2008
Partitioning
Checklist to evaluate the quality of apartitioning: There are at least an order of magnitude more
primitive tasks than processors. (potential)Redundant computations and redundant data
structure storage are minimized. (scalability)Primitive tasks are roughly the same size.
(workload balance) The number of tasks is an increasing function of
the problem size. (future potential)
Foster’s Design Methodology
12Parallel Algorithms, Fall, 2008
Communication
Parallel algorithms have two kinds ofcommunication patterns: Local –a task needs values from a small number
of other tasksGlobal –a significant number of tasks must
contribute data to perform the computation e.g. computing the sum of values held by the primitive
processes It is generally not helpful to draw channels for global
communications at this stage.
Foster’s Design Methodology
7
13Parallel Algorithms, Fall, 2008
Communication
Communication is the overhead of a parallelalgorithm, and we want to minimize it.
Check list: The communication operations are balanced
among the tasks.Each task communicates with only a small number
of neighbors. Tasks can perform their communications
concurrently. Tasks can perform their computations
concurrently.
Foster’s Design Methodology
14Parallel Algorithms, Fall, 2008
AgglomerationFoster’s Design Methodology
In the last two steps of the design process,we have a target architecture in mind, andconsider how to combine primitive tasks intolarger tasks and map them onto physicalprocessors to reduce the amount of paralleloverhead.
Goal of agglomeration to lower communication overhead to maintain the scalability to reduce software engineering costs
8
15Parallel Algorithms, Fall, 2008
AgglomerationFoster’s Design Methodology
Combining tasks that are connected by a channel eliminatesthat communication.
Combining sending and receiving tasks reduces the numberof message transmissions.
To lower communication overhead
16Parallel Algorithms, Fall, 2008
AgglomerationFoster’s Design Methodology
To maintain the scalabilitySuppose we want to develop a parallel program
that manipulates a 3D matrix of size 8128256.
If we agglomerate the second and thirddimensions, we will not be able to port theprogram to a parallel computer with more than 8CPUs.
9
17Parallel Algorithms, Fall, 2008
AgglomerationFoster’s Design Methodology
To reduce software engineering costs If we are parallelizing a sequential program, one
agglomeration may allow us to make greater useof the existing sequential code, reducing the timeand expense of developing the parallel program.
18Parallel Algorithms, Fall, 2008
AgglomerationFoster’s Design Methodology
Check list The agglomeration has increased the locality of
the parallel algorithm. Replicated computations take less time than the
communications they replace.
The amount of replicated data is small enough toallow the algorithm to scale.
Agglomerated tasks have similar computationaland communications costs. (workload balance)
10
19Parallel Algorithms, Fall, 2008
AgglomerationFoster’s Design Methodology
Check list (continued)
The number of tasks is (future potential) an increasing function of the problem size. as small as possible, yet at least as great as the number
of processors in the likely target computers.
The trade-off between the chosen agglomerationand the cost of modifications to existing sequentialcode is reasonable. (software engineering)
20Parallel Algorithms, Fall, 2008
Mapping
Mapping is the process of assigning tasks toprocessors.
Its goal is tomaximize processor utilization andminimize inter-processor communication.
Processor utilization is maximized when thecomputation is balanced evenly.
Inter-processor communication increases whentwo tasks connected by a channel are mapped todifferent processors.
Foster’s Design Methodology
11
21Parallel Algorithms, Fall, 2008
Mapping
Mapping 8 tasks to 3 processors The middle processor deals with twice as many
tasks and inter-processor communications as theother processors.
Foster’s Design Methodology
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
22Parallel Algorithms, Fall, 2008
Mapping
Maximizing processor utilization and minimizing inter-processor communication are often conflicting goals.
e.g. Suppose there are p processors. Mapping alltasks to one processor reduces all inter-processorcommunications, but also reduces the utilization to1/p.
There are no known polynomial-time algorithms tomap tasks to processors to minimize the executiontime. Hence, we must rely on heuristics.
Foster’s Design Methodology
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
12
23Parallel Algorithms, Fall, 2008
MappingFoster’s Design Methodology
Static number of tasks Dynamic number of tasks
Structuredcomm. pattern
Unstructuredcomm. pattern
Constant comp.time per task
Agglomerate tasksto minimizecommunication.Create one task perprocessor.
Varied comp.time per task
Cyclically map tasksto processors tobalancecomputational load.
Use a static loadbalancing algorithm.
Frequent comm.between tasks
Use a dynamicload balancingalgorithm.
No intertaskcommunications.
Use a run-timetask-schedulingalgorithm
centralized distributed
24Parallel Algorithms, Fall, 2008
MappingFoster’s Design Methodology
Check list:Designs based on one task per processor and
multiple tasks per processor have beenconsidered.
Both static and dynamic allocation of tasks toprocessors have been evaluated.
If a dynamic allocation has been chosen, the taskmanager is not a bottleneck to performance.
If a static allocation has been chosen, the ratio oftasks to processors is at least 10:1.
13
25Parallel Algorithms, Fall, 2008
Boundary Value Problem
Introduction
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
0C 0C
x
T0 = 100sin(x)
1
26Parallel Algorithms, Fall, 2008
Boundary Value Problem
Introduction
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
14
27Parallel Algorithms, Fall, 2008
Boundary Value Problem
Introduction
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
ui,j+1 = rui-1,j + (12r)ui,j + rui+1,j
r = k/h2
28Parallel Algorithms, Fall, 2008
Boundary Value Problem
Partition There is one data item per grid point. Let’s
associate one primitive task with each grid point. This yields a 2-D domain decomposition.
15
29Parallel Algorithms, Fall, 2008
Boundary Value Problem
Communication
ui,j+1 = rui-1,j + (12r)ui,j + rui+1,j
30Parallel Algorithms, Fall, 2008
Boundary Value Problem
Agglomeration Tasks computing rod temperature later in time
depend upon the results produced by taskscomputing rod temperatures earlier in time.
16
31Parallel Algorithms, Fall, 2008
Boundary Value Problem
Static number of tasks Dynamic number of tasks
Structuredcomm. pattern
Unstructuredcomm. pattern
Constant comp.time per task
Agglomerate tasksto minimizecommunication.Create one task perprocessor.
Varied comp.time per task
Cyclically map tasksto processors tobalancecomputational load.
Use a static loadbalancing algorithm.
Frequent comm.between tasks
Use a dynamicload balancingalgorithm.
No intertaskcommunications.
Use a run-timetask-schedulingalgorithm
32Parallel Algorithms, Fall, 2008
Boundary Value Problem
MappingA good strategy is to associate a contiguous piece
of the rod with each task, which preserves the simplenearest-neighbor communication between tasks andeliminates unnecessary communications for those datapoints within a single task.
17
33Parallel Algorithms, Fall, 2008
Boundary Value Problem
Analysis –sequential algorithm
n+1 points
m+1points
m(n 1)
: time needed to computeui,j+1 = rui-1,j + (12r)ui,j + rui+1,j
34Parallel Algorithms, Fall, 2008
Boundary Value Problem
Analysis –parallel algorithm
n+1 points
m+1points
m (n 1) /p
m(2)+
p: number of processors: communication time
18
35Parallel Algorithms, Fall, 2008
Finding the Maximum
ReductionGiven a set of n values, a0, a1, …, an-1 and an
associative binary operator , reduction is theprocess of computing a0 a1 …an-1.
Example: addition, minimum, maximum
Since reduction requires exactly n –1 operations, ithas time complexity(n).
How quickly can we perform a reduction on a parallelcomputer?
36Parallel Algorithms, Fall, 2008
Finding the Maximum
PartitioningSince the list has n values, let’s divide it into n
pieces, as finely as possible. If we associate one task per piece, we have n
tasks, each with one value.Our goal is to find the maximum of all n values.
19
37Parallel Algorithms, Fall, 2008
Finding the Maximum
CommunicationSince a task cannot directly access a value stored
in the memory of another task, we must set upchannels between the tasks.
In one communication step, each task may eithersend or receive one message.
At the end of the computation, we want one task(the root task) to have the global maximum.
38Parallel Algorithms, Fall, 2008
Finding the Maximum
Communication
n –1tasks (n 1)(+ )
: time to perform max(.): communication time
n/2 –1tasks
n/2 –1tasks (n/2 1)(+ ) + (+ )
= (n/2)(+ )
20
39Parallel Algorithms, Fall, 2008
Finding the Maximum
Communication
n/4 –1tasks
n/4 –1tasks
n/4 –1tasks
n/4 –1tasks
(n/4 1)(+ ) + 2(+ )= (n/4 + 1)(+ )
40Parallel Algorithms, Fall, 2008
Finding the Maximum
CommunicationA single message-passing step is sufficient to
combine two values into one.Two message-passing steps are sufficient tocombine four values into one.
In general, it is possible to perform a reduction of nvalues in log n message passing steps.
21
41Parallel Algorithms, Fall, 2008
Finding the Maximum
4 2
-3 5
0 7
-6 -3
8 1
-4 4
2 3
6 -1
4 5 0 7
8 4 6 3
Communication –An example
Step 1.
42Parallel Algorithms, Fall, 2008
Finding the Maximum
5 7
8 6
4 5 0 7
8 4 6 3 8 7
Maximum = 8.
Step 2. Step 3. Step 4.
Communication –An example
22
43Parallel Algorithms, Fall, 2008
Finding the Maximum
Communication –What if n is not a power of 2?Suppose n = 2k + r. In the first step, r tasks send
values to the other r tasks and then becomeinactive.
If n is not a power of 2, log n+1 communications arerequired.
In general, log ncommunications are required.
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
44Parallel Algorithms, Fall, 2008
Finding the Maximum
Agglomeration
Static number of tasks Dynamic number of tasks
Structuredcomm. pattern
Unstructuredcomm. pattern
Constant comp.time per task
Agglomerate tasksto minimizecommunication.Create one task perprocessor.
Varied comp.time per task
Cyclically map tasksto processors tobalancecomputational load.
Use a static loadbalancing algorithm.
Frequent comm.between tasks
Use a dynamicload balancingalgorithm.
No intertaskcommunications.
Use a run-timetask-schedulingalgorithm
23
45Parallel Algorithms, Fall, 2008
Finding the Maximum
Agglomeration & mappingWe try to minimize communication by assigning
n/p tasks to each processor.
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
max max
max max
46Parallel Algorithms, Fall, 2008
Finding the Maximum
Analysis If n integers are divided evenly among the p
processors, each processor has no more thann/pintegers.
Since all processors perform concurrently, thetime to compute maximum of their integers is(n/p1).
Each reduction step requires +to send/receivea message and calculate the maximum.
Since there are log pcommunications, theoverall time of this parallel program is(n/p1)+ log p(+).
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
24
47Parallel Algorithms, Fall, 2008
The n-body Problem
We are simulating the motion of n particles ofvarying mass in two dimensions.
During each iteration, we need to compute thenew position and velocity vector of each particle,given the positions of all the other particles.
The future position of the pinkparticle is influenced by thegravitational forces exerted by theother two particles.
48Parallel Algorithms, Fall, 2008
The n-body Problem
Partitioning Let’s assume we have one task per particle. In order to compute the new location, it must know
the locations of all the other particles.
25
49Parallel Algorithms, Fall, 2008
The n-body Problem
CommunicationA gather operation is a global communication that
takes a dataset distributed among a group of tasksand collects the items on a single task.
50Parallel Algorithms, Fall, 2008
The n-body Problem
Communication –an intuitive way
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
After n –1 communications, each taskhas the position of all the other particles.
Is there a quicker way?
26
51Parallel Algorithms, Fall, 2008
The n-body Problem
Communication –a faster way
By courtesy of M.J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003
v0 v1
v2 v3
v0, v1 v0, v1
v2, v3 v2, v3
v0, v1, v2, v3 v0, v1, v2, v3
v0, v1, v2, v3 v0, v1, v2, v3
52Parallel Algorithms, Fall, 2008
The n-body Problem
Communication –a faster wayA logarithmic number of exchange steps are
necessary and sufficient to allow every processorto acquire all the values.
In the ith exchange step, the messages havelength 2i-1.
27
53Parallel Algorithms, Fall, 2008
The n-body Problem
Agglomeration & MappingAssume that n is a multiple of p.We associate one task per processor and
agglomerate n/p particles into each task.Now the all-gather communication requires log p
communication steps. In the ith exchange step, the messages have
length 2i-1(n/p).
54Parallel Algorithms, Fall, 2008
The n-body Problem
Analysis In the previous examples, we assumed that it tookunits of time to send a message since in thoseexamples the messages always have length 1.
From now on, time for communication isdetermined by (latency): the time needed to initiate a message (bandwidth): the number of data items that can be sent
down a channel in one unit of time.
Sending a message containing n data items requirestime + n/.
28
55Parallel Algorithms, Fall, 2008
The n-body Problem
Analysis The communication time for each iteration of
simulation is
Hence, the total time is
p
i
i
ppn
ppnlog
1
1 )1(log)
2(
)/()1(
log pnp
pnp
56Parallel Algorithms, Fall, 2008
The n-body Problem
How about adding data I/O? Let’s assign I/O duties to processor 0.We need to input/output 2-D positions px and py
and velocities vx and vy of n particles. It takes io + 4n/io to send all the values.
Input
Output
29
57Parallel Algorithms, Fall, 2008
The n-body Problem
After the data are read, we need to assignsuitable subsections containing 4n/p values toeach processor.
A straightforward way takes (p1)(+4n/(p))= (p 1)+(p 1)4n/(p) units of time.
Input
Output0
1
2
3
58Parallel Algorithms, Fall, 2008
The n-body Problem
We need a global communication scatter.Can you see how scatter is like a gather operation
in reverse?
p
i
i
ppn
pp
nlog
1
1 )1(4log)
24(
0 1
2 3
0 1
2 3
0 1
2 3
v0, v1, v2, v3v0, v1
v2, v3
v0
v2
v1
v3
30
59Parallel Algorithms, Fall, 2008
The n-body Problem
Scatter communication time Straightforward way: (p 1)+(p 1)4n/(p) Smart way: (log p)+(p 1)4n/(p)
Note that the data transmission time (the term with )is identical for the previous two ways. We assume that our task/channel model supports the
concurrent transmission of messages from multiple tasks. It is reasonable for commercial system, but perhaps
unreasonable for a network of workstations connected by ahub or any shared medium that supports only a singlemesssage at a time.
60Parallel Algorithms, Fall, 2008
The n-body Problem
The expected overall execution time of theparallel computation (suppose m iterations) iscomposed of data input and output scattering at the beginning and gather at the endm (all-gathering + computation)
))1()/()/()1(2log(
))/()1(4log(2)/4(2
npnppnpm
ppnpn ioio
31
61Parallel Algorithms, Fall, 2008
Summary
Task/channel model for parallel computationFoster’s design methodology
partitioning, communication, agglomeration,mapping
Goal Maximize processor utilization Minimize inter-processor communications
Common operations reduction all-gather gather/scatter