Parallel Algorithm Design

Parallel Computing

Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 2

Task/Channel Model

• Parallel computation = set of tasks

• Task

• Program

• Local memory

• Collection of I/O ports

• Tasks interact by sending messages

through channels


Task/Channel Model

Task Channel


Foster’s Design Methodoly

ProblemPartitioning

Communication

AgglomerationMapping

1. Partitioning

2. Communication

3. Agglomeration

4. Mapping


1. Partitioning

• Dividing computation and data into pieces

• Domain decomposition

• Divide data into pieces

• e.g., An array into sub-arrays (reduction); A loop into

sub-loops (matrix multiplication), A search space into

sub-spaces (chess)

• Functional decomposition

• Divide computation into pieces

• e.g., pipelines (floating point multiplication),

workflows (pay roll processing)

• Determine how to associate data with

computations


Partitioning

• The individual pieces are called primitive

tasks.

• Desirable attributes for partition

• Many more primitive tasks than processors on

target computer.

• Tasks of roughly equal size (in computation

and data).

• Number of tasks increases with problem size.


Example of domain decomposition


Example of Functional Decomposition


2. Communication

• Determine values passed among tasks

• Local communication

• Task needs values from a small number of

other tasks

• Create channels illustrating data flow

• Global communication

• Significant number of tasks contribute data to

perform a computation

• Don’t create channels for them early in design


Desirable attributes for communication

• Balanced

• Communication operations balanced among

tasks

• Small degree:

• Each task communicates with only small group

of neighbors

• Concurrency

• Tasks can perform communications

concurrently

• Task can perform computations concurrently


3. Agglomeration

• Agglomeration is the process of grouping

tasks into larger tasks to improve

performance.

• Here, minimizing communication is

typically a design goal.

• Grouping tasks that communicate with each

other eliminates the need for communication,

called increasing the locality

• Grouping tasks can also allow us to combine

multiple communications into one.


Desirable attributes of agglomeration

• Increased the locality of the parallel algorithm

• Agglomerated tasks have similar computational

and communication costs

• Number of tasks increases with problem size

• Number of tasks is as small as possible, yet at

least as great as the number of processors on

target computer


4. Mapping

• Mapping is the process of assigning

agglomerated tasks to the processors

• Here, were thinking of a distributed memory

machine

• If we choose the number of agglomerated tasks

to equal the number of processors then the

mapping is already done. Each processor gets

one agglomerated task


Mapping Goals

• Processor utilization: would like

processors to have roughly equal

computational and communication costs

• Minimize interprocessor communication

• This can be posed as a graph partitioning

problem:

• Each partition should have roughly the same

number of nodes

• The partition should cut a minimal amount of

edges


Partitioning a graph

P0 P1 P0 P1 P0

P1

Equalizing processor utilization and minimizing interprocessor

communication are often competing forces


Mapping heuristics

• Static number of tasks

• Structured communication

• Constant computation time per task

− Agglomerate tasks to minimize comm

− Create one task per processor

• Variable computation time per task

− Cyclically map tasks to processors

• Unstructured communication

− Use a static load balancing algorithm

• Dynamic number of tasks

• Use a run-time task-scheduling algorithm

− e.g., a master slave strategy

• Use a dynamic load balancing algorithm

− e.g., share load among neighboring processors; remapping

periodically


Example

• 1. Boundary value problems

Ice water Rod Insulation


Boundary Value Problem

2

2222

x

uaua

t

u

c

ka 2

t

uu

t

u jiji

,1,

Heat conduction physics

Discretization

ui,j = temperature at position i and time j

2,1,,1

2

2 2

x

uuu

x

u jijiji

jijijiji ruurruu ,1,,11, )21( 2

2

)( x

tar



• Partition

• One data item per grid point

• Associate one primitive task with each grid

point

• Two-dimensional domain decomposition

• Communication

• Identify communication pattern between

primitive tasks

• Each interior primitive task has three incoming

and three outgoing channels



• Agglomeration and mapping

Agglomeration


Model Analysis

• Sequential execution

• – time to update element

• n – number of elements

• m – number of iterations

• Sequential execution time: m n

• Parallel execution

• p – number of processors

• message time = + q/β ≈ , if q « β

• Parallel execution time m (n /p + 2)


Example – Parallel reduction

• Given associative operator

• a0 a

1 a

2 … a

n-1

• Examples

• Add

• Multiply

• And, Or

• Maximum, Minimum

1 task 1 of the values to operate

(1 of the a’s)

Data decomposition


Parallel reduction

Further steps to

reach a binomial tree


Parallel reduction

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1


Parallel reduction

1 7 -6 4

4 5 8 2


Parallel reduction

8 -2

9 10


Parallel reduction

17 8


Parallel reduction

25

Binomial tree


Agglomeration

sum

sum sum

sum

Analysis

• Parallel running time

• – time to perform the binary operation

• - time to communicate a value via a channel

• n values and p tasks

• Time for the tasks perform its inner

calculations: (n/p - 1)

• Communication steps: log p

• After each receiving communication there is

an operation

• Total time: (n/p - 1) + log p ( + )


Example: the N-body problem


v

B1

B2 B3

m

(x,y)

f1 f2

The N-body problem


The N-body problem partitioning

• Domain partitioning

• Assume one task per particle

• Task has particle’s position, velocity

vector and mass

• Iteration

• Get positions and mass of all other particles

• Compute new position and velocity


Gather and All-Gather operations


Gather operation

(sequential)

All-Gather operation

(p-1)

All-Gather


To avoid conflicts all-gather is performed in log p steps,

doubling the data in each step

Communication (n items) = + (n / )

With p tasks there are log p iterations

The number of items doubles

at each iteration

p

pnp

p

np

i

i

)1(log)

2(

log

1

1

Analysis

• N-body problem parallel version

• n bodies and p tasks

• m iterations over time


p

n

p

pnpm

)1(logTotal time excluding I/O

Considering I/O


Reading or writing n items of data through an I/O channel

io

+ n/io

In N-body problem the initial values

must be transmitted to the other tasks

Scatter operation


Improving

1. First task transmits n/2

items to another task

2. The 2 tasks transmits n/4

items to 2 other tasks

3. The 4 tasks transmits n/8

items to 8 other tasks

4. And so on …

p

pnp

p

np

i

i

)1(log)

2(

log

1

1

Analysis considering I/O

• Total time after m iterations

• Initial reading + scattering

• Computing m iterations

• Final gathering + writing


p

n

p

pnpm

p

pnp

n

io

io

)1(

log)1(

log22

Documents

Parallel Algorithm Design