39
Parallel Computing Parallel Algorithm Design

Parallel Algorithm Design

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Algorithm Design

Parallel Computing

Parallel Algorithm Design

Page 2: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 2

Task/Channel Model

• Parallel computation = set of tasks

• Task

• Program

• Local memory

• Collection of I/O ports

• Tasks interact by sending messages

through channels

Page 3: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 3

Task/Channel Model

Task Channel

Page 4: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 4

Foster’s Design Methodoly

ProblemPartitioning

Communication

AgglomerationMapping

1. Partitioning

2. Communication

3. Agglomeration

4. Mapping

Page 5: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 5

1. Partitioning

• Dividing computation and data into pieces

• Domain decomposition

• Divide data into pieces

• e.g., An array into sub-arrays (reduction); A loop into

sub-loops (matrix multiplication), A search space into

sub-spaces (chess)

• Functional decomposition

• Divide computation into pieces

• e.g., pipelines (floating point multiplication),

workflows (pay roll processing)

• Determine how to associate data with

computations

Page 6: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 6

Partitioning

• The individual pieces are called primitive

tasks.

• Desirable attributes for partition

• Many more primitive tasks than processors on

target computer.

• Tasks of roughly equal size (in computation

and data).

• Number of tasks increases with problem size.

Page 7: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 7

Example of domain decomposition

Page 8: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 8

Example of Functional Decomposition

Page 9: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 9

2. Communication

• Determine values passed among tasks

• Local communication

• Task needs values from a small number of

other tasks

• Create channels illustrating data flow

• Global communication

• Significant number of tasks contribute data to

perform a computation

• Don’t create channels for them early in design

Page 10: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 10

Desirable attributes for communication

• Balanced

• Communication operations balanced among

tasks

• Small degree:

• Each task communicates with only small group

of neighbors

• Concurrency

• Tasks can perform communications

concurrently

• Task can perform computations concurrently

Page 11: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 11

3. Agglomeration

• Agglomeration is the process of grouping

tasks into larger tasks to improve

performance.

• Here, minimizing communication is

typically a design goal.

• Grouping tasks that communicate with each

other eliminates the need for communication,

called increasing the locality

• Grouping tasks can also allow us to combine

multiple communications into one.

Page 12: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 12

Desirable attributes of agglomeration

• Increased the locality of the parallel algorithm

• Agglomerated tasks have similar computational

and communication costs

• Number of tasks increases with problem size

• Number of tasks is as small as possible, yet at

least as great as the number of processors on

target computer

Page 13: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 13

4. Mapping

• Mapping is the process of assigning

agglomerated tasks to the processors

• Here, were thinking of a distributed memory

machine

• If we choose the number of agglomerated tasks

to equal the number of processors then the

mapping is already done. Each processor gets

one agglomerated task

Page 14: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 14

Mapping Goals

• Processor utilization: would like

processors to have roughly equal

computational and communication costs

• Minimize interprocessor communication

• This can be posed as a graph partitioning

problem:

• Each partition should have roughly the same

number of nodes

• The partition should cut a minimal amount of

edges

Page 15: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 15

Partitioning a graph

P0 P1 P0 P1 P0

P1

Equalizing processor utilization and minimizing interprocessor

communication are often competing forces

Page 16: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 16

Mapping heuristics

• Static number of tasks

• Structured communication

• Constant computation time per task

− Agglomerate tasks to minimize comm

− Create one task per processor

• Variable computation time per task

− Cyclically map tasks to processors

• Unstructured communication

− Use a static load balancing algorithm

• Dynamic number of tasks

• Use a run-time task-scheduling algorithm

− e.g., a master slave strategy

• Use a dynamic load balancing algorithm

− e.g., share load among neighboring processors; remapping

periodically

Page 17: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 17

Example

• 1. Boundary value problems

Ice water Rod Insulation

Page 18: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 18

Boundary Value Problem

2

2222

x

uaua

t

u

c

ka 2

t

uu

t

u jiji

,1,

Heat conduction physics

Discretization

ui,j = temperature at position i and time j

2,1,,1

2

2 2

x

uuu

x

u jijiji

jijijiji ruurruu ,1,,11, )21( 2

2

)( x

tar

Page 19: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 19

Boundary Value Problem

• Partition

• One data item per grid point

• Associate one primitive task with each grid

point

• Two-dimensional domain decomposition

• Communication

• Identify communication pattern between

primitive tasks

• Each interior primitive task has three incoming

and three outgoing channels

Page 20: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 20

Boundary Value Problem

• Agglomeration and mapping

Agglomeration

Page 21: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 21

Model Analysis

• Sequential execution

• – time to update element

• n – number of elements

• m – number of iterations

• Sequential execution time: m n

• Parallel execution

• p – number of processors

• message time = + q/β ≈ , if q « β

• Parallel execution time m (n /p + 2)

Page 22: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 22

Example – Parallel reduction

• Given associative operator

• a0 a

1 a

2 … a

n-1

• Examples

• Add

• Multiply

• And, Or

• Maximum, Minimum

1 task 1 of the values to operate

(1 of the a’s)

Data decomposition

Page 23: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 23

Parallel reduction

Further steps to

reach a binomial tree

Page 24: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 24

Parallel reduction

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Page 25: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 25

Parallel reduction

1 7 -6 4

4 5 8 2

Page 26: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 26

Parallel reduction

8 -2

9 10

Page 27: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 27

Parallel reduction

17 8

Page 28: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 28

Parallel reduction

25

Binomial tree

Page 29: Parallel Algorithm Design

2010@FEUP Parallel Algorithm Design 29

Agglomeration

sum

sum sum

sum

Page 30: Parallel Algorithm Design

Analysis

• Parallel running time

• – time to perform the binary operation

• - time to communicate a value via a channel

• n values and p tasks

• Time for the tasks perform its inner

calculations: (n/p - 1)

• Communication steps: log p

• After each receiving communication there is

an operation

• Total time: (n/p - 1) + log p ( + )

2010@FEUP Parallel Algorithm Design 30

Page 31: Parallel Algorithm Design

Example: the N-body problem

2010@FEUP Parallel Algorithm Design 31

v

B1

B2 B3

m

(x,y)

f1 f2

Page 32: Parallel Algorithm Design

The N-body problem

2010@FEUP Parallel Algorithm Design 32

Page 33: Parallel Algorithm Design

The N-body problem partitioning

• Domain partitioning

• Assume one task per particle

• Task has particle’s position, velocity

vector and mass

• Iteration

• Get positions and mass of all other particles

• Compute new position and velocity

2010@FEUP Parallel Algorithm Design 33

Page 34: Parallel Algorithm Design

Gather and All-Gather operations

2010@FEUP Parallel Algorithm Design 34

Gather operation

(sequential)

All-Gather operation

(p-1)

Page 35: Parallel Algorithm Design

All-Gather

2010@FEUP Parallel Algorithm Design 35

To avoid conflicts all-gather is performed in log p steps,

doubling the data in each step

Communication (n items) = + (n / )

With p tasks there are log p iterations

The number of items doubles

at each iteration

p

pnp

p

np

i

i

)1(log)

2(

log

1

1

Page 36: Parallel Algorithm Design

Analysis

• N-body problem parallel version

• n bodies and p tasks

• m iterations over time

2010@FEUP Parallel Algorithm Design 36

p

n

p

pnpm

)1(logTotal time excluding I/O

Page 37: Parallel Algorithm Design

Considering I/O

2010@FEUP Parallel Algorithm Design 37

Reading or writing n items of data through an I/O channel

io

+ n/io

In N-body problem the initial values

must be transmitted to the other tasks

Page 38: Parallel Algorithm Design

Scatter operation

2010@FEUP Parallel Algorithm Design 38

Improving

1. First task transmits n/2

items to another task

2. The 2 tasks transmits n/4

items to 2 other tasks

3. The 4 tasks transmits n/8

items to 8 other tasks

4. And so on …

p

pnp

p

np

i

i

)1(log)

2(

log

1

1

Page 39: Parallel Algorithm Design

Analysis considering I/O

• Total time after m iterations

• Initial reading + scattering

• Computing m iterations

• Final gathering + writing

2010@FEUP Parallel Algorithm Design 39

p

n

p

pnpm

p

pnp

n

io

io

)1(

log)1(

log22