Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies

Chapter 3

Parallel Algorithm DesignParallel Algorithm Design

Outline

Task/channel modelTask/channel model Algorithm design methodologyAlgorithm design methodology Case studiesCase studies

Task/Channel Model

Parallel computation = set of tasksParallel computation = set of tasks TaskTask

ProgramProgram Local memoryLocal memory Collection of I/O portsCollection of I/O ports

Tasks interact by sending messages through Tasks interact by sending messages through channelschannels

Task/Channel Model

TaskTaskChannelChannel

Foster’s Design Methodology

PartitioningPartitioning CommunicationCommunication AgglomerationAgglomeration MappingMapping

Foster’s Methodology

ProblemPartitioning

Communication

AgglomerationMapping

Partitioning Dividing computation and data into piecesDividing computation and data into pieces Domain decompositionDomain decomposition

Divide data into piecesDivide data into pieces e.g., An array into sub-arrays (reduction); A loop into e.g., An array into sub-arrays (reduction); A loop into

sub-loops (matrix multiplication), A search space into sub-loops (matrix multiplication), A search space into sub-spaces (chess)sub-spaces (chess)

Determine how to associate computations with Determine how to associate computations with the datathe data

Functional decompositionFunctional decomposition Divide computation into piecesDivide computation into pieces

e.g., pipelines (floating point multiplication), workflows e.g., pipelines (floating point multiplication), workflows (pay roll processing)(pay roll processing)

Determine how to associate data with the Determine how to associate data with the computationscomputations

Example Domain Decompositions

Example Functional Decomposition

Partitioning Checklist

Large Grained TasksLarge Grained Tasks e.g, at least 10x more primitive tasks than e.g, at least 10x more primitive tasks than

processors in target computerprocessors in target computer Balance LoadBalance Load

Primitive tasks roughly the same sizePrimitive tasks roughly the same size ScalableScalable

Number of tasks an increasing function Number of tasks an increasing function of problem sizeof problem size

Communication

Determine values passed among tasksDetermine values passed among tasks Local communicationLocal communication

Task needs values from a small number of other Task needs values from a small number of other taskstasks

Create channels illustrating data flowCreate channels illustrating data flow Global communicationGlobal communication

Significant number of tasks contribute data to Significant number of tasks contribute data to perform a computationperform a computation

Don’t create channels for them early in designDon’t create channels for them early in design

Communication Checklist

BalancedBalanced Communication operations balanced among tasksCommunication operations balanced among tasks

Small degree:Small degree: Each task communicates with only small group of Each task communicates with only small group of

neighborsneighbors ConcurrencyConcurrency

Tasks can perform communications concurrentlyTasks can perform communications concurrently Task can perform computations concurrentlyTask can perform computations concurrently

Agglomeration

Grouping tasks into larger tasksGrouping tasks into larger tasks GoalsGoals

Improve performanceImprove performance Maintain scalability of programMaintain scalability of program Simplify programmingSimplify programming

In MPI programming, goal often to create In MPI programming, goal often to create one agglomerated task per processorone agglomerated task per processor

Agglomeration Can Improve Performance Eliminate communication between Eliminate communication between

primitive tasks agglomerated into primitive tasks agglomerated into consolidated taskconsolidated task

Combine groups of sending and receiving Combine groups of sending and receiving taskstasks

Agglomeration Checklist

Locality of parallel algorithm has increasedLocality of parallel algorithm has increased Tradeoff between agglomeration and code Tradeoff between agglomeration and code

modifications costs is reasonablesmodifications costs is reasonables Agglomerated tasks have similar computational Agglomerated tasks have similar computational

and communications costsand communications costs Number of tasks increases with problem sizeNumber of tasks increases with problem size Number of tasks suitable for likely target systemsNumber of tasks suitable for likely target systems

Mapping

Process of assigning tasks to processorsProcess of assigning tasks to processors Centralized multiprocessor: mapping done Centralized multiprocessor: mapping done

by operating systemby operating system Distributed memory system: mapping done Distributed memory system: mapping done

by userby user Conflicting goals of mappingConflicting goals of mapping

Maximize processor utilizationMaximize processor utilization Minimize interprocessor communicationMinimize interprocessor communication

Mapping Example

Optimal Mapping

Finding optimal mapping is NP-hardFinding optimal mapping is NP-hard Must rely on heuristicsMust rely on heuristics

Mapping Decision Tree

Static number of tasksStatic number of tasks Structured communicationStructured communication

Constant computation time per taskConstant computation time per task• Agglomerate tasks to minimize commAgglomerate tasks to minimize comm• Create one task per processorCreate one task per processor

Variable computation time per taskVariable computation time per task• Cyclically map tasks to processorsCyclically map tasks to processors

Unstructured communicationUnstructured communication• Use a static load balancing algorithmUse a static load balancing algorithm

Dynamic number of tasksDynamic number of tasks

Mapping Strategy

Static number of tasksStatic number of tasks Dynamic number of tasksDynamic number of tasks

Use a run-time task-scheduling algorithmUse a run-time task-scheduling algorithm• e.g., a master slave strategye.g., a master slave strategy

Use a dynamic load balancing algorithmUse a dynamic load balancing algorithm• e.g., share load among neighboring e.g., share load among neighboring

processors; remapping periodically processors; remapping periodically

Mapping Checklist

Considered designs based on one task per Considered designs based on one task per processor and multiple tasks per processorprocessor and multiple tasks per processor If multiple task per processor chosen, If multiple task per processor chosen,

ratio of tasks to processors is at least 10:1ratio of tasks to processors is at least 10:1 Evaluated static and dynamic task Evaluated static and dynamic task

allocationallocation If dynamic task allocation chosen, task If dynamic task allocation chosen, task

allocator is not a bottleneck to performanceallocator is not a bottleneck to performance

Case Studies

Boundary value problemBoundary value problem Finding the maximumFinding the maximum The n-body problemThe n-body problem Adding data inputAdding data input

Boundary Value Problem

Ice water Rod Insulation

Rod Cools as Time Progresses

Finite Difference Approximation

Partitioning

One data item per grid pointOne data item per grid point Associate one primitive task with each grid Associate one primitive task with each grid

pointpoint Two-dimensional domain decompositionTwo-dimensional domain decomposition

Communication

Identify communication pattern between Identify communication pattern between primitive tasksprimitive tasks

Each interior primitive task has three Each interior primitive task has three incoming and three outgoing channelsincoming and three outgoing channels

Agglomeration and Mapping

Agglomeration

Sequential execution time

– – time to update elementtime to update element nn – number of elements – number of elements mm – number of iterations – number of iterations Sequential execution timeSequential execution time: mn: mn

Parallel Execution Time

p p – number of processors– number of processors – – message latencymessage latency Parallel execution time Parallel execution time mm((nn//pp+2+2))

Finding the Maximum ErrorComputed 0.15 0.16 0.16 0.19Correct 0.15 0.16 0.17 0.18Error (%) 0.00% 0.00% 6.25% 5.26%

6.25%

Reduction

Given associative operator Given associative operator aa00 aa11 aa22 … … aan-1n-1

ExamplesExamples AddAdd MultiplyMultiply And, OrAnd, Or Maximum, MinimumMaximum, Minimum

Parallel Reduction Evolution



Binomial Trees

Subgraph of hypercube

Finding Global Sum

4 2 0 7

-3 5 -6 -3

8 1 2 3

-4 4 6 -1

Finding Global Sum

1 7 -6 4

4 5 8 2

Finding Global Sum

8 -2

9 10

Finding Global Sum

17 8

Finding Global Sum

25

Binomial Tree

Agglomeration

Agglomeration

sum

sum sum

sum

The n-body Problem

The n-body Problem

Partitioning

Domain partitioningDomain partitioning Assume one task per particleAssume one task per particle Task has particle’s position, velocity vectorTask has particle’s position, velocity vector IterationIteration

Get positions of all other particlesGet positions of all other particles Compute new position, velocityCompute new position, velocity

Gather

All-gather

Complete Graph for All-gather

Hypercube for All-gather

Communication Time

p

pnp

p

np

i β

β )1(

log2

log

1

1-i −+=⎟⎟

⎠

⎞⎜⎜⎝

⎛+∑

=

Hypercube

Complete graph

p

pnp

pnp

β

β )1(

)1()/

)(1(−

+−=+−

Adding Data Input

Scatter

Scatter in log p Steps

12345678 56781234 56 12

7834

Summary: Task/channel Model

Parallel computationParallel computation Set of tasksSet of tasks Interactions through channelsInteractions through channels

Good designsGood designs Maximize local computationsMaximize local computations Minimize communicationsMinimize communications Scale upScale up

Summary: Design Steps

Partition computationPartition computation Agglomerate tasksAgglomerate tasks Map tasks to processorsMap tasks to processors GoalsGoals

Maximize processor utilizationMaximize processor utilization Minimize inter-processor communicationMinimize inter-processor communication

Summary: Fundamental Algorithms

ReductionReduction Gather and scatterGather and scatter All-gatherAll-gather

Documents

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies