Upload
camilla-stevenson
View
245
Download
1
Tags:
Embed Size (px)
Citation preview
Chapter 3
Parallel Algorithm DesignParallel Algorithm Design
Outline
Task/channel modelTask/channel model Algorithm design methodologyAlgorithm design methodology Case studiesCase studies
Task/Channel Model
Parallel computation = set of tasksParallel computation = set of tasks TaskTask
ProgramProgram Local memoryLocal memory Collection of I/O portsCollection of I/O ports
Tasks interact by sending messages through Tasks interact by sending messages through channelschannels
Task/Channel Model
TaskTaskChannelChannel
Foster’s Design Methodology
PartitioningPartitioning CommunicationCommunication AgglomerationAgglomeration MappingMapping
Foster’s Methodology
ProblemPartitioning
Communication
AgglomerationMapping
Partitioning Dividing computation and data into piecesDividing computation and data into pieces Domain decompositionDomain decomposition
Divide data into piecesDivide data into pieces e.g., An array into sub-arrays (reduction); A loop into e.g., An array into sub-arrays (reduction); A loop into
sub-loops (matrix multiplication), A search space into sub-loops (matrix multiplication), A search space into sub-spaces (chess)sub-spaces (chess)
Determine how to associate computations with Determine how to associate computations with the datathe data
Functional decompositionFunctional decomposition Divide computation into piecesDivide computation into pieces
e.g., pipelines (floating point multiplication), workflows e.g., pipelines (floating point multiplication), workflows (pay roll processing)(pay roll processing)
Determine how to associate data with the Determine how to associate data with the computationscomputations
Example Domain Decompositions
Example Functional Decomposition
Partitioning Checklist
Large Grained TasksLarge Grained Tasks e.g, at least 10x more primitive tasks than e.g, at least 10x more primitive tasks than
processors in target computerprocessors in target computer Balance LoadBalance Load
Primitive tasks roughly the same sizePrimitive tasks roughly the same size ScalableScalable
Number of tasks an increasing function Number of tasks an increasing function of problem sizeof problem size
Communication
Determine values passed among tasksDetermine values passed among tasks Local communicationLocal communication
Task needs values from a small number of other Task needs values from a small number of other taskstasks
Create channels illustrating data flowCreate channels illustrating data flow Global communicationGlobal communication
Significant number of tasks contribute data to Significant number of tasks contribute data to perform a computationperform a computation
Don’t create channels for them early in designDon’t create channels for them early in design
Communication Checklist
BalancedBalanced Communication operations balanced among tasksCommunication operations balanced among tasks
Small degree:Small degree: Each task communicates with only small group of Each task communicates with only small group of
neighborsneighbors ConcurrencyConcurrency
Tasks can perform communications concurrentlyTasks can perform communications concurrently Task can perform computations concurrentlyTask can perform computations concurrently
Agglomeration
Grouping tasks into larger tasksGrouping tasks into larger tasks GoalsGoals
Improve performanceImprove performance Maintain scalability of programMaintain scalability of program Simplify programmingSimplify programming
In MPI programming, goal often to create In MPI programming, goal often to create one agglomerated task per processorone agglomerated task per processor
Agglomeration Can Improve Performance Eliminate communication between Eliminate communication between
primitive tasks agglomerated into primitive tasks agglomerated into consolidated taskconsolidated task
Combine groups of sending and receiving Combine groups of sending and receiving taskstasks
Agglomeration Checklist
Locality of parallel algorithm has increasedLocality of parallel algorithm has increased Tradeoff between agglomeration and code Tradeoff between agglomeration and code
modifications costs is reasonablesmodifications costs is reasonables Agglomerated tasks have similar computational Agglomerated tasks have similar computational
and communications costsand communications costs Number of tasks increases with problem sizeNumber of tasks increases with problem size Number of tasks suitable for likely target systemsNumber of tasks suitable for likely target systems
Mapping
Process of assigning tasks to processorsProcess of assigning tasks to processors Centralized multiprocessor: mapping done Centralized multiprocessor: mapping done
by operating systemby operating system Distributed memory system: mapping done Distributed memory system: mapping done
by userby user Conflicting goals of mappingConflicting goals of mapping
Maximize processor utilizationMaximize processor utilization Minimize interprocessor communicationMinimize interprocessor communication
Mapping Example
Optimal Mapping
Finding optimal mapping is NP-hardFinding optimal mapping is NP-hard Must rely on heuristicsMust rely on heuristics
Mapping Decision Tree
Static number of tasksStatic number of tasks Structured communicationStructured communication
Constant computation time per taskConstant computation time per task• Agglomerate tasks to minimize commAgglomerate tasks to minimize comm• Create one task per processorCreate one task per processor
Variable computation time per taskVariable computation time per task• Cyclically map tasks to processorsCyclically map tasks to processors
Unstructured communicationUnstructured communication• Use a static load balancing algorithmUse a static load balancing algorithm
Dynamic number of tasksDynamic number of tasks
Mapping Strategy
Static number of tasksStatic number of tasks Dynamic number of tasksDynamic number of tasks
Use a run-time task-scheduling algorithmUse a run-time task-scheduling algorithm• e.g., a master slave strategye.g., a master slave strategy
Use a dynamic load balancing algorithmUse a dynamic load balancing algorithm• e.g., share load among neighboring e.g., share load among neighboring
processors; remapping periodically processors; remapping periodically
Mapping Checklist
Considered designs based on one task per Considered designs based on one task per processor and multiple tasks per processorprocessor and multiple tasks per processor If multiple task per processor chosen, If multiple task per processor chosen,
ratio of tasks to processors is at least 10:1ratio of tasks to processors is at least 10:1 Evaluated static and dynamic task Evaluated static and dynamic task
allocationallocation If dynamic task allocation chosen, task If dynamic task allocation chosen, task
allocator is not a bottleneck to performanceallocator is not a bottleneck to performance
Case Studies
Boundary value problemBoundary value problem Finding the maximumFinding the maximum The n-body problemThe n-body problem Adding data inputAdding data input
Boundary Value Problem
Ice water Rod Insulation
Rod Cools as Time Progresses
Finite Difference Approximation
Partitioning
One data item per grid pointOne data item per grid point Associate one primitive task with each grid Associate one primitive task with each grid
pointpoint Two-dimensional domain decompositionTwo-dimensional domain decomposition
Communication
Identify communication pattern between Identify communication pattern between primitive tasksprimitive tasks
Each interior primitive task has three Each interior primitive task has three incoming and three outgoing channelsincoming and three outgoing channels
Agglomeration and Mapping
Agglomeration
Sequential execution time
– – time to update elementtime to update element nn – number of elements – number of elements mm – number of iterations – number of iterations Sequential execution timeSequential execution time: mn: mn
Parallel Execution Time
p p – number of processors– number of processors – – message latencymessage latency Parallel execution time Parallel execution time mm((nn//pp+2+2))
Finding the Maximum ErrorComputed 0.15 0.16 0.16 0.19Correct 0.15 0.16 0.17 0.18Error (%) 0.00% 0.00% 6.25% 5.26%
6.25%
Reduction
Given associative operator Given associative operator aa00 aa11 aa22 … … aan-1n-1
ExamplesExamples AddAdd MultiplyMultiply And, OrAnd, Or Maximum, MinimumMaximum, Minimum
Parallel Reduction Evolution
Parallel Reduction Evolution
Parallel Reduction Evolution
Binomial Trees
Subgraph of hypercube
Finding Global Sum
4 2 0 7
-3 5 -6 -3
8 1 2 3
-4 4 6 -1
Finding Global Sum
1 7 -6 4
4 5 8 2
Finding Global Sum
8 -2
9 10
Finding Global Sum
17 8
Finding Global Sum
25
Binomial Tree
Agglomeration
Agglomeration
sum
sum sum
sum
The n-body Problem
The n-body Problem
Partitioning
Domain partitioningDomain partitioning Assume one task per particleAssume one task per particle Task has particle’s position, velocity vectorTask has particle’s position, velocity vector IterationIteration
Get positions of all other particlesGet positions of all other particles Compute new position, velocityCompute new position, velocity
Gather
All-gather
Complete Graph for All-gather
Hypercube for All-gather
Communication Time
p
pnp
p
np
i β
β )1(
log2
log
1
1-i −+=⎟⎟
⎠
⎞⎜⎜⎝
⎛+∑
=
Hypercube
Complete graph
p
pnp
pnp
β
β )1(
)1()/
)(1(−
+−=+−
Adding Data Input
Scatter
Scatter in log p Steps
12345678 56781234 56 12
7834
Summary: Task/channel Model
Parallel computationParallel computation Set of tasksSet of tasks Interactions through channelsInteractions through channels
Good designsGood designs Maximize local computationsMaximize local computations Minimize communicationsMinimize communications Scale upScale up
Summary: Design Steps
Partition computationPartition computation Agglomerate tasksAgglomerate tasks Map tasks to processorsMap tasks to processors GoalsGoals
Maximize processor utilizationMaximize processor utilization Minimize inter-processor communicationMinimize inter-processor communication
Summary: Fundamental Algorithms
ReductionReduction Gather and scatterGather and scatter All-gatherAll-gather