Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
1
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Parallel Algorithm Design Parallel Algorithm Design ––Decomposition Techniques and Load Decomposition Techniques and Load
Balancing Techniques Balancing Techniques
PCOPP-2002
Day Day -- 4 Classroom Lecture4 Classroom Lecture
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
2
PCOPPPCOPP20022002
Copyright C-DAC, 2002
� Data parallelism; Task parallelism; Combination of Data and Task parallelism
� Decomposition Techniques
� Static and Load Balancing
�Mapping for load balancing
�Minimizing Interaction
�Overheads in parallel algorithms design
� Data Sharing Overheads
Parallel Algorithmic Design
Lecture OutlineFollowing Topics will be discussed
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
3
PCOPPPCOPP20022002
Copyright C-DAC, 2002
CAD Database Scientific modeling
Multi programming
Shared address
Message passing
Data parallel
Compilation or library
Operating systems support
Communication hardware
Physical communication medium
Parallel applications
Programming models
Communication abstraction
User/system boundaryHardware/Software boundary
Layers of Abstraction in Parallel Computers
� Critical layers of abstraction lie between the application program and actual hardware
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
4
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Questions to be answered
� How to partition the data?
� Which data is going to be partitioned?
� How many types of concurrency?
� What are the key principles of designing parallel algorithms?
� What are the overheads in the algorithm design?
� How the mapping for balancing the load is done effectively?
Parallel Algorithms and Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
5
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Principles of Design of parallel algorithms
Two key steps
� Discuss methods for mapping the tasks to processors so that the processors are efficiently utilized.
� Different decompositions and mapping may yield good performance on different computers for a given problem.
It is therefore crucial for programmers to understand the relationship between the underlying machine model and the parallel program to develop efficient programs.
(Contd…)
Parallel Algorithmic Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
6
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Types of Parallelism
� Data parallelism
� Task parallelism
� Combination of Data and Task parallelism
� Stream parallelism
Types of Parallelism
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
7
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Data parallelism
� Identical operations being applied concurrently on different data items is called data parallelism.
� It applies the SAME OPERATION in parallel on different elementsof a data set.
� It uses a simpler model and reduce the programmer’s work.
Example
� Problem of adding n x n matrices.
� Structured grid computations in CFD.
� Genetic algorithms.
Types of Parallelism : Data Parallelism (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
8
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Data parallelism
� For most of the application problems, the degree of data parallelism with the size of the problem.
� More number of processors can be used to solve large size problems.
� f90 and HPF data parallel languages
Responsibility of programmer
� Specifying the distribution of data structures
Computer responsibility
� Generates necessary code for communication and synchronization.
Types of Parallelism : Data Parallelism(Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
9
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Task parallelism
� Many tasks are executed concurrently is called task parallelism.
� This can be done (visualized) by a task graph. In this graph, the node represent a task to be executed. Edges represent the dependencies between the tasks.
� Sometimes, a task in the task graph can be executed as long as all preceding tasks have been completed.
� Let the programmer define different types of processes. These processes communicate and synchronize with each other through MPI or other mechanisms.
Types of Parallelism : Task Parallelism (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
10
PCOPPPCOPP20022002
Copyright C-DAC, 2002
T0
T1 T2
T3 T4 T5 T6
T7 T8 T9 T10
Task GraphT00 = To
T01 T02
T03 T04 T05
T07 T08
Sub Task Graph for To
T10 = T1
T11T12
Sub Task Graph for T1
Types of Parallelism : Task Parallelism
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
11
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Programmer’s responsibility
Programmer must deal explicitly with process creation, communication and synchronization.
Task parallelism
� Example
Vehicle relational database to process the following query
(MODEL = “-------” AND YEAR = “-------”) AND (COLOR = “Green” OR COLOR = “Black”)
Types of Parallelism : Task Parallelism (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
12
PCOPPPCOPP20022002
Copyright C-DAC, 2002
The different tables generated by processing the query and theirdependencies.
Types of Parallelism : Task Parallelism (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
13
PCOPPPCOPP20022002
Copyright C-DAC, 2002
� Parallel Unstructured Adaptive Mesh/Mesh Repartitioning methods
� HPF / Automatic compiler techniques may not yield good performance for unstructured mesh computations.
� Choosing right algorithm and message passing is a right candidate for partition (decomposition) of unstructured mesh (graph) onto processors. Task parallelism is right to obtain concurrency.
Finite Element Unstructured Mesh for Lake Superior Region
Types of Parallelism : Task Parallelism (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
14
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Remarks : Data and Task Parallelism
� Unfortunately the data parallel model’s simplicity makes it lesssuitable for applications that do not fit the model.
� In particular, applications that use irregular data structures often do not match the model.
� These applications impose difficulties on both the language designer and compiler writer.
� Because task and data parallelism each has strengths and weaknesses, integrating both in one model is attractive.
Types of Parallelism : Data and Task Parallelism
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
15
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Integration of Task and Data Parallelism
� Two Approaches
� Add task parallel constructs to data parallel constructs.
� Add data parallel constructs to task parallel construct
� Approach to Integration
� Language based approaches.
� Library based approaches.
Types of Parallelism : Data and Task Parallelism
(Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
16
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Example
� Multi disciplinary optimization application for aircraft design.
� Need for supporting task parallel constructs and communication between data parallel modules
� Optimizer initiates and monitors the application’s execution until the result satisfy some objective function (such as minimal aircraft weight)
Types of Parallelism : Data and Task Parallelism
(Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
17
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Task and Data Parallelism example for aircraft design
Types of Parallelism : Data and Task Parallelism
(Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
18
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Advantages
� Generality
� Ability to increase scalability by exploiting both forms of parallelism in a application.
� Ability to co-ordinate multidisciplinary applications.
Problems
� Differences in parallel program structure
� Address space organization
� Language implementation
Types of Parallelism : Data and Task Parallelism(Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
19
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Stream Parallelism
� Stream parallelism refers to the simultaneous execution of different programs on a data stream. It is also referred to as pipelining.
� The computation is parallelized by executing a different program at each processor and sending intermediate results to the next processor.
� The result is a pipeline of data flow between processors.
Types of Parallelism : Stream Parallelism
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
20
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Pipelined execution of instructions : Higher execution rates and better hardware utilization can be achieved.
Types of Parallelism : Stream Parallelism (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
21
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Stream Parallelism
Remark :
� Each stage is executed in a single clock cycle, the entire instruction execution can be accomplished in five clock cycles. In a sequential execution of the instructions, most of the processor hardware idles for a majority of time.
� A compiler feature that takes advantage of the fact that many operations can be divided into stages which can proceed concurrently in an attempt to utilize 100% of the CPU resources
� In pipelining, this can be exploited by assigning a different physical unit to each phase.
Types of Parallelism : Stream Parallelism (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
22
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Observations
� Many problems exhibit a combination of data, task and stream parallelism.
� The amount of stream parallelism available in a problem is usually independent of the size of the problem.
� The amount of data and task parallelism in a problem usually increases with the size of the problem.
� Combinations of task and data parallelism often allow us to utilize the coarse granularity inherent in task parallelism with the fine granularity in data parallelism to effectively utilize a large number of processors.
Types of Parallelism : Stream Parallelism (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
23
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Decomposition Techniques
The process of splitting the computations in a problem into a set of concurrent tasks is referred to as decomposition.
� Decomposing a problem effectively is of paramount importance in parallel computing.
� Without a good decomposition, we may not be able to achieve a high degree of concurrency.
� Decomposing a problem must ensure good load balance.
Decomposition Techniques
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
24
PCOPPPCOPP20022002
Copyright C-DAC, 2002
What is meant by good decomposition?
� It should lead to high degree of concurrency
� The interaction among tasks should be as little as possible. These objectives often conflict with each other.
� Parallel algorithm design has helped in the formulation of certain heuristics for decomposition.
Decomposition Techniques (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
25
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Decomposition techniques
� Recursive decomposition
� Data decomposition
� Exploratory decomposition
� Hybrid decomposition
Decomposition Techniques (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
26
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Recursive decomposition
� A method for inducing concurrency in problems that are often solved using the “divide and conquer” paradigm.
� A problem is solved by first dividing it into a set of independent sub problems
� Each one of these sub problems is solved recursively and then the solution to original problem is obtained by combining the results of these sub problems
� Recursive decomposition can be applied to extract concurrency in problems
Recursive Decomposition
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
27
PCOPPPCOPP20022002
Copyright C-DAC, 2002
The quick sort recursion tree for sorting the sequence of 12 members. Each shaded box corresponds to the pivot element used to split each particular list.
Recursive Decomposition (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
28
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Data decomposition
� Data decomposition is a powerful method for deriving concurrency in algorithms that operate on large data structures.
� Decomposition of computations is done in two steps –
� First step : The data (or the domain) on which the computations are performed is partitioned
� Second step : This data partitioning is used to induce a partitioning of the computations into task which leads to data level parallelism
Data Decomposition
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
29
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Decomposition of Input, Intermediate and output data
Data Decomposition
Intermediate
AlgorithmSt
age
1
Stag
e 2
Stag
e 3
Partitioning
Input
Partitioning Partitioning
Output
(Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
30
PCOPPPCOPP20022002
Copyright C-DAC, 2002
� Any partitioning of the output suggests a decomposition of the overall computation performed by the algorithm
� The degree of concurrency induced by this decomposition is a powerful method to find concurrency
� This decomposition leads to parallel algorithms that require little or no interaction among various concurrent tasks
Example:
� Adding of square matrices A and B to give matrix C = A + B (Partition the computations by partitioning the output matrix into groups each containing an equal number of matrix elements).
� Multiplication of square matrices A and B to give matrix C = A * B
� Problem of rendering a scene using the ray-tracing algorithm
Data Decomposition : Partitioning output dataData Decomposition : Partitioning output data
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
31
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Example : Rendering a screen using the ray tracing algorithm
The amount of computations performed for each ray is different.
(Contd…)
Data Decomposition : Partitioning output dataData Decomposition : Partitioning output data
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
32
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Example : Ray tracing algorithm
� The output of the ray-tracing algorithm is a two dimensional array corresponding to the size of the image.
� Each entry of this array stores the value of the corresponding pixel, computed by a ray passing through the pixel.
� Since the computation of color values for each pixel is independent of the computation for the pixels, we can decompose the computations performed by the algorithm by partitioning the array of pixel
(Contd…)
Data Decomposition : Partitioning output dataData Decomposition : Partitioning output data
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
33
PCOPPPCOPP20022002
Copyright C-DAC, 2002
� In many algorithms, computations of some elements of output depends upon the computation of other output elements.
� In such cases, computation of different output elements can not be performed concurrently
� Sometimes these computations can be restructured as multi-stage computation such that each output elements of any of the stages can be computed independently of other output elements
� Parallel Gauss elimination method to solve linear system of matrix equations AX=B
� LU factorization to solve linear system of matrix equations AX=B
� Adaptive repartitioning of finite element meshes - dynamic load balancing methods
Data Decomposition : Partitioning intermediate dataData Decomposition : Partitioning intermediate data
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
34
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Example : Gauss Elimination to solve Ax = b matrix system
For k varying form 0 to n-1, In Gaussian elimination procedure, systematically eliminates variable x[k] form equations k+1 to n-1 so that the matrix of coefficients becomes upper-triangular.
(Contd…)
Data Decomposition : Partitioning intermediate dataData Decomposition : Partitioning intermediate data
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
35
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Remark
� Partitioning the output data can be performed if each output can be naturally computed as a function of the input. In many algorithms, this is not possible or desirable to partition the output data.
� It is possible to decompose this computations by partitioning the input array and assigning each input partition to a different task.
Example : Partitioning Input data: Partitioning a list around a pivot element
� For sorting algorithm, the individual elements of the output cannot be efficiently determined in isolation.
� In such cases, it is sometimes possible to partition the input data and then use this partitioning to induce concurrency
Data Decomposition : Partitioning Input dataData Decomposition : Partitioning Input data
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
36
PCOPPPCOPP20022002
Copyright C-DAC, 2002
2618369733581724
Input`6869787521333124
output
pivot
Smal
ler
than
the
pivo
tG
reat
er th
an th
e pi
vot
2618
3697
3358
1724
Tas
k 4
Tas
k 3
Tas
k 2
Tas
k 1
No. of smaller
No. of greater
2 21 32 13 3
Smaller-to-GreaterPrefix Sum of smaller
Greater-to-SmallerPrefix Sum of greater
8653
2567
Using recursive decomposition
6869787521333124
output
task 1
task 2
task 3task 4
task 1
task 2
task 3
task 4
+
Example : Sorting algorithm
Data Decomposition : Partitioning Input dataData Decomposition : Partitioning Input data
�Partitioning the input data around a pivot element and obtain concurrency
(Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
37
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Partitioning the input data
Example : Data mining and Database applications
� Consider the problem of computing the frequency of a set of item sets in a transaction database. (Association rule)
� In this problem, we are given a set of n transactions T and a set of m item sets I (I = I1,I2,...Im).
� Each transaction and itemset contains a small number of items, out of a possible set of item M.
� Our goal for each item set Ii is to find how many items it appears in all transactions.
(Contd…)
Data Decomposition : Partitioning Input dataData Decomposition : Partitioning Input data
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
38
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Summary
� We must look which of the possible partitions will yield a (good) easiest computation partitioning
� In using data decomposition one must explore and evaluate all possible ways of partitioning the data
The Owner Compute Rule :
� All these examples a widely used method of inducing a computational partitioning from a data partitioning that is called the owner-compute rule.
� Depending on the nature of the data or the type of the data usedto perform the data-partitioning, the owner-compute rule may mean different things.
Data Decomposition(Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
39
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Exploratory Decomposition
� Used to decompose problems whose underlying computations correspond to a search of a space for solutions.
� Partition search space into smaller parts, and search each one of these parts concurrently, until the desired solutions are found.
Example : A large company wants to award ten prizes to ten of its randomly selected highly valued customers (a highly valued customer is one that buys over $100,000 worth of products annually)
Exploratory Decomposition (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
40
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Exploratory decomposition
� The task at hand is to select these ten customers. One way of solving this problem is to first randomly permute the customer records that correspond to the most valued customers.
� So given this problem, how can we decompose it so it will run inparallel?
Solution :
One way to use data decomposition and split the database into ten equal parts.Then in each one of these ten smaller databases, we can apply serial algorithm to find one valued customer
Exploratory Decomposition (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
41
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Example : Consider the problem of searching a state space (such as finding a solution to a puzzle problem)
Steps :
� One way of decomposing this computation is to split it into tasks, each one searching a different portion of the search space.
� The task of finding the shortest path from initial to final configuration now translates to finding a path from one of thesenewly generated nodes to final configuration.
� The 15 puzzle problem is typically solved using tree search techniques. Starting from initial configuration, all possible successor one generated.
Exploratory Decomposition (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
42
PCOPPPCOPP20022002
Copyright C-DAC, 2002
A 15-puzzle problem instance : (a) initial configuration; (b)-(e) a sequence of moves leading form the initial to the final configuration. (f) final configuration.
1 6 2 4
9 5 3 8
13 7 11
14 10 15 12
1 6 2 4
9 5 3 8
13 7 11
14
10
15 12
1 6 2 4
9 5 3 8
13 7 11
14
10
15 12
1 6 2 4
9 5 3 8
137 11
14
10
15 12
1 6 2 4
9
5 3 8
13
7 11
14
10
15 12
1
6
2 4
9
5
3
8
13
7
11
14
10
15
12
(a) (b) (c)
(d) (e) (f)
Exploratory Decomposition (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
43
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Hybrid Decomposition
� These composition techniques are not exclusive, and can often be combined together a higher degree of concurrency than that obtained by applying anyone decomposition.
� Infact, for many problems, it is necessary to combine different decomposition techniques to obtain acceptable degree of concurrency. At different stages, different types of decomposition have to be applied to derive concurrency in parallel algorithm .
Example:
� Parallel quick sort algorithm : Sample sort algorithm
� In many state space search problems, the computation associated with generating successor does require significant amount by decomposing the task of generating the successors.
Hybrid Decomposition
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
44
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Speculative Decomposition
� Speculative decomposition leads to effective parallel algorithmswhen the speculative tasks are later found to be needed for the solution to the problem.
� However, in situations in which large fraction of these speculative tasks are found to be unnecessary, then the parallel algorithm ends up performing a lot of non-essential computations.
Examples :
� Optimistic parallel discrete event simulation algorithm.
� Randomized algorithms in Graph partitioning algorithms
Speculative Decomposition
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
45
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Objectives :
� The amount of computation assigned to each processors is balanced so that some processor do not idle while others are executing tasks.
� The interactions among the different processors is minimized, sothat the processors spend most of the time in doing work.
� To balance the load among processors, it may be necessary to assign tasks, that interact heavily, to different processors.
Remark : These objectives conflict with each other. The problem of finding a good mapping becomes harder.
Load Balancing techniquesLoad Balancing techniques
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
46
PCOPPPCOPP20022002
Copyright C-DAC, 2002
� Static load-balancing
� Distribute the work among processors prior to the execution of the algorithm
� Matrix-Matrix Computation
� Easy to design and implement
� Dynamic load-balancing
� Distribute the work among processors during the execution of the algorithm
(Contd…)Load Balancing techniquesLoad Balancing techniques
�Algorithms that require dynamic load-balancing are somewhat more complicated (Parallel Graph Partitioning and Adaptive Finite Element Computations)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
47
PCOPPPCOPP20022002
Copyright C-DAC, 2002
� The concurrency and interdependency of tasks available in a parallel program is fully specified by its task graph. Two characteristics for suitability of static or dynamic load balancing
� The first characteristic is whether or not the tasks that one required to be solved are available before hand or they are dynamically created during the solution of the problem.
� The second characteristic that influences the choice of load balancing scheme is the knowledge of the task size; i.e., the amount of time that is required to solve them.
� Problems in which the amount of computation per task is not known a priori; and can be significantly different for different data.
Characteristics for Suitability of Load BalancingCharacteristics for Suitability of Load Balancing
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
48
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Using the block-cyclic distribution shown in (b) to distribute the computations performed in array (a) will lead to load imbalances sparse matrix in Gaussian elimination.
(Contd…)
Array Distribution schemes : Block Distributions of matrix; One dimensional (strip); Two dimensional (checkerboard);Block cyclicDistributions ;Randomized block distributions
Schemes for Static Load BalancingSchemes for Static Load Balancing
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
49
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Remark : Different decomposition require different amount of data
Dense matrix-matrix multiplication
Matrix-Matrix addition
Data decomposition : When data decomposition is used to derive concurrency a suitable decomposition of data can itself be used to balance the load and minimize interactions.
Schemes for Static Load BalancingSchemes for Static Load Balancing
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
50
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Dynamic partition
� There are problems in which we cannot statistically partition the work among the processors
� In these problems, a static work partitioning is either impossible (e.g. first class) or can potentially lead to serious load imbalance problems (e.g., second and third classes)
� The only way to develop efficient message passing programs for these classes of problem is if we allow dynamic load balancing.
� Thus during the execution of the program, work is dynamically transferred among the processors that have a lot of work to the processors that have little or no work
Schemes for Dynamic Load Balancing (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
51
PCOPPPCOPP20022002
Copyright C-DAC, 2002
� Three general classes of problems that fall into the category
� The first class consists of problems which all the tasks are available at the beginning of the computation but the amount of time required by each task is different and cannot be determined
� The second class consists of problems in which tasks are available at the beginning but as the computation progresses, the amount of time required by each task changes.
� The third class consists of problems in which tasks are generated dynamically
(Contd…)Schemes for Dynamic Load BalancingSchemes for Dynamic Load Balancing
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
52
PCOPPPCOPP20022002
Copyright C-DAC, 2002
� Self-Scheduling methods
� Chunk Scheduling methods
� Master-Slave paradigm
� Randomization techniques
� Round robin methods
(Contd…)Schemes for Dynamic Load BalancingSchemes for Dynamic Load Balancing
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
53
PCOPPPCOPP20022002
Copyright C-DAC, 2002
� To refine mesh in parallel often require dynamic load balancing algorithms� Use diffusive and non-diffusive algorithms� Adapted graph is partitioned from scratch using state of art
multilevel graph partitioning.� Incremental modification of re-partitioned graphs.� Cut and paste re-partitioning algorithm.
Examples : Adaptive load balancing of unstructured meshes
(Contd…)Schemes for Dynamic Load BalancingSchemes for Dynamic Load Balancing
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
54
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Random distribution of elements in a two dimensional region
Graph partitioning of a two dimensional region
Graph Partitioning AlgorithmsGraph Partitioning Algorithms
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
55
PCOPPPCOPP20022002
Copyright C-DAC, 2002
�Elements (evenly distributed)
�Processor level nodes (determined by elements on each processor)
�Global nodes (aligned with processor-level nodes as much as possible)
Mesh Distribution for Finite Element ComputationsMesh Distribution for Finite Element Computations
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
56
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Example for coloring a graph on three processors
Coloring a Sparse GraphColoring a Sparse Graph
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
57
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Example : Rendering a screen using the ray tracing algorithm
� The amount of computations performed for each ray is different. � The output of the ray-tracing algorithm is a two dimensional array
corresponding to the size of the image.
(Contd…)Dynamic Load BalancingDynamic Load Balancing
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
58
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Example : Ray tracing algorithm
� Each entry of this array stores the value of the corresponding pixel, computed by a ray passing through the pixel.
� Even though the tasks are known statically, the amount of time associated required by each task is not known and can vary significantly.
� In general, the amount of computation associated with each ray depends on the type of the objects that it hits, it surface properties, and whether or not is in direct view from the different light source.
� Some arrays hit the background, requiring very little processing
� It is not possible to get an accurate estimate of the the work associated with task; thus assigning the number of arrays to each processor may potentially lead to significant load imbalances.
(Contd…)Dynamic Load BalancingDynamic Load Balancing
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
59
PCOPPPCOPP20022002
Copyright C-DAC, 2002
A 15-puzzle problem instance : (a) initial configuration; (b)-(e) a sequence of moves leading form the initial to the final configuration.
In this problem, it is not possible to get an accurate estimate of the work associated with each task, as the size of the tree rooted at any node can vary significantly.
(a) (b) (c)1 6 2 49 5 3 8
13 7 1114 10 15 12
1 6 2 49 5 3 813 7 1114
1015 12
1 6 2 49 5 3 8
13 7 111410
15 12
1 6 2 49 5 3 8
137 11
1410
15 12
1 6 2 4
95 3 8
137 11
1410
15 12
162 4
95
38
13
71114
1015
12
(d) (e) (f)
Example : 15 Puzzle problem Dynamic Load BalancingDynamic Load Balancing
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
60
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Overheads in algorithm design
� The time that is required by the parallel algorithm but is not essential to the serial algorithm is referred to as overhead due to parallelization.
� Minimizing these overheads is critical in developing efficient parallel algorithm.
� Parallel algorithms can potentially incur three types of overheads.
� The first is due to idling
� The second is due to data sharing and synchronization
� The third is due to extraneous configurations
Overheads in Algorithm Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
61
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Idling overheads :
Processors can become idle for three main reasons.
� The first reason is due to load imbalances
� The second is due to computational dependencies.
� The third is due to interprocessor communication information exchange and synchronization.
Overheads in Algorithm Design (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
62
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Data sharing overheads
� In most non-trivial parallel programs, the tasks executed by different processors require access to the same data.
� For example, in matrix-vector multiplication Y = Ax, in which tasks correspond to computing contain access to the entire vector x. (Here A is dense and sparse matrix).
� Some tasks require data that are generated dynamically by earlier tasks.
Overheads in Algorithm Design (Contd…)
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
63
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Data sharing overheads
� Data sharing can take different forms which depend on the underlying architecture.
� On distributed memory machines, data is initially partitioned among processors.
� On shared address space machines, data sharing is often performed implicitly.
(Contd…)Overheads in Algorithm Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
64
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Data sharing overheads
� Developing algorithms that minimize data sharing overheads ofteninvolves intelligent task decomposition,
� Task distribution
� Initial data distribution
� Proper scheduling of the data sharing operations
� Collective data sharing operations.
(Contd…)Overheads in Algorithm Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
65
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Point-to-Point data sharing operations
� Collective Data Sharing Operations
� Collective Data Transfers
� Broadcast
� Gather
� Scatter
� All-to-All
� Collective Computations
� Reduction
� Collective Synchronization
� Barrier
(Contd…)Overheads in Algorithm Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
66
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Cost of Data Sharing Operations
� Structured Data Sharing
� Dense Matrix-Matrix Multiplication
� Unstructured Data Sharing
� Sparse Matrix-Vector Multiplication
(Contd…)Overheads in Algorithm Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
67
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Sparse Matrix A and the vector b Multiplication
P1 needs to receive elements {8,9,10} from P2.P2 needs to receive only elements {4,6} from P1.
Sparse Matrix A and Vector Multiplication
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
68
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Extraneous Computations
� Parallel algorithms can potentially perform two types of extraneous computations.
� The first type consists of non-essential computations that are performed by the parallel algorithms but they are not required by the serial one.
� In parallel algorithms, non-essential computation often arise either because of speculative execution one.
� The second type of extraneous computation happens when more than one processor ends up performing the same computations.
Example : Parallel search of a sparse graph.
(Contd…)Overheads in Algorithm Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
69
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Methods for containing interactions overheads
� Data sharing overheads
� Static vs Dynamic
� Data sharing interactions
Example : Sparse matrix-vector multiplication
� Regular vs irregular interactions
Remark : One must strive to minimize the interaction overhead in the same way as the user to minimize the interaction overhead due to idling and extraneous computations.
� Two general schemes for dealing with interaction overheads.
� Perform useful work while the interaction takes place
� Try to reduce the interactions parallel programs.
(Contd…)Overheads in Algorithm Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
70
PCOPPPCOPP20022002
Copyright C-DAC, 2002
General Techniques for Choosing Right Parallel Algorithm
� Maximize data locality
� Minimize volume of data
� Minimize frequency of Interactions
� Overlapping computations with interactions.
� Data replication
� Minimize construction and Hot spots.
� Use highly optimized collective interaction operations.
� Collective data transfers and computations•
� Maximize Concurrency.
(Contd…)Overheads in Algorithm Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
71
PCOPPPCOPP20022002
Copyright C-DAC, 2002
Conclusions
Five step process for the design of parallel programs. These steps are described as follows
� Identification of type of parallelism
� Decomposition to expose concurrency
� Mapping and interaction cost
� Extraneous computations
� Handling overheads
(Contd…)Overheads in Algorithm Design
Parallel Algorithm Design -Decomposition Tech. & Load Balancing Tech.
June
03-
06, 2
002
C-D
AC, P
une
72
PCOPPPCOPP20022002
Copyright C-DAC, 2002