Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and

Starting Parallel Algorithm Design

David MonismithBased on notes from Introduction to Parallel Programming 2nd Edition by Grama, Gupta,

Karypis, and Kumar

Outline

• Decomposition• Tasks and Interaction• Load Balancing• Managing Overhead• Parallel Models

Decomposition

• Decomposition - dividing a computation into parts that may be executed in parallel

• Tasks - programmer defined units of computation into which the main computation is subdivided

• Task-dependency graphs - abstraction used to express dependencies between tasks and their relative order of execution

Granularity

• Granularity - number/size of tasks that a computation can be divided into

• Fine grained - task divided into many small tasks• Coarse grained - task divided into few large tasks• Degree of concurrency - maximum number of

tasks that can be executed in parallel in a program at any time

• Average degree of concurrency can be more useful as it provides a better indication of performance

Example

• Matrix-Vector Multiplication– Figure will be drawn upon the board– Generally considered fine grained if parallelizing

based upon each dot product– Could be considered coarse-grained if using a dual

core processor and each task computes half of the dot products

Task Graphs

• Critical path - longest directed path between a pair of start an finish nodes in the task graph

• Critical path length – sum of the weights of the nodes along a critical path

• Weight of a node is the size of the task or amount of work associated with the task

• Aside from these factors, the interaction between tasks running on different processors may cost additional runtime

• An example of a task dependency graph will be drawn in class to aid in the understanding of these concepts

Processes and Threads vs. Processors

• mapping - mechanism by which tasks are assigned to processes and/or threads for execution

• Threads and processes are logical units that perform tasks• Processors physically perform the computations• Important to realize this because we may have multiple

stages of computation• For example, internode communication vs. shared memory

communication• Drawing a task dependency or task interaction graph may

help us to understand how tasks interact with one another and will aid in development of a parallel algorithm

Decomposition Techniques

• Embarrassingly Parallel• Recursive decomposition• Data Decomposition• Exploratory decomposition• Speculative decomposition

Embarrassingly Parallel Tasks

• Some tasks lend themselves to direct parallelization• Such tasks are said to be embarrassingly parallel and

can be directly mapped to processes or threads• A subset of these types of tasks represent the map

pattern• Note that the map pattern represents a function that

can be “replicated and applied to all elements in a collection” – source https://software.intel.com/en-us/blogs/2009/06/10/parallel-patterns-3-map

• Map operations occur in independent loop iterations

https://software.intel.com/en-us/blogs/2009/06/10/parallel-patterns-3-map

https://software.intel.com/en-us/blogs/2009/06/10/parallel-patterns-3-map

Embarrassingly Parallel (Map)

• Performing array (or matrix) addition is a straightforward example that is easily parallelized

• The serial example of this follows:

for(i = 0; i < N; i++) C[i] = A[i] + B[i];

• Three OpenMP parallel versions follow on the next slides

OpenMP First Try • We could parallelize the loop on the last slide directly as follows: #pragma omp parallel private(i) shared(A,B,C) { int start = omp_get_thread_num()*(N / omp_get_num_threads()); int end = start + (N/omp_get_num_threads()); for(i = start; i < end; i++) C[i] = A[i] + B[i]; }• Notice that i is declared private because it it is not shared between

threads – each thread gets its own copy of i• Arrays A, B, and C are declared shared because they are shared between

threads

OpenMP for clause

• It is preferred to allow OpenMP to directly parallelize loops using the for clause as follows

#pragma omp parallel private(i) shared(A,B,C){ #pragma omp for for(i = 0; i < N; i++) C[i] = A[i] + B[i];}

• Notice that the loop can be written in a serial fashion and it will be automatically partitioned and tasked to a thread

Shortened OpenMP for

• When using a single for loop, the parallel and for clauses may be combined

#pragma omp parallel for private(i) \shared(A,B,C) for(i = 0; i < N; i++) C[i] = A[i] + B[i];

Recursive Decomposition

• Used to include concurrency in problems that can be solved with divide-and-conquer

• Such a problem is solved by dividing it into independent sub-problems

• A special type of this decomposition is the Reduction Pattern, wherein elements of a collection are combined with a binary associative operator (e.g. +, -, min, max, etc.), source - https://software.intel.com/en-us/blogs/2009/07/23/parallel-pattern-7-reduce

https://software.intel.com/en-us/blogs/2009/07/23/parallel-pattern-7-reduce

https://software.intel.com/en-us/blogs/2009/07/23/parallel-pattern-7-reduce

Example

• To find a minimum serially given an array A of size N use the following algorithm

min = A[0]; for(i = 1; i < N; i++) if(A[i] < min) min = A[i];

Example• Decomposing this task for parallelism requires a recursive solution

int findMinRec(int A[], int i, int n){ if(n == 1) return A[i]; else { int lmin = findMinRec(A, i, n/2); int rmin = findMinRec(A, i+n/2, n-n/2); return min(lmin,rmin); }}

OpenMP Implementation

for(i = 0; i < N; i++) A[i] = rand() % 100;

small = A[0];#pragma omp parallel for reduction(min:small)for(i = 0; i < N; i++) { if(A[i] < small) small = A[i];}

OpenMP Sum Reduction

for(i = 0; i < N; i++) A[i] = i+1;

sum = 0;#pragma omp parallel for reduction(+:sum)for(i = 0; i < N; i++) sum += A[i];

printf("The sum is %d\n", sum);

Data Decomposition

• Commonly used on algorithms that operate on large data structures

• Involves two steps– Data is partitioned– Data partitioning is used to cause partitioning of

computations into tasks• Operations on different data partitions are

typically similar or are chosen from a small set of operations

Partitioning

• Partitioning output data – outputs computed independently of others as a function of input– Example – matrix multiplication can be partitioned into

submatrices• Partitioning input data – task is created for each

partition of the input data– Example – finding a minimum or maximum

• Partitioning input and output – combination of the two cases above

• Partitioning intermediate data

Next Time

• More decompositions– Exploratory Decomposition– Speculative Decomposition

• Tasks and Interactions• Load balancing• Handling overhead• Parallel Algorithm Models

Documents

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and