Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Programmingin C with MPI and OpenMP
Michael J. QuinnMichael J. Quinn
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 17
Shared-memory ProgrammingShared-memory Programming
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Outline
OpenMPOpenMP Shared-memory modelShared-memory model Parallel Parallel forfor loops loops Declaring private variablesDeclaring private variables Critical sectionsCritical sections ReductionsReductions Performance improvementsPerformance improvements More general data parallelismMore general data parallelism Functional parallelismFunctional parallelism
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
OpenMP
OpenMP: An application programming OpenMP: An application programming interface (API) for parallel programming on interface (API) for parallel programming on multiprocessorsmultiprocessorsCompiler directivesCompiler directivesLibrary of support functionsLibrary of support functions
OpenMP works in conjunction with Fortran, OpenMP works in conjunction with Fortran, C, or C++C, or C++
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
What’s OpenMP Good For?
C + OpenMP sufficient to program C + OpenMP sufficient to program multiprocessorsmultiprocessors
C + MPI + OpenMP a good way to program C + MPI + OpenMP a good way to program multicomputers built out of multiprocessorsmulticomputers built out of multiprocessors IBM RS/6000 SPIBM RS/6000 SPFujitsu AP3000Fujitsu AP3000Dell High Performance Computing Dell High Performance Computing
ClusterCluster
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared-memory Model
P r o c e s s o r P r o c e s s o r P r o c e s s o r P r o c e s s o r
M e m o r y
Processors interact and synchronize with eachother through shared variables.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fork/Join Parallelism
Initially only master thread is activeInitially only master thread is active Master thread executes sequential codeMaster thread executes sequential code Fork: Master thread creates or awakens Fork: Master thread creates or awakens
additional threads to execute parallel codeadditional threads to execute parallel code Join: At end of parallel code created threads Join: At end of parallel code created threads
die or are suspendeddie or are suspended
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fork/Join Parallelism
Tim
e
f o r k
j o i n
M a s t e r T h r e a d
f o r k
j o i n
O t h e r t h r e a d s
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared-memory Model vs.Message-passing Model (#1)
Shared-memory modelShared-memory modelNumber active threads 1 at start and Number active threads 1 at start and
finish of program, changes dynamically finish of program, changes dynamically during executionduring execution
Message-passing modelMessage-passing modelAll processes active throughout execution All processes active throughout execution
of programof program
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Incremental Parallelization
Sequential program a special case of a Sequential program a special case of a shared-memory parallel programshared-memory parallel program
Parallel shared-memory programs may only Parallel shared-memory programs may only have a single parallel loophave a single parallel loop
Incremental parallelization: process of Incremental parallelization: process of converting a sequential program to a converting a sequential program to a parallel program a little bit at a timeparallel program a little bit at a time
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared-memory Model vs.Message-passing Model (#2) Shared-memory modelShared-memory model
Execute and profile sequential programExecute and profile sequential program Incrementally make it parallelIncrementally make it parallel Stop when further effort not warrantedStop when further effort not warranted
Message-passing modelMessage-passing model Sequential-to-parallel transformation requires Sequential-to-parallel transformation requires
major effortmajor effort Transformation done in one giant step rather Transformation done in one giant step rather
than many tiny stepsthan many tiny steps
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel for Loops
C programs often express data-parallel operations C programs often express data-parallel operations as as forfor loops loopsfor (i = first; i < size; i += prime)for (i = first; i < size; i += prime)
marked[i] = 1;marked[i] = 1; OpenMP makes it easy to indicate when the OpenMP makes it easy to indicate when the
iterations of a loop may execute in paralleliterations of a loop may execute in parallel Compiler takes care of generating code that Compiler takes care of generating code that
forks/joins threads and allocates the iterations to forks/joins threads and allocates the iterations to threadsthreads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pragmas
Pragma: a compiler directive in C or C++Pragma: a compiler directive in C or C++ Stands for “pragmatic information”Stands for “pragmatic information” A way for the programmer to communicate A way for the programmer to communicate
with the compilerwith the compiler Compiler free to ignore pragmasCompiler free to ignore pragmas Syntax:Syntax:
#pragma omp#pragma omp <rest of pragma><rest of pragma>
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel for Pragma
Format:Format:#pragma omp parallel for#pragma omp parallel forfor (i = 0; i < N; i++)for (i = 0; i < N; i++) a[i] = b[i] + c[i];a[i] = b[i] + c[i];
Compiler must be able to verify the run-Compiler must be able to verify the run-time system will have information it needs time system will have information it needs to schedule loop iterationsto schedule loop iterations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Canonical Shape of for Loop Control Clause
)
indexindex
indexindex
indexindex
indexindex
index
indexindex
index
;index;index(for
inc
inc
inc
incincendstart
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Execution Context
Every thread has its own execution contextEvery thread has its own execution context Execution context: address space containing all of Execution context: address space containing all of
the variables a thread may accessthe variables a thread may access Contents of execution context:Contents of execution context:
static variablesstatic variables dynamically allocated data structures in the heapdynamically allocated data structures in the heap variables on the run-time stackvariables on the run-time stack additional run-time stack for functions invoked additional run-time stack for functions invoked
by the threadby the thread
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared and Private Variables
Shared variable: has same address in Shared variable: has same address in execution context of every threadexecution context of every thread
Private variable: has different address in Private variable: has different address in execution context of every threadexecution context of every thread
A thread cannot access the private variables A thread cannot access the private variables of another threadof another thread
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared and Private Variables
i n t m a i n ( i n t a r g c , c h a r * a r g v [ ] ){
i n t b [ 3 ] ; c h a r * c p t r ;
i n t i ;
c p t r = m a l l o c ( 1 ) ;# p r a g m a o m p p a r a l l e l f o r
f o r ( i = 0 ; i < 3 ; i + + ) b [ i ] = i ;
H e a p
S t a c k
c p t rb i
ii
M a s t e r T h r e a d ( T h r e a d 0 )
T h r e a d 1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function omp_get_num_procs
Returns number of physical processors Returns number of physical processors available for use by the parallel programavailable for use by the parallel program
int omp_get_num_procs (void)int omp_get_num_procs (void)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function omp_set_num_threads
Uses the parameter value to set the number Uses the parameter value to set the number of threads to be active in parallel sections of of threads to be active in parallel sections of codecode
May be called at multiple points in a May be called at multiple points in a programprogram
void omp_set_num_threads (int t)void omp_set_num_threads (int t)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pop Quiz:
Write a C program segment that sets the Write a C program segment that sets the number of threads equal to the number of number of threads equal to the number of processors that are available.processors that are available.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Declaring Private Variables
for (i = 0; i < BLOCK_SIZE(id,p,n); i++)for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++)for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp);a[i][j] = MIN(a[i][j],a[i][k]+tmp);
Either loop could be executed in parallelEither loop could be executed in parallel We prefer to make outer loop parallel, to reduce We prefer to make outer loop parallel, to reduce
number of forks/joinsnumber of forks/joins We then must give each thread its own private copy We then must give each thread its own private copy
of variable of variable jj
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
private Clause
Clause: an optional, additional component Clause: an optional, additional component to a pragmato a pragma
Private clause: directs compiler to make one Private clause: directs compiler to make one or more variables privateor more variables private
private ( private ( <variable list><variable list> ) )
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example Use of private Clause
#pragma omp parallel for private(j)#pragma omp parallel for private(j)for (i = 0; i < BLOCK_SIZE(id,p,n); i++)for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++)for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp);a[i][j] = MIN(a[i][j],a[i][k]+tmp);
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
firstprivate Clause
Used to create private variables having initial Used to create private variables having initial values identical to the variable controlled by the values identical to the variable controlled by the master thread as the loop is enteredmaster thread as the loop is entered
Variables are initialized once per thread, not once Variables are initialized once per thread, not once per loop iterationper loop iteration
If a thread modifies a variable’s value in an If a thread modifies a variable’s value in an iteration, subsequent iterations will get the iteration, subsequent iterations will get the modified valuemodified value
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example Use of firstprivate ClauseX[0] = complex_function();X[0] = complex_function();for (i=0; i<n; i++) {for (i=0; i<n; i++) { for (j=1; j<4; j++)for (j=1; j<4; j++) x[j] = g(I, x[j-1]);x[j] = g(I, x[j-1]); answer[i] = x[1] – x[3];answer[i] = x[1] – x[3];} }
X[0] = complex_function();X[0] = complex_function();#pragma omp parallel for private(j) firstprivate(x)#pragma omp parallel for private(j) firstprivate(x)for (i=0; i<n; i++) {for (i=0; i<n; i++) { for (j=1; j<4; j++)for (j=1; j<4; j++) x[j] = g(I, x[j-1]);x[j] = g(I, x[j-1]); answer[i] = x[1] – x[3];answer[i] = x[1] – x[3];} }
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
lastprivate Clause
Sequentially last iteration: iteration that Sequentially last iteration: iteration that occurs last when the loop is executed occurs last when the loop is executed sequentiallysequentially
lastprivatelastprivate clause: used to copy back clause: used to copy back to the master thread’s copy of a variable the to the master thread’s copy of a variable the private copy of the variable from the thread private copy of the variable from the thread that executed the sequentially last iterationthat executed the sequentially last iteration
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example Use of lastprivate Clausefor (i=0; i<n; i++) {for (i=0; i<n; i++) { x[0] = 1.0;x[0] = 1.0; for (j=1; j<4; j++)for (j=1; j<4; j++) x[j] = x[j-1] * (i+1);x[j] = x[j-1] * (i+1); sum_of_powers[i] = x[0] + x[1] + x[2] + x[3];sum_of_powers[i] = x[0] + x[1] + x[2] + x[3];} } n_cubed = x[3];n_cubed = x[3];
#pragma omp parallel for private(j) lastprivate(x)#pragma omp parallel for private(j) lastprivate(x)for (i=0; i<n; i++) {for (i=0; i<n; i++) { x[0] = 1.0;x[0] = 1.0; for (j=1; j<4; j++)for (j=1; j<4; j++) x[j] = x[j-1] * (i+1);x[j] = x[j-1] * (i+1); sum_of_powers[i] = x[0] + x[1] + x[2] + x[3];sum_of_powers[i] = x[0] + x[1] + x[2] + x[3];} } n_cubed = x[3];n_cubed = x[3];
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Critical Sections
double area, pi, x;int i, n;...area = 0.0;for (i = 0; i < n; i++) { x += (i+0.5)/n; area += 4.0/(1.0 + x*x);}pi = area / n;
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Race Condition
Consider this C program segment to Consider this C program segment to compute compute using the rectangle rule: using the rectangle rule:
double area, pi, x;int i, n;...area = 0.0;for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x);}pi = area / n;
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Race Condition (cont.)
If we simply parallelize the loop...If we simply parallelize the loop...
double area, pi, x;int i, n;...area = 0.0;#pragma omp parallel for private(x)for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x);}pi = area / n;
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Race Condition (cont.)
... we set up a race condition in which one ... we set up a race condition in which one process may “race ahead” of another and process may “race ahead” of another and not see its change to shared variable not see its change to shared variable areaarea
11.667area
area += 4.0/(1.0 + x*x)
Thread A Thread B
15.432
11.66711.66715.432 15.230
15.230 Answer should be 18.995
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Race Condition Time LineT h r e a d A T h r e a d BV a l u e o f a r e a
1 1 . 6 6 7+ 3 . 7 6 5
+ 3 . 5 6 3
1 1 . 6 6 7
1 5 . 4 3 2
1 5 . 2 3 0
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
critical Pragma
Critical section: a portion of code that only Critical section: a portion of code that only thread at a time may executethread at a time may execute
We denote a critical section by putting the We denote a critical section by putting the pragmapragma
#pragma omp critical#pragma omp critical
in front of a block of C codein front of a block of C code
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Correct, But Inefficient, Code
double area, pi, x;int i, n;...area = 0.0;#pragma omp parallel for private(x)for (i = 0; i < n; i++) { x = (i+0.5)/n;#pragma omp critical area += 4.0/(1.0 + x*x);}pi = area / n;
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Source of Inefficiency
Update to Update to areaarea inside a critical section inside a critical section Only one thread at a time may execute the Only one thread at a time may execute the
statement; i.e., it is sequential codestatement; i.e., it is sequential code Time to execute statement significant part Time to execute statement significant part
of loopof loop By Amdahl’s Law we know speedup will By Amdahl’s Law we know speedup will
be severely constrainedbe severely constrained
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Reductions
Reductions are so common that OpenMP provides Reductions are so common that OpenMP provides support for themsupport for them
May add reduction clause to May add reduction clause to parallel forparallel for pragmapragma
Specify reduction operation and reduction variableSpecify reduction operation and reduction variable OpenMP takes care of storing partial results in OpenMP takes care of storing partial results in
private variables and combining partial results after private variables and combining partial results after the loopthe loop
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
reduction Clause
The reduction clause has this syntax:The reduction clause has this syntax:reduction (reduction (<op><op> : :<variable><variable>))
OperatorsOperators ++SumSum ** ProductProduct && Bitwise andBitwise and || Bitwise orBitwise or ^̂ Bitwise exclusive orBitwise exclusive or &&&& Logical andLogical and |||| Logical orLogical or
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
-finding Code with Reduction Clausedouble area, pi, x;int i, n;...area = 0.0;#pragma omp parallel for \ private(x) reduction(+:area)for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x);}pi = area / n;
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Performance Improvement #1
Too many fork/joins can lower performanceToo many fork/joins can lower performance Inverting loops may help performance ifInverting loops may help performance ifParallelism is in inner loopParallelism is in inner loopAfter inversion, the outer loop can be After inversion, the outer loop can be
made parallelmade parallel Inversion does not significantly lower Inversion does not significantly lower
cache hit ratecache hit rate
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Performance Improvement #2
If loop has too few iterations, fork/join If loop has too few iterations, fork/join overhead is greater than time savings from overhead is greater than time savings from parallel executionparallel execution
The The ifif clause instructs compiler to insert clause instructs compiler to insert code that determines at run-time whether code that determines at run-time whether loop should be executed in parallel; e.g.,loop should be executed in parallel; e.g.,
#pragma omp parallel for if(n > 5000)#pragma omp parallel for if(n > 5000)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Performance Improvement #3
We can useWe can use schedule schedule clause to specify how clause to specify how iterations of a loop should be allocated to threadsiterations of a loop should be allocated to threads
Static schedule: all iterations allocated to threads Static schedule: all iterations allocated to threads before any iterations executedbefore any iterations executed
Dynamic schedule: only some iterations allocated Dynamic schedule: only some iterations allocated to threads at beginning of loop’s execution. to threads at beginning of loop’s execution. Remaining iterations allocated to threads that Remaining iterations allocated to threads that complete their assigned iterations.complete their assigned iterations.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Static vs. Dynamic Scheduling
Static schedulingStatic schedulingLow overheadLow overheadMay exhibit high workload imbalanceMay exhibit high workload imbalance
Dynamic schedulingDynamic schedulingHigher overheadHigher overheadCan reduce workload imbalanceCan reduce workload imbalance
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chunks
A chunk is a contiguous range of iterationsA chunk is a contiguous range of iterations Increasing chunk size reduces overhead and Increasing chunk size reduces overhead and
may increase cache hit ratemay increase cache hit rate Decreasing chunk size allows finer Decreasing chunk size allows finer
balancing of workloadsbalancing of workloads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
schedule Clause
Syntax of schedule clauseSyntax of schedule clauseschedule (schedule (<type><type>[,[,<chunk><chunk> ]) ])
Schedule type required, chunk size optionalSchedule type required, chunk size optional Allowable schedule typesAllowable schedule types
static: static allocationstatic: static allocation dynamic: dynamic allocationdynamic: dynamic allocation guided: guided self-schedulingguided: guided self-scheduling runtime: type chosen at run-time based on value runtime: type chosen at run-time based on value
of environment variable OMP_SCHEDULEof environment variable OMP_SCHEDULE
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Scheduling Options
schedule(static): block allocation of about schedule(static): block allocation of about n/t contiguous iterations to each threadn/t contiguous iterations to each thread
schedule(static,C): interleaved allocation of schedule(static,C): interleaved allocation of chunks of size C to threadschunks of size C to threads
schedule(dynamic): dynamic one-at-a-time schedule(dynamic): dynamic one-at-a-time allocation of iterations to threadsallocation of iterations to threads
schedule(dynamic,C): dynamic allocation of schedule(dynamic,C): dynamic allocation of C iterations at a time to threadsC iterations at a time to threads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Scheduling Options (cont.)
schedule(guided): guided self-scheduling with schedule(guided): guided self-scheduling with minimum chunk size 1minimum chunk size 1
schedule(guided, C): dynamic allocation of chunks schedule(guided, C): dynamic allocation of chunks to tasks using guided self-scheduling heuristic. to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller, Initial chunks are bigger, later chunks are smaller, minimum chunk size is C.minimum chunk size is C.
schedule(runtime): schedule chosen at run-time schedule(runtime): schedule chosen at run-time based on value of OMP_SCHEDULE; Unix based on value of OMP_SCHEDULE; Unix example:example:setenv OMP_SCHEDULE “static,1”setenv OMP_SCHEDULE “static,1”
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
More General Data Parallelism
Our focus has been on the parallelization of Our focus has been on the parallelization of forfor loops loops
Other opportunities for data parallelismOther opportunities for data parallelism processing items on a “to do” listprocessing items on a “to do” listforfor loop + additional code outside of loop + additional code outside of
looploop
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Processing a “To Do” List
H e a p
j o b _ p t r
S h a r e dV a r i a b l e s
M a s t e r T h r e a d T h r e a d 1
t a s k _ p t r t a s k _ p t r
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Sequential Code (1/2)int main (int argc, char *argv[]){ struct job_struct *job_ptr; struct task_struct *task_ptr; ... task_ptr = get_next_task (&job_ptr); while (task_ptr != NULL) { complete_task (task_ptr); task_ptr = get_next_task (&job_ptr); } ...}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Sequential Code (2/2)char *get_next_task(struct job_struct **job_ptr) { struct task_struct *answer;
if (*job_ptr == NULL) answer = NULL; else { answer = (*job_ptr)->task; *job_ptr = (*job_ptr)->next; } return answer;}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallelization Strategy
Every thread should repeatedly take next Every thread should repeatedly take next task from list and complete it, until there are task from list and complete it, until there are no more tasksno more tasks
We must ensure no two threads take same We must ensure no two threads take same take from the list; i.e., must declare a take from the list; i.e., must declare a critical sectioncritical section
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
parallel Pragma
The The parallelparallel pragma precedes a block pragma precedes a block of code that should be executed by of code that should be executed by allall of the of the threadsthreads
Note: execution is replicated among all Note: execution is replicated among all threadsthreads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Use of parallel Pragma
#pragma omp parallel private(task_ptr){ task_ptr = get_next_task (&job_ptr); while (task_ptr != NULL) { complete_task (task_ptr); task_ptr = get_next_task (&job_ptr); }}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Critical Section for get_next_task
char *get_next_task(struct job_struct **job_ptr) { struct task_struct *answer;#pragma omp critical { if (*job_ptr == NULL) answer = NULL; else { answer = (*job_ptr)->task; *job_ptr = (*job_ptr)->next; } } return answer;}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Functions for SPMD-style Programming
The parallel pragma allows us to write The parallel pragma allows us to write SPMD-style programsSPMD-style programs
In these programs we often need to know In these programs we often need to know number of threads and thread ID numbernumber of threads and thread ID number
OpenMP provides functions to retrieve this OpenMP provides functions to retrieve this informationinformation
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function omp_get_thread_num
This function returns the thread This function returns the thread identification numberidentification number
If there are If there are tt threads, the ID numbers range threads, the ID numbers range from 0 to from 0 to tt-1-1
The master thread has ID number 0The master thread has ID number 0
int omp_get_thread_num (void) int omp_get_thread_num (void)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function omp_get_num_threads
Function omp_get_num_threads returns the Function omp_get_num_threads returns the number of active threadsnumber of active threads
If call this function from sequential portion If call this function from sequential portion of program, it will return 1of program, it will return 1
int omp_get_num_threads (void)int omp_get_num_threads (void)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
for Pragma
The The parallel parallel pragma instructs every pragma instructs every thread to execute all of the code inside the thread to execute all of the code inside the blockblock
If we encounter a If we encounter a forfor loop that we want to loop that we want to divide among threads, we use the divide among threads, we use the forfor pragmapragma
#pragma omp for#pragma omp for
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example Use of for Pragma
#pragma omp parallel private(i,j)for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { printf ("Exiting (%d)\n", i); break; }#pragma omp for for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i];}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
single Pragma
Suppose we only want to see the output Suppose we only want to see the output onceonce
The The single single pragma directs compiler that pragma directs compiler that only a single thread should execute the only a single thread should execute the block of code the pragma precedesblock of code the pragma precedes
Syntax:Syntax:
#pragma omp single#pragma omp single
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Use of single Pragma#pragma omp parallel private(i,j)for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) {#pragma omp single printf ("Exiting (%d)\n", i); break; }#pragma omp for for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i];}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
nowait Clause
Compiler puts a barrier synchronization at Compiler puts a barrier synchronization at end of every parallel for statementend of every parallel for statement
In our example, this is necessary: if a thread In our example, this is necessary: if a thread leaves loop and changes leaves loop and changes lowlow or or highhigh, it , it may affect behavior of another threadmay affect behavior of another thread
If we make these private variables, then it If we make these private variables, then it would be okay to let threads move ahead, would be okay to let threads move ahead, which could reduce execution timewhich could reduce execution time
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Use of nowait Clause#pragma omp parallel private(i,j,low,high)for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) {#pragma omp single printf ("Exiting (%d)\n", i); break; }#pragma omp for nowait for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i];}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Functional Parallelism
To this point all of our focus has been on To this point all of our focus has been on exploiting data parallelismexploiting data parallelism
OpenMP allows us to assign different OpenMP allows us to assign different threads to different portions of code threads to different portions of code (functional parallelism)(functional parallelism)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Functional Parallelism Example
v = alpha(); w = beta(); x = gamma(v, w); y = delta(); printf ("%6.2f\n", epsilon(x,y));
a l p h a b e t a
g a m m a d e l t a
e p s i l o n
May execute alpha,beta, and delta inparallel
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
parallel sections Pragma
Precedes a block of Precedes a block of kk blocks of code that blocks of code that may be executed concurrently by may be executed concurrently by kk threads threads
Syntax:Syntax:
#pragma omp parallel sections#pragma omp parallel sections
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
section Pragma
Precedes each block of code within the Precedes each block of code within the encompassing block preceded by the encompassing block preceded by the parallel sections pragmaparallel sections pragma
May be omitted for first parallel section May be omitted for first parallel section after the parallel sections pragmaafter the parallel sections pragma
Syntax:Syntax:
#pragma omp section#pragma omp section
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example of parallel sections
#pragma omp parallel sections {#pragma omp section /* Optional */ v = alpha();#pragma omp section w = beta();#pragma omp section y = delta(); } x = gamma(v, w); printf ("%6.2f\n", epsilon(x,y));
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Another Approach
a l p h a b e t a
g a m m a d e l t a
e p s i l o n
Execute alpha andbeta in parallel.Execute gamma anddelta in parallel.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
sections Pragma
Appears inside a parallel block of codeAppears inside a parallel block of code Has same meaning as the Has same meaning as the parallel parallel sectionssections pragma pragma
If multiple If multiple sectionssections pragmas inside one pragmas inside one parallel block, may reduce fork/join costsparallel block, may reduce fork/join costs
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Use of sections Pragma#pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta(); } #pragma omp sections { x = gamma(v, w); #pragma omp section y = delta(); } } printf ("%6.2f\n", epsilon(x,y));
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (1/3)
OpenMP an API for shared-memory OpenMP an API for shared-memory parallel programmingparallel programming
Shared-memory model based on fork/join Shared-memory model based on fork/join parallelismparallelism
Data parallelismData parallelism parallel for pragmaparallel for pragma reduction clausereduction clause
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (2/3)
Functional parallelism (parallel sections pragma)Functional parallelism (parallel sections pragma) SPMD-style programming (parallel pragma)SPMD-style programming (parallel pragma) Critical sections (critical pragma)Critical sections (critical pragma) Enhancing performance of parallel for loopsEnhancing performance of parallel for loops
Inverting loopsInverting loops Conditionally parallelizing loopsConditionally parallelizing loops Changing loop schedulingChanging loop scheduling
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (3/3)CharacteristicCharacteristic OpenMPOpenMP MPIMPI
Suitable for multiprocessorsSuitable for multiprocessors YesYes YesYes
Suitable for multicomputersSuitable for multicomputers NoNo YesYes
Supports incremental Supports incremental parallelizationparallelization
YesYes NoNo
Minimal extra codeMinimal extra code YesYes NoNo
Explicit control of memory Explicit control of memory hierarchyhierarchy
NoNo YesYes