Upload
sreeram-kumar
View
237
Download
0
Embed Size (px)
Citation preview
8/6/2019 Omp Hands on SC08
1/153
1
A Hands-on Introduction toOpenMP*
Tim Mattson
Principal Engineer
Intel Corporation
* The name OpenMP is the property of the OpenMP Architecture Review Board.
Larry Meadows
Principal Engineer
Intel Corporation
8/6/2019 Omp Hands on SC08
2/153
2
Preliminaries: part 1 Disclosures
The views expressed in this tutorial are those of thepeople delivering the tutorial.
We are not speaking for our employers.
We are not speaking for the OpenMP ARB This is a new tutorial for us:
Help us improve tell us how you would make thistutorial better.
8/6/2019 Omp Hands on SC08
3/153
3
Preliminaries: Part 2 Our plan for the day .. Active learning!
We will mix short lectures with short exercises.You will use your laptop for the exercises that
way youll have an OpenMP environment to takehome so you can keep learning on your own.
Please follow these simple rulesDo the exercises we assign and then change things
around and experiment.
Embrace active learning!
Dont cheat: Do Not look at the solutions beforeyou complete an exercise even if you get reallyfrustrated.
8/6/2019 Omp Hands on SC08
4/153
4
Our Plan for the day
Tasks and other OpenMP 3featuresLinked listIX OpenMP 3 and tasks
Point to point synch with flushProducerconsumer
VIII. Memory model
For, schedules, sectionsLinked list,matmul
VII. Worksharing andschedule
Data environment details,modular software,threadprivate
Pi_mcVI. Data Environment
Single, master, runtimelibraries, environmentvariables, synchronization, etc.
No exerciseV. Odds and ends
For, reductionPi_loopIV. Parallel loops
False sharing, critical, atomicPi_spmd_finalIII. Synchronization
Parallel, default dataenvironment, runtime librarycalls
Pi_spmd_simpleII. Creating threads
Parallel regionsInstall sw,hello_world
I. OMP Intro
conceptsExerciseTopic
Break
Break
lunch
8/6/2019 Omp Hands on SC08
5/153
5
Outline Introduction to OpenMP
Creating Threads Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections Memory model
OpenMP 3.0 and Tasks
8/6/2019 Omp Hands on SC08
6/153
6
OpenMP*
Overview:
omp_set_lock(lck)
#pragma omp parallel for private(A, B)
#pragma omp critical
C$OMP parallel do shared(a, b, c)
C$OMP PARALLEL REDUCTION (+: A, B)
call OMP_INIT_LOCK (ilok)
call omp_test_lock(jlok)
setenv OMP_SCHEDULE dynamic
CALL OMP_SET_NUM_THREADS(10)
C$OMP DO lastprivate(XX)
C$OMP ORDERED
C$OMP SINGLE PRIVATE(X)
C$OMP SECTIONS
C$OMP MASTERC$OMP ATOMIC
C$OMP FLUSH
C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)
C$OMP THREADPRIVATE(/ABC/)
C$OMP PARALLEL COPYIN(/blk/)
Nthrds = OMP_GET_NUM_PROCS()
!$OMP BARRIER
OpenMP: An API f or Wr i t ing Mult i t hreadedApplications
A set of compiler directives and libraryroutines for parallel application programmers
Greatly simplifies writing multi-threaded (MT)programs in Fortran, C and C++
Standardizes last 20 years of SMP practice
* The name OpenMP is the property of the OpenMP Architecture Review Board.
8/6/2019 Omp Hands on SC08
7/153
7
OpenMP Basic Defs: Solution Stack
OpenMP Runtime library
OS/system support for shared memory and threading
S
yste
mlaye r
Directives,Compiler
OpenMP libraryEnvironment
variablesProg.
Lay
er
Application
End User
Userl a
yer
Shared Address Space
Proc3Proc2Proc1 ProcN
HW
8/6/2019 Omp Hands on SC08
8/153
8
OpenMP core syntax
Most of the constructs in OpenMP are compilerdirectives.
#pragma omp construct [clause [clause]]Example
#pragma omp parallel num_threads(4)
Function prototypes and types in the file:
#include
Most OpenMP* constructs apply to astructured block.Structured block: a block of one or more statements
with one point of entry at the top and one point ofexit at the bottom.
Its OK to have an exit() within the structured block.
8/6/2019 Omp Hands on SC08
9/153
9
Exercise 1, Part A: Hello worldVerify that your environment works
Write a program that prints hello world.
void main(){
int ID = 0;
printf( hello(%d) , ID);printf( world(%d) \n, ID);
}
void main(){
int ID = 0;
printf( hello(%d) , ID);printf( world(%d) \n, ID);
}
8/6/2019 Omp Hands on SC08
10/153
10
Exercise 1, Part B: Hello worldVerify that your OpenMP environment works
Write a multithreaded program that prints hello world.
void main(){
int ID = 0;
printf( hello(%d) , ID);printf( world(%d) \n, ID);
}
void main(){
int ID = 0;
printf( hello(%d) , ID);printf( world(%d) \n, ID);
}
#pragma omp parallel{
}
#include omp.h
Switches for compiling and linking
-fopenmp gcc
-mp pgi
/Qopenmp intel
8/6/2019 Omp Hands on SC08
11/153
11
Exercise 1: SolutionA multi-threaded Hello world program
Write a multithreaded program where each
thread prints hello world.#include omp.hvoid main(){
#pragma omp parallel{
int ID = omp_get_thread_num();printf( hello(%d) , ID);printf( world(%d) \n, ID);
}
}
#include omp.hvoid main(){
#pragma omp parallel{
int ID = omp_get_thread_num();printf( hello(%d) , ID);printf( world(%d) \n, ID);
}
}
Sample Output:
hello(1) hello(0) world(1)
world(0)
hello (3) hello(2) world(3)
world(2)
Sample Output:
hello(1) hello(0) world(1)
world(0)
hello (3) hello(2) world(3)world(2)
OpenMP include fileOpenMP include file
Parallel region with defaultnumber of threads
Parallel region with defaultnumber of threads
Runtime library function to
return a thread ID.
Runtime library function to
return a thread ID.End of the Parallel regionEnd of the Parallel region
8/6/2019 Omp Hands on SC08
12/153
12
OpenMP Overview:
How do threads interact? OpenMP is a multi-threading, shared address
model.
Threads communicate by sharing variables.
Unintended sharing of data causes raceconditions:
race condition: when the programs outcomechanges as the threads are scheduled differently.
To control race conditions:
Use synchronization to protect data conflicts.
Synchronization is expensive so:
Change how data is accessed to minimize the need
for synchronization.
8/6/2019 Omp Hands on SC08
13/153
13
Outline Introduction to OpenMP
Creating Threads Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections Memory model
OpenMP 3.0 and Tasks
8/6/2019 Omp Hands on SC08
14/153
14
OpenMP Programming Model:
Fork-Join Parallelism:Master thread spawns a team of threads as needed.
Parallelism added incrementally until performance goalsare met: i.e. the sequential program evolves into aparallel program.
Parallel RegionsMasterThreadin red
A Nested
Parallelregion
A Nested
Parallelregion
Sequential Parts
8/6/2019 Omp Hands on SC08
15/153
15
Thread Creation: Parallel Regions
You create threads in OpenMP* with the parallel
construct. For example, To create a 4 thread Parallel region:
double A[1000];
omp_set_num_threads(4);#pragma omp parallel{
int ID = omp_get_thread_num();
pooh(ID,A);}
Each thread callsEach thread calls pooh(ID,A) forforID == 0 toto 3
Each threadexecutes acopy of the
code within
thestructuredblock
Each threadexecutes acopy of the
code within
the
structuredblock
Runtime function to
request a certainnumber of threads
Runtime function to
request a certainnumber of threads
Runtime functionreturning a thread ID
Runtime functionreturning a thread ID
* The name OpenMP is the property of the OpenMP Architecture Review Board
8/6/2019 Omp Hands on SC08
16/153
16
Thread Creation: Parallel RegionsYou create threads in OpenMP* with the parallel
construct.
For example, To create a 4 thread Parallel region:
double A[1000];
#pragma omp parallel num_threads(4){
int ID = omp_get_thread_num();
pooh(ID,A);}
Each thread callsEach thread calls pooh(ID,A) forforID == 0 toto 3
Each threadexecutes acopy of the
code within
thestructuredblock
Each threadexecutes acopy of the
code within
the
structuredblock
clause to request a certainnumber of threads
clause to request a certainnumber of threads
Runtime functionreturning a thread ID
Runtime functionreturning a thread ID
* The name OpenMP is the property of the OpenMP Architecture Review Board
8/6/2019 Omp Hands on SC08
17/153
17
Thread Creation: Parallel Regions example
Each thread executes the
same code redundantly.
double A[1000];omp_set_num_threads(4);
#pragma omp parallel{
int ID = omp_get_thread_num();pooh(ID, A);
}
printf(all done\n);omp_set_num_threads(4)
pooh(1,A) pooh(2,A) pooh(3,A)
printf(all done\n);
pooh(0,A)
double A[1000];
A single
copy of Ais sharedbetween allthreads.
A singlecopy of Ais sharedbetween allthreads.
Threads wait here for all threads to
finish before proceeding (i.e. a barrier)
Threads wait here for all threads to
finish before proceeding (i.e. a barrier)* The name OpenMP is the property of the OpenMP Architecture Review Board
8/6/2019 Omp Hands on SC08
18/153
18
Exercises 2 to 4:
Numerical Integration
4.0
(1+x2) dx = 0
1
F(xi)x i = 0
N
Mathematically, we know that:
We can approximate theintegral as a sum ofrectangles:
Where each rectangle haswidth x and height F(xi) at
the middle of interval i.
F( x) =
4.0 / ( 1+x
2)
4.0
2.0
1.0
X0.0
8/6/2019 Omp Hands on SC08
19/153
19
Exercises 2 to 4: Serial PI Program
static long num_steps = 100000;
double step;void main (){ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);}pi = step * sum;
}}
8/6/2019 Omp Hands on SC08
20/153
20
Exercise 2 Create a parallel version of the pi program
using a parallel construct. Pay close attention to shared versus private
variables.
In addition to a parallel construct, you will needthe runtime library routines
int omp_get_num_threads();
int omp_get_thread_num();double omp_get_wtime();
Time in Seconds since afixed point in the past
Thread ID or rank
Number of threads inthe team
8/6/2019 Omp Hands on SC08
21/153
8/6/2019 Omp Hands on SC08
22/153
22
Discussedlater
Synchronization High level synchronization:
criticalatomic
barrier
ordered Low level synchronization
flush
locks (both simple and nested)
Synchronization isused to impose order
constraints and toprotect access to
shared data
8/6/2019 Omp Hands on SC08
23/153
23
Synchronization: critical
Mutual exclusion: Only one thread at a timecan enter a critical region.
float res;#pragma omp parallel
{ float B; int i, id, nthrds;
id = omp_get_thread_num();nthrds = omp_get_num_threads();
for(i=id;i
8/6/2019 Omp Hands on SC08
24/153
24
Synchronization: Atomic Atomic provides mutual exclusion but only
applies to the update of a memory location (the
update of X in the following example)#pragma omp parallel
{
double tmp, B;
B = DOIT();
#pragma omp atomicX += big_ugly(B);
}
#pragma omp parallel
{
double tmp, B;
B = DOIT();
tmp = big_ugly(B);
#pragma omp atomicX += tmp;
}
Atomic only protects
the read/ update of X
8/6/2019 Omp Hands on SC08
25/153
25
Exercise 3 In exercise 2, you probably used an array to
create space for each thread to store its partialsum.
If array elements happen to share a cache line,this leads to false sharing.
Non-shared data in the same cache line so eachupdate invalidates the cache line in essencesloshing independent data back and forth
between threads. Modify your pi program from exercise 2 to
avoid false sharing due to the sum array.
8/6/2019 Omp Hands on SC08
26/153
26
Outline Introduction to OpenMP
Creating Threads Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections Memory model
OpenMP 3.0 and Tasks
8/6/2019 Omp Hands on SC08
27/153
27
Discussed later
SPMD vs. worksharing A parallel construct by itself creates an SPMD
or Single Program Multiple Data program i.e., each thread redundantly executes thesame code.
How do you split up pathways through the
code between threads within a team?This is called worksharing
Loop construct
Sections/section constructs
Single construct
Task construct . Coming in OpenMP 3.0
8/6/2019 Omp Hands on SC08
28/153
28
The loop worksharing Constructs
The loop workharing construct splits up loopiterations among the threads in a team
#pragma omp parallel
{#pragma omp for
for (I=0;I
8/6/2019 Omp Hands on SC08
29/153
29
Loop worksharing ConstructsA motivating example
for(i=0;I
8/6/2019 Omp Hands on SC08
30/153
30
Combined parallel/worksharing construct
OpenMP shortcut: Put the parallel and the
worksharing directive on the same line
double res[MAX]; int i;
#pragma omp parallel
{ #pragma omp for
for (i=0;i< MAX; i++) {
res[i] = huge();
}
}
These are equivalentThese are equivalent
double res[MAX]; int i;
#pragma omp parallel for
for (i=0;i< MAX; i++) {res[i] = huge();
}
8/6/2019 Omp Hands on SC08
31/153
31
Working with loops Basic approach
Find compute intensive loops
Make the loop iterations independent .. So they cansafely execute in any order without loop-carrieddependencies
Place the appropriate OpenMP directive and test
int i, j, A[MAX];
j = 5;
for (i=0;i< MAX; i++) {j +=2;
A[i] = big(j);
}
int i, A[MAX];
#pragma omp parallel for
for (i=0;i< MAX; i++) {int j = 5 + 2*i;
A[i] = big(j);
}Remove loop
carried
dependence
Note: loop index
i is private by
default
8/6/2019 Omp Hands on SC08
32/153
32
Reduction
We are combining values into a single accumulation
variable (ave) there is a true dependence betweenloop iterations that cant be trivially removed
This is a very common situation it is called areduction.
Support for reduction operations is included in mostparallel programming environments.
double ave=0.0, A[MAX]; int i;
for (i=0;i< MAX; i++) {
ave + = A[i];}
ave = ave/MAX;
How do we handle this case?
8/6/2019 Omp Hands on SC08
33/153
33
Reduction OpenMP reduction clause:
reduction (op : list)
Inside a parallel or a work-sharing construct:
A local copy of each list variable is made and initializeddepending on the op (e.g. 0 for +).
Compiler finds standard reduction expressions containingop and uses them to update the local copy.
Local copies are reduced into a single value andcombined with the original global value.
The variables in list must be shared in the enclosingparallel region.
double ave=0.0, A[MAX]; int i;#pragma omp parallel for reduction (+:ave)
for (i=0;i< MAX; i++) {
ave + = A[i];
}ave = ave/MAX;
8/6/2019 Omp Hands on SC08
34/153
34
OpenMP: Reduction operands/initial-values
Many different associative operands can be used with reduction:
Initial values are the ones that make sense mathematically.
0-
1*
0+
Initial valueOperator
C/C++ only
~0&
0^
0|
1&&
0||
Initial valueOperator
Fortran Only
0.IEOR.
0.IOR.
All bits on.IAND.
Most neg. numberMAX*
Largest pos. numberMIN*
.false..NEQV.
.true..EQV.
.true..AND.
.false..OR.
Initial valueOperator
8/6/2019 Omp Hands on SC08
35/153
35
Exercise 4 Go back to the serial pi program and parallelize
it with a loop construct
Your goal is to minimize the number changesmade to the serial program.
8/6/2019 Omp Hands on SC08
36/153
36
Outline Introduction to OpenMP
Creating Threads Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
8/6/2019 Omp Hands on SC08
37/153
37
Synchronization: Barrier Barrier: Each thread waits until all threads arrive.
#pragma omp parallel shared (A, B, C) private(id){
id=omp_get_thread_num();A[id] = big_calc1(id);
#pragma omp barrier#pragma omp for
for(i=0;i
8/6/2019 Omp Hands on SC08
38/153
38
Master Construct The masterconstruct denotes a structured
block that is only executed by the master thread.
The other threads just skip it (nosynchronization is implied).
#pragma omp parallel{
do_many_things();#pragma omp master
{ exchange_boundaries(); }#pragma omp barrier
do_many_other_things();
}
8/6/2019 Omp Hands on SC08
39/153
39
Single worksharing Construct The single construct denotes a block of code that is
executed by only one thread (not necessarily the
master thread). A barrier is implied at the end of the single block (can
remove the barrier with a nowaitclause).
#pragma omp parallel{
do_many_things();
#pragma omp single{ exchange_boundaries(); }do_many_other_things();
}
8/6/2019 Omp Hands on SC08
40/153
40
Synchronization: ordered The ordered region executes in the sequential
order.
#pragma omp parallel private (tmp)#pragma omp forordered reduction(+:res)
for (I=0;I
8/6/2019 Omp Hands on SC08
41/153
41
Synchronization: Lock routines
Simple Lock routines:A simple lock is available if it is unset.
omp_init_lock(), omp_set_lock(),
omp_unset_lock(), omp_test_lock(),omp_destroy_lock()
Nested Locks
A nested lock is available if it is unset or if it is set butowned by the thread executing the nested lock function
omp_init_nest_lock(), omp_set_nest_lock(),omp_unset_nest_lock(), omp_test_nest_lock(),omp_destroy_nest_lock()
Note: a thread always accesses the most recent copy of the
lock, so you dont need to use a flush on the lock variable.
A lock implies amemory fence(a flush) of
all threadvisible
variables
8/6/2019 Omp Hands on SC08
42/153
42
Synchronization: Simple Locks Protect resources with locks.
omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel private (tmp, id)
{
id = omp_get_thread_num();tmp = do_lots_of_work(id);
omp_set_lock(&lck);
printf(%d %d, id, tmp);omp_unset_lock(&lck);
}
omp_destroy_lock(&lck);
Wait here foryour turn.Wait here foryour turn.
Release the lock
so the next threadgets a turn.
Release the lock
so the next threadgets a turn.
Free-up storage when done.Free-up storage when done.
8/6/2019 Omp Hands on SC08
43/153
43
Runtime Library routines
Runtime environment routines:
Modify/Check the number of threads
omp_set_num_threads(), omp_get_num_threads(),
omp_get_thread_num(), omp_get_max_threads()Are we in an active parallel region?
omp_in_parallel()
Do you want the system to dynamically vary the number ofthreads from one parallel construct to another?
omp_set_dynamic, omp_get_dynamic();
How many processors in the system?
omp_num_procs()
plus a few less commonly used routines.
8/6/2019 Omp Hands on SC08
44/153
44
Runtime Library routines To use a known, fixed number of threads in a program,
(1) tell the system that you dont want dynamic adjustment ofthe number of threads, (2) set the number of threads, then (3)save the number you got.
#include void main(){ int num_threads;
omp_set_dynamic( 0 );
omp_set_num_threads( omp_num_procs() );#pragma omp parallel
{ int id=omp_get_thread_num();#pragma omp single
num_threads = omp_get_num_threads();do_lots_of_stuff(id);}
}
Protect this op since Memorystores are not atomic
Request as many threads asyou have processors.
Disable dynamic adjustment of thenumber of threads.
Even in this case, the system may give you fewerthreads than requested. I f the precise # of threads
matters, test for it and respond accordingly.
Even in this case, the system may give you fewerthreads than requested. I f the precise # of threads
matters, test for it and respond accordingly.
8/6/2019 Omp Hands on SC08
45/153
45
Environment Variables Set the default number of threads to use.
OMP_NUM_THREADS int_literal
Control how omp for schedule(RUNTIME)loop iterations are scheduled.
OMP_SCHEDULE schedule[, chunk_size]
Plus several less commonly used environment variables.
8/6/2019 Omp Hands on SC08
46/153
46
Outline Introduction to OpenMP
Creating Threads Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
8/6/2019 Omp Hands on SC08
47/153
47
Data environment:Default storage attributes
Shared Memory programming model:
Most variables are shared by default Global variables are SHARED among threads
Fortran: COMMON blocks, SAVE variables, MODULEvariables
C: File scope variables, static
Both: dynamically allocated memory (ALLOCATE, malloc, new)
But not everything is shared...
Stack variables in subprograms(Fortran) or functions(C) calledfrom parallel regions are PRIVATE
Automatic variables within a statement block are PRIVATE.
8/6/2019 Omp Hands on SC08
48/153
48
double A[10];
int main() {
int index[10];
#pragma omp parallel
work(index);
printf(%d\n, index[0]);}
extern double A[10];
void work(int *index) {
double temp[10];
static int count;
...
}
Data sharing: Examples
temp
A, index, count
temp temp
A, index, count
A, index and count are
shared by all threads.
temp is local to each
thread
A, index and count are
shared by all threads.
temp is local to each
thread
* Third party trademarks and names are the property of their respective owner.
D t h i
8/6/2019 Omp Hands on SC08
49/153
49
Data sharing:Changing storage attributes
One can selectively change storage attributes forconstructs using the following clauses*
SHARED
PRIVATE
FIRSTPRIVATE
The final value of a private inside a parallel loop can betransmitted to the shared variable outside the loop with:
LASTPRIVATE
The default attributes can be overridden with:
DEFAULT (PRIVATE | SHARED | NONE)
All the clauses on this pageapply to the OpenMP constructNOT to the entire region.
All the clauses on this pageapply to the OpenMP constructNOT to the entire region.
All data clauses apply to parallel constructs and worksharing constructs except
shared which only applies to parallel constructs.
DEFAULT(PRIVATE) is Fortran only
8/6/2019 Omp Hands on SC08
50/153
50
Data Sharing: Private Clause
void wrong() {
int tmp = 0;
#pragma omp for private(tmp)
for (int j = 0; j < 1000; ++j)
tmp += j;
printf(%d\n, tmp);
}
private(var) creates a new local copy of var for each thread.
The value is uninitialized
In OpenMP 2.5 the value of the shared variable is undefined afterthe region
tmp was notinitialized
tmp was notinitialized
tmp: 0 in 3.0,unspecified in 2.5
tmp: 0 in 3.0,unspecified in 2.5
D t Sh i P i t Cl
8/6/2019 Omp Hands on SC08
51/153
51
Data Sharing: Private Clause
When is the original variable valid?
int tmp;
void danger() {tmp = 0;
#pragma omp parallel private(tmp)
work();
printf(%d\n, tmp);}
The original variables value is unspecified in OpenMP 2.5.
In OpenMP 3.0, if it is referenced outside of the construct
Implementations may reference the original variable or a copy ..A dangerous programming practice!
extern int tmp;void work() {
tmp = 5;
}
unspecified whichcopy of tmp
unspecified whichcopy of tmptmp has unspecified
value
tmp has unspecifiedvalue
8/6/2019 Omp Hands on SC08
52/153
52
Data Sharing: Firstprivate Clause
Firstprivate is a special case of private.
Initializes each private copy with the correspondingvalue from the master thread.
tmp: 0 in 3.0, unspecified in 2.5tmp: 0 in 3.0, unspecified in 2.5
void useless() {
int tmp = 0;
#pragma omp for firstprivate(tmp)
for (int j = 0; j < 1000; ++j)
tmp += j;
printf(%d\n, tmp);
}
Each thread gets its owntmp with an initial value of 0
Each thread gets its owntmp with an initial value of 0
8/6/2019 Omp Hands on SC08
53/153
53
Data sharing: Lastprivate Clause
Lastprivate passes the value of a private from the
last iteration to a global variable.
tmp is defined as its value at the last
sequential iteration (i.e., for j=999)
tmp is defined as its value at the last
sequential iteration (i.e., for j=999)
void closer() {
int tmp = 0;
#pragma omp parallel for firstprivate(tmp) \lastprivate(tmp)
for (int j = 0; j < 1000; ++j)
tmp += j;
printf(%d\n, tmp);
}
Each thread gets its own tmpwith an initial value of 0
Each thread gets its own tmpwith an initial value of 0
Data Sharing:
8/6/2019 Omp Hands on SC08
54/153
54
Data Sharing:
A data environment test Consider this example of PRIVATE and FIRSTPRIVATE
Are A,B,C local to each thread or shared inside the parallel region?
What are their initial values inside and values after the parallel region?
variables A,B, and C = 1
#pragma omp parallel private(B) firstprivate(C)
Inside this parallel region ... A is shared by all threads; equals 1
B and C are local to each thread.
Bs initial value is undefined
Cs initial value equals 1
Outside this parallel region ...
The values of B and C are unspecified in OpenMP 2.5, and inOpenMP 3.0 if referenced in the region but outside the construct.
8/6/2019 Omp Hands on SC08
55/153
55
Data Sharing: Default Clause
Note that the default storage attribute is DEFAULT(SHARED) (so
no need to use it) Exception: #pragma omp task
To change default: DEFAULT(PRIVATE)
eachvariable in the construct is made private as if specified in aprivate clause
mostly saves typing
DEFAULT(NONE): nodefault for variables in static extent. Mustlist storage attribute for each variable in static extent. Goodprogramming practice!
Only the Fortran API supports default(private).
C/C++ only has default(shared) or default(none).
8/6/2019 Omp Hands on SC08
56/153
3.0
8/6/2019 Omp Hands on SC08
57/153
57
Data Sharing: tasks (OpenMP 3.0) The default for tasks is usually firstprivate, because the task may
not be executed until later (and variables may have gone out ofscope).
Variables that are shared in all constructs starting from theinnermost enclosing parallel construct are shared, because thebarrier guarantees task completion.
#pragma omp parallel shared(A) private(B){
...
#pragma omp task
{
int C;
compute(A, B, C);
}
}
A is sharedB is firstprivateC is private
8/6/2019 Omp Hands on SC08
58/153
58
Data sharing: Threadprivate
Makes global data private to a thread
Fortran: COMMON blocksC: File scope and static variables, static class members
Different from making them PRIVATE
with PRIVATE global variables are masked.THREADPRIVATE preserves global scope within each
thread
Threadprivate variables can be initialized using
COPYIN or at time of definition (using language-defined initialization capabilities).
A th d i t l (C)
8/6/2019 Omp Hands on SC08
59/153
59
A threadprivate example (C)
int counter = 0;#pragma omp threadprivate(counter)
int increment_counter()
{counter++;
return (counter);
}
int counter = 0;#pragma omp threadprivate(counter)
int increment_counter()
{counter++;
return (counter);
}
Use threadprivate to create a counter for eachthread.
Data Copying: Copyin
8/6/2019 Omp Hands on SC08
60/153
60
Data Copying: Copyin
parameter (N=1000)
common/buf/A(N)
!$OMP THREADPRIVATE(/buf/)
C Initialize the A array
call init_data(N,A)
!$OMP PARALLEL COPYIN(A)
Now each thread sees threadprivate array A initialied
to the global value set in the subroutine init_data()
!$OMP END PARALLEL
end
parameter (N=1000)
common/buf/A(N)!$OMP THREADPRIVATE(/buf/)
C Initialize the A array
call init_data(N,A)
!$OMP PARALLEL COPYIN(A)
Now each thread sees threadprivate array A initialied
to the global value set in the subroutine init_data()
!$OMP END PARALLEL
end
You initialize threadprivate data using a copyinclause.
8/6/2019 Omp Hands on SC08
61/153
Exercise 5: Monte Carlo Calculations
8/6/2019 Omp Hands on SC08
62/153
62
Exercise 5: Monte Carlo CalculationsUsing Random numbers to solve tough problems
Sample a problem domain to estimate areas, computeprobabilities, find optimal values, etc.
Example: Computing with a digital dart board:
Throw darts at the circle/square.
Chance of falling in circle isproportional to ratio of areas:
Ac = r2 *
As = (2*r) * (2*r) = 4 * r2
P = Ac/As = /4
Compute by randomly choosingpoints, count the fraction that falls in
the circle, compute pi.
2 * r
N= 10 = 2.8
N=100 = 3.16
N= 1000 = 3.148
N= 10 = 2.8
N=100 = 3.16
N= 1000 = 3.148
Exercise 5
8/6/2019 Omp Hands on SC08
63/153
63
Exercise 5
We provide three files for this exercisepi_mc.c: the monte carlo method pi program
random.c: a simple random number generator random.h: include file for random number generator
Create a parallel version of this program withoutchanging the interfaces to functions in random.c
This is an exercise in modular software why should a userof your parallel random number generator have to know anydetails of the generator or make any changes to how thegenerator is called?
Extra Credit:Make the random number generator threadsafe.
Make your random number generator numerically correct (non-overlapping sequences of pseudo-random numbers).
Outline
8/6/2019 Omp Hands on SC08
64/153
64
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
Sections worksharing Construct
8/6/2019 Omp Hands on SC08
65/153
65
Sections worksharing Construct The Sectionsworksharing construct gives a
different structured block to each thread.
#pragma omp parallel
{
#pragma omp sections{#pragma omp section
X_calculation();#pragma omp section
y_calculation();#pragma omp section
z_calculation();}
}
#pragma omp parallel
{
#pragma omp sections{#pragma omp section
X_calculation();#pragma omp section
y_calculation();#pragma omp section
z_calculation();}
}
By default, there is a barrier at the end of the omp
sections. Use the nowait clause to turn off the barrier.
loop worksharing constructs:
8/6/2019 Omp Hands on SC08
66/153
66
The schedule clause The schedule clause affects how loop iterations are
mapped onto threads
schedule(static [,chunk])
Deal-out blocks of iterations of size chunk to each thread.
schedule(dynamic[,chunk])
Each thread grabs chunk iterations off a queue until all
iterations have been handled.schedule(guided[,chunk])
Threads dynamically grab blocks of iterations. The size of theblock starts large and shrinks down to size chunk as the
calculation proceeds.schedule(runtime)
Schedule and chunk size taken from the OMP_SCHEDULEenvironment variable (or the runtime library for OpenMP 3.0).
loop work-sharing constructs:Th h d l l
8/6/2019 Omp Hands on SC08
67/153
67
Special case of dynamicto reduce schedulingoverhead
GUIDED
Unpredictable, highlyvariable work periteration
DYNAMIC
Pre-determined andpredictable by theprogrammer
STATIC
When To UseSchedule Clause
The schedule clauseThe schedule clause
Least work atruntime :schedulingdone atcompile-time
Least work atruntime :
schedulingdone atcompile-time
Most work atruntime :complexscheduling
logic used atrun-time
Most work atruntime :complexscheduling
logic used atrun-time
Exercise 6: hard
8/6/2019 Omp Hands on SC08
68/153
68
Exercise 6: hard
Consider the program linked.c
Traverses a linked list computing a sequence of
Fibonacci numbers at each node.
Parallelize this program using constructsdefined in OpenMP 2.5 (loop worksharing
constructs). Once you have a correct program, optimize it.
Exercise 6: easy
8/6/2019 Omp Hands on SC08
69/153
69
Exercise 6: easy
Parallelize the matrix multiplication program inthe file matmul.c
Can you optimize the program by playing withhow the loops are scheduled?
Outline
8/6/2019 Omp Hands on SC08
70/153
70
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
OpenMP memory model
8/6/2019 Omp Hands on SC08
71/153
71
OpenMP memory model
proc1 proc2 proc3 procN
Shared memory
cache1 cache2 cache3 cacheN
a
a
a
. . .
A memory model is defined in terms of:Coherence: Behavior of the memory system when a single
address is accessed by multiple threads.
Consistency: Orderings of accesses to different addresses by
multiple threads.
OpenMP supports a shared memory model.
All threads share an address space, but it can get complicated:
OpenMP Memory Model: Basic Terms
8/6/2019 Omp Hands on SC08
72/153
72
Source code
Program order
memory
a bCommit order
private view
thread thread
private viewthreadprivatethreadprivatea ab b
Wa Wb Ra Rb . . .
compiler
Executable code
Code order
Wb Rb Wa Ra . . .
RW s in anysemantically
equivalent order
Consistency: Memory Access Re-ordering
8/6/2019 Omp Hands on SC08
73/153
73
Consistency: Memory Access Re ordering
Re-ordering:
Compiler re-orders program orderto the code order
Machine re-orders code orderto the memorycommit order
At a given point in time, the temporary view of
memory may vary from shared memory.
Consistency models based on orderings ofReads (R), Writes (W) and Synchronizations (S):
RR, WW, RW, RS, SS, WS
Consistency
8/6/2019 Omp Hands on SC08
74/153
74
Consistency
Sequential Consistency:
In a multi-processor, ops (R, W, S) are sequentially
consistent if: They remain in program order for each
processor.
They are seen to be in the same overall order byeach of the other processors.
Program order = code order = commit order
Relaxed consistency:Remove some of the ordering constraints for
memory ops (R, W, S).
OpenMP and Relaxed Consistency
8/6/2019 Omp Hands on SC08
75/153
75
p y
OpenMP defines consistency as a variant ofweak consistency:
S ops must be in sequential order across threads.
Can not reorder S ops with R or W ops on the samethread
Weak consistency guaranteesSW, SR , RS, WS, SS
The Synchronization operation relevant to this
discussion is flush.
Flush
8/6/2019 Omp Hands on SC08
76/153
76
Defines a sequence point at which a thread isguaranteed to see a consistent view of memory withrespect to the flush set.
The flush set is: all thread visible variables for a flush construct without an
argument list.
a list of variables when the flush(list) construct is used.
The action of Flush is to guarantee that:All R,W ops that overlap the flush set and occur prior to the
flush complete before the flush executes
All R,W ops that overlap the flush set and occur after the
flush dont execute until after the flush.Flushes with overlapping flush sets can not be reordered.
Memory ops: R = Read, W = write, S = synchronization
Synchronization: flush example
8/6/2019 Omp Hands on SC08
77/153
77
y p
Flush forces data to be updated in memory so otherthreads see the most recent value
double A;
A = compute();
flush(A); // flush to memory to make sure other// threads can pick up the right value
Note: OpenMPs flush is analogous to a fence in
other shared memory APIs.
Note: OpenMPs flush is analogous to a fence in
other shared memory APIs.
Exercise 7: producer consumer
8/6/2019 Omp Hands on SC08
78/153
78
p
Parallelize the prod_cons.c program.
This is a well known pattern called theproducer consumer pattern
One thread produces values that another threadconsumes.
Often used with a stream of produced values toimplement pipeline parallelism
The key is to implement pairwise
synchronization between threads.
Exercise 7: prod_cons.c
8/6/2019 Omp Hands on SC08
79/153
79
p _
int main(){
double *A, sum, runtime; int flag = 0;
A = (double *)malloc(N*sizeof(double));
runtime = omp_get_wtime();
fill_rand(N, A); // Producer: fill an array of data
sum = Sum_array(N, A); // Consumer: sum the array
runtime = omp_get_wtime() - runtime;
printf(" In %lf seconds, The sum is %lf \n",runtime,sum);}
I need to put theprod/cons pair
inside a loop so itstrue pipelineparallelism.
What is the Big Deal with Flush?
8/6/2019 Omp Hands on SC08
80/153
80
Compilers routinely reorder instructions implementinga program
This helps better exploit the functional units, keep machinebusy, hide memory latencies, etc.
Compiler generally cannot move instructions:
past a barrier
past a flush on all variables
But it can move them past a flush with a list ofvariables so long as those variables are not accessed
Keeping track of consistency when flushes are used
can be confusing especially if flush(list) is used.Note: the flush operation does not actually synchronizedifferent threads. It just ensures that a threads values
are made consistent w ith main memory.
Outline
8/6/2019 Omp Hands on SC08
81/153
81
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
OpenMP pre-history
8/6/2019 Omp Hands on SC08
82/153
82
OpenMP based upon SMP directivestandardization efforts PCF and aborted ANSI
X3H5 late 80sNobody fully implemented either standard
Only a couple of partial implementations
Vendors considered proprietary APIs to be acompetitive feature:
Every vendor had proprietary directives sets
Even KAP, a portable multi-platform parallelizationtool used different directives on each platform
PCF Parallel computing forum KAP parallelization tool from KAI.
History of OpenMP DEC
8/6/2019 Omp Hands on SC08
83/153
83
SGI
Cray
Merged,neededcommonalityacrossproducts
KAI ISV - neededlarger market
was tired ofrecoding forSMPs. Urgedvendors to
standardize.
ASCI
Wrote arough draftstraw manSMP API
IBM
Intel
HP
Other vendorsinvited to join
1997
OpenMP Release History
8/6/2019 Omp Hands on SC08
84/153
84
OpenMPFortran 1.1
OpenMPFortran 1.1
OpenMPC/C++ 1.0
OpenMPC/C++ 1.0
OpenMPFortran 2.0
OpenMPFortran 2.0
OpenMPC/C++ 2.0
OpenMPC/C++ 2.0
1998
20001999
2002
OpenMPFortran 1.0
OpenMPFortran 1.0
1997
OpenMP
2.5
OpenMP
2.5
2005
A sing lespecificationfor Fortran, C
and C++
OpenMP
3.0
OpenMP
3.0
tasking,
other newfeatures
2008
Tasks
3.0
8/6/2019 Omp Hands on SC08
85/153
85
Adding tasking is the biggest addition for 3.0
Worked on by a separate subcommittee
led by Jay Hoeflinger at Intel
Re-examined issue from ground up
quite different from Intel taskqs
General task characteristics
3.0
8/6/2019 Omp Hands on SC08
86/153
86
A task has
Code to execute
A data environment (it ownsits data)
An assigned thread that executes the code anduses the data
Two activities: packaging and executionEach encountering thread packages a new instance
of a task (code and data)
Some thread in the team executes the task at somelater time
Definitions
3.0
8/6/2019 Omp Hands on SC08
87/153
87
Task construct task directive plus structuredblock
Task the package of code and instructionsfor allocating data created when a threadencounters a task construct
Task region the dynamic sequence ofinstructions produced by the execution of atask by a thread
Tasks and OpenMP3.0
8/6/2019 Omp Hands on SC08
88/153
88
Tasks have been fully integrated into OpenMP
Key concept: OpenMP has always had tasks, we justnever called them that.
Thread encounteringparallel construct packagesup a set ofimplicittasks, one per thread.
Team of threads is created.
Each thread in team is assigned to one of the tasks
(and tiedto it).Barrier holds original master thread until all implicit
tasks are finished.
We have simply added a way to create a task explicitly
for the team to execute. Every part of an OpenMP program is part of one task or
another!
task Construct
3.0
8/6/2019 Omp Hands on SC08
89/153
89
#pragma omp task [clause[[,]clause] ...]
structured-block
if (expression)untied
shared (list)
private (list)
firstprivate (list)default( shared | none )
where clausecan be one of:
The if clause3.0
8/6/2019 Omp Hands on SC08
90/153
90
When the if clause argument is false
The task is executed immediately by the encounteringthread.
The data environment is still local to the new task... ...and its still a different task with respect to
synchronization.
Its a user directed optimizationwhen the cost of deferring the task is too great
compared to the cost of executing the task code
to control cache and memory affinity
When/where are tasks complete?3.0
8/6/2019 Omp Hands on SC08
91/153
91
At thread barriers, explicit or implicit
applies to all tasks generated in the current parallelregion up to the barrier
matches user expectation
At task barriers
i.e. Wait until all tasks defined in the current task havecompleted.
#pragma omp taskwait
Note: applies only to tasks generated in the current task,
not to descendants .
Example parallel pointer chasing
i t k
3.0
8/6/2019 Omp Hands on SC08
92/153
92
using tasks#pragma omp parallel
{
#pragma omp single private(p){
p = listhead ;
while (p) {
#pragma omp task
process (p)
p=next (p) ;
}}
}
p is firstprivate inside
this task
Example parallel pointer chasing on
m ltiple lists sing tasks
3.0
8/6/2019 Omp Hands on SC08
93/153
93
multiple lists using tasks#pragma omp parallel
{
#pragma omp for private(p)for ( int i =0; i
8/6/2019 Omp Hands on SC08
94/153
94
void postorder(node *p) {
if (p->left)
#pragma omp task
postorder(p->left);if (p->right)
#pragma omp task
postorder(p->right);
#pragma omp taskwait // wait for descendantsprocess(p->data);
}
Parent task suspended until children tasks complete
Task scheduling point
Task switching
3.0
8/6/2019 Omp Hands on SC08
95/153
95
Certain constructs have task scheduling pointsat defined locations within them
When a thread encounters a task schedulingpoint, it is allowed to suspend the current taskand execute another (called task switching)
It can then return to the original task andresume
Task switching example3.0
8/6/2019 Omp Hands on SC08
96/153
96
#pragma omp single
{
for (i=0; i
8/6/2019 Omp Hands on SC08
97/153
97
#p ag a o p s g e{
#pragma omp task
for (i=0; i
8/6/2019 Omp Hands on SC08
98/153
98
The Taskprivate directive was removed fromOpenMP 3.0
Too expensive to implement
Restrictions on task scheduling allowthreadprivate data to be used
User can avoid thread switching with tied tasksTask scheduling points are well defined
Conclusions on tasks
3.0
8/6/2019 Omp Hands on SC08
99/153
99
Enormous amount of work by many people
Tightly integrated into 3.0 spec
Flexible model for irregular parallelism
Provides balanced solution despite often conflictinggoals
Appears that performance can be reasonable
Nested parallelism
3.0
8/6/2019 Omp Hands on SC08
100/153
100
Better support for nested parallelism
Per-thread internal control variables
Allows, for example, callingomp_set_num_threads() inside a parallel region.
Controls the team sizes for next level of parallelism
Library routines to determine depth of nesting,IDs of parent/grandparent etc. threads, teamsizes of parent/grandparent etc. teams
omp_get_active_level()
omp_get_ancestor(level)
omp_get_teamsize(level)
Parallel loops
G h hi k i h h
3.0
8/6/2019 Omp Hands on SC08
101/153
101
Guarantee that this works i.e. that the sameschedule is used in the two loops:
!$omp do schedule(static)
do i=1,n
a(i) = ....
end do!$omp end do nowait
!$omp do schedule(static)
do i=1,n
.... = a(i)end do
Loops (cont.)
3.0
8/6/2019 Omp Hands on SC08
102/153
102
Allow collapsing of perfectly nested loops
Will form a single loop and then parallelize that
!$omp parallel do collapse(2)
do i=1,n
do j=1,n
.....end do
end do
8/6/2019 Omp Hands on SC08
103/153
Portable control of threads
3.0
8/6/2019 Omp Hands on SC08
104/153
104
Added environment variable to control the sizeof child threads stack
OMP_STACKSIZE
Added environment variable to hint to runtimehow to treat idle threads
OMP_WAIT_POLICYACTIVE keep threads alive at barriers/locks
PASSIVE try to release processor at barriers/locks
Control program execution
3.0
8/6/2019 Omp Hands on SC08
105/153
105
Added environment variable and runtimeroutines to get/set the maximum number of
active levels of nested parallelismOMP_MAX_ACTIVE_LEVELS
omp_set_max_active_levels()
omp_get_max_active_levels() Added environment variable to set maximum
number of threads in use
OMP_THREAD_LIMITomp_get_thread_limit()
Odds and ends
3.0
8/6/2019 Omp Hands on SC08
106/153
106
Allow unsigned ints in parallel for loops
Disallow use of the original variable as master threads
private variable Make it clearer where/how private objects are
constructed/destructed
Relax some restrictions on allocatable arrays andFortran pointers
Plug some minor gaps in memory model
Allow C++ static class members to be threadprivate
Improve C/C++ grammar
Minor fixes and clarifications to 2.5
Exercise 8: tasks in OpenMP
8/6/2019 Omp Hands on SC08
107/153
107
Consider the program linked.c
Traverses a linked list computing a sequence of
Fibonacci numbers at each node. Parallelize this program using tasks.
Compare your solutions complexity compared
to the approach without tasks.
Conclusion
O
8/6/2019 Omp Hands on SC08
108/153
108
OpenMP 3.0 is a major upgrade expands therange of algorithms accessible from OpenMP.
OpenMP is fun and about as easy as we canmake it for applications programmers workingwith shared memory platforms.
OpenMP Organizations
O MP hit t i b d URL
8/6/2019 Omp Hands on SC08
109/153
109
OpenMP architecture review board URL,the owner of the OpenMP specification:
www.openmp.org
OpenMP Users Group (cOMPunity) URL:
www.compunity.org
Get involved, join compunity andhelp define the future of OpenMPGet involved, join compunity andhelp define the future of OpenMP
Books about OpenMP
8/6/2019 Omp Hands on SC08
110/153
110
A new book aboutOpenMP 2.5 by a team ofauthors at the forefront of
OpenMPs evolution.
A book about how tothink parallel withexamples in OpenMP, MPIand java
OpenMP Papers
Sosa CP, Scalmani C, Gomperts R, Frisch MJ. Ab initio quantum chemistry on accNUMA architecture using OpenMP. III. Parallel Computing, vol.26, no.7-8, July2000, pp.843-56. Publisher: Elsevier, Netherlands.
8/6/2019 Omp Hands on SC08
111/153
111
Couturier R, Chipot C. Parallel molecular dynamics using OPENMP on a sharedmemory machine. Computer Physics Communications, vol.124, no.1, Jan. 2000,pp.49-59. Publisher: Elsevier, Netherlands.
Bentz J., Kendall R., Parallelization of General Matrix Multiply Routines UsingOpenMP, Shared Memory Parallel Programming with OpenMP, Lecture notes inComputer Science, Vol. 3349, P. 1, 2005
Bova SW, Breshearsz CP, Cuicchi CE, Demirbilek Z, Gabb HA. Dual-level parallelanalysis of harbor wave response using MPI and OpenMP. International Journal
of High Performance Computing Applications, vol.14, no.1, Spring 2000, pp.49-64.Publisher: Sage Science Press, USA.
Ayguade E, Martorell X, Labarta J, Gonzalez M, Navarro N. Exploiting multiplelevels of parallelism in OpenMP: a case study. Proceedings of the 1999International Conference on Parallel Processing. IEEE Comput. Soc. 1999,
pp.172-80. Los Alamitos, CA, USA. Bova SW, Breshears CP, Cuicchi C, Demirbilek Z, Gabb H. Nesting OpenMP in
an MPI application. Proceedings of the ISCA 12th International Conference.Parallel and Distributed Systems. ISCA. 1999, pp.566-71. Cary, NC, USA.
8/6/2019 Omp Hands on SC08
112/153
OpenMP Papers (continued)
B. Chapman, F. Bregier, A. Patil, A. Prabhakar, Achievingperformance under OpenMP on ccNUMA and software distributedh d t C d C t ti P ti d
8/6/2019 Omp Hands on SC08
113/153
113
shared memory systems, Concurrency and Computation: Practice andExperience. 14(8-9): 713-739, 2002.
J. M. Bull and M. E. Kambites. JOMP: an OpenMP-like interface for
Java. Proceedings of the ACM 2000 conference on Java Grande, 2000,Pages 44 - 53.
L. Adhianto and B. Chapman, Performance modeling ofcommunication and computation in hybrid MPI and OpenMPapplications, Simulation Modeling Practice and Theory, vol 15, p. 481-
491, 2007. Shah S, Haab G, Petersen P, Throop J. Flexible control structures for
parallelism in OpenMP; Concurrency: Practice and Experience, 2000;12:1219-1239. Publisher John Wiley & Sons, Ltd.
Mattson, T.G., How Good is OpenMP? Scientific Programming, Vol. 11,
Number 2, p.81-93, 2003. Duran A., Silvera R., Corbalan J., Labarta J., Runtime Adjustment of
Parallel Nested Loops, Shared Memory Parallel Programming withOpenMP, Lecture notes in Computer Science, Vol. 3349, P. 137, 2005
Appendix: Solutions to exercises
Exercise 1: hello world
8/6/2019 Omp Hands on SC08
114/153
114
Exercise 1: hello world
Exercise 2: Simple SPMD Pi program
Exercise 3: SPMD Pi without false sharing Exercise 4: Loop level Pi
Exercise 5: Producer-consumer
Exercise 6: Monte Carlo Pi and random numbers Exercise 7: hard, linked lists without tasks
Exercise 7: easy, matrix multiplication
Exercise 8: linked lists with tasks
Exercise 1: Solution
A multi-threaded Hello world program Write a multithreaded program where each
8/6/2019 Omp Hands on SC08
115/153
115
Write a multithreaded program where eachthread prints hello world.
#include omp.hvoid main(){
#pragma omp parallel{
int ID = omp_get_thread_num();
printf( hello(%d) , ID);printf( world(%d) \n, ID);}
}
#include omp.hvoid main(){
#pragma omp parallel{
int ID = omp_get_thread_num();
printf( hello(%d) , ID);printf( world(%d) \n, ID);}
}
Sample Output:
hello(1) hello(0) world(1)world(0)
hello (3) hello(2) world(3)
world(2)
Sample Output:
hello(1) hello(0) world(1)world(0)
hello (3) hello(2) world(3)
world(2)
OpenMP include fileOpenMP include file
Parallel region with defaultnumber of threads
Parallel region with defaultnumber of threads
Runtime library function toreturn a thread ID.
Runtime library function toreturn a thread ID.End of the Parallel regionEnd of the Parallel region
Appendix: Solutions to exercises
Exercise 1: hello world
8/6/2019 Omp Hands on SC08
116/153
116
Exercise 1: hello world
Exercise 2: Simple SPMD Pi program
Exercise 3: SPMD Pi without false sharing Exercise 4: Loop level Pi
Exercise 5: Producer-consumer
Exercise 6: Monte Carlo Pi and random numbers Exercise 7: hard, linked lists without tasks
Exercise 7: easy, matrix multiplication
Exercise 8: linked lists with tasks
8/6/2019 Omp Hands on SC08
117/153
#include static long num_steps = 100000; double step;#define NUM_THREADS 2
Exercise 2: A simple SPMD pi programPromote scalar to an
array dimensioned bynumber of threads toavoid race condition.
Promote scalar to an
array dimensioned bynumber of threads toavoid race condition.
8/6/2019 Omp Hands on SC08
118/153
118
void main (){ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;omp_set_num_threads(NUM_THREADS);
#pragma omp parallel{
int i, id,nthrds;double x;id = omp_get_thread_num();nthrds = omp_get_num_threads();if (id == 0) nthreads = nthrds;for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;sum[id] += 4.0/(1.0+x*x);}
}for(i=0, pi=0.0;i
8/6/2019 Omp Hands on SC08
119/153
119
Exercise 2: Simple SPMD Pi program
Exercise 3: SPMD Pi without false sharing Exercise 4: Loop level Pi
Exercise 5: Producer-consumer
Exercise 6: Monte Carlo Pi and random numbers Exercise 7: hard, linked lists without tasks
Exercise 7: easy, matrix multiplication
Exercise 8: linked lists with tasks
False sharing
If independent data elements happen to sit on the same
8/6/2019 Omp Hands on SC08
120/153
120
p ppcache line, each update will cause the cache lines toslosh back and forth between threads.
This is called false sharing.
If you promote scalars to an array to support creationof an SPMD program, the array elements are
contiguous in memory and hence share cache lines.Result poor scalability
Solution:
When updates to an item are frequent, work with local copies
of data instead of an array indexed by the thread ID.Pad arrays so elements you use are on distinct cache lines.
#include static long num_steps = 100000; double step;#define NUM_THREADS 2void main ()
Exercise 3: SPMD Pi without false sharing
8/6/2019 Omp Hands on SC08
121/153
121
{ double pi; step = 1.0/(double) num_steps;omp_set_num_threads(NUM_THREADS);
#pragma omp parallel{
int i, id,nthrds; double x, sum;id = omp_get_thread_num();nthrds = omp_get_num_threads();if (id == 0) nthreads = nthrds;id = omp_get_thread_num();nthrds = omp_get_num_threads();for (i=id, sum=0.0;i< num_steps; i=i+nthreads){
x = (i+0.5)*step;sum += 4.0/(1.0+x*x);
}#pragma omp critical
pi += sum * step;}
}
Sum goes out of scope beyond theparallel region so you must sum it inhere. Must protect summation into pi ina critical region so updates dont conflict
Sum goes out of scope beyond theparallel region so you must sum it inhere. Must protect summation into pi ina critical region so updates dont conflict
No array, sono false
sharing.
No array, sono falsesharing.
Create a scalar localto each thread toaccumulate partialsums.
Create a scalar localto each thread toaccumulate partialsums.
Appendix: Solutions to exercises
Exercise 1: hello world
8/6/2019 Omp Hands on SC08
122/153
122
Exercise 2: Simple SPMD Pi program
Exercise 3: SPMD Pi without false sharing Exercise 4: Loop level Pi
Exercise 5: Producer-consumer
Exercise 6: Monte Carlo Pi and random numbers Exercise 7: hard, linked lists without tasks
Exercise 7: easy, matrix multiplication
Exercise 8: linked lists with tasks
Exercise 4: solution
#include
static long num steps = 100000; double step;
8/6/2019 Omp Hands on SC08
123/153
123
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main (){ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private(x) reduction(+:sum)for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
Note: we created a parallelprogram without changingany code and by adding 4
simple lines!
Note: we created a parallelprogram without changingany code and by adding 4
simple lines!
i private
by default
i privateby default
For good OpenMPimplementations,reduction is more
scalable than critical.
For good OpenMP
implementations,reduction is more
scalable than critical.
Appendix: Solutions to exercises
Exercise 1: hello world
8/6/2019 Omp Hands on SC08
124/153
124
Exercise 2: Simple SPMD Pi program
Exercise 3: SPMD Pi without false sharing Exercise 4: Loop level Pi
Exercise 5: Monte Carlo Pi and random numbers
Exercise 6: hard, linked lists without tasks Exercise 6: easy, matrix multiplication
Exercise 7: Producer-consumer
Exercise 8: linked lists with tasks
8/6/2019 Omp Hands on SC08
125/153
Monte Carlo Calculations:Using Random numbers to solve tough problems Sample a problem domain to estimate areas, compute
probabilities, find optimal values, etc.
Example: Computing with a digital dart board:
8/6/2019 Omp Hands on SC08
126/153
126
Example: Computing with a digital dart board:
Throw darts at the circle/square. Chance of falling in circle is
proportional to ratio of areas:
Ac = r2 *
As
= (2*r) * (2*r) = 4 * r2
P = Ac/As = /4
Compute by randomly choosingpoints, count the fraction that falls inthe circle, compute pi.
2 * r
N= 10 = 2.8
N=100 = 3.16
N= 1000 = 3.148
N= 10 = 2.8N=100 = 3.16
N= 1000 = 3.148
Parallel Programmers love Monte Carloalgorithms
#include omp.hstatic long num_trials = 10000;
int main ()
Embarrassingly parallel: theparallelism is so easy its
embarrassing.
Add two lines and you have aparallel program
8/6/2019 Omp Hands on SC08
127/153
127
int main ()
{
long i; long Ncirc = 0; double pi, x, y;double r = 1.0; // radius of circle. Side of squrare is 2*r
seed(0,-r, r); // The circle and square are centered at the origin
#pragma omp parallel for private (x, y) reduction (+:Ncirc)
for(i=0;i
8/6/2019 Omp Hands on SC08
128/153
128
If you pick the multiplier and addend correctly, LCG has aperiod of PMOD.
Picking good LCG parameters is complicated, so look it up
(Numerical Recipes is a good source). I used the following: MULTIPLIER = 1366
ADDEND = 150889
PMOD = 714025
random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;random_last = random_next;
LCG codestatic long MULTIPLIER = 1366;static long ADDEND = 150889;
static long PMOD = 714025;S d th d d
8/6/2019 Omp Hands on SC08
129/153
129
g ;
long random_last = 0;
double random ()
{long random_next;
random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;
random_last = random_next;
return ((double)random_next/(double)PMOD);
}
Seed the pseudo randomsequence by setting
random_last
Running the PI_MC program with LCG generator
1
Log10 number of samples
8/6/2019 Omp Hands on SC08
130/153
130
0.00001
0.0001
0.001
0.01
0.1
1
1 2 3 4 5 6
LCG - one thread
LCG, 4 threads,
trail 1
LCG 4 threads,
trial 2
LCG, 4 threads,
trial 3
Log
10
Relative
error
Run the sameprogram the
same way andget different
answers!
That is notacceptable!
Issue: my LCGgenerator is
not threadsafe
Run the sameprogram thesame way andget different
answers!
That is notacceptable!
Issue: my LCGgenerator is
not threadsafe
Program written using the Intel C/C++ compiler (10.0.659.2005) in Microsoft Visual studio 2005 (8.0.50727.42) and running on a dual-corelaptop (Intel T2400 @ 1.83 Ghz with 2 GB RAM) running Microsoft Windows XP.
LCG code: threadsafe versionstatic long MULTIPLIER = 1366;static long ADDEND = 150889;
static long PMOD = 714025;
random_last carriesstate between randomnumber computations,
8/6/2019 Omp Hands on SC08
131/153
131
long random_last = 0;
#pragma omp threadprivate(random_last)
double random (){
long random_next;
random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;
random_last = random_next;
return ((double)random_next/(double)PMOD);
}
To make the generator
threadsafe, makerandom_last
threadprivate so eachthread has its own copy.
Thread safe random number generators
Log10 number of samples Thread safeversion gives the
8/6/2019 Omp Hands on SC08
132/153
132
Log10
Relativeerror
gsame answereach time you
run the program.
But for largenumber of
samples, itsquality is lowerthan the onethread result!
Why?0.00001
0.0001
0.001
0.01
0.1
1
1 2 3 4 5 6LCG - one
thread
LCG 4 threads,
trial 1
LCT 4 threads,
trial 2
LCG 4 threads,
trial 3
LCG 4 threads,
thread safe
Pseudo Random Sequences
Random number Generators (RNGs) define a sequence of pseudo-randomnumbers of length equal to the period of the RNG
8/6/2019 Omp Hands on SC08
133/153
133
In a typical problem, you grab a subsequence of the RNG range
Seed determines starting point
Grab arbitrary seeds and you may generate overlapping sequences
E.g. three sequences last one wraps at the end of the RNG period.
Overlapping sequences = over-sampling and bad statistics lowerquality or even wrong answers!
Thread 1
Thread 2
Thread 3
Parallel random number generators Multiple threads cooperate to generate and use
random numbers. Solutions:
Replicate and Pray
8/6/2019 Omp Hands on SC08
134/153
134
p y
Give each thread a separate, independentgenerator
Have one thread generate all the numbers.
Leapfrog deal out sequence values roundrobin as if dealing a deck of cards.
Block method pick your seed so each
threads gets a distinct contiguous block. Other than replicate and pray, these are difficult
to implement. Be smart buy a math library thatdoes it right.
If done right,can generate thesame sequence
regardless ofthe number of
threads
Nice fordebugging, but
not reallyneeded
scientifically.
Intels Math kernel Library supportsall of these methods.
MKL Random number generators (RNG) MKL includes several families of RNGs in its vector statistics library.
Specialized to efficiently generate vectors of random numbers
8/6/2019 Omp Hands on SC08
135/153
135
#define BLOCK 100
double buff[BLOCK];VSLStreamStatePtr stream;
vslNewStream(&ran_stream, VSL_BRNG_WH, (int)seed_val);
vdRngUniform (VSL_METHOD_DUNIFORM_STD, stream,BLOCK, buff, low, hi)
vslDeleteStream( &stream );
Initialize astream orpseudo
randomnumbers
Select type ofRNG and set seed
Fill buff with BLOCK pseudo rand.nums, uniformly distributed withvalues between lo and hi.
Delete the stream when you are done
Wichmann-Hill generators (WH) WH is a family of 273 parameter sets each defining a non-
overlapping and independent RNG.
Easy to use just make each stream threadprivate and initiate
8/6/2019 Omp Hands on SC08
136/153
136
Easy to use, just make each stream threadprivate and initiateRNG stream so each thread gets a unique WG RNG.
VSLStreamStatePtr stream;
#pragma omp threadprivate(stream)
vslNewStream(&ran_stream, VSL_BRNG_WH+Thrd_ID, (int)seed);
Independent Generator for eachthread
1
Log10 number of samplesNotice that
once you get
8/6/2019 Omp Hands on SC08
137/153
137
0.0001
0.001
0.01
0.1
1
1 2 3 4 5 6
WH one
thread
WH, 2
threads
WH, 4
threads
Log10
Relat
iveerror
once you getbeyond the
high error,small samplecount range,
adding threadsdoesnt
decreasequality ofrandom
sampling.
Leap Frog method Interleave samples in the sequence of pseudo random numbers:
Thread i starts at the ith
number in the sequenceStride through sequence, stride length = number of threads.
Result the same sequence of values regardless of the number
8/6/2019 Omp Hands on SC08
138/153
138
#pragma omp single{ nthreads = omp_get_num_threads();
iseed = PMOD/ MULTIPLIER; / / just pick a seedpseed[0] = iseed;
mult_n = M ULTIPLIER;for (i = 1; i < nthreads; ++i){
iseed = (unsigned long long)((MULTIPLIER * iseed) % PMOD);pseed[i] = iseed;
mult_n = (mult_n * MULTIPLIER) % PMOD;}
}random_ last = (unsigned long long) pseed[id];
of threads.
One threadcomputes offsetsand stridedmultiplier
LCG with Addend = 0
just to k eep thingssimple
Each thread stores offsetstarting point into itsthreadprivate last randomvalue
Same sequence with many threads.
We can use the leapfrog method to generate thesame answer for any number of threads
8/6/2019 Omp Hands on SC08
139/153
139
4 threads2 threadsOne threadSteps
3.1416583.1416583.14165810000000
3.1403483.1403483.1403481000000
3.139643.139643.13964100000
3.11683.11683.116810000
3.1563.1563.1561000
Used the MKL library with two generator streams per computation: one for the x values (WH) andone for the y values (WH+1). Also used the leapfrog method to deal out iterations among threads.
Appendix: Solutions to exercises
Exercise 1: hello world
Exercise 2: Simple SPMD Pi program
8/6/2019 Omp Hands on SC08
140/153
140
Exercise 2: Simple SPMD Pi program
Exercise 3: SPMD Pi without false sharing Exercise 4: Loop level Pi
Exercise 5: Monte Carlo Pi and random numbers
Exercise 6: hard, linked lists without tasks Exercise 6: easy, matrix multiplication
Exercise 7: Producer-consumer
Exercise 8: linked lists with tasks
Linked lists without tasks See the file Linked_omp25.c
while (p != NULL) {
p = p->next;
count++;
}
Count number of items in the linked list
8/6/2019 Omp Hands on SC08
141/153
141
p = head;
for(i=0; inext;
}
#pragma omp parallel
{
#pragma omp for schedule(static,1)
for(i=0; inext)
nodelist.push back(p); Copy pointer to each node into an array
8/6/2019 Omp Hands on SC08
142/153
142
p _ (p);
int j = (int)nodelist.size();
#pragma omp parallel for schedule(static,1)
for (int i = 0; i < j; ++i)processwork(nodelist[i]);
47 seconds
37 secondsC++, default sched.
28 seconds32 secondsTwo Threads
45 seconds49 secondsOne ThreadC, (static,1)C++, (static,1)
Count number of items in the linked list
Process nodes in parallel with a for loop
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2
8/6/2019 Omp Hands on SC08
143/153
Matrix multiplication
#pragma omp parallel for private(tmp, i, j, k)
for (i=0; i
8/6/2019 Omp Hands on SC08
144/153
144
tmp = 0.0;
for(k=0;k
8/6/2019 Omp Hands on SC08
145/153
145
Exercise 3: SPMD Pi without false sharing Exercise 4: Loop level Pi
Exercise 5: Monte Carlo Pi and random numbers
Exercise 6: hard, linked lists without tasks Exercise 6: easy, matrix multiplication
Exercise 7: Producer-consumer
Exercise 8: linked lists with tasks
Pair wise synchronizaion in OpenMP
OpenMP lacks synchronization constructs thatwork between pairs of threads.
8/6/2019 Omp Hands on SC08
146/153
146
When this is needed you have to build ityourself.
Pair wise synchronization
Use a shared flag variableReader spins waiting for the new flag value
Use flushes to force updates to and from memory
Exercise 7: producer consumerint main()
{double *A, sum, runtime; int numthreads, flag = 0;A = (double *)malloc(N*sizeof(double));
#pragma omp parallel sectionsUse flag to Signal when the
8/6/2019 Omp Hands on SC08
147/153
147
p g p p{
#pragma omp section{
fill_rand(N, A);
#pragma omp flushflag = 1;#pragma omp flush (flag)
}#pragma omp section{
#pragma omp flush (flag)while (flag != 1){
#pragma omp flush (flag)}#pragma omp flushsum = Sum_array(N, A);
}}
}
Use flag to Signal when the
produced value is ready
Flush forces refresh to memory.Guarantees that the otherthread sees the new value of A
Notice you must put the flush inside thewhile loop to make sure the updated flagvariable is seen
Flush needed on both reader andw riter sides of the communication
Appendix: Solutions to exercises
Exercise 1: hello world
Exercise 2: Simple SPMD Pi program
8/6/2019 Omp Hands on SC08
148/153
148
Exercise 3: SPMD Pi without false sharing Exercise 4: Loop level Pi
Exercise 5: Monte Carlo Pi and random numbers
Exercise 6: hard, linked lists without tasks Exercise 6: easy, matrix multiplication
Exercise 7: Producer-consumer
Exercise 8: linked lists with tasks
Linked lists with tasks (intel taskq) See the file Linked_intel_taskq.c
#pragma omp parallel{
#pragma intel omp taskq
8/6/2019 Omp Hands on SC08
149/153
149
{while (p != NULL) {
#pragma intel omp task captureprivate(p)
processwork(p);p = p->next;
}
}
}
30 seconds28 secondsTwo Threads
48 seconds45 secondsOne Thread
Intel taskqArray, Static, 1
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2
Linked lists with tasks (OpenMP 3) See the file Linked_intel_taskq.c
#pragma omp parallel{
#pragma omp single
8/6/2019 Omp Hands on SC08
150/153
150
{p=head;
while (p) {
#pragma omp task firstprivate(p)processwork(p);
p = p->next;
}
}
}
Creates a task w ithits own copy of p
initialized to thevalue of p when
the task is defined
Compiler notes: Intel on Windows
Intel compiler:Launch SW dev environment on my laptop I use:
start/intel software development tools/intel C++
8/6/2019 Omp Hands on SC08
151/153
151
p
compiler 10.1/C+ build environment for 32 bitapps
cd to the directory that holds your source code
Build software for program foo.c
icl /Qopenmp foo.cSet number of threads environment variable
set OMP_NUM_THREADS=4
Run your programfoo.exe To get rid of the pwd on the
prompt, type
prompt = %
Compiler notes: Visual Studio Start new project Select win 32 console project
Set name and path
8/6/2019 Omp Hands on SC08
152/153
152
On the next panel, Click next instead of finish so you canselect an empty project on the following panel.
Drag and drop your source file into the source folder on thevisual studio solution explorer
Activate OpenMPGo to project properties/configuration
properties/C.C++/language and activate OpenMP
Set number of threads inside the program
Build the project Run without debug from the debug menu.
Notes from the SC08 tutorial
It seemed to go very well. We had over 50 people who stuck it outthroughout the day.
Most people brought their laptops (only 7 loaner laptops were used). Andof those with laptops, most had preloaded an OS.
8/6/2019 Omp Hands on SC08
153/153
153
The chaos at the beginning was much less than I expected. I had fears ofan hour or longer to get everyone setup. But thanks to PGI providing alicense key in a temp file, we were able to get everyone installed in shortorder.
Having a team of 4 (two speakers and two assistants) worked well. Itwould have been messier without a hardcore compiler person such asLarry. With dozens of different configurations, he had to do some serious
trouble-shooting to get the most difficult cases up and running. The exercises used early in the course were good. The ones after lunch
were too hard. I need to refine the exercise set. One idea is to for eachslot have an easy exercise and a hard exercise. This will help mekeep the groups experience more balanced.
Most people didnt get the threadprivate exercise. The few people who
tried the linked-list exercise were amazingly creative one even getttinga single/nowait version to work.