Omp Hands on SC08

8/6/2019 Omp Hands on SC08

1/153

1

A Hands-on Introduction toOpenMP*

Tim Mattson

Principal Engineer

Intel Corporation

[email protected]

* The name OpenMP is the property of the OpenMP Architecture Review Board.

Larry Meadows

Principal Engineer

Intel Corporation

[email protected]


2/153

2

Preliminaries: part 1 Disclosures

The views expressed in this tutorial are those of thepeople delivering the tutorial.

We are not speaking for our employers.

We are not speaking for the OpenMP ARB This is a new tutorial for us:

Help us improve tell us how you would make thistutorial better.


3/153

3

Preliminaries: Part 2 Our plan for the day .. Active learning!

We will mix short lectures with short exercises.You will use your laptop for the exercises that

way youll have an OpenMP environment to takehome so you can keep learning on your own.

Please follow these simple rulesDo the exercises we assign and then change things

around and experiment.

Embrace active learning!

Dont cheat: Do Not look at the solutions beforeyou complete an exercise even if you get reallyfrustrated.


4/153

4

Our Plan for the day

Tasks and other OpenMP 3featuresLinked listIX OpenMP 3 and tasks

Point to point synch with flushProducerconsumer

VIII. Memory model

For, schedules, sectionsLinked list,matmul

VII. Worksharing andschedule

Data environment details,modular software,threadprivate

Pi_mcVI. Data Environment

Single, master, runtimelibraries, environmentvariables, synchronization, etc.

No exerciseV. Odds and ends

For, reductionPi_loopIV. Parallel loops

False sharing, critical, atomicPi_spmd_finalIII. Synchronization

Parallel, default dataenvironment, runtime librarycalls

Pi_spmd_simpleII. Creating threads

Parallel regionsInstall sw,hello_world

I. OMP Intro

conceptsExerciseTopic

Break

Break

lunch


5/153

5

Outline Introduction to OpenMP

Creating Threads Synchronization

Parallel Loops

Synchronize single masters and stuff

Data environment

Schedule your for and sections Memory model

OpenMP 3.0 and Tasks


6/153

6

OpenMP*

Overview:

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE dynamic

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTERC$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

OpenMP: An API f or Wr i t ing Mult i t hreadedApplications

A set of compiler directives and libraryroutines for parallel application programmers

Greatly simplifies writing multi-threaded (MT)programs in Fortran, C and C++

Standardizes last 20 years of SMP practice

* The name OpenMP is the property of the OpenMP Architecture Review Board.


7/153

7

OpenMP Basic Defs: Solution Stack

OpenMP Runtime library

OS/system support for shared memory and threading

S

yste

mlaye r

Directives,Compiler

OpenMP libraryEnvironment

variablesProg.

Lay

er

Application

End User

Userl a

yer

Shared Address Space

Proc3Proc2Proc1 ProcN

HW


8/153

8

OpenMP core syntax

Most of the constructs in OpenMP are compilerdirectives.

#pragma omp construct [clause [clause]]Example

#pragma omp parallel num_threads(4)

Function prototypes and types in the file:

#include

Most OpenMP* constructs apply to astructured block.Structured block: a block of one or more statements

with one point of entry at the top and one point ofexit at the bottom.

Its OK to have an exit() within the structured block.


9/153

9

Exercise 1, Part A: Hello worldVerify that your environment works

Write a program that prints hello world.

void main(){

int ID = 0;

printf( hello(%d) , ID);printf( world(%d) \n, ID);

}

void main(){

int ID = 0;


}


10/153

10

Exercise 1, Part B: Hello worldVerify that your OpenMP environment works

Write a multithreaded program that prints hello world.

void main(){

int ID = 0;


}

void main(){

int ID = 0;


}

#pragma omp parallel{

}

#include omp.h

Switches for compiling and linking

-fopenmp gcc

-mp pgi

/Qopenmp intel


11/153

11

Exercise 1: SolutionA multi-threaded Hello world program

Write a multithreaded program where each

thread prints hello world.#include omp.hvoid main(){


int ID = omp_get_thread_num();printf( hello(%d) , ID);printf( world(%d) \n, ID);

}

}

#include omp.hvoid main(){


int ID = omp_get_thread_num();printf( hello(%d) , ID);printf( world(%d) \n, ID);

}

}

Sample Output:

hello(1) hello(0) world(1)

world(0)

hello (3) hello(2) world(3)

world(2)

Sample Output:

hello(1) hello(0) world(1)

world(0)

hello (3) hello(2) world(3)world(2)

OpenMP include fileOpenMP include file

Parallel region with defaultnumber of threads


Runtime library function to

return a thread ID.

Runtime library function to

return a thread ID.End of the Parallel regionEnd of the Parallel region


12/153

12

OpenMP Overview:

How do threads interact? OpenMP is a multi-threading, shared address

model.

Threads communicate by sharing variables.

Unintended sharing of data causes raceconditions:

race condition: when the programs outcomechanges as the threads are scheduled differently.

To control race conditions:

Use synchronization to protect data conflicts.

Synchronization is expensive so:

Change how data is accessed to minimize the need

for synchronization.


13/153

13



Parallel Loops


Data environment




14/153

14

OpenMP Programming Model:

Fork-Join Parallelism:Master thread spawns a team of threads as needed.

Parallelism added incrementally until performance goalsare met: i.e. the sequential program evolves into aparallel program.

Parallel RegionsMasterThreadin red

A Nested

Parallelregion

A Nested

Parallelregion

Sequential Parts


15/153

15

Thread Creation: Parallel Regions

You create threads in OpenMP* with the parallel

construct. For example, To create a 4 thread Parallel region:

double A[1000];

omp_set_num_threads(4);#pragma omp parallel{

int ID = omp_get_thread_num();

pooh(ID,A);}

Each thread callsEach thread calls pooh(ID,A) forforID == 0 toto 3

Each threadexecutes acopy of the

code within

thestructuredblock


code within

the

structuredblock

Runtime function to

request a certainnumber of threads

Runtime function to

request a certainnumber of threads

Runtime functionreturning a thread ID


* The name OpenMP is the property of the OpenMP Architecture Review Board


16/153

16

Thread Creation: Parallel RegionsYou create threads in OpenMP* with the parallel

construct.

For example, To create a 4 thread Parallel region:

double A[1000];

#pragma omp parallel num_threads(4){


pooh(ID,A);}

Each thread callsEach thread calls pooh(ID,A) forforID == 0 toto 3


code within

thestructuredblock


code within

the

structuredblock

clause to request a certainnumber of threads

clause to request a certainnumber of threads



* The name OpenMP is the property of the OpenMP Architecture Review Board


17/153

17

Thread Creation: Parallel Regions example

Each thread executes the

same code redundantly.

double A[1000];omp_set_num_threads(4);


int ID = omp_get_thread_num();pooh(ID, A);

}

printf(all done\n);omp_set_num_threads(4)

pooh(1,A) pooh(2,A) pooh(3,A)

printf(all done\n);

pooh(0,A)

double A[1000];

A single

copy of Ais sharedbetween allthreads.

A singlecopy of Ais sharedbetween allthreads.

Threads wait here for all threads to

finish before proceeding (i.e. a barrier)

Threads wait here for all threads to

finish before proceeding (i.e. a barrier)* The name OpenMP is the property of the OpenMP Architecture Review Board


18/153

18

Exercises 2 to 4:

Numerical Integration

4.0

(1+x2) dx = 0

1

F(xi)x i = 0

N

Mathematically, we know that:

We can approximate theintegral as a sum ofrectangles:

Where each rectangle haswidth x and height F(xi) at

the middle of interval i.

F( x) =

4.0 / ( 1+x

2)

4.0

2.0

1.0

X0.0


19/153

19

Exercises 2 to 4: Serial PI Program

static long num_steps = 100000;

double step;void main (){ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=0;i< num_steps; i++){x = (i+0.5)*step;

sum = sum + 4.0/(1.0+x*x);}pi = step * sum;

}}


20/153

20

Exercise 2 Create a parallel version of the pi program

using a parallel construct. Pay close attention to shared versus private

variables.

In addition to a parallel construct, you will needthe runtime library routines

int omp_get_num_threads();

int omp_get_thread_num();double omp_get_wtime();

Time in Seconds since afixed point in the past

Thread ID or rank

Number of threads inthe team


21/153


22/153

22

Discussedlater

Synchronization High level synchronization:

criticalatomic

barrier

ordered Low level synchronization

flush

locks (both simple and nested)

Synchronization isused to impose order

constraints and toprotect access to

shared data


23/153

23

Synchronization: critical

Mutual exclusion: Only one thread at a timecan enter a critical region.

float res;#pragma omp parallel

{ float B; int i, id, nthrds;

id = omp_get_thread_num();nthrds = omp_get_num_threads();

for(i=id;i


24/153

24

Synchronization: Atomic Atomic provides mutual exclusion but only

applies to the update of a memory location (the

update of X in the following example)#pragma omp parallel

{

double tmp, B;

B = DOIT();

#pragma omp atomicX += big_ugly(B);

}

#pragma omp parallel

{

double tmp, B;

B = DOIT();

tmp = big_ugly(B);

#pragma omp atomicX += tmp;

}

Atomic only protects

the read/ update of X


25/153

25

Exercise 3 In exercise 2, you probably used an array to

create space for each thread to store its partialsum.

If array elements happen to share a cache line,this leads to false sharing.

Non-shared data in the same cache line so eachupdate invalidates the cache line in essencesloshing independent data back and forth

between threads. Modify your pi program from exercise 2 to

avoid false sharing due to the sum array.


26/153

26



Parallel Loops


Data environment




27/153

27

Discussed later

SPMD vs. worksharing A parallel construct by itself creates an SPMD

or Single Program Multiple Data program i.e., each thread redundantly executes thesame code.

How do you split up pathways through the

code between threads within a team?This is called worksharing

Loop construct

Sections/section constructs

Single construct

Task construct . Coming in OpenMP 3.0


28/153

28

The loop worksharing Constructs

The loop workharing construct splits up loopiterations among the threads in a team


{#pragma omp for

for (I=0;I


29/153

29

Loop worksharing ConstructsA motivating example

for(i=0;I


30/153

30

Combined parallel/worksharing construct

OpenMP shortcut: Put the parallel and the

worksharing directive on the same line

double res[MAX]; int i;


{ #pragma omp for

for (i=0;i< MAX; i++) {

res[i] = huge();

}

}

These are equivalentThese are equivalent

double res[MAX]; int i;

#pragma omp parallel for

for (i=0;i< MAX; i++) {res[i] = huge();

}


31/153

31

Working with loops Basic approach

Find compute intensive loops

Make the loop iterations independent .. So they cansafely execute in any order without loop-carrieddependencies

Place the appropriate OpenMP directive and test

int i, j, A[MAX];

j = 5;

for (i=0;i< MAX; i++) {j +=2;

A[i] = big(j);

}

int i, A[MAX];

#pragma omp parallel for

for (i=0;i< MAX; i++) {int j = 5 + 2*i;

A[i] = big(j);

}Remove loop

carried

dependence

Note: loop index

i is private by

default


32/153

32

Reduction

We are combining values into a single accumulation

variable (ave) there is a true dependence betweenloop iterations that cant be trivially removed

This is a very common situation it is called areduction.

Support for reduction operations is included in mostparallel programming environments.

double ave=0.0, A[MAX]; int i;

for (i=0;i< MAX; i++) {

ave + = A[i];}

ave = ave/MAX;

How do we handle this case?


33/153

33

Reduction OpenMP reduction clause:

reduction (op : list)

Inside a parallel or a work-sharing construct:

A local copy of each list variable is made and initializeddepending on the op (e.g. 0 for +).

Compiler finds standard reduction expressions containingop and uses them to update the local copy.

Local copies are reduced into a single value andcombined with the original global value.

The variables in list must be shared in the enclosingparallel region.

double ave=0.0, A[MAX]; int i;#pragma omp parallel for reduction (+:ave)

for (i=0;i< MAX; i++) {

ave + = A[i];

}ave = ave/MAX;


34/153

34

OpenMP: Reduction operands/initial-values

Many different associative operands can be used with reduction:

Initial values are the ones that make sense mathematically.

0-

1*

0+

Initial valueOperator

C/C++ only

~0&

0^

0|

1&&

0||


Fortran Only

0.IEOR.

0.IOR.

All bits on.IAND.

Most neg. numberMAX*

Largest pos. numberMIN*

.false..NEQV.

.true..EQV.

.true..AND.

.false..OR.



35/153

35

Exercise 4 Go back to the serial pi program and parallelize

it with a loop construct

Your goal is to minimize the number changesmade to the serial program.


36/153

36



Parallel Loops


Data environment

Schedule your for and sections

Memory model



37/153

37

Synchronization: Barrier Barrier: Each thread waits until all threads arrive.

#pragma omp parallel shared (A, B, C) private(id){

id=omp_get_thread_num();A[id] = big_calc1(id);

#pragma omp barrier#pragma omp for

for(i=0;i


38/153

38

Master Construct The masterconstruct denotes a structured

block that is only executed by the master thread.

The other threads just skip it (nosynchronization is implied).


do_many_things();#pragma omp master

{ exchange_boundaries(); }#pragma omp barrier

do_many_other_things();

}


39/153

39

Single worksharing Construct The single construct denotes a block of code that is

executed by only one thread (not necessarily the

master thread). A barrier is implied at the end of the single block (can

remove the barrier with a nowaitclause).


do_many_things();

#pragma omp single{ exchange_boundaries(); }do_many_other_things();

}


40/153

40

Synchronization: ordered The ordered region executes in the sequential

order.

#pragma omp parallel private (tmp)#pragma omp forordered reduction(+:res)

for (I=0;I


41/153

41

Synchronization: Lock routines

Simple Lock routines:A simple lock is available if it is unset.

omp_init_lock(), omp_set_lock(),

omp_unset_lock(), omp_test_lock(),omp_destroy_lock()

Nested Locks

A nested lock is available if it is unset or if it is set butowned by the thread executing the nested lock function

omp_init_nest_lock(), omp_set_nest_lock(),omp_unset_nest_lock(), omp_test_nest_lock(),omp_destroy_nest_lock()

Note: a thread always accesses the most recent copy of the

lock, so you dont need to use a flush on the lock variable.

A lock implies amemory fence(a flush) of

all threadvisible

variables


42/153

42

Synchronization: Simple Locks Protect resources with locks.

omp_lock_t lck;

omp_init_lock(&lck);

#pragma omp parallel private (tmp, id)

{

id = omp_get_thread_num();tmp = do_lots_of_work(id);

omp_set_lock(&lck);

printf(%d %d, id, tmp);omp_unset_lock(&lck);

}

omp_destroy_lock(&lck);

Wait here foryour turn.Wait here foryour turn.

Release the lock

so the next threadgets a turn.

Release the lock

so the next threadgets a turn.

Free-up storage when done.Free-up storage when done.


43/153

43

Runtime Library routines

Runtime environment routines:

Modify/Check the number of threads

omp_set_num_threads(), omp_get_num_threads(),

omp_get_thread_num(), omp_get_max_threads()Are we in an active parallel region?

omp_in_parallel()

Do you want the system to dynamically vary the number ofthreads from one parallel construct to another?

omp_set_dynamic, omp_get_dynamic();

How many processors in the system?

omp_num_procs()

plus a few less commonly used routines.


44/153

44

Runtime Library routines To use a known, fixed number of threads in a program,

(1) tell the system that you dont want dynamic adjustment ofthe number of threads, (2) set the number of threads, then (3)save the number you got.

#include void main(){ int num_threads;

omp_set_dynamic( 0 );

omp_set_num_threads( omp_num_procs() );#pragma omp parallel

{ int id=omp_get_thread_num();#pragma omp single

num_threads = omp_get_num_threads();do_lots_of_stuff(id);}

}

Protect this op since Memorystores are not atomic

Request as many threads asyou have processors.

Disable dynamic adjustment of thenumber of threads.

Even in this case, the system may give you fewerthreads than requested. I f the precise # of threads

matters, test for it and respond accordingly.

Even in this case, the system may give you fewerthreads than requested. I f the precise # of threads

matters, test for it and respond accordingly.


45/153

45

Environment Variables Set the default number of threads to use.

OMP_NUM_THREADS int_literal

Control how omp for schedule(RUNTIME)loop iterations are scheduled.

OMP_SCHEDULE schedule[, chunk_size]

Plus several less commonly used environment variables.


46/153

46



Parallel Loops


Data environment


Memory model



47/153

47

Data environment:Default storage attributes

Shared Memory programming model:

Most variables are shared by default Global variables are SHARED among threads

Fortran: COMMON blocks, SAVE variables, MODULEvariables

C: File scope variables, static

Both: dynamically allocated memory (ALLOCATE, malloc, new)

But not everything is shared...

Stack variables in subprograms(Fortran) or functions(C) calledfrom parallel regions are PRIVATE

Automatic variables within a statement block are PRIVATE.


48/153

48

double A[10];

int main() {

int index[10];


work(index);

printf(%d\n, index[0]);}

extern double A[10];

void work(int *index) {

double temp[10];

static int count;

...

}

Data sharing: Examples

temp

A, index, count

temp temp

A, index, count

A, index and count are

shared by all threads.

temp is local to each

thread

A, index and count are

shared by all threads.

temp is local to each

thread

* Third party trademarks and names are the property of their respective owner.

D t h i


49/153

49

Data sharing:Changing storage attributes

One can selectively change storage attributes forconstructs using the following clauses*

SHARED

PRIVATE

FIRSTPRIVATE

The final value of a private inside a parallel loop can betransmitted to the shared variable outside the loop with:

LASTPRIVATE

The default attributes can be overridden with:

DEFAULT (PRIVATE | SHARED | NONE)

All the clauses on this pageapply to the OpenMP constructNOT to the entire region.

All the clauses on this pageapply to the OpenMP constructNOT to the entire region.

All data clauses apply to parallel constructs and worksharing constructs except

shared which only applies to parallel constructs.

DEFAULT(PRIVATE) is Fortran only


50/153

50

Data Sharing: Private Clause

void wrong() {

int tmp = 0;

#pragma omp for private(tmp)

for (int j = 0; j < 1000; ++j)

tmp += j;

printf(%d\n, tmp);

}

private(var) creates a new local copy of var for each thread.

The value is uninitialized

In OpenMP 2.5 the value of the shared variable is undefined afterthe region

tmp was notinitialized

tmp was notinitialized

tmp: 0 in 3.0,unspecified in 2.5

tmp: 0 in 3.0,unspecified in 2.5

D t Sh i P i t Cl


51/153

51

Data Sharing: Private Clause

When is the original variable valid?

int tmp;

void danger() {tmp = 0;

#pragma omp parallel private(tmp)

work();

printf(%d\n, tmp);}

The original variables value is unspecified in OpenMP 2.5.

In OpenMP 3.0, if it is referenced outside of the construct

Implementations may reference the original variable or a copy ..A dangerous programming practice!

extern int tmp;void work() {

tmp = 5;

}

unspecified whichcopy of tmp

unspecified whichcopy of tmptmp has unspecified

value

tmp has unspecifiedvalue


52/153

52

Data Sharing: Firstprivate Clause

Firstprivate is a special case of private.

Initializes each private copy with the correspondingvalue from the master thread.

tmp: 0 in 3.0, unspecified in 2.5tmp: 0 in 3.0, unspecified in 2.5

void useless() {

int tmp = 0;

#pragma omp for firstprivate(tmp)

for (int j = 0; j < 1000; ++j)

tmp += j;

printf(%d\n, tmp);

}

Each thread gets its owntmp with an initial value of 0

Each thread gets its owntmp with an initial value of 0


53/153

53

Data sharing: Lastprivate Clause

Lastprivate passes the value of a private from the

last iteration to a global variable.

tmp is defined as its value at the last

sequential iteration (i.e., for j=999)

tmp is defined as its value at the last

sequential iteration (i.e., for j=999)

void closer() {

int tmp = 0;

#pragma omp parallel for firstprivate(tmp) \lastprivate(tmp)

for (int j = 0; j < 1000; ++j)

tmp += j;

printf(%d\n, tmp);

}

Each thread gets its own tmpwith an initial value of 0

Each thread gets its own tmpwith an initial value of 0

Data Sharing:


54/153

54

Data Sharing:

A data environment test Consider this example of PRIVATE and FIRSTPRIVATE

Are A,B,C local to each thread or shared inside the parallel region?

What are their initial values inside and values after the parallel region?

variables A,B, and C = 1

#pragma omp parallel private(B) firstprivate(C)

Inside this parallel region ... A is shared by all threads; equals 1

B and C are local to each thread.

Bs initial value is undefined

Cs initial value equals 1

Outside this parallel region ...

The values of B and C are unspecified in OpenMP 2.5, and inOpenMP 3.0 if referenced in the region but outside the construct.


55/153

55

Data Sharing: Default Clause

Note that the default storage attribute is DEFAULT(SHARED) (so

no need to use it) Exception: #pragma omp task

To change default: DEFAULT(PRIVATE)

eachvariable in the construct is made private as if specified in aprivate clause

mostly saves typing

DEFAULT(NONE): nodefault for variables in static extent. Mustlist storage attribute for each variable in static extent. Goodprogramming practice!

Only the Fortran API supports default(private).

C/C++ only has default(shared) or default(none).


56/153

3.0


57/153

57

Data Sharing: tasks (OpenMP 3.0) The default for tasks is usually firstprivate, because the task may

not be executed until later (and variables may have gone out ofscope).

Variables that are shared in all constructs starting from theinnermost enclosing parallel construct are shared, because thebarrier guarantees task completion.

#pragma omp parallel shared(A) private(B){

...

#pragma omp task

{

int C;

compute(A, B, C);

}

}

A is sharedB is firstprivateC is private


58/153

58

Data sharing: Threadprivate

Makes global data private to a thread

Fortran: COMMON blocksC: File scope and static variables, static class members

Different from making them PRIVATE

with PRIVATE global variables are masked.THREADPRIVATE preserves global scope within each

thread

Threadprivate variables can be initialized using

COPYIN or at time of definition (using language-defined initialization capabilities).

A th d i t l (C)


59/153

59

A threadprivate example (C)

int counter = 0;#pragma omp threadprivate(counter)

int increment_counter()

{counter++;

return (counter);

}

int counter = 0;#pragma omp threadprivate(counter)

int increment_counter()

{counter++;

return (counter);

}

Use threadprivate to create a counter for eachthread.

Data Copying: Copyin


60/153

60

Data Copying: Copyin

parameter (N=1000)

common/buf/A(N)

!$OMP THREADPRIVATE(/buf/)

C Initialize the A array

call init_data(N,A)

!$OMP PARALLEL COPYIN(A)

Now each thread sees threadprivate array A initialied

to the global value set in the subroutine init_data()

!$OMP END PARALLEL

end

parameter (N=1000)

common/buf/A(N)!$OMP THREADPRIVATE(/buf/)

C Initialize the A array

call init_data(N,A)

!$OMP PARALLEL COPYIN(A)

Now each thread sees threadprivate array A initialied

to the global value set in the subroutine init_data()

!$OMP END PARALLEL

end

You initialize threadprivate data using a copyinclause.


61/153

Exercise 5: Monte Carlo Calculations


62/153

62

Exercise 5: Monte Carlo CalculationsUsing Random numbers to solve tough problems

Sample a problem domain to estimate areas, computeprobabilities, find optimal values, etc.

Example: Computing with a digital dart board:

Throw darts at the circle/square.

Chance of falling in circle isproportional to ratio of areas:

Ac = r2 *

As = (2*r) * (2*r) = 4 * r2

P = Ac/As = /4

Compute by randomly choosingpoints, count the fraction that falls in

the circle, compute pi.

2 * r

N= 10 = 2.8

N=100 = 3.16

N= 1000 = 3.148

N= 10 = 2.8

N=100 = 3.16

N= 1000 = 3.148

Exercise 5


63/153

63

Exercise 5

We provide three files for this exercisepi_mc.c: the monte carlo method pi program

random.c: a simple random number generator random.h: include file for random number generator

Create a parallel version of this program withoutchanging the interfaces to functions in random.c

This is an exercise in modular software why should a userof your parallel random number generator have to know anydetails of the generator or make any changes to how thegenerator is called?

Extra Credit:Make the random number generator threadsafe.

Make your random number generator numerically correct (non-overlapping sequences of pseudo-random numbers).

Outline


64/153

64

Outline

Introduction to OpenMP

Creating Threads

Synchronization

Parallel Loops

Synchronize single masters and stuff Data environment


Memory model


Sections worksharing Construct


65/153

65

Sections worksharing Construct The Sectionsworksharing construct gives a

different structured block to each thread.


{

#pragma omp sections{#pragma omp section

X_calculation();#pragma omp section

y_calculation();#pragma omp section

z_calculation();}

}


{

#pragma omp sections{#pragma omp section

X_calculation();#pragma omp section

y_calculation();#pragma omp section

z_calculation();}

}

By default, there is a barrier at the end of the omp

sections. Use the nowait clause to turn off the barrier.

loop worksharing constructs:


66/153

66

The schedule clause The schedule clause affects how loop iterations are

mapped onto threads

schedule(static [,chunk])

Deal-out blocks of iterations of size chunk to each thread.

schedule(dynamic[,chunk])

Each thread grabs chunk iterations off a queue until all

iterations have been handled.schedule(guided[,chunk])

Threads dynamically grab blocks of iterations. The size of theblock starts large and shrinks down to size chunk as the

calculation proceeds.schedule(runtime)

Schedule and chunk size taken from the OMP_SCHEDULEenvironment variable (or the runtime library for OpenMP 3.0).

loop work-sharing constructs:Th h d l l


67/153

67

Special case of dynamicto reduce schedulingoverhead

GUIDED

Unpredictable, highlyvariable work periteration

DYNAMIC

Pre-determined andpredictable by theprogrammer

STATIC

When To UseSchedule Clause

The schedule clauseThe schedule clause

Least work atruntime :schedulingdone atcompile-time

Least work atruntime :

schedulingdone atcompile-time

Most work atruntime :complexscheduling

logic used atrun-time

Most work atruntime :complexscheduling

logic used atrun-time

Exercise 6: hard


68/153

68

Exercise 6: hard

Consider the program linked.c

Traverses a linked list computing a sequence of

Fibonacci numbers at each node.

Parallelize this program using constructsdefined in OpenMP 2.5 (loop worksharing

constructs). Once you have a correct program, optimize it.

Exercise 6: easy


69/153

69

Exercise 6: easy

Parallelize the matrix multiplication program inthe file matmul.c

Can you optimize the program by playing withhow the loops are scheduled?

Outline


70/153

70

Outline


Creating Threads

Synchronization

Parallel Loops



Memory model


OpenMP memory model


71/153

71

OpenMP memory model

proc1 proc2 proc3 procN

Shared memory

cache1 cache2 cache3 cacheN

a

a

a

. . .

A memory model is defined in terms of:Coherence: Behavior of the memory system when a single

address is accessed by multiple threads.

Consistency: Orderings of accesses to different addresses by

multiple threads.

OpenMP supports a shared memory model.

All threads share an address space, but it can get complicated:

OpenMP Memory Model: Basic Terms


72/153

72

Source code

Program order

memory

a bCommit order

private view

thread thread

private viewthreadprivatethreadprivatea ab b

Wa Wb Ra Rb . . .

compiler

Executable code

Code order

Wb Rb Wa Ra . . .

RW s in anysemantically

equivalent order

Consistency: Memory Access Re-ordering


73/153

73

Consistency: Memory Access Re ordering

Re-ordering:

Compiler re-orders program orderto the code order

Machine re-orders code orderto the memorycommit order

At a given point in time, the temporary view of

memory may vary from shared memory.

Consistency models based on orderings ofReads (R), Writes (W) and Synchronizations (S):

RR, WW, RW, RS, SS, WS

Consistency


74/153

74

Consistency

Sequential Consistency:

In a multi-processor, ops (R, W, S) are sequentially

consistent if: They remain in program order for each

processor.

They are seen to be in the same overall order byeach of the other processors.

Program order = code order = commit order

Relaxed consistency:Remove some of the ordering constraints for

memory ops (R, W, S).

OpenMP and Relaxed Consistency


75/153

75

p y

OpenMP defines consistency as a variant ofweak consistency:

S ops must be in sequential order across threads.

Can not reorder S ops with R or W ops on the samethread

Weak consistency guaranteesSW, SR , RS, WS, SS

The Synchronization operation relevant to this

discussion is flush.

Flush


76/153

76

Defines a sequence point at which a thread isguaranteed to see a consistent view of memory withrespect to the flush set.

The flush set is: all thread visible variables for a flush construct without an

argument list.

a list of variables when the flush(list) construct is used.

The action of Flush is to guarantee that:All R,W ops that overlap the flush set and occur prior to the

flush complete before the flush executes

All R,W ops that overlap the flush set and occur after the

flush dont execute until after the flush.Flushes with overlapping flush sets can not be reordered.

Memory ops: R = Read, W = write, S = synchronization

Synchronization: flush example


77/153

77

y p

Flush forces data to be updated in memory so otherthreads see the most recent value

double A;

A = compute();

flush(A); // flush to memory to make sure other// threads can pick up the right value

Note: OpenMPs flush is analogous to a fence in

other shared memory APIs.

Note: OpenMPs flush is analogous to a fence in

other shared memory APIs.

Exercise 7: producer consumer


78/153

78

p

Parallelize the prod_cons.c program.

This is a well known pattern called theproducer consumer pattern

One thread produces values that another threadconsumes.

Often used with a stream of produced values toimplement pipeline parallelism

The key is to implement pairwise

synchronization between threads.

Exercise 7: prod_cons.c


79/153

79

p _

int main(){

double *A, sum, runtime; int flag = 0;

A = (double *)malloc(N*sizeof(double));

runtime = omp_get_wtime();

fill_rand(N, A); // Producer: fill an array of data

sum = Sum_array(N, A); // Consumer: sum the array

runtime = omp_get_wtime() - runtime;

printf(" In %lf seconds, The sum is %lf \n",runtime,sum);}

I need to put theprod/cons pair

inside a loop so itstrue pipelineparallelism.

What is the Big Deal with Flush?


80/153

80

Compilers routinely reorder instructions implementinga program

This helps better exploit the functional units, keep machinebusy, hide memory latencies, etc.

Compiler generally cannot move instructions:

past a barrier

past a flush on all variables

But it can move them past a flush with a list ofvariables so long as those variables are not accessed

Keeping track of consistency when flushes are used

can be confusing especially if flush(list) is used.Note: the flush operation does not actually synchronizedifferent threads. It just ensures that a threads values

are made consistent w ith main memory.

Outline


81/153

81


Creating Threads

Synchronization

Parallel Loops



Memory model


OpenMP pre-history


82/153

82

OpenMP based upon SMP directivestandardization efforts PCF and aborted ANSI

X3H5 late 80sNobody fully implemented either standard

Only a couple of partial implementations

Vendors considered proprietary APIs to be acompetitive feature:

Every vendor had proprietary directives sets

Even KAP, a portable multi-platform parallelizationtool used different directives on each platform

PCF Parallel computing forum KAP parallelization tool from KAI.

History of OpenMP DEC


83/153

83

SGI

Cray

Merged,neededcommonalityacrossproducts

KAI ISV - neededlarger market

was tired ofrecoding forSMPs. Urgedvendors to

standardize.

ASCI

Wrote arough draftstraw manSMP API

IBM

Intel

HP

Other vendorsinvited to join

1997

OpenMP Release History


84/153

84

OpenMPFortran 1.1

OpenMPFortran 1.1

OpenMPC/C++ 1.0

OpenMPC/C++ 1.0

OpenMPFortran 2.0

OpenMPFortran 2.0

OpenMPC/C++ 2.0

OpenMPC/C++ 2.0

1998

20001999

2002

OpenMPFortran 1.0

OpenMPFortran 1.0

1997

OpenMP

2.5

OpenMP

2.5

2005

A sing lespecificationfor Fortran, C

and C++

OpenMP

3.0

OpenMP

3.0

tasking,

other newfeatures

2008

Tasks

3.0


85/153

85

Adding tasking is the biggest addition for 3.0

Worked on by a separate subcommittee

led by Jay Hoeflinger at Intel

Re-examined issue from ground up

quite different from Intel taskqs

General task characteristics

3.0


86/153

86

A task has

Code to execute

A data environment (it ownsits data)

An assigned thread that executes the code anduses the data

Two activities: packaging and executionEach encountering thread packages a new instance

of a task (code and data)

Some thread in the team executes the task at somelater time

Definitions

3.0


87/153

87

Task construct task directive plus structuredblock

Task the package of code and instructionsfor allocating data created when a threadencounters a task construct

Task region the dynamic sequence ofinstructions produced by the execution of atask by a thread

Tasks and OpenMP3.0


88/153

88

Tasks have been fully integrated into OpenMP

Key concept: OpenMP has always had tasks, we justnever called them that.

Thread encounteringparallel construct packagesup a set ofimplicittasks, one per thread.

Team of threads is created.

Each thread in team is assigned to one of the tasks

(and tiedto it).Barrier holds original master thread until all implicit

tasks are finished.

We have simply added a way to create a task explicitly

for the team to execute. Every part of an OpenMP program is part of one task or

another!

task Construct

3.0


89/153

89

#pragma omp task [clause[[,]clause] ...]

structured-block

if (expression)untied

shared (list)

private (list)

firstprivate (list)default( shared | none )

where clausecan be one of:

The if clause3.0


90/153

90

When the if clause argument is false

The task is executed immediately by the encounteringthread.

The data environment is still local to the new task... ...and its still a different task with respect to

synchronization.

Its a user directed optimizationwhen the cost of deferring the task is too great

compared to the cost of executing the task code

to control cache and memory affinity

When/where are tasks complete?3.0


91/153

91

At thread barriers, explicit or implicit

applies to all tasks generated in the current parallelregion up to the barrier

matches user expectation

At task barriers

i.e. Wait until all tasks defined in the current task havecompleted.

#pragma omp taskwait

Note: applies only to tasks generated in the current task,

not to descendants .

Example parallel pointer chasing

i t k

3.0


92/153

92

using tasks#pragma omp parallel

{

#pragma omp single private(p){

p = listhead ;

while (p) {

#pragma omp task

process (p)

p=next (p) ;

}}

}

p is firstprivate inside

this task

Example parallel pointer chasing on

m ltiple lists sing tasks

3.0


93/153

93

multiple lists using tasks#pragma omp parallel

{

#pragma omp for private(p)for ( int i =0; i


94/153

94

void postorder(node *p) {

if (p->left)

#pragma omp task

postorder(p->left);if (p->right)

#pragma omp task

postorder(p->right);

#pragma omp taskwait // wait for descendantsprocess(p->data);

}

Parent task suspended until children tasks complete

Task scheduling point

Task switching

3.0


95/153

95

Certain constructs have task scheduling pointsat defined locations within them

When a thread encounters a task schedulingpoint, it is allowed to suspend the current taskand execute another (called task switching)

It can then return to the original task andresume

Task switching example3.0


96/153

96

#pragma omp single

{

for (i=0; i


97/153

97

#p ag a o p s g e{

#pragma omp task

for (i=0; i


98/153

98

The Taskprivate directive was removed fromOpenMP 3.0

Too expensive to implement

Restrictions on task scheduling allowthreadprivate data to be used

User can avoid thread switching with tied tasksTask scheduling points are well defined

Conclusions on tasks

3.0


99/153

99

Enormous amount of work by many people

Tightly integrated into 3.0 spec

Flexible model for irregular parallelism

Provides balanced solution despite often conflictinggoals

Appears that performance can be reasonable

Nested parallelism

3.0


100/153

100

Better support for nested parallelism

Per-thread internal control variables

Allows, for example, callingomp_set_num_threads() inside a parallel region.

Controls the team sizes for next level of parallelism

Library routines to determine depth of nesting,IDs of parent/grandparent etc. threads, teamsizes of parent/grandparent etc. teams

omp_get_active_level()

omp_get_ancestor(level)

omp_get_teamsize(level)

Parallel loops

G h hi k i h h

3.0


101/153

101

Guarantee that this works i.e. that the sameschedule is used in the two loops:

!$omp do schedule(static)

do i=1,n

a(i) = ....

end do!$omp end do nowait

!$omp do schedule(static)

do i=1,n

.... = a(i)end do

Loops (cont.)

3.0


102/153

102

Allow collapsing of perfectly nested loops

Will form a single loop and then parallelize that

!$omp parallel do collapse(2)

do i=1,n

do j=1,n

.....end do

end do


103/153

Portable control of threads

3.0


104/153

104

Added environment variable to control the sizeof child threads stack

OMP_STACKSIZE

Added environment variable to hint to runtimehow to treat idle threads

OMP_WAIT_POLICYACTIVE keep threads alive at barriers/locks

PASSIVE try to release processor at barriers/locks

Control program execution

3.0


105/153

105

Added environment variable and runtimeroutines to get/set the maximum number of

active levels of nested parallelismOMP_MAX_ACTIVE_LEVELS

omp_set_max_active_levels()

omp_get_max_active_levels() Added environment variable to set maximum

number of threads in use

OMP_THREAD_LIMITomp_get_thread_limit()

Odds and ends

3.0


106/153

106

Allow unsigned ints in parallel for loops

Disallow use of the original variable as master threads

private variable Make it clearer where/how private objects are

constructed/destructed

Relax some restrictions on allocatable arrays andFortran pointers

Plug some minor gaps in memory model

Allow C++ static class members to be threadprivate

Improve C/C++ grammar

Minor fixes and clarifications to 2.5

Exercise 8: tasks in OpenMP


107/153

107

Consider the program linked.c

Traverses a linked list computing a sequence of

Fibonacci numbers at each node. Parallelize this program using tasks.

Compare your solutions complexity compared

to the approach without tasks.

Conclusion

O


108/153

108

OpenMP 3.0 is a major upgrade expands therange of algorithms accessible from OpenMP.

OpenMP is fun and about as easy as we canmake it for applications programmers workingwith shared memory platforms.

OpenMP Organizations

O MP hit t i b d URL


109/153

109

OpenMP architecture review board URL,the owner of the OpenMP specification:

www.openmp.org

OpenMP Users Group (cOMPunity) URL:

www.compunity.org

Get involved, join compunity andhelp define the future of OpenMPGet involved, join compunity andhelp define the future of OpenMP

Books about OpenMP


110/153

110

A new book aboutOpenMP 2.5 by a team ofauthors at the forefront of

OpenMPs evolution.

A book about how tothink parallel withexamples in OpenMP, MPIand java

OpenMP Papers

Sosa CP, Scalmani C, Gomperts R, Frisch MJ. Ab initio quantum chemistry on accNUMA architecture using OpenMP. III. Parallel Computing, vol.26, no.7-8, July2000, pp.843-56. Publisher: Elsevier, Netherlands.


111/153

111

Couturier R, Chipot C. Parallel molecular dynamics using OPENMP on a sharedmemory machine. Computer Physics Communications, vol.124, no.1, Jan. 2000,pp.49-59. Publisher: Elsevier, Netherlands.

Bentz J., Kendall R., Parallelization of General Matrix Multiply Routines UsingOpenMP, Shared Memory Parallel Programming with OpenMP, Lecture notes inComputer Science, Vol. 3349, P. 1, 2005

Bova SW, Breshearsz CP, Cuicchi CE, Demirbilek Z, Gabb HA. Dual-level parallelanalysis of harbor wave response using MPI and OpenMP. International Journal

of High Performance Computing Applications, vol.14, no.1, Spring 2000, pp.49-64.Publisher: Sage Science Press, USA.

Ayguade E, Martorell X, Labarta J, Gonzalez M, Navarro N. Exploiting multiplelevels of parallelism in OpenMP: a case study. Proceedings of the 1999International Conference on Parallel Processing. IEEE Comput. Soc. 1999,

pp.172-80. Los Alamitos, CA, USA. Bova SW, Breshears CP, Cuicchi C, Demirbilek Z, Gabb H. Nesting OpenMP in

an MPI application. Proceedings of the ISCA 12th International Conference.Parallel and Distributed Systems. ISCA. 1999, pp.566-71. Cary, NC, USA.


112/153

OpenMP Papers (continued)

B. Chapman, F. Bregier, A. Patil, A. Prabhakar, Achievingperformance under OpenMP on ccNUMA and software distributedh d t C d C t ti P ti d


113/153

113

shared memory systems, Concurrency and Computation: Practice andExperience. 14(8-9): 713-739, 2002.

J. M. Bull and M. E. Kambites. JOMP: an OpenMP-like interface for

Java. Proceedings of the ACM 2000 conference on Java Grande, 2000,Pages 44 - 53.

L. Adhianto and B. Chapman, Performance modeling ofcommunication and computation in hybrid MPI and OpenMPapplications, Simulation Modeling Practice and Theory, vol 15, p. 481-

491, 2007. Shah S, Haab G, Petersen P, Throop J. Flexible control structures for

parallelism in OpenMP; Concurrency: Practice and Experience, 2000;12:1219-1239. Publisher John Wiley & Sons, Ltd.

Mattson, T.G., How Good is OpenMP? Scientific Programming, Vol. 11,

Number 2, p.81-93, 2003. Duran A., Silvera R., Corbalan J., Labarta J., Runtime Adjustment of

Parallel Nested Loops, Shared Memory Parallel Programming withOpenMP, Lecture notes in Computer Science, Vol. 3349, P. 137, 2005

Appendix: Solutions to exercises

Exercise 1: hello world


114/153

114


Exercise 2: Simple SPMD Pi program

Exercise 3: SPMD Pi without false sharing Exercise 4: Loop level Pi

Exercise 5: Producer-consumer

Exercise 6: Monte Carlo Pi and random numbers Exercise 7: hard, linked lists without tasks

Exercise 7: easy, matrix multiplication

Exercise 8: linked lists with tasks

Exercise 1: Solution

A multi-threaded Hello world program Write a multithreaded program where each


115/153

115

Write a multithreaded program where eachthread prints hello world.




printf( hello(%d) , ID);printf( world(%d) \n, ID);}

}




printf( hello(%d) , ID);printf( world(%d) \n, ID);}

}

Sample Output:

hello(1) hello(0) world(1)world(0)


world(2)

Sample Output:

hello(1) hello(0) world(1)world(0)


world(2)

OpenMP include fileOpenMP include file



Runtime library function toreturn a thread ID.

Runtime library function toreturn a thread ID.End of the Parallel regionEnd of the Parallel region




116/153

116









117/153

#include static long num_steps = 100000; double step;#define NUM_THREADS 2

Exercise 2: A simple SPMD pi programPromote scalar to an

array dimensioned bynumber of threads toavoid race condition.

Promote scalar to an

array dimensioned bynumber of threads toavoid race condition.


118/153

118

void main (){ int i, nthreads; double pi, sum[NUM_THREADS];

step = 1.0/(double) num_steps;omp_set_num_threads(NUM_THREADS);


int i, id,nthrds;double x;id = omp_get_thread_num();nthrds = omp_get_num_threads();if (id == 0) nthreads = nthrds;for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {

x = (i+0.5)*step;sum[id] += 4.0/(1.0+x*x);}

}for(i=0, pi=0.0;i


119/153

119







False sharing

If independent data elements happen to sit on the same


120/153

120

p ppcache line, each update will cause the cache lines toslosh back and forth between threads.

This is called false sharing.

If you promote scalars to an array to support creationof an SPMD program, the array elements are

contiguous in memory and hence share cache lines.Result poor scalability

Solution:

When updates to an item are frequent, work with local copies

of data instead of an array indexed by the thread ID.Pad arrays so elements you use are on distinct cache lines.

#include static long num_steps = 100000; double step;#define NUM_THREADS 2void main ()

Exercise 3: SPMD Pi without false sharing


121/153

121

{ double pi; step = 1.0/(double) num_steps;omp_set_num_threads(NUM_THREADS);


int i, id,nthrds; double x, sum;id = omp_get_thread_num();nthrds = omp_get_num_threads();if (id == 0) nthreads = nthrds;id = omp_get_thread_num();nthrds = omp_get_num_threads();for (i=id, sum=0.0;i< num_steps; i=i+nthreads){

x = (i+0.5)*step;sum += 4.0/(1.0+x*x);

}#pragma omp critical

pi += sum * step;}

}

Sum goes out of scope beyond theparallel region so you must sum it inhere. Must protect summation into pi ina critical region so updates dont conflict

Sum goes out of scope beyond theparallel region so you must sum it inhere. Must protect summation into pi ina critical region so updates dont conflict

No array, sono false

sharing.

No array, sono falsesharing.

Create a scalar localto each thread toaccumulate partialsums.

Create a scalar localto each thread toaccumulate partialsums.




122/153

122







Exercise 4: solution

#include

static long num steps = 100000; double step;


123/153

123

static long num_steps = 100000; double step;

#define NUM_THREADS 2

void main (){ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel for private(x) reduction(+:sum)for (i=0;i< num_steps; i++){

x = (i+0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}

pi = step * sum;

}

Note: we created a parallelprogram without changingany code and by adding 4

simple lines!

Note: we created a parallelprogram without changingany code and by adding 4

simple lines!

i private

by default

i privateby default

For good OpenMPimplementations,reduction is more

scalable than critical.

For good OpenMP

implementations,reduction is more

scalable than critical.




124/153

124



Exercise 5: Monte Carlo Pi and random numbers

Exercise 6: hard, linked lists without tasks Exercise 6: easy, matrix multiplication




125/153

Monte Carlo Calculations:Using Random numbers to solve tough problems Sample a problem domain to estimate areas, compute

probabilities, find optimal values, etc.



126/153

126


Throw darts at the circle/square. Chance of falling in circle is

proportional to ratio of areas:

Ac = r2 *

As

= (2*r) * (2*r) = 4 * r2

P = Ac/As = /4

Compute by randomly choosingpoints, count the fraction that falls inthe circle, compute pi.

2 * r

N= 10 = 2.8

N=100 = 3.16

N= 1000 = 3.148

N= 10 = 2.8N=100 = 3.16

N= 1000 = 3.148

Parallel Programmers love Monte Carloalgorithms

#include omp.hstatic long num_trials = 10000;

int main ()

Embarrassingly parallel: theparallelism is so easy its

embarrassing.

Add two lines and you have aparallel program


127/153

127

int main ()

{

long i; long Ncirc = 0; double pi, x, y;double r = 1.0; // radius of circle. Side of squrare is 2*r

seed(0,-r, r); // The circle and square are centered at the origin

#pragma omp parallel for private (x, y) reduction (+:Ncirc)

for(i=0;i


128/153

128

If you pick the multiplier and addend correctly, LCG has aperiod of PMOD.

Picking good LCG parameters is complicated, so look it up

(Numerical Recipes is a good source). I used the following: MULTIPLIER = 1366

ADDEND = 150889

PMOD = 714025

random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;random_last = random_next;

LCG codestatic long MULTIPLIER = 1366;static long ADDEND = 150889;

static long PMOD = 714025;S d th d d


129/153

129

g ;

long random_last = 0;

double random ()

{long random_next;

random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;

random_last = random_next;

return ((double)random_next/(double)PMOD);

}

Seed the pseudo randomsequence by setting

random_last

Running the PI_MC program with LCG generator

1

Log10 number of samples


130/153

130

0.00001

0.0001

0.001

0.01

0.1

1

1 2 3 4 5 6

LCG - one thread

LCG, 4 threads,

trail 1

LCG 4 threads,

trial 2

LCG, 4 threads,

trial 3

Log

10

Relative

error

Run the sameprogram the

same way andget different

answers!

That is notacceptable!

Issue: my LCGgenerator is

not threadsafe

Run the sameprogram thesame way andget different

answers!

That is notacceptable!

Issue: my LCGgenerator is

not threadsafe

Program written using the Intel C/C++ compiler (10.0.659.2005) in Microsoft Visual studio 2005 (8.0.50727.42) and running on a dual-corelaptop (Intel T2400 @ 1.83 Ghz with 2 GB RAM) running Microsoft Windows XP.

LCG code: threadsafe versionstatic long MULTIPLIER = 1366;static long ADDEND = 150889;

static long PMOD = 714025;

random_last carriesstate between randomnumber computations,


131/153

131

long random_last = 0;

#pragma omp threadprivate(random_last)

double random (){

long random_next;

random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;

random_last = random_next;

return ((double)random_next/(double)PMOD);

}

To make the generator

threadsafe, makerandom_last

threadprivate so eachthread has its own copy.

Thread safe random number generators

Log10 number of samples Thread safeversion gives the


132/153

132

Log10

Relativeerror

gsame answereach time you

run the program.

But for largenumber of

samples, itsquality is lowerthan the onethread result!

Why?0.00001

0.0001

0.001

0.01

0.1

1

1 2 3 4 5 6LCG - one

thread

LCG 4 threads,

trial 1

LCT 4 threads,

trial 2

LCG 4 threads,

trial 3

LCG 4 threads,

thread safe

Pseudo Random Sequences

Random number Generators (RNGs) define a sequence of pseudo-randomnumbers of length equal to the period of the RNG


133/153

133

In a typical problem, you grab a subsequence of the RNG range

Seed determines starting point

Grab arbitrary seeds and you may generate overlapping sequences

E.g. three sequences last one wraps at the end of the RNG period.

Overlapping sequences = over-sampling and bad statistics lowerquality or even wrong answers!

Thread 1

Thread 2

Thread 3

Parallel random number generators Multiple threads cooperate to generate and use

random numbers. Solutions:

Replicate and Pray


134/153

134

p y

Give each thread a separate, independentgenerator

Have one thread generate all the numbers.

Leapfrog deal out sequence values roundrobin as if dealing a deck of cards.

Block method pick your seed so each

threads gets a distinct contiguous block. Other than replicate and pray, these are difficult

to implement. Be smart buy a math library thatdoes it right.

If done right,can generate thesame sequence

regardless ofthe number of

threads

Nice fordebugging, but

not reallyneeded

scientifically.

Intels Math kernel Library supportsall of these methods.

MKL Random number generators (RNG) MKL includes several families of RNGs in its vector statistics library.

Specialized to efficiently generate vectors of random numbers


135/153

135

#define BLOCK 100

double buff[BLOCK];VSLStreamStatePtr stream;

vslNewStream(&ran_stream, VSL_BRNG_WH, (int)seed_val);

vdRngUniform (VSL_METHOD_DUNIFORM_STD, stream,BLOCK, buff, low, hi)

vslDeleteStream( &stream );

Initialize astream orpseudo

randomnumbers

Select type ofRNG and set seed

Fill buff with BLOCK pseudo rand.nums, uniformly distributed withvalues between lo and hi.

Delete the stream when you are done

Wichmann-Hill generators (WH) WH is a family of 273 parameter sets each defining a non-

overlapping and independent RNG.

Easy to use just make each stream threadprivate and initiate


136/153

136

Easy to use, just make each stream threadprivate and initiateRNG stream so each thread gets a unique WG RNG.

VSLStreamStatePtr stream;

#pragma omp threadprivate(stream)

vslNewStream(&ran_stream, VSL_BRNG_WH+Thrd_ID, (int)seed);

Independent Generator for eachthread

1

Log10 number of samplesNotice that

once you get


137/153

137

0.0001

0.001

0.01

0.1

1

1 2 3 4 5 6

WH one

thread

WH, 2

threads

WH, 4

threads

Log10

Relat

iveerror

once you getbeyond the

high error,small samplecount range,

adding threadsdoesnt

decreasequality ofrandom

sampling.

Leap Frog method Interleave samples in the sequence of pseudo random numbers:

Thread i starts at the ith

number in the sequenceStride through sequence, stride length = number of threads.

Result the same sequence of values regardless of the number


138/153

138

#pragma omp single{ nthreads = omp_get_num_threads();

iseed = PMOD/ MULTIPLIER; / / just pick a seedpseed[0] = iseed;

mult_n = M ULTIPLIER;for (i = 1; i < nthreads; ++i){

iseed = (unsigned long long)((MULTIPLIER * iseed) % PMOD);pseed[i] = iseed;

mult_n = (mult_n * MULTIPLIER) % PMOD;}

}random_ last = (unsigned long long) pseed[id];

of threads.

One threadcomputes offsetsand stridedmultiplier

LCG with Addend = 0

just to k eep thingssimple

Each thread stores offsetstarting point into itsthreadprivate last randomvalue

Same sequence with many threads.

We can use the leapfrog method to generate thesame answer for any number of threads


139/153

139

4 threads2 threadsOne threadSteps

3.1416583.1416583.14165810000000

3.1403483.1403483.1403481000000

3.139643.139643.13964100000

3.11683.11683.116810000

3.1563.1563.1561000

Used the MKL library with two generator streams per computation: one for the x values (WH) andone for the y values (WH+1). Also used the leapfrog method to deal out iterations among threads.





140/153

140







Linked lists without tasks See the file Linked_omp25.c

while (p != NULL) {

p = p->next;

count++;

}

Count number of items in the linked list


141/153

141

p = head;

for(i=0; inext;

}


{

#pragma omp for schedule(static,1)

for(i=0; inext)

nodelist.push back(p); Copy pointer to each node into an array


142/153

142

p _ (p);

int j = (int)nodelist.size();

#pragma omp parallel for schedule(static,1)

for (int i = 0; i < j; ++i)processwork(nodelist[i]);

47 seconds

37 secondsC++, default sched.

28 seconds32 secondsTwo Threads

45 seconds49 secondsOne ThreadC, (static,1)C++, (static,1)

Count number of items in the linked list

Process nodes in parallel with a for loop

Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2


143/153

Matrix multiplication

#pragma omp parallel for private(tmp, i, j, k)

for (i=0; i


144/153

144

tmp = 0.0;

for(k=0;k


145/153

145






Pair wise synchronizaion in OpenMP

OpenMP lacks synchronization constructs thatwork between pairs of threads.


146/153

146

When this is needed you have to build ityourself.

Pair wise synchronization

Use a shared flag variableReader spins waiting for the new flag value

Use flushes to force updates to and from memory

Exercise 7: producer consumerint main()

{double *A, sum, runtime; int numthreads, flag = 0;A = (double *)malloc(N*sizeof(double));

#pragma omp parallel sectionsUse flag to Signal when the


147/153

147

p g p p{

#pragma omp section{

fill_rand(N, A);

#pragma omp flushflag = 1;#pragma omp flush (flag)

}#pragma omp section{

#pragma omp flush (flag)while (flag != 1){

#pragma omp flush (flag)}#pragma omp flushsum = Sum_array(N, A);

}}

}

Use flag to Signal when the

produced value is ready

Flush forces refresh to memory.Guarantees that the otherthread sees the new value of A

Notice you must put the flush inside thewhile loop to make sure the updated flagvariable is seen

Flush needed on both reader andw riter sides of the communication





148/153

148






Linked lists with tasks (intel taskq) See the file Linked_intel_taskq.c


#pragma intel omp taskq


149/153

149

{while (p != NULL) {

#pragma intel omp task captureprivate(p)

processwork(p);p = p->next;

}

}

}

30 seconds28 secondsTwo Threads

48 seconds45 secondsOne Thread

Intel taskqArray, Static, 1

Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2

Linked lists with tasks (OpenMP 3) See the file Linked_intel_taskq.c


#pragma omp single


150/153

150

{p=head;

while (p) {

#pragma omp task firstprivate(p)processwork(p);

p = p->next;

}

}

}

Creates a task w ithits own copy of p

initialized to thevalue of p when

the task is defined

Compiler notes: Intel on Windows

Intel compiler:Launch SW dev environment on my laptop I use:

start/intel software development tools/intel C++


151/153

151

p

compiler 10.1/C+ build environment for 32 bitapps

cd to the directory that holds your source code

Build software for program foo.c

icl /Qopenmp foo.cSet number of threads environment variable

set OMP_NUM_THREADS=4

Run your programfoo.exe To get rid of the pwd on the

prompt, type

prompt = %

Compiler notes: Visual Studio Start new project Select win 32 console project

Set name and path


152/153

152

On the next panel, Click next instead of finish so you canselect an empty project on the following panel.

Drag and drop your source file into the source folder on thevisual studio solution explorer

Activate OpenMPGo to project properties/configuration

properties/C.C++/language and activate OpenMP

Set number of threads inside the program

Build the project Run without debug from the debug menu.

Notes from the SC08 tutorial

It seemed to go very well. We had over 50 people who stuck it outthroughout the day.

Most people brought their laptops (only 7 loaner laptops were used). Andof those with laptops, most had preloaded an OS.


153/153

153

The chaos at the beginning was much less than I expected. I had fears ofan hour or longer to get everyone setup. But thanks to PGI providing alicense key in a temp file, we were able to get everyone installed in shortorder.

Having a team of 4 (two speakers and two assistants) worked well. Itwould have been messier without a hardcore compiler person such asLarry. With dozens of different configurations, he had to do some serious

trouble-shooting to get the most difficult cases up and running. The exercises used early in the course were good. The ones after lunch

were too hard. I need to refine the exercise set. One idea is to for eachslot have an easy exercise and a hard exercise. This will help mekeep the groups experience more balanced.

Most people didnt get the threadprivate exercise. The few people who

tried the linked-list exercise were amazingly creative one even getttinga single/nowait version to work.

Documents

Omp Hands on SC08