Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Lecture04-06:ProgrammingwithOpenMP

ConcurrentandMul<coreProgramming,CSE536

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]/~yan

1

Topics(Part1)

•  IntroducAon•  Programmingonsharedmemorysystem(Chapter7)

–  OpenMP•  Principlesofparallelalgorithmdesign(Chapter3)•  Programmingonsharedmemorysystem(Chapter7)

–  Cilk/Cilkplus(?)–  PThread,mutualexclusion,locks,synchroniza<ons

•  AnalysisofparallelprogramexecuAons(Chapter5)–  PerformanceMetricsforParallelSystems

•  Execu<onTime,Overhead,Speedup,Efficiency,Cost–  ScalabilityofParallelSystems–  Useofperformancetools

2

Outline

•  OpenMPIntroducAon•  ParallelProgrammingwithOpenMP

–  OpenMPparallelregion,andworksharing–  OpenMPdataenvironment,taskingandsynchronizaAon

•  OpenMPPerformanceandBestPracAces•  MoreCaseStudiesandExamples•  ReferenceMaterials

3

WhatisOpenMP

•  StandardAPItowritesharedmemoryparallelapplicaAonsinC,C++,andFortran–  CompilerdirecAves,RunAmerouAnes,Environmentvariables

•  OpenMPArchitectureReviewBoard(ARB)–  MaintainsOpenMPspecificaAon–  Permanentmembers

•  AMD,Cray,Fujitsu,HP,IBM,Intel,NEC,PGI,Oracle,MicrosoZ,TexasInstruments,NVIDIA,Convey

–  Auxiliarymembers•  ANL,ASC/LLNL,cOMPunity,EPCC,LANL,NASA,TACC,RWTHAachenUniversity,UH

–  h`p://www.openmp.org•  LatestVersion4.5releasedNov2015

4

MyrolewithOpenMP

5

“HelloWord”Example/1

#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { printf("Hello World\n"); return(0); }

6

“HelloWord”-AnExample/2

#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }

7


$ gcc –fopenmp hello.c $ export OMP_NUM_THREADS=2 $ ./a.out Hello World Hello World $ export OMP_NUM_THREADS=4 $ ./a.out Hello World Hello World Hello World Hello World $

8

#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }

OpenMPComponents

9

•  Parallelregion

•  Worksharingconstructs

•  Tasking

•  Offloading

•  Affinity

•  ErrorHanding

•  SIMD

•  SynchronizaAon

•  Data-sharinga`ributes

•  Numberofthreads

•  ThreadID

•  Dynamicthreadadjustment

•  Nestedparallelism

•  Schedule

•  AcAvelevels

•  Threadlimit

•  NesAnglevel

•  Ancestorthread

•  Teamsize

•  Locking

•  WallclockAmer

•  Numberofthreads

•  Schedulingtype

•  Dynamicthreadadjustment

•  Nestedparallelism

•  Stacksize

•  Idlethreads

•  AcAvelevels

•  Threadlimit

Direc<ves Run<meEnvironment EnvironmentVariable


#include <stdlib.h> #include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { #pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); } return(0); }

10

DirecAves

RunAmeEnvironment


11

EnvironmentVariable

EnvironmentVariable:itissimilartoprogramargumentsusedtochangetheconfiguraAonoftheexecuAonwithoutrecompiletheprogram.

NOTE:theorderofprint

TheDesignPrincipleBehind

•  Eachprinîsatask

•  Aparallelregionistoclaimasetofcoresforcomputa<on–  Coresarepresentedasmul<plethreads

•  Eachthreadexecuteasingletask–  Taskidisthesameasthreadid

•  omp_get_thread_num()–  Num_tasksisthesameastotalnumberofthreads

•  omp_get_num_threads()

•  1:1mappingbetweentaskandthread–  Everytask/coredosimilarworkinthissimpleexample

12

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

OpenMPParallelCompu<ngSolu<onStack

13

Prog.Layer

(Ope

nMPAP

I)

RunAmelibrary

OS/system

DirecAves,Compiler OpenMPlibrary Environment

variables

ApplicaAon

End User

System

layer

Userlayer

OpenMPSyntax

14

•  MostOpenMPconstructsarecompilerdirec+vesusingpragmas.–  ForCandC++,thepragmastaketheform:#pragma…

•  pragmavslanguage–  pragmaisnotlanguage,shouldnotexpresslogics–  Toprovidecompiler/preprocessoraddi<onalinforma<ononhowtoprocessingdirec<ve-annotatedcode

–  Similarto#include,#define

OpenMPSyntax

15

•  ForCandC++,thepragmastaketheform:#pragma omp construct [clause [clause]…]

•  ForFortran,thedirecAvestakeoneoftheforms:–  Fixedform *$OMP construct [clause [clause]…] C$OMP construct [clause [clause]…]

–  Freeform(butworksforfixedformtoo)!$OMP construct [clause [clause]…]

•  IncludefileandtheOpenMPlibmodule#include <omp.h> use omp_lib

OpenMPCompiler•  OpenMP:threadprogrammingat“highlevel”.

–  Theuserdoesnotneedtospecifythedetails•  ProgramdecomposiAon,assignmentofworktothreads•  Mappingtaskstohardwarethreads

•  Usermakesstrategicdecisions•  Compilerfiguresoutdetails

–  CompilerflagsenableOpenMP(e.g.–openmp,-xopenmp,-fopenmp,-mp)

16

OpenMPMemoryModel

•  OpenMPassumesasharedmemory•  Threadscommunicatebysharingvariables.

•  SynchronizaAonprotectsdataconflicts.–  SynchronizaAonisexpensive.•  ChangehowdataisaccessedtominimizetheneedforsynchronizaAon.

17

OpenMPFork-JoinExecu<onModel

•  Master thread spawns multiple worker threads as needed, together form a team

•  Parallel region is a block of code executed by all threads in a team simultaneously

18

Parallel Regions

Master thread

A Nested Parallel region

Worker threads

Fork Join

OpenMPParallelRegions

•  InC/C++:ablockisasinglestatementoragroupofstatementbetween{}

•  InFortran:ablockisasinglestatementoragroupofstatementsbetweendirecAve/end-direcAvepairs.

19

C$OMP PARALLEL 10 wrk(id) = garbage(id) res(id) = wrk(id)**2 if(.not.conv(res(id)) goto 10 C$OMP END PARALLEL

C$OMP PARALLEL DO do i=1,N

res(i)=bigComp(i) end do C$OMP END PARALLEL DO

#pragma omp parallel { id = omp_get_thread_num(); res[id] = lots_of_work(id); }

#pragma omp parallel for for(i=0;i<N;i++) { res[i] = big_calc(i); A[i] = B[i] + res[i]; }

20

lexical extent of parallel region

C$OMP PARALLEL call whoami C$OMP END PARALLEL

Dynamic extent of parallel region includes lexical extent

subroutine whoami external omp_get_thread_num integer iam, omp_get_thread_num iam = omp_get_thread_num() C$OMP CRITICAL print*,’Hello from ‘, iam C$OMP END CRITICAL return end

+

Orphaned directives can appear outside a parallel construct

bar.f foo.f

A parallel region can span multiple source files.

ScopeofOpenMPRegion

SPMDProgramModels

21

•  SPMD(SingleProgram,Mul<pleData)forparallelregions–  Allthreadsoftheparallelregionexecutethesamecode–  EachthreadhasuniqueID

•  UsethethreadIDtodivergetheexecuAonofthethreads–  Differentthreadcanfollowdifferentpathsthroughthesame

code

•  SPMDisbyfarthemostcommonlyusedpaèrnforstructuringparallelprograms–  MPI,OpenMP,CUDA,etc

if(my_id == x) { } else { }

ModifytheHelloWorldProgramso…

•  Onlyonethreadprintsthetotalnumberofthreads–  gcc–fopenmphello.c–ohello

•  Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo

22

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); if (thread_id == 0) printf("Hello World from thread %d of %d\n", thread_id, num_threads); else printf("Hello World from thread %d\n", thread_id); }

int num_threads = 99999; #pragma omp parallel { int thread_id = omp_get_thread_num(); if (thread_id == 0) num_threads = omp_get_num_threads(); #pragma omp barrier printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

Barrier

23

#pragma omp barrier

OpenMPMaster

•  Denotesastructuredblockexecutedbythemasterthread

•  Theotherthreadsjustskipit–  nosynchronizaAonisimplied

24

#pragma omp parallel private (tmp) {

do_many_things_together(); #pragma omp master

{ exchange_boundaries_by_master_only (); } #pragma barrier do_many_other_things_together(); }

•  Denotesablockofcodethatisexecutedbyonlyonethread.–  Couldbemaster

•  Abarrierisimpliedattheendofthesingleblock.

25

#pragma omp parallel private (tmp) {

do_many_things_together(); #pragma omp single

{ exchange_boundaries_by_one(); } do_many_other_things_together();

}

OpenMPSingle

Usingompmaster/singletomodifytheHelloWorldProgramso…

•  Onlyonethreadprintsthetotalnumberofthreads

•  Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo

26

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

Distribu<ngWorkBasedonThreadID

27

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel shared (a, b)

{

int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }

}

Sequential code

OpenMP parallel region

cat/proc/cpuinfo

Implemen<ngaxpyusingOpenMPparallel

28

•  DividestheexecuAonoftheenclosedcoderegionamongthemembersoftheteam

•  The“for” worksharingconstructsplitsuploopiteraAonsamongthreadsinateam–  Eachthreadgetsoneormore“chunk”->loopchuncking

29

#pragmaompparallel#pragmaompforfor(i=0;i<N;i++){

work(i);} By default, there is a barrier at the end of the “omp

for”. Use the “nowait” clause to turn off the barrier.

#pragma omp for nowait

“nowait” is useful between two consecutive, independent omp for loops.

OpenMPWorksharingConstructs

WorksharingConstructs

30

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel shared (a, b)

{

int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }

}

#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

Sequential code

OpenMP parallel region

OpenMP parallel region and a worksharing for construct

•  schedule(staAc|dynamic|guided[,chunk])•  schedule(auto|runAme)

31

staAc DistributeiteraAonsinblocksofsize"chunk"overthethreadsinaround-robinfashion

dynamic FixedporAonsofwork;sizeiscontrolledbythevalueofchunk;Whenathreadfinishes,itstartsonthenextporAonofwork

guided Samedynamicbehavioras"dynamic",butsizeoftheporAonofworkdecreasesexponenAally

auto Thecompiler(orrunAmesystem)decideswhatisbesttouse;choicecouldbeimplementaAondependent

runAme IteraAonschedulingschemeissetatrunAmethroughenvironmentvariableOMP_SCHEDULE

OpenMPscheduleClause

OpenMPSec<ons

•  Worksharingconstruct•  Givesadifferentstructuredblocktoeachthread

32

#pragma omp parallel #pragma omp sections { #pragma omp section

x_calculation(); #pragma omp section

y_calculation(); #pragma omp section

z_calculation(); }

By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.

LoopCollapse

•  AllowsparallelizaAonofperfectlynestedloopswithoutusingnestedparallelism

•  Thecollapseclauseonfor/doloopindicateshowmanyloopsshouldbecollapsed

33

!$ompparalleldocollapse(2)...doi=il,iu,isdoj=jl,ju,jsdok=kl,ku,ks.....enddoenddoenddo!$ompendparalleldo

Exercise:OpenMPMatrixMul<plica<on

•  Parallelversion•  Parallelforversion

–  Experimentdifferentschedulepolicyandchunksize•  #omppragmaparallelfor

–  Experimentcollapse(2)

34

#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel for schedule(static) private (i) num_threads(num_ths)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

gcc-fopenmpmm.c-omm

Barrier

•  Barrier:EachthreadwaitsunAlallthreadsarrive.

35

#pragma omp parallel shared (A, B, C) private(id) {

id=omp_get_thread_num(); A[id] = big_calc1(id);

#pragma omp barrier #pragma omp for

for(i=0;i<N;i++){C[i]=big_calc3(I,A);} #pragma omp for nowait

for(i=0;i<N;i++){ B[i]=big_calc2(C, i); } A[id] = big_calc3(id);

} implicit barrier at the end of a parallel region

implicit barrier at the end of a for work-sharing construct

no implicit barrier due to nowait

•  Mostvariablesaresharedbydefault•  GlobalvariablesareSHAREDamongthreads

–  Fortran:COMMONblocks,SAVEvariables,MODULEvariables

–  C:Filescopevariables,staAc•  Butnoteverythingisshared...

–  Stackvariablesinsub-programscalledfromparallelregionsarePRIVATE

–  AutomaAcvariablesdefinedinsidetheparallelregionarePRIVATE.

36

DataEnvironment

OpenMPDataEnvironment

double a[size][size], b=4; #pragma omp parallel private (b) { .... }

shared data a[size][size]

T0 T1 T2 T3

privatedatab’=?

b becomes undefined

privatedatab’=?

privatedatab’=?

privatedatab’=?

37

38

program sort common /input/ A(10) integer index(10)

C$OMP PARALLEL call work (index)

C$OMP END PARALLEL print*, index(1)

subroutine work (index) common /input/ A(10) integer index(*) real temp(10) integer count save count ………… !

temp

A, index, count

temp temp

A, index, count

A, index and count are shared by all threads.

temp is local to each thread

OpenMPDataEnvironment

DataEnvironment:Changingstorageaeributes

•  SelecAvelychangestoragea`ributesconstructsusingthefollowingclauses–  SHARED–  PRIVATE–  FIRSTPRIVATE–  THREADPRIVATE

•  Thevalueofaprivateinsideaparallelloopandglobalvalueoutsidetheloopcanbeexchangedwith–  FIRSTPRIVATE,andLASTPRIVATE

•  Thedefaultstatuscanbemodifiedwith:–  DEFAULT(PRIVATE|SHARED|NONE)

39

OpenMPPrivateClause

•  private(var)createsalocalcopyofvarforeachthread.–  Thevalueisunini+alized–  Privatecopyisnotstorage-associatedwiththeoriginal–  Theoriginalisundefinedattheend

40

IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000

IS = IS + J END DO C$OMP END PARALLEL DO print *, IS

OpenMPPrivateClause

•  private(var)createsalocalcopyofvarforeachthread.–  Thevalueisunini+alized–  Privatecopyisnotstorage-associatedwiththeoriginal–  Theoriginalisundefinedattheend

41

IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000

IS = IS + J END DO C$OMP END PARALLEL DO print *, IS

IS was not initialized

IS is undefined here

✗✗

FirstprivateClause

•  firstprivateisaspecialcaseofprivate.–  IniAalizeseachprivatecopywiththecorrespondingvalue

fromthemasterthread.

42

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20  CONTINUE C$OMP END PARALLEL DO print *, IS

FirstprivateClause

•  firstprivateisaspecialcaseofprivate.–  IniAalizeseachprivatecopywiththecorrespondingvalue

fromthemasterthread.

43

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20  CONTINUE C$OMP END PARALLEL DO print *, IS

Regardless of initialization, IS is undefined at this point

Each thread gets its own IS with an initial value of 0

✔ ✗

LastprivateClause

•  LastprivatepassesthevalueofaprivatefromthelastiteraAontothevariableofthemasterthread

44

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) C$OMP& LASTPRIVATE(IS) DO 20 J=1,1000

IS = IS + J 20  CONTINUE C$OMP END PARALLEL DO print *, IS

IS is defined as its value at the last iteration (i.e. for J=1000)

Each thread gets its own IS with an initial value of 0

✔ Isthiscodemeaningful?

OpenMPReduc<on

45

l  Here is the correct way to parallelize this code.

IS = 0 C$OMP PARALLEL DO REDUCTION(+:IS) DO 20 J=1,1000

IS = IS + J 20 CONTINUE print *, IS

Reduction NOT implies firstprivate, where is the initial 0 comes from?

Reduc<onoperands/ini<al-values

•  AssociaAveoperandsusedwithreducAon•  IniAalvaluesaretheonesthatmakesense

mathemaAcally

46

Operand Initial value

+ 0

* 1

- 0

.AND. All 1’s

Operand Initial value

.OR. 0

MAX 1

MIN 0

// All 0’s

Exercise:OpenMPSum.c

•  Twoversions–  ParallelforwithreducAon–  Parallelversion,notusing“ompfor”or“reducAon”clause

47

OpenMPThreadprivate

•  Makesglobaldataprivatetoathread,thuscrossingparallelregionboundary–  Fortran:COMMONblocks–  C:FilescopeandstaAcvariables

•  DifferentfrommakingthemPRIVATE–  WithPRIVATE,globalvariablesaremasked.–  THREADPRIVATEpreservesglobalscopewithineach

thread•  ThreadprivatevariablescanbeiniAalizedusing

COPYIN orbyusingDATAstatements.

48

Threadprivate/copyin

49

parameter (N=1000) common/buf/A(N) C$OMP THREADPRIVATE(/buf/) C Initialize the A array call init_data(N,A) C$OMP PARALLEL COPYIN(A) … Now each thread sees threadprivate array A initialized … to the global value set in the subroutine init_data() C$OMP END PARALLEL .... C$OMP PARALLEL ... Values of threadprivate are persistent across parallel regions C$OMP END PARALLEL

•  You initialize threadprivate data using a copyin clause.

OpenMPSynchroniza<on

•  HighlevelsynchronizaAon:–  criAcalsecAon–  atomic–  barrier–  ordered

•  LowlevelsynchronizaAon–  flush–  locks(bothsimpleandnested)

50

Cri<calsec<on

•  OnlyonethreadataAmecanenteracriAcalsecAon.

51

C$OMP PARALLEL DO PRIVATE(B) C$OMP& SHARED(RES)

DO 100 I=1,NITERS B = DOIT(I)

C$OMP CRITICAL CALL CONSUME (B, RES)

C$OMP END CRITICAL 100 CONTINUE C$OMP END PARALLEL DO

Atomic

•  AtomicisaspecialcaseofacriAcalsecAonthatcanbeusedforcertainsimplestatements

•  ItappliesonlytotheupdateofamemorylocaAon

52

C$OMP PARALLEL PRIVATE(B) B = DOIT(I)

tmp = big_ugly();

C$OMP ATOMIC X = X + temp

C$OMP END PARALLEL

OpenMPTasks

Defineatask:–  C/C++:#pragmaomptask–  Fortran:!$omptask

53

•  Ataskisgeneratedwhenathreadencountersataskconstruct–  Containsataskregionanditsdataenvironment–  Taskcanbenested

•  AtaskregionisaregionconsisAngofallcodeencounteredduringtheexecuAonofatask.

•  ThedataenvironmentconsistsofallthevariablesassociatedwiththeexecuAonofagiventask.–  constructedwhenthetaskisgenerated

Taskcomple<onandsynchroniza<on

•  Taskcomple<onoccurswhenthetaskreachestheendofthetaskregioncode

•  MulApletasksjoinedtocompletethroughtheuseoftasksynchroniza<onconstructs–  taskwait–  barrierconstruct

•  taskwaitconstructs:–  #pragmaomptaskwait–  !$omptaskwait

54

intfib(intn){intx,y;if(n<2)returnn;else{#pragmaomptaskshared(x)x=fib(n-1);#pragmaomptaskshared(y)y=fib(n-2);#pragmaomptaskwaitreturnx+y;}}

Example:ALinkedList

55

An Overview of OpenMP

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Example - A Linked List

........

while(my_pointer) {

(void) do_independent_work (my_pointer); my_pointer = my_pointer->next ;

} // End of while loop

........

Example:ALinkedListwithTasking

56

An Overview of OpenMP


Example - A Linked List With Tasking my_pointer = listhead;

#pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); } my_pointer = my_pointer->next ; } } // End of single - no implied barrier (nowait) } // End of parallel region - implied barrier

OpenMP Task is specifi ed here(executed in parallel)

Ordered

•  TheorderedconstructenforcesthesequenAalorderforablock.

57

#pragma omp parallel private (tmp) #pragma omp for ordered for (i=0;i<N;i++){ tmp = NEAT_STUFF_IN_PARALLEL(i); #pragma ordered res += consum(tmp); }

OpenMPSynchroniza<on

•  Theflushconstructdenotesasequencepointwhereathreadtriestocreateaconsistentviewofmemory.–  AllmemoryoperaAons(bothreadsandwrites)definedprior

tothesequencepointmustcomplete.–  AllmemoryoperaAons(bothreadsandwrites)definedaZer

thesequencepointmustfollowtheflush.–  Variablesinregistersorwritebuffersmustbeupdatedin

memory.•  Argumentstoflushspecifywhichvariablesareflushed.Noargumentsspecifiesthatallthreadvisiblevariablesareflushed.

58

Aflushexample

59

l  pair-wise synchronization.

integer ISYNC(NUM_THREADS) C$OMP PARALLEL DEFAULT (PRIVATE) SHARED (ISYNC)

IAM = OMP_GET_THREAD_NUM() ISYNC(IAM) = 0

C$OMP BARRIER CALL WORK() ISYNC(IAM) = 1 ! I’m all done; signal this to other threads

C$OMP FLUSH(ISYNC) DO WHILE (ISYNC(NEIGH) .EQ. 0)

C$OMP FLUSH(ISYNC) END DO

C$OMP END PARALLEL

Make sure other threads can see my write.

Make sure the read picks up a good copy from memory.

Note: flush is analogous to a fence in other shared memory APIs.

OpenMPLockrou<nes•  SimpleLockrouAnes:availableifitisunset.

–  omp_init_lock(),omp_set_lock(),omp_unset_lock(),omp_test_lock(),omp_destroy_lock()

•  NestedLocks:availableifitisunsetorifitissetbutownedbythethreadexecuAngthenestedlockfuncAon

–  omp_init_nest_lock(),omp_set_nest_lock(),omp_unset_nest_lock(),omp_test_nest_lock(),omp_destroy_nest_lock()

60

OpenMPLocks•  Protectresourceswithlocks.

61

omp_lock_t lck; omp_init_lock(&lck); #pragma omp parallel private (tmp, id) { id = omp_get_thread_num(); tmp = do_lots_of_work(id); omp_set_lock(&lck); printf(“%d %d”, id, tmp); omp_unset_lock(&lck); } omp_destroy_lock(&lck);

Wait here for your turn.

Release the lock so the next thread gets a turn.

Free-up storage when done.

OpenMPLibraryRou<nes•  Modify/Checkthenumberofthreads

–  omp_set_num_threads(),omp_get_num_threads(),omp_get_thread_num(),omp_get_max_threads()

•  Areweinaparallelregion?–  omp_in_parallel()

•  Howmanyprocessorsinthesystem?–  omp_num_procs()

62

OpenMPEnvironmentVariables

•  Setthedefaultnumberofthreadstouse.–  OMP_NUM_THREADSint_literal

•  Controlhow“ompforschedule(RUNTIME)”loopiteraAonsarescheduled.–  OMP_SCHEDULE“schedule[,chunk_size]”

63

Outline


–  Worksharing,tasks,dataenvironment,synchronizaAon•  OpenMPPerformanceandBestPracAces•  CaseStudiesandExamples•  ReferenceMaterials

64

OpenMPPerformance

•  RelaAveeaseofusingOpenMPisamixedblessing•  WecanquicklywriteacorrectOpenMPprogram,butwithoutthedesiredlevelofperformance.

•  Therearecertain“bestpracAces” toavoidcommonperformanceproblems.

•  Extraworkneededtoprogramwithlargethreadcount

65

TypicalOpenMPPerformanceIssues

66

•  OverheadsofOpenMPconstructs,threadmanagement.E.g.–  dynamicloopscheduleshavemuchhigheroverheadsthan

staAcschedules–  SynchronizaAonisexpensive,useNOWAITifpossible–  Largeparallelregionshelpreduceoverheads,enablebeèr

cacheusageandstandardopAmizaAons•  OverheadsofrunAmelibraryrouAnes

–  Somearecalledfrequently•  Loadbalance•  CacheuAlizaAonandfalsesharing

OverheadsofOpenMPDirec<ves

0

200000

400000

600000

800000

1000000

1200000

1400000

Ove

rhea

d (C

ycle

s)

1 2 4 8 16 32 64 128 256

PARALLEL

PARALLEL FOR

SINGLE

LOCK/UNLOCK

ATOMIC

Number of Threads

OpenMP OverheadsEPCC Microbenchmarks

SGI Altix 3600

PARALLEL

FOR

PARALLEL FOR

BARRIER

SINGLE

CRITICAL

LOCK/UNLOCK

ORDERED

ATOMIC

REDUCTION

67

OpenMPBestPrac<ces

•  Reduceusageofbarrierwithnowaitclause

#pragmaompparallel{#pragmaompforfor(i=0;i<n;i++)….#pragmaompfornowaitfor(i=0;i<n;i++)}

68

OpenMPBestPrac<ces

#pragmaompparallelprivate(i){#pragmaompfornowaitfor(i=0;i<n;i++)a[i]+=b[i];#pragmaompfornowaitfor(i=0;i<n;i++)c[i]+=d[i];#pragmaompbarrier#pragmaompfornowaitreducAon(+:sum)for(i=0;i<n;i++)sum+=a[i]+c[i];}

69

OpenMPBestPrac<ces

•  Avoidlargeorderedconstruct•  AvoidlargecriAcalregions

#pragmaompparallelshared(a,b)private(c,d){….#pragmaompcriAcal{a+=2*c;c=d*d;}}

Move out this Statement 70

OpenMPBestPrac<ces

•  MaximizeParallelRegions

#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}}opt=opt+N;//sequenAal#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}

#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}#pragmaompsinglenowaitopt=opt+N;//sequenAal#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}

71

OpenMPBestPrac<ces

•  Singleparallelregionenclosingallwork-sharingloops.for(i=0;i<n;i++)for(j=0;j<n;j++)pragmaompparallelforprivate(k)for(k=0;k<n;k++){……}

#pragmaompparallelprivate(i,j,k){for(i=0;i<n;i++)for(j=0;j<n;j++)#pragmaompforfor(k=0;k<n;k++){…….}} 72

OpenMPBestPrac<ces

73

•  Addressloadimbalances•  Useparallelfordynamicschedulesanddifferentchunksizes

Smith-Waterman Sequence Alignment Algorithm

OpenMPBestPrac<ces

•  Smith-WatermanAlgorithm–  DefaultscheduleisforstaAcevenàloadimbalance

74

#pragmaompforfor(…)for(…)for(…)for(…){/*computealignments*/}#pragmaompcriAcal{./*computescores*/}

OpenMPBestPrac<cesSmith-Waterman Sequence Alignment Algorithm

1

10

100

2 4 8 16 32 64 128

Speedup

threads

100

600

1000

Ideal

1

10

100

2 4 8 16 32 64 128

Speedup

threads

100

600

1000

Ideal

#pragma omp for

#pragma omp for dynamic(schedule, 1)

128 threads with 80% efficiency 75

OpenMPBestPrac<ces

0

10000

20000

30000

40000

50000

60000

70000

80000

Ove

rhea

d (in

Cyc

les)

1 2 4 8 16 32 64 128 256

Default

2

8

32

128

Number of Threads

Chunk Size

Overheads of OpenMP For Static Scheduling SGI Altix 3600

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

Cyc

les

1 2 4 8 16 32 64 128 256

1

4

16

64

Number of Threads

Chunk Size

Overheads of OpenMP For Dynamic ScheduleSGI Altix 3600

•  AddressloadimbalancesbyselecAngthebestscheduleandchunksize

•  AvoidselecAngsmallchunksizewhenworkinchunkissmall.

76

OpenMPBestPrac<ces

•  PipelineprocessingtooverlapI/OandcomputaAons

for(i=0;i<N;i++){ReadFromFile(i,…);for(j=0;j<ProcessingNum;j++)ProcessData(i,j);WriteResultsToFile(i)}

77

OpenMPBestPrac<ces

#pragmaompparallel{#pragmaompsingle{ReadFromFile(0,...);}for(i=0;i<N;i++){#pragmaompsinglenowait{if(i<N-1)ReadFromFile(i+1,….);}#pragmaompforschedule(dynamic)for(j=0;j<ProcessingNum;j++)ProcessChunkOfData(i,j);#pragmaompsinglenowait{WriteResultsToFile(i);}}}

•  PipelineProcessing•  Pre-fetchesI/O•  ThreadsreadingorwriAngfilesjoinsthecomputaAons

The implicit barrier here is very important: 1) file i is finished so we can write to file. 2) file i+1 is read in so we can process in the next loop iteration

78

For dealing with the last file

OpenMPBestPrac<ces

•  singlevs.masterwork-sharing–  masterismoreefficientbutrequiresthread0tobeavailable–  singleismoreefficientifmasterthreadnotavailable–  singlehasimplicitbarrier

79

CacheCoherence

•  Real-worldsharedmemorysystemshavecachesbetweenmemoryandCPU

•  CopiesofasingledataitemcanexistinmulAplecaches•  ModificaAonofashareddataitembyoneCPUleadstooutdatedcopiesinthecacheofanotherCPU

Memory

CPU0

Cache

CPU1

Cache

Originaldataitem

CopyofdataitemincacheofCPU0 Copyofdataitem

incacheofCPU1

•  Falsesharing–  Whenatleastonethreadwritetoa

cachelinewhileothersaccessit•  Thread0:=A[1](read)•  Thread1:A[0]=…(write)

•  SoluAon:usearraypadding

int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;

int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;

OpenMPBestPrac<ces

Getting OpenMP Up To Speed


False Sharing

CPUs Caches Memory

A store into a shared cache line invalidates the other copies of that line:

The system is not able to distinguish between changes

within one individual line

81

A

T0

T1

Exercise:Feelthefalsesharingwithaxpy-papi.c

82

OpenMPBestPrac<ces•  DataplacementpolicyonNUMAarchitectures

•  FirstTouchPolicy

–  Theprocessthatfirsttouchesapageofmemorycausesthatpagetobeallocatedinthenodeonwhichtheprocessisrunning



A generic cc-NUMA architecture

83

NUMAFirst-touchplacement/1

84



About “First Touch” placement/1

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[99]

First TouchAll array elements are in the memory of

the processor executing this thread

int a[100]; Onlyreservethevm

address

NUMAFirst-touchplacement/2

85



About “First Touch” placement/2

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[49]

#pragma omp parallel for num_threads(2)

First TouchBoth memories each have “their half” of

the array

a[50] :a[99]

OpenMPBestPrac<ces

•  First-touchinpracAce–  IniAalizedataconsistentlywiththecomputaAons

#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}

86

•  PrivaAzevariablesasmuchaspossible–  Privatevariablesarestoredinthelocalstacktothethread

•  Privatedataclosetocache

doublea[MaxThreads][N][N]#pragmaompparallelforfor(i=0;i<MaxThreads;i++){for(intj…)for(intk…)a[i][j][k]=…}

doublea[N][N]#pragmaompparallelprivate(a){for(intj…)for(intk…)a[j][k]=…}

OpenMPBestPracAces

87

OpenMPBestPracAces

procedurediff_coeff()‏{arrayallocaAonbymasterthreadiniAalizaAonofsharedarraysPARALLELREGION{looplower_bn[id],upper_bn[id]computaAononsharedarrays…..}}

•  CFDapplicaAonpsudo-code–  SharedarraysiniAalizedincorrectly(firsttouchpolicy)–  DelaysinremotememoryaccessesareprobablecausesbysaturaAonofinterconnect

88

Stall Cycle Breakdown for Non-Privatized (NP) andPrivatized (P) Versions of diff_coeff

0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10

D-c

ach

stal

ls

Bra

nch

mis

pred

ictio

n

Inst

ruct

ion

mis

s st

all

FLP

Uni

ts

Fron

t-end

flush

es

Cyc

les NP

PNP-P

OpenMPBestPracAces•  ArrayprivaAzaAon

–  Improvedtheperformanceofthewholeprogramby30%–  Speedupof10fortheprocedure,nowonly5%oftotalAme

•  Processorstallsarereducedsignificantly

89

•  AvoidThreadMigraAon–  Affectsdatalocality

•  Bindthreadstocores.•  Linux:

–  numactl–cpubind=0foobar–  taskset–c0,1foobar

•  SGIAlAx–  dplace–x2foobar

OpenMPBestPracAces

90

OpenMPSourceofErrors

•  IncorrectuseofsynchronizaAonconstructs–  LesslikelyifusersAckstodirecAves–  ErroneoususeofNOWAIT

•  RacecondiAons(truesharing)–  Canbeveryhardtofind

•  Wrong“spelling”ofsenAnel•  Usetoolstocheckfordataraces.

91

Outline


–  Worksharing,tasks,dataenvironment,synchronizaAon•  OpenMPPerformanceandBestPracAces•  HybridMPI/OpenMP•  CaseStudiesandExamples•  ReferenceMaterials

92

Matrixvectormul<plica<on

93



The Sequential Source

= *

j

i



The OpenMP Source

= *

j

i

Performance–2-socketNehalem

94



Performance - 2 Socket Nehalem

2-socketNehalem

95



A Two Socket Nehalem System

Dataini<aliza<on

96



Data Initialization

= *

j

i

IniAalizaAonwillcausetheallocaAonofmemoryaccordingtothefirsttouchpolicy

ExploitFirstTouch

97



Exploit First Touch

A3Dmatrixupdate



Observation No data dependency on 'I' Therefore we can split the 3D

matrix in larger blocks and process these in parallel

JI

K

98

Theidea



The IdeaK

J

I

ie is

We need to distribute the M iterations over the number of processors

We do this by controlling the start (IS) and end (IE) value of the inner loop

Each thread will calculate these values for it's portion of the work

99

A3Dmatrixupdate

100



A 3D matrix update

The loops are correctly nested for serial performance

Due to a data dependency on J and K, only the inner loop can be parallelized

This will cause the barrier to be executed (N-1) 2 times

Data Dependency Graph

Jj-1 j

k-1k

I

K

101



The performance

Dimensions : M=7,500 N=20Footprint : ~24 MByte

Perf

orm

ance

(Mflo

p/s)

Inner loop over I has been parallelized

Scaling is very poor(as to be expected)

Number of threads

Theperformance

Performanceanalyzerdata

102



Performance Analyzer datascales som

ewhat

do n

ot

scal

e at

all

Using 10 threads

Question: Why is __mt_WaitForWork so high in the profi le ?

Using 20 threads

Falsesharingatwork

103Getting OpenMP Up To Speed


This is False Sharing at work !

no s

harin

g

P=1 P=2 P=4 P=8

False sharing increases as we increase the number of

threads

Performancecompared

104



Performance comparison

Number of threads

Perf

orm

ance

(Mflo

p/s)

For a higher value of M, the program scales better

M = 7,500

M = 75,000

Thefirstimplementa<on



The fi rst implementation

105

OpenMPversion

106



Another Idea: Use OpenMP !

Howthisworks

107



How this works on 2 threads

parallel region

work sharing

parallel region

work sharing

This splits the operation in a way that is similar to our manual implementation

… etc …! … etc …!

Performance

108



Performance We have set M=7500 N=20

This problem size does not scale at all when we explicitly parallelized the inner loop over 'I'

We have have tested 4 versions of this program Inner Loop Over 'I' - Our fi rst OpenMP version AutoPar - The automatically parallelized version of

'kernel' OMP_Chunks - The manually parallelized version

with our explicit calculation of the chunks OMP_DO - The version with the OpenMP parallel

region and work-sharing DO

Performance

109



The performance (M=7,500)

Dimensions : M=7,500 N=20Footprint : ~24 MByte

Perfo

rman

ce (M

flop/

s)

The auto-parallelizingcompiler does really well !

Number of threads

OMP DO

Innerloop

OMP Chunks

ReferenceMaterialonOpenMP

110

•  OpenMPHomepagewww.openmp.org:–  TheprimarysourceofinformaAonaboutOpenMPanditsdevelopment.

•  OpenMPUser’sGroup(cOMPunity)Homepage– www.compunity.org:

•  Books:– UsingOpenMP,BarbaraChapman,GabrieleJost,RuudVanDerPas,Cambridge,MA:TheMITPress2007,ISBN:978-0-262-53302-7

–  ParallelprogramminginOpenMP,Chandra,Rohit,SanFrancisco,Calif.:MorganKaufmann;London:Harcourt,2000,ISBN:1558606718

110

•  DirecAvesimplementedviacodemodificaAonandinserAonofrunAmelibrarycalls

–  Basicstepisoutliningofcodeinparallelregion

•  RunAmelibraryresponsibleformanagingthreads

–  Schedulingloops–  Schedulingtasks–  ImplemenAng

synchronizaAon•  ImplementaAoneffortis

reasonable

OpenMP Code Translation

int main(void) { int a,b,c; #pragma omp parallel \ private(c) do_sth(a,b,c); return 0; }

_INT32 main() { int a,b,c; /* microtask */ void __ompregion_main1() { _INT32 __mplocal_c; /*shared variables are kept intact, substitute accesses to private variable*/ do_sth(a, b, __mplocal_c); } … /*OpenMP runtime calls */ __ompc_fork(&__ompregion_main1); … }

Eachcompilerhascustomrun-Amesupport.QualityoftherunAmesystemhasmajorimpactonperformance.

StandardOpenMPImplementa<on

Documents

Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul