Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP...

Lecture04-06:ProgrammingwithOpenMP

ConcurrentandMul<coreProgramming,CSE536

DepartmentofComputerScienceandEngineeringYonghongYan

yan@oakland.eduwww.secs.oakland.edu/~yan

Topics(Part1)

•  IntroducAon•  Programmingonsharedmemorysystem(Chapter7)

–  OpenMP•  Principlesofparallelalgorithmdesign(Chapter3)•  Programmingonsharedmemorysystem(Chapter7)

–  Cilk/Cilkplus(?)–  PThread,mutualexclusion,locks,synchroniza<ons

•  AnalysisofparallelprogramexecuAons(Chapter5)–  PerformanceMetricsforParallelSystems

•  Execu<onTime,Overhead,Speedup,Efficiency,Cost–  ScalabilityofParallelSystems–  Useofperformancetools

Outline

•  OpenMPIntroducAon•  ParallelProgrammingwithOpenMP

–  OpenMPparallelregion,andworksharing–  OpenMPdataenvironment,taskingandsynchronizaAon

•  OpenMPPerformanceandBestPracAces•  MoreCaseStudiesandExamples•  ReferenceMaterials

WhatisOpenMP

•  StandardAPItowritesharedmemoryparallelapplicaAonsinC,C++,andFortran–  CompilerdirecAves,RunAmerouAnes,Environmentvariables

•  OpenMPArchitectureReviewBoard(ARB)–  MaintainsOpenMPspecificaAon–  Permanentmembers

•  AMD,Cray,Fujitsu,HP,IBM,Intel,NEC,PGI,Oracle,MicrosoZ,TexasInstruments,NVIDIA,Convey

–  Auxiliarymembers•  ANL,ASC/LLNL,cOMPunity,EPCC,LANL,NASA,TACC,RWTHAachenUniversity,UH

–  h`p://www.openmp.org•  LatestVersion4.5releasedNov2015

MyrolewithOpenMP

“HelloWord”Example/1

#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { printf("Hello World\n"); return(0); }

“HelloWord”-AnExample/2

#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }

$ gcc –fopenmp hello.c $ export OMP_NUM_THREADS=2 $ ./a.out Hello World Hello World $ export OMP_NUM_THREADS=4 $ ./a.out Hello World Hello World Hello World Hello World $

#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }

OpenMPComponents

•  Parallelregion

•  Worksharingconstructs

•  Tasking

•  Offloading

•  Affinity

•  ErrorHanding

•  SIMD

•  SynchronizaAon

•  Data-sharinga`ributes

•  Numberofthreads

•  ThreadID

•  Dynamicthreadadjustment

•  Nestedparallelism

•  Schedule

•  AcAvelevels

•  Threadlimit

•  NesAnglevel

•  Ancestorthread

•  Teamsize

•  Locking

•  WallclockAmer

•  Numberofthreads

•  Schedulingtype

•  Dynamicthreadadjustment

•  Nestedparallelism

•  Stacksize

•  Idlethreads

•  AcAvelevels

•  Threadlimit

Direc<ves Run<meEnvironment EnvironmentVariable

#include <stdlib.h> #include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { #pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); } return(0); }

DirecAves

RunAmeEnvironment

EnvironmentVariable

EnvironmentVariable:itissimilartoprogramargumentsusedtochangetheconfiguraAonoftheexecuAonwithoutrecompiletheprogram.

NOTE:theorderofprint

TheDesignPrincipleBehind

•  Eachprin^isatask

•  Aparallelregionistoclaimasetofcoresforcomputa<on–  Coresarepresentedasmul<plethreads

•  Eachthreadexecuteasingletask–  Taskidisthesameasthreadid

•  omp_get_thread_num()–  Num_tasksisthesameastotalnumberofthreads

•  omp_get_num_threads()

•  1:1mappingbetweentaskandthread–  Everytask/coredosimilarworkinthissimpleexample

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

OpenMPParallelCompu<ngSolu<onStack

Prog.Layer

RunAmelibrary

OS/system

DirecAves,Compiler OpenMPlibrary Environment

variables

ApplicaAon

End User

System

Userlayer

OpenMPSyntax

•  MostOpenMPconstructsarecompilerdirec+vesusingpragmas.–  ForCandC++,thepragmastaketheform:#pragma…

•  pragmavslanguage–  pragmaisnotlanguage,shouldnotexpresslogics–  Toprovidecompiler/preprocessoraddi<onalinforma<ononhowtoprocessingdirec<ve-annotatedcode

–  Similarto#include,#define

OpenMPSyntax

•  ForCandC++,thepragmastaketheform:#pragma omp construct [clause [clause]…]

•  ForFortran,thedirecAvestakeoneoftheforms:–  Fixedform *$OMP construct [clause [clause]…] C$OMP construct [clause [clause]…]

–  Freeform(butworksforfixedformtoo)!$OMP construct [clause [clause]…]

•  IncludefileandtheOpenMPlibmodule#include <omp.h> use omp_lib

OpenMPCompiler•  OpenMP:threadprogrammingat“highlevel”.

–  Theuserdoesnotneedtospecifythedetails•  ProgramdecomposiAon,assignmentofworktothreads•  Mappingtaskstohardwarethreads

•  Usermakesstrategicdecisions•  Compilerfiguresoutdetails

–  CompilerflagsenableOpenMP(e.g.–openmp,-xopenmp,-fopenmp,-mp)

OpenMPMemoryModel

•  OpenMPassumesasharedmemory•  Threadscommunicatebysharingvariables.

•  SynchronizaAonprotectsdataconflicts.–  SynchronizaAonisexpensive.•  ChangehowdataisaccessedtominimizetheneedforsynchronizaAon.

OpenMPFork-JoinExecu<onModel

•  Master thread spawns multiple worker threads as needed, together form a team

•  Parallel region is a block of code executed by all threads in a team simultaneously

Parallel Regions

Master thread

A Nested Parallel region

Worker threads

Fork Join

OpenMPParallelRegions

•  InC/C++:ablockisasinglestatementoragroupofstatementbetween{}

•  InFortran:ablockisasinglestatementoragroupofstatementsbetweendirecAve/end-direcAvepairs.

C$OMP PARALLEL 10 wrk(id) = garbage(id) res(id) = wrk(id)**2 if(.not.conv(res(id)) goto 10 C$OMP END PARALLEL

C$OMP PARALLEL DO do i=1,N

res(i)=bigComp(i) end do C$OMP END PARALLEL DO

#pragma omp parallel { id = omp_get_thread_num(); res[id] = lots_of_work(id); }

#pragma omp parallel for for(i=0;i<N;i++) { res[i] = big_calc(i); A[i] = B[i] + res[i]; }

lexical extent of parallel region

C$OMP PARALLEL call whoami C$OMP END PARALLEL

Dynamic extent of parallel region includes lexical extent

subroutine whoami external omp_get_thread_num integer iam, omp_get_thread_num iam = omp_get_thread_num() C$OMP CRITICAL print*,’Hello from ‘, iam C$OMP END CRITICAL return end

Orphaned directives can appear outside a parallel construct

bar.f foo.f

A parallel region can span multiple source files.

ScopeofOpenMPRegion

SPMDProgramModels

•  SPMD(SingleProgram,Mul<pleData)forparallelregions–  Allthreadsoftheparallelregionexecutethesamecode–  EachthreadhasuniqueID

•  UsethethreadIDtodivergetheexecuAonofthethreads–  Differentthreadcanfollowdifferentpathsthroughthesame

•  SPMDisbyfarthemostcommonlyusedpa`ernforstructuringparallelprograms–  MPI,OpenMP,CUDA,etc

if(my_id == x) { } else { }

ModifytheHelloWorldProgramso…

•  Onlyonethreadprintsthetotalnumberofthreads–  gcc–fopenmphello.c–ohello

•  Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); if (thread_id == 0) printf("Hello World from thread %d of %d\n", thread_id, num_threads); else printf("Hello World from thread %d\n", thread_id); }

int num_threads = 99999; #pragma omp parallel { int thread_id = omp_get_thread_num(); if (thread_id == 0) num_threads = omp_get_num_threads(); #pragma omp barrier printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

Barrier

#pragma omp barrier

OpenMPMaster

•  Denotesastructuredblockexecutedbythemasterthread

•  Theotherthreadsjustskipit–  nosynchronizaAonisimplied

#pragma omp parallel private (tmp) {

do_many_things_together(); #pragma omp master

{ exchange_boundaries_by_master_only (); } #pragma barrier do_many_other_things_together(); }

•  Denotesablockofcodethatisexecutedbyonlyonethread.–  Couldbemaster

•  Abarrierisimpliedattheendofthesingleblock.

#pragma omp parallel private (tmp) {

do_many_things_together(); #pragma omp single

{ exchange_boundaries_by_one(); } do_many_other_things_together();

OpenMPSingle

Usingompmaster/singletomodifytheHelloWorldProgramso…

•  Onlyonethreadprintsthetotalnumberofthreads

•  Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

Distribu<ngWorkBasedonThreadID

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel shared (a, b)

int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }

Sequential code

OpenMP parallel region

cat/proc/cpuinfo

Implemen<ngaxpyusingOpenMPparallel

•  DividestheexecuAonoftheenclosedcoderegionamongthemembersoftheteam

•  The“for” worksharingconstructsplitsuploopiteraAonsamongthreadsinateam–  Eachthreadgetsoneormore“chunk”->loopchuncking

#pragmaompparallel#pragmaompforfor(i=0;i<N;i++){

work(i);} By default, there is a barrier at the end of the “omp

for”. Use the “nowait” clause to turn off the barrier.

#pragma omp for nowait

“nowait” is useful between two consecutive, independent omp for loops.

OpenMPWorksharingConstructs

WorksharingConstructs

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel shared (a, b)

int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

Sequential code

OpenMP parallel region

OpenMP parallel region and a worksharing for construct

•  schedule(staAc|dynamic|guided[,chunk])•  schedule(auto|runAme)

staAc DistributeiteraAonsinblocksofsize"chunk"overthethreadsinaround-robinfashion

dynamic FixedporAonsofwork;sizeiscontrolledbythevalueofchunk;Whenathreadfinishes,itstartsonthenextporAonofwork

guided Samedynamicbehavioras"dynamic",butsizeoftheporAonofworkdecreasesexponenAally

auto Thecompiler(orrunAmesystem)decideswhatisbesttouse;choicecouldbeimplementaAondependent

runAme IteraAonschedulingschemeissetatrunAmethroughenvironmentvariableOMP_SCHEDULE

OpenMPscheduleClause

OpenMPSec<ons

•  Worksharingconstruct•  Givesadifferentstructuredblocktoeachthread

#pragma omp parallel #pragma omp sections { #pragma omp section

x_calculation(); #pragma omp section

y_calculation(); #pragma omp section

z_calculation(); }

By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.

LoopCollapse

•  AllowsparallelizaAonofperfectlynestedloopswithoutusingnestedparallelism

•  Thecollapseclauseonfor/doloopindicateshowmanyloopsshouldbecollapsed

!$ompparalleldocollapse(2)...doi=il,iu,isdoj=jl,ju,jsdok=kl,ku,ks.....enddoenddoenddo!$ompendparalleldo

Exercise:OpenMPMatrixMul<plica<on

•  Parallelversion•  Parallelforversion

–  Experimentdifferentschedulepolicyandchunksize•  #omppragmaparallelfor

–  Experimentcollapse(2)

#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel for schedule(static) private (i) num_threads(num_ths)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

gcc-fopenmpmm.c-omm

Barrier

•  Barrier:EachthreadwaitsunAlallthreadsarrive.

#pragma omp parallel shared (A, B, C) private(id) {

id=omp_get_thread_num(); A[id] = big_calc1(id);

#pragma omp barrier #pragma omp for

for(i=0;i<N;i++){C[i]=big_calc3(I,A);} #pragma omp for nowait

for(i=0;i<N;i++){ B[i]=big_calc2(C, i); } A[id] = big_calc3(id);

} implicit barrier at the end of a parallel region

implicit barrier at the end of a for work-sharing construct

no implicit barrier due to nowait

•  Mostvariablesaresharedbydefault•  GlobalvariablesareSHAREDamongthreads

–  Fortran:COMMONblocks,SAVEvariables,MODULEvariables

–  C:Filescopevariables,staAc•  Butnoteverythingisshared...

–  Stackvariablesinsub-programscalledfromparallelregionsarePRIVATE

–  AutomaAcvariablesdefinedinsidetheparallelregionarePRIVATE.

DataEnvironment

OpenMPDataEnvironment

double a[size][size], b=4; #pragma omp parallel private (b) { .... }

shared data a[size][size]

T0 T1 T2 T3

privatedatab’=?

b becomes undefined

privatedatab’=?

program sort common /input/ A(10) integer index(10)

C$OMP PARALLEL call work (index)

C$OMP END PARALLEL print*, index(1)

subroutine work (index) common /input/ A(10) integer index(*) real temp(10) integer count save count ………… !

A, index, count

temp temp

A, index, count

A, index and count are shared by all threads.

temp is local to each thread

OpenMPDataEnvironment

DataEnvironment:Changingstorageaeributes

•  SelecAvelychangestoragea`ributesconstructsusingthefollowingclauses–  SHARED–  PRIVATE–  FIRSTPRIVATE–  THREADPRIVATE

•  Thevalueofaprivateinsideaparallelloopandglobalvalueoutsidetheloopcanbeexchangedwith–  FIRSTPRIVATE,andLASTPRIVATE

•  Thedefaultstatuscanbemodifiedwith:–  DEFAULT(PRIVATE|SHARED|NONE)

OpenMPPrivateClause

•  private(var)createsalocalcopyofvarforeachthread.–  Thevalueisunini+alized–  Privatecopyisnotstorage-associatedwiththeoriginal–  Theoriginalisundefinedattheend

IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000

IS = IS + J END DO C$OMP END PARALLEL DO print *, IS

OpenMPPrivateClause

•  private(var)createsalocalcopyofvarforeachthread.–  Thevalueisunini+alized–  Privatecopyisnotstorage-associatedwiththeoriginal–  Theoriginalisundefinedattheend

IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000

IS = IS + J END DO C$OMP END PARALLEL DO print *, IS

IS was not initialized

IS is undefined here

✗✗

FirstprivateClause

•  firstprivateisaspecialcaseofprivate.–  IniAalizeseachprivatecopywiththecorrespondingvalue

fromthemasterthread.

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20  CONTINUE C$OMP END PARALLEL DO print *, IS

FirstprivateClause

•  firstprivateisaspecialcaseofprivate.–  IniAalizeseachprivatecopywiththecorrespondingvalue

fromthemasterthread.

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20  CONTINUE C$OMP END PARALLEL DO print *, IS

Regardless of initialization, IS is undefined at this point

Each thread gets its own IS with an initial value of 0

✔ ✗

LastprivateClause

•  LastprivatepassesthevalueofaprivatefromthelastiteraAontothevariableofthemasterthread

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) C$OMP& LASTPRIVATE(IS) DO 20 J=1,1000

IS = IS + J 20  CONTINUE C$OMP END PARALLEL DO print *, IS

IS is defined as its value at the last iteration (i.e. for J=1000)

Each thread gets its own IS with an initial value of 0

✔ Isthiscodemeaningful?

OpenMPReduc<on

l  Here is the correct way to parallelize this code.

IS = 0 C$OMP PARALLEL DO REDUCTION(+:IS) DO 20 J=1,1000

IS = IS + J 20 CONTINUE print *, IS

Reduction NOT implies firstprivate, where is the initial 0 comes from?

Reduc<onoperands/ini<al-values

•  AssociaAveoperandsusedwithreducAon•  IniAalvaluesaretheonesthatmakesense

mathemaAcally

Operand Initial value

.AND. All 1’s

Operand Initial value

.OR. 0

// All 0’s

Exercise:OpenMPSum.c

•  Twoversions–  ParallelforwithreducAon–  Parallelversion,notusing“ompfor”or“reducAon”clause

OpenMPThreadprivate

•  Makesglobaldataprivatetoathread,thuscrossingparallelregionboundary–  Fortran:COMMONblocks–  C:FilescopeandstaAcvariables

•  DifferentfrommakingthemPRIVATE–  WithPRIVATE,globalvariablesaremasked.–  THREADPRIVATEpreservesglobalscopewithineach

thread•  ThreadprivatevariablescanbeiniAalizedusing

COPYIN orbyusingDATAstatements.

Threadprivate/copyin

parameter (N=1000) common/buf/A(N) C$OMP THREADPRIVATE(/buf/) C Initialize the A array call init_data(N,A) C$OMP PARALLEL COPYIN(A) … Now each thread sees threadprivate array A initialized … to the global value set in the subroutine init_data() C$OMP END PARALLEL .... C$OMP PARALLEL ... Values of threadprivate are persistent across parallel regions C$OMP END PARALLEL

•  You initialize threadprivate data using a copyin clause.

OpenMPSynchroniza<on

•  HighlevelsynchronizaAon:–  criAcalsecAon–  atomic–  barrier–  ordered

•  LowlevelsynchronizaAon–  flush–  locks(bothsimpleandnested)

Cri<calsec<on

•  OnlyonethreadataAmecanenteracriAcalsecAon.

C$OMP PARALLEL DO PRIVATE(B) C$OMP& SHARED(RES)

DO 100 I=1,NITERS B = DOIT(I)

C$OMP CRITICAL CALL CONSUME (B, RES)

C$OMP END CRITICAL 100 CONTINUE C$OMP END PARALLEL DO

Atomic

•  AtomicisaspecialcaseofacriAcalsecAonthatcanbeusedforcertainsimplestatements

•  ItappliesonlytotheupdateofamemorylocaAon

C$OMP PARALLEL PRIVATE(B) B = DOIT(I)

tmp = big_ugly();

C$OMP ATOMIC X = X + temp

C$OMP END PARALLEL

OpenMPTasks

Defineatask:–  C/C++:#pragmaomptask–  Fortran:!$omptask

•  Ataskisgeneratedwhenathreadencountersataskconstruct–  Containsataskregionanditsdataenvironment–  Taskcanbenested

•  AtaskregionisaregionconsisAngofallcodeencounteredduringtheexecuAonofatask.

•  ThedataenvironmentconsistsofallthevariablesassociatedwiththeexecuAonofagiventask.–  constructedwhenthetaskisgenerated

Taskcomple<onandsynchroniza<on

•  Taskcomple<onoccurswhenthetaskreachestheendofthetaskregioncode

•  MulApletasksjoinedtocompletethroughtheuseoftasksynchroniza<onconstructs–  taskwait–  barrierconstruct

•  taskwaitconstructs:–  #pragmaomptaskwait–  !$omptaskwait

intfib(intn){intx,y;if(n<2)returnn;else{#pragmaomptaskshared(x)x=fib(n-1);#pragmaomptaskshared(y)y=fib(n-2);#pragmaomptaskwaitreturnx+y;}}

Example:ALinkedList

An Overview of OpenMP

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Example - A Linked List

........

while(my_pointer) {

(void) do_independent_work (my_pointer); my_pointer = my_pointer->next ;

} // End of while loop

........

Example:ALinkedListwithTasking

An Overview of OpenMP

Example - A Linked List With Tasking my_pointer = listhead;

#pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); } my_pointer = my_pointer->next ; } } // End of single - no implied barrier (nowait) } // End of parallel region - implied barrier

OpenMP Task is specifi ed here(executed in parallel)

Ordered

•  TheorderedconstructenforcesthesequenAalorderforablock.

#pragma omp parallel private (tmp) #pragma omp for ordered for (i=0;i<N;i++){ tmp = NEAT_STUFF_IN_PARALLEL(i); #pragma ordered res += consum(tmp); }

OpenMPSynchroniza<on

•  Theflushconstructdenotesasequencepointwhereathreadtriestocreateaconsistentviewofmemory.–  AllmemoryoperaAons(bothreadsandwrites)definedprior

tothesequencepointmustcomplete.–  AllmemoryoperaAons(bothreadsandwrites)definedaZer

thesequencepointmustfollowtheflush.–  Variablesinregistersorwritebuffersmustbeupdatedin

memory.•  Argumentstoflushspecifywhichvariablesareflushed.Noargumentsspecifiesthatallthreadvisiblevariablesareflushed.

Aflushexample

l  pair-wise synchronization.

integer ISYNC(NUM_THREADS) C$OMP PARALLEL DEFAULT (PRIVATE) SHARED (ISYNC)

IAM = OMP_GET_THREAD_NUM() ISYNC(IAM) = 0

C$OMP BARRIER CALL WORK() ISYNC(IAM) = 1 ! I’m all done; signal this to other threads

C$OMP FLUSH(ISYNC) DO WHILE (ISYNC(NEIGH) .EQ. 0)

C$OMP FLUSH(ISYNC) END DO

C$OMP END PARALLEL

Make sure other threads can see my write.

Make sure the read picks up a good copy from memory.

Note: flush is analogous to a fence in other shared memory APIs.

OpenMPLockrou<nes•  SimpleLockrouAnes:availableifitisunset.

–  omp_init_lock(),omp_set_lock(),omp_unset_lock(),omp_test_lock(),omp_destroy_lock()

•  NestedLocks:availableifitisunsetorifitissetbutownedbythethreadexecuAngthenestedlockfuncAon

–  omp_init_nest_lock(),omp_set_nest_lock(),omp_unset_nest_lock(),omp_test_nest_lock(),omp_destroy_nest_lock()

OpenMPLocks•  Protectresourceswithlocks.

omp_lock_t lck; omp_init_lock(&lck); #pragma omp parallel private (tmp, id) { id = omp_get_thread_num(); tmp = do_lots_of_work(id); omp_set_lock(&lck); printf(“%d %d”, id, tmp); omp_unset_lock(&lck); } omp_destroy_lock(&lck);

Wait here for your turn.

Release the lock so the next thread gets a turn.

Free-up storage when done.

OpenMPLibraryRou<nes•  Modify/Checkthenumberofthreads

–  omp_set_num_threads(),omp_get_num_threads(),omp_get_thread_num(),omp_get_max_threads()

•  Areweinaparallelregion?–  omp_in_parallel()

•  Howmanyprocessorsinthesystem?–  omp_num_procs()

OpenMPEnvironmentVariables

•  Setthedefaultnumberofthreadstouse.–  OMP_NUM_THREADSint_literal

•  Controlhow“ompforschedule(RUNTIME)”loopiteraAonsarescheduled.–  OMP_SCHEDULE“schedule[,chunk_size]”

Outline

–  Worksharing,tasks,dataenvironment,synchronizaAon•  OpenMPPerformanceandBestPracAces•  CaseStudiesandExamples•  ReferenceMaterials

OpenMPPerformance

•  RelaAveeaseofusingOpenMPisamixedblessing•  WecanquicklywriteacorrectOpenMPprogram,butwithoutthedesiredlevelofperformance.

•  Therearecertain“bestpracAces” toavoidcommonperformanceproblems.

•  Extraworkneededtoprogramwithlargethreadcount

TypicalOpenMPPerformanceIssues

•  OverheadsofOpenMPconstructs,threadmanagement.E.g.–  dynamicloopscheduleshavemuchhigheroverheadsthan

staAcschedules–  SynchronizaAonisexpensive,useNOWAITifpossible–  Largeparallelregionshelpreduceoverheads,enablebe`er

cacheusageandstandardopAmizaAons•  OverheadsofrunAmelibraryrouAnes

–  Somearecalledfrequently•  Loadbalance•  CacheuAlizaAonandfalsesharing

OverheadsofOpenMPDirec<ves

200000

400000

600000

800000

1000000

1200000

1400000

1 2 4 8 16 32 64 128 256

PARALLEL

PARALLEL FOR

SINGLE

LOCK/UNLOCK

ATOMIC

Number of Threads

OpenMP OverheadsEPCC Microbenchmarks

SGI Altix 3600

PARALLEL

PARALLEL FOR

BARRIER

SINGLE

CRITICAL

LOCK/UNLOCK

ORDERED

ATOMIC

REDUCTION

OpenMPBestPrac<ces

•  Reduceusageofbarrierwithnowaitclause

#pragmaompparallel{#pragmaompforfor(i=0;i<n;i++)….#pragmaompfornowaitfor(i=0;i<n;i++)}

OpenMPBestPrac<ces

#pragmaompparallelprivate(i){#pragmaompfornowaitfor(i=0;i<n;i++)a[i]+=b[i];#pragmaompfornowaitfor(i=0;i<n;i++)c[i]+=d[i];#pragmaompbarrier#pragmaompfornowaitreducAon(+:sum)for(i=0;i<n;i++)sum+=a[i]+c[i];}

OpenMPBestPrac<ces

•  Avoidlargeorderedconstruct•  AvoidlargecriAcalregions

#pragmaompparallelshared(a,b)private(c,d){….#pragmaompcriAcal{a+=2*c;c=d*d;}}

Move out this Statement 70

OpenMPBestPrac<ces

•  MaximizeParallelRegions

#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}}opt=opt+N;//sequenAal#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}

#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}#pragmaompsinglenowaitopt=opt+N;//sequenAal#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}

OpenMPBestPrac<ces

•  Singleparallelregionenclosingallwork-sharingloops.for(i=0;i<n;i++)for(j=0;j<n;j++)pragmaompparallelforprivate(k)for(k=0;k<n;k++){……}

#pragmaompparallelprivate(i,j,k){for(i=0;i<n;i++)for(j=0;j<n;j++)#pragmaompforfor(k=0;k<n;k++){…….}} 72

OpenMPBestPrac<ces

•  Addressloadimbalances•  Useparallelfordynamicschedulesanddifferentchunksizes

Smith-Waterman Sequence Alignment Algorithm

OpenMPBestPrac<ces

•  Smith-WatermanAlgorithm–  DefaultscheduleisforstaAcevenàloadimbalance

#pragmaompforfor(…)for(…)for(…)for(…){/*computealignments*/}#pragmaompcriAcal{./*computescores*/}

OpenMPBestPrac<cesSmith-Waterman Sequence Alignment Algorithm

2 4 8 16 32 64 128

Speedup

threads

2 4 8 16 32 64 128

Speedup

threads

#pragma omp for

#pragma omp for dynamic(schedule, 1)

128 threads with 80% efficiency 75

OpenMPBestPrac<ces

1 2 4 8 16 32 64 128 256

Default

Number of Threads

Chunk Size

Overheads of OpenMP For Static Scheduling SGI Altix 3600

2000000

4000000

6000000

8000000

10000000

12000000

14000000

1 2 4 8 16 32 64 128 256

Number of Threads

Chunk Size

Overheads of OpenMP For Dynamic ScheduleSGI Altix 3600

•  AddressloadimbalancesbyselecAngthebestscheduleandchunksize

•  AvoidselecAngsmallchunksizewhenworkinchunkissmall.

OpenMPBestPrac<ces

•  PipelineprocessingtooverlapI/OandcomputaAons

for(i=0;i<N;i++){ReadFromFile(i,…);for(j=0;j<ProcessingNum;j++)ProcessData(i,j);WriteResultsToFile(i)}

OpenMPBestPrac<ces

#pragmaompparallel{#pragmaompsingle{ReadFromFile(0,...);}for(i=0;i<N;i++){#pragmaompsinglenowait{if(i<N-1)ReadFromFile(i+1,….);}#pragmaompforschedule(dynamic)for(j=0;j<ProcessingNum;j++)ProcessChunkOfData(i,j);#pragmaompsinglenowait{WriteResultsToFile(i);}}}

•  PipelineProcessing•  Pre-fetchesI/O•  ThreadsreadingorwriAngfilesjoinsthecomputaAons

The implicit barrier here is very important: 1) file i is finished so we can write to file. 2) file i+1 is read in so we can process in the next loop iteration

For dealing with the last file

OpenMPBestPrac<ces

•  singlevs.masterwork-sharing–  masterismoreefficientbutrequiresthread0tobeavailable–  singleismoreefficientifmasterthreadnotavailable–  singlehasimplicitbarrier

CacheCoherence

•  Real-worldsharedmemorysystemshavecachesbetweenmemoryandCPU

•  CopiesofasingledataitemcanexistinmulAplecaches•  ModificaAonofashareddataitembyoneCPUleadstooutdatedcopiesinthecacheofanotherCPU

Memory

Originaldataitem

CopyofdataitemincacheofCPU0 Copyofdataitem

incacheofCPU1

•  Falsesharing–  Whenatleastonethreadwritetoa

cachelinewhileothersaccessit•  Thread0:=A[1](read)•  Thread1:A[0]=…(write)

•  SoluAon:usearraypadding

int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;

int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;

OpenMPBestPrac<ces

Getting OpenMP Up To Speed

False Sharing

CPUs Caches Memory

A store into a shared cache line invalidates the other copies of that line:

The system is not able to distinguish between changes

within one individual line

Exercise:Feelthefalsesharingwithaxpy-papi.c

OpenMPBestPrac<ces•  DataplacementpolicyonNUMAarchitectures

•  FirstTouchPolicy

–  Theprocessthatfirsttouchesapageofmemorycausesthatpagetobeallocatedinthenodeonwhichtheprocessisrunning

A generic cc-NUMA architecture

NUMAFirst-touchplacement/1

About “First Touch” placement/1

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[99]

First TouchAll array elements are in the memory of

the processor executing this thread

int a[100]; Onlyreservethevm

address

NUMAFirst-touchplacement/2

About “First Touch” placement/2

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[49]

#pragma omp parallel for num_threads(2)

First TouchBoth memories each have “their half” of

the array

a[50] :a[99]

OpenMPBestPrac<ces

•  First-touchinpracAce–  IniAalizedataconsistentlywiththecomputaAons

#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}

•  PrivaAzevariablesasmuchaspossible–  Privatevariablesarestoredinthelocalstacktothethread

•  Privatedataclosetocache

doublea[MaxThreads][N][N]#pragmaompparallelforfor(i=0;i<MaxThreads;i++){for(intj…)for(intk…)a[i][j][k]=…}

doublea[N][N]#pragmaompparallelprivate(a){for(intj…)for(intk…)a[j][k]=…}

OpenMPBestPracAces

procedurediff_coeff()‏{arrayallocaAonbymasterthreadiniAalizaAonofsharedarraysPARALLELREGION{looplower_bn[id],upper_bn[id]computaAononsharedarrays…..}}

•  CFDapplicaAonpsudo-code–  SharedarraysiniAalizedincorrectly(firsttouchpolicy)–  DelaysinremotememoryaccessesareprobablecausesbysaturaAonofinterconnect

Stall Cycle Breakdown for Non-Privatized (NP) andPrivatized (P) Versions of diff_coeff

0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10

les NP

OpenMPBestPracAces•  ArrayprivaAzaAon

–  Improvedtheperformanceofthewholeprogramby30%–  Speedupof10fortheprocedure,nowonly5%oftotalAme

•  Processorstallsarereducedsignificantly

•  AvoidThreadMigraAon–  Affectsdatalocality

•  Bindthreadstocores.•  Linux:

–  numactl–cpubind=0foobar–  taskset–c0,1foobar

•  SGIAlAx–  dplace–x2foobar

OpenMPBestPracAces

OpenMPSourceofErrors

•  IncorrectuseofsynchronizaAonconstructs–  LesslikelyifusersAckstodirecAves–  ErroneoususeofNOWAIT

•  RacecondiAons(truesharing)–  Canbeveryhardtofind

•  Wrong“spelling”ofsenAnel•  Usetoolstocheckfordataraces.

Outline

–  Worksharing,tasks,dataenvironment,synchronizaAon•  OpenMPPerformanceandBestPracAces•  HybridMPI/OpenMP•  CaseStudiesandExamples•  ReferenceMaterials

Matrixvectormul<plica<on

The Sequential Source

The OpenMP Source

Performance–2-socketNehalem

Performance - 2 Socket Nehalem

2-socketNehalem

A Two Socket Nehalem System

Dataini<aliza<on

Data Initialization

IniAalizaAonwillcausetheallocaAonofmemoryaccordingtothefirsttouchpolicy

ExploitFirstTouch

Exploit First Touch

A3Dmatrixupdate

Observation No data dependency on 'I' Therefore we can split the 3D

matrix in larger blocks and process these in parallel

Theidea

The IdeaK

We need to distribute the M iterations over the number of processors

We do this by controlling the start (IS) and end (IE) value of the inner loop

Each thread will calculate these values for it's portion of the work

A3Dmatrixupdate

A 3D matrix update

The loops are correctly nested for serial performance

Due to a data dependency on J and K, only the inner loop can be parallelized

This will cause the barrier to be executed (N-1) 2 times

Data Dependency Graph

Jj-1 j

The performance

Dimensions : M=7,500 N=20Footprint : ~24 MByte

Inner loop over I has been parallelized

Scaling is very poor(as to be expected)

Number of threads

Theperformance

Performanceanalyzerdata

Performance Analyzer datascales som

Using 10 threads

Question: Why is __mt_WaitForWork so high in the profi le ?

Using 20 threads

Falsesharingatwork

103Getting OpenMP Up To Speed

This is False Sharing at work !

P=1 P=2 P=4 P=8

False sharing increases as we increase the number of

threads

Performancecompared

Performance comparison

Number of threads

For a higher value of M, the program scales better

M = 7,500

M = 75,000

Thefirstimplementa<on

The fi rst implementation

OpenMPversion

Another Idea: Use OpenMP !

Howthisworks

How this works on 2 threads

parallel region

work sharing

parallel region

work sharing

This splits the operation in a way that is similar to our manual implementation

… etc …! … etc …!

Performance

Performance We have set M=7500 N=20

This problem size does not scale at all when we explicitly parallelized the inner loop over 'I'

We have have tested 4 versions of this program Inner Loop Over 'I' - Our fi rst OpenMP version AutoPar - The automatically parallelized version of

'kernel' OMP_Chunks - The manually parallelized version

with our explicit calculation of the chunks OMP_DO - The version with the OpenMP parallel

region and work-sharing DO

Performance

The performance (M=7,500)

Dimensions : M=7,500 N=20Footprint : ~24 MByte

The auto-parallelizingcompiler does really well !

Number of threads

OMP DO

Innerloop

OMP Chunks

ReferenceMaterialonOpenMP

•  OpenMPHomepagewww.openmp.org:–  TheprimarysourceofinformaAonaboutOpenMPanditsdevelopment.

•  OpenMPUser’sGroup(cOMPunity)Homepage– www.compunity.org:

•  Books:– UsingOpenMP,BarbaraChapman,GabrieleJost,RuudVanDerPas,Cambridge,MA:TheMITPress2007,ISBN:978-0-262-53302-7

–  ParallelprogramminginOpenMP,Chandra,Rohit,SanFrancisco,Calif.:MorganKaufmann;London:Harcourt,2000,ISBN:1558606718

•  DirecAvesimplementedviacodemodificaAonandinserAonofrunAmelibrarycalls

–  Basicstepisoutliningofcodeinparallelregion

•  RunAmelibraryresponsibleformanagingthreads

–  Schedulingloops–  Schedulingtasks–  ImplemenAng

synchronizaAon•  ImplementaAoneffortis

reasonable

OpenMP Code Translation

int main(void) { int a,b,c; #pragma omp parallel \ private(c) do_sth(a,b,c); return 0; }

_INT32 main() { int a,b,c; /* microtask */ void __ompregion_main1() { _INT32 __mplocal_c; /*shared variables are kept intact, substitute accesses to private variable*/ do_sth(a, b, __mplocal_c); } … /*OpenMP runtime calls */ __ompc_fork(&__ompregion_main1); … }

Eachcompilerhascustomrun-Amesupport.QualityoftherunAmesystemhasmajorimpactonperformance.

StandardOpenMPImplementa<on

Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP...

Documents

Introduction to parallel programming via OpenMP · 2019. 11. 20. · Introduction to parallel programming via OpenMP August 20, 2019 Introduction to parallel rogrammingp via OpenMP

Programming Irregular Applications with OpenMP · 2016. 11. 14. · 1 1 Programming Irregular Applications with OpenMP* * The name “OpenMP” is the property of the OpenMP Architecture

Parallel programming in OpenMp

Parallel Programming Lecture Set 4 POSIX Threads Overview & OpenMP Johnnie Baker February 2, 2011

Lecture 17: OpenMP Basics

Lecture 6. OpenMP

C++ and OpenMP - · PDF file2 ParCo’07 Terboven C++ and OpenMP. Agenda • OpenMP and objects • OpenMP and generic programming • OpenMP and the STL • Conclusion

Advanced programming with OpenMP - cvut.cz

Programming with OpenMP and mixed MPI-OpenMP

Parallel Programming: OpenMP

Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming

Parallel Programming with OpenMP

Parallel Programming with OpenMP part 2 – OpenMP v3.0 - tasking

MULTICORE PROGRAMMING WITH OPENMP

OpenMP Application Programming Interface

SharedMemory Programming : OpenMPsweiss/course... · G. Barlas, 2015 4 OpenMP History OpenMP : Open Multi-Processing is an API for shared-memory programming. OpenMP was

Shared memory programming with OpenMP · OpenMP Programming 4 OpenMP Model for shared-memory parallel programming Portable across shared-memory architectures Incremental parallelization

Threaded Programming Lecture 2: Introduction to OpenMP

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism

CS267 Lecture 61 Shared Memory Programming: Threads and OpenMP Lecture 6 James Demmel demmel/cs267_Spr14/ demmel/cs267_Spr14