111
Lecture 04-06: Programming with OpenMP Concurrent and Mul<core Programming, CSE536 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1

Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Lecture04-06:ProgrammingwithOpenMP

ConcurrentandMul<coreProgramming,CSE536

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]/~yan

1

Page 2: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Topics(Part1)

•  IntroducAon•  Programmingonsharedmemorysystem(Chapter7)

–  OpenMP•  Principlesofparallelalgorithmdesign(Chapter3)•  Programmingonsharedmemorysystem(Chapter7)

–  Cilk/Cilkplus(?)–  PThread,mutualexclusion,locks,synchroniza<ons

•  AnalysisofparallelprogramexecuAons(Chapter5)–  PerformanceMetricsforParallelSystems

•  Execu<onTime,Overhead,Speedup,Efficiency,Cost–  ScalabilityofParallelSystems–  Useofperformancetools

2

Page 3: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Outline

•  OpenMPIntroducAon•  ParallelProgrammingwithOpenMP

–  OpenMPparallelregion,andworksharing–  OpenMPdataenvironment,taskingandsynchronizaAon

•  OpenMPPerformanceandBestPracAces•  MoreCaseStudiesandExamples•  ReferenceMaterials

3

Page 4: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

WhatisOpenMP

•  StandardAPItowritesharedmemoryparallelapplicaAonsinC,C++,andFortran–  CompilerdirecAves,RunAmerouAnes,Environmentvariables

•  OpenMPArchitectureReviewBoard(ARB)–  MaintainsOpenMPspecificaAon–  Permanentmembers

•  AMD,Cray,Fujitsu,HP,IBM,Intel,NEC,PGI,Oracle,MicrosoZ,TexasInstruments,NVIDIA,Convey

–  Auxiliarymembers•  ANL,ASC/LLNL,cOMPunity,EPCC,LANL,NASA,TACC,RWTHAachenUniversity,UH

–  h`p://www.openmp.org•  LatestVersion4.5releasedNov2015

4

Page 5: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

MyrolewithOpenMP

5

Page 6: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

“HelloWord”Example/1

#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { printf("Hello World\n"); return(0); }

6

Page 7: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

“HelloWord”-AnExample/2

#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }

7

Page 8: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

“HelloWord”-AnExample/3

$ gcc –fopenmp hello.c $ export OMP_NUM_THREADS=2 $ ./a.out Hello World Hello World $ export OMP_NUM_THREADS=4 $ ./a.out Hello World Hello World Hello World Hello World $

8

#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }

Page 9: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPComponents

9

•  Parallelregion

•  Worksharingconstructs

•  Tasking

•  Offloading

•  Affinity

•  ErrorHanding

•  SIMD

•  SynchronizaAon

•  Data-sharinga`ributes

•  Numberofthreads

•  ThreadID

•  Dynamicthreadadjustment

•  Nestedparallelism

•  Schedule

•  AcAvelevels

•  Threadlimit

•  NesAnglevel

•  Ancestorthread

•  Teamsize

•  Locking

•  WallclockAmer

•  Numberofthreads

•  Schedulingtype

•  Dynamicthreadadjustment

•  Nestedparallelism

•  Stacksize

•  Idlethreads

•  AcAvelevels

•  Threadlimit

Direc<ves Run<meEnvironment EnvironmentVariable

Page 10: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

“HelloWord”-AnExample/3

#include <stdlib.h> #include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { #pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); } return(0); }

10

DirecAves

RunAmeEnvironment

Page 11: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

“HelloWord”-AnExample/4

11

EnvironmentVariable

EnvironmentVariable:itissimilartoprogramargumentsusedtochangetheconfiguraAonoftheexecuAonwithoutrecompiletheprogram.

NOTE:theorderofprint

Page 12: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

TheDesignPrincipleBehind

•  Eachprin^isatask

•  Aparallelregionistoclaimasetofcoresforcomputa<on–  Coresarepresentedasmul<plethreads

•  Eachthreadexecuteasingletask–  Taskidisthesameasthreadid

•  omp_get_thread_num()–  Num_tasksisthesameastotalnumberofthreads

•  omp_get_num_threads()

•  1:1mappingbetweentaskandthread–  Everytask/coredosimilarworkinthissimpleexample

12

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

Page 13: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPParallelCompu<ngSolu<onStack

13

Prog.Layer

(Ope

nMPAP

I)

RunAmelibrary

OS/system

DirecAves,Compiler OpenMPlibrary Environment

variables

ApplicaAon

End User

System

layer

Userlayer

Page 14: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPSyntax

14

•  MostOpenMPconstructsarecompilerdirec+vesusingpragmas.–  ForCandC++,thepragmastaketheform:#pragma…

•  pragmavslanguage–  pragmaisnotlanguage,shouldnotexpresslogics–  Toprovidecompiler/preprocessoraddi<onalinforma<ononhowtoprocessingdirec<ve-annotatedcode

–  Similarto#include,#define

Page 15: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPSyntax

15

•  ForCandC++,thepragmastaketheform:#pragma omp construct [clause [clause]…]

•  ForFortran,thedirecAvestakeoneoftheforms:–  Fixedform *$OMP construct [clause [clause]…] C$OMP construct [clause [clause]…]

–  Freeform(butworksforfixedformtoo)!$OMP construct [clause [clause]…]

•  IncludefileandtheOpenMPlibmodule#include <omp.h> use omp_lib

Page 16: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPCompiler•  OpenMP:threadprogrammingat“highlevel”.

–  Theuserdoesnotneedtospecifythedetails•  ProgramdecomposiAon,assignmentofworktothreads•  Mappingtaskstohardwarethreads

•  Usermakesstrategicdecisions•  Compilerfiguresoutdetails

–  CompilerflagsenableOpenMP(e.g.–openmp,-xopenmp,-fopenmp,-mp)

16

Page 17: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPMemoryModel

•  OpenMPassumesasharedmemory•  Threadscommunicatebysharingvariables.

•  SynchronizaAonprotectsdataconflicts.–  SynchronizaAonisexpensive.•  ChangehowdataisaccessedtominimizetheneedforsynchronizaAon.

17

Page 18: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPFork-JoinExecu<onModel

•  Master thread spawns multiple worker threads as needed, together form a team

•  Parallel region is a block of code executed by all threads in a team simultaneously

18

Parallel Regions

Master thread

A Nested Parallel region

Worker threads

Fork Join

Page 19: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPParallelRegions

•  InC/C++:ablockisasinglestatementoragroupofstatementbetween{}

•  InFortran:ablockisasinglestatementoragroupofstatementsbetweendirecAve/end-direcAvepairs.

19

C$OMP PARALLEL 10 wrk(id) = garbage(id) res(id) = wrk(id)**2 if(.not.conv(res(id)) goto 10 C$OMP END PARALLEL

C$OMP PARALLEL DO do i=1,N

res(i)=bigComp(i) end do C$OMP END PARALLEL DO

#pragma omp parallel { id = omp_get_thread_num(); res[id] = lots_of_work(id); }

#pragma omp parallel for for(i=0;i<N;i++) { res[i] = big_calc(i); A[i] = B[i] + res[i]; }

Page 20: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

20

lexical extent of parallel region

C$OMP PARALLEL call whoami C$OMP END PARALLEL

Dynamic extent of parallel region includes lexical extent

subroutine whoami external omp_get_thread_num integer iam, omp_get_thread_num iam = omp_get_thread_num() C$OMP CRITICAL print*,’Hello from ‘, iam C$OMP END CRITICAL return end

+

Orphaned directives can appear outside a parallel construct

bar.f foo.f

A parallel region can span multiple source files.

ScopeofOpenMPRegion

Page 21: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

SPMDProgramModels

21

•  SPMD(SingleProgram,Mul<pleData)forparallelregions–  Allthreadsoftheparallelregionexecutethesamecode–  EachthreadhasuniqueID

•  UsethethreadIDtodivergetheexecuAonofthethreads–  Differentthreadcanfollowdifferentpathsthroughthesame

code

•  SPMDisbyfarthemostcommonlyusedpa`ernforstructuringparallelprograms–  MPI,OpenMP,CUDA,etc

if(my_id == x) { } else { }

Page 22: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

ModifytheHelloWorldProgramso…

•  Onlyonethreadprintsthetotalnumberofthreads–  gcc–fopenmphello.c–ohello

•  Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo

22

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); if (thread_id == 0) printf("Hello World from thread %d of %d\n", thread_id, num_threads); else printf("Hello World from thread %d\n", thread_id); }

int num_threads = 99999; #pragma omp parallel { int thread_id = omp_get_thread_num(); if (thread_id == 0) num_threads = omp_get_num_threads(); #pragma omp barrier printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

Page 23: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Barrier

23

#pragma omp barrier

Page 24: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPMaster

•  Denotesastructuredblockexecutedbythemasterthread

•  Theotherthreadsjustskipit–  nosynchronizaAonisimplied

24

#pragma omp parallel private (tmp) {

do_many_things_together(); #pragma omp master

{ exchange_boundaries_by_master_only (); } #pragma barrier do_many_other_things_together(); }

Page 25: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

•  Denotesablockofcodethatisexecutedbyonlyonethread.–  Couldbemaster

•  Abarrierisimpliedattheendofthesingleblock.

25

#pragma omp parallel private (tmp) {

do_many_things_together(); #pragma omp single

{ exchange_boundaries_by_one(); } do_many_other_things_together();

}

OpenMPSingle

Page 26: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Usingompmaster/singletomodifytheHelloWorldProgramso…

•  Onlyonethreadprintsthetotalnumberofthreads

•  Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo

26

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

Page 27: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Distribu<ngWorkBasedonThreadID

27

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel shared (a, b)

{

int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }

}

Sequential code

OpenMP parallel region

cat/proc/cpuinfo

Page 28: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Implemen<ngaxpyusingOpenMPparallel

28

Page 29: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

•  DividestheexecuAonoftheenclosedcoderegionamongthemembersoftheteam

•  The“for” worksharingconstructsplitsuploopiteraAonsamongthreadsinateam–  Eachthreadgetsoneormore“chunk”->loopchuncking

29

#pragmaompparallel#pragmaompforfor(i=0;i<N;i++){

work(i);} By default, there is a barrier at the end of the “omp

for”. Use the “nowait” clause to turn off the barrier.

#pragma omp for nowait

“nowait” is useful between two consecutive, independent omp for loops.

OpenMPWorksharingConstructs

Page 30: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

WorksharingConstructs

30

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel shared (a, b)

{

int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }

}

#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

Sequential code

OpenMP parallel region

OpenMP parallel region and a worksharing for construct

Page 31: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

•  schedule(staAc|dynamic|guided[,chunk])•  schedule(auto|runAme)

31

staAc DistributeiteraAonsinblocksofsize"chunk"overthethreadsinaround-robinfashion

dynamic FixedporAonsofwork;sizeiscontrolledbythevalueofchunk;Whenathreadfinishes,itstartsonthenextporAonofwork

guided Samedynamicbehavioras"dynamic",butsizeoftheporAonofworkdecreasesexponenAally

auto Thecompiler(orrunAmesystem)decideswhatisbesttouse;choicecouldbeimplementaAondependent

runAme IteraAonschedulingschemeissetatrunAmethroughenvironmentvariableOMP_SCHEDULE

OpenMPscheduleClause

Page 32: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPSec<ons

•  Worksharingconstruct•  Givesadifferentstructuredblocktoeachthread

32

#pragma omp parallel #pragma omp sections { #pragma omp section

x_calculation(); #pragma omp section

y_calculation(); #pragma omp section

z_calculation(); }

By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.

Page 33: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

LoopCollapse

•  AllowsparallelizaAonofperfectlynestedloopswithoutusingnestedparallelism

•  Thecollapseclauseonfor/doloopindicateshowmanyloopsshouldbecollapsed

33

!$ompparalleldocollapse(2)...doi=il,iu,isdoj=jl,ju,jsdok=kl,ku,ks.....enddoenddoenddo!$ompendparalleldo

Page 34: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Exercise:OpenMPMatrixMul<plica<on

•  Parallelversion•  Parallelforversion

–  Experimentdifferentschedulepolicyandchunksize•  #omppragmaparallelfor

–  Experimentcollapse(2)

34

#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel for schedule(static) private (i) num_threads(num_ths)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

gcc-fopenmpmm.c-omm

Page 35: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Barrier

•  Barrier:EachthreadwaitsunAlallthreadsarrive.

35

#pragma omp parallel shared (A, B, C) private(id) {

id=omp_get_thread_num(); A[id] = big_calc1(id);

#pragma omp barrier #pragma omp for

for(i=0;i<N;i++){C[i]=big_calc3(I,A);} #pragma omp for nowait

for(i=0;i<N;i++){ B[i]=big_calc2(C, i); } A[id] = big_calc3(id);

} implicit barrier at the end of a parallel region

implicit barrier at the end of a for work-sharing construct

no implicit barrier due to nowait

Page 36: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

•  Mostvariablesaresharedbydefault•  GlobalvariablesareSHAREDamongthreads

–  Fortran:COMMONblocks,SAVEvariables,MODULEvariables

–  C:Filescopevariables,staAc•  Butnoteverythingisshared...

–  Stackvariablesinsub-programscalledfromparallelregionsarePRIVATE

–  AutomaAcvariablesdefinedinsidetheparallelregionarePRIVATE.

36

DataEnvironment

Page 37: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPDataEnvironment

double a[size][size], b=4; #pragma omp parallel private (b) { .... }

shared data a[size][size]

T0 T1 T2 T3

privatedatab’=?

b becomes undefined

privatedatab’=?

privatedatab’=?

privatedatab’=?

37

Page 38: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

38

program sort common /input/ A(10) integer index(10)

C$OMP PARALLEL call work (index)

C$OMP END PARALLEL print*, index(1)

subroutine work (index) common /input/ A(10) integer index(*) real temp(10) integer count save count ………… !

temp

A, index, count

temp temp

A, index, count

A, index and count are shared by all threads.

temp is local to each thread

OpenMPDataEnvironment

Page 39: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

DataEnvironment:Changingstorageaeributes

•  SelecAvelychangestoragea`ributesconstructsusingthefollowingclauses–  SHARED–  PRIVATE–  FIRSTPRIVATE–  THREADPRIVATE

•  Thevalueofaprivateinsideaparallelloopandglobalvalueoutsidetheloopcanbeexchangedwith–  FIRSTPRIVATE,andLASTPRIVATE

•  Thedefaultstatuscanbemodifiedwith:–  DEFAULT(PRIVATE|SHARED|NONE)

39

Page 40: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPPrivateClause

•  private(var)createsalocalcopyofvarforeachthread.–  Thevalueisunini+alized–  Privatecopyisnotstorage-associatedwiththeoriginal–  Theoriginalisundefinedattheend

40

IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000

IS = IS + J END DO C$OMP END PARALLEL DO print *, IS

Page 41: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPPrivateClause

•  private(var)createsalocalcopyofvarforeachthread.–  Thevalueisunini+alized–  Privatecopyisnotstorage-associatedwiththeoriginal–  Theoriginalisundefinedattheend

41

IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000

IS = IS + J END DO C$OMP END PARALLEL DO print *, IS

IS was not initialized

IS is undefined here

✗✗

Page 42: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

FirstprivateClause

•  firstprivateisaspecialcaseofprivate.–  IniAalizeseachprivatecopywiththecorrespondingvalue

fromthemasterthread.

42

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20  CONTINUE C$OMP END PARALLEL DO print *, IS

Page 43: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

FirstprivateClause

•  firstprivateisaspecialcaseofprivate.–  IniAalizeseachprivatecopywiththecorrespondingvalue

fromthemasterthread.

43

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20  CONTINUE C$OMP END PARALLEL DO print *, IS

Regardless of initialization, IS is undefined at this point

Each thread gets its own IS with an initial value of 0

✔ ✗

Page 44: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

LastprivateClause

•  LastprivatepassesthevalueofaprivatefromthelastiteraAontothevariableofthemasterthread

44

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) C$OMP& LASTPRIVATE(IS) DO 20 J=1,1000

IS = IS + J 20  CONTINUE C$OMP END PARALLEL DO print *, IS

IS is defined as its value at the last iteration (i.e. for J=1000)

Each thread gets its own IS with an initial value of 0

✔ Isthiscodemeaningful?

Page 45: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPReduc<on

45

l  Here is the correct way to parallelize this code.

IS = 0 C$OMP PARALLEL DO REDUCTION(+:IS) DO 20 J=1,1000

IS = IS + J 20 CONTINUE print *, IS

Reduction NOT implies firstprivate, where is the initial 0 comes from?

Page 46: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Reduc<onoperands/ini<al-values

•  AssociaAveoperandsusedwithreducAon•  IniAalvaluesaretheonesthatmakesense

mathemaAcally

46

Operand Initial value

+ 0

* 1

- 0

.AND. All 1’s

Operand Initial value

.OR. 0

MAX 1

MIN 0

// All 0’s

Page 47: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Exercise:OpenMPSum.c

•  Twoversions–  ParallelforwithreducAon–  Parallelversion,notusing“ompfor”or“reducAon”clause

47

Page 48: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPThreadprivate

•  Makesglobaldataprivatetoathread,thuscrossingparallelregionboundary–  Fortran:COMMONblocks–  C:FilescopeandstaAcvariables

•  DifferentfrommakingthemPRIVATE–  WithPRIVATE,globalvariablesaremasked.–  THREADPRIVATEpreservesglobalscopewithineach

thread•  ThreadprivatevariablescanbeiniAalizedusing

COPYIN orbyusingDATAstatements.

48

Page 49: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Threadprivate/copyin

49

parameter (N=1000) common/buf/A(N) C$OMP THREADPRIVATE(/buf/) C Initialize the A array call init_data(N,A) C$OMP PARALLEL COPYIN(A) … Now each thread sees threadprivate array A initialized … to the global value set in the subroutine init_data() C$OMP END PARALLEL .... C$OMP PARALLEL ... Values of threadprivate are persistent across parallel regions C$OMP END PARALLEL

•  You initialize threadprivate data using a copyin clause.

Page 50: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPSynchroniza<on

•  HighlevelsynchronizaAon:–  criAcalsecAon–  atomic–  barrier–  ordered

•  LowlevelsynchronizaAon–  flush–  locks(bothsimpleandnested)

50

Page 51: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Cri<calsec<on

•  OnlyonethreadataAmecanenteracriAcalsecAon.

51

C$OMP PARALLEL DO PRIVATE(B) C$OMP& SHARED(RES)

DO 100 I=1,NITERS B = DOIT(I)

C$OMP CRITICAL CALL CONSUME (B, RES)

C$OMP END CRITICAL 100 CONTINUE C$OMP END PARALLEL DO

Page 52: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Atomic

•  AtomicisaspecialcaseofacriAcalsecAonthatcanbeusedforcertainsimplestatements

•  ItappliesonlytotheupdateofamemorylocaAon

52

C$OMP PARALLEL PRIVATE(B) B = DOIT(I)

tmp = big_ugly();

C$OMP ATOMIC X = X + temp

C$OMP END PARALLEL

Page 53: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPTasks

Defineatask:–  C/C++:#pragmaomptask–  Fortran:!$omptask

53

•  Ataskisgeneratedwhenathreadencountersataskconstruct–  Containsataskregionanditsdataenvironment–  Taskcanbenested

•  AtaskregionisaregionconsisAngofallcodeencounteredduringtheexecuAonofatask.

•  ThedataenvironmentconsistsofallthevariablesassociatedwiththeexecuAonofagiventask.–  constructedwhenthetaskisgenerated

Page 54: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Taskcomple<onandsynchroniza<on

•  Taskcomple<onoccurswhenthetaskreachestheendofthetaskregioncode

•  MulApletasksjoinedtocompletethroughtheuseoftasksynchroniza<onconstructs–  taskwait–  barrierconstruct

•  taskwaitconstructs:–  #pragmaomptaskwait–  !$omptaskwait

54

intfib(intn){intx,y;if(n<2)returnn;else{#pragmaomptaskshared(x)x=fib(n-1);#pragmaomptaskshared(y)y=fib(n-2);#pragmaomptaskwaitreturnx+y;}}

Page 55: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Example:ALinkedList

55

An Overview of OpenMP

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Example - A Linked List

........

while(my_pointer) {

(void) do_independent_work (my_pointer); my_pointer = my_pointer->next ;

} // End of while loop

........

Page 56: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Example:ALinkedListwithTasking

56

An Overview of OpenMP

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Example - A Linked List With Tasking my_pointer = listhead;

#pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); } my_pointer = my_pointer->next ; } } // End of single - no implied barrier (nowait) } // End of parallel region - implied barrier

OpenMP Task is specifi ed here(executed in parallel)

Page 57: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Ordered

•  TheorderedconstructenforcesthesequenAalorderforablock.

57

#pragma omp parallel private (tmp) #pragma omp for ordered for (i=0;i<N;i++){ tmp = NEAT_STUFF_IN_PARALLEL(i); #pragma ordered res += consum(tmp); }

Page 58: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPSynchroniza<on

•  Theflushconstructdenotesasequencepointwhereathreadtriestocreateaconsistentviewofmemory.–  AllmemoryoperaAons(bothreadsandwrites)definedprior

tothesequencepointmustcomplete.–  AllmemoryoperaAons(bothreadsandwrites)definedaZer

thesequencepointmustfollowtheflush.–  Variablesinregistersorwritebuffersmustbeupdatedin

memory.•  Argumentstoflushspecifywhichvariablesareflushed.Noargumentsspecifiesthatallthreadvisiblevariablesareflushed.

58

Page 59: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Aflushexample

59

l  pair-wise synchronization.

integer ISYNC(NUM_THREADS) C$OMP PARALLEL DEFAULT (PRIVATE) SHARED (ISYNC)

IAM = OMP_GET_THREAD_NUM() ISYNC(IAM) = 0

C$OMP BARRIER CALL WORK() ISYNC(IAM) = 1 ! I’m all done; signal this to other threads

C$OMP FLUSH(ISYNC) DO WHILE (ISYNC(NEIGH) .EQ. 0)

C$OMP FLUSH(ISYNC) END DO

C$OMP END PARALLEL

Make sure other threads can see my write.

Make sure the read picks up a good copy from memory.

Note: flush is analogous to a fence in other shared memory APIs.

Page 60: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPLockrou<nes•  SimpleLockrouAnes:availableifitisunset.

–  omp_init_lock(),omp_set_lock(),omp_unset_lock(),omp_test_lock(),omp_destroy_lock()

•  NestedLocks:availableifitisunsetorifitissetbutownedbythethreadexecuAngthenestedlockfuncAon

–  omp_init_nest_lock(),omp_set_nest_lock(),omp_unset_nest_lock(),omp_test_nest_lock(),omp_destroy_nest_lock()

60

Page 61: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPLocks•  Protectresourceswithlocks.

61

omp_lock_t lck; omp_init_lock(&lck); #pragma omp parallel private (tmp, id) { id = omp_get_thread_num(); tmp = do_lots_of_work(id); omp_set_lock(&lck); printf(“%d %d”, id, tmp); omp_unset_lock(&lck); } omp_destroy_lock(&lck);

Wait here for your turn.

Release the lock so the next thread gets a turn.

Free-up storage when done.

Page 62: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPLibraryRou<nes•  Modify/Checkthenumberofthreads

–  omp_set_num_threads(),omp_get_num_threads(),omp_get_thread_num(),omp_get_max_threads()

•  Areweinaparallelregion?–  omp_in_parallel()

•  Howmanyprocessorsinthesystem?–  omp_num_procs()

62

Page 63: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPEnvironmentVariables

•  Setthedefaultnumberofthreadstouse.–  OMP_NUM_THREADSint_literal

•  Controlhow“ompforschedule(RUNTIME)”loopiteraAonsarescheduled.–  OMP_SCHEDULE“schedule[,chunk_size]”

63

Page 64: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Outline

•  OpenMPIntroducAon•  ParallelProgrammingwithOpenMP

–  Worksharing,tasks,dataenvironment,synchronizaAon•  OpenMPPerformanceandBestPracAces•  CaseStudiesandExamples•  ReferenceMaterials

64

Page 65: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPPerformance

•  RelaAveeaseofusingOpenMPisamixedblessing•  WecanquicklywriteacorrectOpenMPprogram,butwithoutthedesiredlevelofperformance.

•  Therearecertain“bestpracAces” toavoidcommonperformanceproblems.

•  Extraworkneededtoprogramwithlargethreadcount

65

Page 66: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

TypicalOpenMPPerformanceIssues

66

•  OverheadsofOpenMPconstructs,threadmanagement.E.g.–  dynamicloopscheduleshavemuchhigheroverheadsthan

staAcschedules–  SynchronizaAonisexpensive,useNOWAITifpossible–  Largeparallelregionshelpreduceoverheads,enablebe`er

cacheusageandstandardopAmizaAons•  OverheadsofrunAmelibraryrouAnes

–  Somearecalledfrequently•  Loadbalance•  CacheuAlizaAonandfalsesharing

Page 67: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OverheadsofOpenMPDirec<ves

0

200000

400000

600000

800000

1000000

1200000

1400000

Ove

rhea

d (C

ycle

s)

1 2 4 8 16 32 64 128 256

PARALLEL

PARALLEL FOR

SINGLE

LOCK/UNLOCK

ATOMIC

Number of Threads

OpenMP OverheadsEPCC Microbenchmarks

SGI Altix 3600

PARALLEL

FOR

PARALLEL FOR

BARRIER

SINGLE

CRITICAL

LOCK/UNLOCK

ORDERED

ATOMIC

REDUCTION

67

Page 68: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

•  Reduceusageofbarrierwithnowaitclause

#pragmaompparallel{#pragmaompforfor(i=0;i<n;i++)….#pragmaompfornowaitfor(i=0;i<n;i++)}

68

Page 69: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

#pragmaompparallelprivate(i){#pragmaompfornowaitfor(i=0;i<n;i++)a[i]+=b[i];#pragmaompfornowaitfor(i=0;i<n;i++)c[i]+=d[i];#pragmaompbarrier#pragmaompfornowaitreducAon(+:sum)for(i=0;i<n;i++)sum+=a[i]+c[i];}

69

Page 70: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

•  Avoidlargeorderedconstruct•  AvoidlargecriAcalregions

#pragmaompparallelshared(a,b)private(c,d){….#pragmaompcriAcal{a+=2*c;c=d*d;}}

Move out this Statement 70

Page 71: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

•  MaximizeParallelRegions

#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}}opt=opt+N;//sequenAal#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}

#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}#pragmaompsinglenowaitopt=opt+N;//sequenAal#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}

71

Page 72: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

•  Singleparallelregionenclosingallwork-sharingloops.for(i=0;i<n;i++)for(j=0;j<n;j++)pragmaompparallelforprivate(k)for(k=0;k<n;k++){……}

#pragmaompparallelprivate(i,j,k){for(i=0;i<n;i++)for(j=0;j<n;j++)#pragmaompforfor(k=0;k<n;k++){…….}} 72

Page 73: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

73

•  Addressloadimbalances•  Useparallelfordynamicschedulesanddifferentchunksizes

Smith-Waterman Sequence Alignment Algorithm

Page 74: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

•  Smith-WatermanAlgorithm–  DefaultscheduleisforstaAcevenàloadimbalance

74

#pragmaompforfor(…)for(…)for(…)for(…){/*computealignments*/}#pragmaompcriAcal{./*computescores*/}

Page 75: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<cesSmith-Waterman Sequence Alignment Algorithm

1

10

100

2 4 8 16 32 64 128

Speedup

threads

100

600

1000

Ideal

1

10

100

2 4 8 16 32 64 128

Speedup

threads

100

600

1000

Ideal

#pragma omp for

#pragma omp for dynamic(schedule, 1)

128 threads with 80% efficiency 75

Page 76: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

0

10000

20000

30000

40000

50000

60000

70000

80000

Ove

rhea

d (in

Cyc

les)

1 2 4 8 16 32 64 128 256

Default

2

8

32

128

Number of Threads

Chunk Size

Overheads of OpenMP For Static Scheduling SGI Altix 3600

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

Cyc

les

1 2 4 8 16 32 64 128 256

1

4

16

64

Number of Threads

Chunk Size

Overheads of OpenMP For Dynamic ScheduleSGI Altix 3600

•  AddressloadimbalancesbyselecAngthebestscheduleandchunksize

•  AvoidselecAngsmallchunksizewhenworkinchunkissmall.

76

Page 77: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

•  PipelineprocessingtooverlapI/OandcomputaAons

for(i=0;i<N;i++){ReadFromFile(i,…);for(j=0;j<ProcessingNum;j++)ProcessData(i,j);WriteResultsToFile(i)}

77

Page 78: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

#pragmaompparallel{#pragmaompsingle{ReadFromFile(0,...);}for(i=0;i<N;i++){#pragmaompsinglenowait{if(i<N-1)ReadFromFile(i+1,….);}#pragmaompforschedule(dynamic)for(j=0;j<ProcessingNum;j++)ProcessChunkOfData(i,j);#pragmaompsinglenowait{WriteResultsToFile(i);}}}

•  PipelineProcessing•  Pre-fetchesI/O•  ThreadsreadingorwriAngfilesjoinsthecomputaAons

The implicit barrier here is very important: 1) file i is finished so we can write to file. 2) file i+1 is read in so we can process in the next loop iteration

78

For dealing with the last file

Page 79: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

•  singlevs.masterwork-sharing–  masterismoreefficientbutrequiresthread0tobeavailable–  singleismoreefficientifmasterthreadnotavailable–  singlehasimplicitbarrier

79

Page 80: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

CacheCoherence

•  Real-worldsharedmemorysystemshavecachesbetweenmemoryandCPU

•  CopiesofasingledataitemcanexistinmulAplecaches•  ModificaAonofashareddataitembyoneCPUleadstooutdatedcopiesinthecacheofanotherCPU

Memory

CPU0

Cache

CPU1

Cache

Originaldataitem

CopyofdataitemincacheofCPU0 Copyofdataitem

incacheofCPU1

Page 81: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

•  Falsesharing–  Whenatleastonethreadwritetoa

cachelinewhileothersaccessit•  Thread0:=A[1](read)•  Thread1:A[0]=…(write)

•  SoluAon:usearraypadding

int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;

int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;

OpenMPBestPrac<ces

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

False Sharing

CPUs Caches Memory

A store into a shared cache line invalidates the other copies of that line:

The system is not able to distinguish between changes

within one individual line

81

A

T0

T1

Page 82: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Exercise:Feelthefalsesharingwithaxpy-papi.c

82

Page 83: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces•  DataplacementpolicyonNUMAarchitectures

•  FirstTouchPolicy

–  Theprocessthatfirsttouchesapageofmemorycausesthatpagetobeallocatedinthenodeonwhichtheprocessisrunning

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

A generic cc-NUMA architecture

83

Page 84: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

NUMAFirst-touchplacement/1

84

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About “First Touch” placement/1

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[99]

First TouchAll array elements are in the memory of

the processor executing this thread

int a[100]; Onlyreservethevm

address

Page 85: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

NUMAFirst-touchplacement/2

85

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About “First Touch” placement/2

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[49]

#pragma omp parallel for num_threads(2)

First TouchBoth memories each have “their half” of

the array

a[50] :a[99]

Page 86: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPrac<ces

•  First-touchinpracAce–  IniAalizedataconsistentlywiththecomputaAons

#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}

86

Page 87: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

•  PrivaAzevariablesasmuchaspossible–  Privatevariablesarestoredinthelocalstacktothethread

•  Privatedataclosetocache

doublea[MaxThreads][N][N]#pragmaompparallelforfor(i=0;i<MaxThreads;i++){for(intj…)for(intk…)a[i][j][k]=…}

doublea[N][N]#pragmaompparallelprivate(a){for(intj…)for(intk…)a[j][k]=…}

OpenMPBestPracAces

87

Page 88: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPBestPracAces

procedurediff_coeff()‏{arrayallocaAonbymasterthreadiniAalizaAonofsharedarraysPARALLELREGION{looplower_bn[id],upper_bn[id]computaAononsharedarrays…..}}

•  CFDapplicaAonpsudo-code–  SharedarraysiniAalizedincorrectly(firsttouchpolicy)–  DelaysinremotememoryaccessesareprobablecausesbysaturaAonofinterconnect

88

Page 89: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Stall Cycle Breakdown for Non-Privatized (NP) andPrivatized (P) Versions of diff_coeff

0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10

D-c

ach

stal

ls

Bra

nch

mis

pred

ictio

n

Inst

ruct

ion

mis

s st

all

FLP

Uni

ts

Fron

t-end

flush

es

Cyc

les NP

PNP-P

OpenMPBestPracAces•  ArrayprivaAzaAon

–  Improvedtheperformanceofthewholeprogramby30%–  Speedupof10fortheprocedure,nowonly5%oftotalAme

•  Processorstallsarereducedsignificantly

89

Page 90: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

•  AvoidThreadMigraAon–  Affectsdatalocality

•  Bindthreadstocores.•  Linux:

–  numactl–cpubind=0foobar–  taskset–c0,1foobar

•  SGIAlAx–  dplace–x2foobar

OpenMPBestPracAces

90

Page 91: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPSourceofErrors

•  IncorrectuseofsynchronizaAonconstructs–  LesslikelyifusersAckstodirecAves–  ErroneoususeofNOWAIT

•  RacecondiAons(truesharing)–  Canbeveryhardtofind

•  Wrong“spelling”ofsenAnel•  Usetoolstocheckfordataraces.

91

Page 92: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Outline

•  OpenMPIntroducAon•  ParallelProgrammingwithOpenMP

–  Worksharing,tasks,dataenvironment,synchronizaAon•  OpenMPPerformanceandBestPracAces•  HybridMPI/OpenMP•  CaseStudiesandExamples•  ReferenceMaterials

92

Page 93: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Matrixvectormul<plica<on

93

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The Sequential Source

= *

j

i

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The OpenMP Source

= *

j

i

Page 94: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Performance–2-socketNehalem

94

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Performance - 2 Socket Nehalem

Page 95: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

2-socketNehalem

95

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

A Two Socket Nehalem System

Page 96: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Dataini<aliza<on

96

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Data Initialization

= *

j

i

IniAalizaAonwillcausetheallocaAonofmemoryaccordingtothefirsttouchpolicy

Page 97: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

ExploitFirstTouch

97

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Exploit First Touch

Page 98: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

A3Dmatrixupdate

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Observation No data dependency on 'I' Therefore we can split the 3D

matrix in larger blocks and process these in parallel

JI

K

98

Page 99: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Theidea

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The IdeaK

J

I

ie is

We need to distribute the M iterations over the number of processors

We do this by controlling the start (IS) and end (IE) value of the inner loop

Each thread will calculate these values for it's portion of the work

99

Page 100: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

A3Dmatrixupdate

100

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

A 3D matrix update

The loops are correctly nested for serial performance

Due to a data dependency on J and K, only the inner loop can be parallelized

This will cause the barrier to be executed (N-1) 2 times

Data Dependency Graph

Jj-1 j

k-1k

I

K

Page 101: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

101

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The performance

Dimensions : M=7,500 N=20Footprint : ~24 MByte

Perf

orm

ance

(Mflo

p/s)

Inner loop over I has been parallelized

Scaling is very poor(as to be expected)

Number of threads

Theperformance

Page 102: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Performanceanalyzerdata

102

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Performance Analyzer datascales som

ewhat

do n

ot

scal

e at

all

Using 10 threads

Question: Why is __mt_WaitForWork so high in the profi le ?

Using 20 threads

Page 103: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Falsesharingatwork

103Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

This is False Sharing at work !

no s

harin

g

P=1 P=2 P=4 P=8

False sharing increases as we increase the number of

threads

Page 104: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Performancecompared

104

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Performance comparison

Number of threads

Perf

orm

ance

(Mflo

p/s)

For a higher value of M, the program scales better

M = 7,500

M = 75,000

Page 105: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Thefirstimplementa<on

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The fi rst implementation

105

Page 106: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

OpenMPversion

106

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Another Idea: Use OpenMP !

Page 107: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Howthisworks

107

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

How this works on 2 threads

parallel region

work sharing

parallel region

work sharing

This splits the operation in a way that is similar to our manual implementation

… etc …! … etc …!

Page 108: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Performance

108

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Performance We have set M=7500 N=20

This problem size does not scale at all when we explicitly parallelized the inner loop over 'I'

We have have tested 4 versions of this program Inner Loop Over 'I' - Our fi rst OpenMP version AutoPar - The automatically parallelized version of

'kernel' OMP_Chunks - The manually parallelized version

with our explicit calculation of the chunks OMP_DO - The version with the OpenMP parallel

region and work-sharing DO

Page 109: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

Performance

109

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The performance (M=7,500)

Dimensions : M=7,500 N=20Footprint : ~24 MByte

Perfo

rman

ce (M

flop/

s)

The auto-parallelizingcompiler does really well !

Number of threads

OMP DO

Innerloop

OMP Chunks

Page 110: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

ReferenceMaterialonOpenMP

110

•  OpenMPHomepagewww.openmp.org:–  TheprimarysourceofinformaAonaboutOpenMPanditsdevelopment.

•  OpenMPUser’sGroup(cOMPunity)Homepage– www.compunity.org:

•  Books:– UsingOpenMP,BarbaraChapman,GabrieleJost,RuudVanDerPas,Cambridge,MA:TheMITPress2007,ISBN:978-0-262-53302-7

–  ParallelprogramminginOpenMP,Chandra,Rohit,SanFrancisco,Calif.:MorganKaufmann;London:Harcourt,2000,ISBN:1558606718

110

Page 111: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul

•  DirecAvesimplementedviacodemodificaAonandinserAonofrunAmelibrarycalls

–  Basicstepisoutliningofcodeinparallelregion

•  RunAmelibraryresponsibleformanagingthreads

–  Schedulingloops–  Schedulingtasks–  ImplemenAng

synchronizaAon•  ImplementaAoneffortis

reasonable

OpenMP Code Translation

int main(void) { int a,b,c; #pragma omp parallel \ private(c) do_sth(a,b,c); return 0; }

_INT32 main() { int a,b,c; /* microtask */ void __ompregion_main1() { _INT32 __mplocal_c; /*shared variables are kept intact, substitute accesses to private variable*/ do_sth(a, b, __mplocal_c); } … /*OpenMP runtime calls */ __ompc_fork(&__ompregion_main1); … }

Eachcompilerhascustomrun-Amesupport.QualityoftherunAmesystemhasmajorimpactonperformance.

StandardOpenMPImplementa<on