Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe...

Preview:

Citation preview

GPU Hackathon 2017- OpenACC

Programming GPUs with OpenACC

1

SaberFekiComputationalScientistLead

SupercomputingCoreLaboratory,KAUST saber.feki@kaust.edu.sa

GPU Hackathon 2017- OpenACC

GPU architecture

2

GPU Hackathon 2017- OpenACC

CPU-GPU memory model

3

PCIeInterconnect16X- 8GB/s(gen2)and15.75GB/s(gen3),verythinpipe!KeplerK402,880cudacores1.48Tflops/s

GPU Hackathon 2017- OpenACC

GPU programming

4

GPU Hackathon 2017- OpenACC

OpenACC, the standard

• ByNVIDIA,CRAY,PGIandCAPS• ThestandardwasannouncedinNov2011atSC11conference• http://www.openacc-standard.org• OpenACC2.0releasedinsummer2013• Now,20+partnersfromacademiaandindustry

5

GPU Hackathon 2017- OpenACC

OpenACC advantages

• Easy:Directivesaretheeasypathtoacceleratecomputeintensiveapplications

• Open:OpenACCisanopenGPUdirectivesstandard,makingGPUprogrammingstraightforwardandportableacrossparallelandmulti-coreprocessors

• Powerful:GPUDirectivesallowcompleteaccesstothemassiveparallelpowerofaGPU

6

GPU Hackathon 2017- OpenACC

PGI and CAPS compilers study (I)

7

S.Feki,A.Al-Jarro,H.Bağcı.PortinganExplicitTimeDomainVolumeIntegralEquationSolveronGPUswithOpenACC,IEEEAntennasandPropagationMagazine,July,2014

#pragma acc kernels {for ( l = 0 ; l < nt; ++l) { // time loop#pragma acc loop independent collapse (3)

for (int i = 0; i < n; ++i){ for (int j = 0; j < n; ++j){

for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....

}}

}

#pragma acc loop independent collapse (3)for (int i = 0; i < n; ++i){

for (int j = 0; j < n; ++j){for (int k = 0; k < n; ++k){

B[i][j][k] = B[i]][j][k] + ....}

}}

} // end time loop }

#pragma acc datafor ( l = 0 ; l < nt; ++l) { // time loop#pragma acc kernels#pragma acc loop independent gang

for (int i = 0; i < n; ++i){ #pragma acc loop independent gang,vector

for (int j = 0; j < n; ++j){#pragma acc loop independent gang,vector

for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....

} } }#pragma acc kernels#pragma acc loop independent gang

for (int i = 0; i < n; ++i){ #pragma acc loop independent gang,vector

for (int j = 0; j < n; ++j){#pragma acc loop independent gang,vector

for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....

} } }} // end time loopCAPS PGI

GPU Hackathon 2017- OpenACC

PGI and CAPS compilers study (II)

8

0

5

10

15

20

25

30

35

6 11 25 32 41 56 77 113 176

Speedup

Numberofdegreesoffreedom(X1000)

CAPSPGI

S.Feki,A.Al-Jarro,H.Bağcı.PortinganExplicitTimeDomainVolumeIntegralEquationSolveronGPUswithOpenACC,IEEEAntennasandPropagationMagazine,July,2014

GPU Hackathon 2017- OpenACC

Directive syntax

• Fortran!$accdirective[clause[,]clause]…]…oftenpairedwithamatchingenddirective!$accenddirective• C#pragmaaccdirective[clause[,]clause]…]Oftenfollowedbyastructuredcodeblock

9

GPU Hackathon 2017- OpenACC

kernels: Your first OpenACC Directive

• Eachloopexecutedasaseparatekernel (aparallelfunctionthatrunsontheGPU)

!$acc kernelsdo i=1,n

a(i) = 0.0 b(i) = 1.0c(i) = 2.0

end dodo i=1,na(i) = b(i) + c(i)

end do !$acc end kernels

10

GPU Hackathon 2017- OpenACC

Compile and run

• C:pgcc–acc[-Minfo=accel]–osaxpy_accsaxpy.c• Fortran:pgf90–acc[-Minfo=accel]–osaxpy_accsaxpy.f90• Compileroutput:[sfeki@c4hdnsaxpy]$pgcc-acc-ta=nvidia-Minfo=accel-osaxpysaxpy.csaxpy:

5,Generatingpresent_or_copyin(x[0:n])Generatingpresent_or_copy(y[0:n])Generatingcomputecapability1.0binaryGeneratingcomputecapability2.0binary

6,LoopisparallelizableAcceleratorkernelgenerated6,#pragmaaccloopgang,vector(128)/*blockIdx.xthreadIdx.x*/CC1.0:8registers;48shared,0constant,0localmemorybytesCC2.0:12registers;0shared,64constant,0localmemorybytes

11

GPU Hackathon 2017- OpenACC

SAXPY example, revisited

12

GPU Hackathon 2017- OpenACC

Jacobi Iteration: C code

13

GPU Hackathon 2017- OpenACC

Jacobi Iteration: OpenACC code

14

GPU Hackathon 2017- OpenACC

PGI Accelerator Compiler output

15

GPU Hackathon 2017- OpenACC

Performance

16

CPU:IntelXeonX56806Cores@3.33GHz

GPU:NVIDIATeslaM2070

GPU Hackathon 2017- OpenACC

What went wrong ?

17

GPU Hackathon 2017- OpenACC

Excessive data transfer

18

GPU Hackathon 2017- OpenACC

Another way of detecting it: NVIDIA Profiler

• Usenvprof forprofilingtheGPUapplication:

• UseNVVPGUI:NVIDIAVisualProfiler:

19

GPU Hackathon 2017- OpenACC

Data construct

• Fortran!$accdata[clause…]structuredblock

!$accenddata• C#pragmaaccdata[clause…]{structuredblock}• Managedatamovement.Dataregionsmaybenested• GeneralClausesif(condition)async(expression)

20

GPU Hackathon 2017- OpenACC

Data clauses

• copy (list)AllocatesmemoryonGPUandcopiesdatafromhosttoGPUwhenenteringregionandcopiesdatatothehostwhenexitingregion.

• copyin (list)AllocatesmemoryonGPUandcopiesdatafromhosttoGPUwhenenteringregion.

• copyout (list)AllocatesmemoryonGPUandcopiesdatatothehostwhenexitingregion.

• create (list)AllocatesmemoryonGPUbutdoesnotcopy.• present (list)DataisalreadypresentonGPUfromanother

containingdataregion.• andpresent_or_copy[in|out],present_or_create,deviceptr.

21

GPU Hackathon 2017- OpenACC

Array shaping

• Compilersometimescannotdeterminesizeofarrays• Mustspecifyexplicitlyusingdataclausesandarray“shape”• C#pragmaaccdatacopyin(a[0:size]),copyout(b[s/4:3*s/4])

• Fortran!$accdatacopyin(a(1:size)),copyout(b(s/4:3*s/4))• Note:dataclausescanbeusedondata,kernelsorparallel

22

GPU Hackathon 2017- OpenACC

Jacobi Iteration: OpenACC C Code, Revisited

23

GPU Hackathon 2017- OpenACC

Performance numbers

24

GPU Hackathon 2017- OpenACC

New NVIDIA profiles

25

GPU Hackathon 2017- OpenACC

CUDA Kernels

• Threadsaregroupedintoblocks• Blocks aregroupedintoagrid• Akernel isexecutedasagridofblocksofthreads

26

GPU Hackathon 2017- OpenACC

Thread blocks• Threadblocksallowcooperation– Cooperativelyload/storeblocksofmemorythattheyalluse

– Shareresultswitheachotherorcooperatetoproduceasingleresult

– Synchronizewitheachother• Threadblocksallowscalability– Blockscanexecuteinanyorder,concurrentlyorsequentially

– Thisindependencebetweenblocksgivesscalability:• AkernelscalesacrossanynumberofSMs

27

GPU Hackathon 2017- OpenACC

Mapping OpenACC to CUDA I

• TheOpenACCexecutionmodelhasthreelevels:gang,worker,andvector

• AllowsmappingtoanarchitecturethatisacollectionofProcessingElements(PEs)

• OneormorePEspernode• EachPEismulti-threaded• Eachthreadcanexecutevectorinstructions

• Tile pragmainOpenACC2.0

28

GPU Hackathon 2017- OpenACC

Mapping OpenACC to CUDA II

• ForGPUs,themappingisimplementation-dependent.Somepossibilities:– gang==block,worker==warp,andvector==threadsofawarp– omit“worker”andjusthavegang==block,vector==threadsofablock

• Dependsonwhatthecompilerthinksisthebestmappingfortheproblem

• ...Butexplicitlyspecifyingthatagivenloopshouldmaptogangs,workers,and/orvectorsisoptionalanyway– Furtherspecifyingthenumberofgangs/workers/vectorsisalsooptional

– Sowhydoit?Totunethecodetofitaparticulartargetarchitectureinastraightforwardandeasilyre-tunedway.

29

GPU Hackathon 2017- OpenACC

OpenACC loop directive and clauses

#pragmaacc kernelsloopfor(int i=0;i<n;++i)y[i]+=a*x[i];Useswhatevermappingtothreadsandblocksthecompilerchooses.Perhaps16blocks,256threadseach#pragma acc kernelsloopgang(100),vector(128)for(int i=0;i<n;++i)y[i]+=a*x[i];100threadblocks,eachwith128threads,eachthreadexecutesoneiterationoftheloop,usingkernels#pragma acc parallelnum_gangs(100),vector_length(128){#pragma acc loopgang,vectorfor(int i=0;i<n;++i)y[i]+=a*x[i];

}100threadblocks,eachwith128threads,eachthreadexecutesoneiterationoftheloop,usingparallel

30

GPU Hackathon 2017- OpenACC

Mapping OpenACC to CUDA threads and blocks

31

• Nestedloopsgeneratemulti-dimensionalblocksandgrids:#pragmaacckernelsloopgang(100),vector(16)for(…)

#pragmaaccloopgang(200),vector(32)for(…)

16threadtallblock

100blockstall(row/Y

direction)

and32threadwide

200blockswide(column/Xdirection)

GPU Hackathon 2017- OpenACC

Other clauses for loop directive

32

#pragmaaccloop[cluases]

•independent:forindependentloops•seq:forsequentialexecutionoftheloop•Reduction:forreductionoperationsuchasmin,max,etc…

GPU Hackathon 2017- OpenACC

Jacobi example … again

33

WithKernelsanddatadirectives

GPU Hackathon 2017- OpenACC

Jacobi example … again

34

Afteraddingloopdirectivewithgangandvectorclauses

GPU Hackathon 2017- OpenACC

An opportunity for Auto-tuning

• Gangandvectorvaluescanbeauto-tunedfortheapplication,targetingtheavailableacceleratordevice

35

2.37

1.68

1.83

1.44

1.15

1.49

1.67

2.54

1.171.22

1.321.24 1.20

1.10

1.331.25 1.27 1.26

1.00

1.20

1.40

1.60

1.80

2.00

2.20

2.40

2.60

Performan

ceSpe

edup

ProblemSizes

S.Siddiqui,F.Al-Zayer,S.Feki.HistoricLearningApproachforAuto-tuningOpenACCAcceleratedScientificApplications,iWAPT2014,Eugene,Oregon,USA

GPU Hackathon 2017- OpenACC

An opportunity for Auto-tuning

36

Input code annotated with OpenACC

#pragma acc kernels#pragma acc loop independentfor (x = 4 ; x < nx-4; x++) {

#pragma acc loop independentfor (y = 4; y < ny-4; y++) {

#pragma acc loop independentfor (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

Accelerator Specification

Automatic code generator

#pragma acc kernels#pragma acc loop independent gang(a),vector(b)for (x = 4 ; x < nx-4; x++) {

#pragma acc loop independent gang(c)for (y = 4; y < ny-4; y++) {

#pragma acc loop independent vector(d)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

#pragma acc kernels#pragma acc loop independent gang(a)for (x = 4 ; x < nx-4; x++) {

#pragma acc loop independent gang(b),vector(c)for (y = 4; y < ny-4; y++) {

#pragma acc loop independent vector(d)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

#pragma acc kernels#pragma acc loop independent gang(a),vector(b)for (x = 4 ; x < nx-4; x++) {

#pragma acc loop independent vector(c)for (y = 4; y < ny-4; y++) {

#pragma acc loop independent gang(d),vector(e)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

#pragma acc kernels#pragma acc loop independent for (x = 4 ; x < nx-4; x++) { #pragma acc loop independent gang(a),vector(b)

for (y = 4; y < ny-4; y++) {#pragma acc loop independent gang(c),vector(d)

for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

Runtime evaluation and selection

Database

GPU Hackathon 2017- OpenACC

Jacobi example … again

37

• Whichotheroptimizationwecanfurtherdo?

– RestructuringthecodewillenhancebothCPUandGPUversion– Hint:reducememoryoperations

GPU Hackathon 2017- OpenACC

OpenACC Runtime Library

38

• InC:#include“openacc.h”• InFortran:#include‘openacc_lib.h’ oruseopenacc• Contains:– Prototypesofallroutines– Definitionofdatatypesusedintheseroutinesincludingenumerationtypedescribingtypesofaccelerators

GPU Hackathon 2017- OpenACC 39

OpenACC Runtime Library Definitions

• openacc_version withavalueyyyymm (yearandmonthoftheopenacc version)

• acc_device_t :typeofacceleratordevice– acc_device_none– acc_device_default– acc_device_host– acc_device_not_host

GPU Hackathon 2017- OpenACC

• acc_get_num_devices:returnsthenumberofdevicesofthegiventypeattachedtothehost

• acc_set_device_type:tellswhichtypeofdevicetousewhenexecutinganacceleratorparallelorkernelregion.

• acc_get_device_type:tellswhichtypeofdevicetobeusedforthenextacceleratedregion

• acc_set_device_num:specifywhichdevicetouse• acc_get_device_num:returnsthedevicenumberofthe

specifieddevicetypethatwillbeusedtorunthenextacceleratorparallelorkernelsregion

40

OpenACC Runtime Library Routines I

GPU Hackathon 2017- OpenACC

OpenACC Runtime Library Routines II

• acc_init:initializetheruntime,canbeusedtoisolatetheinitializationcostfromthecomputationcost

• acc_shutdown:shutdowntheconnectiontothedeviceandfreeanyallocatedresources

• acc_malloc:allocatememoryontheacceleratordevice• acc_free:freesmemoryontheacceleratordevice

41

GPU Hackathon 2017- OpenACC

OpenACC Runtime Library Routines: use case

• PortinganMPIcodetomultipleGPUs.• Exampleinrunningon8nodes,with4GPUseach,i.e.32MPI

processes

• acc_init()• acc_set_device_num( rank%4)

• Eachnoderuns4MPIprocesses,eachofthemisoffloadingcomputekernelstoaseparateGPU

42

S.Feki,A.Al-Jarro,H.Bağcı.MultipleGPUsElectromagneticsSimulationsusingMPIandOpenACC,PosterinGPUTechnologyConference,SanJose,California,USA,March24-27,2014

GPU Hackathon 2017- OpenACC

OpenACC and CUDA libraries

43

GPU Hackathon 2017- OpenACC

GPU accelerated libraries

44

GPU Hackathon 2017- OpenACC

Sharing data with libraries

• CUDAlibrariesandOpenACCbothoperateondevicearrays• OpenACCprovidesmechanismsforinteroperabilitywith

librarycalls– deviceptr dataclause– host_data construct

• Note:samemechanismsusefulforinteroperabilitywithcustomCUDAC/C++/Fortrancode

45

GPU Hackathon 2017- OpenACC

deviceptr Data Clause

deviceptr(list)Declaresthatthepointersinlistrefertodevicepointersthatneednotbeallocatedormovedbetweenthehostanddeviceforthispointer.Example:• C#pragmaacc datadeviceptr(d_input)• Fortran$!acc datadeviceptr(d_input)

46

GPU Hackathon 2017- OpenACC

host_data Construct

• Makestheaddressofdevicedataavailableonthehost.• deviceptr(list)Tellsthecompilertousethedeviceaddress

foranyvariableinlist.Variablesinthelistmustbepresentindevicememoryduetodataregionsthatcontainthisconstruct

• Example• C#pragmaacc host_data use_device(d_input)• Fortran$!acc host_data use_device(d_input)

47

GPU Hackathon 2017- OpenACC

Summary on device pointers

• Usedeviceptr dataclausetopasspre-allocateddevicedatatoOpenACCregionsandloops

• Usehost_data togetdeviceaddressforpointersinsideaccdataregions

• ThesametechniquesshownherecanbeusedtosharedevicedatabetweenOpenACCloopsand– YourcustomCUDAC/C++/Fortran/etc.devicecode– AnyCUDALibrarythatusesCUDAdevicepointers

48GPU Hackathon 2017- OpenACC

GPU Hackathon 2017- OpenACC 49

Thanks !

GPU Hackathon 2017- OpenACC

Recommended