5
Laplaces equation – MPI + CUDA Lets first assume we don’t use datatypes (instead we will manually pack/unpack) Data distribution Create datatypes Exchange data with neighbors (north, south, east, west) Do local computation north south east west north south east west

Laplace s equation – MPI + CUDA

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Laplace s equation – MPI + CUDA

Laplace�sequation– MPI+CUDA• Letsfirstassumewedon’t

usedatatypes(insteadwe

willmanuallypack/unpack)

Datadistribution

Createdatatypes

Exchangedatawithneighbors

(north,south,east,west)

Dolocalcomputation

north

south

east

west

north

south

east

west

Page 2: Laplace s equation – MPI + CUDA

ProgrammingCUDA• Eachlittleboxisalightcomputationalthread

• Goingbacktothedatadistributionmethods(k-cyclic),

theCUDAworkdistributioncanbeassimilatedtoa<X-

cyclic,Y-cyclic>distribution

– Thegoalistoevenlydistributethecomputationalworkoverall

threadsavailableontheGPU

– Warp:agroupof32parallelthreads,thatexecutesexactly the

samething(butarenameddistinguishably)

– Awarpexecutesonecommoninstructionatatime(efficiency

requiresthatallthreadsinthewarpsdothesamething,takethe

samebranches,dothesameoperation).

– Ifmultipleexecutionpatharepossible(duetobranches),ifthe

decisiononwhichbranchtotakeisnotcompletebetweenthe

threadsallofthepossibleexecutionpathareexecuted!

• Thisalsohintthatatomicoperationsissuedbythreadsinawarpto

thesamememorylocationwouldbeexecutedsequentially

– occupancy

• Moreinfo@https://docs.nvidia.com/cuda/cuda-c-

programming-guide/

north

south

east

west

THREADS_PER_BLOCK_X

THREADS_PER_BLO

CK_Y

Page 3: Laplace s equation – MPI + CUDA

Debugging

• Commercialtools(DDT,TV,…)

• Ifpossibilitytoexportxterm:mpirun –np2xterm –egdb –args <myappargs>

• Ifnot,addasleep(oralooparoundasleepinyourapplications)anduse”gdb –p<pid>”toattachtoyourprocess(onceconnectedtothesamenodewheretheapplicationisrunning)

• gdb canexecuteGDBcommandsfromaFILE(with--command=FILE,-x)

Page 4: Laplace s equation – MPI + CUDA

Profiling

• Non-CUDAapplication:valgrind (free),or

vtune (Intel),Score-P,Tau,Vampir

• CUDAapplication:nvprof fromCUDA

Page 5: Laplace s equation – MPI + CUDA

Possiblecodeoptimizations

• CUDA:– Asthecomputationissymmetricalandhighlybalanced,onecanhave

adifferentworkdistributionanddomorecomputationsperthread

– Usesharedmemory

– Dividethecomputationsin2parts:whatneedsexternaldataandwhatdoesn’t.

• MPI:– Usedatatypes

– UseRMA

• Overlap communicationandcomputations

– Createaspecializedkerneltopackandunpackallthebordersinoneoperation

– Asstartingakernelhasahighlatencymergethispack/unpackkernelwiththeupdatesbasedontheghostregions