Laplace s equation – MPI + CUDA

Laplace�sequation– MPI+CUDA• Letsfirstassumewedon’t

usedatatypes(insteadwe

willmanuallypack/unpack)

Datadistribution

Createdatatypes

Exchangedatawithneighbors

(north,south,east,west)

Dolocalcomputation

north

south

east

west

north

south

east

west

ProgrammingCUDA• Eachlittleboxisalightcomputationalthread

• Goingbacktothedatadistributionmethods(k-cyclic),

theCUDAworkdistributioncanbeassimilatedtoa<X-

cyclic,Y-cyclic>distribution

– Thegoalistoevenlydistributethecomputationalworkoverall

threadsavailableontheGPU

– Warp:agroupof32parallelthreads,thatexecutesexactly the

samething(butarenameddistinguishably)

– Awarpexecutesonecommoninstructionatatime(efficiency

requiresthatallthreadsinthewarpsdothesamething,takethe

samebranches,dothesameoperation).

– Ifmultipleexecutionpatharepossible(duetobranches),ifthe

decisiononwhichbranchtotakeisnotcompletebetweenthe

threadsallofthepossibleexecutionpathareexecuted!

• Thisalsohintthatatomicoperationsissuedbythreadsinawarpto

thesamememorylocationwouldbeexecutedsequentially

– occupancy

• Moreinfo@https://docs.nvidia.com/cuda/cuda-c-

programming-guide/

north

south

east

west

THREADS_PER_BLOCK_X

THREADS_PER_BLO

CK_Y

Debugging

• Commercialtools(DDT,TV,…)

• Ifpossibilitytoexportxterm:mpirun –np2xterm –egdb –args <myappargs>

• Ifnot,addasleep(oralooparoundasleepinyourapplications)anduse”gdb –p<pid>”toattachtoyourprocess(onceconnectedtothesamenodewheretheapplicationisrunning)

• gdb canexecuteGDBcommandsfromaFILE(with--command=FILE,-x)

Profiling

• Non-CUDAapplication:valgrind (free),or

vtune (Intel),Score-P,Tau,Vampir

• CUDAapplication:nvprof fromCUDA

Possiblecodeoptimizations

• CUDA:– Asthecomputationissymmetricalandhighlybalanced,onecanhave

adifferentworkdistributionanddomorecomputationsperthread

– Usesharedmemory

– Dividethecomputationsin2parts:whatneedsexternaldataandwhatdoesn’t.

• MPI:– Usedatatypes

– UseRMA

• Overlap communicationandcomputations

– Createaspecializedkerneltopackandunpackallthebordersinoneoperation

– Asstartingakernelhasahighlatencymergethispack/unpackkernelwiththeupdatesbasedontheghostregions

Documents

Laplace s equation – MPI + CUDA