Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

ExploitingMaximalOverlapforNon-ContiguousDataMovementProcessing

onModernGPU-enabledSystems

Ching-HsiangChu,KhaledHamidouche,AkshayVenkatesh,

DipS.Banerjee,HariSubramoni andDhabaleswar K.(DK)PandaNetwork-basedComputingLaboratory

DepartmentofComputerScienceandEngineeringTheOhioStateUniversity

IPDPS2016 2NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation

• Conclusion

Outline


• Multi-coreprocessorsareubiquitous• InfiniBandisverypopularinHPCclusters• Accelerators/Coprocessorsbecomingcommoninhigh-endsystems• PushingtheenvelopeforExascale computing

DriversofModernHPCClusterArchitectures

Accelerators/Coprocessorshighcomputedensity, highperformance/watt

>1Tflop/sDPonachip

HighPerformanceInterconnects- InfiniBand<1uslatency,>100Gbps Bandwidth

Tianhe– 2 Titan Stampede Tianhe– 1A

Multi-coreProcessors


• GrowthofAccelerator-enabledclustersinthelast3years– 22%ofTop50clustersareboostedbyNVIDIAGPUsinNov’15

– FromTop500list(http://www.top500.org)

AcceleratorsinHPCSystems

8 15 23 28 335231 22 20 18 1514

11 1216 20 30

29

0

20

40

60

80

100

June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015

System

Cou

nt

NVIDIAKepler NVIDIAFermi IntelXeonPhi


• Parallel applications on GPU clusters– CUDA (Compute Unified Device Architecture):

• Kernel computation on NVIDIA GPUs

– CUDA-Aware MPI (Message Passing Interface):• Communications across processes/nodes

• Non-blocking communication to overlap with CUDAkernels

Motivation

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2);/* Independent computations on CPU/GPU */ MPI_Wait (request1, status1);MPI_Wait (request2, status2);


• Use of non-contiguous data becoming common– Easy to represent complex data structure

• MPI Datatypes

– E.g., Fluid dynamic, image processing…

• WhatifthedataareonGPUmemory?1. CopydatatoCPUtoperformthepacking/unpacking

• Slowerforlargemessage

• DatamovementsbetweenGPUandCPUareexpensive

2. UtilizeGPUkerneltoperformthepacking/unpacking*• Noexplicitcopies,fasterforlargemessage

Motivation

*R. Shi et al., “HAND: A Hybrid Approach to Accelerate Non- contiguous Data Movement Using MPI Datatypes on GPU Clusters,” in 43rd ICPP, Sept 2014, pp. 221–230.


MPI_Isend(Buf1, ...,req1);

MPI_Isend(Buf2, ...,req2);

ApplicationworkontheCPU/GPU

MPI_Waitall(req,…)

CommonScenario

*Buf1, Buf2…contain non-contiguous MPI Datatype

WasteofcomputingresourcesonCPUandGPU

Motivation–Non-ContiguousDataMovementinMPI

Timeline


• LowoverlapbetweenCPUandGPUforapplications– Packing/Unpackingoperationsareserialized

• CPU/GPUresourcesarenotfullyutilized– GPUthreadsremainidleformostofthetime

– Lowutilization,lowefficiency

ProblemStatement

Overlap Productivity

Performanc

e

Resource

Utilization

Proposed

UserNaive

UserAdvanced

FartherfromthecenterisBetter

CanwehavedesignstoleveragenewGPUtechnologytoaddresstheseissues?


• ProposesnewdesignsleveragenewNVIDIAGPUtechnologiesØHyper-Qtechnology(Multi-Streaming)

ØCUDAEvent andCallback

• AchievingØHighperformanceandresourceutilizationforapplications

ØHighproductivityfordevelopers

Goalsofthiswork


• Introduction

• ProposedDesigns– Event-based

– Callback-based

• PerformanceEvaluation

• Conclusion

Outline


CPU

Progress

GPU

Time

Initi

ate

Kern

el

Star

t Se

ndIsend(1)

Initi

ate

Kern

el

Star

t Se

nd

Initi

ate

Kern

elGPU

CPUIn

itiat

e Ke

rnel

Star

tSe

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)Existing Design

Proposed Design

Kernel on Stream

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Initi

ate

Kern

el

Star

t Se

nd


Kernel on Stream

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd


Kernel on Stream

Isend(1) Wait

WFK

Star

t Se

nd

Wait

Progress

Start Finish Proposed Finish Existing

WFK

WFK

Expected Benefits

Overview


• CUDAEventManagement– ProvidesamechanismtosignalwhentaskshaveoccurredinaCUDAstream

• Basicdesignidea1. CPUlaunchesaCUDApacking/unpackingkernel

2. CPUcreatesCUDAeventandthenreturnsimmediately• GPUsetsthestatusas‘completed’whenthekerneliscompleted

3. InMPI_Wait/MPI_Waitall:

• CPUqueriestheeventswhenthepacked/unpackeddataisrequiredforcommunication

Event-basedDesign


Event-basedDesign

MPI_Isend()cudaEventRecord()

HCA CPU GPU

MPI_Waitall()Query / Progress

Send

CompletionRequest

Complete

pack_kernel1<<< >>>

pack_kernel2<<< >>>

pack_kernel3<<< >>>cudaEventRecord()

cudaEventRecord()

MPI_Isend()

MPI_Isend()


• Majorbenefits– OverlapbetweenCPUcommunicationandGPUpackingkernel

– GPUresourcesarehighlyutilized

• Limitation

– CPUisrequiredtokeepcheckingthestatusoftheevent• LowerCPUutilization

Event-basedDesign

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1);MPI_Wait (request2, status2);


• CUDAStreamCallback– Launchingworkautomatically ontheCPUwhensomethinghas

completedontheCUDAstream

– Restrictions:• Callbacksareprocessedbyadriverthread,wherenoCUDAAPIscanbecalled

• Overheadwheninitializingcallbackfunction

• Basicdesignidea1. CPUlaunchesaCUDApacking/unpackingkernel

2. CPUaddsCallbackfunctionandthenreturnsimmediately

3. Callbackfunctionwakesupahelperthreadtoprocessthecommunication

Callback-basedDesign



MPI_Isend()addCallback()

HCA CPU GPU

Send

CompletionRequestComplete

main helper callback

pack_kernel1<<< >>>

Callback

MPI_Waitall()

addCallback()

addCallback()CPU

Computations

pack_kernel2<<< >>>

pack_kernel3<<< >>>

MPI_Isend()

MPI_Isend()

CallbackCallback


• Majorbenefits– OverlapbetweenCPUcommunicationandGPUpackingkernel

– OverlapbetweenCPUcommunicationandothercomputations

– HigherCPUandGPUutilization


MPI_Isend(Buf1, ...,&requests[0]); MPI_Isend(Buf2, ...,&requests[1]); MPI_Isend(Buf3, ...,&requests[2]);

// Application work on the CPU

MPI_Waitall(requests, status);


• Introduction

• ProposedDesigns

• PerformanceEvaluation– Benchmark

– HaloExchange-basedApplicationKernel

• Conclusion

Outline


OverviewoftheMVAPICH2Project• HighPerformanceopen-source MPILibraryforInfiniBand,10-40Gig/iWARP, andRDMAoverConverged Enhanced

Ethernet(RoCE)

– MVAPICH (MPI-1),MVAPICH2 (MPI-2.2andMPI-3.0),Availablesince2002

– MVAPICH2-X (MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC (MVAPICH2-MIC),Availablesince2014

– Support forVirtualization(MVAPICH2-Virt),Availablesince2015

– Support forEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– Usedbymorethan2,575organizations in80countries

– Morethan 376,000(0.37million)downloads fromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(Nov‘15ranking)• 10th ranked519,640-corecluster(Stampede)atTACC• 13th ranked185,344-corecluster(Pleiades)atNASA• 25th ranked76,032-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers

– Availablewithsoftwarestacksofmanyvendors andLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu• EmpoweringTop500systemsforoveradecade

– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– StampedeatTACC(10th inNov’15,519,640cores,5.168Plops)


1. Wilkescluster@UniversityofCambridge– 2NVIDIAK20cGPUspernode

• Upto32GPUnodes

2. CSCScluster@SwissNationalSupercomputingCentre– CrayCS-Stormsystem

– 8NVIDIAK80GPUspernode• Upto96GPUsover12nodes

ExperimentalEnvironments


• Modified‘CUDA-Aware’DDTBench(http://htor.inf.ethz.ch/research/datatypes/ddtbench/)

Benchmark-levelEvaluation- Performance

00.20.40.60.81

1.21.41.61.82

Norm

alize

dExecutionTime

InputSize

Default Event-based Callback-basedNAS_MG_y SPECFEM3D_OC WRF_sa SPECFEM3D_CM

2.7X3.4X2.6X1.5X

Lowerisbetter


• Modified‘CUDA-Aware’DDTBenchforNAS_MG_y test– Injecteddummycomputations

Benchmark-levelEvaluation- Overlap

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Isend(Buf3, ...,request3); Dummy_comp(); // Application work on the CPUMPI_Waitall(requests, status);

020406080100

Overla

p(%)

InputSize

Default Event-based Callback-based

Higherisbetter


• MeteoSwiss weatherforecastingCOSMO*applicationkernel@CSCScluster

• Multi-dimensionaldata• Contiguous ononedimension

• Non-contiguousonotherdimensions

• Halodataexchange• Duplicatetheboundary

• Exchangetheboundary

Application-levelEvaluation- HaloDataExchange

*http://www.cosmo-model.org/


Application-level(HaloExchange)Evaluation

00.20.40.60.81

1.2

16 32 64 96

Normalize

dExecutionTime

NumberofGPUs

CSCSGPUcluster

Default Callback-based Event-based

0

0.5

1

1.5

4 8 16 32

Normalize

dExecutionTime

NumberofGPUs

WilkesGPUCluster

Default Callback-based Event-based

2X 1.6X

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2);// Computations on GPUMPI_Wait (request1, status1);MPI_Wait (request2, status2); Lowerisbetter


• Proposed designs can improve the overallperformance and utilization of CPU as well as GPU– Event-based design: Overlapping CPU communication with

GPU computation

– Callback-based design: Further overlapping with CPUcomputation

• Future Work– Non-blocking collectiveoperations

– Contiguousdata movements

– Next generation GPU architectures

– Will be available in the MVAPICH2-GDR library

Conclusion


[email protected]

TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/

Network-BasedComputing Laboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/


• NVIDIA- CUDAHyper-Q(Multi-stream)technology– MultipleCPUthreads/processes tolaunchkernelonasingle

GPUsimultaneously

– IncreasingGPUutilizationandreducingCPUidletimes

Motivation– NVIDIAGPUFeature

http://www.hpc.co.jp/images/hyper-q.png


Motivation– Non-ContiguousDataMovementinMPI

sbuf=malloc(…);rbuf=malloc(…);/*Packing*/for(i=1;i<n;i+=2)

sbuf[i]=matrix[n][0];MPI_Send(sbuf,n,MPI_DOUBLE,…);MPI_Recv(rbuf,n,MPI_DOUBLE,…);/*Unpacking*/for(i=1;i<n;i+=2)

matrix[i][0]=rbuf[i]free(sbuf);free(rbuf);

MPI_Datatype nt;MPI_Type_Vector(1,1,n,MPI_DOUBLE,&nt);MPI_Type_commit(&nt);MPI_Send(matrix,1,nt,…);MPI_Recv(matrix,1,nt,…);

UsingMPI Datatypes• No explicit copies inapplicationsØBette performance

• Less codeØHigher productivity

Documents

Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler