Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ExploitingMaximalOverlapforNon-ContiguousDataMovementProcessing
onModernGPU-enabledSystems
Ching-HsiangChu,KhaledHamidouche,AkshayVenkatesh,
DipS.Banerjee,HariSubramoni andDhabaleswar K.(DK)PandaNetwork-basedComputingLaboratory
DepartmentofComputerScienceandEngineeringTheOhioStateUniversity
IPDPS2016 2NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns
• PerformanceEvaluation
• Conclusion
Outline
IPDPS2016 3NetworkBasedComputingLaboratory
• Multi-coreprocessorsareubiquitous• InfiniBandisverypopularinHPCclusters• Accelerators/Coprocessorsbecomingcommoninhigh-endsystems• PushingtheenvelopeforExascale computing
DriversofModernHPCClusterArchitectures
Accelerators/Coprocessorshighcomputedensity, highperformance/watt
>1Tflop/sDPonachip
HighPerformanceInterconnects- InfiniBand<1uslatency,>100Gbps Bandwidth
Tianhe– 2 Titan Stampede Tianhe– 1A
Multi-coreProcessors
IPDPS2016 4NetworkBasedComputingLaboratory
• GrowthofAccelerator-enabledclustersinthelast3years– 22%ofTop50clustersareboostedbyNVIDIAGPUsinNov’15
– FromTop500list(http://www.top500.org)
AcceleratorsinHPCSystems
8 15 23 28 335231 22 20 18 1514
11 1216 20 30
29
0
20
40
60
80
100
June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015
System
Cou
nt
NVIDIAKepler NVIDIAFermi IntelXeonPhi
IPDPS2016 5NetworkBasedComputingLaboratory
• Parallel applications on GPU clusters– CUDA (Compute Unified Device Architecture):
• Kernel computation on NVIDIA GPUs
– CUDA-Aware MPI (Message Passing Interface):• Communications across processes/nodes
• Non-blocking communication to overlap with CUDAkernels
Motivation
MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2);/* Independent computations on CPU/GPU */ MPI_Wait (request1, status1);MPI_Wait (request2, status2);
IPDPS2016 6NetworkBasedComputingLaboratory
• Use of non-contiguous data becoming common– Easy to represent complex data structure
• MPI Datatypes
– E.g., Fluid dynamic, image processing…
• WhatifthedataareonGPUmemory?1. CopydatatoCPUtoperformthepacking/unpacking
• Slowerforlargemessage
• DatamovementsbetweenGPUandCPUareexpensive
2. UtilizeGPUkerneltoperformthepacking/unpacking*• Noexplicitcopies,fasterforlargemessage
Motivation
*R. Shi et al., “HAND: A Hybrid Approach to Accelerate Non- contiguous Data Movement Using MPI Datatypes on GPU Clusters,” in 43rd ICPP, Sept 2014, pp. 221–230.
IPDPS2016 7NetworkBasedComputingLaboratory
MPI_Isend(Buf1, ...,req1);
MPI_Isend(Buf2, ...,req2);
ApplicationworkontheCPU/GPU
MPI_Waitall(req,…)
CommonScenario
*Buf1, Buf2…contain non-contiguous MPI Datatype
WasteofcomputingresourcesonCPUandGPU
Motivation–Non-ContiguousDataMovementinMPI
Timeline
IPDPS2016 8NetworkBasedComputingLaboratory
• LowoverlapbetweenCPUandGPUforapplications– Packing/Unpackingoperationsareserialized
• CPU/GPUresourcesarenotfullyutilized– GPUthreadsremainidleformostofthetime
– Lowutilization,lowefficiency
ProblemStatement
Overlap Productivity
Performanc
e
Resource
Utilization
Proposed
UserNaive
UserAdvanced
FartherfromthecenterisBetter
CanwehavedesignstoleveragenewGPUtechnologytoaddresstheseissues?
IPDPS2016 9NetworkBasedComputingLaboratory
• ProposesnewdesignsleveragenewNVIDIAGPUtechnologiesØHyper-Qtechnology(Multi-Streaming)
ØCUDAEvent andCallback
• AchievingØHighperformanceandresourceutilizationforapplications
ØHighproductivityfordevelopers
Goalsofthiswork
IPDPS2016 10NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns– Event-based
– Callback-based
• PerformanceEvaluation
• Conclusion
Outline
IPDPS2016 11NetworkBasedComputingLaboratory
CPU
Progress
GPU
Time
Initi
ate
Kern
el
Star
t Se
ndIsend(1)
Initi
ate
Kern
el
Star
t Se
nd
Initi
ate
Kern
elGPU
CPUIn
itiat
e Ke
rnel
Star
tSe
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)Existing Design
Proposed Design
Kernel on Stream
Kernel on Stream
Isend(2)Isend(3)
Kernel on Stream
Initi
ate
Kern
el
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)
Initi
ate
Kern
el
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1) Wait
WFK
Star
t Se
nd
Wait
Progress
Start Finish Proposed Finish Existing
WFK
WFK
Expected Benefits
Overview
IPDPS2016 12NetworkBasedComputingLaboratory
• CUDAEventManagement– ProvidesamechanismtosignalwhentaskshaveoccurredinaCUDAstream
• Basicdesignidea1. CPUlaunchesaCUDApacking/unpackingkernel
2. CPUcreatesCUDAeventandthenreturnsimmediately• GPUsetsthestatusas‘completed’whenthekerneliscompleted
3. InMPI_Wait/MPI_Waitall:
• CPUqueriestheeventswhenthepacked/unpackeddataisrequiredforcommunication
Event-basedDesign
IPDPS2016 13NetworkBasedComputingLaboratory
Event-basedDesign
MPI_Isend()cudaEventRecord()
HCA CPU GPU
MPI_Waitall()Query / Progress
Send
CompletionRequest
Complete
pack_kernel1<<< >>>
pack_kernel2<<< >>>
pack_kernel3<<< >>>cudaEventRecord()
cudaEventRecord()
MPI_Isend()
MPI_Isend()
IPDPS2016 14NetworkBasedComputingLaboratory
• Majorbenefits– OverlapbetweenCPUcommunicationandGPUpackingkernel
– GPUresourcesarehighlyutilized
• Limitation
– CPUisrequiredtokeepcheckingthestatusoftheevent• LowerCPUutilization
Event-basedDesign
MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1);MPI_Wait (request2, status2);
IPDPS2016 15NetworkBasedComputingLaboratory
• CUDAStreamCallback– Launchingworkautomatically ontheCPUwhensomethinghas
completedontheCUDAstream
– Restrictions:• Callbacksareprocessedbyadriverthread,wherenoCUDAAPIscanbecalled
• Overheadwheninitializingcallbackfunction
• Basicdesignidea1. CPUlaunchesaCUDApacking/unpackingkernel
2. CPUaddsCallbackfunctionandthenreturnsimmediately
3. Callbackfunctionwakesupahelperthreadtoprocessthecommunication
Callback-basedDesign
IPDPS2016 16NetworkBasedComputingLaboratory
Callback-basedDesign
MPI_Isend()addCallback()
HCA CPU GPU
Send
CompletionRequestComplete
main helper callback
pack_kernel1<<< >>>
Callback
MPI_Waitall()
addCallback()
addCallback()CPU
Computations
pack_kernel2<<< >>>
pack_kernel3<<< >>>
MPI_Isend()
MPI_Isend()
CallbackCallback
IPDPS2016 17NetworkBasedComputingLaboratory
• Majorbenefits– OverlapbetweenCPUcommunicationandGPUpackingkernel
– OverlapbetweenCPUcommunicationandothercomputations
– HigherCPUandGPUutilization
Callback-basedDesign
MPI_Isend(Buf1, ...,&requests[0]); MPI_Isend(Buf2, ...,&requests[1]); MPI_Isend(Buf3, ...,&requests[2]);
// Application work on the CPU
MPI_Waitall(requests, status);
IPDPS2016 18NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns
• PerformanceEvaluation– Benchmark
– HaloExchange-basedApplicationKernel
• Conclusion
Outline
IPDPS2016 19NetworkBasedComputingLaboratory
OverviewoftheMVAPICH2Project• HighPerformanceopen-source MPILibraryforInfiniBand,10-40Gig/iWARP, andRDMAoverConverged Enhanced
Ethernet(RoCE)
– MVAPICH (MPI-1),MVAPICH2 (MPI-2.2andMPI-3.0),Availablesince2002
– MVAPICH2-X (MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC (MVAPICH2-MIC),Availablesince2014
– Support forVirtualization(MVAPICH2-Virt),Availablesince2015
– Support forEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– Usedbymorethan2,575organizations in80countries
– Morethan 376,000(0.37million)downloads fromtheOSUsitedirectly
– EmpoweringmanyTOP500clusters(Nov‘15ranking)• 10th ranked519,640-corecluster(Stampede)atTACC• 13th ranked185,344-corecluster(Pleiades)atNASA• 25th ranked76,032-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers
– Availablewithsoftwarestacksofmanyvendors andLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu• EmpoweringTop500systemsforoveradecade
– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->
– StampedeatTACC(10th inNov’15,519,640cores,5.168Plops)
IPDPS2016 20NetworkBasedComputingLaboratory
1. Wilkescluster@UniversityofCambridge– 2NVIDIAK20cGPUspernode
• Upto32GPUnodes
2. CSCScluster@SwissNationalSupercomputingCentre– CrayCS-Stormsystem
– 8NVIDIAK80GPUspernode• Upto96GPUsover12nodes
ExperimentalEnvironments
IPDPS2016 21NetworkBasedComputingLaboratory
• Modified‘CUDA-Aware’DDTBench(http://htor.inf.ethz.ch/research/datatypes/ddtbench/)
Benchmark-levelEvaluation- Performance
00.20.40.60.81
1.21.41.61.82
Norm
alize
dExecutionTime
InputSize
Default Event-based Callback-basedNAS_MG_y SPECFEM3D_OC WRF_sa SPECFEM3D_CM
2.7X3.4X2.6X1.5X
Lowerisbetter
IPDPS2016 22NetworkBasedComputingLaboratory
• Modified‘CUDA-Aware’DDTBenchforNAS_MG_y test– Injecteddummycomputations
Benchmark-levelEvaluation- Overlap
MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Isend(Buf3, ...,request3); Dummy_comp(); // Application work on the CPUMPI_Waitall(requests, status);
020406080100
Overla
p(%)
InputSize
Default Event-based Callback-based
Higherisbetter
IPDPS2016 23NetworkBasedComputingLaboratory
• MeteoSwiss weatherforecastingCOSMO*applicationkernel@CSCScluster
• Multi-dimensionaldata• Contiguous ononedimension
• Non-contiguousonotherdimensions
• Halodataexchange• Duplicatetheboundary
• Exchangetheboundary
Application-levelEvaluation- HaloDataExchange
*http://www.cosmo-model.org/
IPDPS2016 24NetworkBasedComputingLaboratory
Application-level(HaloExchange)Evaluation
00.20.40.60.81
1.2
16 32 64 96
Normalize
dExecutionTime
NumberofGPUs
CSCSGPUcluster
Default Callback-based Event-based
0
0.5
1
1.5
4 8 16 32
Normalize
dExecutionTime
NumberofGPUs
WilkesGPUCluster
Default Callback-based Event-based
2X 1.6X
MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2);// Computations on GPUMPI_Wait (request1, status1);MPI_Wait (request2, status2); Lowerisbetter
IPDPS2016 25NetworkBasedComputingLaboratory
• Proposed designs can improve the overallperformance and utilization of CPU as well as GPU– Event-based design: Overlapping CPU communication with
GPU computation
– Callback-based design: Further overlapping with CPUcomputation
• Future Work– Non-blocking collectiveoperations
– Contiguousdata movements
– Next generation GPU architectures
– Will be available in the MVAPICH2-GDR library
Conclusion
IPDPS2016 26NetworkBasedComputingLaboratory
TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/
Network-BasedComputing Laboratoryhttp://nowlab.cse.ohio-state.edu/
TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/
IPDPS2016 27NetworkBasedComputingLaboratory
• NVIDIA- CUDAHyper-Q(Multi-stream)technology– MultipleCPUthreads/processes tolaunchkernelonasingle
GPUsimultaneously
– IncreasingGPUutilizationandreducingCPUidletimes
Motivation– NVIDIAGPUFeature
http://www.hpc.co.jp/images/hyper-q.png
IPDPS2016 28NetworkBasedComputingLaboratory
Motivation– Non-ContiguousDataMovementinMPI
sbuf=malloc(…);rbuf=malloc(…);/*Packing*/for(i=1;i<n;i+=2)
sbuf[i]=matrix[n][0];MPI_Send(sbuf,n,MPI_DOUBLE,…);MPI_Recv(rbuf,n,MPI_DOUBLE,…);/*Unpacking*/for(i=1;i<n;i+=2)
matrix[i][0]=rbuf[i]free(sbuf);free(rbuf);
MPI_Datatype nt;MPI_Type_Vector(1,1,n,MPI_DOUBLE,&nt);MPI_Type_commit(&nt);MPI_Send(matrix,1,nt,…);MPI_Recv(matrix,1,nt,…);
UsingMPI Datatypes• No explicit copies inapplicationsØBette performance
• Less codeØHigher productivity