28
Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing on Modern GPU-enabled Systems Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Dip S. Banerjee, Hari Subramoni and Dhabaleswar K. (DK) Panda Network-based ComputingLaboratory Department of Computer Science and Engineering The Ohio State University

Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

ExploitingMaximalOverlapforNon-ContiguousDataMovementProcessing

onModernGPU-enabledSystems

Ching-HsiangChu,KhaledHamidouche,AkshayVenkatesh,

DipS.Banerjee,HariSubramoni andDhabaleswar K.(DK)PandaNetwork-basedComputingLaboratory

DepartmentofComputerScienceandEngineeringTheOhioStateUniversity

Page 2: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 2NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation

• Conclusion

Outline

Page 3: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 3NetworkBasedComputingLaboratory

• Multi-coreprocessorsareubiquitous• InfiniBandisverypopularinHPCclusters• Accelerators/Coprocessorsbecomingcommoninhigh-endsystems• PushingtheenvelopeforExascale computing

DriversofModernHPCClusterArchitectures

Accelerators/Coprocessorshighcomputedensity, highperformance/watt

>1Tflop/sDPonachip

HighPerformanceInterconnects- InfiniBand<1uslatency,>100Gbps Bandwidth

Tianhe– 2 Titan Stampede Tianhe– 1A

Multi-coreProcessors

Page 4: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 4NetworkBasedComputingLaboratory

• GrowthofAccelerator-enabledclustersinthelast3years– 22%ofTop50clustersareboostedbyNVIDIAGPUsinNov’15

– FromTop500list(http://www.top500.org)

AcceleratorsinHPCSystems

8 15 23 28 335231 22 20 18 1514

11 1216 20 30

29

0

20

40

60

80

100

June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015

System

Cou

nt

NVIDIAKepler NVIDIAFermi IntelXeonPhi

Page 5: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 5NetworkBasedComputingLaboratory

• Parallel applications on GPU clusters– CUDA (Compute Unified Device Architecture):

• Kernel computation on NVIDIA GPUs

– CUDA-Aware MPI (Message Passing Interface):• Communications across processes/nodes

• Non-blocking communication to overlap with CUDAkernels

Motivation

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2);/* Independent computations on CPU/GPU */ MPI_Wait (request1, status1);MPI_Wait (request2, status2);

Page 6: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 6NetworkBasedComputingLaboratory

• Use of non-contiguous data becoming common– Easy to represent complex data structure

• MPI Datatypes

– E.g., Fluid dynamic, image processing…

• WhatifthedataareonGPUmemory?1. CopydatatoCPUtoperformthepacking/unpacking

• Slowerforlargemessage

• DatamovementsbetweenGPUandCPUareexpensive

2. UtilizeGPUkerneltoperformthepacking/unpacking*• Noexplicitcopies,fasterforlargemessage

Motivation

*R. Shi et al., “HAND: A Hybrid Approach to Accelerate Non- contiguous Data Movement Using MPI Datatypes on GPU Clusters,” in 43rd ICPP, Sept 2014, pp. 221–230.

Page 7: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 7NetworkBasedComputingLaboratory

MPI_Isend(Buf1, ...,req1);

MPI_Isend(Buf2, ...,req2);

ApplicationworkontheCPU/GPU

MPI_Waitall(req,…)

CommonScenario

*Buf1, Buf2…contain non-contiguous MPI Datatype

WasteofcomputingresourcesonCPUandGPU

Motivation–Non-ContiguousDataMovementinMPI

Timeline

Page 8: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 8NetworkBasedComputingLaboratory

• LowoverlapbetweenCPUandGPUforapplications– Packing/Unpackingoperationsareserialized

• CPU/GPUresourcesarenotfullyutilized– GPUthreadsremainidleformostofthetime

– Lowutilization,lowefficiency

ProblemStatement

Overlap Productivity

Performanc

e

Resource

Utilization

Proposed

UserNaive

UserAdvanced

FartherfromthecenterisBetter

CanwehavedesignstoleveragenewGPUtechnologytoaddresstheseissues?

Page 9: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 9NetworkBasedComputingLaboratory

• ProposesnewdesignsleveragenewNVIDIAGPUtechnologiesØHyper-Qtechnology(Multi-Streaming)

ØCUDAEvent andCallback

• AchievingØHighperformanceandresourceutilizationforapplications

ØHighproductivityfordevelopers

Goalsofthiswork

Page 10: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 10NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns– Event-based

– Callback-based

• PerformanceEvaluation

• Conclusion

Outline

Page 11: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 11NetworkBasedComputingLaboratory

CPU

Progress

GPU

Time

Initi

ate

Kern

el

Star

t Se

ndIsend(1)

Initi

ate

Kern

el

Star

t Se

nd

Initi

ate

Kern

elGPU

CPUIn

itiat

e Ke

rnel

Star

tSe

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)Existing Design

Proposed Design

Kernel on Stream

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Initi

ate

Kern

el

Star

t Se

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1) Wait

WFK

Star

t Se

nd

Wait

Progress

Start Finish Proposed Finish Existing

WFK

WFK

Expected Benefits

Overview

Page 12: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 12NetworkBasedComputingLaboratory

• CUDAEventManagement– ProvidesamechanismtosignalwhentaskshaveoccurredinaCUDAstream

• Basicdesignidea1. CPUlaunchesaCUDApacking/unpackingkernel

2. CPUcreatesCUDAeventandthenreturnsimmediately• GPUsetsthestatusas‘completed’whenthekerneliscompleted

3. InMPI_Wait/MPI_Waitall:

• CPUqueriestheeventswhenthepacked/unpackeddataisrequiredforcommunication

Event-basedDesign

Page 13: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 13NetworkBasedComputingLaboratory

Event-basedDesign

MPI_Isend()cudaEventRecord()

HCA CPU GPU

MPI_Waitall()Query / Progress

Send

CompletionRequest

Complete

pack_kernel1<<< >>>

pack_kernel2<<< >>>

pack_kernel3<<< >>>cudaEventRecord()

cudaEventRecord()

MPI_Isend()

MPI_Isend()

Page 14: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 14NetworkBasedComputingLaboratory

• Majorbenefits– OverlapbetweenCPUcommunicationandGPUpackingkernel

– GPUresourcesarehighlyutilized

• Limitation

– CPUisrequiredtokeepcheckingthestatusoftheevent• LowerCPUutilization

Event-basedDesign

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1);MPI_Wait (request2, status2);

Page 15: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 15NetworkBasedComputingLaboratory

• CUDAStreamCallback– Launchingworkautomatically ontheCPUwhensomethinghas

completedontheCUDAstream

– Restrictions:• Callbacksareprocessedbyadriverthread,wherenoCUDAAPIscanbecalled

• Overheadwheninitializingcallbackfunction

• Basicdesignidea1. CPUlaunchesaCUDApacking/unpackingkernel

2. CPUaddsCallbackfunctionandthenreturnsimmediately

3. Callbackfunctionwakesupahelperthreadtoprocessthecommunication

Callback-basedDesign

Page 16: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 16NetworkBasedComputingLaboratory

Callback-basedDesign

MPI_Isend()addCallback()

HCA CPU GPU

Send

CompletionRequestComplete

main helper callback

pack_kernel1<<< >>>

Callback

MPI_Waitall()

addCallback()

addCallback()CPU

Computations

pack_kernel2<<< >>>

pack_kernel3<<< >>>

MPI_Isend()

MPI_Isend()

CallbackCallback

Page 17: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 17NetworkBasedComputingLaboratory

• Majorbenefits– OverlapbetweenCPUcommunicationandGPUpackingkernel

– OverlapbetweenCPUcommunicationandothercomputations

– HigherCPUandGPUutilization

Callback-basedDesign

MPI_Isend(Buf1, ...,&requests[0]); MPI_Isend(Buf2, ...,&requests[1]); MPI_Isend(Buf3, ...,&requests[2]);

// Application work on the CPU

MPI_Waitall(requests, status);

Page 18: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 18NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation– Benchmark

– HaloExchange-basedApplicationKernel

• Conclusion

Outline

Page 19: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 19NetworkBasedComputingLaboratory

OverviewoftheMVAPICH2Project• HighPerformanceopen-source MPILibraryforInfiniBand,10-40Gig/iWARP, andRDMAoverConverged Enhanced

Ethernet(RoCE)

– MVAPICH (MPI-1),MVAPICH2 (MPI-2.2andMPI-3.0),Availablesince2002

– MVAPICH2-X (MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC (MVAPICH2-MIC),Availablesince2014

– Support forVirtualization(MVAPICH2-Virt),Availablesince2015

– Support forEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– Usedbymorethan2,575organizations in80countries

– Morethan 376,000(0.37million)downloads fromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(Nov‘15ranking)• 10th ranked519,640-corecluster(Stampede)atTACC• 13th ranked185,344-corecluster(Pleiades)atNASA• 25th ranked76,032-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers

– Availablewithsoftwarestacksofmanyvendors andLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu• EmpoweringTop500systemsforoveradecade

– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– StampedeatTACC(10th inNov’15,519,640cores,5.168Plops)

Page 20: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 20NetworkBasedComputingLaboratory

1. Wilkescluster@UniversityofCambridge– 2NVIDIAK20cGPUspernode

• Upto32GPUnodes

2. CSCScluster@SwissNationalSupercomputingCentre– CrayCS-Stormsystem

– 8NVIDIAK80GPUspernode• Upto96GPUsover12nodes

ExperimentalEnvironments

Page 21: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 21NetworkBasedComputingLaboratory

• Modified‘CUDA-Aware’DDTBench(http://htor.inf.ethz.ch/research/datatypes/ddtbench/)

Benchmark-levelEvaluation- Performance

00.20.40.60.81

1.21.41.61.82

Norm

alize

dExecutionTime

InputSize

Default Event-based Callback-basedNAS_MG_y SPECFEM3D_OC WRF_sa SPECFEM3D_CM

2.7X3.4X2.6X1.5X

Lowerisbetter

Page 22: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 22NetworkBasedComputingLaboratory

• Modified‘CUDA-Aware’DDTBenchforNAS_MG_y test– Injecteddummycomputations

Benchmark-levelEvaluation- Overlap

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Isend(Buf3, ...,request3); Dummy_comp(); // Application work on the CPUMPI_Waitall(requests, status);

020406080100

Overla

p(%)

InputSize

Default Event-based Callback-based

Higherisbetter

Page 23: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 23NetworkBasedComputingLaboratory

• MeteoSwiss weatherforecastingCOSMO*applicationkernel@CSCScluster

• Multi-dimensionaldata• Contiguous ononedimension

• Non-contiguousonotherdimensions

• Halodataexchange• Duplicatetheboundary

• Exchangetheboundary

Application-levelEvaluation- HaloDataExchange

*http://www.cosmo-model.org/

Page 24: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 24NetworkBasedComputingLaboratory

Application-level(HaloExchange)Evaluation

00.20.40.60.81

1.2

16 32 64 96

Normalize

dExecutionTime

NumberofGPUs

CSCSGPUcluster

Default Callback-based Event-based

0

0.5

1

1.5

4 8 16 32

Normalize

dExecutionTime

NumberofGPUs

WilkesGPUCluster

Default Callback-based Event-based

2X 1.6X

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2);// Computations on GPUMPI_Wait (request1, status1);MPI_Wait (request2, status2); Lowerisbetter

Page 25: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 25NetworkBasedComputingLaboratory

• Proposed designs can improve the overallperformance and utilization of CPU as well as GPU– Event-based design: Overlapping CPU communication with

GPU computation

– Callback-based design: Further overlapping with CPUcomputation

• Future Work– Non-blocking collectiveoperations

– Contiguousdata movements

– Next generation GPU architectures

– Will be available in the MVAPICH2-GDR library

Conclusion

Page 26: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 26NetworkBasedComputingLaboratory

[email protected]

TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/

Network-BasedComputing Laboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/

Page 27: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 27NetworkBasedComputingLaboratory

• NVIDIA- CUDAHyper-Q(Multi-stream)technology– MultipleCPUthreads/processes tolaunchkernelonasingle

GPUsimultaneously

– IncreasingGPUutilizationandreducingCPUidletimes

Motivation– NVIDIAGPUFeature

http://www.hpc.co.jp/images/hyper-q.png

Page 28: Exploiting Maximal Overlap for Non- Contiguous Data ...web.cse.ohio-state.edu/~chu.368/slides/16IPDPS.pdf · June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 t NVIDIA Kepler

IPDPS2016 28NetworkBasedComputingLaboratory

Motivation– Non-ContiguousDataMovementinMPI

sbuf=malloc(…);rbuf=malloc(…);/*Packing*/for(i=1;i<n;i+=2)

sbuf[i]=matrix[n][0];MPI_Send(sbuf,n,MPI_DOUBLE,…);MPI_Recv(rbuf,n,MPI_DOUBLE,…);/*Unpacking*/for(i=1;i<n;i+=2)

matrix[i][0]=rbuf[i]free(sbuf);free(rbuf);

MPI_Datatype nt;MPI_Type_Vector(1,1,n,MPI_DOUBLE,&nt);MPI_Type_commit(&nt);MPI_Send(matrix,1,nt,…);MPI_Recv(matrix,1,nt,…);

UsingMPI Datatypes• No explicit copies inapplicationsØBette performance

• Less codeØHigher productivity