21
OpenSHMEM NonBlocking Data Movement Opera8ons with MVAPICH2-X: Early Experiences Khaled Hamidouche*, Jie Zhang*, Karen Tomko+, D.K Panda* *: The Ohio State University (OSU), +: Ohio Supercompu;ng Center (OSC) E-mail: [email protected] PGAS Applica8ons Workshop (PAW’16) by

OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

OpenSHMEMNonBlockingDataMovementOpera8onswithMVAPICH2-X:EarlyExperiences

KhaledHamidouche*,JieZhang*,KarenTomko+,D.KPanda**:TheOhioStateUniversity(OSU),+:OhioSupercompu;ngCenter(OSC)

E-mail:[email protected]

PGASApplica8onsWorkshop(PAW’16)

by

Page 2: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 2NetworkBasedCompu8ngLaboratory

DriversofModernHPCClusterArchitectures

Tianhe–2 Titan Stampede Tianhe–1A

•  Mul;-core/many-coretechnologies

•  RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

•  SolidStateDrives(SSDs),Non-Vola;leRandom-AccessMemory(NVRAM),NVMe-SSD

•  Accelerators(NVIDIAGPGPUsandIntelXeonPhi)

Accelerators/Coprocessorshighcomputedensity,high

performance/wa\>1TFlopDPonachip

HighPerformanceInterconnects-InfiniBand

<1useclatency,200GbpsBandwidth>Mul8-coreProcessors SSD,NVMe-SSD,NVRAM

Page 3: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 3NetworkBasedCompu8ngLaboratory

ParallelProgrammingModelsOverviewP1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

DSMDistributedMemoryModel

MPI(MessagePassingInterface)

Par88onedGlobalAddressSpace(PGAS)

OpenSHMEM,GA,UPC,Chapel,X10,CAF,…

•  Programmingmodelsprovideabstractmachinemodels

•  Modelscanbemappedondifferenttypesofsystems

–  e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

•  Addi;onally,OpenMPcanbeusedtoparallelizecomputa;onwithinthenode

•  Eachmodelhasstrengthsanddrawbacks-suitedifferentproblemsorapplica;ons

Page 4: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 4NetworkBasedCompu8ngLaboratory

TheOpenSHMEMMemoryModel•  Symmetricdataobjects

–  GlobalVariables

–  Allocatedusingcollec;veshmem_malloc,shmem_memalign,shmem_reallocrou;ne

•  Globallyaddressable–objectshavesame–  Type

–  Size

–  SamevirtualaddressoroffsetatallPEs

–  Addressofaremoteobjectcanbecalculatedbasedoninfooflocalobject

–  OpenSHMEM1.3introducesNon-BlockingDataMovementOpera;ons

SymmetricObjects

b

b

PE0 PE1

a a

VirtualAddressSpace

(global)

(alloce’d)

Page 5: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 5NetworkBasedCompu8ngLaboratory

•  Blockingopera;on=>Op;mizeforlatency–  Bufferreusea_erreturningfromthecall

•  Non-Blockingopera;on=>Op;mizeComputa;on/Communica;onoverlap–  Returnassoonasweposttherequest

–  Comple;onisensuredlateronacomple;on/synchroniza;oncall

–  Bufferreusea_ercomple;on

–  APIextensionwith_nbi(ex:shmem_putmem_nbi)

–  shmem_fence/Shmem_barrier.Completeallpreviousopera;ons

Non-BlockingDataMovementOpera8ons

shmem_putmem

Computa;on

shmem_putmem_nbi

Computa;onshmem_fence

BlockingSeman8cs

Non-BlockingSeman8cs

Page 6: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 6NetworkBasedCompu8ngLaboratory

•  Introduc;on•  Contribu;ons•  Alterna;veDesigns•  PerformanceEvalua;on•  ConclusionsandFutureWork

Outline

Page 7: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 7NetworkBasedCompu8ngLaboratory

•  Proposehigh-performancedesignsandimplementa;onsofOpenSHMEMNBIopera;onsontopoftheMVAPICH2-Xlibrary.

•  ExtendOMBwithnewNBIbenchmarksforevalua;ngOpenSHMEM1.3NBIopera;onsinastandardizedmanner.

•  Designcommunica;onkernelsincluding3DstencilandalltoallpafernsusingOpenSHMEM.

•  DemonstratethebenefitsandimpactofOpenSHMEMNBIopera;onsonbothlatencyandoverlapmetrics.

Contribu8ons

Page 8: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 8NetworkBasedCompu8ngLaboratory

•  Introduc;on•  Contribu;ons•  Alterna;veDesigns•  PerformanceEvalua;on•  ConclusionsandFutureWork

Outline

Page 9: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 9NetworkBasedCompu8ngLaboratory

OverviewoftheMVAPICH2Project•  HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

–  MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

–  MVAPICH2-X(MPI+PGAS),Availablesince2011

–  SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

–  SupportforVirtualiza;on(MVAPICH2-Virt),Availablesince2015

–  SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

–  SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

–  Usedbymorethan2,675organiza8onsin83countries

–  Morethan399,000(>0.39million)downloadsfromtheOSUsitedirectly–  EmpoweringmanyTOP500clusters(Jun‘16ranking)

•  12thranked519,640-corecluster(Stampede)atTACC

•  15thranked185,344-corecluster(Pleiades)atNASA

•  31stranked76,032-corecluster(Tsubame2.5)atTokyoIns;tuteofTechnologyandmanyothers

–  Availablewithso_warestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

–  hfp://mvapich.cse.ohio-state.edu

•  EmpoweringTop500systemsforoveradecade

–  System-XfromVirginiaTech(3rdinNov2003,2,200processors,12.25TFlops)->

–  StampedeatTACC(12thinJun’16,462,462cores,5.168Plops)

Page 10: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 10NetworkBasedCompu8ngLaboratory

MVAPICH2-XforHybridMPI+PGASApplica8ons

•  Unifiedcommunica;onrun;meforMPI,UPC,UPC++,OpenSHMEM,CAF–  Availablesince2012(star;ngwithMVAPICH2-X1.9)–  hfp://mvapich.cse.ohio-state.edu

•  FeatureHighlights–  SupportsMPI(+OpenMP),OpenSHMEM,UPC,CAF,UPC++,

MPI(+OpenMP)+OpenSHMEM,MPI(+OpenMP)+UPC–  MPI-3compliant,OpenSHMEMv1.3standardcompliant,UPC

v1.2standardcompliant(withini;alsupportforUPC1.3),CAF2008standard(OpenUH),UPC++

–  ScalableInter-nodeandintra-nodecommunica;on–point-to-pointandcollec;ves

Page 11: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 11NetworkBasedCompu8ngLaboratory

OpenSHMEMDesigninMVAPICH2-X

•  OpenSHMEMStackbasedonOpenSHMEMReferenceImplementa;on

•  OpenSHMEMCommunica;onoverMVAPICH2-XRun;me

–  Usesac;vemessages,atomicandone-sidedopera;onsandremoteregistra;oncache

Communica8onAPI SymmetricMemoryManagementAPI

MinimalSetofInternalAPI

OpenSHMEMAPI

InfiniBand,RoCE,iWARP

DataMovement Collec;vesAtomicsMemory

Management

Ac;veMessages

One-sidedOpera;ons

MVAPICH2-XRun8me

RemoteAtomicOps

EnhancedRegistra;onCache

J.Jose,K.Kandalla,M.LuoandD.K.Panda,Suppor8ngHybridMPIandOpenSHMEMoverInfiniBand:DesignandPerformanceEvalua8on,Int'lConferenceonParallelProcessing(ICPP'12),September2012

Page 12: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 12NetworkBasedCompu8ngLaboratory

•  SharedMemorybased

–  SymmetricMemory(heap)insharedmemory

–  Directcopyfromsourcetodes;na;on

–  Latencyop;mized(nooverlap)

•  CMAbased

–  Sharedmemoryisnotavailable

–  Directcopy(Zero-copy)

–  Latencyop;mized(nooverlap)

•  IBLoopbackbased

–  Offloadthecopyopera;onstoanexternalengine(IB)

–  Exchangethel_key/r-keyduringini;aliza;on

–  Overlapop;mized(returnassoonasweposttheIBopera;on)

•  On-loadenginebased(Workinprogress)–  Kernelbasedhelperthreads

–  Offloadthecopyopera;onstohelperthreads(SimilartoCMA)

–  Op;mizedforbothLatencyandOverlap

Intra-nodeAlterna8veDesigns

Page 13: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 13NetworkBasedCompu8ngLaboratory

•  List-baseddesign–  Onthecall:Createanden-queueaninternalrequest

•  AssociateanIBComple;onEventwitharequest

–  Intheprogressengine:De-queueanddeleteacompletedrequest

–  Duringcomple;oncall(Fence):Pollstheprogressengineun;lthelistisempty

–  OverheadofCreate/DeleteoftherequestinCri;calpath

•  Counter-baseddesign–  Globalintegercounter

–  Onthecall:Incrementthecounter

–  Intheprogressengine:decreasethecounter

–  Duringcomple;oncall:Polltheprogressengineun;lcounter==0

–  Minimaloverheadinthecri;calpath

Inter-nodesAlterna8veDesigns

Page 14: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 14NetworkBasedCompu8ngLaboratory

•  Introduc;on•  Contribu;ons•  Alterna;veDesigns•  PerformanceEvalua;on•  ConclusionsandFutureWork

Outline

Page 15: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 15NetworkBasedCompu8ngLaboratory

•  StampedeSystem@TACC–  16-coresSandyBridgenodes

–  MellanoxFDRinterconnec;on

•  MVAPICH2-X2.2RC1–  UnifiedCommunica;onRun;mesupport(UCR)

–  OpenSHMEM1.3gitbranchofthestandardimplementa;on

•  ExtensionofOpenSHMEMOMBwith–  NBIpt-ptlatencytests(forputandget)

–  Messageratebenchmarks

–  OverlapBenchmarks

•  RedesignedAll-to-Alland3D-StencilbenchmarkswithNBIinterface

ExperimentalSetup

Page 16: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 16NetworkBasedCompu8ngLaboratory

Intra-nodeEvalua8on

0

1

2

3

4

1 2 4 8 16 32 64 128256512 1K 2K 4K

Latency(us)

MessageSize(Bytes)

Blocking NBI_SHM NBI_LB

050100150200250

8K 16K 32K 64K 128K 256K 512K 1M

Latency(us)

MessageSize(Bytes)

Blocking NBI_SHM NBI_LB

05000000

100000001500000020000000

MessageRate

MessageSize(Bytes)

Blocking NBI_SHM NBI_LB•  LB-baseddesignhasoverheadinlatency

•  SHM-baseddesignachievesverygoodmessagerate(3X)improvementscomparedtoLB-design

•  Overheadinlatencyforsmallmessage–  So_wareoverheadduetosynchroniza;on

opera;on

Page 17: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 17NetworkBasedCompu8ngLaboratory

Inter-nodeEvalua8on

0100200300400

Latency(us)

MessageSize(Bytes)

Blocking NBI

0

50

100

150

32K 64K 128K 256K 512K 1M

Overla

p(%

)

MessageSize(Byte)

Blocking NBI

01000000200000030000004000000

2 8 32

128

512 2K

8K

32K

128K

512K

2M

MessageRate

MessageSize(Bytes)

Blocking NBI Blockin_Opt

•  NBIdeliverssamelatencyperformanceasBlocking

•  NBIdesignachievesverygoodmessagerate(5X)comparedtoblocking

•  Maximaloverlappoten;al–  Hidethecommunica;onoverheadwithRDMA

Page 18: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 18NetworkBasedCompu8ngLaboratory

3D-StencilCommunica8onKernelEvalua8on

0

0.2

0.4

0.6

0.8

1

1.2

8 27 64 125

Execu8

onTim

e

#ofPEs

Blocking NBI

0

0.2

0.4

0.6

0.8

1

8 27 64 125

Execu8

onTim

e

#ofPEs

Blocking NBI

•  50%,30%and15%performanceimprovementon27,64and125cores

•  Overlapbenefits

–  Smallmessage:bothcomputa;on/Communica;onandCommunica;on/Communica;onoverlap

SmallInputSize(512Byte) LargeInputSize(2KByte)

Page 19: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 19NetworkBasedCompu8ngLaboratory

•  Introduc;on•  Contribu;ons•  Alterna;veDesigns•  PerformanceEvalua;on•  ConclusionsandFutureWork

Outline

Page 20: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 20NetworkBasedCompu8ngLaboratory

•  Highlightedthealterna;veapproachesfordesigningNBIopera;ons–  Bothintraandinternodeopera;ononRDMAnetworks

•  DemonstratethebenefitsofOpenSHMEMNBIopera;ons–  MessageRate,Overlap

•  Evaluatetheimpactofsuchseman;cs/designsatapplica;onlevel

•  ThesupportwillbeavailablewiththenextreleaseofMVAPICH2-X

•  ThenewBenchmarkswillbeavailablewiththenextreleaseofOMB

Conclusions

Page 21: OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

PAW’16 [email protected]

ThankYou!

Network-BasedCompu;ngLaboratoryhfp://nowlab.cse.ohio-state.edu/

TheMVAPICHProjecthfp://mvapich.cse.ohio-state.edu/