68
Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] h<p://www.cse.ohio-state.edu/~panda Keynote Talk at HPCAC-Stanford (Feb 2016) by

Programming Models for Exascale Systems

Embed Size (px)

Citation preview

Page 1: Programming Models for Exascale Systems

ProgrammingModelsforExascaleSystems

DhabaleswarK.(DK)PandaTheOhioStateUniversity

E-mail:[email protected]

h<p://www.cse.ohio-state.edu/~panda

KeynoteTalkatHPCAC-Stanford(Feb2016)

by

Page 2: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 2NetworkBasedCompuNngLaboratory

High-EndCompuNng(HEC):ExaFlop&ExaByte

100-200 PFlops in 2016-2018

1 EFlops in 2020-2024?

3

F i g u r e 1

Source: IDC's Digital Universe Study, sponsored by EMC, December 2012

Within these broad outlines of the digital universe are some singularities worth noting.

First, while the portion of the digital universe holding potential analytic value is growing, only a tiny fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and so on. However, even with a generous estimate, the amount of information in the digital universe that is "tagged" accounts for only about 3% of the digital universe in 2012, and that which is analyzed is half a percent of the digital universe. Herein is the promise of "Big Data" technology — the extraction of value from the large untapped pools of data in the digital universe.

10K-20K EBytes in 2016-2018

40K EBytes in 2020 ?

ExaFlop&HPC• 

ExaByte&BigData• 

Page 3: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 3NetworkBasedCompuNngLaboratory

0102030405060708090100

050

100150200250300350400450500

Percen

tageofC

lusters

Num

bero

fClusters

Timeline

PercentageofClustersNumberofClusters

TrendsforCommodityCompuNngClustersintheTop500List(hTp://www.top500.org)

85%

Page 4: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 4NetworkBasedCompuNngLaboratory

DriversofModernHPCClusterArchitectures

Tianhe–2 Titan Stampede Tianhe–1A

•  MulR-core/many-coretechnologies

•  RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

•  SolidStateDrives(SSDs),Non-VolaRleRandom-AccessMemory(NVRAM),NVMe-SSD

•  Accelerators(NVIDIAGPGPUsandIntelXeonPhi)

Accelerators/Coprocessorshighcomputedensity,high

performance/waT>1TFlopDPonachip

HighPerformanceInterconnects-InfiniBand

<1useclatency,100GbpsBandwidth>MulN-coreProcessors SSD,NVMe-SSD,NVRAM

Page 5: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 5NetworkBasedCompuNngLaboratory

•  235IBClusters(47%)intheNov’2015Top500list(h<p://www.top500.org)

•  InstallaRonsintheTop50(21systems):

Large-scaleInfiniBandInstallaNons

462,462cores(Stampede)atTACC(10th) 76,032cores(Tsubame2.5)atJapan/GSIC(25th)

185,344cores(Pleiades)atNASA/Ames(13th) 194,616cores(Cascade)atPNNL(27th)

72,800coresCrayCS-StorminUS(15th) 76,032cores(Makman-2)atSaudiAramco(32nd)

72,800coresCrayCS-StorminUS(16th) 110,400cores(Pangea)inFrance(33rd)

265,440coresSGIICEatTulipTradingAustralia(17th) 37,120cores(Lomonosov-2)atRussia/MSU(35th)

124,200cores(Topaz)SGIICEatERDCDSRCinUS(18th) 57,600cores(SwifLucy)inUS(37th)

72,000cores(HPC2)inItaly(19th) 55,728cores(Prometheus)atPoland/Cyfronet(38th)

152,692cores(Thunder)atAFRL/USA(21st) 50,544cores(Occigen)atFrance/GENCI-CINES(43rd)

147,456cores(SuperMUC)inGermany(22nd) 76,896cores(Salomon)SGIICEinCzechRepublic(47th)

86,016cores(SuperMUCPhase2)inGermany(24th) andmanymore!

Page 6: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 6NetworkBasedCompuNngLaboratory

TowardsExascaleSystem(TodayandTarget)

Systems 2016Tianhe-2

2020-2024 DifferenceToday&Exascale

Systempeak 55PFlop/s 1EFlop/s ~20x

Power 18MW(3Gflops/W)

~20MW(50Gflops/W)

O(1)~15x

Systemmemory 1.4PB(1.024PBCPU+0.384PBCoP)

32–64PB ~50X

Nodeperformance 3.43TF/s(0.4CPU+3CoP)

1.2or15TF O(1)

Nodeconcurrency 24coreCPU+171coresCoP

O(1k)orO(10k) ~5x-~50x

TotalnodeinterconnectBW 6.36GB/s 200–400GB/s ~40x-~60x

Systemsize(nodes) 16,000 O(100,000)orO(1M) ~6x-~60x

Totalconcurrency 3.12M12.48Mthreads(4/core)

O(billion)forlatencyhiding

~100x

MTTI Few/day Many/day O(?)

Courtesy:Prof.JackDongarra

Page 7: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 7NetworkBasedCompuNngLaboratory

•  ScienRficCompuRng–  MessagePassingInterface(MPI),includingMPI+OpenMP,istheDominant

ProgrammingModel

–  ManydiscussionstowardsParRRonedGlobalAddressSpace(PGAS)•  UPC,OpenSHMEM,CAF,etc.

–  HybridProgramming:MPI+PGAS(OpenSHMEM,UPC)

•  BigData/Enterprise/CommercialCompuRng–  Focusesonlargedataanddataanalysis

–  Hadoop(HDFS,HBase,MapReduce)

–  Sparkisemergingforin-memorycompuRng

–  MemcachedisalsousedforWeb2.0

TwoMajorCategoriesofApplicaNons

Page 8: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 8NetworkBasedCompuNngLaboratory

ParallelProgrammingModelsOverview

P1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

ParRRonedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

•  Programmingmodelsprovideabstractmachinemodels

•  Modelscanbemappedondifferenttypesofsystems–  e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

•  PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance

Page 9: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 9NetworkBasedCompuNngLaboratory

ParNNonedGlobalAddressSpace(PGAS)Models

•  Keyfeatures-  SimplesharedmemoryabstracRons

-  Lightweightone-sidedcommunicaRon

-  EasiertoexpressirregularcommunicaRon

•  DifferentapproachestoPGAS-  Languages

•  UnifiedParallelC(UPC)

•  Co-ArrayFortran(CAF)

•  X10

•  Chapel

-  Libraries•  OpenSHMEM

•  GlobalArrays

Page 10: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 10NetworkBasedCompuNngLaboratory

Hybrid(MPI+PGAS)Programming

•  ApplicaRonsub-kernelscanbere-wri<eninMPI/PGASbasedoncommunicaRoncharacterisRcs

•  Benefits:–  BestofDistributedCompuRngModel

–  BestofSharedMemoryCompuRngModel

•  ExascaleRoadmap*:–  “HybridProgrammingisapracRcalwayto

programexascalesystems”

*TheInterna4onalExascaleSo;wareRoadmap,Dongarra,J.,Beckman,P.etal.,Volume25,Number1,2011,Interna4onalJournalofHighPerformanceComputerApplica4ons,ISSN1094-3420

Kernel1MPI

Kernel2MPI

Kernel3MPI

KernelNMPI

HPCApplicaNon

Kernel2PGAS

KernelNPGAS

Page 11: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 11NetworkBasedCompuNngLaboratory

DesigningCommunicaNonLibrariesforMulN-PetaflopandExaflopSystems:Challenges

ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.

ApplicaNonKernels/ApplicaNons

NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmniPath)

MulN/Many-coreArchitectures

Accelerators(NVIDIAandMIC)

MiddlewareCo-Design

OpportuniNesand

ChallengesacrossVarious

Layers

PerformanceScalabilityFault-

Resilience

CommunicaNonLibraryorRunNmeforProgrammingModelsPoint-to-pointCommunicaNon

CollecNveCommunicaNon

Energy-Awareness

SynchronizaNonandLocks

I/OandFileSystems

FaultTolerance

Page 12: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 12NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided)–  Scalablejobstart-up

•  ScalableCollecRvecommunicaRon–  Offload–  Non-blocking–  Topology-aware

•  Balancingintra-nodeandinter-nodecommunicaRonfornextgeneraRonnodes(128-1024cores)–  MulRpleend-pointspernode

•  SupportforefficientmulR-threading•  IntegratedSupportforGPGPUsandAccelerators•  Fault-tolerance/resiliency•  QoSsupportforcommunicaRonandI/O•  SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,

CAF,…)•  VirtualizaRon•  Energy-Awareness

BroadChallengesinDesigningCommunicaNonLibrariesfor(MPI+X)atExascale

Page 13: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 13NetworkBasedCompuNngLaboratory

•  ExtremeLowMemoryFootprint–  MemorypercoreconRnuestodecrease

•  D-L-AFramework

–  Discover•  Overallnetworktopology(fat-tree,3D,…),Networktopologyforprocessesforagivenjob•  Nodearchitecture,Healthofnetworkandnode

–  Learn•  Impactonperformanceandscalability•  PotenRalforfailure

–  Adapt•  Internalprotocolsandalgorithms•  Processmapping•  Fault-tolerancesoluRons

–  Lowoverheadtechniqueswhiledeliveringperformance,scalabilityandfault-tolerance

AddiNonalChallengesforDesigningExascaleSomwareLibraries

Page 14: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 14NetworkBasedCompuNngLaboratory

OverviewoftheMVAPICH2Project•  HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)

–  MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002

–  MVAPICH2-X(MPI+PGAS),Availablesince2011

–  SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

–  SupportforVirtualizaRon(MVAPICH2-Virt),Availablesince2015

–  SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

–  Usedbymorethan2,525organizaNonsin77countries

–  Morethan351,000(>0.35million)downloadsfromtheOSUsitedirectly

–  EmpoweringmanyTOP500clusters(Nov‘15ranking)•  10thranked519,640-corecluster(Stampede)atTACC

•  13thranked185,344-corecluster(Pleiades)atNASA

•  25thranked76,032-corecluster(Tsubame2.5)atTokyoInsRtuteofTechnologyandmanyothers

–  AvailablewithsofwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

–  h<p://mvapich.cse.ohio-state.edu

•  EmpoweringTop500systemsforoveradecade–  System-XfromVirginiaTech(3rdinNov2003,2,200processors,12.25TFlops)->

–  StampedeatTACC(10thinNov’15,519,640cores,5.168Plops)

Page 15: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 15NetworkBasedCompuNngLaboratory

MVAPICH2Architecture

HighPerformanceParallelProgrammingModels

MessagePassingInterface(MPI)

PGAS(UPC,OpenSHMEM,CAF,UPC++*)

Hybrid---MPI+X(MPI+PGAS+OpenMP/Cilk)

HighPerformanceandScalableCommunicaNonRunNmeDiverseAPIsandMechanisms

Point-to-point

PrimiNves

CollecNvesAlgorithms

Energy-Awareness

RemoteMemoryAccess

I/OandFileSystems

FaultTolerance

VirtualizaNon AcNveMessages

JobStartupIntrospecNon&Analysis

SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,OmniPath)

SupportforModernMulN-/Many-coreArchitectures(Intel-Xeon,OpenPower*,Xeon-Phi(MIC,KNL*),NVIDIAGPGPU)

TransportProtocols ModernFeatures

RC XRC UD DC UMR ODP*SR-IOV

MulNRail

TransportMechanismsSharedMemory CMA IVSHMEM

ModernFeatures

MCDRAM* NVLink* CAPI*

*-Upcoming

Page 16: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 16NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided

RMA)–  SupportforadvancedIBmechanisms(UMRandODP)–  Extremelyminimalmemoryfootprint–  Scalablejobstart-up

•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+

UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 17: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 17NetworkBasedCompuNngLaboratory

One-wayLatency:MPIoverIBwithMVAPICH2

0.000.200.400.600.801.001.201.401.601.802.00 SmallMessageLatency

MessageSize(bytes)

Latency(us)

1.261.19

0.951.15

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back

0

20

40

60

80

100

120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

LargeMessageLatency

MessageSize(bytes)

Latency(us)

Page 18: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 18NetworkBasedCompuNngLaboratory

Bandwidth:MPIoverIBwithMVAPICH2

0

2000

4000

6000

8000

10000

12000

14000 UnidirecNonalBandwidth

Band

width

(MBy

tes/sec)

MessageSize(bytes)

12465

3387

6356

12104

0

5000

10000

15000

20000

25000

30000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

BidirecNonalBandwidth

Band

width

(MBy

tes/sec)

MessageSize(bytes)

21425

12161

24353

6308

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back

Page 19: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 19NetworkBasedCompuNngLaboratory

0

0.5

1

0 1 2 4 8 16 32 64 128 256 512 1K

Latency(us)

MessageSize(Bytes)

LatencyIntra-Socket Inter-Socket

MVAPICH2Two-SidedIntra-NodePerformance(SharedmemoryandKernel-basedZero-copySupport(LiMICandCMA))

LatestMVAPICH22.2b

IntelIvy-bridge0.18us

0.45us

0

5000

10000

15000

Band

width(M

B/s)

MessageSize(Bytes)

Bandwidth(Inter-socket)inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC

0

5000

10000

15000

Band

width(M

B/s)

MessageSize(Bytes)

Bandwidth(Intra-socket)intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC

14,250MB/s13,749MB/s

Page 20: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 20NetworkBasedCompuNngLaboratory

•  IntroducedbyMellanoxtosupportdirectlocalandremotenonconRguousmemoryaccess–  Avoidpackingatsenderandunpackingatreceiver

•  AvailablewithMVAPICH2-X2.2b

User-modeMemoryRegistraNon(UMR)

050

100150200250300350

4K 16K 64K 256K 1M

Latency(u

s)

MessageSize(Bytes)

Small&MediumMessageLatencyUMRDefault

0

5000

10000

15000

20000

2M 4M 8M 16M

Latency(us)

MessageSize(Bytes)

LargeMessageLatencyUMRDefault

Connect-IB(54Gbps):2.8GHzDualTen-core(IvyBridge)IntelPCIGen3withMellanoxIBFDRswitch

M.Li,H.Subramoni,K.Hamidouche,X.LuandD.K.Panda,HighPerformanceMPIDatatypeSupportwithUser-modeMemoryRegistraNon:Challenges,DesignsandBenefits,CLUSTER,2015

Page 21: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 21NetworkBasedCompuNngLaboratory

•  IntroducedbyMellanoxtosupportdirectremotememoryaccesswithoutpinning

•  Memoryregionspagedin/outdynamicallybytheHCA/OS

•  Sizeofregisteredbufferscanbelargerthanphysicalmemory

•  WillbeavailableinupcomingMVAPICH2-X2.2RC1

On-DemandPaging(ODP)

Connect-IB(54Gbps):2.6GHzDualOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRswitch

0

500

1000

1500

16 32 64

Pin-do

wnBu

fferS

ize

(MB)

NumberofProcesses

Graph500Pin-downBufferSizesPin-down ODP

0

1

2

3

4

5

16 32 64

ExecuN

onTim

e(s)

NumberofProcesses

Graph500BFSKernelPin-down ODP

Page 22: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 22NetworkBasedCompuNngLaboratory

MinimizingMemoryFootprintbyDirectConnect(DC)Transport

Nod

e0 P1

P0

Node1

P3

P2Node3

P7

P6

Nod

e2 P5

P4

IBNetwork

•  ConstantconnecRoncost(OneQPforanypeer)•  FullFeatureSet(RDMA,Atomicsetc)•  Separateobjectsforsend(DCIniRator)andreceive(DCTarget)

–  DCTargetidenRfiedby“DCTNumber”–  Messagesroutedwith(DCTNumber,LID)–  Requiressame“DCKey”toenablecommunicaRon

•  AvailablesinceMVAPICH2-X2.2a

0

0.5

1

160 320 620Normalized

ExecuNo

nTime

NumberofProcesses

NAMD-Apoa1:LargedatasetRC DC-Pool UD XRC

1022

4797

1 1 12

10 10 10 10

1 13

5

1

10

100

80 160 320 640

Conn

ecNo

nMem

ory(KB)

NumberofProcesses

MemoryFootprintforAlltoallRC DC-Pool UD XRC

H.Subramoni,K.Hamidouche,A.Venkatesh,S.ChakrabortyandD.K.Panda,DesigningMPILibrarywithDynamicConnectedTransport(DCT)ofInfiniBand:EarlyExperiences.IEEEInternaRonalSupercompuRngConference(ISC’14)

Page 23: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 23NetworkBasedCompuNngLaboratory

•  Near-constantMPIandOpenSHMEMiniRalizaRonRmeatanyprocesscount

•  10xand30ximprovementinstartupRmeofMPIandOpenSHMEMrespecRvelyat16,384processes

•  MemoryconsumpRonreducedforremoteendpointinformaRonbyO(processespernode)

•  1GBMemorysavedpernodewith1Mprocessesand16processespernode

TowardsHighPerformanceandScalableStartupatExascale

P M

O

JobStartupPerformance

Mem

oryRe

quire

dtoStore

Endp

ointInform

aRon

a b c d

eP

M

PGAS–Stateoftheart

MPI–Stateoftheart

O PGAS/MPI–OpRmized

PMIX_Ring

PMIX_Ibarrier

PMIX_Iallgather

ShmembasedPMI

b

c

d

e

aOn-demandConnecRon

On-demandConnecNonManagementforOpenSHMEMandOpenSHMEM+MPI.S.Chakraborty,H.Subramoni,J.Perkins,A.A.Awan,andDKPanda,20thInternaRonalWorkshoponHigh-levelParallelProgrammingModelsandSupporRveEnvironments(HIPS’15)

PMIExtensionsforScalableMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,J.Perkins,M.Arnold,andDKPanda,Proceedingsofthe21stEuropeanMPIUsers'GroupMeeRng(EuroMPI/Asia’14)

Non-blockingPMIExtensionsforFastMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,A.Venkatesh,J.Perkins,andDKPanda,15thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’15)

SHMEMPMI–SharedMemorybasedPMIforImprovedPerformanceandScalability.S.Chakraborty,H.Subramoni,J.Perkins,andDKPanda,16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’16),AcceptedforPublica6on

a

b

c d

e

Page 24: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 24NetworkBasedCompuNngLaboratory

•  SHMEMPMIallowsMPIprocessestodirectlyreadremoteendpoint(EP)informaRonfromtheprocessmanagerthroughsharedmemorysegments

•  Onlyasinglecopypernode-O(processespernode)reducRoninmemoryusage

•  EsRmatedsavingsof1GBpernodewith1millionprocessesand16processespernode

•  Upto1,000RmesfasterPMIGetscomparedtodefaultdesign.WillbeavailableinMVAPICH22.2RC1.

ProcessManagementInterfaceoverSharedMemory(SHMEMPMI)

TACCStampede-Connect-IB(54Gbps):2.6GHzQuadOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRSHMEMPMI–SharedMemoryBasedPMIforPerformanceandScalabilityS.Chakraborty,H.Subramoni,J.Perkins,andD.K.Panda,

16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid‘16),Acceptedforpublica6on

0

50

100

150

200

250

300

1 2 4 8 16 32

TimeTaken(m

illise

cond

s)

NumberofProcessesperNode

TimeTakenbyonePMI_GetDefault

SHMEMPMI

0.00010.0010.010.1110100

100010000

16 64 256 1K 4K 16K 64K 256K 1MMem

oryUsageperNod

e(M

B)

NumberofProcessesperJob

MemoryUsageforRemoteEPInformaRonFence-DefaultAllgather-DefaultFence-ShmemAllgather-Shmem

EsNmated

1000x

Actual

16x

Page 25: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 25NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon

–  OffloadandNon-blocking–  Topology-aware

•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 26: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 26NetworkBasedCompuNngLaboratory

ModifiedHPLwithOffload-Bcastdoesupto4.5%be<erthandefaultversion(512Processes)

012345

512 600 720 800

ApplicaN

onRun

-Tim

e(s)

DataSize

05

1015

64 128 256 512Run-Time(s)

NumberofProcesses

PCG-Default Modified-PCG-Offload

Co-DesignwithMPI-3Non-BlockingCollecNvesandCollecNveOffloadCo-DirectHardware(AvailablesinceMVAPICH2-X2.2a)

ModifiedP3DFFTwithOffload-Alltoalldoesupto17%be<erthandefaultversion(128Processes)

K.Kandalla,et.al..High-PerformanceandScalableNon-BlockingAll-to-AllwithCollecNveOffloadonInfiniBandClusters:AStudywithParallel3DFFT,ISC2011

17%

00.20.40.60.81

1.2

10 20 30 40 50 60 70

Normalized

Pe

rforman

ce

HPL-Offload HPL-1ring HPL-Host

HPLProblemSize(N)as%ofTotalMemory

4.5%

ModifiedPre-ConjugateGradientSolverwithOffload-Allreducedoesupto21.8%be<erthandefaultversion

K.Kandalla,et.al,DesigningNon-blockingBroadcastwithCollecNveOffloadonInfiniBandClusters:ACaseStudywithHPL,HotI2011K.Kandalla,et.al.,DesigningNon-blockingAllreducewithCollecNveOffloadonInfiniBandClusters:ACaseStudywithConjugateGradientSolvers,IPDPS’12

21.8%

CanNetwork-OffloadbasedNon-BlockingNeighborhoodMPICollecNvesImproveCommunicaNonOverheadsofIrregularGraphAlgorithms?K.Kandalla,A.Buluc,H.Subramoni,K.Tomko,J.Vienne,L.Oliker,andD.K.Panda,IWPAPS’12

Page 27: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 27NetworkBasedCompuNngLaboratory

Network-Topology-AwarePlacementofProcesses•  CanwedesignahighlyscalablenetworktopologydetecRonserviceforIB?•  HowdowedesigntheMPIcommunicaRonlibraryinanetwork-topology-awaremannertoefficientlyleveragethetopology

informaRongeneratedbyourservice?•  WhatarethepotenRalbenefitsofusinganetwork-topology-awareMPIlibraryontheperformanceofparallelscienRficapplicaRons?

OverallperformanceandSplitupofphysicalcommunicaNonforMILConRanger

Performanceforvaryingsystemsizes Defaultfor2048corerun Topo-Awarefor2048corerun

15%

H.Subramoni,S.Potluri,K.Kandalla,B.Barth,J.Vienne,J.Keasler,K.Tomko,K.Schulz,A.Moody,andD.K.Panda,DesignofaScalableInfiniBandTopologyServicetoEnableNetwork-Topology-AwarePlacementofProcesses,SC'12.BESTPaperandBESTSTUDENTPaperFinalist

• ReducenetworktopologydiscoveryNmefromO(N2hosts)toO(Nhosts)

• 15%improvementinMILCexecuNonNme@2048cores• 15%improvementinHypreexecuNonNme@1024cores

Page 28: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 28NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs

–  CUDA-AwareMPI–  GPUDirectRDMA(GDR)Support–  CUDA-awareNon-blockingCollecRves–  SupportforManagedMemory–  EfficientdatatypeProcessing

•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+

UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 29: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 29NetworkBasedCompuNngLaboratory

PCIe

GPU

CPU

NIC

Switch

At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .);

At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• DatamovementinapplicaRonswithstandardMPIandCUDAinterfaces

HighProduc4vityandLowPerformance

MPI+CUDA-Naive

Page 30: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 30NetworkBasedCompuNngLaboratory

PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall();

<<Similar at receiver>>

•  Pipeliningatuserlevelwithnon-blockingMPIandCUDAinterfaces

LowProduc4vityandHighPerformance

MPI+CUDA-Advanced

Page 31: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 31NetworkBasedCompuNngLaboratory

At Sender: At Receiver: MPI_Recv(r_devbuf, size, …);

inside MVAPICH2

•  StandardMPIinterfacesusedforunifieddatamovement

•  TakesadvantageofUnifiedVirtualAddressing(>=CUDA4.0)

•  OverlapsdatamovementfromGPUwithRDMAtransfers

HighPerformanceandHighProduc4vity

MPI_Send(s_devbuf, size, …);

GPU-AwareMPILibrary:MVAPICH2-GPU

Page 32: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 32NetworkBasedCompuNngLaboratory

•  OFEDwithsupportforGPUDirectRDMAisdevelopedbyNVIDIAandMellanox

•  OSUhasadesignofMVAPICH2using

GPUDirectRDMA–  HybriddesignusingGPU-DirectRDMA

•  GPUDirectRDMAandHost-basedpipelining

•  AlleviatesP2Pbandwidthbo<lenecksonSandyBridgeandIvyBridge

–  SupportforcommunicaRonusingmulR-rail

–  SupportforMellanoxConnect-IBandConnectXVPIadapters

–  SupportforRoCEwithMellanoxConnectXVPIadapters

GPU-DirectRDMA(GDR)withCUDA

IBAdapter

SystemMemory

GPUMemory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNBE5-2670

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVBE5-2680V2

SNBE5-2670/

IVBE5-2680V2

Page 33: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 33NetworkBasedCompuNngLaboratory

CUDA-AwareMPI:MVAPICH2-GDR1.8-2.2Releases•  SupportforMPIcommunicaRonfromNVIDIAGPUdevicememory•  HighperformanceRDMA-basedinter-nodepoint-to-point

communicaRon(GPU-GPU,GPU-HostandHost-GPU)•  Highperformanceintra-nodepoint-to-pointcommunicaRonformulR-

GPUadapters/node(GPU-GPU,GPU-HostandHost-GPU)•  TakingadvantageofCUDAIPC(availablesinceCUDA4.1)inintra-node

communicaRonformulRpleGPUadapters/node•  OpRmizedandtunedcollecRvesforGPUdevicebuffers•  MPIdatatypesupportforpoint-to-pointandcollecRvecommunicaRon

fromGPUdevicebuffers

Page 34: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 34NetworkBasedCompuNngLaboratory

34

MVAPICH2-GDR-2.2bIntelIvyBridge(E5-2680v2)node-20cores

NVIDIATeslaK40cGPUMellanoxConnect-IBDual-FDRHCA

CUDA7MellanoxOFED2.4withGPU-Direct-RDMA

10x2X

11x

2x

PerformanceofMVAPICH2-GPUwithGPU-DirectRDMA(GDR)

05

1015202530

0 2 8 32 128 512 2K

MV2-GDR2.2b MV2-GDR2.0bMV2w/oGDR

GPU-GPUinternodelatency

MessageSize(bytes)

Latency(us)

2.18us0

50010001500200025003000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR

GPU-GPUInternodeBandwidth

MessageSize(bytes)

Band

width(M

B/s)

11X

01000200030004000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR

GPU-GPUInternodeBi-Bandwidth

MessageSize(bytes)

Bi-Ban

dwidth(M

B/s)

Page 35: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 35NetworkBasedCompuNngLaboratory

LENS(Oct'15) 35

•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) •  HoomdBlue Version 1.0.5

•  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

ApplicaNon-LevelEvaluaNon(HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

AverageTimeStep

sper

second

(TPS)

NumberofProcesses

MV2 MV2+GDR

0500100015002000250030003500

4 8 16 32AverageTimeStep

sper

second

(TPS)

NumberofProcesses

64KParNcles 256KParNcles

2X2X

Page 36: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 36NetworkBasedCompuNngLaboratory

0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Overla

p(%

)

MessageSize(Bytes)

Medium/LargeMessageOverlap(64GPUnodes)

Ialltoall(1process/node)

Ialltoall(2process/node;1process/GPU)0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Overla

p(%

)

MessageSize(Bytes)

Medium/LargeMessageOverlap(64GPUnodes)

Igather(1process/node)

Igather(2processes/node;1process/GPU)

Plazorm:Wilkes:IntelIvyBridgeNVIDIATeslaK20c+MellanoxConnect-IB

AvailablesinceMVAPICH2-GDR2.2a

CUDA-AwareNon-BlockingCollecNves

A.Venkatesh,K.Hamidouche,H.Subramoni,andD.K.Panda,OffloadedGPUCollecNvesusingCORE-DirectandCUDACapabiliNesonIBClusters,HIPC,2015

Page 37: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 37NetworkBasedCompuNngLaboratory

CommunicaNonRunNmewithGPUManagedMemory

●  CUDA6.0NVIDIAintroducedCUDAManaged(orUnified)memoryallowingacommonmemoryallocaRonforGPUorCPUthroughcudaMallocManaged()call

●  SignificantproducRvitybenefitsduetoabstracRonofexplicitallocaRonandcudaMemcpy()

●  ExtendedMVAPICH2toperformcommunicaRonsdirectlyfrommanagedbuffers(AvailableinMVAPICH2-GDR2.2b)

●  OSUMicro-benchmarksextendedtoevaluatetheperformanceofpoint-to-pointandcollecRvecommunicaRonsusingmanagedbuffers●  AvailableinOMB5.2

D.S.Banerjee,KHamidouche,andD.KPanda,DesigningHighPerformanceCommunicaRonRunRmeforGPUManagedMemory:EarlyExperiences,GPGPU-9Workshop,tobeheldinconjuncRonwithPPoPP‘16

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 4 8 16 32 64 128256 1K 4K 8K 16K

HaloExcha

ngeTime(m

s)

TotalDimensionSize(Bytes)

2DStencilPerformanceforHalowidth=1

DeviceManaged

Page 38: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 38NetworkBasedCompuNngLaboratory

CPU

Progress

GPU

Time

Initi

ate

Kern

el

Star

t Se

nd

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd

Initi

ate

Kern

el

GPU

CPU

Initi

ate

Kern

el

Star

tSe

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)Existing Design

Proposed Design

Kernel on Stream

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Initi

ate

Kern

el

Star

t Se

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1) Wait

WFK

Star

t Se

nd

Wait

Progress

Start Finish Proposed Finish Existing

WFK

WFK

Expected Benefits

MPIDatatypeProcessing(CommunicaNonOpNmizaNon)

WasteofcompuNngresourcesonCPUandGPUCommonScenario

*Buf1, Buf2…contain non-conRguousMPIDatatype

MPI_Isend(A,..Datatype,…)MPI_Isend(B,..Datatype,…)MPI_Isend(C,..Datatype,…)MPI_Isend(D,..Datatype,…)…MPI_Waitall(…);

Page 39: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 39NetworkBasedCompuNngLaboratory

ApplicaNon-LevelEvaluaNon(HaloExchange-Cosmo)

0

0.5

1

1.5

16 32 64 96

Normalized

ExecuNo

nTime

NumberofGPUs

CSCSGPUclusterDefault Callback-based Event-based

0

0.5

1

1.5

4 8 16 32

Normalized

ExecuNo

nTime

NumberofGPUs

WilkesGPUClusterDefault Callback-based Event-based

•  2Ximprovementon32GPUsnodes•  30%improvementon96GPUnodes(8GPUs/node)

C.Chu,K.Hamidouche,A.Venkatesh,D.Banerjee,H.Subramoni,andD.K.Panda,ExploiNngMaximalOverlapforNon-ConNguousDataMovementProcessingonModernGPU-enabledSystems,IPDPS’16

Page 40: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 40NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 41: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 41NetworkBasedCompuNngLaboratory

MPIApplicaNonsonMICClusters

Xeon XeonPhi

MulR-coreCentric

Many-coreCentric

MPIProgram

MPIProgram

OffloadedComputaRon

MPIProgram MPIProgram

MPIProgram

Host-only

Offload(/reverseOffload)

Symmetric

Coprocessor-only

• FlexibilityinlaunchingMPIjobsonclusterswithXeonPhi

Page 42: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 42NetworkBasedCompuNngLaboratory

MVAPICH2-MIC2.0DesignforClusterswithIBandMIC

•  OffloadMode

•  IntranodeCommunicaRon

•  Coprocessor-onlyandSymmetricMode

•  InternodeCommunicaRon

•  Coprocessors-onlyandSymmetricMode

•  MulR-MICNodeConfiguraRons

•  Runningonthreemajorsystems

•  Stampede,Blueridge(VirginiaTech)andBeacon(UTK)

Page 43: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 43NetworkBasedCompuNngLaboratory

MIC-Remote-MICP2PCommunicaNonwithProxy-basedCommunicaNon

Bandwidth

BeTer

BeTer

BeTer

Latency(LargeMessages)

0 1000 2000 3000 4000 5000

8K 32K 128K 512K 2M

Lat

ency

(use

c)

Message Size (Bytes)

0

2000

4000

6000

1 16 256 4K 64K 1M Band

width(M

B/sec)

Message Size (Bytes)

5236

Intra-socketP2P

Inter-socketP2P

0

5000

10000

15000

8K 32K 128K 512K 2M

Lat

ency

(use

c)

Message Size (Bytes)

Latency(LargeMessages)

0

2000

4000

6000

1 16 256 4K 64K 1M Band

width(M

B/sec)

Message Size (Bytes) BeT

er

5594

Bandwidth

Page 44: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 44NetworkBasedCompuNngLaboratory

OpNmizedMPICollecNvesforMICClusters(Allgather&Alltoall)

A.Venkatesh,S.Potluri,R.Rajachandrasekar,M.Luo,K.HamidoucheandD.K.Panda-HighPerformanceAlltoallandAllgatherdesignsforInfiniBandMICClusters;IPDPS’14,May2014

0

10000

20000

30000

1 2 4 8 16 32 64 128256512 1K

Latency(usecs)

MessageSize(Bytes)

32-Node-Allgather(16H+16M)SmallMessageLatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K256K512K 1M

Latency(usecs)

MessageSize(Bytes)

32-Node-Allgather(8H+8M)LargeMessageLatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

4K 8K 16K 32K 64K 128K256K512K

Latency(usecs)

MessageSize(Bytes)

32-Node-Alltoall(8H+8M)LargeMessageLatencyMV2-MIC

MV2-MIC-Opt

0

20

40

60

MV2-MIC-Opt MV2-MICExecuN

onTim

e(secs)

32Nodes(8H+8M),Size=2K*2K*1K

P3DFFTPerformanceCommunicaRonComputaRon

76%58%

55%

Page 45: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 45NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 46: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 46NetworkBasedCompuNngLaboratory

MVAPICH2-XforAdvancedMPIandHybridMPI+PGASApplicaNonsMPI,OpenSHMEM,UPC,CAForHybrid(MPI+PGAS)

ApplicaNons

UnifiedMVAPICH2-XRunNme

InfiniBand,RoCE,iWARP

OpenSHMEMCalls MPICallsUPCCalls

•  UnifiedcommunicaRonrunRmeforMPI,UPC,OpenSHMEM,CAFavailablewithMVAPICH2-X1.9(2012)onwards!

•  UPC++supportwillbeavailableinupcomingMVAPICH2-X2.2RC1•  FeatureHighlights

–  SupportsMPI(+OpenMP),OpenSHMEM,UPC,CAF,MPI(+OpenMP)+OpenSHMEM,MPI(+OpenMP)+UPC+CAF

–  MPI-3compliant,OpenSHMEMv1.0standardcompliant,UPCv1.2standardcompliant(withiniRalsupportforUPC1.3),CAF2008standard(OpenUH)

–  ScalableInter-nodeandintra-nodecommunicaRon–point-to-pointandcollecRves

CAFCalls

Page 47: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 47NetworkBasedCompuNngLaboratory

ApplicaNonLevelPerformancewithGraph500andSortGraph500ExecuNonTime

J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternaNonalSupercompuNngConference(ISC’13),June2013

J.Jose,K.Kandalla,M.LuoandD.K.Panda,SupporNngHybridMPIandOpenSHMEMoverInfiniBand:DesignandPerformanceEvaluaNon,Int'lConferenceonParallelProcessing(ICPP'12),September2012

05101520253035

4K 8K 16K

Time(s)

No.ofProcesses

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

13X

7.6X

•  PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design•  8,192processes

-2.4XimprovementoverMPI-CSR-7.6XimprovementoverMPI-Simple

•  16,384processes-1.5XimprovementoverMPI-CSR-13XimprovementoverMPI-Simple

J.Jose,K.Kandalla,S.Potluri,J.ZhangandD.K.Panda,OpNmizingCollecNveCommunicaNoninOpenSHMEM,Int'lConferenceonParNNonedGlobalAddressSpaceProgrammingModels(PGAS'13),October2013.

SortExecuNonTime

0

1000

2000

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Time(secon

ds)

InputData-No.ofProcesses

MPI Hybrid

51%

•  PerformanceofHybrid(MPI+OpenSHMEM)SortApplicaRon

•  4,096processes,4TBInputSize-MPI–2408sec;0.16TB/min-Hybrid–1172sec;0.36TB/min-51%improvementoverMPI-design

Page 48: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 48NetworkBasedCompuNngLaboratory

MiniMD–TotalExecuNonTime

•  Hybriddesignperformsbe<erthanMPIimplementaRon•  1,024processes

-  17%improvementoverMPIversion•  StrongScaling

Inputsize:128*128*128

Performance StrongScaling

0

500

1000

1500

2000

2500

512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

17%

050010001500200025003000

256 512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

Time(m

s)

Time(m

s)

#ofCores #ofCores

M.Li,J.Lin,X.Lu,K.Hamidouche,K.TomkoandD.K.Panda,ScalableMiniMDDesignwithHybridMPIandOpenSHMEM,OpenSHMEMUserGroupMeeNng(OUG’14),heldinconjuncNonwith8thInternaNonalConferenceonParNNonedGlobalAddressSpaceProgrammingModels,(PGAS14).

Page 49: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 49NetworkBasedCompuNngLaboratory

HybridMPI+UPCNAS-FT

•  ModifiedNASFTUPCall-to-allpa<ernusingMPI_Alltoall•  Trulyhybridprogram•  ForFT(ClassC,128processes)

•  34%improvementoverUPC-GASNet•  30%improvementoverUPC-OSU

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Time(s)

NASProblemSize–SystemSize

UPC-GASNet

UPC-OSU

Hybrid-OSU

34%

J.Jose,M.Luo,S.SurandD.K.Panda,UnifyingUPCandMPIRunNmes:ExperiencewithMVAPICH,FourthConferenceonParNNonedGlobalAddressSpaceProgrammingModel(PGAS’10),October2010

HybridMPI+UPCSupport

Availablesince

MVAPICH2-X1.9(2012)

Page 50: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 50NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 51: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 51NetworkBasedCompuNngLaboratory

•  VirtualizaRonhasmanybenefits–  Fault-tolerance–  JobmigraRon–  CompacRon

•  HavenotbeenverypopularinHPCduetooverheadassociatedwithVirtualizaRon

•  NewSR-IOV(SingleRoot–IOVirtualizaRon)supportavailablewithMellanoxInfiniBandadapterschangesthefield

•  EnhancedMVAPICH2supportforSR-IOV•  MVAPICH2-Virt2.1(withandwithoutOpenStack)ispubliclyavailable

CanHPCandVirtualizaNonbeCombined?

J.Zhang,X.Lu,J.Jose,R.ShiandD.K.Panda,CanInter-VMShmemBenefitMPIApplicaNonsonSR-IOVbasedVirtualizedInfiniBandClusters?EuroPar'14J.Zhang,X.Lu,J.Jose,M.Li,R.ShiandD.K.Panda,HighPerformanceMPILibrayoverSR-IOVenabledInfiniBandClusters,HiPC’14J.Zhang,X.Lu,M.ArnoldandD.K.Panda,MVAPICH2OverOpenStackwithSR-IOV:anEfficientApproachtobuildHPCClouds,CCGrid’15

Page 52: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 52NetworkBasedCompuNngLaboratory

•  RedesignMVAPICH2tomakeitvirtualmachineaware–  SR-IOVshowsneartonaRve

performanceforinter-nodepointtopointcommunicaRon

–  IVSHMEMofferszero-copyaccesstodataonsharedmemoryofco-residentVMs

–  LocalityDetector:maintainsthelocalityinformaRonofco-residentvirtualmachines

–  CommunicaRonCoordinator:selectsthecommunicaRonchannel(SR-IOV,IVSHMEM)adapRvely

OverviewofMVAPICH2-VirtwithSR-IOVandIVSHMEM

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter

Physical Function

user space

kernel space

MPI proc

PCI Device

VF Driver

Guest 2user space

kernel space

MPI proc

PCI Device

VF Driver

Virtual Function

Virtual Function

/dev/shm/

IV-SHM

IV-Shmem Channel

SR-IOV Channel

J.Zhang,X.Lu,J.Jose,R.Shi,D.K.Panda.CanInter-VMShmemBenefitMPIApplicaRonsonSR-IOVbasedVirtualizedInfiniBandClusters?Euro-Par,2014.

J.Zhang,X.Lu,J.Jose,R.Shi,M.Li,D.K.Panda.HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters.HiPC,2014.

Page 53: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 53NetworkBasedCompuNngLaboratory

Nova

Glance

Neutron

Swift

Keystone

Cinder

Heat

Ceilometer

Horizon

VM

Backup volumes in

Stores images in

Provides images

Provides Network

Provisions

Provides Volumes

Monitors

Provides UI

Provides Auth for

Orchestrates cloud

•  OpenStackisoneofthemostpopularopen-sourcesoluRonstobuildcloudsandmanagevirtualmachines

•  DeploymentwithOpenStack–  SupporRngSR-IOVconfiguraRon

–  SupporRngIVSHMEMconfiguraRon

–  VirtualMachineawaredesignofMVAPICH2withSR-IOV

•  AnefficientapproachtobuildHPCCloudswithMVAPICH2-VirtandOpenStack

MVAPICH2-VirtwithSR-IOVandIVSHMEMoverOpenStack

J.Zhang,X.Lu,M.Arnold,D.K.Panda.MVAPICH2overOpenStackwithSR-IOV:AnEfficientApproachtoBuildHPCClouds.CCGrid,2015.

Page 54: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 54NetworkBasedCompuNngLaboratory

0

50

100

150

200

250

300

350

400

milc leslie3d pop2 GAPgeofem zeusmp2 lu

ExecuN

onTim

e(s)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-NaRve

1% 9.5%

0

1000

2000

3000

4000

5000

6000

22,20 24,10 24,16 24,20 26,10 26,16

ExecuN

onTim

e(m

s)

ProblemSize(Scale,Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-NaRve 2%

•  32VMs,6Core/VM

•  ComparedtoNaRve,2-5%overheadforGraph500with128Procs

•  ComparedtoNaRve,1-9.5%overheadforSPECMPI2007with128Procs

ApplicaNon-LevelPerformanceonChameleon

SPECMPI2007 Graph500

5%

Page 55: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 55NetworkBasedCompuNngLaboratory

NSFChameleonCloud:APowerfulandFlexibleExperimentalInstrument •  Large-scaleinstrument

–  TargeRngBigData,BigCompute,BigInstrumentresearch–  ~650nodes(~14,500cores),5PBdiskovertwosites,2sitesconnectedwith100Gnetwork

•  Reconfigurableinstrument–  BaremetalreconfiguraRon,operatedassingleinstrument,graduatedapproachforease-of-use

•  Connectedinstrument–  WorkloadandTraceArchive–  PartnershipswithproducRonclouds:CERN,OSDC,Rackspace,Google,andothers–  Partnershipswithusers

•  Complementaryinstrument–  ComplemenRngGENI,Grid’5000,andothertestbeds

•  Sustainableinstrument–  IndustryconnecRons

h<p://www.chameleoncloud.org/

Page 56: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 56NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 57: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 57NetworkBasedCompuNngLaboratory

•  MVAPICH2-EA2.1(Energy-Aware)•  Awhite-boxapproach•  NewEnergy-EfficientcommunicaRonprotocolsforpt-ptandcollecRveoperaRons•  IntelligentlyapplytheappropriateEnergysavingtechniques•  ApplicaRonobliviousenergysaving

•  OEMT•  AlibraryuRlitytomeasureenergyconsumpRonforMPIapplicaRons•  WorkswithallMPIrunRmes•  PRELOADopRonforprecompiledapplicaRons•  DoesnotrequireROOTpermission:

•  AsafekernelmoduletoreadonlyasubsetofMSRs

Energy-AwareMVAPICH2&OSUEnergyManagementTool(OEMT)

Page 58: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 58NetworkBasedCompuNngLaboratory

•  AnenergyefficientrunRmethatprovidesenergysavingswithoutapplicaRonknowledge

•  UsesautomaRcallyandtransparentlythebestenergylever

•  ProvidesguaranteesonmaximumdegradaRonwith5-41%savingsat<=5%degradaRon

•  PessimisRcMPIappliesenergyreducRonlevertoeachMPIcall

MVAPICH2-EA:ApplicaNonObliviousEnergy-Aware-MPI(EAM)

ACaseforApplicaNon-ObliviousEnergy-EfficientMPIRunNmeA.Venkatesh,A.Vishnu,K.Hamidouche,N.Tallent,D.

K.Panda,D.Kerbyson,andA.Hoise,SupercompuNng‘15,Nov2015[BestStudentPaperFinalist]

1

Page 59: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 59NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 60: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 60NetworkBasedCompuNngLaboratory

•  OSUINAMmonitorsIBclustersinrealRmebyqueryingvarioussubnetmanagementenRResinthenetwork

•  MajorfeaturesoftheOSUINAMtoolinclude:–  Analyzeandprofilenetwork-levelacRviReswithmanyparameters(dataanderrors)atuserspecified

granularity

–  Capabilitytoanalyzeandprofilenode-level,job-levelandprocess-levelacRviResforMPIcommunicaRon(pt-to-pt,collecRvesandRMA)

–  RemotelymonitorCPUuRlizaRonofMPIprocessesatuserspecifiedgranularity

–  Visualizethedatatransferhappeningina"live"fashion-LiveViewfor•  EnRreNetwork-LiveNetworkLevelView

•  ParRcularJob-LiveJobLevelView

•  OneormulRpleNodes-LiveNodeLevelView

–  CapabilitytovisualizedatatransferthathappenedinthenetworkataRmeduraRoninthepast•  EnRreNetwork-HistoricalNetworkLevelView

•  ParRcularJob-HistoricalJobLevelView

•  OneormulRpleNodes-HistoricalNodeLevelView

OverviewofOSUINAM

Page 61: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 61NetworkBasedCompuNngLaboratory

OSUINAM–NetworkLevelView

•  Shownetworktopologyoflargeclusters•  Visualizetrafficpa<ernondifferentlinks•  QuicklyidenRfycongestedlinks/linksinerrorstate•  Seethehistoryunfold–playbackhistoricalstateofthenetwork

FullNetwork(152nodes) Zoomed-inViewoftheNetwork

Page 62: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 62NetworkBasedCompuNngLaboratory

OSUINAM–JobandNodeLevelViews

VisualizingaJob(5Nodes) FindingRoutesBetweenNodes

•  Joblevelview•  Showdifferentnetworkmetrics(load,error,etc.)foranylivejob•  PlaybackhistoricaldataforcompletedjobstoidenRfybo<lenecks

•  Nodelevelviewprovidesdetailsperprocessorpernode•  CPUuRlizaRonforeachrank/node•  Bytessent/receivedforMPIoperaRons(pt-to-pt,collecRve,RMA)•  Networkmetrics(e.g.XmitDiscard,RcvError)perrank/node

Page 63: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 63NetworkBasedCompuNngLaboratory

MVAPICH2–PlansforExascale

•  PerformanceandMemoryscalabilitytoward1Mcores•  Hybridprogramming(MPI+OpenSHMEM,MPI+UPC,MPI+CAF…)

–  Supportfortask-basedparallelism(UPC++)

•  EnhancedOpRmizaRonforGPUSupportandAccelerators•  Takingadvantageofadvancedfeatures

–  UserModeMemoryRegistraRon(UMR)–  On-demandPaging

•  EnhancedInter-nodeandIntra-nodecommunicaRonschemesforupcomingOmniPathandKnightsLandingarchitectures

•  ExtendedRMAsupport(asinMPI3.0)•  Extendedtopology-awarecollecRves•  Energy-awarepoint-to-point(one-sidedandtwo-sided)andcollecRves•  ExtendedSupportforMPIToolsInterface(asinMPI3.0)•  ExtendedCheckpoint-RestartandmigraRonsupportwithSCR

Page 64: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 64NetworkBasedCompuNngLaboratory

•  Exascalesystemswillbeconstrainedby–  Power–  Memorypercore–  Datamovementcost–  Faults

•  ProgrammingModelsandRunRmesforHPCneedtobedesignedfor–  Scalability–  Performance–  Fault-resilience–  Energy-awareness–  Programmability–  ProducRvity

•  Highlightedsomeoftheissuesandchallenges•  NeedconRnuousinnovaRononallthesefronts

LookingintotheFuture….

Page 65: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 65NetworkBasedCompuNngLaboratory

FundingAcknowledgmentsFundingSupportby

EquipmentSupportby

Page 66: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 66NetworkBasedCompuNngLaboratory

PersonnelAcknowledgmentsCurrentStudents

–  A.AugusRne(M.S.)

–  A.Awan(Ph.D.)–  S.Chakraborthy(Ph.D.)

–  C.-H.Chu(Ph.D.)–  N.Islam(Ph.D.)

–  M.Li(Ph.D.)

PastStudents–  P.Balaji(Ph.D.)

–  S.Bhagvat(M.S.)

–  A.Bhat(M.S.)

–  D.BunRnas(Ph.D.)

–  L.Chai(Ph.D.)

–  B.Chandrasekharan(M.S.)

–  N.Dandapanthula(M.S.)

–  V.Dhanraj(M.S.)

–  T.Gangadharappa(M.S.)–  K.Gopalakrishnan(M.S.)

–  G.Santhanaraman(Ph.D.)–  A.Singh(Ph.D.)

–  J.Sridhar(M.S.)

–  S.Sur(Ph.D.)

–  H.Subramoni(Ph.D.)

–  K.Vaidyanathan(Ph.D.)

–  A.Vishnu(Ph.D.)

–  J.Wu(Ph.D.)

–  W.Yu(Ph.D.)

PastResearchScien4st–  S.Sur

CurrentPost-Doc–  J.Lin

–  D.Banerjee

CurrentProgrammer–  J.Perkins

PastPost-Docs–  H.Wang

–  X.Besseron–  H.-W.Jin

–  M.Luo

–  W.Huang(Ph.D.)–  W.Jiang(M.S.)

–  J.Jose(Ph.D.)

–  S.Kini(M.S.)

–  M.Koop(Ph.D.)

–  R.Kumar(M.S.)

–  S.Krishnamoorthy(M.S.)

–  K.Kandalla(Ph.D.)

–  P.Lai(M.S.)

–  J.Liu(Ph.D.)

–  M.Luo(Ph.D.)–  A.Mamidala(Ph.D.)

–  G.Marsh(M.S.)

–  V.Meshram(M.S.)

–  A.Moody(M.S.)

–  S.Naravula(Ph.D.)

–  R.Noronha(Ph.D.)

–  X.Ouyang(Ph.D.)

–  S.Pai(M.S.)

–  S.Potluri(Ph.D.)

–  R.Rajachandrasekar(Ph.D.)

–  K.Kulkarni(M.S.)–  M.Rahman(Ph.D.)

–  D.Shankar(Ph.D.)–  A.Venkatesh(Ph.D.)

–  J.Zhang(Ph.D.)

–  E.Mancini–  S.Marcarelli

–  J.Vienne

CurrentResearchScien4stsCurrentSeniorResearchAssociate–  H.Subramoni

–  X.Lu

PastProgrammers–  D.Bureddy

-K.Hamidouche

CurrentResearchSpecialist–  M.Arnold

Page 67: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 67NetworkBasedCompuNngLaboratory

InternaNonalWorkshoponCommunicaNonArchitecturesatExtremeScale(Exacomm)

ExaComm2015washeldwithInt’lSupercompuRngConference(ISC‘15),atFrankfurt,Germany,onThursday,July16th,2015

OneKeynoteTalk:JohnM.Shalf,CTO,LBL/NERSC

FourInvitedTalks:DrorGoldenberg(Mellanox);MarRnSchulz(LLNL);CyrielMinkenberg(IBM-Zurich);Arthur(Barney)Maccabe(ORNL)

Panel:RonBrightwell(Sandia)TwoResearchPapers

ExaComm2016willbeheldinconjuncRonwithISC’16h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html

TechnicalPaperSubmissionDeadline:Friday,April15,2016

Page 68: Programming Models for Exascale Systems

HPCAC-Stanford(Feb‘16) 68NetworkBasedCompuNngLaboratory

[email protected]

ThankYou!

TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/

Network-BasedCompuRngLaboratoryh<p://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/