Programming Models for Exascale Systems

ProgrammingModelsforExascaleSystems

DhabaleswarK.(DK)PandaTheOhioStateUniversity

E-mail:panda@cse.ohio-state.edu

h<p://www.cse.ohio-state.edu/~panda

KeynoteTalkatHPCAC-Stanford(Feb2016)

HPCAC-Stanford(Feb‘16) 2NetworkBasedCompuNngLaboratory

High-EndCompuNng(HEC):ExaFlop&ExaByte

100-200 PFlops in 2016-2018

1 EFlops in 2020-2024?

F i g u r e 1

Source: IDC's Digital Universe Study, sponsored by EMC, December 2012

Within these broad outlines of the digital universe are some singularities worth noting.

First, while the portion of the digital universe holding potential analytic value is growing, only a tiny fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and so on. However, even with a generous estimate, the amount of information in the digital universe that is "tagged" accounts for only about 3% of the digital universe in 2012, and that which is analyzed is half a percent of the digital universe. Herein is the promise of "Big Data" technology — the extraction of value from the large untapped pools of data in the digital universe.

10K-20K EBytes in 2016-2018

40K EBytes in 2020 ?

ExaFlop&HPC• 

ExaByte&BigData• 

0102030405060708090100

100150200250300350400450500

Percen

tageofC

lusters

fClusters

Timeline

PercentageofClustersNumberofClusters

TrendsforCommodityCompuNngClustersintheTop500List(hTp://www.top500.org)

DriversofModernHPCClusterArchitectures

Tianhe–2 Titan Stampede Tianhe–1A

•  MulR-core/many-coretechnologies

•  RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

•  SolidStateDrives(SSDs),Non-VolaRleRandom-AccessMemory(NVRAM),NVMe-SSD

•  Accelerators(NVIDIAGPGPUsandIntelXeonPhi)

Accelerators/Coprocessorshighcomputedensity,high

performance/waT>1TFlopDPonachip

HighPerformanceInterconnects-InfiniBand

<1useclatency,100GbpsBandwidth>MulN-coreProcessors SSD,NVMe-SSD,NVRAM

•  235IBClusters(47%)intheNov’2015Top500list(h<p://www.top500.org)

•  InstallaRonsintheTop50(21systems):

Large-scaleInfiniBandInstallaNons

462,462cores(Stampede)atTACC(10th) 76,032cores(Tsubame2.5)atJapan/GSIC(25th)

185,344cores(Pleiades)atNASA/Ames(13th) 194,616cores(Cascade)atPNNL(27th)

72,800coresCrayCS-StorminUS(15th) 76,032cores(Makman-2)atSaudiAramco(32nd)

72,800coresCrayCS-StorminUS(16th) 110,400cores(Pangea)inFrance(33rd)

265,440coresSGIICEatTulipTradingAustralia(17th) 37,120cores(Lomonosov-2)atRussia/MSU(35th)

124,200cores(Topaz)SGIICEatERDCDSRCinUS(18th) 57,600cores(SwifLucy)inUS(37th)

72,000cores(HPC2)inItaly(19th) 55,728cores(Prometheus)atPoland/Cyfronet(38th)

152,692cores(Thunder)atAFRL/USA(21st) 50,544cores(Occigen)atFrance/GENCI-CINES(43rd)

147,456cores(SuperMUC)inGermany(22nd) 76,896cores(Salomon)SGIICEinCzechRepublic(47th)

86,016cores(SuperMUCPhase2)inGermany(24th) andmanymore!

TowardsExascaleSystem(TodayandTarget)

Systems 2016Tianhe-2

2020-2024 DifferenceToday&Exascale

Systempeak 55PFlop/s 1EFlop/s ~20x

Power 18MW(3Gflops/W)

~20MW(50Gflops/W)

O(1)~15x

Systemmemory 1.4PB(1.024PBCPU+0.384PBCoP)

32–64PB ~50X

Nodeperformance 3.43TF/s(0.4CPU+3CoP)

1.2or15TF O(1)

Nodeconcurrency 24coreCPU+171coresCoP

O(1k)orO(10k) ~5x-~50x

TotalnodeinterconnectBW 6.36GB/s 200–400GB/s ~40x-~60x

Systemsize(nodes) 16,000 O(100,000)orO(1M) ~6x-~60x

Totalconcurrency 3.12M12.48Mthreads(4/core)

O(billion)forlatencyhiding

MTTI Few/day Many/day O(?)

Courtesy:Prof.JackDongarra

•  ScienRficCompuRng–  MessagePassingInterface(MPI),includingMPI+OpenMP,istheDominant

ProgrammingModel

–  ManydiscussionstowardsParRRonedGlobalAddressSpace(PGAS)•  UPC,OpenSHMEM,CAF,etc.

–  HybridProgramming:MPI+PGAS(OpenSHMEM,UPC)

•  BigData/Enterprise/CommercialCompuRng–  Focusesonlargedataanddataanalysis

–  Hadoop(HDFS,HBase,MapReduce)

–  Sparkisemergingforin-memorycompuRng

–  MemcachedisalsousedforWeb2.0

TwoMajorCategoriesofApplicaNons

ParallelProgrammingModelsOverview

P1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

ParRRonedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

•  Programmingmodelsprovideabstractmachinemodels

•  Modelscanbemappedondifferenttypesofsystems–  e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

•  PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance

ParNNonedGlobalAddressSpace(PGAS)Models

•  Keyfeatures-  SimplesharedmemoryabstracRons

-  Lightweightone-sidedcommunicaRon

-  EasiertoexpressirregularcommunicaRon

•  DifferentapproachestoPGAS-  Languages

•  UnifiedParallelC(UPC)

•  Co-ArrayFortran(CAF)

•  X10

•  Chapel

-  Libraries•  OpenSHMEM

•  GlobalArrays

Hybrid(MPI+PGAS)Programming

•  ApplicaRonsub-kernelscanbere-wri<eninMPI/PGASbasedoncommunicaRoncharacterisRcs

•  Benefits:–  BestofDistributedCompuRngModel

–  BestofSharedMemoryCompuRngModel

•  ExascaleRoadmap*:–  “HybridProgrammingisapracRcalwayto

programexascalesystems”

*TheInterna4onalExascaleSo;wareRoadmap,Dongarra,J.,Beckman,P.etal.,Volume25,Number1,2011,Interna4onalJournalofHighPerformanceComputerApplica4ons,ISSN1094-3420

Kernel1MPI

Kernel2MPI

Kernel3MPI

KernelNMPI

HPCApplicaNon

Kernel2PGAS

KernelNPGAS

DesigningCommunicaNonLibrariesforMulN-PetaflopandExaflopSystems:Challenges

ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.

ApplicaNonKernels/ApplicaNons

NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmniPath)

MulN/Many-coreArchitectures

Accelerators(NVIDIAandMIC)

MiddlewareCo-Design

OpportuniNesand

ChallengesacrossVarious

Layers

PerformanceScalabilityFault-

Resilience

CommunicaNonLibraryorRunNmeforProgrammingModelsPoint-to-pointCommunicaNon

CollecNveCommunicaNon

Energy-Awareness

SynchronizaNonandLocks

I/OandFileSystems

FaultTolerance

•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided)–  Scalablejobstart-up

•  ScalableCollecRvecommunicaRon–  Offload–  Non-blocking–  Topology-aware

•  Balancingintra-nodeandinter-nodecommunicaRonfornextgeneraRonnodes(128-1024cores)–  MulRpleend-pointspernode

•  SupportforefficientmulR-threading•  IntegratedSupportforGPGPUsandAccelerators•  Fault-tolerance/resiliency•  QoSsupportforcommunicaRonandI/O•  SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,

CAF,…)•  VirtualizaRon•  Energy-Awareness

BroadChallengesinDesigningCommunicaNonLibrariesfor(MPI+X)atExascale

•  ExtremeLowMemoryFootprint–  MemorypercoreconRnuestodecrease

•  D-L-AFramework

–  Discover•  Overallnetworktopology(fat-tree,3D,…),Networktopologyforprocessesforagivenjob•  Nodearchitecture,Healthofnetworkandnode

–  Learn•  Impactonperformanceandscalability•  PotenRalforfailure

–  Adapt•  Internalprotocolsandalgorithms•  Processmapping•  Fault-tolerancesoluRons

–  Lowoverheadtechniqueswhiledeliveringperformance,scalabilityandfault-tolerance

AddiNonalChallengesforDesigningExascaleSomwareLibraries

OverviewoftheMVAPICH2Project•  HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)

–  MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002

–  MVAPICH2-X(MPI+PGAS),Availablesince2011

–  SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

–  SupportforVirtualizaRon(MVAPICH2-Virt),Availablesince2015

–  SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

–  Usedbymorethan2,525organizaNonsin77countries

–  Morethan351,000(>0.35million)downloadsfromtheOSUsitedirectly

–  EmpoweringmanyTOP500clusters(Nov‘15ranking)•  10thranked519,640-corecluster(Stampede)atTACC

•  13thranked185,344-corecluster(Pleiades)atNASA

•  25thranked76,032-corecluster(Tsubame2.5)atTokyoInsRtuteofTechnologyandmanyothers

–  AvailablewithsofwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

–  h<p://mvapich.cse.ohio-state.edu

•  EmpoweringTop500systemsforoveradecade–  System-XfromVirginiaTech(3rdinNov2003,2,200processors,12.25TFlops)->

–  StampedeatTACC(10thinNov’15,519,640cores,5.168Plops)

MVAPICH2Architecture

HighPerformanceParallelProgrammingModels

MessagePassingInterface(MPI)

PGAS(UPC,OpenSHMEM,CAF,UPC++*)

Hybrid---MPI+X(MPI+PGAS+OpenMP/Cilk)

HighPerformanceandScalableCommunicaNonRunNmeDiverseAPIsandMechanisms

Point-to-point

PrimiNves

CollecNvesAlgorithms

Energy-Awareness

RemoteMemoryAccess

I/OandFileSystems

FaultTolerance

VirtualizaNon AcNveMessages

JobStartupIntrospecNon&Analysis

SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,OmniPath)

SupportforModernMulN-/Many-coreArchitectures(Intel-Xeon,OpenPower*,Xeon-Phi(MIC,KNL*),NVIDIAGPGPU)

TransportProtocols ModernFeatures

RC XRC UD DC UMR ODP*SR-IOV

MulNRail

TransportMechanismsSharedMemory CMA IVSHMEM

ModernFeatures

MCDRAM* NVLink* CAPI*

*-Upcoming

•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided

RMA)–  SupportforadvancedIBmechanisms(UMRandODP)–  Extremelyminimalmemoryfootprint–  Scalablejobstart-up

•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+

UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

One-wayLatency:MPIoverIBwithMVAPICH2

0.000.200.400.600.801.001.201.401.601.802.00 SmallMessageLatency

MessageSize(bytes)

Latency(us)

1.261.19

0.951.15

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back

120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

LargeMessageLatency

MessageSize(bytes)

Latency(us)

Bandwidth:MPIoverIBwithMVAPICH2

14000 UnidirecNonalBandwidth

tes/sec)

MessageSize(bytes)

30000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

BidirecNonalBandwidth

tes/sec)

MessageSize(bytes)

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back

0 1 2 4 8 16 32 64 128 256 512 1K

Latency(us)

MessageSize(Bytes)

LatencyIntra-Socket Inter-Socket

MVAPICH2Two-SidedIntra-NodePerformance(SharedmemoryandKernel-basedZero-copySupport(LiMICandCMA))

LatestMVAPICH22.2b

IntelIvy-bridge0.18us

0.45us

width(M

MessageSize(Bytes)

Bandwidth(Inter-socket)inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC

width(M

MessageSize(Bytes)

Bandwidth(Intra-socket)intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC

14,250MB/s13,749MB/s

•  IntroducedbyMellanoxtosupportdirectlocalandremotenonconRguousmemoryaccess–  Avoidpackingatsenderandunpackingatreceiver

•  AvailablewithMVAPICH2-X2.2b

User-modeMemoryRegistraNon(UMR)

100150200250300350

4K 16K 64K 256K 1M

Latency(u

MessageSize(Bytes)

Small&MediumMessageLatencyUMRDefault

2M 4M 8M 16M

Latency(us)

MessageSize(Bytes)

LargeMessageLatencyUMRDefault

Connect-IB(54Gbps):2.8GHzDualTen-core(IvyBridge)IntelPCIGen3withMellanoxIBFDRswitch

M.Li,H.Subramoni,K.Hamidouche,X.LuandD.K.Panda,HighPerformanceMPIDatatypeSupportwithUser-modeMemoryRegistraNon:Challenges,DesignsandBenefits,CLUSTER,2015

•  IntroducedbyMellanoxtosupportdirectremotememoryaccesswithoutpinning

•  Memoryregionspagedin/outdynamicallybytheHCA/OS

•  Sizeofregisteredbufferscanbelargerthanphysicalmemory

•  WillbeavailableinupcomingMVAPICH2-X2.2RC1

On-DemandPaging(ODP)

Connect-IB(54Gbps):2.6GHzDualOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRswitch

16 32 64

Pin-do

NumberofProcesses

Graph500Pin-downBufferSizesPin-down ODP

16 32 64

ExecuN

NumberofProcesses

Graph500BFSKernelPin-down ODP

MinimizingMemoryFootprintbyDirectConnect(DC)Transport

P2Node3

IBNetwork

•  ConstantconnecRoncost(OneQPforanypeer)•  FullFeatureSet(RDMA,Atomicsetc)•  Separateobjectsforsend(DCIniRator)andreceive(DCTarget)

–  DCTargetidenRfiedby“DCTNumber”–  Messagesroutedwith(DCTNumber,LID)–  Requiressame“DCKey”toenablecommunicaRon

•  AvailablesinceMVAPICH2-X2.2a

160 320 620Normalized

ExecuNo

NumberofProcesses

NAMD-Apoa1:LargedatasetRC DC-Pool UD XRC

1 1 12

10 10 10 10

80 160 320 640

ory(KB)

NumberofProcesses

MemoryFootprintforAlltoallRC DC-Pool UD XRC

H.Subramoni,K.Hamidouche,A.Venkatesh,S.ChakrabortyandD.K.Panda,DesigningMPILibrarywithDynamicConnectedTransport(DCT)ofInfiniBand:EarlyExperiences.IEEEInternaRonalSupercompuRngConference(ISC’14)

•  Near-constantMPIandOpenSHMEMiniRalizaRonRmeatanyprocesscount

•  10xand30ximprovementinstartupRmeofMPIandOpenSHMEMrespecRvelyat16,384processes

•  MemoryconsumpRonreducedforremoteendpointinformaRonbyO(processespernode)

•  1GBMemorysavedpernodewith1Mprocessesand16processespernode

TowardsHighPerformanceandScalableStartupatExascale

JobStartupPerformance

dtoStore

ointInform

a b c d

PGAS–Stateoftheart

MPI–Stateoftheart

O PGAS/MPI–OpRmized

PMIX_Ring

PMIX_Ibarrier

PMIX_Iallgather

ShmembasedPMI

aOn-demandConnecRon

On-demandConnecNonManagementforOpenSHMEMandOpenSHMEM+MPI.S.Chakraborty,H.Subramoni,J.Perkins,A.A.Awan,andDKPanda,20thInternaRonalWorkshoponHigh-levelParallelProgrammingModelsandSupporRveEnvironments(HIPS’15)

PMIExtensionsforScalableMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,J.Perkins,M.Arnold,andDKPanda,Proceedingsofthe21stEuropeanMPIUsers'GroupMeeRng(EuroMPI/Asia’14)

Non-blockingPMIExtensionsforFastMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,A.Venkatesh,J.Perkins,andDKPanda,15thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’15)

SHMEMPMI–SharedMemorybasedPMIforImprovedPerformanceandScalability.S.Chakraborty,H.Subramoni,J.Perkins,andDKPanda,16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’16),AcceptedforPublica6on

•  SHMEMPMIallowsMPIprocessestodirectlyreadremoteendpoint(EP)informaRonfromtheprocessmanagerthroughsharedmemorysegments

•  Onlyasinglecopypernode-O(processespernode)reducRoninmemoryusage

•  EsRmatedsavingsof1GBpernodewith1millionprocessesand16processespernode

•  Upto1,000RmesfasterPMIGetscomparedtodefaultdesign.WillbeavailableinMVAPICH22.2RC1.

ProcessManagementInterfaceoverSharedMemory(SHMEMPMI)

TACCStampede-Connect-IB(54Gbps):2.6GHzQuadOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRSHMEMPMI–SharedMemoryBasedPMIforPerformanceandScalabilityS.Chakraborty,H.Subramoni,J.Perkins,andD.K.Panda,

16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid‘16),Acceptedforpublica6on

1 2 4 8 16 32

TimeTaken(m

illise

NumberofProcessesperNode

TimeTakenbyonePMI_GetDefault

SHMEMPMI

0.00010.0010.010.1110100

100010000

16 64 256 1K 4K 16K 64K 256K 1MMem

oryUsageperNod

NumberofProcessesperJob

MemoryUsageforRemoteEPInformaRonFence-DefaultAllgather-DefaultFence-ShmemAllgather-Shmem

EsNmated

Actual

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon

–  OffloadandNon-blocking–  Topology-aware

•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

ModifiedHPLwithOffload-Bcastdoesupto4.5%be<erthandefaultversion(512Processes)

012345

512 600 720 800

ApplicaN

DataSize

64 128 256 512Run-Time(s)

NumberofProcesses

PCG-Default Modified-PCG-Offload

Co-DesignwithMPI-3Non-BlockingCollecNvesandCollecNveOffloadCo-DirectHardware(AvailablesinceMVAPICH2-X2.2a)

ModifiedP3DFFTwithOffload-Alltoalldoesupto17%be<erthandefaultversion(128Processes)

K.Kandalla,et.al..High-PerformanceandScalableNon-BlockingAll-to-AllwithCollecNveOffloadonInfiniBandClusters:AStudywithParallel3DFFT,ISC2011

00.20.40.60.81

10 20 30 40 50 60 70

Normalized

rforman

HPL-Offload HPL-1ring HPL-Host

HPLProblemSize(N)as%ofTotalMemory

ModifiedPre-ConjugateGradientSolverwithOffload-Allreducedoesupto21.8%be<erthandefaultversion

K.Kandalla,et.al,DesigningNon-blockingBroadcastwithCollecNveOffloadonInfiniBandClusters:ACaseStudywithHPL,HotI2011K.Kandalla,et.al.,DesigningNon-blockingAllreducewithCollecNveOffloadonInfiniBandClusters:ACaseStudywithConjugateGradientSolvers,IPDPS’12

CanNetwork-OffloadbasedNon-BlockingNeighborhoodMPICollecNvesImproveCommunicaNonOverheadsofIrregularGraphAlgorithms?K.Kandalla,A.Buluc,H.Subramoni,K.Tomko,J.Vienne,L.Oliker,andD.K.Panda,IWPAPS’12

Network-Topology-AwarePlacementofProcesses•  CanwedesignahighlyscalablenetworktopologydetecRonserviceforIB?•  HowdowedesigntheMPIcommunicaRonlibraryinanetwork-topology-awaremannertoefficientlyleveragethetopology

informaRongeneratedbyourservice?•  WhatarethepotenRalbenefitsofusinganetwork-topology-awareMPIlibraryontheperformanceofparallelscienRficapplicaRons?

OverallperformanceandSplitupofphysicalcommunicaNonforMILConRanger

Performanceforvaryingsystemsizes Defaultfor2048corerun Topo-Awarefor2048corerun

H.Subramoni,S.Potluri,K.Kandalla,B.Barth,J.Vienne,J.Keasler,K.Tomko,K.Schulz,A.Moody,andD.K.Panda,DesignofaScalableInfiniBandTopologyServicetoEnableNetwork-Topology-AwarePlacementofProcesses,SC'12.BESTPaperandBESTSTUDENTPaperFinalist

• ReducenetworktopologydiscoveryNmefromO(N2hosts)toO(Nhosts)

• 15%improvementinMILCexecuNonNme@2048cores• 15%improvementinHypreexecuNonNme@1024cores

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs

–  CUDA-AwareMPI–  GPUDirectRDMA(GDR)Support–  CUDA-awareNon-blockingCollecRves–  SupportforManagedMemory–  EfficientdatatypeProcessing

•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+

UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

Switch

At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .);

At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• DatamovementinapplicaRonswithstandardMPIandCUDAinterfaces

HighProduc4vityandLowPerformance

MPI+CUDA-Naive

Switch

At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall();

<<Similar at receiver>>

•  Pipeliningatuserlevelwithnon-blockingMPIandCUDAinterfaces

LowProduc4vityandHighPerformance

MPI+CUDA-Advanced

At Sender: At Receiver: MPI_Recv(r_devbuf, size, …);

inside MVAPICH2

•  StandardMPIinterfacesusedforunifieddatamovement

•  TakesadvantageofUnifiedVirtualAddressing(>=CUDA4.0)

•  OverlapsdatamovementfromGPUwithRDMAtransfers

HighPerformanceandHighProduc4vity

MPI_Send(s_devbuf, size, …);

GPU-AwareMPILibrary:MVAPICH2-GPU

•  OFEDwithsupportforGPUDirectRDMAisdevelopedbyNVIDIAandMellanox

•  OSUhasadesignofMVAPICH2using

GPUDirectRDMA–  HybriddesignusingGPU-DirectRDMA

•  GPUDirectRDMAandHost-basedpipelining

•  AlleviatesP2Pbandwidthbo<lenecksonSandyBridgeandIvyBridge

–  SupportforcommunicaRonusingmulR-rail

–  SupportforMellanoxConnect-IBandConnectXVPIadapters

–  SupportforRoCEwithMellanoxConnectXVPIadapters

GPU-DirectRDMA(GDR)withCUDA

IBAdapter

SystemMemory

GPUMemory

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNBE5-2670

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVBE5-2680V2

SNBE5-2670/

IVBE5-2680V2

CUDA-AwareMPI:MVAPICH2-GDR1.8-2.2Releases•  SupportforMPIcommunicaRonfromNVIDIAGPUdevicememory•  HighperformanceRDMA-basedinter-nodepoint-to-point

communicaRon(GPU-GPU,GPU-HostandHost-GPU)•  Highperformanceintra-nodepoint-to-pointcommunicaRonformulR-

GPUadapters/node(GPU-GPU,GPU-HostandHost-GPU)•  TakingadvantageofCUDAIPC(availablesinceCUDA4.1)inintra-node

communicaRonformulRpleGPUadapters/node•  OpRmizedandtunedcollecRvesforGPUdevicebuffers•  MPIdatatypesupportforpoint-to-pointandcollecRvecommunicaRon

fromGPUdevicebuffers

MVAPICH2-GDR-2.2bIntelIvyBridge(E5-2680v2)node-20cores

NVIDIATeslaK40cGPUMellanoxConnect-IBDual-FDRHCA

CUDA7MellanoxOFED2.4withGPU-Direct-RDMA

PerformanceofMVAPICH2-GPUwithGPU-DirectRDMA(GDR)

1015202530

0 2 8 32 128 512 2K

MV2-GDR2.2b MV2-GDR2.0bMV2w/oGDR

GPU-GPUinternodelatency

MessageSize(bytes)

Latency(us)

2.18us0

50010001500200025003000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR

GPU-GPUInternodeBandwidth

MessageSize(bytes)

width(M

01000200030004000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR

GPU-GPUInternodeBi-Bandwidth

MessageSize(bytes)

Bi-Ban

dwidth(M

LENS(Oct'15) 35

•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) •  HoomdBlue Version 1.0.5

•  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

ApplicaNon-LevelEvaluaNon(HOOMD-blue)

4 8 16 32

AverageTimeStep

second

NumberofProcesses

MV2 MV2+GDR

0500100015002000250030003500

4 8 16 32AverageTimeStep

second

NumberofProcesses

64KParNcles 256KParNcles

4K 16K 64K 256K 1M

Overla

MessageSize(Bytes)

Medium/LargeMessageOverlap(64GPUnodes)

Ialltoall(1process/node)

Ialltoall(2process/node;1process/GPU)0

4K 16K 64K 256K 1M

Overla

MessageSize(Bytes)

Medium/LargeMessageOverlap(64GPUnodes)

Igather(1process/node)

Igather(2processes/node;1process/GPU)

Plazorm:Wilkes:IntelIvyBridgeNVIDIATeslaK20c+MellanoxConnect-IB

AvailablesinceMVAPICH2-GDR2.2a

CUDA-AwareNon-BlockingCollecNves

A.Venkatesh,K.Hamidouche,H.Subramoni,andD.K.Panda,OffloadedGPUCollecNvesusingCORE-DirectandCUDACapabiliNesonIBClusters,HIPC,2015

CommunicaNonRunNmewithGPUManagedMemory

●  CUDA6.0NVIDIAintroducedCUDAManaged(orUnified)memoryallowingacommonmemoryallocaRonforGPUorCPUthroughcudaMallocManaged()call

●  SignificantproducRvitybenefitsduetoabstracRonofexplicitallocaRonandcudaMemcpy()

●  ExtendedMVAPICH2toperformcommunicaRonsdirectlyfrommanagedbuffers(AvailableinMVAPICH2-GDR2.2b)

●  OSUMicro-benchmarksextendedtoevaluatetheperformanceofpoint-to-pointandcollecRvecommunicaRonsusingmanagedbuffers●  AvailableinOMB5.2

D.S.Banerjee,KHamidouche,andD.KPanda,DesigningHighPerformanceCommunicaRonRunRmeforGPUManagedMemory:EarlyExperiences,GPGPU-9Workshop,tobeheldinconjuncRonwithPPoPP‘16

1 2 4 8 16 32 64 128256 1K 4K 8K 16K

HaloExcha

ngeTime(m

TotalDimensionSize(Bytes)

2DStencilPerformanceforHalowidth=1

DeviceManaged

Progress

Isend(1)

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)Existing Design

Proposed Design

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Isend(1)

Kernel on Stream

Isend(1) Wait

Progress

Start Finish Proposed Finish Existing

Expected Benefits

MPIDatatypeProcessing(CommunicaNonOpNmizaNon)

WasteofcompuNngresourcesonCPUandGPUCommonScenario

*Buf1, Buf2…contain non-conRguousMPIDatatype

MPI_Isend(A,..Datatype,…)MPI_Isend(B,..Datatype,…)MPI_Isend(C,..Datatype,…)MPI_Isend(D,..Datatype,…)…MPI_Waitall(…);

ApplicaNon-LevelEvaluaNon(HaloExchange-Cosmo)

16 32 64 96

Normalized

ExecuNo

NumberofGPUs

CSCSGPUclusterDefault Callback-based Event-based

4 8 16 32

Normalized

ExecuNo

NumberofGPUs

WilkesGPUClusterDefault Callback-based Event-based

•  2Ximprovementon32GPUsnodes•  30%improvementon96GPUnodes(8GPUs/node)

C.Chu,K.Hamidouche,A.Venkatesh,D.Banerjee,H.Subramoni,andD.K.Panda,ExploiNngMaximalOverlapforNon-ConNguousDataMovementProcessingonModernGPU-enabledSystems,IPDPS’16

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPIApplicaNonsonMICClusters

Xeon XeonPhi

MulR-coreCentric

Many-coreCentric

MPIProgram

OffloadedComputaRon

MPIProgram MPIProgram

MPIProgram

Host-only

Offload(/reverseOffload)

Symmetric

Coprocessor-only

• FlexibilityinlaunchingMPIjobsonclusterswithXeonPhi

MVAPICH2-MIC2.0DesignforClusterswithIBandMIC

•  OffloadMode

•  IntranodeCommunicaRon

•  Coprocessor-onlyandSymmetricMode

•  InternodeCommunicaRon

•  Coprocessors-onlyandSymmetricMode

•  MulR-MICNodeConfiguraRons

•  Runningonthreemajorsystems

•  Stampede,Blueridge(VirginiaTech)andBeacon(UTK)

MIC-Remote-MICP2PCommunicaNonwithProxy-basedCommunicaNon

Bandwidth

Latency(LargeMessages)

0 1000 2000 3000 4000 5000

8K 32K 128K 512K 2M

Message Size (Bytes)

1 16 256 4K 64K 1M Band

width(M

B/sec)

Intra-socketP2P

Inter-socketP2P

8K 32K 128K 512K 2M

Latency(LargeMessages)

1 16 256 4K 64K 1M Band

width(M

B/sec)

Message Size (Bytes) BeT

Bandwidth

OpNmizedMPICollecNvesforMICClusters(Allgather&Alltoall)

A.Venkatesh,S.Potluri,R.Rajachandrasekar,M.Luo,K.HamidoucheandD.K.Panda-HighPerformanceAlltoallandAllgatherdesignsforInfiniBandMICClusters;IPDPS’14,May2014

1 2 4 8 16 32 64 128256512 1K

Latency(usecs)

MessageSize(Bytes)

32-Node-Allgather(16H+16M)SmallMessageLatencyMV2-MIC

MV2-MIC-Opt

8K 16K 32K 64K 128K256K512K 1M

Latency(usecs)

MessageSize(Bytes)

32-Node-Allgather(8H+8M)LargeMessageLatencyMV2-MIC

MV2-MIC-Opt

4K 8K 16K 32K 64K 128K256K512K

Latency(usecs)

MessageSize(Bytes)

32-Node-Alltoall(8H+8M)LargeMessageLatencyMV2-MIC

MV2-MIC-Opt

MV2-MIC-Opt MV2-MICExecuN

e(secs)

32Nodes(8H+8M),Size=2K*2K*1K

P3DFFTPerformanceCommunicaRonComputaRon

76%58%

MVAPICH2-XforAdvancedMPIandHybridMPI+PGASApplicaNonsMPI,OpenSHMEM,UPC,CAForHybrid(MPI+PGAS)

ApplicaNons

UnifiedMVAPICH2-XRunNme

InfiniBand,RoCE,iWARP

OpenSHMEMCalls MPICallsUPCCalls

•  UnifiedcommunicaRonrunRmeforMPI,UPC,OpenSHMEM,CAFavailablewithMVAPICH2-X1.9(2012)onwards!

•  UPC++supportwillbeavailableinupcomingMVAPICH2-X2.2RC1•  FeatureHighlights

–  SupportsMPI(+OpenMP),OpenSHMEM,UPC,CAF,MPI(+OpenMP)+OpenSHMEM,MPI(+OpenMP)+UPC+CAF

–  MPI-3compliant,OpenSHMEMv1.0standardcompliant,UPCv1.2standardcompliant(withiniRalsupportforUPC1.3),CAF2008standard(OpenUH)

–  ScalableInter-nodeandintra-nodecommunicaRon–point-to-pointandcollecRves

CAFCalls

ApplicaNonLevelPerformancewithGraph500andSortGraph500ExecuNonTime

J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternaNonalSupercompuNngConference(ISC’13),June2013

J.Jose,K.Kandalla,M.LuoandD.K.Panda,SupporNngHybridMPIandOpenSHMEMoverInfiniBand:DesignandPerformanceEvaluaNon,Int'lConferenceonParallelProcessing(ICPP'12),September2012

05101520253035

4K 8K 16K

Time(s)

No.ofProcesses

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

•  PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design•  8,192processes

-2.4XimprovementoverMPI-CSR-7.6XimprovementoverMPI-Simple

•  16,384processes-1.5XimprovementoverMPI-CSR-13XimprovementoverMPI-Simple

J.Jose,K.Kandalla,S.Potluri,J.ZhangandD.K.Panda,OpNmizingCollecNveCommunicaNoninOpenSHMEM,Int'lConferenceonParNNonedGlobalAddressSpaceProgrammingModels(PGAS'13),October2013.

SortExecuNonTime

500GB-512 1TB-1K 2TB-2K 4TB-4K

Time(secon

InputData-No.ofProcesses

MPI Hybrid

•  PerformanceofHybrid(MPI+OpenSHMEM)SortApplicaRon

•  4,096processes,4TBInputSize-MPI–2408sec;0.16TB/min-Hybrid–1172sec;0.36TB/min-51%improvementoverMPI-design

MiniMD–TotalExecuNonTime

•  Hybriddesignperformsbe<erthanMPIimplementaRon•  1,024processes

-  17%improvementoverMPIversion•  StrongScaling

Inputsize:128*128*128

Performance StrongScaling

512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

050010001500200025003000

256 512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

Time(m

#ofCores #ofCores

M.Li,J.Lin,X.Lu,K.Hamidouche,K.TomkoandD.K.Panda,ScalableMiniMDDesignwithHybridMPIandOpenSHMEM,OpenSHMEMUserGroupMeeNng(OUG’14),heldinconjuncNonwith8thInternaNonalConferenceonParNNonedGlobalAddressSpaceProgrammingModels,(PGAS14).

HybridMPI+UPCNAS-FT

•  ModifiedNASFTUPCall-to-allpa<ernusingMPI_Alltoall•  Trulyhybridprogram•  ForFT(ClassC,128processes)

•  34%improvementoverUPC-GASNet•  30%improvementoverUPC-OSU

B-64 C-64 B-128 C-128

Time(s)

NASProblemSize–SystemSize

UPC-GASNet

UPC-OSU

Hybrid-OSU

J.Jose,M.Luo,S.SurandD.K.Panda,UnifyingUPCandMPIRunNmes:ExperiencewithMVAPICH,FourthConferenceonParNNonedGlobalAddressSpaceProgrammingModel(PGAS’10),October2010

HybridMPI+UPCSupport

Availablesince

MVAPICH2-X1.9(2012)

•  VirtualizaRonhasmanybenefits–  Fault-tolerance–  JobmigraRon–  CompacRon

•  HavenotbeenverypopularinHPCduetooverheadassociatedwithVirtualizaRon

•  NewSR-IOV(SingleRoot–IOVirtualizaRon)supportavailablewithMellanoxInfiniBandadapterschangesthefield

•  EnhancedMVAPICH2supportforSR-IOV•  MVAPICH2-Virt2.1(withandwithoutOpenStack)ispubliclyavailable

CanHPCandVirtualizaNonbeCombined?

J.Zhang,X.Lu,J.Jose,R.ShiandD.K.Panda,CanInter-VMShmemBenefitMPIApplicaNonsonSR-IOVbasedVirtualizedInfiniBandClusters?EuroPar'14J.Zhang,X.Lu,J.Jose,M.Li,R.ShiandD.K.Panda,HighPerformanceMPILibrayoverSR-IOVenabledInfiniBandClusters,HiPC’14J.Zhang,X.Lu,M.ArnoldandD.K.Panda,MVAPICH2OverOpenStackwithSR-IOV:anEfficientApproachtobuildHPCClouds,CCGrid’15

•  RedesignMVAPICH2tomakeitvirtualmachineaware–  SR-IOVshowsneartonaRve

performanceforinter-nodepointtopointcommunicaRon

–  IVSHMEMofferszero-copyaccesstodataonsharedmemoryofco-residentVMs

–  LocalityDetector:maintainsthelocalityinformaRonofco-residentvirtualmachines

–  CommunicaRonCoordinator:selectsthecommunicaRonchannel(SR-IOV,IVSHMEM)adapRvely

OverviewofMVAPICH2-VirtwithSR-IOVandIVSHMEM

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter

Physical Function

user space

kernel space

MPI proc

PCI Device

VF Driver

Guest 2user space

kernel space

MPI proc

PCI Device

VF Driver

Virtual Function

/dev/shm/

IV-SHM

IV-Shmem Channel

SR-IOV Channel

J.Zhang,X.Lu,J.Jose,R.Shi,D.K.Panda.CanInter-VMShmemBenefitMPIApplicaRonsonSR-IOVbasedVirtualizedInfiniBandClusters?Euro-Par,2014.

J.Zhang,X.Lu,J.Jose,R.Shi,M.Li,D.K.Panda.HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters.HiPC,2014.

Glance

Neutron

Keystone

Cinder

Ceilometer

Horizon

Backup volumes in

Stores images in

Provides images

Provides Network

Provisions

Provides Volumes

Monitors

Provides UI

Provides Auth for

Orchestrates cloud

•  OpenStackisoneofthemostpopularopen-sourcesoluRonstobuildcloudsandmanagevirtualmachines

•  DeploymentwithOpenStack–  SupporRngSR-IOVconfiguraRon

–  SupporRngIVSHMEMconfiguraRon

–  VirtualMachineawaredesignofMVAPICH2withSR-IOV

•  AnefficientapproachtobuildHPCCloudswithMVAPICH2-VirtandOpenStack

MVAPICH2-VirtwithSR-IOVandIVSHMEMoverOpenStack

J.Zhang,X.Lu,M.Arnold,D.K.Panda.MVAPICH2overOpenStackwithSR-IOV:AnEfficientApproachtoBuildHPCClouds.CCGrid,2015.

milc leslie3d pop2 GAPgeofem zeusmp2 lu

ExecuN

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-NaRve

1% 9.5%

22,20 24,10 24,16 24,20 26,10 26,16

ExecuN

ProblemSize(Scale,Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-NaRve 2%

•  32VMs,6Core/VM

•  ComparedtoNaRve,2-5%overheadforGraph500with128Procs

•  ComparedtoNaRve,1-9.5%overheadforSPECMPI2007with128Procs

ApplicaNon-LevelPerformanceonChameleon

SPECMPI2007 Graph500

NSFChameleonCloud:APowerfulandFlexibleExperimentalInstrument •  Large-scaleinstrument

–  TargeRngBigData,BigCompute,BigInstrumentresearch–  ~650nodes(~14,500cores),5PBdiskovertwosites,2sitesconnectedwith100Gnetwork

•  Reconfigurableinstrument–  BaremetalreconfiguraRon,operatedassingleinstrument,graduatedapproachforease-of-use

•  Connectedinstrument–  WorkloadandTraceArchive–  PartnershipswithproducRonclouds:CERN,OSDC,Rackspace,Google,andothers–  Partnershipswithusers

•  Complementaryinstrument–  ComplemenRngGENI,Grid’5000,andothertestbeds

•  Sustainableinstrument–  IndustryconnecRons

h<p://www.chameleoncloud.org/

•  MVAPICH2-EA2.1(Energy-Aware)•  Awhite-boxapproach•  NewEnergy-EfficientcommunicaRonprotocolsforpt-ptandcollecRveoperaRons•  IntelligentlyapplytheappropriateEnergysavingtechniques•  ApplicaRonobliviousenergysaving

•  OEMT•  AlibraryuRlitytomeasureenergyconsumpRonforMPIapplicaRons•  WorkswithallMPIrunRmes•  PRELOADopRonforprecompiledapplicaRons•  DoesnotrequireROOTpermission:

•  AsafekernelmoduletoreadonlyasubsetofMSRs

Energy-AwareMVAPICH2&OSUEnergyManagementTool(OEMT)

•  AnenergyefficientrunRmethatprovidesenergysavingswithoutapplicaRonknowledge

•  UsesautomaRcallyandtransparentlythebestenergylever

•  ProvidesguaranteesonmaximumdegradaRonwith5-41%savingsat<=5%degradaRon

•  PessimisRcMPIappliesenergyreducRonlevertoeachMPIcall

MVAPICH2-EA:ApplicaNonObliviousEnergy-Aware-MPI(EAM)

ACaseforApplicaNon-ObliviousEnergy-EfficientMPIRunNmeA.Venkatesh,A.Vishnu,K.Hamidouche,N.Tallent,D.

K.Panda,D.Kerbyson,andA.Hoise,SupercompuNng‘15,Nov2015[BestStudentPaperFinalist]

•  OSUINAMmonitorsIBclustersinrealRmebyqueryingvarioussubnetmanagementenRResinthenetwork

•  MajorfeaturesoftheOSUINAMtoolinclude:–  Analyzeandprofilenetwork-levelacRviReswithmanyparameters(dataanderrors)atuserspecified

granularity

–  Capabilitytoanalyzeandprofilenode-level,job-levelandprocess-levelacRviResforMPIcommunicaRon(pt-to-pt,collecRvesandRMA)

–  RemotelymonitorCPUuRlizaRonofMPIprocessesatuserspecifiedgranularity

–  Visualizethedatatransferhappeningina"live"fashion-LiveViewfor•  EnRreNetwork-LiveNetworkLevelView

•  ParRcularJob-LiveJobLevelView

•  OneormulRpleNodes-LiveNodeLevelView

–  CapabilitytovisualizedatatransferthathappenedinthenetworkataRmeduraRoninthepast•  EnRreNetwork-HistoricalNetworkLevelView

•  ParRcularJob-HistoricalJobLevelView

•  OneormulRpleNodes-HistoricalNodeLevelView

OverviewofOSUINAM

OSUINAM–NetworkLevelView

•  Shownetworktopologyoflargeclusters•  Visualizetrafficpa<ernondifferentlinks•  QuicklyidenRfycongestedlinks/linksinerrorstate•  Seethehistoryunfold–playbackhistoricalstateofthenetwork

FullNetwork(152nodes) Zoomed-inViewoftheNetwork

OSUINAM–JobandNodeLevelViews

VisualizingaJob(5Nodes) FindingRoutesBetweenNodes

•  Joblevelview•  Showdifferentnetworkmetrics(load,error,etc.)foranylivejob•  PlaybackhistoricaldataforcompletedjobstoidenRfybo<lenecks

•  Nodelevelviewprovidesdetailsperprocessorpernode•  CPUuRlizaRonforeachrank/node•  Bytessent/receivedforMPIoperaRons(pt-to-pt,collecRve,RMA)•  Networkmetrics(e.g.XmitDiscard,RcvError)perrank/node

MVAPICH2–PlansforExascale

•  PerformanceandMemoryscalabilitytoward1Mcores•  Hybridprogramming(MPI+OpenSHMEM,MPI+UPC,MPI+CAF…)

–  Supportfortask-basedparallelism(UPC++)

•  EnhancedOpRmizaRonforGPUSupportandAccelerators•  Takingadvantageofadvancedfeatures

–  UserModeMemoryRegistraRon(UMR)–  On-demandPaging

•  EnhancedInter-nodeandIntra-nodecommunicaRonschemesforupcomingOmniPathandKnightsLandingarchitectures

•  ExtendedRMAsupport(asinMPI3.0)•  Extendedtopology-awarecollecRves•  Energy-awarepoint-to-point(one-sidedandtwo-sided)andcollecRves•  ExtendedSupportforMPIToolsInterface(asinMPI3.0)•  ExtendedCheckpoint-RestartandmigraRonsupportwithSCR

•  Exascalesystemswillbeconstrainedby–  Power–  Memorypercore–  Datamovementcost–  Faults

•  ProgrammingModelsandRunRmesforHPCneedtobedesignedfor–  Scalability–  Performance–  Fault-resilience–  Energy-awareness–  Programmability–  ProducRvity

•  Highlightedsomeoftheissuesandchallenges•  NeedconRnuousinnovaRononallthesefronts

LookingintotheFuture….

FundingAcknowledgmentsFundingSupportby

EquipmentSupportby

PersonnelAcknowledgmentsCurrentStudents

–  A.AugusRne(M.S.)

–  A.Awan(Ph.D.)–  S.Chakraborthy(Ph.D.)

–  C.-H.Chu(Ph.D.)–  N.Islam(Ph.D.)

–  M.Li(Ph.D.)

PastStudents–  P.Balaji(Ph.D.)

–  S.Bhagvat(M.S.)

–  A.Bhat(M.S.)

–  D.BunRnas(Ph.D.)

–  L.Chai(Ph.D.)

–  B.Chandrasekharan(M.S.)

–  N.Dandapanthula(M.S.)

–  V.Dhanraj(M.S.)

–  T.Gangadharappa(M.S.)–  K.Gopalakrishnan(M.S.)

–  G.Santhanaraman(Ph.D.)–  A.Singh(Ph.D.)

–  J.Sridhar(M.S.)

–  S.Sur(Ph.D.)

–  H.Subramoni(Ph.D.)

–  K.Vaidyanathan(Ph.D.)

–  A.Vishnu(Ph.D.)

–  J.Wu(Ph.D.)

–  W.Yu(Ph.D.)

PastResearchScien4st–  S.Sur

CurrentPost-Doc–  J.Lin

–  D.Banerjee

CurrentProgrammer–  J.Perkins

PastPost-Docs–  H.Wang

–  X.Besseron–  H.-W.Jin

–  M.Luo

–  W.Huang(Ph.D.)–  W.Jiang(M.S.)

–  J.Jose(Ph.D.)

–  S.Kini(M.S.)

–  M.Koop(Ph.D.)

–  R.Kumar(M.S.)

–  S.Krishnamoorthy(M.S.)

–  K.Kandalla(Ph.D.)

–  P.Lai(M.S.)

–  J.Liu(Ph.D.)

–  M.Luo(Ph.D.)–  A.Mamidala(Ph.D.)

–  G.Marsh(M.S.)

–  V.Meshram(M.S.)

–  A.Moody(M.S.)

–  S.Naravula(Ph.D.)

–  R.Noronha(Ph.D.)

–  X.Ouyang(Ph.D.)

–  S.Pai(M.S.)

–  S.Potluri(Ph.D.)

–  R.Rajachandrasekar(Ph.D.)

–  K.Kulkarni(M.S.)–  M.Rahman(Ph.D.)

–  D.Shankar(Ph.D.)–  A.Venkatesh(Ph.D.)

–  J.Zhang(Ph.D.)

–  E.Mancini–  S.Marcarelli

–  J.Vienne

CurrentResearchScien4stsCurrentSeniorResearchAssociate–  H.Subramoni

–  X.Lu

PastProgrammers–  D.Bureddy

-K.Hamidouche

CurrentResearchSpecialist–  M.Arnold

InternaNonalWorkshoponCommunicaNonArchitecturesatExtremeScale(Exacomm)

ExaComm2015washeldwithInt’lSupercompuRngConference(ISC‘15),atFrankfurt,Germany,onThursday,July16th,2015

OneKeynoteTalk:JohnM.Shalf,CTO,LBL/NERSC

FourInvitedTalks:DrorGoldenberg(Mellanox);MarRnSchulz(LLNL);CyrielMinkenberg(IBM-Zurich);Arthur(Barney)Maccabe(ORNL)

Panel:RonBrightwell(Sandia)TwoResearchPapers

ExaComm2016willbeheldinconjuncRonwithISC’16h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html

TechnicalPaperSubmissionDeadline:Friday,April15,2016

panda@cse.ohio-state.edu

ThankYou!

TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/

Network-BasedCompuRngLaboratoryh<p://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/

Programming Models for Exascale Systems

Technology

su3 bench, a Micro-benchmark for Exploring Exascale Era … · 2020. 9. 8. · su3_bench, a Micro-benchmark for Exploring Exascale Era Programming Models, Compilers and Runtimes Douglas

ARGO: An Exascale Operating System and Runtimesc15.supercomputing.org/sites/all/themes/SC15...• Successfully demonstrated Argobots integration with several programming models: MPI,

Optimization & Scalability€¦ · Programming Models & Tools Cloud Computing Optimization & Scalability Energy Efﬁciency Exascale Computing Services Big Data, Analytics & Management

Brad Chamberlain, Cray Inc. · 2019. 9. 4. · Anticipated Exascale Timeline (excerpts) 2010-2011: develop abstract node/machine model 2010-2012: initial programming models development

Software Libraries and Middleware for Exascale Systemshibd.cse.ohio-state.edu/static/media/talks/slide/dk-hpc-connection.pdfand Exaflop Systems: Challenges Programming Models MPI,

Programming Models for Exascale Systems

A Unified Runtime Infrastructure for Exascale Programming ...€¦ · Exascale Programming Models Pavan Balaji Argonne National Laboratory (with input from others at Argonne, Oak

Portably Improving Uintah’s Readiness for Exascale Systems ...programming models such as CUDA and OpenMP. This portability eases Uin-tah’s future ports, including those for exascale

Investigating Message Passing and PGAS Programming Models for the Exascale … · 2016. 2. 5. · Programming Models for the Exascale Era Dana Akhmetova (danaak@kth.se), Ivy Bo Peng

Lattice QCD, Programming Models and Porting LQCD codes to ... · Lattice QCD, Programming Models and Porting LQCD codes to Exascale Bálint Joó - Jefferson Lab Feb 19, 2020 HPC Roundtable

Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12, 2014

MPI+X The Right Programming Paradigm for Exascale?mvapich.cse.ohio-state.edu/static/media/talks/slide/DK... · 2017-07-18 · MPI+X –The Right Programming Paradigm for Exascale?

Exascale Computingring and programming models, as well as development of new technologies. HLRS is an acti- ... of the programming model and prototype system by porting and benchmarking

Programming Models for Exascale Systemsnowlab.cse.ohio-state.edu/.../slide/hpcac_lugano14_pmodels.pdf · HPC Advisory Council Switzerland Conference, Apr '14 • High Performance

High-Performance I/O Programming Models for Exascale ...kth.diva-portal.org/smash/get/diva2:1367264/FULLTEXT01.pdfHigh-Performance I/O Programming Models for Exascale Computing SERGIO

Systems & Programming Models at the · Systems & Programming Models at the High Performance Computing Center Stuttgart Rainer Keller, HLRS:: :: ... partners towards Exascale computing

Exploring New Optimizations for Hybrid Programming Using ...€¦ · Future Exascale Systems • Research / co-design of future programming models and runtimes with hardware. –

High Performance Computing Systemsdshook/cse566/lectures/Exascale.pdf · Programming Models ... Deliver two exascale systems by 2023 – First in 2021 based on “advanced architecture”

High-Performance and Scalable Designs of Programming Models for Exascale Systems

Automatic Extraction of Software Models for Exascale Hardware/Software Co-Design