Programming Models for Exascale Systems

ProgrammingModelsforExascaleSystems

DhabaleswarK.(DK)PandaTheOhioStateUniversity

E-mail:[email protected]

h<p://www.cse.ohio-state.edu/~panda

KeynoteTalkatHPCAC-Stanford(Feb2016)

by

HPCAC-Stanford(Feb‘16) 2NetworkBasedCompuNngLaboratory

High-EndCompuNng(HEC):ExaFlop&ExaByte

100-200 PFlops in 2016-2018

1 EFlops in 2020-2024?

3

F i g u r e 1

Source: IDC's Digital Universe Study, sponsored by EMC, December 2012

Within these broad outlines of the digital universe are some singularities worth noting.

First, while the portion of the digital universe holding potential analytic value is growing, only a tiny fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and so on. However, even with a generous estimate, the amount of information in the digital universe that is "tagged" accounts for only about 3% of the digital universe in 2012, and that which is analyzed is half a percent of the digital universe. Herein is the promise of "Big Data" technology — the extraction of value from the large untapped pools of data in the digital universe.

10K-20K EBytes in 2016-2018

40K EBytes in 2020 ?

ExaFlop&HPC• 

ExaByte&BigData• 


0102030405060708090100

050

100150200250300350400450500

Percen

tageofC

lusters

Num

bero

fClusters

Timeline

PercentageofClustersNumberofClusters

TrendsforCommodityCompuNngClustersintheTop500List(hTp://www.top500.org)

85%


DriversofModernHPCClusterArchitectures

Tianhe–2 Titan Stampede Tianhe–1A

•  MulR-core/many-coretechnologies

•  RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

•  SolidStateDrives(SSDs),Non-VolaRleRandom-AccessMemory(NVRAM),NVMe-SSD

•  Accelerators(NVIDIAGPGPUsandIntelXeonPhi)

Accelerators/Coprocessorshighcomputedensity,high

performance/waT>1TFlopDPonachip

HighPerformanceInterconnects-InfiniBand

<1useclatency,100GbpsBandwidth>MulN-coreProcessors SSD,NVMe-SSD,NVRAM


•  235IBClusters(47%)intheNov’2015Top500list(h<p://www.top500.org)

•  InstallaRonsintheTop50(21systems):

Large-scaleInfiniBandInstallaNons

462,462cores(Stampede)atTACC(10th) 76,032cores(Tsubame2.5)atJapan/GSIC(25th)

185,344cores(Pleiades)atNASA/Ames(13th) 194,616cores(Cascade)atPNNL(27th)

72,800coresCrayCS-StorminUS(15th) 76,032cores(Makman-2)atSaudiAramco(32nd)

72,800coresCrayCS-StorminUS(16th) 110,400cores(Pangea)inFrance(33rd)

265,440coresSGIICEatTulipTradingAustralia(17th) 37,120cores(Lomonosov-2)atRussia/MSU(35th)

124,200cores(Topaz)SGIICEatERDCDSRCinUS(18th) 57,600cores(SwifLucy)inUS(37th)

72,000cores(HPC2)inItaly(19th) 55,728cores(Prometheus)atPoland/Cyfronet(38th)

152,692cores(Thunder)atAFRL/USA(21st) 50,544cores(Occigen)atFrance/GENCI-CINES(43rd)

147,456cores(SuperMUC)inGermany(22nd) 76,896cores(Salomon)SGIICEinCzechRepublic(47th)

86,016cores(SuperMUCPhase2)inGermany(24th) andmanymore!


TowardsExascaleSystem(TodayandTarget)

Systems 2016Tianhe-2

2020-2024 DifferenceToday&Exascale

Systempeak 55PFlop/s 1EFlop/s ~20x

Power 18MW(3Gflops/W)

~20MW(50Gflops/W)

O(1)~15x

Systemmemory 1.4PB(1.024PBCPU+0.384PBCoP)

32–64PB ~50X

Nodeperformance 3.43TF/s(0.4CPU+3CoP)

1.2or15TF O(1)

Nodeconcurrency 24coreCPU+171coresCoP

O(1k)orO(10k) ~5x-~50x

TotalnodeinterconnectBW 6.36GB/s 200–400GB/s ~40x-~60x

Systemsize(nodes) 16,000 O(100,000)orO(1M) ~6x-~60x

Totalconcurrency 3.12M12.48Mthreads(4/core)

O(billion)forlatencyhiding

~100x

MTTI Few/day Many/day O(?)

Courtesy:Prof.JackDongarra


•  ScienRficCompuRng–  MessagePassingInterface(MPI),includingMPI+OpenMP,istheDominant

ProgrammingModel

–  ManydiscussionstowardsParRRonedGlobalAddressSpace(PGAS)•  UPC,OpenSHMEM,CAF,etc.

–  HybridProgramming:MPI+PGAS(OpenSHMEM,UPC)

•  BigData/Enterprise/CommercialCompuRng–  Focusesonlargedataanddataanalysis

–  Hadoop(HDFS,HBase,MapReduce)

–  Sparkisemergingforin-memorycompuRng

–  MemcachedisalsousedforWeb2.0

TwoMajorCategoriesofApplicaNons


ParallelProgrammingModelsOverview

P1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

ParRRonedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

•  Programmingmodelsprovideabstractmachinemodels

•  Modelscanbemappedondifferenttypesofsystems–  e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

•  PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance


ParNNonedGlobalAddressSpace(PGAS)Models

•  Keyfeatures-  SimplesharedmemoryabstracRons

-  Lightweightone-sidedcommunicaRon

-  EasiertoexpressirregularcommunicaRon

•  DifferentapproachestoPGAS-  Languages

•  UnifiedParallelC(UPC)

•  Co-ArrayFortran(CAF)

•  X10

•  Chapel

-  Libraries•  OpenSHMEM

•  GlobalArrays


Hybrid(MPI+PGAS)Programming

•  ApplicaRonsub-kernelscanbere-wri<eninMPI/PGASbasedoncommunicaRoncharacterisRcs

•  Benefits:–  BestofDistributedCompuRngModel

–  BestofSharedMemoryCompuRngModel

•  ExascaleRoadmap*:–  “HybridProgrammingisapracRcalwayto

programexascalesystems”

*TheInterna4onalExascaleSo;wareRoadmap,Dongarra,J.,Beckman,P.etal.,Volume25,Number1,2011,Interna4onalJournalofHighPerformanceComputerApplica4ons,ISSN1094-3420

Kernel1MPI

Kernel2MPI

Kernel3MPI

KernelNMPI

HPCApplicaNon

Kernel2PGAS

KernelNPGAS


DesigningCommunicaNonLibrariesforMulN-PetaflopandExaflopSystems:Challenges

ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.

ApplicaNonKernels/ApplicaNons

NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmniPath)

MulN/Many-coreArchitectures

Accelerators(NVIDIAandMIC)

MiddlewareCo-Design

OpportuniNesand

ChallengesacrossVarious

Layers

PerformanceScalabilityFault-

Resilience

CommunicaNonLibraryorRunNmeforProgrammingModelsPoint-to-pointCommunicaNon

CollecNveCommunicaNon

Energy-Awareness

SynchronizaNonandLocks

I/OandFileSystems

FaultTolerance


•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided)–  Scalablejobstart-up

•  ScalableCollecRvecommunicaRon–  Offload–  Non-blocking–  Topology-aware

•  Balancingintra-nodeandinter-nodecommunicaRonfornextgeneraRonnodes(128-1024cores)–  MulRpleend-pointspernode

•  SupportforefficientmulR-threading•  IntegratedSupportforGPGPUsandAccelerators•  Fault-tolerance/resiliency•  QoSsupportforcommunicaRonandI/O•  SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,

CAF,…)•  VirtualizaRon•  Energy-Awareness

BroadChallengesinDesigningCommunicaNonLibrariesfor(MPI+X)atExascale


•  ExtremeLowMemoryFootprint–  MemorypercoreconRnuestodecrease

•  D-L-AFramework

–  Discover•  Overallnetworktopology(fat-tree,3D,…),Networktopologyforprocessesforagivenjob•  Nodearchitecture,Healthofnetworkandnode

–  Learn•  Impactonperformanceandscalability•  PotenRalforfailure

–  Adapt•  Internalprotocolsandalgorithms•  Processmapping•  Fault-tolerancesoluRons

–  Lowoverheadtechniqueswhiledeliveringperformance,scalabilityandfault-tolerance

AddiNonalChallengesforDesigningExascaleSomwareLibraries


OverviewoftheMVAPICH2Project•  HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)

–  MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002

–  MVAPICH2-X(MPI+PGAS),Availablesince2011

–  SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

–  SupportforVirtualizaRon(MVAPICH2-Virt),Availablesince2015

–  SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

–  Usedbymorethan2,525organizaNonsin77countries

–  Morethan351,000(>0.35million)downloadsfromtheOSUsitedirectly

–  EmpoweringmanyTOP500clusters(Nov‘15ranking)•  10thranked519,640-corecluster(Stampede)atTACC

•  13thranked185,344-corecluster(Pleiades)atNASA

•  25thranked76,032-corecluster(Tsubame2.5)atTokyoInsRtuteofTechnologyandmanyothers

–  AvailablewithsofwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

–  h<p://mvapich.cse.ohio-state.edu

•  EmpoweringTop500systemsforoveradecade–  System-XfromVirginiaTech(3rdinNov2003,2,200processors,12.25TFlops)->

–  StampedeatTACC(10thinNov’15,519,640cores,5.168Plops)


MVAPICH2Architecture

HighPerformanceParallelProgrammingModels

MessagePassingInterface(MPI)

PGAS(UPC,OpenSHMEM,CAF,UPC++*)

Hybrid---MPI+X(MPI+PGAS+OpenMP/Cilk)

HighPerformanceandScalableCommunicaNonRunNmeDiverseAPIsandMechanisms

Point-to-point

PrimiNves

CollecNvesAlgorithms

Energy-Awareness

RemoteMemoryAccess

I/OandFileSystems

FaultTolerance

VirtualizaNon AcNveMessages

JobStartupIntrospecNon&Analysis

SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,OmniPath)

SupportforModernMulN-/Many-coreArchitectures(Intel-Xeon,OpenPower*,Xeon-Phi(MIC,KNL*),NVIDIAGPGPU)

TransportProtocols ModernFeatures

RC XRC UD DC UMR ODP*SR-IOV

MulNRail

TransportMechanismsSharedMemory CMA IVSHMEM

ModernFeatures

MCDRAM* NVLink* CAPI*

*-Upcoming


•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided

RMA)–  SupportforadvancedIBmechanisms(UMRandODP)–  Extremelyminimalmemoryfootprint–  Scalablejobstart-up

•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+

UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale


One-wayLatency:MPIoverIBwithMVAPICH2

0.000.200.400.600.801.001.201.401.601.802.00 SmallMessageLatency

MessageSize(bytes)

Latency(us)

1.261.19

0.951.15

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back

0

20

40

60

80

100

120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

LargeMessageLatency

MessageSize(bytes)

Latency(us)


Bandwidth:MPIoverIBwithMVAPICH2

0

2000

4000

6000

8000

10000

12000

14000 UnidirecNonalBandwidth

Band

width

(MBy

tes/sec)

MessageSize(bytes)

12465

3387

6356

12104

0

5000

10000

15000

20000

25000

30000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

BidirecNonalBandwidth

Band

width

(MBy

tes/sec)

MessageSize(bytes)

21425

12161

24353

6308

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back


0

0.5

1

0 1 2 4 8 16 32 64 128 256 512 1K

Latency(us)

MessageSize(Bytes)

LatencyIntra-Socket Inter-Socket

MVAPICH2Two-SidedIntra-NodePerformance(SharedmemoryandKernel-basedZero-copySupport(LiMICandCMA))

LatestMVAPICH22.2b

IntelIvy-bridge0.18us

0.45us

0

5000

10000

15000

Band

width(M

B/s)

MessageSize(Bytes)

Bandwidth(Inter-socket)inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC

0

5000

10000

15000

Band

width(M

B/s)

MessageSize(Bytes)

Bandwidth(Intra-socket)intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC

14,250MB/s13,749MB/s


•  IntroducedbyMellanoxtosupportdirectlocalandremotenonconRguousmemoryaccess–  Avoidpackingatsenderandunpackingatreceiver

•  AvailablewithMVAPICH2-X2.2b

User-modeMemoryRegistraNon(UMR)

050

100150200250300350

4K 16K 64K 256K 1M

Latency(u

s)

MessageSize(Bytes)

Small&MediumMessageLatencyUMRDefault

0

5000

10000

15000

20000

2M 4M 8M 16M

Latency(us)

MessageSize(Bytes)

LargeMessageLatencyUMRDefault

Connect-IB(54Gbps):2.8GHzDualTen-core(IvyBridge)IntelPCIGen3withMellanoxIBFDRswitch

M.Li,H.Subramoni,K.Hamidouche,X.LuandD.K.Panda,HighPerformanceMPIDatatypeSupportwithUser-modeMemoryRegistraNon:Challenges,DesignsandBenefits,CLUSTER,2015


•  IntroducedbyMellanoxtosupportdirectremotememoryaccesswithoutpinning

•  Memoryregionspagedin/outdynamicallybytheHCA/OS

•  Sizeofregisteredbufferscanbelargerthanphysicalmemory

•  WillbeavailableinupcomingMVAPICH2-X2.2RC1

On-DemandPaging(ODP)

Connect-IB(54Gbps):2.6GHzDualOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRswitch

0

500

1000

1500

16 32 64

Pin-do

wnBu

fferS

ize

(MB)

NumberofProcesses

Graph500Pin-downBufferSizesPin-down ODP

0

1

2

3

4

5

16 32 64

ExecuN

onTim

e(s)

NumberofProcesses

Graph500BFSKernelPin-down ODP


MinimizingMemoryFootprintbyDirectConnect(DC)Transport

Nod

e0 P1

P0

Node1

P3

P2Node3

P7

P6

Nod

e2 P5

P4

IBNetwork

•  ConstantconnecRoncost(OneQPforanypeer)•  FullFeatureSet(RDMA,Atomicsetc)•  Separateobjectsforsend(DCIniRator)andreceive(DCTarget)

–  DCTargetidenRfiedby“DCTNumber”–  Messagesroutedwith(DCTNumber,LID)–  Requiressame“DCKey”toenablecommunicaRon

•  AvailablesinceMVAPICH2-X2.2a

0

0.5

1

160 320 620Normalized

ExecuNo

nTime

NumberofProcesses

NAMD-Apoa1:LargedatasetRC DC-Pool UD XRC

1022

4797

1 1 12

10 10 10 10

1 13

5

1

10

100

80 160 320 640

Conn

ecNo

nMem

ory(KB)

NumberofProcesses

MemoryFootprintforAlltoallRC DC-Pool UD XRC

H.Subramoni,K.Hamidouche,A.Venkatesh,S.ChakrabortyandD.K.Panda,DesigningMPILibrarywithDynamicConnectedTransport(DCT)ofInfiniBand:EarlyExperiences.IEEEInternaRonalSupercompuRngConference(ISC’14)


•  Near-constantMPIandOpenSHMEMiniRalizaRonRmeatanyprocesscount

•  10xand30ximprovementinstartupRmeofMPIandOpenSHMEMrespecRvelyat16,384processes

•  MemoryconsumpRonreducedforremoteendpointinformaRonbyO(processespernode)

•  1GBMemorysavedpernodewith1Mprocessesand16processespernode

TowardsHighPerformanceandScalableStartupatExascale

P M

O

JobStartupPerformance

Mem

oryRe

quire

dtoStore

Endp

ointInform

aRon

a b c d

eP

M

PGAS–Stateoftheart

MPI–Stateoftheart

O PGAS/MPI–OpRmized

PMIX_Ring

PMIX_Ibarrier

PMIX_Iallgather

ShmembasedPMI

b

c

d

e

aOn-demandConnecRon

On-demandConnecNonManagementforOpenSHMEMandOpenSHMEM+MPI.S.Chakraborty,H.Subramoni,J.Perkins,A.A.Awan,andDKPanda,20thInternaRonalWorkshoponHigh-levelParallelProgrammingModelsandSupporRveEnvironments(HIPS’15)

PMIExtensionsforScalableMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,J.Perkins,M.Arnold,andDKPanda,Proceedingsofthe21stEuropeanMPIUsers'GroupMeeRng(EuroMPI/Asia’14)

Non-blockingPMIExtensionsforFastMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,A.Venkatesh,J.Perkins,andDKPanda,15thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’15)

SHMEMPMI–SharedMemorybasedPMIforImprovedPerformanceandScalability.S.Chakraborty,H.Subramoni,J.Perkins,andDKPanda,16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’16),AcceptedforPublica6on

a

b

c d

e


•  SHMEMPMIallowsMPIprocessestodirectlyreadremoteendpoint(EP)informaRonfromtheprocessmanagerthroughsharedmemorysegments

•  Onlyasinglecopypernode-O(processespernode)reducRoninmemoryusage

•  EsRmatedsavingsof1GBpernodewith1millionprocessesand16processespernode

•  Upto1,000RmesfasterPMIGetscomparedtodefaultdesign.WillbeavailableinMVAPICH22.2RC1.

ProcessManagementInterfaceoverSharedMemory(SHMEMPMI)

TACCStampede-Connect-IB(54Gbps):2.6GHzQuadOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRSHMEMPMI–SharedMemoryBasedPMIforPerformanceandScalabilityS.Chakraborty,H.Subramoni,J.Perkins,andD.K.Panda,

16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid‘16),Acceptedforpublica6on

0

50

100

150

200

250

300

1 2 4 8 16 32

TimeTaken(m

illise

cond

s)

NumberofProcessesperNode

TimeTakenbyonePMI_GetDefault

SHMEMPMI

0.00010.0010.010.1110100

100010000

16 64 256 1K 4K 16K 64K 256K 1MMem

oryUsageperNod

e(M

B)

NumberofProcessesperJob

MemoryUsageforRemoteEPInformaRonFence-DefaultAllgather-DefaultFence-ShmemAllgather-Shmem

EsNmated

1000x

Actual

16x


•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon

–  OffloadandNon-blocking–  Topology-aware

•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)



ModifiedHPLwithOffload-Bcastdoesupto4.5%be<erthandefaultversion(512Processes)

012345

512 600 720 800

ApplicaN

onRun

-Tim

e(s)

DataSize

05

1015

64 128 256 512Run-Time(s)

NumberofProcesses

PCG-Default Modified-PCG-Offload

Co-DesignwithMPI-3Non-BlockingCollecNvesandCollecNveOffloadCo-DirectHardware(AvailablesinceMVAPICH2-X2.2a)

ModifiedP3DFFTwithOffload-Alltoalldoesupto17%be<erthandefaultversion(128Processes)

K.Kandalla,et.al..High-PerformanceandScalableNon-BlockingAll-to-AllwithCollecNveOffloadonInfiniBandClusters:AStudywithParallel3DFFT,ISC2011

17%

00.20.40.60.81

1.2

10 20 30 40 50 60 70

Normalized

Pe

rforman

ce

HPL-Offload HPL-1ring HPL-Host

HPLProblemSize(N)as%ofTotalMemory

4.5%

ModifiedPre-ConjugateGradientSolverwithOffload-Allreducedoesupto21.8%be<erthandefaultversion

K.Kandalla,et.al,DesigningNon-blockingBroadcastwithCollecNveOffloadonInfiniBandClusters:ACaseStudywithHPL,HotI2011K.Kandalla,et.al.,DesigningNon-blockingAllreducewithCollecNveOffloadonInfiniBandClusters:ACaseStudywithConjugateGradientSolvers,IPDPS’12

21.8%

CanNetwork-OffloadbasedNon-BlockingNeighborhoodMPICollecNvesImproveCommunicaNonOverheadsofIrregularGraphAlgorithms?K.Kandalla,A.Buluc,H.Subramoni,K.Tomko,J.Vienne,L.Oliker,andD.K.Panda,IWPAPS’12


Network-Topology-AwarePlacementofProcesses•  CanwedesignahighlyscalablenetworktopologydetecRonserviceforIB?•  HowdowedesigntheMPIcommunicaRonlibraryinanetwork-topology-awaremannertoefficientlyleveragethetopology

informaRongeneratedbyourservice?•  WhatarethepotenRalbenefitsofusinganetwork-topology-awareMPIlibraryontheperformanceofparallelscienRficapplicaRons?

OverallperformanceandSplitupofphysicalcommunicaNonforMILConRanger

Performanceforvaryingsystemsizes Defaultfor2048corerun Topo-Awarefor2048corerun

15%

H.Subramoni,S.Potluri,K.Kandalla,B.Barth,J.Vienne,J.Keasler,K.Tomko,K.Schulz,A.Moody,andD.K.Panda,DesignofaScalableInfiniBandTopologyServicetoEnableNetwork-Topology-AwarePlacementofProcesses,SC'12.BESTPaperandBESTSTUDENTPaperFinalist

• ReducenetworktopologydiscoveryNmefromO(N2hosts)toO(Nhosts)

• 15%improvementinMILCexecuNonNme@2048cores• 15%improvementinHypreexecuNonNme@1024cores


•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs

–  CUDA-AwareMPI–  GPUDirectRDMA(GDR)Support–  CUDA-awareNon-blockingCollecRves–  SupportforManagedMemory–  EfficientdatatypeProcessing

•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+

UPC,CAF,…)•  VirtualizaRon•  Energy-Awareness•  InfiniBandNetworkAnalysisandMonitoring(INAM)



PCIe

GPU

CPU

NIC

Switch

At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .);

At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• DatamovementinapplicaRonswithstandardMPIandCUDAinterfaces

HighProduc4vityandLowPerformance

MPI+CUDA-Naive


PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall();

<<Similar at receiver>>

•  Pipeliningatuserlevelwithnon-blockingMPIandCUDAinterfaces

LowProduc4vityandHighPerformance

MPI+CUDA-Advanced


At Sender: At Receiver: MPI_Recv(r_devbuf, size, …);

inside MVAPICH2

•  StandardMPIinterfacesusedforunifieddatamovement

•  TakesadvantageofUnifiedVirtualAddressing(>=CUDA4.0)

•  OverlapsdatamovementfromGPUwithRDMAtransfers

HighPerformanceandHighProduc4vity

MPI_Send(s_devbuf, size, …);

GPU-AwareMPILibrary:MVAPICH2-GPU


•  OFEDwithsupportforGPUDirectRDMAisdevelopedbyNVIDIAandMellanox

•  OSUhasadesignofMVAPICH2using

GPUDirectRDMA–  HybriddesignusingGPU-DirectRDMA

•  GPUDirectRDMAandHost-basedpipelining

•  AlleviatesP2Pbandwidthbo<lenecksonSandyBridgeandIvyBridge

–  SupportforcommunicaRonusingmulR-rail

–  SupportforMellanoxConnect-IBandConnectXVPIadapters

–  SupportforRoCEwithMellanoxConnectXVPIadapters

GPU-DirectRDMA(GDR)withCUDA

IBAdapter

SystemMemory

GPUMemory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNBE5-2670

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVBE5-2680V2

SNBE5-2670/

IVBE5-2680V2


CUDA-AwareMPI:MVAPICH2-GDR1.8-2.2Releases•  SupportforMPIcommunicaRonfromNVIDIAGPUdevicememory•  HighperformanceRDMA-basedinter-nodepoint-to-point

communicaRon(GPU-GPU,GPU-HostandHost-GPU)•  Highperformanceintra-nodepoint-to-pointcommunicaRonformulR-

GPUadapters/node(GPU-GPU,GPU-HostandHost-GPU)•  TakingadvantageofCUDAIPC(availablesinceCUDA4.1)inintra-node

communicaRonformulRpleGPUadapters/node•  OpRmizedandtunedcollecRvesforGPUdevicebuffers•  MPIdatatypesupportforpoint-to-pointandcollecRvecommunicaRon

fromGPUdevicebuffers


34

MVAPICH2-GDR-2.2bIntelIvyBridge(E5-2680v2)node-20cores

NVIDIATeslaK40cGPUMellanoxConnect-IBDual-FDRHCA

CUDA7MellanoxOFED2.4withGPU-Direct-RDMA

10x2X

11x

2x

PerformanceofMVAPICH2-GPUwithGPU-DirectRDMA(GDR)

05

1015202530

0 2 8 32 128 512 2K

MV2-GDR2.2b MV2-GDR2.0bMV2w/oGDR

GPU-GPUinternodelatency

MessageSize(bytes)

Latency(us)

2.18us0

50010001500200025003000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR

GPU-GPUInternodeBandwidth

MessageSize(bytes)

Band

width(M

B/s)

11X

01000200030004000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR

GPU-GPUInternodeBi-Bandwidth

MessageSize(bytes)

Bi-Ban

dwidth(M

B/s)


LENS(Oct'15) 35

•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) •  HoomdBlue Version 1.0.5

•  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

ApplicaNon-LevelEvaluaNon(HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

AverageTimeStep

sper

second

(TPS)

NumberofProcesses

MV2 MV2+GDR

0500100015002000250030003500

4 8 16 32AverageTimeStep

sper

second

(TPS)

NumberofProcesses

64KParNcles 256KParNcles

2X2X


0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Overla

p(%

)

MessageSize(Bytes)

Medium/LargeMessageOverlap(64GPUnodes)

Ialltoall(1process/node)

Ialltoall(2process/node;1process/GPU)0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Overla

p(%

)

MessageSize(Bytes)

Medium/LargeMessageOverlap(64GPUnodes)

Igather(1process/node)

Igather(2processes/node;1process/GPU)

Plazorm:Wilkes:IntelIvyBridgeNVIDIATeslaK20c+MellanoxConnect-IB

AvailablesinceMVAPICH2-GDR2.2a

CUDA-AwareNon-BlockingCollecNves

A.Venkatesh,K.Hamidouche,H.Subramoni,andD.K.Panda,OffloadedGPUCollecNvesusingCORE-DirectandCUDACapabiliNesonIBClusters,HIPC,2015


CommunicaNonRunNmewithGPUManagedMemory

●  CUDA6.0NVIDIAintroducedCUDAManaged(orUnified)memoryallowingacommonmemoryallocaRonforGPUorCPUthroughcudaMallocManaged()call

●  SignificantproducRvitybenefitsduetoabstracRonofexplicitallocaRonandcudaMemcpy()

●  ExtendedMVAPICH2toperformcommunicaRonsdirectlyfrommanagedbuffers(AvailableinMVAPICH2-GDR2.2b)

●  OSUMicro-benchmarksextendedtoevaluatetheperformanceofpoint-to-pointandcollecRvecommunicaRonsusingmanagedbuffers●  AvailableinOMB5.2

D.S.Banerjee,KHamidouche,andD.KPanda,DesigningHighPerformanceCommunicaRonRunRmeforGPUManagedMemory:EarlyExperiences,GPGPU-9Workshop,tobeheldinconjuncRonwithPPoPP‘16

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 4 8 16 32 64 128256 1K 4K 8K 16K

HaloExcha

ngeTime(m

s)

TotalDimensionSize(Bytes)

2DStencilPerformanceforHalowidth=1

DeviceManaged


CPU

Progress

GPU

Time

Initi

ate

Kern

el

Star

t Se

nd

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd

Initi

ate

Kern

el

GPU

CPU

Initi

ate

Kern

el

Star

tSe

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)Existing Design

Proposed Design

Kernel on Stream

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Initi

ate

Kern

el

Star

t Se

nd


Kernel on Stream

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd


Kernel on Stream

Isend(1) Wait

WFK

Star

t Se

nd

Wait

Progress

Start Finish Proposed Finish Existing

WFK

WFK

Expected Benefits

MPIDatatypeProcessing(CommunicaNonOpNmizaNon)

WasteofcompuNngresourcesonCPUandGPUCommonScenario

*Buf1, Buf2…contain non-conRguousMPIDatatype

MPI_Isend(A,..Datatype,…)MPI_Isend(B,..Datatype,…)MPI_Isend(C,..Datatype,…)MPI_Isend(D,..Datatype,…)…MPI_Waitall(…);


ApplicaNon-LevelEvaluaNon(HaloExchange-Cosmo)

0

0.5

1

1.5

16 32 64 96

Normalized

ExecuNo

nTime

NumberofGPUs

CSCSGPUclusterDefault Callback-based Event-based

0

0.5

1

1.5

4 8 16 32

Normalized

ExecuNo

nTime

NumberofGPUs

WilkesGPUClusterDefault Callback-based Event-based

•  2Ximprovementon32GPUsnodes•  30%improvementon96GPUnodes(8GPUs/node)

C.Chu,K.Hamidouche,A.Venkatesh,D.Banerjee,H.Subramoni,andD.K.Panda,ExploiNngMaximalOverlapforNon-ConNguousDataMovementProcessingonModernGPU-enabledSystems,IPDPS’16


•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,




MPIApplicaNonsonMICClusters

Xeon XeonPhi

MulR-coreCentric

Many-coreCentric

MPIProgram

MPIProgram

OffloadedComputaRon

MPIProgram MPIProgram

MPIProgram

Host-only

Offload(/reverseOffload)

Symmetric

Coprocessor-only

• FlexibilityinlaunchingMPIjobsonclusterswithXeonPhi


MVAPICH2-MIC2.0DesignforClusterswithIBandMIC

•  OffloadMode

•  IntranodeCommunicaRon

•  Coprocessor-onlyandSymmetricMode

•  InternodeCommunicaRon

•  Coprocessors-onlyandSymmetricMode

•  MulR-MICNodeConfiguraRons

•  Runningonthreemajorsystems

•  Stampede,Blueridge(VirginiaTech)andBeacon(UTK)


MIC-Remote-MICP2PCommunicaNonwithProxy-basedCommunicaNon

Bandwidth

BeTer

BeTer

BeTer

Latency(LargeMessages)

0 1000 2000 3000 4000 5000

8K 32K 128K 512K 2M

Lat

ency

(use

c)

Message Size (Bytes)

0

2000

4000

6000

1 16 256 4K 64K 1M Band

width(M

B/sec)


5236

Intra-socketP2P

Inter-socketP2P

0

5000

10000

15000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


Latency(LargeMessages)

0

2000

4000

6000

1 16 256 4K 64K 1M Band

width(M

B/sec)

Message Size (Bytes) BeT

er

5594

Bandwidth


OpNmizedMPICollecNvesforMICClusters(Allgather&Alltoall)

A.Venkatesh,S.Potluri,R.Rajachandrasekar,M.Luo,K.HamidoucheandD.K.Panda-HighPerformanceAlltoallandAllgatherdesignsforInfiniBandMICClusters;IPDPS’14,May2014

0

10000

20000

30000

1 2 4 8 16 32 64 128256512 1K

Latency(usecs)

MessageSize(Bytes)

32-Node-Allgather(16H+16M)SmallMessageLatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K256K512K 1M

Latency(usecs)

MessageSize(Bytes)

32-Node-Allgather(8H+8M)LargeMessageLatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

4K 8K 16K 32K 64K 128K256K512K

Latency(usecs)

MessageSize(Bytes)

32-Node-Alltoall(8H+8M)LargeMessageLatencyMV2-MIC

MV2-MIC-Opt

0

20

40

60

MV2-MIC-Opt MV2-MICExecuN

onTim

e(secs)

32Nodes(8H+8M),Size=2K*2K*1K

P3DFFTPerformanceCommunicaRonComputaRon

76%58%

55%






MVAPICH2-XforAdvancedMPIandHybridMPI+PGASApplicaNonsMPI,OpenSHMEM,UPC,CAForHybrid(MPI+PGAS)

ApplicaNons

UnifiedMVAPICH2-XRunNme

InfiniBand,RoCE,iWARP

OpenSHMEMCalls MPICallsUPCCalls

•  UnifiedcommunicaRonrunRmeforMPI,UPC,OpenSHMEM,CAFavailablewithMVAPICH2-X1.9(2012)onwards!

•  UPC++supportwillbeavailableinupcomingMVAPICH2-X2.2RC1•  FeatureHighlights

–  SupportsMPI(+OpenMP),OpenSHMEM,UPC,CAF,MPI(+OpenMP)+OpenSHMEM,MPI(+OpenMP)+UPC+CAF

–  MPI-3compliant,OpenSHMEMv1.0standardcompliant,UPCv1.2standardcompliant(withiniRalsupportforUPC1.3),CAF2008standard(OpenUH)

–  ScalableInter-nodeandintra-nodecommunicaRon–point-to-pointandcollecRves

CAFCalls


ApplicaNonLevelPerformancewithGraph500andSortGraph500ExecuNonTime

J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternaNonalSupercompuNngConference(ISC’13),June2013

J.Jose,K.Kandalla,M.LuoandD.K.Panda,SupporNngHybridMPIandOpenSHMEMoverInfiniBand:DesignandPerformanceEvaluaNon,Int'lConferenceonParallelProcessing(ICPP'12),September2012

05101520253035

4K 8K 16K

Time(s)

No.ofProcesses

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

13X

7.6X

•  PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design•  8,192processes

-2.4XimprovementoverMPI-CSR-7.6XimprovementoverMPI-Simple

•  16,384processes-1.5XimprovementoverMPI-CSR-13XimprovementoverMPI-Simple

J.Jose,K.Kandalla,S.Potluri,J.ZhangandD.K.Panda,OpNmizingCollecNveCommunicaNoninOpenSHMEM,Int'lConferenceonParNNonedGlobalAddressSpaceProgrammingModels(PGAS'13),October2013.

SortExecuNonTime

0

1000

2000

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Time(secon

ds)

InputData-No.ofProcesses

MPI Hybrid

51%

•  PerformanceofHybrid(MPI+OpenSHMEM)SortApplicaRon

•  4,096processes,4TBInputSize-MPI–2408sec;0.16TB/min-Hybrid–1172sec;0.36TB/min-51%improvementoverMPI-design


MiniMD–TotalExecuNonTime

•  Hybriddesignperformsbe<erthanMPIimplementaRon•  1,024processes

-  17%improvementoverMPIversion•  StrongScaling

Inputsize:128*128*128

Performance StrongScaling

0

500

1000

1500

2000

2500

512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

17%

050010001500200025003000

256 512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

Time(m

s)

Time(m

s)

#ofCores #ofCores

M.Li,J.Lin,X.Lu,K.Hamidouche,K.TomkoandD.K.Panda,ScalableMiniMDDesignwithHybridMPIandOpenSHMEM,OpenSHMEMUserGroupMeeNng(OUG’14),heldinconjuncNonwith8thInternaNonalConferenceonParNNonedGlobalAddressSpaceProgrammingModels,(PGAS14).


HybridMPI+UPCNAS-FT

•  ModifiedNASFTUPCall-to-allpa<ernusingMPI_Alltoall•  Trulyhybridprogram•  ForFT(ClassC,128processes)

•  34%improvementoverUPC-GASNet•  30%improvementoverUPC-OSU

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Time(s)

NASProblemSize–SystemSize

UPC-GASNet

UPC-OSU

Hybrid-OSU

34%

J.Jose,M.Luo,S.SurandD.K.Panda,UnifyingUPCandMPIRunNmes:ExperiencewithMVAPICH,FourthConferenceonParNNonedGlobalAddressSpaceProgrammingModel(PGAS’10),October2010

HybridMPI+UPCSupport

Availablesince

MVAPICH2-X1.9(2012)






•  VirtualizaRonhasmanybenefits–  Fault-tolerance–  JobmigraRon–  CompacRon

•  HavenotbeenverypopularinHPCduetooverheadassociatedwithVirtualizaRon

•  NewSR-IOV(SingleRoot–IOVirtualizaRon)supportavailablewithMellanoxInfiniBandadapterschangesthefield

•  EnhancedMVAPICH2supportforSR-IOV•  MVAPICH2-Virt2.1(withandwithoutOpenStack)ispubliclyavailable

CanHPCandVirtualizaNonbeCombined?

J.Zhang,X.Lu,J.Jose,R.ShiandD.K.Panda,CanInter-VMShmemBenefitMPIApplicaNonsonSR-IOVbasedVirtualizedInfiniBandClusters?EuroPar'14J.Zhang,X.Lu,J.Jose,M.Li,R.ShiandD.K.Panda,HighPerformanceMPILibrayoverSR-IOVenabledInfiniBandClusters,HiPC’14J.Zhang,X.Lu,M.ArnoldandD.K.Panda,MVAPICH2OverOpenStackwithSR-IOV:anEfficientApproachtobuildHPCClouds,CCGrid’15


•  RedesignMVAPICH2tomakeitvirtualmachineaware–  SR-IOVshowsneartonaRve

performanceforinter-nodepointtopointcommunicaRon

–  IVSHMEMofferszero-copyaccesstodataonsharedmemoryofco-residentVMs

–  LocalityDetector:maintainsthelocalityinformaRonofco-residentvirtualmachines

–  CommunicaRonCoordinator:selectsthecommunicaRonchannel(SR-IOV,IVSHMEM)adapRvely

OverviewofMVAPICH2-VirtwithSR-IOVandIVSHMEM

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter

Physical Function

user space

kernel space

MPI proc

PCI Device

VF Driver

Guest 2user space

kernel space

MPI proc

PCI Device

VF Driver

Virtual Function

Virtual Function

/dev/shm/

IV-SHM

IV-Shmem Channel

SR-IOV Channel

J.Zhang,X.Lu,J.Jose,R.Shi,D.K.Panda.CanInter-VMShmemBenefitMPIApplicaRonsonSR-IOVbasedVirtualizedInfiniBandClusters?Euro-Par,2014.

J.Zhang,X.Lu,J.Jose,R.Shi,M.Li,D.K.Panda.HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters.HiPC,2014.


Nova

Glance

Neutron

Swift

Keystone

Cinder

Heat

Ceilometer

Horizon

VM

Backup volumes in

Stores images in

Provides images

Provides Network

Provisions

Provides Volumes

Monitors

Provides UI

Provides Auth for

Orchestrates cloud

•  OpenStackisoneofthemostpopularopen-sourcesoluRonstobuildcloudsandmanagevirtualmachines

•  DeploymentwithOpenStack–  SupporRngSR-IOVconfiguraRon

–  SupporRngIVSHMEMconfiguraRon

–  VirtualMachineawaredesignofMVAPICH2withSR-IOV

•  AnefficientapproachtobuildHPCCloudswithMVAPICH2-VirtandOpenStack

MVAPICH2-VirtwithSR-IOVandIVSHMEMoverOpenStack

J.Zhang,X.Lu,M.Arnold,D.K.Panda.MVAPICH2overOpenStackwithSR-IOV:AnEfficientApproachtoBuildHPCClouds.CCGrid,2015.


0

50

100

150

200

250

300

350

400

milc leslie3d pop2 GAPgeofem zeusmp2 lu

ExecuN

onTim

e(s)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-NaRve

1% 9.5%

0

1000

2000

3000

4000

5000

6000

22,20 24,10 24,16 24,20 26,10 26,16

ExecuN

onTim

e(m

s)

ProblemSize(Scale,Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-NaRve 2%

•  32VMs,6Core/VM

•  ComparedtoNaRve,2-5%overheadforGraph500with128Procs

•  ComparedtoNaRve,1-9.5%overheadforSPECMPI2007with128Procs

ApplicaNon-LevelPerformanceonChameleon

SPECMPI2007 Graph500

5%


NSFChameleonCloud:APowerfulandFlexibleExperimentalInstrument •  Large-scaleinstrument

–  TargeRngBigData,BigCompute,BigInstrumentresearch–  ~650nodes(~14,500cores),5PBdiskovertwosites,2sitesconnectedwith100Gnetwork

•  Reconfigurableinstrument–  BaremetalreconfiguraRon,operatedassingleinstrument,graduatedapproachforease-of-use

•  Connectedinstrument–  WorkloadandTraceArchive–  PartnershipswithproducRonclouds:CERN,OSDC,Rackspace,Google,andothers–  Partnershipswithusers

•  Complementaryinstrument–  ComplemenRngGENI,Grid’5000,andothertestbeds

•  Sustainableinstrument–  IndustryconnecRons

h<p://www.chameleoncloud.org/






•  MVAPICH2-EA2.1(Energy-Aware)•  Awhite-boxapproach•  NewEnergy-EfficientcommunicaRonprotocolsforpt-ptandcollecRveoperaRons•  IntelligentlyapplytheappropriateEnergysavingtechniques•  ApplicaRonobliviousenergysaving

•  OEMT•  AlibraryuRlitytomeasureenergyconsumpRonforMPIapplicaRons•  WorkswithallMPIrunRmes•  PRELOADopRonforprecompiledapplicaRons•  DoesnotrequireROOTpermission:

•  AsafekernelmoduletoreadonlyasubsetofMSRs

Energy-AwareMVAPICH2&OSUEnergyManagementTool(OEMT)


•  AnenergyefficientrunRmethatprovidesenergysavingswithoutapplicaRonknowledge

•  UsesautomaRcallyandtransparentlythebestenergylever

•  ProvidesguaranteesonmaximumdegradaRonwith5-41%savingsat<=5%degradaRon

•  PessimisRcMPIappliesenergyreducRonlevertoeachMPIcall

MVAPICH2-EA:ApplicaNonObliviousEnergy-Aware-MPI(EAM)

ACaseforApplicaNon-ObliviousEnergy-EfficientMPIRunNmeA.Venkatesh,A.Vishnu,K.Hamidouche,N.Tallent,D.

K.Panda,D.Kerbyson,andA.Hoise,SupercompuNng‘15,Nov2015[BestStudentPaperFinalist]

1






•  OSUINAMmonitorsIBclustersinrealRmebyqueryingvarioussubnetmanagementenRResinthenetwork

•  MajorfeaturesoftheOSUINAMtoolinclude:–  Analyzeandprofilenetwork-levelacRviReswithmanyparameters(dataanderrors)atuserspecified

granularity

–  Capabilitytoanalyzeandprofilenode-level,job-levelandprocess-levelacRviResforMPIcommunicaRon(pt-to-pt,collecRvesandRMA)

–  RemotelymonitorCPUuRlizaRonofMPIprocessesatuserspecifiedgranularity

–  Visualizethedatatransferhappeningina"live"fashion-LiveViewfor•  EnRreNetwork-LiveNetworkLevelView

•  ParRcularJob-LiveJobLevelView

•  OneormulRpleNodes-LiveNodeLevelView

–  CapabilitytovisualizedatatransferthathappenedinthenetworkataRmeduraRoninthepast•  EnRreNetwork-HistoricalNetworkLevelView

•  ParRcularJob-HistoricalJobLevelView

•  OneormulRpleNodes-HistoricalNodeLevelView

OverviewofOSUINAM


OSUINAM–NetworkLevelView

•  Shownetworktopologyoflargeclusters•  Visualizetrafficpa<ernondifferentlinks•  QuicklyidenRfycongestedlinks/linksinerrorstate•  Seethehistoryunfold–playbackhistoricalstateofthenetwork

FullNetwork(152nodes) Zoomed-inViewoftheNetwork


OSUINAM–JobandNodeLevelViews

VisualizingaJob(5Nodes) FindingRoutesBetweenNodes

•  Joblevelview•  Showdifferentnetworkmetrics(load,error,etc.)foranylivejob•  PlaybackhistoricaldataforcompletedjobstoidenRfybo<lenecks

•  Nodelevelviewprovidesdetailsperprocessorpernode•  CPUuRlizaRonforeachrank/node•  Bytessent/receivedforMPIoperaRons(pt-to-pt,collecRve,RMA)•  Networkmetrics(e.g.XmitDiscard,RcvError)perrank/node


MVAPICH2–PlansforExascale

•  PerformanceandMemoryscalabilitytoward1Mcores•  Hybridprogramming(MPI+OpenSHMEM,MPI+UPC,MPI+CAF…)

–  Supportfortask-basedparallelism(UPC++)

•  EnhancedOpRmizaRonforGPUSupportandAccelerators•  Takingadvantageofadvancedfeatures

–  UserModeMemoryRegistraRon(UMR)–  On-demandPaging

•  EnhancedInter-nodeandIntra-nodecommunicaRonschemesforupcomingOmniPathandKnightsLandingarchitectures

•  ExtendedRMAsupport(asinMPI3.0)•  Extendedtopology-awarecollecRves•  Energy-awarepoint-to-point(one-sidedandtwo-sided)andcollecRves•  ExtendedSupportforMPIToolsInterface(asinMPI3.0)•  ExtendedCheckpoint-RestartandmigraRonsupportwithSCR


•  Exascalesystemswillbeconstrainedby–  Power–  Memorypercore–  Datamovementcost–  Faults

•  ProgrammingModelsandRunRmesforHPCneedtobedesignedfor–  Scalability–  Performance–  Fault-resilience–  Energy-awareness–  Programmability–  ProducRvity

•  Highlightedsomeoftheissuesandchallenges•  NeedconRnuousinnovaRononallthesefronts

LookingintotheFuture….


FundingAcknowledgmentsFundingSupportby

EquipmentSupportby


PersonnelAcknowledgmentsCurrentStudents

–  A.AugusRne(M.S.)

–  A.Awan(Ph.D.)–  S.Chakraborthy(Ph.D.)

–  C.-H.Chu(Ph.D.)–  N.Islam(Ph.D.)

–  M.Li(Ph.D.)

PastStudents–  P.Balaji(Ph.D.)

–  S.Bhagvat(M.S.)

–  A.Bhat(M.S.)

–  D.BunRnas(Ph.D.)

–  L.Chai(Ph.D.)

–  B.Chandrasekharan(M.S.)

–  N.Dandapanthula(M.S.)

–  V.Dhanraj(M.S.)

–  T.Gangadharappa(M.S.)–  K.Gopalakrishnan(M.S.)

–  G.Santhanaraman(Ph.D.)–  A.Singh(Ph.D.)

–  J.Sridhar(M.S.)

–  S.Sur(Ph.D.)

–  H.Subramoni(Ph.D.)

–  K.Vaidyanathan(Ph.D.)

–  A.Vishnu(Ph.D.)

–  J.Wu(Ph.D.)

–  W.Yu(Ph.D.)

PastResearchScien4st–  S.Sur

CurrentPost-Doc–  J.Lin

–  D.Banerjee

CurrentProgrammer–  J.Perkins

PastPost-Docs–  H.Wang

–  X.Besseron–  H.-W.Jin

–  M.Luo

–  W.Huang(Ph.D.)–  W.Jiang(M.S.)

–  J.Jose(Ph.D.)

–  S.Kini(M.S.)

–  M.Koop(Ph.D.)

–  R.Kumar(M.S.)

–  S.Krishnamoorthy(M.S.)

–  K.Kandalla(Ph.D.)

–  P.Lai(M.S.)

–  J.Liu(Ph.D.)

–  M.Luo(Ph.D.)–  A.Mamidala(Ph.D.)

–  G.Marsh(M.S.)

–  V.Meshram(M.S.)

–  A.Moody(M.S.)

–  S.Naravula(Ph.D.)

–  R.Noronha(Ph.D.)

–  X.Ouyang(Ph.D.)

–  S.Pai(M.S.)

–  S.Potluri(Ph.D.)

–  R.Rajachandrasekar(Ph.D.)

–  K.Kulkarni(M.S.)–  M.Rahman(Ph.D.)

–  D.Shankar(Ph.D.)–  A.Venkatesh(Ph.D.)

–  J.Zhang(Ph.D.)

–  E.Mancini–  S.Marcarelli

–  J.Vienne

CurrentResearchScien4stsCurrentSeniorResearchAssociate–  H.Subramoni

–  X.Lu

PastProgrammers–  D.Bureddy

-K.Hamidouche

CurrentResearchSpecialist–  M.Arnold


InternaNonalWorkshoponCommunicaNonArchitecturesatExtremeScale(Exacomm)

ExaComm2015washeldwithInt’lSupercompuRngConference(ISC‘15),atFrankfurt,Germany,onThursday,July16th,2015

OneKeynoteTalk:JohnM.Shalf,CTO,LBL/NERSC

FourInvitedTalks:DrorGoldenberg(Mellanox);MarRnSchulz(LLNL);CyrielMinkenberg(IBM-Zurich);Arthur(Barney)Maccabe(ORNL)

Panel:RonBrightwell(Sandia)TwoResearchPapers

ExaComm2016willbeheldinconjuncRonwithISC’16h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html

TechnicalPaperSubmissionDeadline:Friday,April15,2016


[email protected]

ThankYou!

TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/

Network-BasedCompuRngLaboratoryh<p://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/