62
High-Performance and Scalable Designs of Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] h<p://www.cse.ohio-state.edu/~panda Talk at HPCAC-Switzerland (Mar 2016) by

Programming Models for Exascale Systems

Embed Size (px)

Citation preview

Page 1: Programming Models for Exascale Systems

High-PerformanceandScalableDesignsofProgrammingModelsforExascaleSystems

DhabaleswarK.(DK)PandaTheOhioStateUniversity

E-mail:[email protected]

h<p://www.cse.ohio-state.edu/~panda

TalkatHPCAC-Switzerland(Mar2016)

by

Page 2: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 2NetworkBasedCompuNngLaboratory

High-EndCompuNng(HEC):ExaFlop&ExaByte

100-200 PFlops in 2016-2018

1 EFlops in 2020-2024?

3

F i g u r e 1

Source: IDC's Digital Universe Study, sponsored by EMC, December 2012

Within these broad outlines of the digital universe are some singularities worth noting.

First, while the portion of the digital universe holding potential analytic value is growing, only a tiny fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and so on. However, even with a generous estimate, the amount of information in the digital universe that is "tagged" accounts for only about 3% of the digital universe in 2012, and that which is analyzed is half a percent of the digital universe. Herein is the promise of "Big Data" technology — the extraction of value from the large untapped pools of data in the digital universe.

10K-20K EBytes in 2016-2018

40K EBytes in 2020 ?

ExaFlop&HPC• 

ExaByte&BigData• 

Page 3: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 3NetworkBasedCompuNngLaboratory

0102030405060708090100

050

100150200250300350400450500

Percen

tageofC

lusters

Num

bero

fClusters

Timeline

PercentageofClustersNumberofClusters

TrendsforCommodityCompuNngClustersintheTop500List(hUp://www.top500.org)

85%

Page 4: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 4NetworkBasedCompuNngLaboratory

DriversofModernHPCClusterArchitectures

Tianhe–2 Titan Stampede Tianhe–1A

•  MulR-core/many-coretechnologies

•  RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

•  SolidStateDrives(SSDs),Non-VolaRleRandom-AccessMemory(NVRAM),NVMe-SSD

•  Accelerators(NVIDIAGPGPUsandIntelXeonPhi)

Accelerators/Coprocessorshighcomputedensity,high

performance/waU>1TFlopDPonachip

HighPerformanceInterconnects-InfiniBand

<1useclatency,100GbpsBandwidth>MulN-coreProcessors SSD,NVMe-SSD,NVRAM

Page 5: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 5NetworkBasedCompuNngLaboratory

•  235IBClusters(47%)intheNov’2015Top500list(h<p://www.top500.org)

•  InstallaRonsintheTop50(21systems):

Large-scaleInfiniBandInstallaNons

462,462cores(Stampede)atTACC(10th) 76,032cores(Tsubame2.5)atJapan/GSIC(25th)

185,344cores(Pleiades)atNASA/Ames(13th) 194,616cores(Cascade)atPNNL(27th)

72,800coresCrayCS-StorminUS(15th) 76,032cores(Makman-2)atSaudiAramco(32nd)

72,800coresCrayCS-StorminUS(16th) 110,400cores(Pangea)inFrance(33rd)

265,440coresSGIICEatTulipTradingAustralia(17th) 37,120cores(Lomonosov-2)atRussia/MSU(35th)

124,200cores(Topaz)SGIICEatERDCDSRCinUS(18th) 57,600cores(SwifLucy)inUS(37th)

72,000cores(HPC2)inItaly(19th) 55,728cores(Prometheus)atPoland/Cyfronet(38th)

152,692cores(Thunder)atAFRL/USA(21st) 50,544cores(Occigen)atFrance/GENCI-CINES(43rd)

147,456cores(SuperMUC)inGermany(22nd) 76,896cores(Salomon)SGIICEinCzechRepublic(47th)

86,016cores(SuperMUCPhase2)inGermany(24th) andmanymore!

Page 6: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 6NetworkBasedCompuNngLaboratory

•  ScienRficCompuRng–  MessagePassingInterface(MPI),includingMPI+OpenMP,istheDominant

ProgrammingModel

–  ManydiscussionstowardsParRRonedGlobalAddressSpace(PGAS)•  UPC,OpenSHMEM,CAF,etc.

–  HybridProgramming:MPI+PGAS(OpenSHMEM,UPC)

•  BigData/Enterprise/CommercialCompuRng–  Focusesonlargedataanddataanalysis

–  Hadoop(HDFS,HBase,MapReduce)

–  Sparkisemergingforin-memorycompuRng

–  MemcachedisalsousedforWeb2.0

TwoMajorCategoriesofApplicaNons

Page 7: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 7NetworkBasedCompuNngLaboratory

TowardsExascaleSystem(TodayandTarget)

Systems 2016Tianhe-2

2020-2024 DifferenceToday&Exascale

Systempeak 55PFlop/s 1EFlop/s ~20x

Power 18MW(3Gflops/W)

~20MW(50Gflops/W)

O(1)~15x

Systemmemory 1.4PB(1.024PBCPU+0.384PBCoP)

32–64PB ~50X

Nodeperformance 3.43TF/s(0.4CPU+3CoP)

1.2or15TF O(1)

Nodeconcurrency 24coreCPU+171coresCoP

O(1k)orO(10k) ~5x-~50x

TotalnodeinterconnectBW 6.36GB/s 200–400GB/s ~40x-~60x

Systemsize(nodes) 16,000 O(100,000)orO(1M) ~6x-~60x

Totalconcurrency 3.12M12.48Mthreads(4/core)

O(billion)forlatencyhiding

~100x

MTTI Few/day Many/day O(?)

Courtesy:Prof.JackDongarra

Page 8: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 8NetworkBasedCompuNngLaboratory

•  EnergyandPowerChallenge–  Hardtosolvepowerrequirementsfordatamovement

•  MemoryandStorageChallenge–  Hardtoachievehighcapacityandhighdatarate

•  ConcurrencyandLocalityChallenge–  Managementofverylargeamountofconcurrency(billionthreads)

•  ResiliencyChallenge–  Lowvoltagedevices(forlowpower)introducemorefaults

BasicDesignChallengesforExascaleSystems

Page 9: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 9NetworkBasedCompuNngLaboratory

ParallelProgrammingModelsOverview

P1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

ParRRonedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

•  Programmingmodelsprovideabstractmachinemodels

•  Modelscanbemappedondifferenttypesofsystems–  e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

•  PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance

Page 10: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 10NetworkBasedCompuNngLaboratory

•  MessagePassingLibrarystandardizedbyMPIForum–  CandFortran

•  Goal:portable,efficientandflexiblestandardforwriRngparallelapplicaRons

•  NotIEEEorISOstandard,butwidelyconsidered“industrystandard”forHPCapplicaRon

•  EvoluRonofMPI–  MPI-1:1994

–  MPI-2:1996

–  MPI-3.0:2008–2012,standardizedonSeptember21,2012

–  MPI-3.1:2012–2015,standardizedonJune4,2015

–  NextplanisforMPI4.0

MPIOverviewandHistory

Page 11: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 11NetworkBasedCompuNngLaboratory

•  PowerrequiredfordatamovementoperaRonsisoneofthemainchallenges

•  Non-blockingcollecRves–  OverlapcomputaRonandcommunicaRon

•  MuchimprovedOne-sidedinterface–  ReducesynchronizaRonofsender/receiver

•  Manageconcurrency–  ImprovedinteroperabilitywithPGAS(e.g.UPC,GlobalArrays,OpenSHMEM,CAF)

•  Resiliency–  NewinterfacefordetecRngfailures

HowdoesMPIPlantoMeetExascaleChallenges?

Page 12: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 12NetworkBasedCompuNngLaboratory

•  MajorfeaturesinMPI3.0–  Non-blockingCollecRves

–  ImprovedOne-Sided(RMA)Model

–  MPIToolsInterface

•  SpecificaRonisavailablefrom:h<p://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

MajorNewFeaturesinMPI-3.0

Page 13: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 13NetworkBasedCompuNngLaboratory

MPI-3RMA:One-sidedCommunicaNonModelHCA HCA HCA P 1 P 2 P 3

Write to P2

Write to P3

Write Data from P1

Write data from P2

Post to HCA

Post to HCA

Buffer at P2 Buffer at P3

Global Region Creation (Buffer Info Exchanged)

Buffer at P1

HCA Write

Data to P2

HCA Write

Data to P3

Page 14: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 14NetworkBasedCompuNngLaboratory

•  Non-blockingone-sidedcommunicaRonrouRnes

–  Put,Get(Rput,Rget)–  Accumulate,Get_accumulate

–  Atomics

•  FlexiblesynchronizaRonoperaRonstocontroliniRaRonandcompleRon

MPI-3RMA:CommunicaNonandsynchronizaNonPrimiNves

MPIOne-sidedSynchronizaNon/CompleNonPrimiNves

SynchronizaNon CompleNon Win_sync

Lock/UnlockLock_all/Unlock_all

Fence

Post-Wait/Start-Complete

Flush

Flush_all

Flush_local

Flush_local_all

Page 15: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 15NetworkBasedCompuNngLaboratory

•  NetworkadapterscanprovideRDMAfeaturethatdoesn’trequiresofwareinvolvementatremoteside

•  Aslongasputs/getsareexecutedassoonastheyareissued,overlapcanbeachieved

•  RDMA-basedimplementaRonsdojustthat

MPI-3RMA:OverlappingCommunicaNonandComputaNon

Page 16: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 16NetworkBasedCompuNngLaboratory

•  EnablesoverlapofcomputaRonwithcommunicaRon

•  Non-blockingcallsdonotmatchblockingcollecRvecalls–  MPImayusedifferentalgorithmsforblockingandnon-blockingcollecRves

–  BlockingcollecRves:OpRmizedforlatency

–  Non-blockingcollecRves:OpRmizedforoverlap

•  AprocesscallinganNBCoperaRon–  SchedulescollecRveoperaRonandimmediatelyreturns

–  ExecutesapplicaRoncomputaRoncode

–  WaitsfortheendofthecollecRve

•  ThecommunicaRonprogressby–  ApplicaRoncodethroughMPI_Test

–  Networkadapter(HCA)withhardwaresupport

–  Dedicatedprocesses/threadinMPIlibrary

•  Thereisanon-blockingequivalentforeachblockingoperaRon–  Hasan“I”inthename(MPI_Bcast->MPI_Ibcast;MPI_Reduce->MPI_Ireduce)

MPI-3Non-blockingCollecNve(NBC)OperaNons

Page 17: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 17NetworkBasedCompuNngLaboratory

MPIToolsInterface

•  ExtendedtoolssupportinMPI-3,beyondthePMPIinterface•  Providestandardizedinterface(MPIT)toaccessMPIinternal

informaRon•  ConfiguraRonandcontrolinformaRon

•  Eagerlimit,buffersizes,...•  PerformanceinformaRon

•  Timespentinblocking,memoryusage,...•  DebugginginformaRon

•  Packetcounters,thresholds,...•  Externaltoolscanbuildontopofthisstandardinterface

Page 18: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 18NetworkBasedCompuNngLaboratory

•  MPI3.1wasapprovedonJune4,2015

–  SpecificaRonisavailablefrom:h<p://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf

•  Majorfeaturesandenhancements:

–  CorrecRontotheFortranbindingsintroducedinMPI-3.0

–  NewfuncRonsaddedincluderouRnestomanipulateMPI_Aintvaluesinaportablemanner

–  NonblockingcollecRveI/OrouRnes–  RouRnestogettheindexvaluebynameforMPI_Tperformanceand

controlvariables

MPI-3.1Enhancements

Page 19: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 19NetworkBasedCompuNngLaboratory

ParNNonedGlobalAddressSpace(PGAS)Models•  Keyfeatures

-  SimplesharedmemoryabstracRons

-  Lightweightone-sidedcommunicaRon

-  EasiertoexpressirregularcommunicaRon

•  DifferentapproachestoPGAS-  Languages

•  UnifiedParallelC(UPC)

•  Co-ArrayFortran(CAF)

•  X10

•  Chapel

-  Libraries•  OpenSHMEM

•  UPC++

•  GlobalArrays

Page 20: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 20NetworkBasedCompuNngLaboratory

OpenSHMEM•  SHMEMimplementaRons–CraySHMEM,SGISHMEM,QuadricsSHMEM,HPSHMEM,GSHMEM

•  SubtledifferencesinAPI,acrossversions–example:

SGISHMEMQuadricsSHMEMCraySHMEM

IniNalizaNonstart_pes(0)shmem_init start_pes

ProcessID_my_pemy_peshmem_my_pe

•  MadeapplicaRoncodesnon-portable

•  OpenSHMEMisanefforttoaddressthis:

“Anew,openspecifica>ontoconsolidatethevariousextantSHMEMversions

intoawidelyacceptedstandard.”–OpenSHMEMSpecifica>onv1.0

byUniversityofHoustonandOakRidgeNaRonalLab

SGISHMEMisthebaseline

Page 21: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 21NetworkBasedCompuNngLaboratory

•  UPC:UnifiedParallelC-PGASbasedlanguageextensiontoC–  AnISOC99-basedlanguageprovidinguniformprogrammingmodelforbothsharedanddistributed

memoryhardwaretosupportHPC

–  UPC=UPCtranslator+Ccompiler+UPCrunRme

•  CoarrayFortran(CAF):Language-levelPGASsupportinFortran–  AnextensiontoFortrantosupportglobalsharedarray(coarray)inparallelFortranapplicaRons

–  CAF=CAFcompiler+CAFrunRme(libcaf)

–  BasicsupportinFortran2008andextendedsupporttocollecRveinFortran2015

•  UPC++:AnObjectOrientedPGASProgrammingModel–  Acompiler-freePGASprogrammingmodelincontextofC++

–  BuiltontopofC++standardtemplatesandrunRmelibraries

–  ExtensiontoUPC’sprogrammingidioms

–  RegistertaskforasyncexecuRon

UPC,CAFandUPC++

Page 22: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 22NetworkBasedCompuNngLaboratory

•  HierarchicalarchitectureswithmulRpleaddressspaces

•  (MPI+PGAS)Model–  MPIacrossaddressspaces

–  PGASwithinanaddressspace

•  MPIisgoodatmovingdatabetweenaddressspaces

•  Withinanaddressspace,MPIcaninteroperatewithothersharedmemoryprogrammingmodels

•  ApplicaRonscanhavekernelswithdifferentcommunicaRonpa<erns

•  Canbenefitfromdifferentmodels

•  Re-wriRngcompleteapplicaRonscanbeahugeeffort

•  PortcriRcalkernelstothedesiredmodelinstead

MPI+PGASforExascaleArchitecturesandApplicaNons

Page 23: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 23NetworkBasedCompuNngLaboratory

Hybrid(MPI+PGAS)Programming

•  ApplicaRonsub-kernelscanbere-wri<eninMPI/PGASbasedoncommunicaRoncharacterisRcs

•  Benefits:–  BestofDistributedCompuRngModel

–  BestofSharedMemoryCompuRngModel

•  ExascaleRoadmap*:–  “HybridProgrammingisapracRcalwayto

programexascalesystems”

*TheInterna>onalExascaleSoKwareRoadmap,Dongarra,J.,Beckman,P.etal.,Volume25,Number1,2011,Interna>onalJournalofHighPerformanceComputerApplica>ons,ISSN1094-3420

Kernel1MPI

Kernel2MPI

Kernel3MPI

KernelNMPI

HPCApplicaNon

Kernel2PGAS

KernelNPGAS

Page 24: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 24NetworkBasedCompuNngLaboratory

DesigningCommunicaNonLibrariesforMulN-PetaflopandExaflopSystems:Challenges

ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.

ApplicaNonKernels/ApplicaNons

NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmniPath)

MulN/Many-coreArchitectures

Accelerators(NVIDIAandMIC)

MiddlewareCo-Design

OpportuniNesand

ChallengesacrossVarious

Layers

PerformanceScalabilityFault-

Resilience

CommunicaNonLibraryorRunNmeforProgrammingModelsPoint-to-pointCommunicaNon

CollecNveCommunicaNon

Energy-Awareness

SynchronizaNonandLocks

I/OandFileSystems

FaultTolerance

Page 25: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 25NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided)–  Scalablejobstart-up

•  ScalableCollecRvecommunicaRon–  Offload–  Non-blocking–  Topology-aware

•  Balancingintra-nodeandinter-nodecommunicaRonfornextgeneraRonnodes(128-1024cores)–  MulRpleend-pointspernode

•  SupportforefficientmulR-threading•  IntegratedSupportforGPGPUsandAccelerators•  Fault-tolerance/resiliency•  QoSsupportforcommunicaRonandI/O•  SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,

CAF,…)•  VirtualizaRon•  Energy-Awareness

BroadChallengesinDesigningCommunicaNonLibrariesfor(MPI+X)atExascale

Page 26: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 26NetworkBasedCompuNngLaboratory

•  ExtremeLowMemoryFootprint–  MemorypercoreconRnuestodecrease

•  D-L-AFramework

–  Discover•  Overallnetworktopology(fat-tree,3D,…),Networktopologyforprocessesforagivenjob•  Nodearchitecture,Healthofnetworkandnode

–  Learn•  Impactonperformanceandscalability•  PotenRalforfailure

–  Adapt•  Internalprotocolsandalgorithms•  Processmapping•  Fault-tolerancesoluRons

–  Lowoverheadtechniqueswhiledeliveringperformance,scalabilityandfault-tolerance

AddiNonalChallengesforDesigningExascaleSoqwareLibraries

Page 27: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 27NetworkBasedCompuNngLaboratory

OverviewoftheMVAPICH2Project•  HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)

–  MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002

–  MVAPICH2-X(MPI+PGAS),Availablesince2011

–  SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

–  SupportforVirtualizaRon(MVAPICH2-Virt),Availablesince2015

–  SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

–  Usedbymorethan2,525organizaNonsin77countries

–  Morethan356,000(>0.36million)downloadsfromtheOSUsitedirectly

–  EmpoweringmanyTOP500clusters(Nov‘15ranking)•  10thranked519,640-corecluster(Stampede)atTACC

•  13thranked185,344-corecluster(Pleiades)atNASA

•  25thranked76,032-corecluster(Tsubame2.5)atTokyoInsRtuteofTechnologyandmanyothers

–  AvailablewithsofwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

–  h<p://mvapich.cse.ohio-state.edu

•  EmpoweringTop500systemsforoveradecade–  System-XfromVirginiaTech(3rdinNov2003,2,200processors,12.25TFlops)->

–  StampedeatTACC(10thinNov’15,519,640cores,5.168Plops)

Page 28: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 28NetworkBasedCompuNngLaboratory

MVAPICH2Architecture

HighPerformanceParallelProgrammingModels

MessagePassingInterface(MPI)

PGAS(UPC,OpenSHMEM,CAF,UPC++*)

Hybrid---MPI+X(MPI+PGAS+OpenMP/Cilk)

HighPerformanceandScalableCommunicaNonRunNmeDiverseAPIsandMechanisms

Point-to-point

PrimiNves

CollecNvesAlgorithms

Energy-Awareness

RemoteMemoryAccess

I/OandFileSystems

FaultTolerance

VirtualizaNon AcNveMessages

JobStartupIntrospecNon&Analysis

SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,OmniPath)

SupportforModernMulN-/Many-coreArchitectures(Intel-Xeon,OpenPower*,Xeon-Phi(MIC,KNL*),NVIDIAGPGPU)

TransportProtocols ModernFeatures

RC XRC UD DC UMR ODP*SR-IOV

MulNRail

TransportMechanismsSharedMemory CMA IVSHMEM

ModernFeatures

MCDRAM* NVLink* CAPI*

*Upcoming

Page 29: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 29NetworkBasedCompuNngLaboratoryTimeline Ja

n-04

Jan-

10

Nov

-12

MVAPICH2-X

OMB

MVAPICH2

MVAPICH

Oct

-02

Nov

-04

Apr

-15

EOL

MVAPICH2-GDR

MVAPICH2-MIC

MVAPICHProjectTimeline

Jul-

15

MVAPICH2-Virt

Aug

-14

Aug

-15

Sep-

15

MVAPICH2-EA

OSU-INAM

Page 30: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 30NetworkBasedCompuNngLaboratory

MVAPICH2SoqwareFamilyRequirements MVAPICH2Librarytouse

MPIwithIB,iWARPandRoCE MVAPICH2

AdvancedMPI,OSUINAM,PGASandMPI+PGASwithIBandRoCE MVAPICH2-X

MPIwithIB&GPU MVAPICH2-GDR

MPIwithIB&MIC MVAPICH2-MIC

HPCCloudwithMPI&IB MVAPICH2-Virt

Energy-awareMPIwithIB,iWARPandRoCE MVAPICH2-EA

Page 31: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 31NetworkBasedCompuNngLaboratory

0

50000

100000

150000

200000

250000

300000

350000Sep-04

Jan-05

May-05

Sep-05

Jan-06

May-06

Sep-06

Jan-07

May-07

Sep-07

Jan-08

May-08

Sep-08

Jan-09

May-09

Sep-09

Jan-10

May-10

Sep-10

Jan-11

May-11

Sep-11

Jan-12

May-12

Sep-12

Jan-13

May-13

Sep-13

Jan-14

May-14

Sep-14

Jan-15

May-15

Sep-15

Jan-16

Num

bero

fDow

nloa

ds

Timeline

MV0.9.4

MV2

0.9.0

MV2

0.9.8

MV2

1.0

MV1.0

MV2

1.0.3

MV1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9 MV2

2.1

MV2

-GDR

2.0b

MV2

-MIC2.0

MV2

-Virt2.1rc2 MV2

-GDR

2.2b

MV2

-X2.2b

MV2

2.2b

MVAPICH/MVAPICH2ReleaseTimelineandDownloads

Page 32: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 32NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors–  Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided

RMA)–  SupportforadvancedIBmechanisms(UMRandODP)–  Extremelyminimalmemoryfootprint–  Scalablejobstart-up

•  CollecRvecommunicaRon•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+

UPC,CAF,…)•  InfiniBandNetworkAnalysisandMonitoring(INAM)•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  VirtualizaRon(SR-IOVandContainer)•  Energy-Awareness

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 33: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 33NetworkBasedCompuNngLaboratory

One-wayLatency:MPIoverIBwithMVAPICH2

0.000.200.400.600.801.001.201.401.601.802.00 SmallMessageLatency

MessageSize(bytes)

Latency(us)

1.261.19

0.951.15

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back

0

20

40

60

80

100

120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

LargeMessageLatency

MessageSize(bytes)

Latency(us)

Page 34: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 34NetworkBasedCompuNngLaboratory

Bandwidth:MPIoverIBwithMVAPICH2

0

2000

4000

6000

8000

10000

12000

14000 UnidirecNonalBandwidth

Band

width

(MBy

tes/sec)

MessageSize(bytes)

12465

3387

6356

12104

0

5000

10000

15000

20000

25000

30000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR

BidirecNonalBandwidth

Band

width

(MBy

tes/sec)

MessageSize(bytes)

21425

12161

24353

6308

TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch

ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back

Page 35: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 35NetworkBasedCompuNngLaboratory

0

0.5

1

0 1 2 4 8 16 32 64 128 256 512 1K

Latency(us)

MessageSize(Bytes)

LatencyIntra-Socket Inter-Socket

MVAPICH2Two-SidedIntra-NodePerformance(SharedmemoryandKernel-basedZero-copySupport(LiMICandCMA))

LatestMVAPICH22.2b

IntelIvy-bridge0.18us

0.45us

0

5000

10000

15000

Band

width(M

B/s)

MessageSize(Bytes)

Bandwidth(Inter-socket)inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC

0

5000

10000

15000

Band

width(M

B/s)

MessageSize(Bytes)

Bandwidth(Intra-socket)intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC

14,250MB/s13,749MB/s

Page 36: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 36NetworkBasedCompuNngLaboratory

•  IntroducedbyMellanoxtosupportdirectlocalandremotenonconRguousmemoryaccess–  Avoidpackingatsenderandunpackingatreceiver

•  AvailablewithMVAPICH2-X2.2b

User-modeMemoryRegistraNon(UMR)

050

100150200250300350

4K 16K 64K 256K 1M

Latency(u

s)

MessageSize(Bytes)

Small&MediumMessageLatencyUMRDefault

0

5000

10000

15000

20000

2M 4M 8M 16M

Latency(us)

MessageSize(Bytes)

LargeMessageLatencyUMRDefault

Connect-IB(54Gbps):2.8GHzDualTen-core(IvyBridge)IntelPCIGen3withMellanoxIBFDRswitch

M.Li,H.Subramoni,K.Hamidouche,X.LuandD.K.Panda,HighPerformanceMPIDatatypeSupportwithUser-modeMemoryRegistraNon:Challenges,DesignsandBenefits,CLUSTER,2015

Page 37: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 37NetworkBasedCompuNngLaboratory

•  IntroducedbyMellanoxtosupportdirectremotememoryaccesswithoutpinning

•  Memoryregionspagedin/outdynamicallybytheHCA/OS

•  Sizeofregisteredbufferscanbelargerthanphysicalmemory

•  WillbeavailableinfutureMVAPICH2release

On-DemandPaging(ODP)

Connect-IB(54Gbps):2.6GHzDualOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRswitch

0

500

1000

1500

16 32 64

Pin-do

wnBu

fferS

ize

(MB)

NumberofProcesses

Graph500Pin-downBufferSizesPin-down ODP

0

1

2

3

4

5

16 32 64

ExecuN

onTim

e(s)

NumberofProcesses

Graph500BFSKernelPin-down ODP

Page 38: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 38NetworkBasedCompuNngLaboratory

MinimizingMemoryFootprintbyDirectConnect(DC)Transport

Nod

e0 P1

P0

Node1

P3

P2Node3

P7

P6

Nod

e2 P5

P4

IBNetwork

•  ConstantconnecRoncost(OneQPforanypeer)•  FullFeatureSet(RDMA,Atomicsetc)•  Separateobjectsforsend(DCIniRator)andreceive(DCTarget)

–  DCTargetidenRfiedby“DCTNumber”–  Messagesroutedwith(DCTNumber,LID)–  Requiressame“DCKey”toenablecommunicaRon

•  AvailablesinceMVAPICH2-X2.2a

0

0.5

1

160 320 620Normalized

ExecuNo

nTime

NumberofProcesses

NAMD-Apoa1:LargedatasetRC DC-Pool UD XRC

1022

4797

1 1 12

10 10 10 10

1 13

5

1

10

100

80 160 320 640

Conn

ecNo

nMem

ory(KB)

NumberofProcesses

MemoryFootprintforAlltoallRC DC-Pool UD XRC

H.Subramoni,K.Hamidouche,A.Venkatesh,S.ChakrabortyandD.K.Panda,DesigningMPILibrarywithDynamicConnectedTransport(DCT)ofInfiniBand:EarlyExperiences.IEEEInternaRonalSupercompuRngConference(ISC’14)

Page 39: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 39NetworkBasedCompuNngLaboratory

•  Near-constantMPIandOpenSHMEMiniRalizaRonRmeatanyprocesscount

•  10xand30ximprovementinstartupRmeofMPIandOpenSHMEMrespecRvelyat16,384processes

•  MemoryconsumpRonreducedforremoteendpointinformaRonbyO(processespernode)

•  1GBMemorysavedpernodewith1Mprocessesand16processespernode

TowardsHighPerformanceandScalableStartupatExascale

P M

O

JobStartupPerformance

Mem

oryRe

quire

dtoStore

Endp

ointInform

aRon

a b c d

eP

M

PGAS–Stateoftheart

MPI–Stateoftheart

O PGAS/MPI–OpRmized

PMIX_Ring

PMIX_Ibarrier

PMIX_Iallgather

ShmembasedPMI

b

c

d

e

aOn-demandConnecRon

On-demandConnecNonManagementforOpenSHMEMandOpenSHMEM+MPI.S.Chakraborty,H.Subramoni,J.Perkins,A.A.Awan,andDKPanda,20thInternaRonalWorkshoponHigh-levelParallelProgrammingModelsandSupporRveEnvironments(HIPS’15)

PMIExtensionsforScalableMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,J.Perkins,M.Arnold,andDKPanda,Proceedingsofthe21stEuropeanMPIUsers'GroupMeeRng(EuroMPI/Asia’14)

Non-blockingPMIExtensionsforFastMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,A.Venkatesh,J.Perkins,andDKPanda,15thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’15)

SHMEMPMI–SharedMemorybasedPMIforImprovedPerformanceandScalability.S.Chakraborty,H.Subramoni,J.Perkins,andDKPanda,16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’16),AcceptedforPublica=on

a

b

c d

e

Page 40: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 40NetworkBasedCompuNngLaboratory

•  SHMEMPMIallowsMPIprocessestodirectlyreadremoteendpoint(EP)informaRonfromtheprocessmanagerthroughsharedmemorysegments

•  Onlyasinglecopypernode-O(processespernode)reducRoninmemoryusage

•  EsRmatedsavingsof1GBpernodewith1millionprocessesand16processespernode

•  Upto1,000RmesfasterPMIGetscomparedtodefaultdesign.WillbeavailableinMVAPICH22.2RC1.

ProcessManagementInterfaceoverSharedMemory(SHMEMPMI)

TACCStampede-Connect-IB(54Gbps):2.6GHzQuadOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRSHMEMPMI–SharedMemoryBasedPMIforPerformanceandScalabilityS.Chakraborty,H.Subramoni,J.Perkins,andD.K.Panda,

16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid‘16),Acceptedforpublica=on

0

50

100

150

200

250

300

1 2 4 8 16 32

TimeTaken(m

illise

cond

s)

NumberofProcessesperNode

TimeTakenbyonePMI_GetDefault

SHMEMPMI

0.00010.0010.010.1110100

100010000

16 64 256 1K 4K 16K 64K 256K 1MMem

oryUsageperNod

e(M

B)

NumberofProcessesperJob

MemoryUsageforRemoteEPInformaRonFence-DefaultAllgather-DefaultFence-ShmemAllgather-Shmem

EsNmated

1000x

Actual

16x

Page 41: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 41NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon

–  OffloadandNon-blocking–  Topology-aware

•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+UPC,CAF,…)

•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 42: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 42NetworkBasedCompuNngLaboratory

ModifiedHPLwithOffload-Bcastdoesupto4.5%be<erthandefaultversion(512Processes)

012345

512 600 720 800

ApplicaN

onRun

-Tim

e(s)

DataSize

05

1015

64 128 256 512Run-Time(s)

NumberofProcesses

PCG-Default Modified-PCG-Offload

Co-DesignwithMPI-3Non-BlockingCollecNvesandCollecNveOffloadCo-DirectHardware(AvailablesinceMVAPICH2-X2.2a)

ModifiedP3DFFTwithOffload-Alltoalldoesupto17%be<erthandefaultversion(128Processes)

K.Kandalla,et.al..High-PerformanceandScalableNon-BlockingAll-to-AllwithCollecNveOffloadonInfiniBandClusters:AStudywithParallel3DFFT,ISC2011

17%

00.20.40.60.81

1.2

10 20 30 40 50 60 70

Normalized

Pe

rforman

ce

HPL-Offload HPL-1ring HPL-Host

HPLProblemSize(N)as%ofTotalMemory

4.5%

ModifiedPre-ConjugateGradientSolverwithOffload-Allreducedoesupto21.8%be<erthandefaultversion

K.Kandalla,et.al,DesigningNon-blockingBroadcastwithCollecNveOffloadonInfiniBandClusters:ACaseStudywithHPL,HotI2011K.Kandalla,et.al.,DesigningNon-blockingAllreducewithCollecNveOffloadonInfiniBandClusters:ACaseStudywithConjugateGradientSolvers,IPDPS’12

21.8%

CanNetwork-OffloadbasedNon-BlockingNeighborhoodMPICollecNvesImproveCommunicaNonOverheadsofIrregularGraphAlgorithms?K.Kandalla,A.Buluc,H.Subramoni,K.Tomko,J.Vienne,L.Oliker,andD.K.Panda,IWPAPS’12

Page 43: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 43NetworkBasedCompuNngLaboratory

Network-Topology-AwarePlacementofProcesses•  CanwedesignahighlyscalablenetworktopologydetecRonserviceforIB?•  HowdowedesigntheMPIcommunicaRonlibraryinanetwork-topology-awaremannertoefficientlyleveragethetopology

informaRongeneratedbyourservice?•  WhatarethepotenRalbenefitsofusinganetwork-topology-awareMPIlibraryontheperformanceofparallelscienRficapplicaRons?

OverallperformanceandSplitupofphysicalcommunicaNonforMILConRanger

Performanceforvaryingsystemsizes Defaultfor2048corerun Topo-Awarefor2048corerun

15%

H.Subramoni,S.Potluri,K.Kandalla,B.Barth,J.Vienne,J.Keasler,K.Tomko,K.Schulz,A.Moody,andD.K.Panda,DesignofaScalableInfiniBandTopologyServicetoEnableNetwork-Topology-AwarePlacementofProcesses,SC'12.BESTPaperandBESTSTUDENTPaperFinalist

• ReducenetworktopologydiscoveryNmefromO(N2hosts)toO(Nhosts)

• 15%improvementinMILCexecuNonNme@2048cores• 15%improvementinHypreexecuNonNme@1024cores

Page 44: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 44NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 45: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 45NetworkBasedCompuNngLaboratory

MVAPICH2-XforAdvancedMPIandHybridMPI+PGASApplicaNons

MPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI + PGAS) Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls UPC Calls

•  UnifiedcommunicaRonrunRmeforMPI,UPC,OpenSHMEM,CAF,UPC++availablewithMVAPICH2-X1.9onwards!(since2012)

•  UPC++supportwillbeavailableinupcomingMVAPICH2-X2.2RC1•  FeatureHighlights

–  SupportsMPI(+OpenMP),OpenSHMEM,UPC,CAF,UPC++,MPI(+OpenMP)+OpenSHMEM,MPI(+OpenMP)+UPC

–  MPI-3compliant,OpenSHMEMv1.0standardcompliant,UPCv1.2standardcompliant(withiniRalsupportforUPC1.3),CAF2008standard(OpenUH),UPC++

–  ScalableInter-nodeandintra-nodecommunicaRon–point-to-pointandcollecRves

CAF Calls UPC++ Calls

Page 46: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 46NetworkBasedCompuNngLaboratory

ApplicaNonLevelPerformancewithGraph500andSortGraph500ExecuNonTime

J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternaNonalSupercompuNngConference(ISC’13),June2013

J.Jose,K.Kandalla,M.LuoandD.K.Panda,SupporNngHybridMPIandOpenSHMEMoverInfiniBand:DesignandPerformanceEvaluaNon,Int'lConferenceonParallelProcessing(ICPP'12),September2012

05101520253035

4K 8K 16K

Time(s)

No.ofProcesses

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

13X

7.6X

•  PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design•  8,192processes

-2.4XimprovementoverMPI-CSR-7.6XimprovementoverMPI-Simple

•  16,384processes-1.5XimprovementoverMPI-CSR-13XimprovementoverMPI-Simple

J.Jose,K.Kandalla,S.Potluri,J.ZhangandD.K.Panda,OpNmizingCollecNveCommunicaNoninOpenSHMEM,Int'lConferenceonParNNonedGlobalAddressSpaceProgrammingModels(PGAS'13),October2013.

SortExecuNonTime

0

1000

2000

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Time(secon

ds)

InputData-No.ofProcesses

MPI Hybrid

51%

•  PerformanceofHybrid(MPI+OpenSHMEM)SortApplicaRon

•  4,096processes,4TBInputSize-MPI–2408sec;0.16TB/min-Hybrid–1172sec;0.36TB/min-51%improvementoverMPI-design

Page 47: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 47NetworkBasedCompuNngLaboratory

MiniMD–TotalExecuNonTime

•  Hybriddesignperformsbe<erthanMPIimplementaRon•  1,024processes

-  17%improvementoverMPIversion•  StrongScaling

Inputsize:128*128*128

Performance StrongScaling

0

500

1000

1500

2000

2500

512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

17%

050010001500200025003000

256 512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

Time(m

s)

Time(m

s)

#ofCores #ofCores

M.Li,J.Lin,X.Lu,K.Hamidouche,K.TomkoandD.K.Panda,ScalableMiniMDDesignwithHybridMPIandOpenSHMEM,OpenSHMEMUserGroupMeeNng(OUG’14),heldinconjuncNonwith8thInternaNonalConferenceonParNNonedGlobalAddressSpaceProgrammingModels,(PGAS14).

Page 48: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 48NetworkBasedCompuNngLaboratory

HybridMPI+UPCNAS-FT

•  ModifiedNASFTUPCall-to-allpa<ernusingMPI_Alltoall•  Trulyhybridprogram•  ForFT(ClassC,128processes)

•  34%improvementoverUPC-GASNet•  30%improvementoverUPC-OSU

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Time(s)

NASProblemSize–SystemSize

UPC-GASNet

UPC-OSU

Hybrid-OSU

34%

J.Jose,M.Luo,S.SurandD.K.Panda,UnifyingUPCandMPIRunNmes:ExperiencewithMVAPICH,FourthConferenceonParNNonedGlobalAddressSpaceProgrammingModel(PGAS’10),October2010

HybridMPI+UPCSupport

Availablesince

MVAPICH2-X1.9(2012)

Page 49: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 49NetworkBasedCompuNngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecRvecommunicaRon•  UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,

MPI+UPC,CAF,…)•  InfiniBandNetworkAnalysisandMonitoring(INAM)

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale

Page 50: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 50NetworkBasedCompuNngLaboratory

OverviewofOSUINAM•  AnetworkmonitoringandanalysistoolthatiscapableofanalyzingtrafficontheInfiniBandnetwork

withinputsfromtheMPIrunRme–  h<p://mvapich.cse.ohio-state.edu/tools/osu-inam/

–  h<p://mvapich.cse.ohio-state.edu/userguide/osu-inam/

•  MonitorsIBclustersinrealRmebyqueryingvarioussubnetmanagementenRResandgatheringinputfromtheMPIrunRmes

•  Capabilitytoanalyzeandprofilenode-level,job-levelandprocess-levelacRviResforMPIcommunicaRon(Point-to-Point,CollecRvesandRMA)

•  Abilitytofilterdatabasedontypeofcountersusing“dropdown”list

•  RemotelymonitorvariousmetricsofMPIprocessesatuserspecifiedgranularity

•  "JobPage"todisplayjobsinascending/descendingorderofvariousperformancemetricsinconjuncRonwithMVAPICH2-X

•  Visualizethedatatransferhappeningina“live”or“historical”fashionforenRrenetwork,joborsetofnodes

Page 51: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 51NetworkBasedCompuNngLaboratory

OSUINAM–NetworkLevelView

•  Shownetworktopologyoflargeclusters•  Visualizetrafficpa<ernondifferentlinks•  QuicklyidenRfycongestedlinks/linksinerrorstate•  Seethehistoryunfold–playbackhistoricalstateofthenetwork

FullNetwork(152nodes) Zoomed-inViewoftheNetwork

Page 52: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 52NetworkBasedCompuNngLaboratory

OSUINAM–JobandNodeLevelViews

VisualizingaJob(5Nodes) FindingRoutesBetweenNodes

•  Joblevelview•  Showdifferentnetworkmetrics(load,error,etc.)foranylivejob•  PlaybackhistoricaldataforcompletedjobstoidenRfybo<lenecks

•  Nodelevelviewprovidesdetailsperprocessorpernode•  CPUuRlizaRonforeachrank/node•  Bytessent/receivedforMPIoperaRons(pt-to-pt,collecRve,RMA)•  Networkmetrics(e.g.XmitDiscard,RcvError)perrank/node

Page 53: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 53NetworkBasedCompuNngLaboratory

LiveNodeLevelView

Page 54: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 54NetworkBasedCompuNngLaboratory

LiveSwitchLevelView

Page 55: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 55NetworkBasedCompuNngLaboratory

ListofSupportedSwitchCounters•  ThefollowingcountersarequeriedfromtheInfiniBandSwitches

•  XmitData–  Totalnumberofdataoctets,dividedby4,transmi<edonallVLsfromtheport

–  Thisincludesalloctetsbetween(andnotincluding)thestartofpacketdelimiterandtheVCRC,andmayincludepacketscontainingerrors

–  Excludesalllinkpackets.

•  RcvData–  Totalnumberofdataoctets,dividedby4,receivedonallVLsfromtheport

–  Thisincludesalloctetsbetween(andnotincluding)thestartofpacketdelimiterandtheVCRC,andmayincludepacketscontainingerrors

–  Excludesalllinkpackets.

•  Max[XmitData/RcvData]:Maximumofthetwovaluesabove

Page 56: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 56NetworkBasedCompuNngLaboratory

ListofSupportedMPIProcessLevelCounters•  MVAPICH2-XcollectsaddiRonalinformaRonabouttheprocess’snetworkusagewhichcanbedisplayedbyOSU

INAM•  XmitData

–  Totalnumberofbytestransmi<edaspartoftheMPIapplicaRon

•  RcvData–  TotalnumberofbytesreceivedaspartoftheMPIapplicaRon

•  Max[XmitData/RcvData]–  Maximumofthetwovaluesabove

•  PointtoPointSend–  Totalnumberofbytestransmi<edaspartofMPIpoint-to-pointoperaRons

•  PointtoPointRcvd–  TotalnumberofbytesreceivedaspartofMPIpoint-to-pointoperaRons

•  Max[PointtoPointSent/Rcvd]–  Maximumofthetwovaluesabove

•  CollBytesSent–  Totalnumberofbytestransmi<edaspartofMPIcollecRveoperaRons

•  CollBytesRcvd–  TotalnumberofbytesreceivedaspartofMPIcollecRveoperaRons

Page 57: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 57NetworkBasedCompuNngLaboratory

ListofSupportedMPIProcessLevelCounters(Cont.)•  Max[CollBytesSent/Rcvd]

–  Maximumofthetwovaluesabove•  RMABytesSent

–  Totalnumberofbytestransmi<edaspartofMPIRMAoperaRons

–  NotethatduetothenatureoftheRMAoperaRons,bytesreceivedforRMAoperaRonscannotbecounted

•  RCVBUF–  ThenumberofinternalcommunicaRonbuffersusedforreliableconnecRon(RC)

•  UDVBUF–  ThenumberofinternalcommunicaRonbuffersusedforunreliabledatagram(UD)

•  VMSize–  Totalnumberofbytesusedbytheprogramforitsvirtualmemory

•  VMPeak–  Maximumnumberofvirtualmemorybytesfortheprogram

•  VMRSS–  Thenumberofbytesresidentinthememory(Residentsetsize)

•  VMHWM–  Themaximumnumberofbytesthatcanberesidentinmemory(PeakresidentsetsizeorHighwatermark)

Page 58: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 58NetworkBasedCompuNngLaboratory

ListofSupportedNetworkErrorCounters(Cont.)•  XmtDiscards

–  Totalnumberofoutboundpacketsdiscardedbytheportbecausetheportisdownorcongested.Reasonsforthisinclude:•  OutputportisnotintheacRvestate

•  PacketlengthexceededNeighborMTU

•  SwitchLifeRmeLimitexceeded

•  SwitchHOQLifeRmeLimitexceededThismayalsoincludepacketsdiscardedwhileinVLStalledState.

•  XmtConstraintErrors–  Totalnumberofpacketsnottransmi<edfromtheswitchphysicalportforthefollowingreasons:

•  FilterRawOutboundistrueandpacketisraw

•  ParRRonEnforcementOutboundistrueandpacketfailsparRRonkeycheckorIPversioncheck

•  RcvConstraintErrors–  Totalnumberofpacketsnotreceivedfromtheswitchphysicalportforthefollowingreasons:

•  FilterRawInboundistrueandpacketisraw

•  ParRRonEnforcementInboundistrueandpacketfailsparRRonkeycheckorIPversioncheck

•  LinkIntegrityErrors–  ThenumberofRmesthatthecountoflocalphysicalerrorsexceededthethresholdspecifiedbyLocalPhyErrors

•  ExcBufOverrunErrors–  ThenumberofRmesthatOverrunErrorsconsecuRveflowcontrolupdateperiodsoccurred,eachhavingatleastoneoverrunerror

•  VL15Dropped:NumberofincomingVL15packetsdroppedduetoresourcelimitaRons(e.g.,lackofbuffers)intheport

Page 59: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 59NetworkBasedCompuNngLaboratory

ListofSupportedNetworkErrorCounters•  Thefollowingerrorcountersareavailablebothatswitchandprocesslevel:

•  SymbolErrors–  Totalnumberofminorlinkerrorsdetectedononeormorephysicallanes

•  LinkRecovers–  TotalnumberofRmesthePortTrainingstatemachinehassuccessfullycompletedthelinkerrorrecoveryprocess

•  LinkDowned–  TotalnumberofRmesthePortTrainingstatemachinehasfailedthelinkerrorrecoveryprocessanddownedthelink

•  RcvErrors–  Totalnumberofpacketscontaininganerrorthatwerereceivedontheport.Theseerrorsinclude:

•  Localphysicalerrors

•  Malformeddatapacketerrors

•  Malformedlinkpacketerrors

•  Packetsdiscardedduetobufferoverrun

•  RcvRemotePhysErrors–  TotalnumberofpacketsmarkedwiththeEBPdelimiterreceivedontheport.

•  RcvSwitchRelayErrors–  Totalnumberofpacketsreceivedontheportthatwerediscardedbecausetheycouldnotbeforwardedbytheswitchrelay

Page 60: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 60NetworkBasedCompuNngLaboratory

Conclusions•  Providedanoverviewofprogrammingmodelsforexascalesystems

•  OutlinedtheassociatedchallengesindesigningrunRmesfortheprogrammingmodelschallenges

•  DemonstratedhowMVAPICH2projectisaddressingsomeofthesechallenges

Page 61: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 61NetworkBasedCompuNngLaboratory

•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  VirtualizaRon(SR-IOVandContainer)•  Energy-Awareness•  BestPracRce:SetofTuningsforCommonApplicaRons

(AvailablethroughtheMVAPICHWebsite)

AddiNonalChallengestobeCoveredinToday’s1:30pmTalk

Page 62: Programming Models for Exascale Systems

HPCAC-Switzerland(Mar‘16) 62NetworkBasedCompuNngLaboratory

[email protected]

ThankYou!

TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/

Network-BasedCompuRngLaboratoryh<p://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/