View
516
Download
0
Category
Preview:
Citation preview
ProgrammingModelsforExascaleSystems
DhabaleswarK.(DK)PandaTheOhioStateUniversity
E-mail:panda@cse.ohio-state.edu
h<p://www.cse.ohio-state.edu/~panda
KeynoteTalkatHPCAC-Stanford(Feb2016)
by
HPCAC-Stanford(Feb‘16) 2NetworkBasedCompuNngLaboratory
High-EndCompuNng(HEC):ExaFlop&ExaByte
100-200 PFlops in 2016-2018
1 EFlops in 2020-2024?
3
F i g u r e 1
Source: IDC's Digital Universe Study, sponsored by EMC, December 2012
Within these broad outlines of the digital universe are some singularities worth noting.
First, while the portion of the digital universe holding potential analytic value is growing, only a tiny fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and so on. However, even with a generous estimate, the amount of information in the digital universe that is "tagged" accounts for only about 3% of the digital universe in 2012, and that which is analyzed is half a percent of the digital universe. Herein is the promise of "Big Data" technology — the extraction of value from the large untapped pools of data in the digital universe.
10K-20K EBytes in 2016-2018
40K EBytes in 2020 ?
ExaFlop&HPC•
ExaByte&BigData•
HPCAC-Stanford(Feb‘16) 3NetworkBasedCompuNngLaboratory
0102030405060708090100
050
100150200250300350400450500
Percen
tageofC
lusters
Num
bero
fClusters
Timeline
PercentageofClustersNumberofClusters
TrendsforCommodityCompuNngClustersintheTop500List(hTp://www.top500.org)
85%
HPCAC-Stanford(Feb‘16) 4NetworkBasedCompuNngLaboratory
DriversofModernHPCClusterArchitectures
Tianhe–2 Titan Stampede Tianhe–1A
• MulR-core/many-coretechnologies
• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)
• SolidStateDrives(SSDs),Non-VolaRleRandom-AccessMemory(NVRAM),NVMe-SSD
• Accelerators(NVIDIAGPGPUsandIntelXeonPhi)
Accelerators/Coprocessorshighcomputedensity,high
performance/waT>1TFlopDPonachip
HighPerformanceInterconnects-InfiniBand
<1useclatency,100GbpsBandwidth>MulN-coreProcessors SSD,NVMe-SSD,NVRAM
HPCAC-Stanford(Feb‘16) 5NetworkBasedCompuNngLaboratory
• 235IBClusters(47%)intheNov’2015Top500list(h<p://www.top500.org)
• InstallaRonsintheTop50(21systems):
Large-scaleInfiniBandInstallaNons
462,462cores(Stampede)atTACC(10th) 76,032cores(Tsubame2.5)atJapan/GSIC(25th)
185,344cores(Pleiades)atNASA/Ames(13th) 194,616cores(Cascade)atPNNL(27th)
72,800coresCrayCS-StorminUS(15th) 76,032cores(Makman-2)atSaudiAramco(32nd)
72,800coresCrayCS-StorminUS(16th) 110,400cores(Pangea)inFrance(33rd)
265,440coresSGIICEatTulipTradingAustralia(17th) 37,120cores(Lomonosov-2)atRussia/MSU(35th)
124,200cores(Topaz)SGIICEatERDCDSRCinUS(18th) 57,600cores(SwifLucy)inUS(37th)
72,000cores(HPC2)inItaly(19th) 55,728cores(Prometheus)atPoland/Cyfronet(38th)
152,692cores(Thunder)atAFRL/USA(21st) 50,544cores(Occigen)atFrance/GENCI-CINES(43rd)
147,456cores(SuperMUC)inGermany(22nd) 76,896cores(Salomon)SGIICEinCzechRepublic(47th)
86,016cores(SuperMUCPhase2)inGermany(24th) andmanymore!
HPCAC-Stanford(Feb‘16) 6NetworkBasedCompuNngLaboratory
TowardsExascaleSystem(TodayandTarget)
Systems 2016Tianhe-2
2020-2024 DifferenceToday&Exascale
Systempeak 55PFlop/s 1EFlop/s ~20x
Power 18MW(3Gflops/W)
~20MW(50Gflops/W)
O(1)~15x
Systemmemory 1.4PB(1.024PBCPU+0.384PBCoP)
32–64PB ~50X
Nodeperformance 3.43TF/s(0.4CPU+3CoP)
1.2or15TF O(1)
Nodeconcurrency 24coreCPU+171coresCoP
O(1k)orO(10k) ~5x-~50x
TotalnodeinterconnectBW 6.36GB/s 200–400GB/s ~40x-~60x
Systemsize(nodes) 16,000 O(100,000)orO(1M) ~6x-~60x
Totalconcurrency 3.12M12.48Mthreads(4/core)
O(billion)forlatencyhiding
~100x
MTTI Few/day Many/day O(?)
Courtesy:Prof.JackDongarra
HPCAC-Stanford(Feb‘16) 7NetworkBasedCompuNngLaboratory
• ScienRficCompuRng– MessagePassingInterface(MPI),includingMPI+OpenMP,istheDominant
ProgrammingModel
– ManydiscussionstowardsParRRonedGlobalAddressSpace(PGAS)• UPC,OpenSHMEM,CAF,etc.
– HybridProgramming:MPI+PGAS(OpenSHMEM,UPC)
• BigData/Enterprise/CommercialCompuRng– Focusesonlargedataanddataanalysis
– Hadoop(HDFS,HBase,MapReduce)
– Sparkisemergingforin-memorycompuRng
– MemcachedisalsousedforWeb2.0
TwoMajorCategoriesofApplicaNons
HPCAC-Stanford(Feb‘16) 8NetworkBasedCompuNngLaboratory
ParallelProgrammingModelsOverview
P1 P2 P3
SharedMemory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory MemoryLogicalsharedmemory
SharedMemoryModel
SHMEM,DSMDistributedMemoryModel
MPI(MessagePassingInterface)
ParRRonedGlobalAddressSpace(PGAS)
GlobalArrays,UPC,Chapel,X10,CAF,…
• Programmingmodelsprovideabstractmachinemodels
• Modelscanbemappedondifferenttypesofsystems– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.
• PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance
HPCAC-Stanford(Feb‘16) 9NetworkBasedCompuNngLaboratory
ParNNonedGlobalAddressSpace(PGAS)Models
• Keyfeatures- SimplesharedmemoryabstracRons
- Lightweightone-sidedcommunicaRon
- EasiertoexpressirregularcommunicaRon
• DifferentapproachestoPGAS- Languages
• UnifiedParallelC(UPC)
• Co-ArrayFortran(CAF)
• X10
• Chapel
- Libraries• OpenSHMEM
• GlobalArrays
HPCAC-Stanford(Feb‘16) 10NetworkBasedCompuNngLaboratory
Hybrid(MPI+PGAS)Programming
• ApplicaRonsub-kernelscanbere-wri<eninMPI/PGASbasedoncommunicaRoncharacterisRcs
• Benefits:– BestofDistributedCompuRngModel
– BestofSharedMemoryCompuRngModel
• ExascaleRoadmap*:– “HybridProgrammingisapracRcalwayto
programexascalesystems”
*TheInterna4onalExascaleSo;wareRoadmap,Dongarra,J.,Beckman,P.etal.,Volume25,Number1,2011,Interna4onalJournalofHighPerformanceComputerApplica4ons,ISSN1094-3420
Kernel1MPI
Kernel2MPI
Kernel3MPI
KernelNMPI
HPCApplicaNon
Kernel2PGAS
KernelNPGAS
HPCAC-Stanford(Feb‘16) 11NetworkBasedCompuNngLaboratory
DesigningCommunicaNonLibrariesforMulN-PetaflopandExaflopSystems:Challenges
ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.
ApplicaNonKernels/ApplicaNons
NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmniPath)
MulN/Many-coreArchitectures
Accelerators(NVIDIAandMIC)
MiddlewareCo-Design
OpportuniNesand
ChallengesacrossVarious
Layers
PerformanceScalabilityFault-
Resilience
CommunicaNonLibraryorRunNmeforProgrammingModelsPoint-to-pointCommunicaNon
CollecNveCommunicaNon
Energy-Awareness
SynchronizaNonandLocks
I/OandFileSystems
FaultTolerance
HPCAC-Stanford(Feb‘16) 12NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors– Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided)– Scalablejobstart-up
• ScalableCollecRvecommunicaRon– Offload– Non-blocking– Topology-aware
• Balancingintra-nodeandinter-nodecommunicaRonfornextgeneraRonnodes(128-1024cores)– MulRpleend-pointspernode
• SupportforefficientmulR-threading• IntegratedSupportforGPGPUsandAccelerators• Fault-tolerance/resiliency• QoSsupportforcommunicaRonandI/O• SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,
CAF,…)• VirtualizaRon• Energy-Awareness
BroadChallengesinDesigningCommunicaNonLibrariesfor(MPI+X)atExascale
HPCAC-Stanford(Feb‘16) 13NetworkBasedCompuNngLaboratory
• ExtremeLowMemoryFootprint– MemorypercoreconRnuestodecrease
• D-L-AFramework
– Discover• Overallnetworktopology(fat-tree,3D,…),Networktopologyforprocessesforagivenjob• Nodearchitecture,Healthofnetworkandnode
– Learn• Impactonperformanceandscalability• PotenRalforfailure
– Adapt• Internalprotocolsandalgorithms• Processmapping• Fault-tolerancesoluRons
– Lowoverheadtechniqueswhiledeliveringperformance,scalabilityandfault-tolerance
AddiNonalChallengesforDesigningExascaleSomwareLibraries
HPCAC-Stanford(Feb‘16) 14NetworkBasedCompuNngLaboratory
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualizaRon(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– Usedbymorethan2,525organizaNonsin77countries
– Morethan351,000(>0.35million)downloadsfromtheOSUsitedirectly
– EmpoweringmanyTOP500clusters(Nov‘15ranking)• 10thranked519,640-corecluster(Stampede)atTACC
• 13thranked185,344-corecluster(Pleiades)atNASA
• 25thranked76,032-corecluster(Tsubame2.5)atTokyoInsRtuteofTechnologyandmanyothers
– AvailablewithsofwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– h<p://mvapich.cse.ohio-state.edu
• EmpoweringTop500systemsforoveradecade– System-XfromVirginiaTech(3rdinNov2003,2,200processors,12.25TFlops)->
– StampedeatTACC(10thinNov’15,519,640cores,5.168Plops)
HPCAC-Stanford(Feb‘16) 15NetworkBasedCompuNngLaboratory
MVAPICH2Architecture
HighPerformanceParallelProgrammingModels
MessagePassingInterface(MPI)
PGAS(UPC,OpenSHMEM,CAF,UPC++*)
Hybrid---MPI+X(MPI+PGAS+OpenMP/Cilk)
HighPerformanceandScalableCommunicaNonRunNmeDiverseAPIsandMechanisms
Point-to-point
PrimiNves
CollecNvesAlgorithms
Energy-Awareness
RemoteMemoryAccess
I/OandFileSystems
FaultTolerance
VirtualizaNon AcNveMessages
JobStartupIntrospecNon&Analysis
SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,OmniPath)
SupportforModernMulN-/Many-coreArchitectures(Intel-Xeon,OpenPower*,Xeon-Phi(MIC,KNL*),NVIDIAGPGPU)
TransportProtocols ModernFeatures
RC XRC UD DC UMR ODP*SR-IOV
MulNRail
TransportMechanismsSharedMemory CMA IVSHMEM
ModernFeatures
MCDRAM* NVLink* CAPI*
*-Upcoming
HPCAC-Stanford(Feb‘16) 16NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors– Supportforhighly-efficientinter-nodeandintra-nodecommunicaRon(bothtwo-sidedandone-sided
RMA)– SupportforadvancedIBmechanisms(UMRandODP)– Extremelyminimalmemoryfootprint– Scalablejobstart-up
• CollecRvecommunicaRon• IntegratedSupportforGPGPUs• IntegratedSupportforMICs• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+
UPC,CAF,…)• VirtualizaRon• Energy-Awareness• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Stanford(Feb‘16) 17NetworkBasedCompuNngLaboratory
One-wayLatency:MPIoverIBwithMVAPICH2
0.000.200.400.600.801.001.201.401.601.802.00 SmallMessageLatency
MessageSize(bytes)
Latency(us)
1.261.19
0.951.15
TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch
ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back
0
20
40
60
80
100
120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR
LargeMessageLatency
MessageSize(bytes)
Latency(us)
HPCAC-Stanford(Feb‘16) 18NetworkBasedCompuNngLaboratory
Bandwidth:MPIoverIBwithMVAPICH2
0
2000
4000
6000
8000
10000
12000
14000 UnidirecNonalBandwidth
Band
width
(MBy
tes/sec)
MessageSize(bytes)
12465
3387
6356
12104
0
5000
10000
15000
20000
25000
30000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDR
BidirecNonalBandwidth
Band
width
(MBy
tes/sec)
MessageSize(bytes)
21425
12161
24353
6308
TrueScale-QDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-3-FDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitch
ConnectIB-DualFDR-2.8GHzDeca-core(IvyBridge)IntelPCIGen3withIBswitchConnectX-4-EDR-2.8GHzDeca-core(Haswell)IntelPCIGen3Back-to-back
HPCAC-Stanford(Feb‘16) 19NetworkBasedCompuNngLaboratory
0
0.5
1
0 1 2 4 8 16 32 64 128 256 512 1K
Latency(us)
MessageSize(Bytes)
LatencyIntra-Socket Inter-Socket
MVAPICH2Two-SidedIntra-NodePerformance(SharedmemoryandKernel-basedZero-copySupport(LiMICandCMA))
LatestMVAPICH22.2b
IntelIvy-bridge0.18us
0.45us
0
5000
10000
15000
Band
width(M
B/s)
MessageSize(Bytes)
Bandwidth(Inter-socket)inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC
0
5000
10000
15000
Band
width(M
B/s)
MessageSize(Bytes)
Bandwidth(Intra-socket)intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC
14,250MB/s13,749MB/s
HPCAC-Stanford(Feb‘16) 20NetworkBasedCompuNngLaboratory
• IntroducedbyMellanoxtosupportdirectlocalandremotenonconRguousmemoryaccess– Avoidpackingatsenderandunpackingatreceiver
• AvailablewithMVAPICH2-X2.2b
User-modeMemoryRegistraNon(UMR)
050
100150200250300350
4K 16K 64K 256K 1M
Latency(u
s)
MessageSize(Bytes)
Small&MediumMessageLatencyUMRDefault
0
5000
10000
15000
20000
2M 4M 8M 16M
Latency(us)
MessageSize(Bytes)
LargeMessageLatencyUMRDefault
Connect-IB(54Gbps):2.8GHzDualTen-core(IvyBridge)IntelPCIGen3withMellanoxIBFDRswitch
M.Li,H.Subramoni,K.Hamidouche,X.LuandD.K.Panda,HighPerformanceMPIDatatypeSupportwithUser-modeMemoryRegistraNon:Challenges,DesignsandBenefits,CLUSTER,2015
HPCAC-Stanford(Feb‘16) 21NetworkBasedCompuNngLaboratory
• IntroducedbyMellanoxtosupportdirectremotememoryaccesswithoutpinning
• Memoryregionspagedin/outdynamicallybytheHCA/OS
• Sizeofregisteredbufferscanbelargerthanphysicalmemory
• WillbeavailableinupcomingMVAPICH2-X2.2RC1
On-DemandPaging(ODP)
Connect-IB(54Gbps):2.6GHzDualOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRswitch
0
500
1000
1500
16 32 64
Pin-do
wnBu
fferS
ize
(MB)
NumberofProcesses
Graph500Pin-downBufferSizesPin-down ODP
0
1
2
3
4
5
16 32 64
ExecuN
onTim
e(s)
NumberofProcesses
Graph500BFSKernelPin-down ODP
HPCAC-Stanford(Feb‘16) 22NetworkBasedCompuNngLaboratory
MinimizingMemoryFootprintbyDirectConnect(DC)Transport
Nod
e0 P1
P0
Node1
P3
P2Node3
P7
P6
Nod
e2 P5
P4
IBNetwork
• ConstantconnecRoncost(OneQPforanypeer)• FullFeatureSet(RDMA,Atomicsetc)• Separateobjectsforsend(DCIniRator)andreceive(DCTarget)
– DCTargetidenRfiedby“DCTNumber”– Messagesroutedwith(DCTNumber,LID)– Requiressame“DCKey”toenablecommunicaRon
• AvailablesinceMVAPICH2-X2.2a
0
0.5
1
160 320 620Normalized
ExecuNo
nTime
NumberofProcesses
NAMD-Apoa1:LargedatasetRC DC-Pool UD XRC
1022
4797
1 1 12
10 10 10 10
1 13
5
1
10
100
80 160 320 640
Conn
ecNo
nMem
ory(KB)
NumberofProcesses
MemoryFootprintforAlltoallRC DC-Pool UD XRC
H.Subramoni,K.Hamidouche,A.Venkatesh,S.ChakrabortyandD.K.Panda,DesigningMPILibrarywithDynamicConnectedTransport(DCT)ofInfiniBand:EarlyExperiences.IEEEInternaRonalSupercompuRngConference(ISC’14)
HPCAC-Stanford(Feb‘16) 23NetworkBasedCompuNngLaboratory
• Near-constantMPIandOpenSHMEMiniRalizaRonRmeatanyprocesscount
• 10xand30ximprovementinstartupRmeofMPIandOpenSHMEMrespecRvelyat16,384processes
• MemoryconsumpRonreducedforremoteendpointinformaRonbyO(processespernode)
• 1GBMemorysavedpernodewith1Mprocessesand16processespernode
TowardsHighPerformanceandScalableStartupatExascale
P M
O
JobStartupPerformance
Mem
oryRe
quire
dtoStore
Endp
ointInform
aRon
a b c d
eP
M
PGAS–Stateoftheart
MPI–Stateoftheart
O PGAS/MPI–OpRmized
PMIX_Ring
PMIX_Ibarrier
PMIX_Iallgather
ShmembasedPMI
b
c
d
e
aOn-demandConnecRon
On-demandConnecNonManagementforOpenSHMEMandOpenSHMEM+MPI.S.Chakraborty,H.Subramoni,J.Perkins,A.A.Awan,andDKPanda,20thInternaRonalWorkshoponHigh-levelParallelProgrammingModelsandSupporRveEnvironments(HIPS’15)
PMIExtensionsforScalableMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,J.Perkins,M.Arnold,andDKPanda,Proceedingsofthe21stEuropeanMPIUsers'GroupMeeRng(EuroMPI/Asia’14)
Non-blockingPMIExtensionsforFastMPIStartup.S.Chakraborty,H.Subramoni,A.Moody,A.Venkatesh,J.Perkins,andDKPanda,15thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’15)
SHMEMPMI–SharedMemorybasedPMIforImprovedPerformanceandScalability.S.Chakraborty,H.Subramoni,J.Perkins,andDKPanda,16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid’16),AcceptedforPublica6on
a
b
c d
e
HPCAC-Stanford(Feb‘16) 24NetworkBasedCompuNngLaboratory
• SHMEMPMIallowsMPIprocessestodirectlyreadremoteendpoint(EP)informaRonfromtheprocessmanagerthroughsharedmemorysegments
• Onlyasinglecopypernode-O(processespernode)reducRoninmemoryusage
• EsRmatedsavingsof1GBpernodewith1millionprocessesand16processespernode
• Upto1,000RmesfasterPMIGetscomparedtodefaultdesign.WillbeavailableinMVAPICH22.2RC1.
ProcessManagementInterfaceoverSharedMemory(SHMEMPMI)
TACCStampede-Connect-IB(54Gbps):2.6GHzQuadOcta-core(SandyBridge)IntelPCIGen3withMellanoxIBFDRSHMEMPMI–SharedMemoryBasedPMIforPerformanceandScalabilityS.Chakraborty,H.Subramoni,J.Perkins,andD.K.Panda,
16thIEEE/ACMInternaRonalSymposiumonCluster,CloudandGridCompuRng(CCGrid‘16),Acceptedforpublica6on
0
50
100
150
200
250
300
1 2 4 8 16 32
TimeTaken(m
illise
cond
s)
NumberofProcessesperNode
TimeTakenbyonePMI_GetDefault
SHMEMPMI
0.00010.0010.010.1110100
100010000
16 64 256 1K 4K 16K 64K 256K 1MMem
oryUsageperNod
e(M
B)
NumberofProcessesperJob
MemoryUsageforRemoteEPInformaRonFence-DefaultAllgather-DefaultFence-ShmemAllgather-Shmem
EsNmated
1000x
Actual
16x
HPCAC-Stanford(Feb‘16) 25NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon
– OffloadandNon-blocking– Topology-aware
• IntegratedSupportforGPGPUs• IntegratedSupportforMICs• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,
MPI+UPC,CAF,…)• VirtualizaRon• Energy-Awareness• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Stanford(Feb‘16) 26NetworkBasedCompuNngLaboratory
ModifiedHPLwithOffload-Bcastdoesupto4.5%be<erthandefaultversion(512Processes)
012345
512 600 720 800
ApplicaN
onRun
-Tim
e(s)
DataSize
05
1015
64 128 256 512Run-Time(s)
NumberofProcesses
PCG-Default Modified-PCG-Offload
Co-DesignwithMPI-3Non-BlockingCollecNvesandCollecNveOffloadCo-DirectHardware(AvailablesinceMVAPICH2-X2.2a)
ModifiedP3DFFTwithOffload-Alltoalldoesupto17%be<erthandefaultversion(128Processes)
K.Kandalla,et.al..High-PerformanceandScalableNon-BlockingAll-to-AllwithCollecNveOffloadonInfiniBandClusters:AStudywithParallel3DFFT,ISC2011
17%
00.20.40.60.81
1.2
10 20 30 40 50 60 70
Normalized
Pe
rforman
ce
HPL-Offload HPL-1ring HPL-Host
HPLProblemSize(N)as%ofTotalMemory
4.5%
ModifiedPre-ConjugateGradientSolverwithOffload-Allreducedoesupto21.8%be<erthandefaultversion
K.Kandalla,et.al,DesigningNon-blockingBroadcastwithCollecNveOffloadonInfiniBandClusters:ACaseStudywithHPL,HotI2011K.Kandalla,et.al.,DesigningNon-blockingAllreducewithCollecNveOffloadonInfiniBandClusters:ACaseStudywithConjugateGradientSolvers,IPDPS’12
21.8%
CanNetwork-OffloadbasedNon-BlockingNeighborhoodMPICollecNvesImproveCommunicaNonOverheadsofIrregularGraphAlgorithms?K.Kandalla,A.Buluc,H.Subramoni,K.Tomko,J.Vienne,L.Oliker,andD.K.Panda,IWPAPS’12
HPCAC-Stanford(Feb‘16) 27NetworkBasedCompuNngLaboratory
Network-Topology-AwarePlacementofProcesses• CanwedesignahighlyscalablenetworktopologydetecRonserviceforIB?• HowdowedesigntheMPIcommunicaRonlibraryinanetwork-topology-awaremannertoefficientlyleveragethetopology
informaRongeneratedbyourservice?• WhatarethepotenRalbenefitsofusinganetwork-topology-awareMPIlibraryontheperformanceofparallelscienRficapplicaRons?
OverallperformanceandSplitupofphysicalcommunicaNonforMILConRanger
Performanceforvaryingsystemsizes Defaultfor2048corerun Topo-Awarefor2048corerun
15%
H.Subramoni,S.Potluri,K.Kandalla,B.Barth,J.Vienne,J.Keasler,K.Tomko,K.Schulz,A.Moody,andD.K.Panda,DesignofaScalableInfiniBandTopologyServicetoEnableNetwork-Topology-AwarePlacementofProcesses,SC'12.BESTPaperandBESTSTUDENTPaperFinalist
• ReducenetworktopologydiscoveryNmefromO(N2hosts)toO(Nhosts)
• 15%improvementinMILCexecuNonNme@2048cores• 15%improvementinHypreexecuNonNme@1024cores
HPCAC-Stanford(Feb‘16) 28NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon• IntegratedSupportforGPGPUs
– CUDA-AwareMPI– GPUDirectRDMA(GDR)Support– CUDA-awareNon-blockingCollecRves– SupportforManagedMemory– EfficientdatatypeProcessing
• IntegratedSupportforMICs• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+
UPC,CAF,…)• VirtualizaRon• Energy-Awareness• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Stanford(Feb‘16) 29NetworkBasedCompuNngLaboratory
PCIe
GPU
CPU
NIC
Switch
At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .);
At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .);
• DatamovementinapplicaRonswithstandardMPIandCUDAinterfaces
HighProduc4vityandLowPerformance
MPI+CUDA-Naive
HPCAC-Stanford(Feb‘16) 30NetworkBasedCompuNngLaboratory
PCIe
GPU
CPU
NIC
Switch
At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall();
<<Similar at receiver>>
• Pipeliningatuserlevelwithnon-blockingMPIandCUDAinterfaces
LowProduc4vityandHighPerformance
MPI+CUDA-Advanced
HPCAC-Stanford(Feb‘16) 31NetworkBasedCompuNngLaboratory
At Sender: At Receiver: MPI_Recv(r_devbuf, size, …);
inside MVAPICH2
• StandardMPIinterfacesusedforunifieddatamovement
• TakesadvantageofUnifiedVirtualAddressing(>=CUDA4.0)
• OverlapsdatamovementfromGPUwithRDMAtransfers
HighPerformanceandHighProduc4vity
MPI_Send(s_devbuf, size, …);
GPU-AwareMPILibrary:MVAPICH2-GPU
HPCAC-Stanford(Feb‘16) 32NetworkBasedCompuNngLaboratory
• OFEDwithsupportforGPUDirectRDMAisdevelopedbyNVIDIAandMellanox
• OSUhasadesignofMVAPICH2using
GPUDirectRDMA– HybriddesignusingGPU-DirectRDMA
• GPUDirectRDMAandHost-basedpipelining
• AlleviatesP2Pbandwidthbo<lenecksonSandyBridgeandIvyBridge
– SupportforcommunicaRonusingmulR-rail
– SupportforMellanoxConnect-IBandConnectXVPIadapters
– SupportforRoCEwithMellanoxConnectXVPIadapters
GPU-DirectRDMA(GDR)withCUDA
IBAdapter
SystemMemory
GPUMemory
GPU
CPU
Chipset
P2P write: 5.2 GB/s P2P read: < 1.0 GB/s
SNBE5-2670
P2P write: 6.4 GB/s P2P read: 3.5 GB/s
IVBE5-2680V2
SNBE5-2670/
IVBE5-2680V2
HPCAC-Stanford(Feb‘16) 33NetworkBasedCompuNngLaboratory
CUDA-AwareMPI:MVAPICH2-GDR1.8-2.2Releases• SupportforMPIcommunicaRonfromNVIDIAGPUdevicememory• HighperformanceRDMA-basedinter-nodepoint-to-point
communicaRon(GPU-GPU,GPU-HostandHost-GPU)• Highperformanceintra-nodepoint-to-pointcommunicaRonformulR-
GPUadapters/node(GPU-GPU,GPU-HostandHost-GPU)• TakingadvantageofCUDAIPC(availablesinceCUDA4.1)inintra-node
communicaRonformulRpleGPUadapters/node• OpRmizedandtunedcollecRvesforGPUdevicebuffers• MPIdatatypesupportforpoint-to-pointandcollecRvecommunicaRon
fromGPUdevicebuffers
HPCAC-Stanford(Feb‘16) 34NetworkBasedCompuNngLaboratory
34
MVAPICH2-GDR-2.2bIntelIvyBridge(E5-2680v2)node-20cores
NVIDIATeslaK40cGPUMellanoxConnect-IBDual-FDRHCA
CUDA7MellanoxOFED2.4withGPU-Direct-RDMA
10x2X
11x
2x
PerformanceofMVAPICH2-GPUwithGPU-DirectRDMA(GDR)
05
1015202530
0 2 8 32 128 512 2K
MV2-GDR2.2b MV2-GDR2.0bMV2w/oGDR
GPU-GPUinternodelatency
MessageSize(bytes)
Latency(us)
2.18us0
50010001500200025003000
1 4 16 64 256 1K 4K
MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR
GPU-GPUInternodeBandwidth
MessageSize(bytes)
Band
width(M
B/s)
11X
01000200030004000
1 4 16 64 256 1K 4K
MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR
GPU-GPUInternodeBi-Bandwidth
MessageSize(bytes)
Bi-Ban
dwidth(M
B/s)
HPCAC-Stanford(Feb‘16) 35NetworkBasedCompuNngLaboratory
LENS(Oct'15) 35
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
ApplicaNon-LevelEvaluaNon(HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
AverageTimeStep
sper
second
(TPS)
NumberofProcesses
MV2 MV2+GDR
0500100015002000250030003500
4 8 16 32AverageTimeStep
sper
second
(TPS)
NumberofProcesses
64KParNcles 256KParNcles
2X2X
HPCAC-Stanford(Feb‘16) 36NetworkBasedCompuNngLaboratory
0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Overla
p(%
)
MessageSize(Bytes)
Medium/LargeMessageOverlap(64GPUnodes)
Ialltoall(1process/node)
Ialltoall(2process/node;1process/GPU)0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Overla
p(%
)
MessageSize(Bytes)
Medium/LargeMessageOverlap(64GPUnodes)
Igather(1process/node)
Igather(2processes/node;1process/GPU)
Plazorm:Wilkes:IntelIvyBridgeNVIDIATeslaK20c+MellanoxConnect-IB
AvailablesinceMVAPICH2-GDR2.2a
CUDA-AwareNon-BlockingCollecNves
A.Venkatesh,K.Hamidouche,H.Subramoni,andD.K.Panda,OffloadedGPUCollecNvesusingCORE-DirectandCUDACapabiliNesonIBClusters,HIPC,2015
HPCAC-Stanford(Feb‘16) 37NetworkBasedCompuNngLaboratory
CommunicaNonRunNmewithGPUManagedMemory
● CUDA6.0NVIDIAintroducedCUDAManaged(orUnified)memoryallowingacommonmemoryallocaRonforGPUorCPUthroughcudaMallocManaged()call
● SignificantproducRvitybenefitsduetoabstracRonofexplicitallocaRonandcudaMemcpy()
● ExtendedMVAPICH2toperformcommunicaRonsdirectlyfrommanagedbuffers(AvailableinMVAPICH2-GDR2.2b)
● OSUMicro-benchmarksextendedtoevaluatetheperformanceofpoint-to-pointandcollecRvecommunicaRonsusingmanagedbuffers● AvailableinOMB5.2
D.S.Banerjee,KHamidouche,andD.KPanda,DesigningHighPerformanceCommunicaRonRunRmeforGPUManagedMemory:EarlyExperiences,GPGPU-9Workshop,tobeheldinconjuncRonwithPPoPP‘16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 4 8 16 32 64 128256 1K 4K 8K 16K
HaloExcha
ngeTime(m
s)
TotalDimensionSize(Bytes)
2DStencilPerformanceforHalowidth=1
DeviceManaged
HPCAC-Stanford(Feb‘16) 38NetworkBasedCompuNngLaboratory
CPU
Progress
GPU
Time
Initi
ate
Kern
el
Star
t Se
nd
Isend(1)
Initi
ate
Kern
el
Star
t Se
nd
Initi
ate
Kern
el
GPU
CPU
Initi
ate
Kern
el
Star
tSe
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)Existing Design
Proposed Design
Kernel on Stream
Kernel on Stream
Isend(2)Isend(3)
Kernel on Stream
Initi
ate
Kern
el
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)
Initi
ate
Kern
el
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1) Wait
WFK
Star
t Se
nd
Wait
Progress
Start Finish Proposed Finish Existing
WFK
WFK
Expected Benefits
MPIDatatypeProcessing(CommunicaNonOpNmizaNon)
WasteofcompuNngresourcesonCPUandGPUCommonScenario
*Buf1, Buf2…contain non-conRguousMPIDatatype
MPI_Isend(A,..Datatype,…)MPI_Isend(B,..Datatype,…)MPI_Isend(C,..Datatype,…)MPI_Isend(D,..Datatype,…)…MPI_Waitall(…);
HPCAC-Stanford(Feb‘16) 39NetworkBasedCompuNngLaboratory
ApplicaNon-LevelEvaluaNon(HaloExchange-Cosmo)
0
0.5
1
1.5
16 32 64 96
Normalized
ExecuNo
nTime
NumberofGPUs
CSCSGPUclusterDefault Callback-based Event-based
0
0.5
1
1.5
4 8 16 32
Normalized
ExecuNo
nTime
NumberofGPUs
WilkesGPUClusterDefault Callback-based Event-based
• 2Ximprovementon32GPUsnodes• 30%improvementon96GPUnodes(8GPUs/node)
C.Chu,K.Hamidouche,A.Venkatesh,D.Banerjee,H.Subramoni,andD.K.Panda,ExploiNngMaximalOverlapforNon-ConNguousDataMovementProcessingonModernGPU-enabledSystems,IPDPS’16
HPCAC-Stanford(Feb‘16) 40NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon• IntegratedSupportforGPGPUs• IntegratedSupportforMICs• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,
MPI+UPC,CAF,…)• VirtualizaRon• Energy-Awareness• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Stanford(Feb‘16) 41NetworkBasedCompuNngLaboratory
MPIApplicaNonsonMICClusters
Xeon XeonPhi
MulR-coreCentric
Many-coreCentric
MPIProgram
MPIProgram
OffloadedComputaRon
MPIProgram MPIProgram
MPIProgram
Host-only
Offload(/reverseOffload)
Symmetric
Coprocessor-only
• FlexibilityinlaunchingMPIjobsonclusterswithXeonPhi
HPCAC-Stanford(Feb‘16) 42NetworkBasedCompuNngLaboratory
MVAPICH2-MIC2.0DesignforClusterswithIBandMIC
• OffloadMode
• IntranodeCommunicaRon
• Coprocessor-onlyandSymmetricMode
• InternodeCommunicaRon
• Coprocessors-onlyandSymmetricMode
• MulR-MICNodeConfiguraRons
• Runningonthreemajorsystems
• Stampede,Blueridge(VirginiaTech)andBeacon(UTK)
HPCAC-Stanford(Feb‘16) 43NetworkBasedCompuNngLaboratory
MIC-Remote-MICP2PCommunicaNonwithProxy-basedCommunicaNon
Bandwidth
BeTer
BeTer
BeTer
Latency(LargeMessages)
0 1000 2000 3000 4000 5000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M Band
width(M
B/sec)
Message Size (Bytes)
5236
Intra-socketP2P
Inter-socketP2P
0
5000
10000
15000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
Latency(LargeMessages)
0
2000
4000
6000
1 16 256 4K 64K 1M Band
width(M
B/sec)
Message Size (Bytes) BeT
er
5594
Bandwidth
HPCAC-Stanford(Feb‘16) 44NetworkBasedCompuNngLaboratory
OpNmizedMPICollecNvesforMICClusters(Allgather&Alltoall)
A.Venkatesh,S.Potluri,R.Rajachandrasekar,M.Luo,K.HamidoucheandD.K.Panda-HighPerformanceAlltoallandAllgatherdesignsforInfiniBandMICClusters;IPDPS’14,May2014
0
10000
20000
30000
1 2 4 8 16 32 64 128256512 1K
Latency(usecs)
MessageSize(Bytes)
32-Node-Allgather(16H+16M)SmallMessageLatencyMV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K256K512K 1M
Latency(usecs)
MessageSize(Bytes)
32-Node-Allgather(8H+8M)LargeMessageLatencyMV2-MIC
MV2-MIC-Opt
0
500
1000
4K 8K 16K 32K 64K 128K256K512K
Latency(usecs)
MessageSize(Bytes)
32-Node-Alltoall(8H+8M)LargeMessageLatencyMV2-MIC
MV2-MIC-Opt
0
20
40
60
MV2-MIC-Opt MV2-MICExecuN
onTim
e(secs)
32Nodes(8H+8M),Size=2K*2K*1K
P3DFFTPerformanceCommunicaRonComputaRon
76%58%
55%
HPCAC-Stanford(Feb‘16) 45NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon• IntegratedSupportforGPGPUs• IntegratedSupportforMICs• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,
MPI+UPC,CAF,…)• VirtualizaRon• Energy-Awareness• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Stanford(Feb‘16) 46NetworkBasedCompuNngLaboratory
MVAPICH2-XforAdvancedMPIandHybridMPI+PGASApplicaNonsMPI,OpenSHMEM,UPC,CAForHybrid(MPI+PGAS)
ApplicaNons
UnifiedMVAPICH2-XRunNme
InfiniBand,RoCE,iWARP
OpenSHMEMCalls MPICallsUPCCalls
• UnifiedcommunicaRonrunRmeforMPI,UPC,OpenSHMEM,CAFavailablewithMVAPICH2-X1.9(2012)onwards!
• UPC++supportwillbeavailableinupcomingMVAPICH2-X2.2RC1• FeatureHighlights
– SupportsMPI(+OpenMP),OpenSHMEM,UPC,CAF,MPI(+OpenMP)+OpenSHMEM,MPI(+OpenMP)+UPC+CAF
– MPI-3compliant,OpenSHMEMv1.0standardcompliant,UPCv1.2standardcompliant(withiniRalsupportforUPC1.3),CAF2008standard(OpenUH)
– ScalableInter-nodeandintra-nodecommunicaRon–point-to-pointandcollecRves
CAFCalls
HPCAC-Stanford(Feb‘16) 47NetworkBasedCompuNngLaboratory
ApplicaNonLevelPerformancewithGraph500andSortGraph500ExecuNonTime
J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternaNonalSupercompuNngConference(ISC’13),June2013
J.Jose,K.Kandalla,M.LuoandD.K.Panda,SupporNngHybridMPIandOpenSHMEMoverInfiniBand:DesignandPerformanceEvaluaNon,Int'lConferenceonParallelProcessing(ICPP'12),September2012
05101520253035
4K 8K 16K
Time(s)
No.ofProcesses
MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)
13X
7.6X
• PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design• 8,192processes
-2.4XimprovementoverMPI-CSR-7.6XimprovementoverMPI-Simple
• 16,384processes-1.5XimprovementoverMPI-CSR-13XimprovementoverMPI-Simple
J.Jose,K.Kandalla,S.Potluri,J.ZhangandD.K.Panda,OpNmizingCollecNveCommunicaNoninOpenSHMEM,Int'lConferenceonParNNonedGlobalAddressSpaceProgrammingModels(PGAS'13),October2013.
SortExecuNonTime
0
1000
2000
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Time(secon
ds)
InputData-No.ofProcesses
MPI Hybrid
51%
• PerformanceofHybrid(MPI+OpenSHMEM)SortApplicaRon
• 4,096processes,4TBInputSize-MPI–2408sec;0.16TB/min-Hybrid–1172sec;0.36TB/min-51%improvementoverMPI-design
HPCAC-Stanford(Feb‘16) 48NetworkBasedCompuNngLaboratory
MiniMD–TotalExecuNonTime
• Hybriddesignperformsbe<erthanMPIimplementaRon• 1,024processes
- 17%improvementoverMPIversion• StrongScaling
Inputsize:128*128*128
Performance StrongScaling
0
500
1000
1500
2000
2500
512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
17%
050010001500200025003000
256 512 1,024
Hybrid-Barrier MPI-Original Hybrid-Advanced
Time(m
s)
Time(m
s)
#ofCores #ofCores
M.Li,J.Lin,X.Lu,K.Hamidouche,K.TomkoandD.K.Panda,ScalableMiniMDDesignwithHybridMPIandOpenSHMEM,OpenSHMEMUserGroupMeeNng(OUG’14),heldinconjuncNonwith8thInternaNonalConferenceonParNNonedGlobalAddressSpaceProgrammingModels,(PGAS14).
HPCAC-Stanford(Feb‘16) 49NetworkBasedCompuNngLaboratory
HybridMPI+UPCNAS-FT
• ModifiedNASFTUPCall-to-allpa<ernusingMPI_Alltoall• Trulyhybridprogram• ForFT(ClassC,128processes)
• 34%improvementoverUPC-GASNet• 30%improvementoverUPC-OSU
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Time(s)
NASProblemSize–SystemSize
UPC-GASNet
UPC-OSU
Hybrid-OSU
34%
J.Jose,M.Luo,S.SurandD.K.Panda,UnifyingUPCandMPIRunNmes:ExperiencewithMVAPICH,FourthConferenceonParNNonedGlobalAddressSpaceProgrammingModel(PGAS’10),October2010
HybridMPI+UPCSupport
Availablesince
MVAPICH2-X1.9(2012)
HPCAC-Stanford(Feb‘16) 50NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon• IntegratedSupportforGPGPUs• IntegratedSupportforMICs• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,
MPI+UPC,CAF,…)• VirtualizaRon• Energy-Awareness• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Stanford(Feb‘16) 51NetworkBasedCompuNngLaboratory
• VirtualizaRonhasmanybenefits– Fault-tolerance– JobmigraRon– CompacRon
• HavenotbeenverypopularinHPCduetooverheadassociatedwithVirtualizaRon
• NewSR-IOV(SingleRoot–IOVirtualizaRon)supportavailablewithMellanoxInfiniBandadapterschangesthefield
• EnhancedMVAPICH2supportforSR-IOV• MVAPICH2-Virt2.1(withandwithoutOpenStack)ispubliclyavailable
CanHPCandVirtualizaNonbeCombined?
J.Zhang,X.Lu,J.Jose,R.ShiandD.K.Panda,CanInter-VMShmemBenefitMPIApplicaNonsonSR-IOVbasedVirtualizedInfiniBandClusters?EuroPar'14J.Zhang,X.Lu,J.Jose,M.Li,R.ShiandD.K.Panda,HighPerformanceMPILibrayoverSR-IOVenabledInfiniBandClusters,HiPC’14J.Zhang,X.Lu,M.ArnoldandD.K.Panda,MVAPICH2OverOpenStackwithSR-IOV:anEfficientApproachtobuildHPCClouds,CCGrid’15
HPCAC-Stanford(Feb‘16) 52NetworkBasedCompuNngLaboratory
• RedesignMVAPICH2tomakeitvirtualmachineaware– SR-IOVshowsneartonaRve
performanceforinter-nodepointtopointcommunicaRon
– IVSHMEMofferszero-copyaccesstodataonsharedmemoryofco-residentVMs
– LocalityDetector:maintainsthelocalityinformaRonofco-residentvirtualmachines
– CommunicaRonCoordinator:selectsthecommunicaRonchannel(SR-IOV,IVSHMEM)adapRvely
OverviewofMVAPICH2-VirtwithSR-IOVandIVSHMEM
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter
Physical Function
user space
kernel space
MPI proc
PCI Device
VF Driver
Guest 2user space
kernel space
MPI proc
PCI Device
VF Driver
Virtual Function
Virtual Function
/dev/shm/
IV-SHM
IV-Shmem Channel
SR-IOV Channel
J.Zhang,X.Lu,J.Jose,R.Shi,D.K.Panda.CanInter-VMShmemBenefitMPIApplicaRonsonSR-IOVbasedVirtualizedInfiniBandClusters?Euro-Par,2014.
J.Zhang,X.Lu,J.Jose,R.Shi,M.Li,D.K.Panda.HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters.HiPC,2014.
HPCAC-Stanford(Feb‘16) 53NetworkBasedCompuNngLaboratory
Nova
Glance
Neutron
Swift
Keystone
Cinder
Heat
Ceilometer
Horizon
VM
Backup volumes in
Stores images in
Provides images
Provides Network
Provisions
Provides Volumes
Monitors
Provides UI
Provides Auth for
Orchestrates cloud
• OpenStackisoneofthemostpopularopen-sourcesoluRonstobuildcloudsandmanagevirtualmachines
• DeploymentwithOpenStack– SupporRngSR-IOVconfiguraRon
– SupporRngIVSHMEMconfiguraRon
– VirtualMachineawaredesignofMVAPICH2withSR-IOV
• AnefficientapproachtobuildHPCCloudswithMVAPICH2-VirtandOpenStack
MVAPICH2-VirtwithSR-IOVandIVSHMEMoverOpenStack
J.Zhang,X.Lu,M.Arnold,D.K.Panda.MVAPICH2overOpenStackwithSR-IOV:AnEfficientApproachtoBuildHPCClouds.CCGrid,2015.
HPCAC-Stanford(Feb‘16) 54NetworkBasedCompuNngLaboratory
0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
ExecuN
onTim
e(s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-NaRve
1% 9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
ExecuN
onTim
e(m
s)
ProblemSize(Scale,Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-NaRve 2%
• 32VMs,6Core/VM
• ComparedtoNaRve,2-5%overheadforGraph500with128Procs
• ComparedtoNaRve,1-9.5%overheadforSPECMPI2007with128Procs
ApplicaNon-LevelPerformanceonChameleon
SPECMPI2007 Graph500
5%
HPCAC-Stanford(Feb‘16) 55NetworkBasedCompuNngLaboratory
NSFChameleonCloud:APowerfulandFlexibleExperimentalInstrument • Large-scaleinstrument
– TargeRngBigData,BigCompute,BigInstrumentresearch– ~650nodes(~14,500cores),5PBdiskovertwosites,2sitesconnectedwith100Gnetwork
• Reconfigurableinstrument– BaremetalreconfiguraRon,operatedassingleinstrument,graduatedapproachforease-of-use
• Connectedinstrument– WorkloadandTraceArchive– PartnershipswithproducRonclouds:CERN,OSDC,Rackspace,Google,andothers– Partnershipswithusers
• Complementaryinstrument– ComplemenRngGENI,Grid’5000,andothertestbeds
• Sustainableinstrument– IndustryconnecRons
h<p://www.chameleoncloud.org/
HPCAC-Stanford(Feb‘16) 56NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon• IntegratedSupportforGPGPUs• IntegratedSupportforMICs• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,
MPI+UPC,CAF,…)• VirtualizaRon• Energy-Awareness• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Stanford(Feb‘16) 57NetworkBasedCompuNngLaboratory
• MVAPICH2-EA2.1(Energy-Aware)• Awhite-boxapproach• NewEnergy-EfficientcommunicaRonprotocolsforpt-ptandcollecRveoperaRons• IntelligentlyapplytheappropriateEnergysavingtechniques• ApplicaRonobliviousenergysaving
• OEMT• AlibraryuRlitytomeasureenergyconsumpRonforMPIapplicaRons• WorkswithallMPIrunRmes• PRELOADopRonforprecompiledapplicaRons• DoesnotrequireROOTpermission:
• AsafekernelmoduletoreadonlyasubsetofMSRs
Energy-AwareMVAPICH2&OSUEnergyManagementTool(OEMT)
HPCAC-Stanford(Feb‘16) 58NetworkBasedCompuNngLaboratory
• AnenergyefficientrunRmethatprovidesenergysavingswithoutapplicaRonknowledge
• UsesautomaRcallyandtransparentlythebestenergylever
• ProvidesguaranteesonmaximumdegradaRonwith5-41%savingsat<=5%degradaRon
• PessimisRcMPIappliesenergyreducRonlevertoeachMPIcall
MVAPICH2-EA:ApplicaNonObliviousEnergy-Aware-MPI(EAM)
ACaseforApplicaNon-ObliviousEnergy-EfficientMPIRunNmeA.Venkatesh,A.Vishnu,K.Hamidouche,N.Tallent,D.
K.Panda,D.Kerbyson,andA.Hoise,SupercompuNng‘15,Nov2015[BestStudentPaperFinalist]
1
HPCAC-Stanford(Feb‘16) 59NetworkBasedCompuNngLaboratory
• Scalabilityformilliontobillionprocessors• CollecRvecommunicaRon• IntegratedSupportforGPGPUs• IntegratedSupportforMICs• UnifiedRunRmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,
MPI+UPC,CAF,…)• VirtualizaRon• Energy-Awareness• InfiniBandNetworkAnalysisandMonitoring(INAM)
OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale
HPCAC-Stanford(Feb‘16) 60NetworkBasedCompuNngLaboratory
• OSUINAMmonitorsIBclustersinrealRmebyqueryingvarioussubnetmanagementenRResinthenetwork
• MajorfeaturesoftheOSUINAMtoolinclude:– Analyzeandprofilenetwork-levelacRviReswithmanyparameters(dataanderrors)atuserspecified
granularity
– Capabilitytoanalyzeandprofilenode-level,job-levelandprocess-levelacRviResforMPIcommunicaRon(pt-to-pt,collecRvesandRMA)
– RemotelymonitorCPUuRlizaRonofMPIprocessesatuserspecifiedgranularity
– Visualizethedatatransferhappeningina"live"fashion-LiveViewfor• EnRreNetwork-LiveNetworkLevelView
• ParRcularJob-LiveJobLevelView
• OneormulRpleNodes-LiveNodeLevelView
– CapabilitytovisualizedatatransferthathappenedinthenetworkataRmeduraRoninthepast• EnRreNetwork-HistoricalNetworkLevelView
• ParRcularJob-HistoricalJobLevelView
• OneormulRpleNodes-HistoricalNodeLevelView
OverviewofOSUINAM
HPCAC-Stanford(Feb‘16) 61NetworkBasedCompuNngLaboratory
OSUINAM–NetworkLevelView
• Shownetworktopologyoflargeclusters• Visualizetrafficpa<ernondifferentlinks• QuicklyidenRfycongestedlinks/linksinerrorstate• Seethehistoryunfold–playbackhistoricalstateofthenetwork
FullNetwork(152nodes) Zoomed-inViewoftheNetwork
HPCAC-Stanford(Feb‘16) 62NetworkBasedCompuNngLaboratory
OSUINAM–JobandNodeLevelViews
VisualizingaJob(5Nodes) FindingRoutesBetweenNodes
• Joblevelview• Showdifferentnetworkmetrics(load,error,etc.)foranylivejob• PlaybackhistoricaldataforcompletedjobstoidenRfybo<lenecks
• Nodelevelviewprovidesdetailsperprocessorpernode• CPUuRlizaRonforeachrank/node• Bytessent/receivedforMPIoperaRons(pt-to-pt,collecRve,RMA)• Networkmetrics(e.g.XmitDiscard,RcvError)perrank/node
HPCAC-Stanford(Feb‘16) 63NetworkBasedCompuNngLaboratory
MVAPICH2–PlansforExascale
• PerformanceandMemoryscalabilitytoward1Mcores• Hybridprogramming(MPI+OpenSHMEM,MPI+UPC,MPI+CAF…)
– Supportfortask-basedparallelism(UPC++)
• EnhancedOpRmizaRonforGPUSupportandAccelerators• Takingadvantageofadvancedfeatures
– UserModeMemoryRegistraRon(UMR)– On-demandPaging
• EnhancedInter-nodeandIntra-nodecommunicaRonschemesforupcomingOmniPathandKnightsLandingarchitectures
• ExtendedRMAsupport(asinMPI3.0)• Extendedtopology-awarecollecRves• Energy-awarepoint-to-point(one-sidedandtwo-sided)andcollecRves• ExtendedSupportforMPIToolsInterface(asinMPI3.0)• ExtendedCheckpoint-RestartandmigraRonsupportwithSCR
HPCAC-Stanford(Feb‘16) 64NetworkBasedCompuNngLaboratory
• Exascalesystemswillbeconstrainedby– Power– Memorypercore– Datamovementcost– Faults
• ProgrammingModelsandRunRmesforHPCneedtobedesignedfor– Scalability– Performance– Fault-resilience– Energy-awareness– Programmability– ProducRvity
• Highlightedsomeoftheissuesandchallenges• NeedconRnuousinnovaRononallthesefronts
LookingintotheFuture….
HPCAC-Stanford(Feb‘16) 65NetworkBasedCompuNngLaboratory
FundingAcknowledgmentsFundingSupportby
EquipmentSupportby
HPCAC-Stanford(Feb‘16) 66NetworkBasedCompuNngLaboratory
PersonnelAcknowledgmentsCurrentStudents
– A.AugusRne(M.S.)
– A.Awan(Ph.D.)– S.Chakraborthy(Ph.D.)
– C.-H.Chu(Ph.D.)– N.Islam(Ph.D.)
– M.Li(Ph.D.)
PastStudents– P.Balaji(Ph.D.)
– S.Bhagvat(M.S.)
– A.Bhat(M.S.)
– D.BunRnas(Ph.D.)
– L.Chai(Ph.D.)
– B.Chandrasekharan(M.S.)
– N.Dandapanthula(M.S.)
– V.Dhanraj(M.S.)
– T.Gangadharappa(M.S.)– K.Gopalakrishnan(M.S.)
– G.Santhanaraman(Ph.D.)– A.Singh(Ph.D.)
– J.Sridhar(M.S.)
– S.Sur(Ph.D.)
– H.Subramoni(Ph.D.)
– K.Vaidyanathan(Ph.D.)
– A.Vishnu(Ph.D.)
– J.Wu(Ph.D.)
– W.Yu(Ph.D.)
PastResearchScien4st– S.Sur
CurrentPost-Doc– J.Lin
– D.Banerjee
CurrentProgrammer– J.Perkins
PastPost-Docs– H.Wang
– X.Besseron– H.-W.Jin
– M.Luo
– W.Huang(Ph.D.)– W.Jiang(M.S.)
– J.Jose(Ph.D.)
– S.Kini(M.S.)
– M.Koop(Ph.D.)
– R.Kumar(M.S.)
– S.Krishnamoorthy(M.S.)
– K.Kandalla(Ph.D.)
– P.Lai(M.S.)
– J.Liu(Ph.D.)
– M.Luo(Ph.D.)– A.Mamidala(Ph.D.)
– G.Marsh(M.S.)
– V.Meshram(M.S.)
– A.Moody(M.S.)
– S.Naravula(Ph.D.)
– R.Noronha(Ph.D.)
– X.Ouyang(Ph.D.)
– S.Pai(M.S.)
– S.Potluri(Ph.D.)
– R.Rajachandrasekar(Ph.D.)
– K.Kulkarni(M.S.)– M.Rahman(Ph.D.)
– D.Shankar(Ph.D.)– A.Venkatesh(Ph.D.)
– J.Zhang(Ph.D.)
– E.Mancini– S.Marcarelli
– J.Vienne
CurrentResearchScien4stsCurrentSeniorResearchAssociate– H.Subramoni
– X.Lu
PastProgrammers– D.Bureddy
-K.Hamidouche
CurrentResearchSpecialist– M.Arnold
HPCAC-Stanford(Feb‘16) 67NetworkBasedCompuNngLaboratory
InternaNonalWorkshoponCommunicaNonArchitecturesatExtremeScale(Exacomm)
ExaComm2015washeldwithInt’lSupercompuRngConference(ISC‘15),atFrankfurt,Germany,onThursday,July16th,2015
OneKeynoteTalk:JohnM.Shalf,CTO,LBL/NERSC
FourInvitedTalks:DrorGoldenberg(Mellanox);MarRnSchulz(LLNL);CyrielMinkenberg(IBM-Zurich);Arthur(Barney)Maccabe(ORNL)
Panel:RonBrightwell(Sandia)TwoResearchPapers
ExaComm2016willbeheldinconjuncRonwithISC’16h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html
TechnicalPaperSubmissionDeadline:Friday,April15,2016
HPCAC-Stanford(Feb‘16) 68NetworkBasedCompuNngLaboratory
panda@cse.ohio-state.edu
ThankYou!
TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/
Network-BasedCompuRngLaboratoryh<p://nowlab.cse.ohio-state.edu/
TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/
Recommended