View
2
Download
0
Category
Preview:
Citation preview
The MVAPICH2 ProjectLatest Status and Future Plans
Hari SubramoniThe Ohio State University
E-mail: subramon@cse.ohio-state.eduhttp://www.cse.ohio-state.edu/~subramon
Presentation at MPICH BoF (SC ‘19)by
Network Based Computing Laboratory 2MPICH BoF (SC’19)
• A long time ago, in a galaxy far, far away…. (actually 20 years ago), there existed…
• MPICH– High performance and widely portable implementation of MPI standard
– From ANL
• MVICH– Implementation of MPICH ADI-2 for VIA
– VIA – Virtual Interface Architecture (precursor to InfiniBand)
– From LBL
• VAPI– Verbs level API
– Initial InfiniBand API from IB Vendors (older version of OFED/IB verbs)
MPICH + MVICH + VAPI = MVAPICH
History of MVAPICH
Network Based Computing Laboratory 3MPICH BoF (SC’19)
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– Based on MPICH2 3.2.1
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (SC ‘02)
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,050 organizations in 89 countries
– More than 615,000 (> 0.6 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘19 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 14th, 570,020 cores (Neurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
Partner in the #5th TACC Frontera System
Network Based Computing Laboratory 4MPICH BoF (SC’19)
0
100000
200000
300000
400000
500000
600000
Sep-
04
Jan-
05M
ay-0
5
Sep-
05
Jan-
06M
ay-0
6
Sep-
06
Jan-
07M
ay-0
7
Sep-
07
Jan-
08M
ay-0
8
Sep-
08
Jan-
09M
ay-0
9
Sep-
09
Jan-
10M
ay-1
0
Sep-
10
Jan-
11M
ay-1
1
Sep-
11
Jan-
12M
ay-1
2
Sep-
12
Jan-
13M
ay-1
3
Sep-
13
Jan-
14M
ay-1
4
Sep-
14
Jan-
15M
ay-1
5
Sep-
15
Jan-
16M
ay-1
6
Sep-
16
Jan-
17M
ay-1
7
Sep-
17
Jan-
18M
ay-1
8
Sep-
18
Jan-
19M
ay-1
9
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9 M
V2-G
DR
2.0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3.2
MV2
-X2.
3rc
2
MV2
Virt
2.2
MV2
2.3
.2
OSU
INAM
0.9
.3
MV2
-Azu
re 2
.3.2
MV2
-AW
S2.
3
MVAPICH2 Release Timeline and Downloads
Network Based Computing Laboratory 5MPICH BoF (SC’19)
Architecture of MVAPICH2 Software Family (HPC and DL)
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-Awareness
Remote Memory Access
I/O andFile Systems
FaultTolerance
Virtualization Active Messages
Job StartupIntrospection
& Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
MemoryCMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
Network Based Computing Laboratory 6MPICH BoF (SC’19)
MVAPICH2 Software Family High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration
OEMT Utility to measure the energy consumption of MPI applications
Network Based Computing Laboratory 7MPICH BoF (SC’19)
0
1000
2000
3000
4000
5000
56 224 896 3584 14336 57344
Tim
e Ta
ken
(Mill
isec
onds
)
Number of Processes
MPI_Init on FronteraIntel MPI 2019MVAPICH2 2.3.2
4.5s
3.9s
MVAPICH2 – Basic MPIFast Startup on Emerging Many-Cores Enhanced Intra-node
Performance for ARMEnhanced Intra-node
Performance for OpenPOWER
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
(4,28) (8,28) (16,28)
Late
ncy
(sec
onds
)
(Number of Nodes, PPN)
MVAPICH2MVAPICH2-SHArP
Advanced Allreduce with SHARP
13%
0
0.5
1
1.5
0 1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-2.3.2
0
0.2
0.4
0.6
0.8
4 8 16 32 64 128 256 512 1K 2K
Late
ncy
(us)
MVAPICH2 2.3.2SpectrumMPI-10.3.0.01
0.25us
1
10
100
1000
8 32 128 512
Late
ncy
(us)
No. of Nodes
16 256 4K 64K 512K
MPI_Bcast using RDMA_CM-based Multicast
05
101520253035404550
4 8 16 32 64 128Com
mun
icat
ion-
Com
puta
tion
Ove
rlap
(%)
Message Size (Bytes)
1 PPN, 8 NodesMVAPICH2MVAPICH2-SHArP
Advanced Non-Blocking Allreduce with SHARP
High
er is
Bet
ter
Network Based Computing Laboratory 8MPICH BoF (SC’19)
MVAPICH2-X – Advanced MPI + PGAS + Tools
1
10
100
1000
10000
1000001K 2K 4K 8K 16
K32
K64
K12
8K25
6K51
2K 1M 2M 4M 8MMessage Size
Power8 (160 Processes)
MVAPICH2-2.3a
OpenMPI 2.1.0
MVAPICH2-X 2.3rc1
Use CMA
UseSHMEM
CMA-Aware MPI_Bcast
Late
ncy (
us)
Performance of P3DFFT Optimized Async Progress
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(us)
Message Size
MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-X-2.3rc1
OSU_Allreduce(Broadwell 256 procs)
73.2
1.8X
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4MMessage Size
MVAPICH2-2.3b
IMPI-2017v1.132
MVAPICH2-2.3rc1
OSU_Reduce(Broadwell 256 procs)
4X
36.1
37.9
16.8
Shared Address Space (XPMEM)-based Collectives Design
0
50
100
150
56 112 224 448
Tim
e p
er lo
op in
sec
onds
Number of processes
MVAPICH2-X Async
MVAPICH2-X Default
Intel MPI 18.1.163
44%
33%
0200400600800
1000120014001600
512 1024 2048 4096
Exec
utio
n Ti
me
(s)
No. of Processes
MVAPICH2 MVAPICH2-X
Neuron with YuEtAl2012
10%
76%
39%
Overhead of RC protocol for connection
establishment and communication
Network Based Computing Laboratory 9MPICH BoF (SC’19)
MVAPICH2-GDR – Optimized MPI for clusters with NVIDIA GPUs
05
1015202530
0 1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
MV2-(NO-GDR) MV2-GDR-2.3
10x1.85us
Best Performance for GPU-based Transfers
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Default Callback-based Event-based
Enhanced Kernel-based Datatype Processing for Cosmo Weather Prediction Model on x86
TensorFlow Training with MVAPICH2-GDR on Summit GPU-Based MPI_Allreduce on Summit
0
5
10
24 48 96 192 384 768 1536
Band
wid
th (G
B/s)
Number of GPUs
128MB Message
SpectrumMPI 10.2.0.11 OpenMPI 4.0.1
NCCL 2.4 MVAPICH2-GDR-2.3.2
050
100150200250300350400
1 2 4 6 12 24 48 96 192 384 768 1536
Imag
e pe
r sec
ond
(Tho
usan
ds)
Number of GPUs
NCCL-2.4 MVAPICH2-GDR-2.3.2
Time/epoch = 3.6 secondsTotal Time (90 epochs)= 3.6 x 90 = 332 seconds = 5.5 minutes!
16 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type)
pre-comm post-recv post-send wait-recv wait-send post-
comm start-up test-comm
bench-comm
Spectrum MPI 10.3 0.0001 0.0000 1.6021 1.7204 0.0112 0.0001 0.0004 7.7383 83.6229
MVAPICH2-GDR 2.3.2 0.0001 0.0000 0.0862 0.0871 0.0018 0.0001 0.0009 0.3558 4.4396
MVAPICH2-GDR 2.3.3 (Upcoming) 0.0001 0.0000 0.0030 0.0032 0.0001 0.0001 0.0009 0.0133 0.1602
18x
27x
Enhanced Kernel-based Datatype Processing for COMB Application Kernel on POWER9
Network Based Computing Laboratory 10MPICH BoF (SC’19)
MVAPICH2-X Advanced Support for HPC-CloudsPerformance on Amazon EFA Performance of Radix on Microsoft Azure
01020304050607080
72(2x36) 144(4x36) 288(8x36)
Exec
utio
n Ti
me
(Sec
onds
)
Processes (Nodes X PPN)
miniGhostMV2X OpenMPI
05
1015202530
72(2x36) 144(4x36) 288(8x36)
Exec
utio
n Ti
me
(Sec
onds
)Processes (Nodes X PPN)
CloverLeafMV2X-UDMV2X-SRDOpenMPI
10% better
27.5% better
Instance type: c5n.18xlargeCPU: Intel Xeon Platinum 8124M @ 3.00GHzMVAPICH2 version: MVAPICH2-X 2.3rc2 + SRD supportOpenMPI version: Open MPI v3.1.3 with libfabric 1.7
0
5
10
15
20
25
30
Exec
utio
n Ti
me
(Sec
onds
)
Number of Processes (Nodes X PPN)
Total Execution Time on HC (Lower is better)
MVAPICH2-XHPCx
3x faster
0
5
10
15
20
25
60(1X60) 120(2X60) 240(4X60)
Exec
utio
n Ti
me
(Sec
onds
)
Number of Processes (Nodes X PPN)
Total Execution Time on HB (Lower is better)
MVAPICH2-X
HPCx
38% faster
• MVAPICH2-Azure 2.3.2• Released on 08/16/2019• Major Features and Enhancements
– Based on MVAPICH2-2.3.2– Enhanced tuning for point-to-point and collective operations– Targeted for Azure HB & HC virtual machine instances
– Flexibility for 'one-click' deployment– Tested with Azure HB & HC VM instances
• Available for download from http://mvapich.cse.ohio-state.edu/downloads/• Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2-azure/
• MVAPICH2-X-AWS 2.3• Released on 08/12/2019• Major Features and Enhancements
– Based on MVAPICH2-X 2.3– Support for on Amazon EFA adapter's Scalable Reliable Datagram (SRD)– Support for XPMEM based intra-node communication for point-to-point and
collectives– Enhanced tuning for point-to-point and collective operations– Targeted for AWS instances with Amazon Linux 2 AMI and EFA support– Tested with c5n.18xlarge instance
Network Based Computing Laboratory 11MPICH BoF (SC’19)
MVAPICH2 – Future Roadmap and Plans for Exascale• Update to MPICH 3.3.2 CH3 channel
• 2020• Initial support for the CH4 channel
• 2020/2021• Making CH4 channel default
• 2021/2022• Performance and Memory scalability toward 1M-10M cores• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)
• MPI + Task*• Enhanced Optimization for GPUs and FPGAs*• Taking advantage of advanced features of Mellanox InfiniBand
• Tag Matching*• Adapter Memory*
• Enhanced communication schemes for upcoming architectures• NVLINK*• CAPI*
• Extended topology-aware collectives• Extended Energy-aware designs and Virtualization Support• Extended Support for MPI Tools Interface (as in MPI 3.0)• Extended FT support• Support for * features will be available in future MVAPICH2 Releases
Network Based Computing Laboratory 12MPICH BoF (SC’19)
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
subramoni.1@osu.edu
http://web.cse.ohio-state.edu/~subramon
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
Recommended