The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

The MVAPICH2 ProjectLatest Status and Future Plans

Hari SubramoniThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~subramon

Presentation at MPICH BoF (SC ‘19)by

http://www.cse.ohio-state.edu/%7Epanda

Network Based Computing Laboratory 2MPICH BoF (SC’19)

• A long time ago, in a galaxy far, far away…. (actually 20 years ago), there existed…

• MPICH– High performance and widely portable implementation of MPI standard

– From ANL

• MVICH– Implementation of MPICH ADI-2 for VIA

– VIA – Virtual Interface Architecture (precursor to InfiniBand)

– From LBL

• VAPI– Verbs level API

– Initial InfiniBand API from IB Vendors (older version of OFED/IB verbs)

MPICH + MVICH + VAPI = MVAPICH

History of MVAPICH


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– Based on MPICH2 3.2.1

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (SC ‘02)

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries

– More than 615,000 (> 0.6 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘19 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 14th, 570,020 cores (Neurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

Partner in the #5th TACC Frontera System

http://mvapich.cse.ohio-state.edu/


0

100000

200000

300000

400000

500000

600000

Sep-

04

Jan-

05M

ay-0

5

Sep-

05

Jan-

06M

ay-0

6

Sep-

06

Jan-

07M

ay-0

7

Sep-

07

Jan-

08M

ay-0

8

Sep-

08

Jan-

09M

ay-0

9

Sep-

09

Jan-

10M

ay-1

0

Sep-

10

Jan-

11M

ay-1

1

Sep-

11

Jan-

12M

ay-1

2

Sep-

12

Jan-

13M

ay-1

3

Sep-

13

Jan-

14M

ay-1

4

Sep-

14

Jan-

15M

ay-1

5

Sep-

15

Jan-

16M

ay-1

6

Sep-

16

Jan-

17M

ay-1

7

Sep-

17

Jan-

18M

ay-1

8

Sep-

18

Jan-

19M

ay-1

9

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9 M

V2-G

DR

2.0b

MV2

-MIC

2.0

MV2

-GD

R 2

.3.2

MV2

-X2.

3rc

2

MV2

Virt

2.2

MV2

2.3

.2

OSU

INAM

0.9

.3

MV2

-Azu

re 2

.3.2

MV2

-AW

S2.

3

MVAPICH2 Release Timeline and Downloads


Architecture of MVAPICH2 Software Family (HPC and DL)

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-Awareness

Remote Memory Access

I/O andFile Systems

FaultTolerance

Virtualization Active Messages

Job StartupIntrospection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC SRD UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM


MVAPICH2 Software Family High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications


0

1000

2000

3000

4000

5000

56 224 896 3584 14336 57344

Tim

e Ta

ken

(Mill

isec

onds

)

Number of Processes

MPI_Init on FronteraIntel MPI 2019MVAPICH2 2.3.2

4.5s

3.9s

MVAPICH2 – Basic MPIFast Startup on Emerging Many-Cores Enhanced Intra-node

Performance for ARMEnhanced Intra-node

Performance for OpenPOWER

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

(4,28) (8,28) (16,28)

Late

ncy

(sec

onds

)

(Number of Nodes, PPN)

MVAPICH2MVAPICH2-SHArP

Advanced Allreduce with SHARP

13%

0

0.5

1

1.5

0 1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-2.3.2

0

0.2

0.4

0.6

0.8

4 8 16 32 64 128 256 512 1K 2K

Late

ncy

(us)

MVAPICH2 2.3.2SpectrumMPI-10.3.0.01

0.25us

1

10

100

1000

8 32 128 512

Late

ncy

(us)

No. of Nodes

16 256 4K 64K 512K

MPI_Bcast using RDMA_CM-based Multicast

05

101520253035404550

4 8 16 32 64 128Com

mun

icat

ion-

Com

puta

tion

Ove

rlap

(%)

Message Size (Bytes)

1 PPN, 8 NodesMVAPICH2MVAPICH2-SHArP

Advanced Non-Blocking Allreduce with SHARP

High

er is

Bet

ter


MVAPICH2-X – Advanced MPI + PGAS + Tools

1

10

100

1000

10000

1000001K 2K 4K 8K 16

K32

K64

K12

8K25

6K51

2K 1M 2M 4M 8MMessage Size

Power8 (160 Processes)

MVAPICH2-2.3a

OpenMPI 2.1.0

MVAPICH2-X 2.3rc1

Use CMA

UseSHMEM

CMA-Aware MPI_Bcast

Late

ncy (

us)

Performance of P3DFFT Optimized Async Progress

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)

Message Size

MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-X-2.3rc1

OSU_Allreduce(Broadwell 256 procs)

73.2

1.8X

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4MMessage Size

MVAPICH2-2.3b

IMPI-2017v1.132

MVAPICH2-2.3rc1

OSU_Reduce(Broadwell 256 procs)

4X

36.1

37.9

16.8

Shared Address Space (XPMEM)-based Collectives Design

0

50

100

150

56 112 224 448

Tim

e p

er lo

op in

sec

onds

Number of processes

MVAPICH2-X Async

MVAPICH2-X Default

Intel MPI 18.1.163

44%

33%

0200400600800

1000120014001600

512 1024 2048 4096

Exec

utio

n Ti

me

(s)

No. of Processes

MVAPICH2 MVAPICH2-X

Neuron with YuEtAl2012

10%

76%

39%

Overhead of RC protocol for connection

establishment and communication

mailto:[email protected]

mailto:[email protected]


MVAPICH2-GDR – Optimized MPI for clusters with NVIDIA GPUs

05

1015202530

0 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

MV2-(NO-GDR) MV2-GDR-2.3

10x1.85us

Best Performance for GPU-based Transfers

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Default Callback-based Event-based

Enhanced Kernel-based Datatype Processing for Cosmo Weather Prediction Model on x86

TensorFlow Training with MVAPICH2-GDR on Summit GPU-Based MPI_Allreduce on Summit

0

5

10

24 48 96 192 384 768 1536

Band

wid

th (G

B/s)

Number of GPUs

128MB Message

SpectrumMPI 10.2.0.11 OpenMPI 4.0.1

NCCL 2.4 MVAPICH2-GDR-2.3.2

050

100150200250300350400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e pe

r sec

ond

(Tho

usan

ds)

Number of GPUs

NCCL-2.4 MVAPICH2-GDR-2.3.2

Time/epoch = 3.6 secondsTotal Time (90 epochs)= 3.6 x 90 = 332 seconds = 5.5 minutes!

16 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type)

pre-comm post-recv post-send wait-recv wait-send post-

comm start-up test-comm

bench-comm

Spectrum MPI 10.3 0.0001 0.0000 1.6021 1.7204 0.0112 0.0001 0.0004 7.7383 83.6229

MVAPICH2-GDR 2.3.2 0.0001 0.0000 0.0862 0.0871 0.0018 0.0001 0.0009 0.3558 4.4396

MVAPICH2-GDR 2.3.3 (Upcoming) 0.0001 0.0000 0.0030 0.0032 0.0001 0.0001 0.0009 0.0133 0.1602

18x

27x

Enhanced Kernel-based Datatype Processing for COMB Application Kernel on POWER9


MVAPICH2-X Advanced Support for HPC-CloudsPerformance on Amazon EFA Performance of Radix on Microsoft Azure

01020304050607080

72(2x36) 144(4x36) 288(8x36)

Exec

utio

n Ti

me

(Sec

onds

)

Processes (Nodes X PPN)

miniGhostMV2X OpenMPI

05

1015202530

72(2x36) 144(4x36) 288(8x36)

Exec

utio

n Ti

me

(Sec

onds

)Processes (Nodes X PPN)

CloverLeafMV2X-UDMV2X-SRDOpenMPI

10% better

27.5% better

Instance type: c5n.18xlargeCPU: Intel Xeon Platinum 8124M @ 3.00GHzMVAPICH2 version: MVAPICH2-X 2.3rc2 + SRD supportOpenMPI version: Open MPI v3.1.3 with libfabric 1.7

0

5

10

15

20

25

30

Exec

utio

n Ti

me

(Sec

onds

)

Number of Processes (Nodes X PPN)

Total Execution Time on HC (Lower is better)

MVAPICH2-XHPCx

3x faster

0

5

10

15

20

25

60(1X60) 120(2X60) 240(4X60)

Exec

utio

n Ti

me

(Sec

onds

)

Number of Processes (Nodes X PPN)

Total Execution Time on HB (Lower is better)

MVAPICH2-X

HPCx

38% faster

• MVAPICH2-Azure 2.3.2• Released on 08/16/2019• Major Features and Enhancements

– Based on MVAPICH2-2.3.2– Enhanced tuning for point-to-point and collective operations– Targeted for Azure HB & HC virtual machine instances

– Flexibility for 'one-click' deployment– Tested with Azure HB & HC VM instances

• Available for download from http://mvapich.cse.ohio-state.edu/downloads/• Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2-azure/

• MVAPICH2-X-AWS 2.3• Released on 08/12/2019• Major Features and Enhancements

– Based on MVAPICH2-X 2.3– Support for on Amazon EFA adapter's Scalable Reliable Datagram (SRD)– Support for XPMEM based intra-node communication for point-to-point and

collectives– Enhanced tuning for point-to-point and collective operations– Targeted for AWS instances with Amazon Linux 2 AMI and EFA support– Tested with c5n.18xlarge instance

http://mvapich.cse.ohio-state.edu/downloads/

http://mvapich.cse.ohio-state.edu/userguide/mv2-azure/


MVAPICH2 – Future Roadmap and Plans for Exascale• Update to MPICH 3.3.2 CH3 channel

• 2020• Initial support for the CH4 channel

• 2020/2021• Making CH4 channel default

• 2021/2022• Performance and Memory scalability toward 1M-10M cores• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)

• MPI + Task*• Enhanced Optimization for GPUs and FPGAs*• Taking advantage of advanced features of Mellanox InfiniBand

• Tag Matching*• Adapter Memory*

• Enhanced communication schemes for upcoming architectures• NVLINK*• CAPI*

• Extended topology-aware collectives• Extended Energy-aware designs and Virtualization Support• Extended Support for MPI Tools Interface (as in MPI 3.0)• Extended FT support• Support for * features will be available in future MVAPICH2 Releases


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

http://web.cse.ohio-state.edu/~subramon

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

Documents

The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI