12
The MVAPICH2 Project Latest Status and Future Plans Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon Presentation at MPICH BoF (SC ‘19) by

The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

The MVAPICH2 ProjectLatest Status and Future Plans

Hari SubramoniThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~subramon

Presentation at MPICH BoF (SC ‘19)by

Page 2: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 2MPICH BoF (SC’19)

• A long time ago, in a galaxy far, far away…. (actually 20 years ago), there existed…

• MPICH– High performance and widely portable implementation of MPI standard

– From ANL

• MVICH– Implementation of MPICH ADI-2 for VIA

– VIA – Virtual Interface Architecture (precursor to InfiniBand)

– From LBL

• VAPI– Verbs level API

– Initial InfiniBand API from IB Vendors (older version of OFED/IB verbs)

MPICH + MVICH + VAPI = MVAPICH

History of MVAPICH

Page 3: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 3MPICH BoF (SC’19)

Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– Based on MPICH2 3.2.1

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (SC ‘02)

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries

– More than 615,000 (> 0.6 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘19 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 14th, 570,020 cores (Neurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

Partner in the #5th TACC Frontera System

Page 4: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 4MPICH BoF (SC’19)

0

100000

200000

300000

400000

500000

600000

Sep-

04

Jan-

05M

ay-0

5

Sep-

05

Jan-

06M

ay-0

6

Sep-

06

Jan-

07M

ay-0

7

Sep-

07

Jan-

08M

ay-0

8

Sep-

08

Jan-

09M

ay-0

9

Sep-

09

Jan-

10M

ay-1

0

Sep-

10

Jan-

11M

ay-1

1

Sep-

11

Jan-

12M

ay-1

2

Sep-

12

Jan-

13M

ay-1

3

Sep-

13

Jan-

14M

ay-1

4

Sep-

14

Jan-

15M

ay-1

5

Sep-

15

Jan-

16M

ay-1

6

Sep-

16

Jan-

17M

ay-1

7

Sep-

17

Jan-

18M

ay-1

8

Sep-

18

Jan-

19M

ay-1

9

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9 M

V2-G

DR

2.0b

MV2

-MIC

2.0

MV2

-GD

R 2

.3.2

MV2

-X2.

3rc

2

MV2

Virt

2.2

MV2

2.3

.2

OSU

INAM

0.9

.3

MV2

-Azu

re 2

.3.2

MV2

-AW

S2.

3

MVAPICH2 Release Timeline and Downloads

Page 5: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 5MPICH BoF (SC’19)

Architecture of MVAPICH2 Software Family (HPC and DL)

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-Awareness

Remote Memory Access

I/O andFile Systems

FaultTolerance

Virtualization Active Messages

Job StartupIntrospection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC SRD UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM

Page 6: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 6MPICH BoF (SC’19)

MVAPICH2 Software Family High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications

Page 7: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 7MPICH BoF (SC’19)

0

1000

2000

3000

4000

5000

56 224 896 3584 14336 57344

Tim

e Ta

ken

(Mill

isec

onds

)

Number of Processes

MPI_Init on FronteraIntel MPI 2019MVAPICH2 2.3.2

4.5s

3.9s

MVAPICH2 – Basic MPIFast Startup on Emerging Many-Cores Enhanced Intra-node

Performance for ARMEnhanced Intra-node

Performance for OpenPOWER

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

(4,28) (8,28) (16,28)

Late

ncy

(sec

onds

)

(Number of Nodes, PPN)

MVAPICH2MVAPICH2-SHArP

Advanced Allreduce with SHARP

13%

0

0.5

1

1.5

0 1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-2.3.2

0

0.2

0.4

0.6

0.8

4 8 16 32 64 128 256 512 1K 2K

Late

ncy

(us)

MVAPICH2 2.3.2SpectrumMPI-10.3.0.01

0.25us

1

10

100

1000

8 32 128 512

Late

ncy

(us)

No. of Nodes

16 256 4K 64K 512K

MPI_Bcast using RDMA_CM-based Multicast

05

101520253035404550

4 8 16 32 64 128Com

mun

icat

ion-

Com

puta

tion

Ove

rlap

(%)

Message Size (Bytes)

1 PPN, 8 NodesMVAPICH2MVAPICH2-SHArP

Advanced Non-Blocking Allreduce with SHARP

High

er is

Bet

ter

Page 8: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 8MPICH BoF (SC’19)

MVAPICH2-X – Advanced MPI + PGAS + Tools

1

10

100

1000

10000

1000001K 2K 4K 8K 16

K32

K64

K12

8K25

6K51

2K 1M 2M 4M 8MMessage Size

Power8 (160 Processes)

MVAPICH2-2.3a

OpenMPI 2.1.0

MVAPICH2-X 2.3rc1

Use CMA

UseSHMEM

CMA-Aware MPI_Bcast

Late

ncy (

us)

Performance of P3DFFT Optimized Async Progress

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)

Message Size

MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-X-2.3rc1

OSU_Allreduce(Broadwell 256 procs)

73.2

1.8X

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4MMessage Size

MVAPICH2-2.3b

IMPI-2017v1.132

MVAPICH2-2.3rc1

OSU_Reduce(Broadwell 256 procs)

4X

36.1

37.9

16.8

Shared Address Space (XPMEM)-based Collectives Design

0

50

100

150

56 112 224 448

Tim

e p

er lo

op in

sec

onds

Number of processes

MVAPICH2-X Async

MVAPICH2-X Default

Intel MPI 18.1.163

44%

33%

0200400600800

1000120014001600

512 1024 2048 4096

Exec

utio

n Ti

me

(s)

No. of Processes

MVAPICH2 MVAPICH2-X

Neuron with YuEtAl2012

10%

76%

39%

Overhead of RC protocol for connection

establishment and communication

Page 9: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 9MPICH BoF (SC’19)

MVAPICH2-GDR – Optimized MPI for clusters with NVIDIA GPUs

05

1015202530

0 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

MV2-(NO-GDR) MV2-GDR-2.3

10x1.85us

Best Performance for GPU-based Transfers

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Default Callback-based Event-based

Enhanced Kernel-based Datatype Processing for Cosmo Weather Prediction Model on x86

TensorFlow Training with MVAPICH2-GDR on Summit GPU-Based MPI_Allreduce on Summit

0

5

10

24 48 96 192 384 768 1536

Band

wid

th (G

B/s)

Number of GPUs

128MB Message

SpectrumMPI 10.2.0.11 OpenMPI 4.0.1

NCCL 2.4 MVAPICH2-GDR-2.3.2

050

100150200250300350400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e pe

r sec

ond

(Tho

usan

ds)

Number of GPUs

NCCL-2.4 MVAPICH2-GDR-2.3.2

Time/epoch = 3.6 secondsTotal Time (90 epochs)= 3.6 x 90 = 332 seconds = 5.5 minutes!

16 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type)

pre-comm post-recv post-send wait-recv wait-send post-

comm start-up test-comm

bench-comm

Spectrum MPI 10.3 0.0001 0.0000 1.6021 1.7204 0.0112 0.0001 0.0004 7.7383 83.6229

MVAPICH2-GDR 2.3.2 0.0001 0.0000 0.0862 0.0871 0.0018 0.0001 0.0009 0.3558 4.4396

MVAPICH2-GDR 2.3.3 (Upcoming) 0.0001 0.0000 0.0030 0.0032 0.0001 0.0001 0.0009 0.0133 0.1602

18x

27x

Enhanced Kernel-based Datatype Processing for COMB Application Kernel on POWER9

Page 10: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 10MPICH BoF (SC’19)

MVAPICH2-X Advanced Support for HPC-CloudsPerformance on Amazon EFA Performance of Radix on Microsoft Azure

01020304050607080

72(2x36) 144(4x36) 288(8x36)

Exec

utio

n Ti

me

(Sec

onds

)

Processes (Nodes X PPN)

miniGhostMV2X OpenMPI

05

1015202530

72(2x36) 144(4x36) 288(8x36)

Exec

utio

n Ti

me

(Sec

onds

)Processes (Nodes X PPN)

CloverLeafMV2X-UDMV2X-SRDOpenMPI

10% better

27.5% better

Instance type: c5n.18xlargeCPU: Intel Xeon Platinum 8124M @ 3.00GHzMVAPICH2 version: MVAPICH2-X 2.3rc2 + SRD supportOpenMPI version: Open MPI v3.1.3 with libfabric 1.7

0

5

10

15

20

25

30

Exec

utio

n Ti

me

(Sec

onds

)

Number of Processes (Nodes X PPN)

Total Execution Time on HC (Lower is better)

MVAPICH2-XHPCx

3x faster

0

5

10

15

20

25

60(1X60) 120(2X60) 240(4X60)

Exec

utio

n Ti

me

(Sec

onds

)

Number of Processes (Nodes X PPN)

Total Execution Time on HB (Lower is better)

MVAPICH2-X

HPCx

38% faster

• MVAPICH2-Azure 2.3.2• Released on 08/16/2019• Major Features and Enhancements

– Based on MVAPICH2-2.3.2– Enhanced tuning for point-to-point and collective operations– Targeted for Azure HB & HC virtual machine instances

– Flexibility for 'one-click' deployment– Tested with Azure HB & HC VM instances

• Available for download from http://mvapich.cse.ohio-state.edu/downloads/• Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2-azure/

• MVAPICH2-X-AWS 2.3• Released on 08/12/2019• Major Features and Enhancements

– Based on MVAPICH2-X 2.3– Support for on Amazon EFA adapter's Scalable Reliable Datagram (SRD)– Support for XPMEM based intra-node communication for point-to-point and

collectives– Enhanced tuning for point-to-point and collective operations– Targeted for AWS instances with Amazon Linux 2 AMI and EFA support– Tested with c5n.18xlarge instance

Page 11: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 11MPICH BoF (SC’19)

MVAPICH2 – Future Roadmap and Plans for Exascale• Update to MPICH 3.3.2 CH3 channel

• 2020• Initial support for the CH4 channel

• 2020/2021• Making CH4 channel default

• 2021/2022• Performance and Memory scalability toward 1M-10M cores• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)

• MPI + Task*• Enhanced Optimization for GPUs and FPGAs*• Taking advantage of advanced features of Mellanox InfiniBand

• Tag Matching*• Adapter Memory*

• Enhanced communication schemes for upcoming architectures• NVLINK*• CAPI*

• Extended topology-aware collectives• Extended Energy-aware designs and Virtualization Support• Extended Support for MPI Tools Interface (as in MPI 3.0)• Extended FT support• Support for * features will be available in future MVAPICH2 Releases

Page 12: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI

Network Based Computing Laboratory 12MPICH BoF (SC’19)

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

http://web.cse.ohio-state.edu/~subramon

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/