65
MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda GPU Technology Conference (GTC 2019) by Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon

MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI

Dhabaleswar K. (DK) Panda

The Ohio State UniversityE-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

GPU Technology Conference (GTC 2019)

by

Hari Subramoni

The Ohio State UniversityE-mail: [email protected]

http://www.cse.ohio-state.edu/~subramon

Page 2: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 2Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 3: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 3Network Based Computing Laboratory

Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,975 organizations in 88 countries

– More than 529,000 (> 0.5 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘18 ranking)

• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China

• 14th, 556,104 cores (Oakforest-PACS) in Japan

• 17th, 367,024 cores (Stampede2) at TACC

• 27th, 241,108-core (Pleiades) at NASA and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decadePartner in the upcoming TACC Frontera System

Page 4: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 4Network Based Computing Laboratory

0

100000

200000

300000

400000

500000

600000Se

p-04

Feb-

05Ju

l-05

Dec-

05M

ay-0

6O

ct-0

6M

ar-0

7Au

g-07

Jan-

08Ju

n-08

Nov

-08

Apr-

09Se

p-09

Feb-

10Ju

l-10

Dec-

10M

ay-1

1O

ct-1

1M

ar-1

2Au

g-12

Jan-

13Ju

n-13

Nov

-13

Apr-

14Se

p-14

Feb-

15Ju

l-15

Dec-

15M

ay-1

6O

ct-1

6M

ar-1

7Au

g-17

Jan-

18Ju

n-18

Nov

-18

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GD

R 2.

0b

MV2

-MIC

2.0

MV2

2.3

.1

MV2

-X2.

3rc1

MV2

Virt

2.2

MV2

-GD

R 2

.3O

SU IN

AM 0

.9.4

MVAPICH2 Release Timeline and Downloads

Page 5: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 5Network Based Computing Laboratory

Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-

Awareness

Remote Memory Access

I/O and

File Systems

Fault

ToleranceVirtualization Active

MessagesJob Startup

Introspection & Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC XRC UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

MCDRAM* NVLink CAPI*

* Upcoming

XPMEM

Page 6: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 6Network Based Computing Laboratory

MVAPICH2 Software Family High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications

Page 7: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 7Network Based Computing Laboratory

CPU CPUQPI

GPU

PCIe

GPU

GPU

CPU

GPU

IB

Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host

7. Inter-Node GPU-Host6. Inter-Socket GPU-Host

Memory buffers

8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .

• NVLink is leading to more paths …..• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters

Page 8: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 8Network Based Computing Laboratory

At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

Page 9: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 9Network Based Computing Laboratory

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3.1 Releases

• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication

(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU

adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from

GPU device buffers• Unified memory

Page 10: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 10Network Based Computing Laboratory

• MVAPICH2-GDR 2.3.1 requires the following software to be installed on your system:1. Mellanox OFED 3.2 and later

2. NVIDIA Driver 367.48 or later

3. NVIDIA CUDA Toolkit 7.5 and later

4. NVIDIA Peer Memory (nv_peer_mem) module to enable GPUDirect RDMA (GDR) support

• Strongly Recommended for Best Performance5. GDRCOPY Library by NVIDIA: https://github.com/NVIDIA/gdrcopy

• Comprehensive Instructions can be seen from the MVAPICH2-GDR User Guide:– http://mvapich.cse.ohio-state.edu/userguide/gdr/

MVAPICH2-GDR: Pre-requisites for OpenPOWER & x86 Systems

Page 11: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 11Network Based Computing Laboratory

• Simple Installation steps for both systems

• Pick the right MVAPICH2-GDR RPM from Downloads page:– http://mvapich.cse.ohio-state.edu/downloads/

– e.g. http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/mofed4.5/mvapich2-gdr-mcast.cuda10.0.mofed4.5.gnu4.8.5-2.3-1.el7.x86_64.rpm (== <mv2-gdr-rpm-name>.rpm)

$ wget http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/<mv2-gdr-rpm-name>.rpm

Root Users:

$ rpm -Uvh --nodeps <mv2-gdr-rpm-name>.rpm

Non-Root Users:

$ rpm2cpio <mv2-gdr-rpm-name>.rpm | cpio – id

• Contact MVAPICH help list with any questions related to the package

[email protected]

MVAPICH2-GDR: Download and Setup on OpenPOWER & x86 Systems

Page 12: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 12Network Based Computing Laboratory

• Released on 03/16/2018

• Major Features and Enhancements

– Based on MVAPICH2 2.3.1

– Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems

– Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems

– Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get

– Support for PGI 18.10

– Flexible support for running TensorFlow (Horovod) jobs

– Add support for Volta (V100) GPU

– Support for OpenPOWER with NVLink

– Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink

– Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication

– InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications

– Efficient broadcast designs for Deep Learning applications

MVAPICH2-GDR 2.3.1

Page 13: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 13Network Based Computing Laboratory

0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3.1

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3.1

0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR 2.3.1

MVAPICH2-GDR-2.3.1Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 10.0Mellanox OFED 4.2 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.85us11X

Page 14: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 14Network Based Computing Laboratory

• RoCE V1 and V2 support

• RDMA_CM connection support

• CUDA-Aware Collective Tuning– Point-point Tuning (available since MVAPICH2-GDR 2.0)

• Tuned thresholds for the different communication patterns and features

• Depending on the system configuration (CPU, HCA and GPU models)

– Tuning Framework for GPU based collectives • Select the best algorithm depending on message size, system size and system

configuration

• Support for Bcast and Gather operations for different GDR-enabled systems

• Available since MVAPICH2-GDR 2.2RC1 release

ROCE and Optimized Collectives Support

Page 15: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 15Network Based Computing Laboratory

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

MV2 MV2+GDR

0500

100015002000250030003500

4 8 16 32Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

64K Particles 256K Particles

2X2X

Page 16: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 16Network Based Computing Laboratory

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/

Page 17: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 17Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 18: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 18Network Based Computing Laboratory

Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1

• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect

• Up to 30% higher D2D bandwidth on DGX-1 with NVLink

05000

10000150002000025000300003500040000

128K 256K 512K 1M 2M 4M

Mill

ion

Byte

s (M

B)/s

econ

d

Message Size (Bytes)

Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design

1-stream 4-streams

16% better

02000400060008000

100001200014000160001800020000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Mill

ion

Byte

s (M

B)/s

econ

d

Message Size (Bytes)

Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design

1-stream 4-streams

30% better

Available since MVAPICH2-GDR-2.3a

Page 19: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 19Network Based Computing Laboratory

0

5000

10000

15000

Band

wid

th (M

Bps)

Message Size (Bytes)

INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

CMA-based Intra-node Host-to-Host Communication Support

0100200300400500600

Late

ncy

(us)

Message Size (Bytes)

INTRA-NODE Pt-to-Pt (H2H) LATENCY

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

MVAPICH2-GDR-2.3aIntel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores

NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCACUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA

• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth

30% better30% better

Page 20: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 20Network Based Computing Laboratory

• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions

• Halo data exchange• Duplicate the boundary• Exchange the boundary in each

iteration

Halo data exchange

Non-contiguous Data Exchange

Page 21: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 21Network Based Computing Laboratory

MPI Datatype support in MVAPICH2• Datatypes support in MPI

– Operate on customized datatypes to improve productivity

– Enable MPI library to optimize non-contiguous data

At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);MPI_Type_commit(&new_type);…MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks- Efficiently move data between nodes using RDMA- In progress - currently optimizes vector and hindexed datatypes- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.

Page 22: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 22Network Based Computing Laboratory

MPI Datatype Processing (Computation Optimization )

• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block

• Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels• Vector

• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_VECTOR_YSIZE

• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM

• Indexed_block• MV2_CUDA_KERNEL_IDXBLK_XDIM

Page 23: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 23Network Based Computing Laboratory

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPUCommon Scenario

*A, B…contain non-contiguousMPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);

Page 24: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 24Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 25: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 25Network Based Computing Laboratory

Enhanced Support for Intra-node Unified Memory● CUDA Unified Memory(UM) => no memory pin down

● No IPC support for intra-node communication ● No GDR support for Inter-node communication

● Initial and basic support in MVAPICH2-GDR ● For both intra- and inter-nodes use “pipeline

through” host memory ● Enhance intra-node UM to use IPC

● Double buffering pair-wise IPC-based scheme ● Brings IPC performance to UM ● High performance and high productivity

● Available since MVAPICH2-GDR 2.2RC1

K. Hamidouche, A. Awan, A. Venkatesh, and D. K Panda, CUDA M3: Designing Efficient CUDA Managed Memory-aware MPI by Exploiting GDR and IPC, HiPC ‘16

On K80 with MV2-GDR

Page 26: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 26Network Based Computing Laboratory

Characterizing Unified Memory aware MPI on modern GPUs

● Improved UM support in Pascal & Volta GPUs through:● Advanced GPU page fault engines● cudaMemPrefetch and cudaMemAdvise APIs provide

more control for UM data placement● Are the UM designs developed during Kepler era still valid?● Carried out an in-depth characterization● Our characterization studies show:

● The UM designs from Kepler era are still valid● They are 4.2X and 2.8X better in latency compared to

MVAPICH2-GDR and Open MPI

K. V. Manian, A. Awan, A. Ruhela, C. Chu, H. Subramoni and D. K Panda, Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures, GPGPU ‘19 Workshop, in conjunction with ASPLOS ’19, April ‘19

On V100 with MV2-GDR and OMPI

On V100 with MV2-GDR

Page 27: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 27Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 28: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 28Network Based Computing Laboratory

• Streaming applications on HPC systems

1. Communication (MPI)• Broadcast-type operations

2. Computation (CUDA)• Multiple GPU nodes as workers

Streaming ApplicationsData Source

SenderHPC resources for real-time analytics

Real-time streaming

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

Data streaming-like broadcast operations

Page 29: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 29Network Based Computing Laboratory

IB HCA

CPU

GPU

Source

IB Switch

Header

Data

IB HCA

CPU

GPU

Destination 1Header

Data

IB HCA

CPU

GPU

Destination NHeader

Data

1. IB Gather + GDR Read2. IB Hardware Multicast3. IB Scatter + GDR Write

• For GPU-resident data, using– GPUDirect RDMA (GDR)

– InfiniBand Hardware Multicast (IB-MCAST)

• Overhead– IB UD limit

– GDR limit

Hardware Multicast-based Broadcast

A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.

Available since MVAPICH2-GDR 2.3a

Page 30: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 30Network Based Computing Laboratory

0

10

20

30

40

50

60

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K

Late

ncy

(μs)

Message Size (Bytes)

MCAST-GDR-OPT MCAST-GDR

• IB-MCAST + GDR + Topology-aware IPC-based schemes – Up to 58% and 79% reduction

for small and large messages

Streaming Benchmark @ CSCS (88 GPUs)

58%

0

2000

4000

6000

8000

10000

12000

32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(μs)

Message Size (Bytes)

MCAST-GDR-OPT MCAST-GDR

79%

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.

Page 31: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 31Network Based Computing Laboratory

0

0.5

1

1.5

8 GPUs 16 GPUs 8 GPUs 16 GPUs 8 GPUs 16 GPUs

AlexNet VGG ResNet-50

Spee

dup

MV2-GDR-Knomial MV2-GDR-Ring MV2-MCAST-GDR-Opt

• @ RI2 cluster, 16 GPUs, 1 GPU/node– CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) [2]

Application-based Evaluation: CUDA-Aware CNTK

Higher is better

18%24%

[1] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17.[2] D. S. Banerjee, K. Hamidouche, and D. K. Panda, Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom’16

• Reduces up to 24%, 16% and 18% of latency for AlexNet, VGG and ResNet models• Higher improvement can be observed for larger system sizes

16%

Research Poster (P9242)

Page 32: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 32Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 33: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 33Network Based Computing Laboratory

• Scale-up: Intra-node Communication

– Many improvements like:• NVIDIA cuDNN, cuBLAS, NCCL, etc.

• CUDA 9 Co-operative Groups

• Scale-out: Inter-node Communication

– DL Frameworks – most are optimized for single-node only

– Distributed (Parallel) Training is an emerging trend

• OSU-Caffe – MPI-based

• Microsoft CNTK – MPI/NCCL2

• Google TensorFlow – gRPC-based/MPI/NCCL2

• Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI)

Deep Learning: New Challenges for Runtimes

Scal

e-up

Per

form

ance

Scale-out Performance

cuDNN

gRPC

Hadoop

MPIMKL-DNN

Desired

NCCL2

Page 34: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 34Network Based Computing Laboratory

Data Parallel Deep Learning and MPI Collectives

MPI_Bcast (GPU 0)

packed_comm_buff

L1

L2

..Ln

F

L1

L2

..Ln

L1

L2

..Ln

L1

L2

..Ln

Params

GPU

0 Params

GPU

1 Params

GPU

2 Params

GPU

3

Gradients

1. Data Propagation

2. Forward Backward

Pass

3. Gradient Aggregatio

n

B F B F B F B

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

ApplyUpdates

MPI_Reduce (GPU 0)

Loop {}• Major MPI Collectivesinvolved in Designing distributed frameworks

• MPI_Bcast – required for DNN parameter exchange

• MPI_Reduce – needed for gradient accumulation from multiple solvers

• MPI_Allreduce – use just one Allreduce instead of Reduce and Broadcast

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

Page 35: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 35Network Based Computing Laboratory

0

10000

20000

30000

40000

50000

512K 1M 2M 4M

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2 BAIDU OPENMPI

0100000020000003000000400000050000006000000

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8

2684

3545

6

5368

7091

2

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2 BAIDU OPENMPI

1

10

100

1000

10000

100000

4 16 64 256

1024

4096

1638

4

6553

6

2621

44

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2 BAIDU OPENMPI

• 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI

*Available since MVAPICH2-GDR 2.3a

~30X betterMV2 is ~2X better

than Baidu

~10X better OpenMPI is ~5X slower than Baidu

~4X better

Page 36: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 36Network Based Computing Laboratory

MVAPICH2-GDR vs. NCCL2 – Allreduce Operation• Optimized designs since MVAPICH2-GDR 2.3 offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

1

10

100

1000

10000

100000

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR 2.3.1 NCCL2

~1.2X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

1

10

100

1000

4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR 2.3.1 NCCL2

~3X better

Page 37: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 37Network Based Computing Laboratory

MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)• Optimized designs in MVAPICH2-GDR 2.3.1 offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)

1

10

100

1000

10000

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR 2.3.1 NCCL-2.3

~1.7X better

Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2

0

10

20

30

40

50

60

8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR 2.3.1 NCCL-2.3

~2.5X better

Page 38: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 38Network Based Computing Laboratory

• Efficient Allreduce is crucial for Horovod’s overall training performance

– Both MPI and NCCL designs are available

• We have evaluated Horovod extensively and compared across a wide range of designs using gRPC and gRPC extensions

• MVAPICH2-GDR achieved up to 90%scaling efficiency for ResNet-50 Training on 64 Pascal GPUs

Scalable TensorFlow using Horovod, MPI, and NCCL

A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, (To be presented) CCGrid ‘19. https://arxiv.org/abs/1810.11112

Page 39: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 39Network Based Computing Laboratory

Distributed Training with TensorFlow and MVAPICH2-GDR• ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (8 Volta GPUs)

0500

10001500200025003000

1 2 4 8

Imag

e pe

r sec

ond

Number of GPUs

NCCL-2.3 MVAPICH2-GDR-Next

7.5% higher

Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2

75

80

85

90

95

100

1 2 4 8

Scal

ing

Effic

ienc

y (%

)

Number of GPUs

NCCL-2.3 MVAPICH2-GDR-Next

Scaling Efficiency =Actual throughput

Ideal throughput at scale× 100%

Page 40: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 40Network Based Computing Laboratory

• Caffe : A flexible and layered Deep Learning framework.

• Benefits and Weaknesses– Multi-GPU Training within a single node

– Performance degradation for GPUs across different sockets

– Limited Scale-out

• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across

multi-GPU nodes)

– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset

– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

0

50

100

150

200

250

8 16 32 64 128

Trai

ning

Tim

e (s

econ

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use case

OSU-Caffe publicly available from

http://hidl.cse.ohio-state.edu/

Support on OPENPOWER will be available soon

Page 41: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 41Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 42: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 42Network Based Computing Laboratory

0

0.5

1

1.5

1 4 16 64 256 1K 4K

Late

ncy

(us)

MVAPICH2-GDR 2.3.1

Point-to-Point Host-level Performance on OpenPOWER

Platform: OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA

Intra-node Latency Intra-node Bi-directional BandwidthIntra-node Bandwidth

0

10000

20000

30000

40000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

B/s)

MVAPICH2-GDR 2.3.1

0

20000

40000

60000

80000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

B/s)

MVAPICH2-GDR 2.3.1

0

1

2

3

4

5

1 4 16 64 256 1K 4K

Late

ncy

(us)

MVAPICH2-GDR 2.3.1

0

5000

10000

15000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

B/s)

MVAPICH2-GDR 2.3.1

0

5000

10000

15000

20000

25000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Band

wid

th (M

B/s)

MVAPICH2-GDR 2.3.1

Inter-node Latency Inter-node Bi-directional BandwidthInter-node Bandwidth

~0.5 μs

~2.3 μs

~30GB/s ~60GB/s

~12GB/s

~24GB/s

Page 43: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 43Network Based Computing Laboratory

0

5

10

15

20

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

INTRA-NODE LATENCY (SMALL)

Intra-Socket

Inter-Socket

Device-to-Device Performance on OpenPOWER (NVLink2 + Volta)

05

1015202530

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)

Message Size (Bytes)

INTER-NODE BANDWIDTH

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

0

100

200

300

400

500

16K 32K 64K 128K256K512K 1M 2M 4M

Late

ncy

(us)

Message Size (Bytes)

INTRA-NODE LATENCY (LARGE)

Intra-Socket

Inter-Socket

0

5

10

15

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

INTER-NODE LATENCY (SMALL)

0

100

200

300

400

Late

ncy

(us)

Message Size (Bytes)

INTER-NODE LATENCY (LARGE)

Intra-node Bandwidth: 70.4 GB/sec for 128MB (via NVLINK2)

Intra-node Latency: 5.36 us (without GDRCopy)

Inter-node Latency: 5.66 us (without GDRCopy) Inter-node Bandwidth: 23.7 GB/sec (2 port EDR)Available since MVAPICH2-GDR 2.3a

0

20

40

60

80

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)

Message Size (Bytes)

INTRA-NODE BANDWIDTH

Intra-SocketInter-Socket

Page 44: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 44Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 45: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 45Network Based Computing Laboratory

• Increasing trend to provide container support for MPI Libraries • Ease of build• Portability• Reproducibility

• MVAPICH2-GDR 2.3.1 provides container (Docker) support• More details are available in the MVAPICH2-GDR User Guide

• http://mvapich.cse.ohio-state.edu/userguide/gdr/

• Synergistic with the HPC-Container-Maker and hpccm efforts by NVIDIA

• (https://github.com/NVIDIA/hpc-container-maker)

Container Support

Page 46: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 46Network Based Computing Laboratory

0

5000

10000

15000

20000

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

Docker Native

02000400060008000

1000012000

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

Docker Native

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Late

ncy

(us)

Message Size (Bytes)

GPU-GPU Inter-node Latency

Docker Native

MVAPICH2-GDR-2.3.1Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

MVAPICH2-GDR on Container with Negligible Overhead

Page 47: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 47Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 48: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 48Network Based Computing Laboratory

0

10

20

30

40

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0

Scalable Host-based Collectives on OpenPOWER with CMA (Intra-node Reduce & AlltoAll)

(Nod

es=1

, PPN

=20)

0

50

100

150

200

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0

(Nod

es=1

, PPN

=20)

Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively

3.6X

Alltoall

Reduce

5.2X

3.2X3.3X

0

400

800

1200

1600

2000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0

(Nod

es=1

, PPN

=20)

0

2500

5000

7500

10000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0

(Nod

es=1

, PPN

=20)

3.2X

1.4X

1.3X1.2X

Page 49: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 49Network Based Computing Laboratory

0

10

20

30

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-GDR-NextOpenMPI-3.0.0

Scalable Host-based Collectives on OpenPOWER with CMA (Multi-node, Reduce & Alltoall)

(Nod

es=4

, PPN

=20)

0

500

1000

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-GDR-NextOpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

Up to 12.4X and 8.5X performance improvement by MVAPICH2 for small and large messages respectively

12.4X

Alltoall

Reduce

1.9X

0

2000

4000

6000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

MVAPICH2-GDR-Next

OpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

0

50000

100000

150000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

MVAPICH2-GDR-NextOpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

8.5X

Page 50: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 50Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 51: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 51Network Based Computing Laboratory

• Offload Reduction computation and communication to peer MPI ranks– Every Peer has direct “load/store” access to other peer’s buffers

– Multiple pseudo roots independently carry-out reductions for intra-and inter-node

– Directly put reduced data into root’s receive buffer

• True “Zero-copy” design for Allreduce and Reduce– No copies require during the entire duration of Reduction operation

– Scalable to multiple nodes

• Zero contention overheads as memory copies happen in “user-space”

Shared Address Space (XPMEM-based) Collectives

J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.

Available since MVAPICH2-X 2.3rc1

Page 52: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 52Network Based Computing Laboratory

Benefits of XPMEM based MPI_Bcast

• 28 MPI Processes on single dual-socket Broadwell E5-2680v4, 2x14 core processor

1

10

100

1000

10000

4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

Intel MPI 2018OpenMPI 3.0.1MV2X-2.3rc1 (CMA Coll)MV2X-2.3rc2 (XPMEM Coll)

Late

ncy

(us)

5X over OpenMPI

Message Size (bytes)

Page 53: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 53Network Based Computing Laboratory

Benefits of XPMEM based MPI_Scatter

• High cache-locality and contention-free access compared to CMA

1

10

100

1000

10000

100000

512 1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

Intel MPI 2018OpenMPI 3.0.1MV2X-2.3rc1 (CMA Coll)MV2X-2.3rc2 (XPMEM Coll)

Late

ncy

(us)

~28X better thanstate-of-the-art

Message Size (bytes)

Page 54: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 54Network Based Computing Laboratory

0

1000

2000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

3X0

2000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

34%

0

1000

2000

3000

4000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

Message Size

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

0

1000

2000

16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-GDR-Next

SpectrumMPI-10.1.0

OpenMPI-3.0.0

Optimized All-Reduce with XPMEM(N

odes

=1, P

PN=2

0)

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node

2X

(Nod

es=2

, PPN

=20)

4X48%

3.3X

2X

2X

Page 55: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 55Network Based Computing Laboratory

Application-Level Benefits of XPMEM-Based Collectives

MiniAMR (Broadwell, ppn=16)

• Up to 20% benefits over IMPI for CNTK DNN training using AllReduce• Up to 27% benefits over IMPI and up to 15% improvement over MVAPICH2 for MiniAMR application kernel

0100200300400500600700800

28 56 112 224

Exec

utio

n Ti

me

(s)

No. of Processes

Intel MPIMVAPICH2MVAPICH2-XPMEM

CNTK AlexNet Training (Broadwell, B.S=default, iteration=50, ppn=28)

0

10

20

30

40

50

60

70

16 32 64 128 256

Exec

utio

n Ti

me

(s)

No. of Processes

Intel MPIMVAPICH2MVAPICH2-XPMEM20%

9%

27%

15%

Page 56: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 56Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 57: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 57Network Based Computing Laboratory

MVAPICH2-GDR: Enhanced Derived Datatype Processing• Kernel-based and GDRCOPY-based one-shot packing for inter-socket and inter-node communication

• Zero-copy (packing-free) for GPUs with peer-to-peer direct access over PCIe/NVLink

0

5

10

15

20

25

[6, 8,8,8,8] [6, 8,8,8,16] [6, 8,8,16,16] [6, 16,16,16,16]

MILC

Spee

dup

Problem size

GPU-based DDTBench mimics MILC communication kernel

OpenMPI 4.0.0 MVAPICH2-GDR 2.3 MVAPICH2-GDR-Next

Platform: Nvidia DGX-2 system

(16 NVIDIA Volta GPUs connected with NVSwitch), CUDA 9.2

012345

8 16 32 64

Exec

utio

n Ti

me

(s)

Number of GPUs

Communication Kernel of COSMO Model

MVAPICH2-GDR 2.3 MVAPICH2-GDR-NextPlatform: Cray CS-Storm

(8 NVIDIA Tesla K80 GPUs per node), CUDA 8.0

Improved ~10X

Page 58: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 58Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 59: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 59Network Based Computing Laboratory

Scalability and Large (Out-of-core) Models?• Large DNNs cannot be trained on GPUs due to memory limitation!

– ResNet-50 for Image Recognition but current frameworks can only go up to a small batch size of 45

– Next generation models like Neural Machine Translation (NMT) are ridiculously large, consists of billions of parameters, and require even more memory

– Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?

• General intuition is that managed allocations “will be” slow!

– The proposed framework called OC-Caffe (Out-of-Core Caffe)shows the potential of managed memory designs that can provide performance with negligible/no overhead.

• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7

A. A. Awan, C.-H. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18

Research Poster (P9243)

Page 60: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 60Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features

• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container

• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Datatype Processing• Out-of-core processing for Deep Learning

• Conclusions

Outline

Page 61: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 61Network Based Computing Laboratory

• MVAPICH2-GDR MPI library optimizes MPI communication on InfiniBand and RoCE (V1 and V2) clusters with GPUs on both x86 and OpenPOWER platforms (including NVLink)

• Provides optimized designs for point-to-point two-sided and one-sided communication, datatype processing and collective operations

• Takes advantage of CUDA features like IPC and GPUDirect RDMA families

• Allows flexible solutions for streaming applications with GPUs

• Provides optimized solutions for both HPC and High-Performance Deep Learning (HiDL) frameworks and applications

• Upcoming releases will be supporting advanced designs

Conclusions

Page 62: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 62Network Based Computing Laboratory

Please join us for more events..Monday, March 18 Tuesday, March 19 Wednesday, March 20 Research Poster

1. P9243 - Exploiting CUDA Unified Memory for Efficient Out-of-Core DNN Training

2. P9242 - Exploiting GPUDirect Technology and Hardware Multicast for Streaming and Deep Learning Applications

Talk

S9476 - MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI

Instructor-Led Training

L9121 - How to Boost the Performance of HPC/AI Applications Using MVAPICH2 Library

SJCC Upper Concourse

06:00 PM - 08:00 PM

SJCC Room 211A (Concourse Level)03:00 PM - 03:50 PM

SJCC Room LL21D (Lower Level) 08:00 AM - 10:00 AM

Page 63: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 63Network Based Computing Laboratory

Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)– S. Guganani (Ph.D.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– R. Biswas (M.S.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– R. Rajachandrasekar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

– J. Zhang (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– J. Hashmi (Ph.D.)

– A. Jain (Ph.D.)– K. S. Khorassani (Ph.D.)

– P. Kousha (Ph.D.)

– D. Shankar (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Current Research Asst. Professor– X. Lu

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– A. Ruhela

– K. Manian

Current Students (Undergraduate)– V. Gangal (B.S.)

– M. Haupt (B.S.)

– N. Sarkauskas (B.S.)

– A. Yeretzian (B.S.)

Past Research Specialist– M. Arnold

Current Research Scientist– H. Subramoni

Page 64: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 64Network Based Computing Laboratory

• Looking for Bright and Enthusiastic Personnel to join as

– PhD Students

– Post-Doctoral Researchers

– MPI Programmer/Software Engineer

– Hadoop/Big Data Programmer/Software Engineer

– Deep Learning and Cloud Programmer/Software Engineer

• If interested, please send an e-mail to [email protected]

Multiple Positions Available in My Group

Page 65: MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

GTC 2019 65Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected], [email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/