MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies
Dhabaleswar K. (DK) Panda
The Ohio State UniversityE-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
GPU Technology Conference (GTC 2018)
by
Hari Subramoni
The Ohio State UniversityE-mail: [email protected]
http://www.cse.ohio-state.edu/~subramon
GTC 2018 2Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 3Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,875 organizations in 86 countries
– More than 460,000 (> 0.46 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 9th, 556,104 cores (Oakforest-PACS) in Japan
• 12th, 368,928-core (Stampede2) at TACC
• 17th, 241,108-core (Pleiades) at NASA
• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
GTC 2018 4Network Based Computing Laboratory
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000Se
p-04
Feb-
05
Jul-0
5
Dec-
05
May
-06
Oct
-06
Mar
-07
Aug-
07
Jan-
08
Jun-
08
Nov
-08
Apr-
09
Sep-
09
Feb-
10
Jul-1
0
Dec-
10
May
-11
Oct
-11
Mar
-12
Aug-
12
Jan-
13
Jun-
13
Nov
-13
Apr-
14
Sep-
14
Feb-
15
Jul-1
5
Dec-
15
May
-16
Oct
-16
Mar
-17
Aug-
17
Jan-
18
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2
-GD
R 2.
0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3a
MV2
-X2.
3bM
V2Vi
rt 2
.2
MV2
2.3
rc1
OSU
INAM
0.9
.3
MVAPICH2 Release Timeline and Downloads
GTC 2018 5Network Based Computing Laboratory
MVAPICH2 Architecture
High Performance Parallel Programming ModelsMessage Passing Interface
(MPI)PGAS
(UPC, OpenSHMEM, CAF, UPC++)Hybrid --- MPI + X
(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-Awareness
Remote Memory Access
I/O andFile Systems
FaultTolerance
Virtualization Active Messages
Job StartupIntrospection
& Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, OmniPath)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODP* SR-IOV
Multi Rail
Transport MechanismsShared
Memory CMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
GTC 2018 6Network Based Computing Laboratory
MVAPICH2 Software Family High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration
OEMT Utility to measure the energy consumption of MPI applications
GTC 2018 7Network Based Computing Laboratory
CPU CPUQPI
GPU
PCIe
GPU
GPU
CPU
GPU
IB
Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host
7. Inter-Node GPU-Host6. Inter-Socket GPU-Host
Memory buffers
8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .
• NVLink is leading to more paths• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters
GTC 2018 8Network Based Computing Laboratory
At Sender:
At Receiver:MPI_Recv(r_devbuf, size, …);
insideMVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
GTC 2018 9Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases
• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU
adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from
GPU device buffers• Unified memory
GTC 2018 10Network Based Computing Laboratory
• MVAPICH2-2.3 with GDR support can be downloaded from
https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/
• System software requirements• Mellanox OFED 3.2 or later
• NVIDIA Driver 367.48 or later
• NVIDIA CUDA Toolkit 7.5/8.0/9.0 or later
• Plugin for GPUDirect RDMA
http://www.mellanox.com/page/products_dyn?product_family=116
• Strongly recommended
• GDRCOPY module from NVIDIAhttps://github.com/NVIDIA/gdrcopy
• Contact MVAPICH help list with any questions related to the package
Using MVAPICH2-GPUDirect Version
GTC 2018 11Network Based Computing Laboratory
• Released on 11/09/2017
• Major Features and Enhancements
– Based on MVAPICH2 2.2
– Support for CUDA 9.0
– Add support for Volta (V100) GPU
– Support for OpenPOWER with NVLink
– Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink
– Enhanced performance of GPU-based point-to-point communication
– Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication
– Enhanced performance of MPI_Allreduce for GPU-resident data
– InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications• Basic support for IB-MCAST designs with GPUDirect RDMA
• Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA
• Advanced reliability support for IB-MCAST designs
– Efficient broadcast designs for Deep Learning applications
– Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems
MVAPICH2-GDR 2.3a
GTC 2018 12Network Based Computing Laboratory
0
2000
4000
6000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
01000200030004000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
0
10
20
300 1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR-2.3a
MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.88us11X
GTC 2018 13Network Based Computing Laboratory
• RoCE V1 and V2 support
• RDMA_CM connection support
• CUDA-Aware Collective Tuning– Point-point Tuning (available since MVAPICH2-GDR 2.0)
• Tuned thresholds for the different communication patterns and features
• Depending on the system configuration (CPU, HCA and GPU models)
– Tuning Framework for GPU based collectives • Select the best algorithm depending on message size, system size and system
configuration
• Support for Bcast and Gather operations for different GDR-enabled systems
• Available since MVAPICH2-GDR 2.2RC1 release
ROCE and Optimized Collectives Support
GTC 2018 14Network Based Computing Laboratory
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
MV2 MV2+GDR
0500
100015002000250030003500
4 8 16 32Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
64K Particles 256K Particles
2X2X
GTC 2018 15Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
00.20.40.60.8
11.2
4 8 16 32
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/
GTC 2018 16Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 17Network Based Computing Laboratory
Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1
• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect
• Up to 30% higher D2D bandwidth on DGX-1 with NVLink
•
05000
10000150002000025000300003500040000
128K 256K 512K 1M 2M 4M
Mill
ion
Byte
s (M
B)/s
econ
d
Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design
1-stream 4-streams
16% better
02000400060008000
100001200014000160001800020000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Mill
ion
Byte
s (M
B)/s
econ
d
Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design
1-stream 4-streams
30% better
Available with MVAPICH2-GDR-2.3a
GTC 2018 18Network Based Computing Laboratory
0
5000
10000
15000
Band
wid
th (M
Bps)
Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)
CMA-based Intra-node Host-to-Host Communication Support
0100200300400500600
Late
ncy
(us)
Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) LATENCY
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)
MVAPICH2-GDR-2.3aIntel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores
NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCACUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA
• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth
30% better30% better
GTC 2018 19Network Based Computing Laboratory
• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions
• Halo data exchange• Duplicate the boundary• Exchange the boundary in each
iteration
Halo data exchange
Non-contiguous Data Exchange
GTC 2018 20Network Based Computing Laboratory
MPI Datatype support in MVAPICH2• Datatypes support in MPI
– Operate on customized datatypes to improve productivity
– Enable MPI library to optimize non-contiguous data
At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);MPI_Type_commit(&new_type);…MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks- Efficiently move data between nodes using RDMA- In progress - currently optimizes vector and hindexed datatypes- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
GTC 2018 21Network Based Computing Laboratory
MPI Datatype Processing (Computation Optimization )
• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block
• Generic kernels for all other irregular datatypes
• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels
• Flexible set of parameters for users to tune kernels• Vector
• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_VECTOR_YSIZE
• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM
• Indexed_block• MV2_CUDA_KERNEL_IDXBLK_XDIM
GTC 2018 22Network Based Computing Laboratory
MPI Datatype Processing (Communication Optimization )
Waste of computing resources on CPU and GPUCommon Scenario
*A, B…contain non-contiguousMPI Datatype
MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…
MPI_Waitall (…);
GTC 2018 23Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 24Network Based Computing Laboratory
Initial (Basic) Support for GPU Managed Memory
● CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified) memory allowing a common memory allocation for GPU or CPU through cudaMallocManaged() call
● Significant productivity benefits due to abstraction of explicit allocation and cudaMemcpy()
● Extended MVAPICH2 to perform communications directly from managed buffers (Available since MVAPICH2-GDR 2.2b)
● OSU Micro-benchmarks extended to evaluate the performance of point-to-point and collective communications using managed buffers
● Available since OMB 5.2D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 4 8 16 32 64 128 256 1K 4K 8K 16K
Halo
Exc
hang
e Ti
me
(ms)
Total Dimension Size (Bytes)
2D Stencil Performance for Halowidth=1
DeviceManaged
GTC 2018 25Network Based Computing Laboratory
Enhanced Support for Intra-node Managed Memory
● CUDA Managed => no memory pin down ● No IPC support for intra-node communication ● No GDR support for Inter-node
communication● Initial and basic support in MVAPICH2-GDR
● For both intra- and inter-nodes use “pipeline through” host memory
● Enhance intra-node managed memory to use IPC● Double buffering pair-wise IPC-based scheme ● Brings IPC performance to Managed memory ● High performance and high productivity● 2.5 X improvement in bandwidth
● Available since MVAPICH2-GDR 2.2RC1
0
200
400
600
800
1000
1200
4K 16K 64K 256K 1M 4M
EnhancedMV2-GDR 2.2b
Message Size (bytes)
Late
ncy
(us)
02000400060008000
10000
32K 128K 512K 2M
EnhancedMV2-GDR 2.2b
Message Size (bytes)
Band
wid
th (M
B/s)
2.5X
GTC 2018 26Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 27Network Based Computing Laboratory
• Streaming applications on HPC systems
1. Communication (MPI)• Broadcast-type operations
2. Computation (CUDA)• Multiple GPU nodes as workers
Streaming ApplicationsData Source
SenderHPC resources for real-time analytics
Real-time streaming
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
Data streaming-like broadcast operations
GTC 2018 28Network Based Computing Laboratory
IB HCA
CPU
GPU
Source
IB Switch
Header
Data
IB HCA
CPU
GPU
Destination 1Header
Data
IB HCA
CPU
GPU
Destination NHeader
Data
1. IB Gather + GDR Read2. IB Hardware Multicast3. IB Scatter + GDR Write
• For GPU-resident data, using– GPUDirect RDMA (GDR)
– InfiniBand Hardware Multicast (IB-MCAST)
• Overhead– IB UD limit
– GDR limit
Hardware Multicast-based Broadcast
A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.
GTC 2018 29Network Based Computing Laboratory
• Preparing Intermediate buffer (im_buf)– Page-locked (pinned) host buffer
Fast Device-Host data movement
– Allocated at initialization phase Low overhead
• Streaming data through host– Fine-tuned chunked data
– Asynchronous copy operations
Three-stage pipeline
IB HCA
CPU
GPU
Source
IB Switch
Header
d_out
1. Data Preparation2. IB Gather 3. IB Hardware Multicast
im_buf
Optimized Broadcast Send
MPI_Bcast(d_out,…)
C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017.
GTC 2018 30Network Based Computing Laboratory
• Zero-copy broadcast receive – Pre-posted user buffer (d_in)
– Avoids additional data movement
– Leverages IB Scatter and GDR features
Low-latency
Free-up PCIe resources for applications
IB Switch
IB HCA
CPU
GPU
Destination 1Header
d_in
IB HCA
CPU
GPU
Destination NHeader
d_in
IB Hardware MulticastIB Scatter (GDR Write)
Optimized Broadcast ReceiveMPI_Bcast(d_in,…)
C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017.
GTC 2018 31Network Based Computing Laboratory
• Proposed Intra-node Topology-Aware Broadcast– CUDA InterProcess Communication (IPC)
Broadcast on Multi-GPU systems
Node 1
IB Switch
GPU 0 GPU 1 GPU N
Node NGPU
CPU
Source
GPU
CPU
CPUMulticast steps
cudaMemcpy(Device ↔ Device)
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.
Available in MVAPICH2-GDR 2.3a
GTC 2018 32Network Based Computing Laboratory
0
10
20
30
40
50
60
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K
Late
ncy
(μs)
Message Size (Bytes)
MCAST-GDR-OPT MCAST-GDR
• IB-MCAST + GDR + Topology-aware IPC-based schemes – Up to 58% and 79% reduction
for small and large messages
Streaming Benchmark @ CSCS (88 GPUs)
58%
0
2000
4000
6000
8000
10000
12000
32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(μs)
Message Size (Bytes)
MCAST-GDR-OPT MCAST-GDR
79%
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.
GTC 2018 33Network Based Computing Laboratory
0
0.5
1
1.5
8 GPUs 16 GPUs 8 GPUs 16 GPUs 8 GPUs 16 GPUs
AlexNet VGG ResNet-50
Spee
dup
MV2-GDR-Knomial MV2-GDR-Ring MV2-MCAST-GDR-Opt
• @ RI2 cluster, 16 GPUs, 1 GPU/node– CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) [2]
Application-based Evaluation: CUDA-Aware CNTK
Higher is better
18%24%
[1] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17.[2] D. S. Banerjee, K. Hamidouche, and D. K. Panda, Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom’16
• Reduces up to 24%, 16% and 18% of latency for AlexNet, VGG and ResNet models• Higher improvement can be observed for larger system sizes
16%
GTC 2018 34Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 35Network Based Computing Laboratory
• Deep Learning frameworks are a different game altogether
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• Existing State-of-the-art– cuDNN, cuBLAS, NCCL --> scale-up performance
– NCCL2, CUDA-Aware MPI --> scale-out performance• For small and medium message sizes only!
• Proposed: Can we co-design the MPI runtime (MVAPICH2-GDR) and the DL framework (Caffe) to achieve both?
– Efficient Overlap of Computation and Communication
– Efficient Large-Message Communication (Reductions)
– What application co-designs are needed to exploit communication-runtime co-designs?
Deep Learning: New Challenges for MPI Runtimes
Scal
e-up
Per
form
ance
Scale-out Performance
cuDNN
NCCL
gRPC
Hadoop
ProposedCo-
Designs
MPIcuBLAS
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
NCCL2
GTC 2018 36Network Based Computing Laboratory
0100200300
2 4 8 16 32 64 128
Late
ncy
(ms)
Message Size (MB)
Reduce – 192 GPUsLarge Message Optimized Collectives for Deep Learning
0
100
200
128 160 192
Late
ncy
(ms)
No. of GPUs
Reduce – 64 MB
0100200300
16 32 64
Late
ncy
(ms)
No. of GPUs
Allreduce - 128 MB
0
50
100
2 4 8 16 32 64 128
Late
ncy
(ms)
Message Size (MB)
Bcast – 64 GPUs
0
50
100
16 32 64
Late
ncy
(ms)
No. of GPUs
Bcast 128 MB
• MV2-GDR provides optimized collectives for large message sizes
• Optimized Reduce, Allreduce, and Bcast
• Good scaling with large number of GPUs
• Available since MVAPICH2-GDR 2.2GA
0100200300
2 4 8 16 32 64 128La
tenc
y (m
s)Message Size (MB)
Allreduce – 64 GPUs
GTC 2018 37Network Based Computing Laboratory
0
1000
2000
3000
4000
5000
6000
512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
0100000200000300000400000500000600000700000800000
8388
608
1677
7216
3355
4432
6710
8864
1342
1772
8
2684
3545
6
5368
7091
2
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
1
10
100
1000
10000
100000
4 16 64 256
1024
4096
1638
4
6553
6
2621
44
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
• Initial Evaluation shows promising performance gains for MVAPICH2-GDR 2.3a*• 8 GPUs (2 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0
MVAPICH2: Allreduce Comparison with Baidu and OpenMPI
*Available with MVAPICH2-GDR 2.3a
~10X better
MV2 is ~20% better than Baidu
~3.5X betterOpenMPI is ~4X slower
than Baidu
GTC 2018 38Network Based Computing Laboratory
0
10000
20000
30000
40000
50000
512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
0100000020000003000000400000050000006000000
8388
608
1677
7216
3355
4432
6710
8864
1342
1772
8
2684
3545
6
5368
7091
2
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
1
10
100
1000
10000
100000
4 16 64 256
1024
4096
1638
4
6553
6
2621
44
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
• 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0
MVAPICH2: Allreduce Comparison with Baidu and OpenMPI
*Available with MVAPICH2-GDR 2.3a
~30X betterMV2 is ~2X better
than Baidu
~10X better OpenMPI is ~5X slower than Baidu
~4X better
GTC 2018 39Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses– Multi-GPU Training within a single node
– Performance degradation for GPUs across different sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across
multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
Trai
ning
Tim
e (s
econ
ds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use case
OSU-Caffe publicly available from
http://hidl.cse.ohio-state.edu/
Support on OPENPOWER will be available soon
GTC 2018 40Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 41Network Based Computing Laboratory
0
0.5
1
1.5
1 4 16 64 256 1K 4K
Late
ncy
(us)
MVAPICH2-GDR 2.3a
Point-to-Point Host-level Performance on OpenPower
Platform: OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA
Intra-node Latency Intra-node Bi-directional BandwidthIntra-node Bandwidth
0
10000
20000
30000
40000
1 4 16 64 256 1K 4K 16K
64K
256K 1M
Band
wid
th (M
B/s)
MVAPICH2-GDR 2.3a
0
20000
40000
60000
80000
1 4 16 64 256 1K 4K 16K
64K
256K 1M
Band
wid
th (M
B/s)
MVAPICH2-GDR 2.3a
0
1
2
3
4
5
1 4 16 64 256 1K 4K
Late
ncy
(us)
MVAPICH2-GDR 2.3a
0
5000
10000
15000
1 4 16 64 256 1K 4K 16K
64K
256K 1M
Band
wid
th (M
B/s)
MVAPICH2-GDR…
0
5000
10000
15000
20000
25000
1 4 16 64 256 1K 4K 16K
64K
256K 1M
Band
wid
th (M
B/s)
MVAPICH2-GDR…
Inter-node Latency Inter-node Bi-directional BandwidthInter-node Bandwidth
~0.5 μs
~2.3 μs
~30GB/s ~60GB/s
~12GB/s
~24GB/s
GTC 2018 42Network Based Computing Laboratory
0
10
20
30
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
INTRA-NODE LATENCY (SMALL)
INTRA-SOCKET(NVLINK) INTER-SOCKET
Device-to-Device Performance on OpenPOWER (NVLink + Pascal)
010203040
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (G
B/se
c)
Message Size (Bytes)
INTRA-NODE BANDWIDTH
INTRA-SOCKET(NVLINK) INTER-SOCKET
0
5
10
15
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (G
B/se
c)
Message Size (Bytes)
INTER-NODE BANDWIDTH
Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and EDR InfiniBand Inter-connect
0
200
400
16K 32K 64K 128K256K512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
INTRA-NODE LATENCY (LARGE)
INTRA-SOCKET(NVLINK) INTER-SOCKET
202224262830
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
INTER-NODE LATENCY (SMALL)
0100200300400500
Late
ncy
(us)
Message Size (Bytes)
INTER-NODE LATENCY (LARGE)
Intra-node Bandwidth: 33.9 GB/sec (NVLINK)Intra-node Latency: 14.6 us (without GPUDirectRDMA)
Inter-node Latency: 23.8 us (without GPUDirectRDMA) Inter-node Bandwidth: 11.9 GB/sec (EDR)Available in MVAPICH2-GDR 2.3a
GTC 2018 43Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 44Network Based Computing Laboratory
• Increasing trend to provide container support for MPI Libraries • Ease of build• Portability• Reproducibility
• MVAPICH2-GDR 2.3a provides container (Docker) support• More details are available in the MVAPICH2-GDR User Guide
• http://mvapich.cse.ohio-state.edu/userguide/gdr/2.3a/
• Synergistic with the HPC-Container-Maker and hpccm efforts by NVIDIA
• (https://github.com/NVIDIA/hpc-container-maker)
Container Support
GTC 2018 45Network Based Computing Laboratory
0
5000
10000
15000
20000
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
Docker Native
02000400060008000
1000012000
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
Docker Native
1
10
100
1000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Late
ncy
(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
Docker Native
MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
MVAPICH2-GDR on Container with Negligible Overhead
GTC 2018 46Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 47Network Based Computing Laboratory
0
10
20
30
40
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0
Scalable Host-based Collectives with CMA (Intra-node Reduce & AlltoAll)(N
odes
=1, P
PN=2
0)
0
50
100
150
200
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively
3.6X
Alltoall
Reduce
5.2X
3.2X3.3X
0
400
800
1200
1600
2000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
0
2500
5000
7500
10000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
3.2X
1.4X
1.3X1.2X
GTC 2018 48Network Based Computing Laboratory
0
1000
2000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
Message Size (bytes)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0
0
25
50
75
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
Message Size (bytes)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0
0
15
30
45
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
Message Size (bytes)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0
Scalable Host-based Collectives with CMA (Intra-node, Gather & Scatter)(N
odes
=1, P
PN=2
0)
0
250
500
750
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
Message Size (bytes)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
(Nod
es=1
, PPN
=20)
(Nod
es=1
, PPN
=20)
Up to 24X and 15x performance improvement by MVAPICH2 for medium to large messages respectively
Scatter
Gather24.5X
5.6X
8.3X
15.6X
4.9X
1.5X
6.4X 6.7X
GTC 2018 49Network Based Computing Laboratory
0
10
20
30
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-GDR-NextOpenMPI-3.0.0
Scalable Host-based Collectives with CMA (Multi-node, Reduce & Alltoall)(N
odes
=4, P
PN=2
0)
0
500
1000
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-GDR-NextOpenMPI-3.0.0
(Nod
es=4
, PPN
=20)
Up to 12.4X and 8.5X performance improvement by MVAPICH2 for small and large messages respectively
12.4X
Alltoall
Reduce
1.9X
0
2000
4000
6000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
MVAPICH2-GDR-Next
OpenMPI-3.0.0
(Nod
es=4
, PPN
=20)
0
50000
100000
150000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
MVAPICH2-GDR-NextOpenMPI-3.0.0
(Nod
es=4
, PPN
=20)
8.5X
GTC 2018 50Network Based Computing Laboratory
0
1000
2000
3000
4000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
Message Size (bytes)
MVAPICH2-GDR-Next
OpenMPI-3.0.0
0
20
40
60
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
Message Size (bytes)
MVAPICH2-GDR-Next
OpenMPI-3.0.0
0
20
40
60
80
100
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
Message Size (bytes)
MVAPICH2-GDR-Next
OpenMPI-3.0.0
Scalable Host-based Collectives with CMA (Multi-node, Gather & Scatter)(N
odes
=4, P
PN=2
0)
0
1000
2000
3000
4000
5000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
Message Size (bytes)
MVAPICH2-GDR-Next
OpenMPI-3.0.0
(Nod
es=4
, PPN
=20)
(Nod
es=4
, PPN
=20)
(Nod
es=4
, PPN
=20)
Up to 17.8X and 3.9X improvement with MVAPICH2-GDR-Next for medium to large messages respectively
Scatter
Gather17.8X 3.9X
1.6X
GTC 2018 51Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 52Network Based Computing Laboratory
0
1000
2000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
3X0
2000
4000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
Message Size
MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
34%
0
1000
2000
3000
4000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
Message Size
MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
0
1000
2000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
Optimized All-Reduce with XPMEM(N
odes
=1, P
PN=2
0)
Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch
• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node
2X
(Nod
es=2
, PPN
=20)
4X48%
3.3X
2X
2X
GTC 2018 53Network Based Computing Laboratory
0
1000
2000
3000
4000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
Message Size
MVAPICH2-GDR-Next
OpenMPI-3.0.0
0
1000
2000
3000
4000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
MVAPICH2-GDR-Next
OpenMPI-3.0.0
0
1000
2000
3000
4000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
MVAPICH2-GDR-Next
OpenMPI-3.0.0
Optimized All-Reduce with XPMEM (N
odes
=3, P
PN=2
0)
Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch
• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over OpenMPI for inter-node.
0
1000
2000
3000
4000
16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
Message Size
MVAPICH2-GDR-Next
OpenMPI-3.0.0
(Nod
es=4
, PPN
=20)
42% 27%
2X 35%
GTC 2018 54Network Based Computing Laboratory
0
4000
8000
12000
16000La
tenc
y (u
s)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0OpenMPI-3.0.0
Optimized Reduce with XPMEM(N
odes
=1, P
PN=2
0)
0
8000
16000
24000
32000
40000
Late
ncy
(us)
Message Size
MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch
(Nod
es=2
, PPN
=20)
• Optimized MPI Reduce Design in MVAPICH2– Up to 3.1X performance improvement over OpenMPI at scale and up to 37% over spectrum MPI on intra-node
0
5000
10000
15000
Late
ncy
(us)
MVAPICH2-GDR-NextSpectrumMPI-10.1.0OpenMPI-3.0.0
0
10000
20000
30000
40000
Late
ncy
(us)
Message Size
MVAPICH2-GDR-Next
SpectrumMPI-10.1.0
OpenMPI-3.0.0
26%
2.8X
27%
1.9X
37%
5.6X
93%
3.1X
GTC 2018 55Network Based Computing Laboratory
08000
1600024000320004000048000
Late
ncy
(us)
MVAPICH2-GDR-Next
OpenMPI-3.0.0
Optimized Reduce with XPMEM(N
odes
=3, P
PN=2
0)
08000
1600024000320004000048000
Late
ncy
(us)
Message Size
MVAPICH2-GDR-Next
OpenMPI-3.0.0
Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch
(Nod
es=4
, PPN
=20)
• Optimized MPI Reduce Design in MVAPICH2– Up to 7X performance improvement over OpenMPI at scale
0
10000
20000
30000
40000
50000
Late
ncy
(us)
MVAPICH2-GDR-Next
OpenMPI-3.0.0
0
10000
20000
30000
40000
50000
Late
ncy
(us)
Message Size
MVAPICH2-GDR-Next
OpenMPI-3.0.0
4X6X
7X 4X
GTC 2018 56Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 57Network Based Computing Laboratory
1
10
100
1000
10000
100000
Late
ncy
(us)
Message Size (Bytes)
NCCL2 MVAPICH2-GDR
MVAPICH2-GDR vs. NCCL2 – Broadcast Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
• MPI_Bcast (MVAPICH2-GDR) vs. ncclBcast (NCCL2) on 16 K-80 GPUs
*Will be available with upcoming MVAPICH2-GDR 2.3b
1
10
100
1000
10000
100000
Late
ncy
(us)
Message Size (Bytes)
NCCL2 MVAPICH2-GDR
~10X better~4X better
1 GPU/node 2 GPUs/node
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-80 GPUs, and EDR InfiniBand Inter-connect
GTC 2018 58Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Reduce Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
• MPI_Reduce (MVAPICH2-GDR) vs. ncclReduce (NCCL2) on 16 GPUs
*Will be available with upcoming MVAPICH2-GDR 2.3b
1
10
100
1000
10000
100000
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~5X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~2.5X better
GTC 2018 59Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs
*Will be available with upcoming MVAPICH2-GDR 2.3b
1
10
100
1000
10000
100000
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~1.2X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~3X better
GTC 2018 60Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 61Network Based Computing Laboratory
Out-of-Core Deep Neural Network Training with Caffe• Large DNNs cannot be trained on GPUs due to memory limitation!
– ResNet-50 is the state-of-the-art DNN architecture for Image Recognition but current frameworks can only go up to a small batch size of 45
– Next generation models like Neural Machine Translation (NMT) are ridiculously large, consists of billions of parameters, and require even more memory
– Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?
• General intuition is that managed allocations “will be” slow!
– The proposed framework called OC-Caffe (Out-of-Core Caffe) shows the potential of managed memory designs that can provide performance with negligible/no overhead.
– In addition to Out-of-core Training support, productivity can be greatly enhanced in terms of DL framework design by using the new Unified Memory features.
Submission Under Review
GTC 2018 62Network Based Computing Laboratory
• Comparable performance to Caffe-Default for “in-memory” batch sizes
• OC-Caffe-Opt: up to 5X improvement over Intel-MKL-optimized CPU-based AlexNet training on Volta V100 GPU with CUDA9 and CUDNN7
Performance Trends for OC-Caffe
OC-Caffe will be released by the HiDL [email protected]
Out-of-core (over-subscription)Trainable (in-memory)
Submission Under Review
GTC 2018 63Network Based Computing Laboratory
• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7
• OC-Caffe allows efficient scale-up on DGX-1 system with Pascal P100 GPUs with CUDA9 and CUDNN7
Performance Trends for OC-Caffe
Out-of-core (over-subscription) Scale-up on DGX-1
Submission Under Review
GTC 2018 64Network Based Computing Laboratory
Can Unified-Memory also Simplify Framework Design?• Enhanced and simplified the Caffe
framework• Simplified Layer class and all inherited
classes (e.g. ConvolutionLayer)• Remove almost all of the memory allocation,
movement, and state management code in SyncedMemory and Blob class
• Estimated 3,000 lines of repetitive and error-prone code can be eliminated
• Developers can add new inherited Layer classes in a much simpler manner
– E.g. Implement a single Forward function instead of forward_gpu() and forward_cpu() function separately.
class ConvolutionLayer {public:
void cpu_data()void cpu_diff()void gpu_data()void gpu_diff()
void mutable_cpu_data()void mutable_cpu_diff()void mutable_gpu_data()void mutable_gpu_diff()
void Forward_cpu()void Forward_gpu()void forward_cpu_gemm()void forward_gpu_gemm()void forward_cpu_bias()void forward_gpu_bias()
void Backward_cpu()void Backward_gpu()void backward_cpu_gemm()void backward_gpu_gemm()void backward_cpu_bias()void backward_gpu_bias()
}
class ConvolutionLayer{public:
void data()void diff()
void mutable_data()void mutable_diff()
void Forward()void forward_gemm()void forward_bias()
void Backward()void backward_gemm()void backward_bias()
}
Existing Design
Proposed High-Productivity Design based on Managed Memory Allocation and Data MovementSubmission Under Review
GTC 2018 65Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features
• Multi-stream Communication for IPC• CMA-based Intra-node Host-to-Host Communication Support• Maximal Overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Streaming Support with InfiniBand Multicast and GDR• Support for Deep Learning• Support for OpenPOWER with NVLink• Support for Container
• Upcoming Features• CMA-based Intra-node Collective Communication Support• XPMEM-based Collective Communication Support• Optimized Collectives for Deep Learning• Out-of-core processing for Deep Learning
• Conclusions
Outline
GTC 2018 66Network Based Computing Laboratory
• MVAPICH2-GDR MPI library optimizes MPI communication on InfiniBand and RoCE (V1 and V2) clusters with GPUs on both x86 and OpenPOWER platforms
• Provides optimized designs for point-to-point two-sided and one-sided communication, datatype processing and collective operations
• Takes advantage of CUDA features like IPC and GPUDirect RDMA families
• Allows flexible solutions for streaming applications with GPUs
• Provides optimized solutions for High-Performance Deep Learning (HiDL) frameworks and applications
Conclusions
GTC 2018 67Network Based Computing Laboratory
• CE8118 - Connect with the Experts: MVAPICH for GPU
• Session Schedule• Wednesday, Mar 28, 3:00 PM - 4:00 PM, LL Pod A
• Session Speakers• Dhabaleswar Panda - Professor and University Distinguished Scholar, OSU
• Davide Rossetti - Senior Software Engineer, NVIDIA
• Hari Subramoni - Research Scientist, OSU
• Session Description• MVAPICH is a well established MPI library supporting GPUDirect. Meet with the experts to
discuss how you can optimize your HPC and AI application using GPUDirect RDMA, with MVAPICH GDR, and learn about MVAPICH GDS, newest feature, supporting GPUDirect Async.
Join Us for a Connect with the Experts Session
GTC 2018 68Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students – A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist– K. Hamidouche
– S. Sur
Past Post-Docs– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists– X. Lu
– H. Subramoni
Past Programmers– D. Bureddy
– J. Perkins
Current Research Specialist– J. Smith
– M. Arnold
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc– A. Ruhela
Current Students (Undergraduate)– N. Sarkauskas (B.S.)
GTC 2018 69Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
[email protected], [email protected]
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/