Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand
Dhabaleswar K. (DK) Panda The Ohio State University
E-mail: [email protected] http://www.cse.ohio-state.edu/~panda
Presentation at GTC 2014
by
• Growth of High Performance Computing (HPC) – Growth in processor performance
• Chip density doubles every 18 months
– Growth in commodity networking • Increase in speed/features + reducing cost
– Growth in accelerators (NVIDIA GPUs)
Current and Next Generation HPC Systems and Applications
2 OSU-GTC-2014
3
Outline
OSU-GTC-2014
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• On going work
• Before CUDA 4: Additional copies – Low performance and low productivity
• After CUDA 4: Host-based pipeline – Unified Virtual Address
– Pipeline CUDA copies with IB transfers
– High performance and high productivity
• After CUDA 5.5: GPUDirect-RDMA support – GPU to GPU direct transfer
– Bypass the host memory
– Hybrid design to avoid PCI bottlenecks
OSU-GTC-2014 4
MVAPICH2-GPU: CUDA-Aware MPI
InfiniBand
GPU
GPU Memor
y
CPU
Chip set
Data Movement on GPU Clusters
CPU CPU QPI
GPU
PCIe
GP
U
GPU
CPU
GPU
IB
Node 0 Node 1 1. Intra-GPU 2. Intra-Socket GPU-GPU 3. Inter-Socket GPU-GPU 4. Inter-Node GPU-GPU 5. Intra-Socket GPU-Host
7. Inter-Node GPU-Host 6. Inter-Socket GPU-Host
Memory buffers
5
8. Inter-Node GPU-GPU with IB adapter on remote socket
and more . . .
• For each path different schemes: Shared_mem, IPC, GDR, pipeline …. • Critical for runtimes to optimize data movement while hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
OSU-GTC-2014
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Support for NVIDIA GPUs, Available since 2011
– Used by more than 2,150 organizations (HPC Centers, Industry and Universities) in 72 countries
– More than 205,000 downloads from OSU site directly
– Empowering many TOP500 clusters • 7th ranked 519,640-core cluster (Stampede) at TACC
• 11th ranked 74,358-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• 16th ranked 96,192-core cluster (Pleiades) at NASA
• 105th ranked 16,896-core cluster (Keenland) at GaTech and many others . . .
– Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu 6
MVAPICH2/MVAPICH2-X Software
OSU-GTC-2014
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• On going work
7
Outline
OSU-GTC-2014
• Hybrid design using GPUDirect RDMA
– GPUDirect RDMA and Host-based pipelining
– Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge
• Support for communication using multi-rail
• Support for Mellanox Connect-IB and ConnectX VPI adapters
• Support for RoCE with Mellanox ConnectX VPI adapters
OSU-GTC-2014 8
GPUDirect RDMA (GDR) with CUDA
IB Adapter
System Memory
GPU Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2
S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)
9
Performance of MVAPICH2 with GPUDirect-RDMA: Latency GPU-GPU Internode MPI Latency
0
100
200
300
400
500
600
700
800
8K 32K 128K 512K 2M
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Large Message Latency
Message Size (Bytes)
Late
ncy
(us)
Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
10 %
0
5
10
15
20
25
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Latency
Message Size (Bytes)
Late
ncy
(us)
67 %
5.49 usec
OSU-GTC-2014
10
Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Bandwidth
Message Size (Bytes)
Band
wid
th (M
B/s)
0
2000
4000
6000
8000
10000
12000
8K 32K 128K 512K 2M
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Large Message Bandwidth
Message Size (Bytes)
Band
wid
th (M
B/s)
Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
5x
9.8 GB/s
OSU-GTC-2014
11
Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth
Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
GPU-GPU Internode MPI Bi-directional Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Bi-Bandwidth
Message Size (Bytes)
Bi-B
andw
idth
(MB/
s)
0
5000
10000
15000
20000
25000
8K 32K 128K 512K 2M
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Large Message Bi-Bandwidth
Message Size (Bytes)
Bi-B
andw
idth
(MB/
s)
4.3x
19 %
19 GB/s
OSU-GTC-2014
12
Applications-level Benefits: AWP-ODC with MVAPICH2-GPU
Based on MVAPICH2-2.0b, CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch Two nodes, one GPU/node, one Process/GPU
0
50
100
150
200
250
300
350
Platform A Platform B
MV2 MV2-GDR11.6%
15.4%
Platform A: Intel Sandy Bridge + NVIDIA Tesla K20 + Mellanox ConnectX-3
Platform B: Intel Ivy Bridge + NVIDIA Tesla K40 + Mellanox Connect-IB
• A widely-used seismic modeling application, Gordon Bell Finalist at SC 2010
• An initial version using MPI + CUDA for GPU clusters
• Takes advantage of CUDA-aware MPI, two nodes, 1 GPU/Node and 64x32x32 problem
• GPUDirect-RDMA delivers better performance with newer architecture
050
100150200250300350
Platform A Platfrom AGDR
Platform B Platfrom BGDR
Computation Communication
21% 30%
OSU-GTC-2014
13
Continuous Enhancements for Improved Point-to-point Performance
Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
OSU-GTC-2014
0
2
4
6
8
10
12
14
1 4 16 64 256 1K 4K 16K
MV2-2.0b-GDR
Improved
Small Message Latency
Message Size (Bytes)
Late
ncy
(us)
GPU-GPU Internode MPI Latency
• Reduced synchronization and while avoiding expensive copies
31 %
14
Dynamic Tuning for Point-to-point Performance
Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
OSU-GTC-2014
GPU-GPU Internode MPI Performance
0
100
200
300
400
500
600
128K 512K 2M
Opt_latency
Opt_bw
Opt_dynamic
Latency
Message Size (Bytes)
Late
ncy
(us)
0
1000
2000
3000
4000
5000
6000
1 16 256 4K 64K 1M
Opt_latency
Opt_bw
Opt_dynamic
Bandwidth
Message Size (Bytes)
Bi-B
andw
idth
(MB/
s)
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• Conclusion
15
Outline
OSU-GTC-2014
16
One-sided communication
• Send/Recv semantics incur overheads • Distributed buffer information
• Message matching
• Additional copies or rendezvous exchange
• One-sided communication • Separates synchronization from communication
• Direct mapping over RDMA semantics
• Lower overheads and better overlap
4 bytes Host-Host GPU-GPU
IB send/recv 0.98 1.84
MPI send/recv 1.25 6.95
Table: Latency (half round trip) on SandyBridge nodes with FDR connect-IB
OSU-GTC-2014
17
MPI-3 RMA Support with GPUDirect RDMA
Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin
MPI-3 RMA provides flexible synchronization and completion primitives
0
2
4
6
8
10
12
14
16
18
20
1 4 16 64 256 1K 4K
Send-Recv
Put+ActiveSync
Put+Flush
Small Message Latency
Message Size (Bytes)
Late
ncy
(use
c)
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1 4 16 64 256 1K 4K
Send-RecvPut+ActiveSync
Small Message Rate
Message Size (Bytes)
Mes
sage
Rat
e (M
sgs/
sec)
OSU-GTC-2014
< 2usec
18
Communication Kernel Evaluation: 3Dstencil and Alltoall
Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin
0102030405060708090
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
Send/Recv One-sided
3D Stencil with 16 GPU nodes
020406080
100120140
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
Send/Recv One-sided
AlltoAll with 16 GPU nodes
OSU-GTC-2014
42% 50%
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• Conclusion
19
Outline
OSU-GTC-2014
Non-contiguous Data Exchange
20
• Multi-dimensional data – Row based organization – Contiguous on one dimension – Non-contiguous on other
dimensions
• Halo data exchange – Duplicate the boundary – Exchange the boundary in each
iteration
Halo data exchange
OSU-GTC-2014
MPI Datatye Processing • Comprehensive support
• Targeted kernels for regular datatypes - vector, subarray, indexed_block
• Generic kernels for all other irregular datatypes
• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels
• Flexible set of parameters for users to tune kernels
• Vector • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_VECTOR_YSIZE
• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_SUBARR_XDIM
• MV2_CUDA_KERNEL_SUBARR_YDIM
• MV2_CUDA_KERNEL_SUBARR_ZDIM
• Indexed_block • MV2_CUDA_KERNEL_IDXBLK_XDIM
OSU-GTC-2014 21
22
Application-Level Evaluation (LBMGPU-3D)
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
050
100150200250300350400
8 16 32 64
Tota
l Exe
cutio
n Ti
me
(sec
)
Number of GPUs
MPI MPI-GPU5.6%
8.2% 13.5%
15.5%
3D LBM-CUDA
OSU-GTC-2014
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• Conclusion
23
Outline
OSU-GTC-2014
OSU-GTC-2014 24
More Optimizations!!!
• Topology-detection:
• Avoid the inter-sockets QPI bottlenecks
• Dynamic threshold selection between GDR and host-based transfers
• All these and other features will be available with the next release of MVAPICH2-GDR => coming very soon
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• Conclusion
25
Outline
OSU-GTC-2014
• OpenACC is gaining popularity • Several sessions during GTC • A set of compiler directives (#pragma) • Offload specific loops or parallelizable sections in code onto accelerators
#pragma acc region {
for(i = 0; i < size; i++) { A[i] = B[i] + C[i]; } } • Routines to allocate/free memory on accelerators
buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);
• Supported for C, C++ and Fortran • Huge list of modifiers – copy, copyout, private, independent, etc..
OpenACC
26 OSU-GTC-2014
• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in
OpenACC 2.0 implementations – MVAPICH2 will detect the device pointer and optimize communication – Delivers the same performance as with CUDA
Using MVPICH2 with the new OpenACC 2.0
27 OSU-GTC-2014
A = malloc(sizeof(int) * N); …...
#pragma acc data copyin(A) { #pragma acc parallel for //compute for loop
MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD); }
…… free(A);
How can I get Started with GDR Experimentation? • MVAPICH2-2.0b with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/
• System software requirements – Mellanox OFED 2.1
– NVIDIA Driver 331.20 or later
– NVIDIA CUDA Toolkit 5.5
– Plugin for GPUDirect RDMA
(http://www.mellanox.com/page/products_dyn?product_family=116)
• Has optimized designs for point-to-point communication using GDR
• Work under progress for optimizing collective and one-sided communication
• Contact MVAPICH help list with any questions related to the package [email protected]
• MVAPICH2-GDR-RC1 with additional optimizations coming soon!!
28 OSU-GTC-2014
• MVAPICH2 optimizes MPI communication on InfiniBand clusters with GPUs
• Provides optimized designs for point-to-point two-sided and one-sided communication, and datatype processing
• Takes advantage of CUDA features like IPC and GPUDirect RDMA
• Delivers – High performance
– High productivity
With support for latest NVIDIA GPUs and InfiniBand Adapters
29
Conclusions
OSU-GTC-2014
OSU-GTC-2014 30
Acknowledgments Dr. Davide Rossetti and others @NVIDIA
OSU-GTC-2014 31
Want to improve the top500 ranking of your heterogeneous GPU Cluster?
Yes !!
Do not miss our next talk –
S4535 - Accelerating HPL on Heterogeneous Clusters with NVIDIA GPUs
Tuesday, 03/25 (today)
Room LL21A
17:00 – 17:25
Talk on Hybrid HPL for Heterogeneous Clusters
Thank You!
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
MVAPICH Web Page http://mvapich.cse.ohio-state.edu/
32 OSU-GTC-2014