Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU ...nowlab.cse.ohio-state.edu/static/media/... · Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected] http://www.cse.ohio-state.edu/~panda

Presentation at GTC 2014

by

http://www.cse.ohio-state.edu/%7Epanda

• Growth of High Performance Computing (HPC) – Growth in processor performance

• Chip density doubles every 18 months

– Growth in commodity networking • Increase in speed/features + reducing cost

– Growth in accelerators (NVIDIA GPUs)

Current and Next Generation HPC Systems and Applications

2 OSU-GTC-2014

3

Outline

OSU-GTC-2014

• Communication on InfiniBand Clusters with GPUs

• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication

• One-sided Communication

• MPI Datatype Processing

• More Optimizations

• MPI and OpenACC

• On going work

• Before CUDA 4: Additional copies – Low performance and low productivity

• After CUDA 4: Host-based pipeline – Unified Virtual Address

– Pipeline CUDA copies with IB transfers

– High performance and high productivity

• After CUDA 5.5: GPUDirect-RDMA support – GPU to GPU direct transfer

– Bypass the host memory

– Hybrid design to avoid PCI bottlenecks

OSU-GTC-2014 4

MVAPICH2-GPU: CUDA-Aware MPI

InfiniBand

GPU

GPU Memor

y

CPU

Chip set

Data Movement on GPU Clusters

CPU CPU QPI

GPU

PCIe

GP

U

GPU

CPU

GPU

IB

Node 0 Node 1 1. Intra-GPU 2. Intra-Socket GPU-GPU 3. Inter-Socket GPU-GPU 4. Inter-Node GPU-GPU 5. Intra-Socket GPU-Host

7. Inter-Node GPU-Host 6. Inter-Socket GPU-Host

Memory buffers

5

8. Inter-Node GPU-GPU with IB adapter on remote socket

and more . . .

• For each path different schemes: Shared_mem, IPC, GDR, pipeline …. • Critical for runtimes to optimize data movement while hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

OSU-GTC-2014

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Support for NVIDIA GPUs, Available since 2011

– Used by more than 2,150 organizations (HPC Centers, Industry and Universities) in 72 countries

– More than 205,000 downloads from OSU site directly

– Empowering many TOP500 clusters • 7th ranked 519,640-core cluster (Stampede) at TACC

• 11th ranked 74,358-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• 16th ranked 96,192-core cluster (Pleiades) at NASA

• 105th ranked 16,896-core cluster (Keenland) at GaTech and many others . . .

– Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu 6

MVAPICH2/MVAPICH2-X Software

OSU-GTC-2014

http://mvapich.cse.ohio-state.edu/






• MPI and OpenACC

• On going work

7

Outline

OSU-GTC-2014

• Hybrid design using GPUDirect RDMA

– GPUDirect RDMA and Host-based pipelining

– Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge

• Support for communication using multi-rail

• Support for Mellanox Connect-IB and ConnectX VPI adapters

• Support for RoCE with Mellanox ConnectX VPI adapters

OSU-GTC-2014 8

GPUDirect RDMA (GDR) with CUDA

IB Adapter

System Memory

GPU Memory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNB E5-2670

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVB E5-2680V2

SNB E5-2670 /

IVB E5-2680V2

S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)

9

Performance of MVAPICH2 with GPUDirect-RDMA: Latency GPU-GPU Internode MPI Latency

0

100

200

300

400

500

600

700

800

8K 32K 128K 512K 2M

1-Rail2-Rail1-Rail-GDR2-Rail-GDR

Large Message Latency

Message Size (Bytes)

Late

ncy

(us)

Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch

10 %

0

5

10

15

20

25

1 4 16 64 256 1K 4K


Small Message Latency


Late

ncy

(us)

67 %

5.49 usec

OSU-GTC-2014

10

Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Small Message Bandwidth


Band

wid

th (M

B/s)

0

2000

4000

6000

8000

10000

12000

8K 32K 128K 512K 2M


Large Message Bandwidth


Band

wid

th (M

B/s)



5x

9.8 GB/s

OSU-GTC-2014

11

Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth



GPU-GPU Internode MPI Bi-directional Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Small Message Bi-Bandwidth


Bi-B

andw

idth

(MB/

s)

0

5000

10000

15000

20000

25000

8K 32K 128K 512K 2M


Large Message Bi-Bandwidth


Bi-B

andw

idth

(MB/

s)

4.3x

19 %

19 GB/s

OSU-GTC-2014

12

Applications-level Benefits: AWP-ODC with MVAPICH2-GPU

Based on MVAPICH2-2.0b, CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch Two nodes, one GPU/node, one Process/GPU

0

50

100

150

200

250

300

350

Platform A Platform B

MV2 MV2-GDR11.6%

15.4%

Platform A: Intel Sandy Bridge + NVIDIA Tesla K20 + Mellanox ConnectX-3

Platform B: Intel Ivy Bridge + NVIDIA Tesla K40 + Mellanox Connect-IB

• A widely-used seismic modeling application, Gordon Bell Finalist at SC 2010

• An initial version using MPI + CUDA for GPU clusters

• Takes advantage of CUDA-aware MPI, two nodes, 1 GPU/Node and 64x32x32 problem

• GPUDirect-RDMA delivers better performance with newer architecture

050

100150200250300350

Platform A Platfrom AGDR

Platform B Platfrom BGDR

Computation Communication

21% 30%

OSU-GTC-2014

13

Continuous Enhancements for Improved Point-to-point Performance

Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores


OSU-GTC-2014

0

2

4

6

8

10

12

14

1 4 16 64 256 1K 4K 16K

MV2-2.0b-GDR

Improved



Late

ncy

(us)

GPU-GPU Internode MPI Latency

• Reduced synchronization and while avoiding expensive copies

31 %

14

Dynamic Tuning for Point-to-point Performance

Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores


OSU-GTC-2014

GPU-GPU Internode MPI Performance

0

100

200

300

400

500

600

128K 512K 2M

Opt_latency

Opt_bw

Opt_dynamic

Latency


Late

ncy

(us)

0

1000

2000

3000

4000

5000

6000

1 16 256 4K 64K 1M

Opt_latency

Opt_bw

Opt_dynamic

Bandwidth


Bi-B

andw

idth

(MB/

s)






• MPI and OpenACC

• Conclusion

15

Outline

OSU-GTC-2014

16

One-sided communication

• Send/Recv semantics incur overheads • Distributed buffer information

• Message matching

• Additional copies or rendezvous exchange

• One-sided communication • Separates synchronization from communication

• Direct mapping over RDMA semantics

• Lower overheads and better overlap

4 bytes Host-Host GPU-GPU

IB send/recv 0.98 1.84

MPI send/recv 1.25 6.95

Table: Latency (half round trip) on SandyBridge nodes with FDR connect-IB

OSU-GTC-2014

17

MPI-3 RMA Support with GPUDirect RDMA

Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin

MPI-3 RMA provides flexible synchronization and completion primitives

0

2

4

6

8

10

12

14

16

18

20

1 4 16 64 256 1K 4K

Send-Recv

Put+ActiveSync

Put+Flush



Late

ncy

(use

c)

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1 4 16 64 256 1K 4K

Send-RecvPut+ActiveSync

Small Message Rate


Mes

sage

Rat

e (M

sgs/

sec)

OSU-GTC-2014

< 2usec

18

Communication Kernel Evaluation: 3Dstencil and Alltoall

Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin

0102030405060708090

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


Send/Recv One-sided

3D Stencil with 16 GPU nodes

020406080

100120140

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


Send/Recv One-sided

AlltoAll with 16 GPU nodes

OSU-GTC-2014

42% 50%






• MPI and OpenACC

• Conclusion

19

Outline

OSU-GTC-2014

Non-contiguous Data Exchange

20

• Multi-dimensional data – Row based organization – Contiguous on one dimension – Non-contiguous on other

dimensions

• Halo data exchange – Duplicate the boundary – Exchange the boundary in each

iteration

Halo data exchange

OSU-GTC-2014

MPI Datatye Processing • Comprehensive support

• Targeted kernels for regular datatypes - vector, subarray, indexed_block

• Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels

• Vector • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_VECTOR_YSIZE

• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_SUBARR_XDIM

• MV2_CUDA_KERNEL_SUBARR_YDIM

• MV2_CUDA_KERNEL_SUBARR_ZDIM

• Indexed_block • MV2_CUDA_KERNEL_IDXBLK_XDIM

OSU-GTC-2014 21

22

Application-Level Evaluation (LBMGPU-3D)

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

050

100150200250300350400

8 16 32 64

Tota

l Exe

cutio

n Ti

me

(sec

)

Number of GPUs

MPI MPI-GPU5.6%

8.2% 13.5%

15.5%

3D LBM-CUDA

OSU-GTC-2014






• MPI and OpenACC

• Conclusion

23

Outline

OSU-GTC-2014

OSU-GTC-2014 24

More Optimizations!!!

• Topology-detection:

• Avoid the inter-sockets QPI bottlenecks

• Dynamic threshold selection between GDR and host-based transfers

• All these and other features will be available with the next release of MVAPICH2-GDR => coming very soon






• MPI and OpenACC

• Conclusion

25

Outline

OSU-GTC-2014

• OpenACC is gaining popularity • Several sessions during GTC • A set of compiler directives (#pragma) • Offload specific loops or parallelizable sections in code onto accelerators

#pragma acc region {

for(i = 0; i < size; i++) { A[i] = B[i] + C[i]; } } • Routines to allocate/free memory on accelerators

buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);

• Supported for C, C++ and Fortran • Huge list of modifiers – copy, copyout, private, independent, etc..

OpenACC

26 OSU-GTC-2014

• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in

OpenACC 2.0 implementations – MVAPICH2 will detect the device pointer and optimize communication – Delivers the same performance as with CUDA

Using MVPICH2 with the new OpenACC 2.0

27 OSU-GTC-2014

A = malloc(sizeof(int) * N); …...

#pragma acc data copyin(A) { #pragma acc parallel for //compute for loop

MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD); }

…… free(A);

How can I get Started with GDR Experimentation? • MVAPICH2-2.0b with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

• System software requirements – Mellanox OFED 2.1

– NVIDIA Driver 331.20 or later

– NVIDIA CUDA Toolkit 5.5

– Plugin for GPUDirect RDMA

(http://www.mellanox.com/page/products_dyn?product_family=116)

• Has optimized designs for point-to-point communication using GDR

• Work under progress for optimizing collective and one-sided communication

• Contact MVAPICH help list with any questions related to the package [email protected]

• MVAPICH2-GDR-RC1 with additional optimizations coming soon!!

28 OSU-GTC-2014

http://www.mellanox.com/page/products_dyn?product_family=116

mailto:[email protected]


• MVAPICH2 optimizes MPI communication on InfiniBand clusters with GPUs

• Provides optimized designs for point-to-point two-sided and one-sided communication, and datatype processing

• Takes advantage of CUDA features like IPC and GPUDirect RDMA

• Delivers – High performance

– High productivity

With support for latest NVIDIA GPUs and InfiniBand Adapters

29

Conclusions

OSU-GTC-2014

OSU-GTC-2014 30

Acknowledgments Dr. Davide Rossetti and others @NVIDIA

OSU-GTC-2014 31

Want to improve the top500 ranking of your heterogeneous GPU Cluster?

Yes !!

Do not miss our next talk –

S4535 - Accelerating HPL on Heterogeneous Clusters with NVIDIA GPUs

Tuesday, 03/25 (today)

Room LL21A

17:00 – 17:25

Talk on Hybrid HPL for Heterogeneous Clusters

Thank You!

[email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

MVAPICH Web Page http://mvapich.cse.ohio-state.edu/

32 OSU-GTC-2014


http://nowlab.cse.ohio-state.edu/



Documents

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU ...nowlab.cse.ohio-state.edu/static/media/... · Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand