GPUDIRECT, CUDA AWARE MPI, & CUDA IPC · & CUDA IPC. 2 AGENDA What is GPU Direct? CUDA...

Steve Abbott, OLCF User Conference Call, January 2018

GPUDIRECT, CUDA AWARE MPI,

& CUDA IPC

AGENDAWhat is GPU Direct?CUDA Aware MPIAdvanced On Node Communication

SUMMIT NODE OVERVIEW

256 GB (DDR4)

SUMMIT NODE(2) IBM POWER9 + (6) NVIDIA VOLTA V100

(50 GB/s)NVLink2

GPU 0 GPU 1 GPU 2

16 GB(HBM2)

0 (0-3)

8 (32-35)

16 (64-67)

1 (4-7)

9 (36-39)

17 (68-71)

2 (8-11)

10 (40-43)

18 (72-75)

3 (12-15)

11 (44-47)

19 (76-79)

4 (16-19)

12 (48-51)

20 (80-83)

5 (20-23)

13 (52-55)6 (24-27)

14 (56-59)7 (28-31)

15 (60-63)

256 GB (DDR4)

GPU 3 GPU 4 GPU 5

16 GB(HBM2)

22 (88-91)

30 (120-123)

38 (152-155)

23 (92-95)

31 (124-127)

39 (156-159)

24 (96-99)

32 (128-131)

40 (160-163)

25 (100-103)

33 (132-135)

41 (164-167)

26 (104-107)

34 (136-139)

42 (168-171)

27 (108-111)

35 (140-143)28 (112-115)

36 (144-147)29 (116-119)

37 (148-151)

64 GB/s

135 GB/s 135 GB/s

(900 GB/s)

UNDER THE HOOD

Many connections

Many devices

Many stacks

Summit has fat nodes!

GPUDIRECT

NVIDIA GPUDIRECT™Accelerated Communication with Network & Storage Devices

1/29/2

Memory

PCI-e/NVLINK

System

Memory

NVIDIA GPUDIRECT™Peer to Peer Transfers

1/29/2

Memory

System

Memory

PCI-e/NVLINK

NVIDIA GPUDIRECT™Support for RDMA

1/29/2

Memory

System

Memory

PCI-e/NVLINK

CUDA AWARE MPI FOR ON AND OFF NODE TRANSFERS

REGULAR MPI GPU TO REMOTE GPU

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost);

MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);

MPI Rank 0 MPI Rank 1

REGULAR MPI GPU TO REMOTE GPU

1/29/2

memcpy H->DMPI_Sendrecvmemcpy D->H

MPI GPU TO REMOTE GPUwithout GPUDirect

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

#pragma acc host_data use_device (s_buf, r_buf)

MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

1/29/2

MPI_Sendrecv

MPI GPU TO REMOTE GPUSupport for RDMA

#pragma acc host_data use_device (s_buf, r_buf)

MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

1/29/2

MPI_Sendrecv

JSRUN/SMPI GPU OPTIONS

To enable CUDA aware MPI, use jsrun --smpiargs=“-gpu”

Running On Summit

KNOWN ISSUES ON SUMMIT

Problems with Multiple resource sets per node:

$> jsrun –g 1 –a 1 --smpiargs=“-gpu” ….

[1]Error opening IPC Memhandle from peer:0, invalid argument

• One workaround: set PAMI_DISABLE_IPC=1

• Expect poor performance, but a good functionality check

Will be resolved by software updates later this year

Things to watch out for (as of January)

PERFORMANT WORKAROUNDS

Option 1: Run in one resource set and set GPU affinity in your code

(do NOT restrict CUDA_VISIBLE_DEVICES, but you can permute it)

Option 2: Use a wrapper script

• Add “#BSUB –step_cgroup n” to your LSF options

• Run with `jsrun <your-jsrun-options> --smpiargs=”-gpu” ./gpu_setter.sh <your app>`

• (script on next slide)

• Will need to be careful about your CPU bindings!

Running On Summit

#! /bin/bash

# gpu_setter.sh

# Rudimentary GPU affinity setter for Summit

# >$ jsrun –rs_per_host 1 –gpu_per_rs 6 <task/cpu option> ./gpu_setter.sh <your app>

# This script assumes your code does not attempt to set its own

# GPU affinity (e.g. with cudaSetDevice). Using this affinity script

# with a code that does its own internal GPU selection probably won't work!

# Compute device number from OpenMPI local rank environment variable

# Keeping in mind Summit has 6 GPUs per node

mydevice=$((${OMPI_COMM_WORLD_LOCAL_RANK} % 6))

# CUDA_VISIBLE_DEVICES controls both what GPUs are visible to your process

# and the order they appear in. By putting "mydevice" first the in list, we

# make sure it shows up as device "0" to the process so it's automatically selected.

# The order of the other devices doesn't matter, only that all devices (0-5) are

present.

CUDA_VISIBLE_DEVICES="${mydevice},0,1,2,3,4,5"

# Process with sed to remove the duplicate and reform the list, keeping the order we

CUDA_VISIBLE_DEVICES=$(sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta;

s/(,,)+/,/g; s/, *$//' <<< $CUDA_VISIBLE_DEVICES)

export CUDA_VISIBLE_DEVICES

# Launch the application we were given

exec "$@"

GPU TO GPU COMMUNICATION

CUDA aware MPI functionally portable

OpenACC/MP interoperable

Performance may vary between on/off node, socket, HW support for GPU Direct

Unified memory support varies between implementations, but it becoming common

For more information on the following advanced on-node communication models, see the OLCF Training Archive or Summit Training Workshops

Single-process, multi-GPU

Multi-process, single-gpu

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC · & CUDA IPC. 2 AGENDA What is GPU Direct? CUDA...

Documents

GPU IMPLEMENTATION OF HPGMG-FV · CUDA-aware MPI MPI CUDA-AWARE MPI DATA MIGRATIONS (50%) SINGLE NODE WITH P2P BETWEEN TWO K40s . 13 MULTI-GPU SCALING Single node with P2P ... Initial

HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data

OP2 MANY-CORE ARCHITECTURES - University of Oxfordpeople.maths.ox.ac.uk/gilesm/talks/AWE-Visit-27012012.pdf · Single Node CUDA Single Node OpenMP Cluster MPI Cluster MPI+CUDA Conventional

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC€¦ · Steve Abbott, February 12, 2019 GPUDIRECT, CUDA AWARE MPI,& CUDA IPC

Multigrid Method using OpenMP/MPI Hybrid Parallel ... CPU+GPU, CPU+Manycores (e.g. Intel MIC/Xeon Phi) • MPI+X: OpenMP, OpenACC, CUDA, OpenCL Fujitsu@SC12 2 Multigrid • Scalable

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …cacs.usc.edu/education/cs653/Aji-MPIACC-IEEETPDS16.pdf · productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution

Performance Study of CUDA -Aware MPI libraries for GPU … · 2019-12-02 · Performance Study of CUDA -Aware MPI libraries for GPU -enabled OpenPOWER Systems Kawthar Shafie Khorassani,

Intro to CUDA-Aware MPI and NVIDIA GPUDirect | GTC 2013on-demand.gputechconf.com/gtc/...Intro-CUDA-Aware-MPI-NVIDIA-G… · regular MPI CUDA-aware MPI Jacobi Results (1000 steps)

MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware ... · MVAPICH2-GDR: High -Performance and Scalable CUDA-Aware MPI Library for HPC and AI Dhabaleswar K. (DK) Panda The Ohio

TRM-06704-001 v5.5 | July 2013 CUDA SAMPLES Reference Manualacarus.uson.mx/docs/cuda-5.5/CUDA_Samples.pdf · OpenGL, CUDA, MPI, and OpenMP libraries for all OS Platforms (Mac, Linux

MULTI GPU PROGRAMMING WITH MPI...MULTI GPU PROGRAMMING WITH MPI MPI+CUDA PCI-e GPU GDDR5 Memory Memory System Memory CPU Network Card Node 0 PCI-e GPU GDDR5Memory Memory System CPU

MVAPICH Co-designing Communication Middleware and Deep … · 2019-12-11 · CUDA-Aware MPI CUDA-Aware Allreduce Pointer Cache GDR 1 Program Launch Optimizations 3 Characterization

Parallel Computing Using MPI, OpenMP, CUDA with · PDF file• Questions (10 minutes) ... • Are two important and indivisible concepts of MPI. • Group: is the set of processes

MPI- and CUDA- implementations of modal finite difference

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC...Steve Abbott, Summit Training Workshop, December 2018 GPUDIRECT, CUDA AWARE MPI, & CUDA IPC

Lecture 26: Joint CUDA-MPI Programming Modified by Karen L

Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Multi-level Parallel Fit Algorithms Using MPI and CUDA...Multi-level Parallel Fit Algorithms Using MPI and CUDA SuperB Software R&D Workshop, 8-12 March 2010, Ferrara, Italy Karen

Splotch: High Performance Visualization using MPI, OpenMP ... · Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory)

COMP 605: Introduction to Parallel Computing Quiz4:Module ...mthomas/sp17.605/homework/... · Comparing MPI and CUDA MPI Matrix-Matrix Multiplication ref data Figure 6:MPI Matrix-Matrix