High-Performance Broadcast Designs for Streaming ...on-demand.gputechconf.com/.../presentation/...broadcast-sreaming.pdf · High-Performance Broadcast Designs for Streaming Applications

High-Performance Broadcast Designs for Streaming Applications on Multi-GPU InfiniBand Clusters

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

GPU Technology Conference (GTC 2017)

by

http://www.cse.ohio-state.edu/%7Epanda

GTC 2017 2Network Based Computing Laboratory

• Examples - surveillance, habitat monitoring, etc..

• Require efficient transport of data from/to distributed sources/sinks

• Sensitive to latency and throughput metrics

• Require HPC resources to efficiently carry out compute-intensive tasks

Streaming Applications


• Pipelined data parallel compute phases • Form the crux of streaming applications lend

themselves for GPGPUs

• Data distribution to GPGPU sites • Over PCIe within the node• Over InfiniBand interconnects across nodes

• Back-to-back Broadcast operation

– Key dictator of throughput of streaming applications

Nature of Streaming ApplicationsData Source

Data Distributor

HPC resources

Real-time streaming

Worker

CPUGPU

GPU

Worker

CPUGPU

GPU

Worker

CPUGPU

GPU

Worker

CPUGPU

GPU

Worker

CPUGPU

GPU

Data streaming-like broadcast operations


Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD

• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects -InfiniBand

<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

Tianhe – 2 TitanK - ComputerSunway TaihuLight


• 187 IB Clusters (37%) in the Nov’16 Top500 list

– (http://www.top500.org)

• Installations in the Top 50 (15 systems):

Large-scale InfiniBand Installations

241,108 cores (Pleiades) at NASA/Ames (13th) 147,456 cores (SuperMUC) in Germany (36th)

220,800 cores (Pangea) in France (16th) 86,016 cores (SuperMUC Phase 2) in Germany (37th)

462,462 cores (Stampede) at TACC (17th) 74,520 cores (Tsubame 2.5) at Japan/GSIC (40th)

144,900 cores (Cheyenne) at NCAR/USA (20th) 194,616 cores (Cascade) at PNNL (44th)

72,800 cores Cray CS-Storm in US (25th) 76,032 cores (Makman-2) at Saudi Aramco (49th)

72,800 cores Cray CS-Storm in US (26th) 72,000 cores (Prolix) at Meteo France, France (50th)

124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (27th ) 73,440 cores (Beaufix2) at Meteo France, France (51st)

60,512 cores (DGX SATURNV) at NVIDIA/USA (28th) 42,688 cores (Lomonosov-2) at Russia/MSU (52nd)

72,000 cores (HPC2) in Italy (29th) 60,240 cores SGI ICE X at JAEA Japan (54th)

152,692 cores (Thunder) at AFRL/USA (32nd) and many more!

http://www.top500.org/


• Introduced in Oct 2000• High Performance Point-to-point Data Transfer

– Interprocessor communication and I/O– Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec ->

200Gbps), and low CPU utilization (5-10%)• Multiple Features

– Offloaded Send/Recv, RDMA Read/Write, Atomic Operations– Hardware Multicast support through Unreliable Datagram (UD)

• A message sent from a single source can reach all destinations in a single pass over the network through switch-based replication

• Restricted to one MTU• Large messages need to be sent in a chunked manner• Reliability needs to be addressed

• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….

InfiniBand Networking Technology


InfiniBand Hardware Multicast Example


Multicast-aware CPU-Based MPI_Bcast on Stampede using MVAPICH2 (6K nodes with 102K cores)

010203040

2 8 32 128 512

Late

ncy

(µs)

Message Size (Bytes)

Small Messages (102,400 Cores)Default Multicast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCIe Gen3 with Mellanox IB FDR switch

0100200300400500

2K 8K 32K 128K

Late

ncy

(µs)


Large Messages (102,400 Cores)DefaultMulticast

0

10

20

30

Late

ncy

(µs)

Number of Nodes

16 Byte Message

DefaultMulticast

0

50

100

150

200

Late

ncy

(µs)

Number of Nodes

32 KByte Message

DefaultMulticast


GPU Memory

• Before CUDA 4: Additional copies • Low performance and low productivity

• After CUDA 4: Host-based pipeline • Unified Virtual Address

• Pipeline CUDA copies with IB transfers

• High performance and high productivity

• After CUDA 5.5: GPUDirect RDMA support • GPU to GPU direct transfer

• Bypass the host memory

• Hybrid design to avoid PCI bottlenecks InfiniBand

GPU

CPU

Chip set

GPUDirect RDMA (GDR) and CUDA-Aware MPI


01000200030004000

1 4 16 64 256 1K 4K

MV2-GDR2.2MV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode Bi-Bandwidth

Message Size (bytes)

Bi-B

andw

idth

(MB/

s)

05

1015202530

0 2 8 32 128 512 2K

MV2-GDR2.2 MV2-GDR2.0bMV2 w/o GDR

GPU-GPU internode latency


Late

ncy

(us)

MVAPICH2-GDR-2.2Intel Ivy Bridge (E5-2680 v2) node - 20 cores

NVIDIA Tesla K40c GPUMellanox Connect-X4 EDR HCA

CUDA 8.0Mellanox OFED 3.0 with GPU-Direct-RDMA

10x2X

11x

Performance of MVAPICH2-GDR with GPUDirect RDMA (GDR)

2.18us0

50010001500200025003000

1 4 16 64 256 1K 4K

MV2-GDR2.2MV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode Bandwidth


Band

wid

th

(MB/

s) 11X

2X

3X

More details in 2:00pm session todayS7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING


• Host-Staged Multicast (HSM): Traditional short message broadcast operation between GPUs•Data copied from GPU to host memory•Using InfiniBand UD-based hardware multicast

• Sub-optimal use of near-scale invariant UD-multicast performance

• PCIe resources are wasted and benefits of multicast are nullified

• GPUDirect RDMA capabilities unused

Multicasting Data from one GPU to other GPUs: Shortcomings


• Can we design a GPU broadcast mechanism that can deliver low latency and high throughput for streaming applications?

• Can we combine GDR and MCAST features to• Achieve the best performance • Free-up the Host-Device PCIe bandwidth for application needs

• Can such design be extended to support heterogeneous configurations?• Host-to-Device (H2D): Most common in streaming applications

• E.g., Camera connected to host and devices used for computation

• Device-to-device (D2D)• Device-to-Host (D2H)

• Can we design an efficient MCAST based broadcast for multi-GPU systems?• Can we design an efficient reliability support on top of the UD-based MCAST broadcast?• How much performance benefits can be achieved with the new designs?

Problem Statement


• Copy user GPU data to host buffers• Perform Multicast • Copy data back to user GPU buffer

GPU

HCA

HostVbuf

userNW• Drawbacks:

• CudaMemcpy dictates performance

• Requires PCIe Host-Device resources cudaMemcpy

MCAST

Existing Protocol for GPU Multicast


GPU

HCA

HostVbuf

userNW

• Drawbacks:• D-H operation limits

performance• Can we avoid GDRCOPY for

D-H copies?

GDRCOPY*

MCAST

Enhanced Solution #1: GDRCOPY-based design• Copy user GPU data to host buffers

• Using GDRCOPY module*

• Perform Multicast • Copy data back to user GPU buffer

• Using GDRCOPY module

*https://github.com/NVIDIA/gdrcopy


• Copy user GPU data to host buffers • Using loopback scheme

• Perform Multicast • Copy back the data to GPU

• Using GDRCOPY scheme

GPU

HCA

HostVbuf

userNW

• Good performance for bothH-D and D-H copies

• Expected good performance only for small message

• Still using the PCIe H-D resources

GDRCOPYMCAST

Enhanced Solution #2: (GDRCOPY + Loopback)-based design

LoopBack


•How to design efficient and reliable broadcast operation fromhost to device for streaming applications on multi-GPU nodesystems?

•Challenges• How to handle heterogeneity of the configuration including H2D

broadcast?• Can we have a topology-aware broadcast designs on multi-GPU nodes?• Can we enhance the reliability support for streaming applications?• Can we mimic such behavior at benchmark level?

• mimic the need for PCIe H-D at application level• Demonstrate the benefits of such designs on such application patterns

Can we do Better?


• Handling Efficient Broadcast on Multi-GPU node Systems• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing

High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “SBAC-PAD’16, Oct 2016.

• Providing Reliability Support

• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “EfficientReliability Support for Hardware Multicast-based Broadcast in GPU-enabled StreamingApplications,“ in COMHPC 2016 (SC Workshop), Nov 2016.

• Optimizing Broadcast for Multi-source Streaming

• C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient andScalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted forpresentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.

Three Major Solutions


• Combining MCAST+GDR hardware features for heterogeneous configurations:

– Source on the Host and destination on Device

– SL design: Scatter at destination• Source: Data and Control on Host

• Destinations: Data on Device and Control on Host

– Combines IB MCAST and GDR features at receivers

– CUDA IPC-based topology-aware intra-node broadcast

– Minimize use of PCIe resources

– Maximizing availability of PCIe Host-Device Resources

SL-based Design for Heterogeneous Configuration (H2D)

Node NIB

HCA

IB HCA

CPU

GPU

Source

IB Switch

GPU

CPU

Node 1

Multicast steps

C Data

C

IB SL step

Data

IB HCA

GPU

CPU

Data

C

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.


• Socket based leader (1 HCA per socket) - Control synchronization through Host Shared

memory• Polling on shared flag • Reading the buffers addresses

- IPC read of the GPU data • Direct (RMA semantics) IPC read• IPC read with other access patterns in the future– K-nomial Tree, Ring structure

Intra-node Topology-aware (Hybrid SL+IPC) Design for Multi-GPU Node Configuration

Node 1

IB Switch

GPU 0 GPU 1 GPU N

Node NGPU

CPU

Source

GPU

CPU

CPUSL MCAST

cudaMemcpy (D2D)


0102030405060

1 4 16 64 256 1K 4K 16K

Late

ncy

(μs)


SL-MCAST SGL-MCAST Host-MCAST

0

2000

4000

6000

8000

10000

32K 64K 128K256K512K 1M 2M 4MLa

tenc

y(μs

)Message Size (Bytes)


• Redesigned broadcast benchmark with Root buffer on Host & non-Root on Device

• Inter-node experiments @ Wilkes cluster, 32 GPUs, 1 GPU/node

SL-based Design for H-D Heterogeneous Support

56% 39%

Lower is better


• Evaluate H-D heterogeneous support- Mixing Inter-node and Intra-node experiments @ CSCS cluster, 88 GPUs, 8

NVIDIA K80 GPUs per node

Evaluation of the Topology-aware (SL+IPC) Design

02000400060008000

1000012000

Late

ncy

(μs)


IPC SL-MCAST SHMEM SL-MCAST

0

20

40

60

1 4 16 64 256 1K 4K 16K

Late

ncy

(μs)



58% 79%

Lower is better


• Inter-node experiments @ Wilkes cluster, 32 GPUs, 1 GPU/node– 1K byte messages

Scalability Evaluation of the Proposed Design

05

10152025

2 4 8 16 32

Late

ncy

(μs)

System size (Number of GPU nodes)


Maintain good Scalability while yielding up to 64% reduction of latency

64%


• Mimic the behavior of streaming applications @ CSCS cluster, 88 GPUs, 8 NVIDIA K80 GPUs per node

– Broadcast operations overlapped with application level Host-Device transfers

Benefits of the Availability of Host-Device PCI Resources

01234

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4MTh

roug

hput

(GB/

s)



3.2X

Higher is better

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

Maintain near-peak throughput over all message sizes










• Remote Memory Access (RMA)-based Design– Sender maintains a backup buffer for the MCAST packets

• Sender is not interrupted

– Receiver Performs the RMA Get operation to the sender’s backup buffer to retrieve lost MCAST packets

Efficient Reliability Support for MCAST-based broadcast

Broadcast receiverBroadcast senderIB HCA IB HCA MPI

Timeout

MPITime


• Evaluate the RMA-based reliability support for SL-based MCAST design@ CSCS cluster, 88 GPUs, 8 NVIDIA K80 GPUs per node

– Negligible overhead

– RMA-based design performs better than NACK-based scheme for large messages

Evaluation: Efficient Reliability Design

0

10

20

30

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(μs)


w/o reliability NACK RMA

01000200030004000

16K

32K

64K

128K

256K

512K 1M 2M 4M

Late

ncy

(μs)


w/o reliability NACK RMA


Latency reduction compared to the existing NACK-based scheme

Benefits of the RMA-based Reliability Design

0

0.5

1

1.5

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Nor

mal

ized

Late

ncy


0.01% 0.10% 1%

0

0.5

1

1.5

16K

32K

64K

128K

256K

512K 1M 2M 4M

Nor

mal

ized

Late

ncy


0.01% 0.10% 1%Normalized to SL-based MCAST with NACK-based retransmission scheme

Message Size8KB 128KB 2MB

ErrorRate

0.01% 16% 31% 11%0.1% 21% 36% 19%1% 24% 21% 10%

Error rates:

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.










• Optimizing MCAST+GDR Broadcast: – Source and destination buffers are on GPU Device

• Typically very large messages (>1MB)

– Pipelining data from Device to Host

• Avoid GDR read limit

• Leverage high-performance SL design

– Combines IB MCAST and GDR features

– Minimize use of PCIe resources on the receiver side

– Maximizing availability of PCIe Host-Device Resources

Optimized Design for Multi-Source Streaming

IB HCA

CPU

GPU

Source

IB Switch

Header

Data

IB HCA

CPU

GPU

Node 1Header

Data

IB HCA

CPU

GPU

Node NHeader

Data

1. Pipeline Data movement2. IB Gather 3. IB Hardware Multicast4. IB Scatter + GDR Write

Data


• Pipelined MCAST+GDR Design

– Pipelines data from Device to Host on the source node• Streaming broadcast

– Leverages high-performance SL-based design

• High Scalability

• High overlap between multiple broadcast calls

Analysis of the Optimized Design


• @ OSU RI2 cluster, 16 NVIDIA K80 GPUs, 1 GPU/node

1

10

100

1000

10000

4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M 16M

Late

ncy

(μs)


MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt

Benchmark Evaluation

1

10

100

1000

10000

2 4 8 16

Late

ncy

(μs)

Number of GPU nodes

2 MB Message

MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt

Lower is better

Near-Constant65%

• Provide near-constant latency over the system sizes • Reduces up to 65% of latency for large messages


• @ OSU RI2 cluster, 16 NVIDIA K80 GPUs, 1 GPU/node – Microsoft Cognitive Toolkit (CNTK) with CUDA-Aware MPI*

Application Evaluation: Deep Learning Frameworks

0

100

200

300

8 16

Trai

ning

Tim

e (s

)

Number of GPU nodes

AlexNet modelMV2-GDR-Knomial MV2-GDR-Ring

0

1000

2000

3000

8 16

Trai

ning

Tim

e (s

)

Number of GPU nodes

VGG modelMV2-GDR-Knomial MV2-GDR-Ring

Lower is better

15% 24% 6%15%

• Reduces up to 24% and 15% of latency for AlexNet and VGG models– Average training time of one Epoch

• Higher improvement can be observed for larger system sizes*D. Banerjee, K. Hamidouche, D. Panda, Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom'16, Dec 2016.


• IB MCAST feature provides high scalability and low latency

• NVIDIA GDR feature provides a direct access between IB and GPUs

• MVAPICH2-GDR provides schemes to efficiently broadcast from/to GPU memories using host staged techniques

• Presented a set of designs to couple GDR and IB MCAST features for • Heterogeneous Systems

• Multi-GPU systems

• Single-source and Multi-source Streaming

• New designs will be available in future MVAPICH2-GDR library

Conclusions


Two Additional Talks

• S7356 - MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning – Day: Today, 05/11

– Time: 14:00 - 14:50

– Location: Room 211B

• S7324 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions– Day: Today, 05/11

– Time: 15:00 - 15:25

– Location: Room 211B


[email protected]

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/

mailto:[email protected]

http://nowlab.cse.ohio-state.edu/

Documents

High-Performance Broadcast Designs for Streaming ...on-demand.gputechconf.com/.../presentation/...broadcast-sreaming.pdf · High-Performance Broadcast Designs for Streaming Applications