Upload
trinhcong
View
230
Download
0
Embed Size (px)
Citation preview
High-Performance Broadcast Designs for Streaming Applications on Multi-GPU InfiniBand Clusters
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
GPU Technology Conference (GTC 2017)
by
GTC 2017 2Network Based Computing Laboratory
• Examples - surveillance, habitat monitoring, etc..
• Require efficient transport of data from/to distributed sources/sinks
• Sensitive to latency and throughput metrics
• Require HPC resources to efficiently carry out compute-intensive tasks
Streaming Applications
GTC 2017 3Network Based Computing Laboratory
• Pipelined data parallel compute phases • Form the crux of streaming applications lend
themselves for GPGPUs
• Data distribution to GPGPU sites • Over PCIe within the node• Over InfiniBand interconnects across nodes
• Back-to-back Broadcast operation
– Key dictator of throughput of streaming applications
Nature of Streaming ApplicationsData Source
Data Distributor
HPC resources
Real-time streaming
Worker
CPUGPU
GPU
Worker
CPUGPU
GPU
Worker
CPUGPU
GPU
Worker
CPUGPU
GPU
Worker
CPUGPU
GPU
Data streaming-like broadcast operations
GTC 2017 4Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators / Coprocessors high compute density, high
performance/watt>1 TFlop DP on a chip
High Performance Interconnects -InfiniBand
<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM
Tianhe – 2 TitanK - ComputerSunway TaihuLight
GTC 2017 5Network Based Computing Laboratory
• 187 IB Clusters (37%) in the Nov’16 Top500 list
– (http://www.top500.org)
• Installations in the Top 50 (15 systems):
Large-scale InfiniBand Installations
241,108 cores (Pleiades) at NASA/Ames (13th) 147,456 cores (SuperMUC) in Germany (36th)
220,800 cores (Pangea) in France (16th) 86,016 cores (SuperMUC Phase 2) in Germany (37th)
462,462 cores (Stampede) at TACC (17th) 74,520 cores (Tsubame 2.5) at Japan/GSIC (40th)
144,900 cores (Cheyenne) at NCAR/USA (20th) 194,616 cores (Cascade) at PNNL (44th)
72,800 cores Cray CS-Storm in US (25th) 76,032 cores (Makman-2) at Saudi Aramco (49th)
72,800 cores Cray CS-Storm in US (26th) 72,000 cores (Prolix) at Meteo France, France (50th)
124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (27th ) 73,440 cores (Beaufix2) at Meteo France, France (51st)
60,512 cores (DGX SATURNV) at NVIDIA/USA (28th) 42,688 cores (Lomonosov-2) at Russia/MSU (52nd)
72,000 cores (HPC2) in Italy (29th) 60,240 cores SGI ICE X at JAEA Japan (54th)
152,692 cores (Thunder) at AFRL/USA (32nd) and many more!
GTC 2017 6Network Based Computing Laboratory
• Introduced in Oct 2000• High Performance Point-to-point Data Transfer
– Interprocessor communication and I/O– Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec ->
200Gbps), and low CPU utilization (5-10%)• Multiple Features
– Offloaded Send/Recv, RDMA Read/Write, Atomic Operations– Hardware Multicast support through Unreliable Datagram (UD)
• A message sent from a single source can reach all destinations in a single pass over the network through switch-based replication
• Restricted to one MTU• Large messages need to be sent in a chunked manner• Reliability needs to be addressed
• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….
InfiniBand Networking Technology
GTC 2017 8Network Based Computing Laboratory
Multicast-aware CPU-Based MPI_Bcast on Stampede using MVAPICH2 (6K nodes with 102K cores)
010203040
2 8 32 128 512
Late
ncy
(µs)
Message Size (Bytes)
Small Messages (102,400 Cores)Default Multicast
ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCIe Gen3 with Mellanox IB FDR switch
0100200300400500
2K 8K 32K 128K
Late
ncy
(µs)
Message Size (Bytes)
Large Messages (102,400 Cores)DefaultMulticast
0
10
20
30
Late
ncy
(µs)
Number of Nodes
16 Byte Message
DefaultMulticast
0
50
100
150
200
Late
ncy
(µs)
Number of Nodes
32 KByte Message
DefaultMulticast
GTC 2017 9Network Based Computing Laboratory
GPU Memory
• Before CUDA 4: Additional copies • Low performance and low productivity
• After CUDA 4: Host-based pipeline • Unified Virtual Address
• Pipeline CUDA copies with IB transfers
• High performance and high productivity
• After CUDA 5.5: GPUDirect RDMA support • GPU to GPU direct transfer
• Bypass the host memory
• Hybrid design to avoid PCI bottlenecks InfiniBand
GPU
CPU
Chip set
GPUDirect RDMA (GDR) and CUDA-Aware MPI
GTC 2017 10Network Based Computing Laboratory
01000200030004000
1 4 16 64 256 1K 4K
MV2-GDR2.2MV2-GDR2.0bMV2 w/o GDR
GPU-GPU Internode Bi-Bandwidth
Message Size (bytes)
Bi-B
andw
idth
(MB/
s)
05
1015202530
0 2 8 32 128 512 2K
MV2-GDR2.2 MV2-GDR2.0bMV2 w/o GDR
GPU-GPU internode latency
Message Size (bytes)
Late
ncy
(us)
MVAPICH2-GDR-2.2Intel Ivy Bridge (E5-2680 v2) node - 20 cores
NVIDIA Tesla K40c GPUMellanox Connect-X4 EDR HCA
CUDA 8.0Mellanox OFED 3.0 with GPU-Direct-RDMA
10x2X
11x
Performance of MVAPICH2-GDR with GPUDirect RDMA (GDR)
2.18us0
50010001500200025003000
1 4 16 64 256 1K 4K
MV2-GDR2.2MV2-GDR2.0bMV2 w/o GDR
GPU-GPU Internode Bandwidth
Message Size (bytes)
Band
wid
th
(MB/
s) 11X
2X
3X
More details in 2:00pm session todayS7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING
GTC 2017 11Network Based Computing Laboratory
• Host-Staged Multicast (HSM): Traditional short message broadcast operation between GPUs•Data copied from GPU to host memory•Using InfiniBand UD-based hardware multicast
• Sub-optimal use of near-scale invariant UD-multicast performance
• PCIe resources are wasted and benefits of multicast are nullified
• GPUDirect RDMA capabilities unused
Multicasting Data from one GPU to other GPUs: Shortcomings
GTC 2017 12Network Based Computing Laboratory
• Can we design a GPU broadcast mechanism that can deliver low latency and high throughput for streaming applications?
• Can we combine GDR and MCAST features to• Achieve the best performance • Free-up the Host-Device PCIe bandwidth for application needs
• Can such design be extended to support heterogeneous configurations?• Host-to-Device (H2D): Most common in streaming applications
• E.g., Camera connected to host and devices used for computation
• Device-to-device (D2D)• Device-to-Host (D2H)
• Can we design an efficient MCAST based broadcast for multi-GPU systems?• Can we design an efficient reliability support on top of the UD-based MCAST broadcast?• How much performance benefits can be achieved with the new designs?
Problem Statement
GTC 2017 13Network Based Computing Laboratory
• Copy user GPU data to host buffers• Perform Multicast • Copy data back to user GPU buffer
GPU
HCA
HostVbuf
userNW• Drawbacks:
• CudaMemcpy dictates performance
• Requires PCIe Host-Device resources cudaMemcpy
MCAST
Existing Protocol for GPU Multicast
GTC 2017 14Network Based Computing Laboratory
GPU
HCA
HostVbuf
userNW
• Drawbacks:• D-H operation limits
performance• Can we avoid GDRCOPY for
D-H copies?
GDRCOPY*
MCAST
Enhanced Solution #1: GDRCOPY-based design• Copy user GPU data to host buffers
• Using GDRCOPY module*
• Perform Multicast • Copy data back to user GPU buffer
• Using GDRCOPY module
*https://github.com/NVIDIA/gdrcopy
GTC 2017 15Network Based Computing Laboratory
• Copy user GPU data to host buffers • Using loopback scheme
• Perform Multicast • Copy back the data to GPU
• Using GDRCOPY scheme
GPU
HCA
HostVbuf
userNW
• Good performance for bothH-D and D-H copies
• Expected good performance only for small message
• Still using the PCIe H-D resources
GDRCOPYMCAST
Enhanced Solution #2: (GDRCOPY + Loopback)-based design
LoopBack
GTC 2017 16Network Based Computing Laboratory
•How to design efficient and reliable broadcast operation fromhost to device for streaming applications on multi-GPU nodesystems?
•Challenges• How to handle heterogeneity of the configuration including H2D
broadcast?• Can we have a topology-aware broadcast designs on multi-GPU nodes?• Can we enhance the reliability support for streaming applications?• Can we mimic such behavior at benchmark level?
• mimic the need for PCIe H-D at application level• Demonstrate the benefits of such designs on such application patterns
Can we do Better?
GTC 2017 17Network Based Computing Laboratory
• Handling Efficient Broadcast on Multi-GPU node Systems• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing
High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “SBAC-PAD’16, Oct 2016.
• Providing Reliability Support
• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “EfficientReliability Support for Hardware Multicast-based Broadcast in GPU-enabled StreamingApplications,“ in COMHPC 2016 (SC Workshop), Nov 2016.
• Optimizing Broadcast for Multi-source Streaming
• C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient andScalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted forpresentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.
Three Major Solutions
GTC 2017 18Network Based Computing Laboratory
• Combining MCAST+GDR hardware features for heterogeneous configurations:
– Source on the Host and destination on Device
– SL design: Scatter at destination• Source: Data and Control on Host
• Destinations: Data on Device and Control on Host
– Combines IB MCAST and GDR features at receivers
– CUDA IPC-based topology-aware intra-node broadcast
– Minimize use of PCIe resources
– Maximizing availability of PCIe Host-Device Resources
SL-based Design for Heterogeneous Configuration (H2D)
Node NIB
HCA
IB HCA
CPU
GPU
Source
IB Switch
GPU
CPU
Node 1
Multicast steps
C Data
C
IB SL step
Data
IB HCA
GPU
CPU
Data
C
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.
GTC 2017 19Network Based Computing Laboratory
• Socket based leader (1 HCA per socket) - Control synchronization through Host Shared
memory• Polling on shared flag • Reading the buffers addresses
- IPC read of the GPU data • Direct (RMA semantics) IPC read• IPC read with other access patterns in the future– K-nomial Tree, Ring structure
Intra-node Topology-aware (Hybrid SL+IPC) Design for Multi-GPU Node Configuration
Node 1
IB Switch
GPU 0 GPU 1 GPU N
Node NGPU
CPU
Source
GPU
CPU
CPUSL MCAST
cudaMemcpy (D2D)
GTC 2017 20Network Based Computing Laboratory
0102030405060
1 4 16 64 256 1K 4K 16K
Late
ncy
(μs)
Message Size (Bytes)
SL-MCAST SGL-MCAST Host-MCAST
0
2000
4000
6000
8000
10000
32K 64K 128K256K512K 1M 2M 4MLa
tenc
y(μs
)Message Size (Bytes)
SL-MCAST SGL-MCAST Host-MCAST
• Redesigned broadcast benchmark with Root buffer on Host & non-Root on Device
• Inter-node experiments @ Wilkes cluster, 32 GPUs, 1 GPU/node
SL-based Design for H-D Heterogeneous Support
56% 39%
Lower is better
GTC 2017 21Network Based Computing Laboratory
• Evaluate H-D heterogeneous support- Mixing Inter-node and Intra-node experiments @ CSCS cluster, 88 GPUs, 8
NVIDIA K80 GPUs per node
Evaluation of the Topology-aware (SL+IPC) Design
02000400060008000
1000012000
Late
ncy
(μs)
Message Size (Bytes)
IPC SL-MCAST SHMEM SL-MCAST
0
20
40
60
1 4 16 64 256 1K 4K 16K
Late
ncy
(μs)
Message Size (Bytes)
IPC SL-MCAST SHMEM SL-MCAST
58% 79%
Lower is better
GTC 2017 22Network Based Computing Laboratory
• Inter-node experiments @ Wilkes cluster, 32 GPUs, 1 GPU/node– 1K byte messages
Scalability Evaluation of the Proposed Design
05
10152025
2 4 8 16 32
Late
ncy
(μs)
System size (Number of GPU nodes)
SL-MCAST SGL-MCAST Host-MCAST
Maintain good Scalability while yielding up to 64% reduction of latency
64%
GTC 2017 23Network Based Computing Laboratory
• Mimic the behavior of streaming applications @ CSCS cluster, 88 GPUs, 8 NVIDIA K80 GPUs per node
– Broadcast operations overlapped with application level Host-Device transfers
Benefits of the Availability of Host-Device PCI Resources
01234
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4MTh
roug
hput
(GB/
s)
Message Size (Bytes)
IPC SL-MCAST SHMEM SL-MCAST
3.2X
Higher is better
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.
Maintain near-peak throughput over all message sizes
GTC 2017 24Network Based Computing Laboratory
• Handling Efficient Broadcast on Multi-GPU node Systems• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing
High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “SBAC-PAD’16, Oct 2016.
• Providing Reliability Support
• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “EfficientReliability Support for Hardware Multicast-based Broadcast in GPU-enabled StreamingApplications,“ in COMHPC 2016 (SC Workshop), Nov 2016.
• Optimizing Broadcast for Multi-source Streaming
• C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient andScalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted forpresentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.
Three Major Solutions
GTC 2017 25Network Based Computing Laboratory
• Remote Memory Access (RMA)-based Design– Sender maintains a backup buffer for the MCAST packets
• Sender is not interrupted
– Receiver Performs the RMA Get operation to the sender’s backup buffer to retrieve lost MCAST packets
Efficient Reliability Support for MCAST-based broadcast
Broadcast receiverBroadcast senderIB HCA IB HCA MPI
Timeout
MPITime
GTC 2017 26Network Based Computing Laboratory
• Evaluate the RMA-based reliability support for SL-based MCAST design@ CSCS cluster, 88 GPUs, 8 NVIDIA K80 GPUs per node
– Negligible overhead
– RMA-based design performs better than NACK-based scheme for large messages
Evaluation: Efficient Reliability Design
0
10
20
30
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(μs)
Message Size (Bytes)
w/o reliability NACK RMA
01000200030004000
16K
32K
64K
128K
256K
512K 1M 2M 4M
Late
ncy
(μs)
Message Size (Bytes)
w/o reliability NACK RMA
GTC 2017 27Network Based Computing Laboratory
Latency reduction compared to the existing NACK-based scheme
Benefits of the RMA-based Reliability Design
0
0.5
1
1.5
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Nor
mal
ized
Late
ncy
Message Size (Bytes)
0.01% 0.10% 1%
0
0.5
1
1.5
16K
32K
64K
128K
256K
512K 1M 2M 4M
Nor
mal
ized
Late
ncy
Message Size (Bytes)
0.01% 0.10% 1%Normalized to SL-based MCAST with NACK-based retransmission scheme
Message Size8KB 128KB 2MB
ErrorRate
0.01% 16% 31% 11%0.1% 21% 36% 19%1% 24% 21% 10%
Error rates:
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.
GTC 2017 28Network Based Computing Laboratory
• Handling Efficient Broadcast on Multi-GPU node Systems• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing
High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “SBAC-PAD’16, Oct 2016.
• Providing Reliability Support
• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “EfficientReliability Support for Hardware Multicast-based Broadcast in GPU-enabled StreamingApplications,“ in COMHPC 2016 (SC Workshop), Nov 2016.
• Optimizing Broadcast for Multi-source Streaming
• C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient andScalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted forpresentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.
Three Major Solutions
GTC 2017 29Network Based Computing Laboratory
• Optimizing MCAST+GDR Broadcast: – Source and destination buffers are on GPU Device
• Typically very large messages (>1MB)
– Pipelining data from Device to Host
• Avoid GDR read limit
• Leverage high-performance SL design
– Combines IB MCAST and GDR features
– Minimize use of PCIe resources on the receiver side
– Maximizing availability of PCIe Host-Device Resources
Optimized Design for Multi-Source Streaming
IB HCA
CPU
GPU
Source
IB Switch
Header
Data
IB HCA
CPU
GPU
Node 1Header
Data
IB HCA
CPU
GPU
Node NHeader
Data
1. Pipeline Data movement2. IB Gather 3. IB Hardware Multicast4. IB Scatter + GDR Write
Data
GTC 2017 30Network Based Computing Laboratory
• Pipelined MCAST+GDR Design
– Pipelines data from Device to Host on the source node• Streaming broadcast
– Leverages high-performance SL-based design
• High Scalability
• High overlap between multiple broadcast calls
Analysis of the Optimized Design
GTC 2017 31Network Based Computing Laboratory
• @ OSU RI2 cluster, 16 NVIDIA K80 GPUs, 1 GPU/node
1
10
100
1000
10000
4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
Late
ncy
(μs)
Message Size (bytes)
MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt
Benchmark Evaluation
1
10
100
1000
10000
2 4 8 16
Late
ncy
(μs)
Number of GPU nodes
2 MB Message
MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt
Lower is better
Near-Constant65%
• Provide near-constant latency over the system sizes • Reduces up to 65% of latency for large messages
GTC 2017 32Network Based Computing Laboratory
• @ OSU RI2 cluster, 16 NVIDIA K80 GPUs, 1 GPU/node – Microsoft Cognitive Toolkit (CNTK) with CUDA-Aware MPI*
Application Evaluation: Deep Learning Frameworks
0
100
200
300
8 16
Trai
ning
Tim
e (s
)
Number of GPU nodes
AlexNet modelMV2-GDR-Knomial MV2-GDR-Ring
0
1000
2000
3000
8 16
Trai
ning
Tim
e (s
)
Number of GPU nodes
VGG modelMV2-GDR-Knomial MV2-GDR-Ring
Lower is better
15% 24% 6%15%
• Reduces up to 24% and 15% of latency for AlexNet and VGG models– Average training time of one Epoch
• Higher improvement can be observed for larger system sizes*D. Banerjee, K. Hamidouche, D. Panda, Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom'16, Dec 2016.
GTC 2017 33Network Based Computing Laboratory
• IB MCAST feature provides high scalability and low latency
• NVIDIA GDR feature provides a direct access between IB and GPUs
• MVAPICH2-GDR provides schemes to efficiently broadcast from/to GPU memories using host staged techniques
• Presented a set of designs to couple GDR and IB MCAST features for • Heterogeneous Systems
• Multi-GPU systems
• Single-source and Multi-source Streaming
• New designs will be available in future MVAPICH2-GDR library
Conclusions
GTC 2017 34Network Based Computing Laboratory
Two Additional Talks
• S7356 - MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning – Day: Today, 05/11
– Time: 14:00 - 14:50
– Location: Room 211B
• S7324 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions– Day: Today, 05/11
– Time: 15:00 - 15:25
– Location: Room 211B
GTC 2017 35Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/