Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
16th ANNUAL WORKSHOP 2020
Ammar Ahmad Awan, Jahanzeb Maqbool Hashmi, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K. (DK) Panda
The Ohio State University
http://nowlab.cse.ohio-state.edu
Designing a Deep-Learning Aware MPI Library: An MVAPICH2 Approach
WHAT IS DEEP LEARNING?
Courtesy: https://hackernoon.com/difference-between-artificial-intelligence-machine-learning-and-deep-
learning-1pcv3zeg, https://blog.dataiku.com/ai-vs.-machine-learning-vs.-deep-learning
• Deep Learning (DL)
– A subset of Machine Learning that uses
Deep Neural Networks (DNNs)
– Perhaps, the most revolutionary subset!
• Based on learning data representation
• Examples Convolutional Neural Networks,
Recurrent Neural Networks, Hybrid Networks
• Data Scientist or Developer Perspective
1. Identify DL as solution to a problem
2. Determine Data Set
3. Select Deep Learning Algorithm to Use
4. Use a large data set to train an algorithm
2 DL-Aware MPI - OFAWS 2020
DEEP LEARNING AND HIGH-PERFORMANCE ARCHITECTURES
*https://blogs.nvidia.com/blog/2014/09/07/imagenet/
▪ NVIDIA GPUs are the main driving force for faster training of DL models
• The ImageNet Challenge - (ILSVRC) -- 90% of the teams used GPUs (2014)*
• Deep Neural Networks (DNNs) like ResNet(s) and Inception
▪ However, High Performance Architectures for DL and HPC are evolving
• 135/500 Top HPC systems use NVIDIA GPUs (Nov ’19)
• DGX-1 (Pascal) and DGX-2 (Volta)
• Dedicated DL supercomputers
• Cascade-Lake Xeon CPUs have 28 cores/socket (TACC Frontera– #5 on Top500)
• AMD EPYC (Rome) CPUs have 64 cores/socket (Upcoming DOE Clusters)
• AMD GPUs will be powering the Frontier – DOE’s Exascale System at ORNL
• Domain Specific Accelerators for DNNs are also emerging Accelerator/CP Performance Share www.top500.org
3 DL-Aware MPI - OFAWS 2020
How to efficiently scale-out Deep Learning (DL) workloads by better
exploiting High Performance Computing (HPC) resources like Multi-/Many-core
CPUs and GPUs?
BROAD CHALLENGE: EXPLOITING HPC FOR DEEP LEARNING
4 DL-Aware MPI - OFAWS 2020
▪ gRPC• Officially available and supported
• Open-source – can be enhanced by others
• Accelerated gRPC (add RDMA to gRPC)
▪ gRPC+X• Use gRPC for bootstrap and rendezvous
• Actual communication is in “X”
• X→ MPI, Verbs, GPUDirect RDMA (GDR), etc.
▪ No-gRPC• Baidu – the first one to use MPI Collectives for TF
• Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)
HIGH-PERFORMANCE DISTRIBUTED DATA PARALLEL TRAINING WITH TENSORFLOW
Distributed TensorFlow
gRPC
Accelerated gRPC
gRPC+X
gRPC+MPI
gRPC+Verbs
gRPC+GDR
No-gRPC
Baidu-MPI Horovod
MPI
NCCL
*Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation ”, CCGrid ’19.
5
A. A. Awan, J. Bedorf, C-H Chu, H. Subramoni, and DK Panda., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid’19
DL-Aware MPI - OFAWS 2020
OVERVIEW OF THE MVAPICH2 PROJECT
▪ High Performance open-source MPI Library
▪ Support for multiple interconnects
• InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA
▪ Support for multiple platforms
• x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs
▪ Started in 2001, first open-source version demonstrated at SC ‘02
▪ Supports the latest MPI-3.1 standard
▪ http://mvapich.cse.ohio-state.edu
▪ Additional optimized versions for different systems/environments:
• MVAPICH2-X (Advanced MPI + PGAS), since 2011
• MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014
• MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014
• MVAPICH2-Virt with virtualization support, since 2015
• MVAPICH2-EA with support for Energy-Awareness, since 2015
• MVAPICH2-Azure for Azure HPC IB instances, since 2019
• MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019
▪ Tools:
• OSU MPI Micro-Benchmarks (OMB), since 2003
• OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015
• Used by more than 3,090 organizations in 89 countries
• More than 761,000 (> 0.76 million) downloads from the OSU
site directly
• Empowering many TOP500 clusters (Nov ‘19 ranking)
– 3rd, 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China
– 5th, 448, 448 cores (Frontera) at TACC
– 8th, 391,680 cores (ABCI) in Japan
– 14th, 570,020 cores (Nurion) in South Korea and many others
• Available with software stacks of many vendors and Linux
Distros (RedHat, SuSE, OpenHPC, and Spack)
• Partner in the 5th ranked TACC Frontera system
• Empowering Top500 systems for more than 15 years
6 DL-Aware MPI - OFAWS 2020
MVAPICH2 (MPI)-DRIVEN INFRASTRUCTURE FOR ML/DL TRAINING
MVAPICH2-X for
CPU-Based Training
MVAPICH2-GDR for
GPU-Based Training
Horovod
TensorFlow PyTorch MXNet
ML/DL Applications
7 DL-Aware MPI - OFAWS 2020
More details fromhttp://hidl.cse.ohio-state.edu
▪CPU-based Deep Learning
•Using MVAPICH2-X
▪GPU-based Deep Learning
•Using MVAPICH2-GDR
HIGH-PERFORMANCE DEEP LEARNING
8 DL-Aware MPI - OFAWS 2020
0
10
20
30
40
50
16 32 64
IMA
GES
/SEC
Batch Size per process
PPN 2 PPN 4 PPN 8
CPU-BASED TRAINING: SINGLE-NODE MULTI-PROCESS (MP) MODE
0
10
20
30
40
50
64 128 256 512
IMA
GES
/SEC
Effective Batch Size
ResNet-152 (SP) ResNet-152 (MP)
ResNet-152 Training performance
• BS=64, 4ppn is better; BS=32, 8ppn is slightly better
• However, keeping effective batch size (EBS) low is more
important! – Why? (DNN does not converge to SOTA
when batch size is large)
ResNet-152: Single Process (SP) vs. Multi-Process(MP)
• MP is better for all effective batch sizes
• Up to 1.35X better performance for MP
compared to SP for BS=64.
1.35X
9 DL-Aware MPI - OFAWS 2020
A. Jain, A. A. Awan, Q. Anthony, H. Subramoni and DK Panda, “Performance Characterization of DNN Training using B. TensorFlow and PyTorch on Modern Clusters”, Cluster ’19.
CPU-BASED TRAINING: MULTI-PROCESS (MN): MP VS. SP?
3086
1766
1213
2478
1245
2900
1718
1197
2271
1128
2152
1292
919
1729
832
RES NET -50 RES NET-101 RES NET -152 I NCEPT I O N-V 3
I NCEPT I O N-V 4
IMA
GES
/SEC
MP-Tuned MP-Default SPSkylake-3 (48 cores, 96
threads)
• Scale—32 nodes
• MP-Tuned—up to 1.5X
better than SP
• MP-Tuned—10% better
than MP-Default
• Why MP-Tuned is
better?
– Uses the best possible
number of inter-op
and intra-op threads
10
A. Jain, A. A. Awan, Q. Anthony, H. Subramoni and DK Panda, “Performance Characterization of DNN Training using B. TensorFlow and PyTorch on Modern Clusters”, Cluster ’19.
DL-Aware MPI - OFAWS 2020
1
8
64
512
4096
32768
262144
1 2 4 8 16
32
64
128
256
512
102
4
204
8
Imag
es p
er s
ec
Nodes
IMPI MVAPICH2-X ideal
SCALING RESNET-50 ON TACC FRONTERA: 2,048 NODES!
• Scaled TensorFlow to 2,048 nodes on
Frontera using MVAPICH2 and IntelMPI
• MVAPICH2 and IntelMPI give similar
performance for DNN training
• Report a peak of 260,000 images/sec on
2048 nodes
• On 2048 nodes, ResNet-50 can be trained
in 7 minutes! A. Jain, A. A. Awan, H. Subramoni and DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (in conjunction with SC ’19).
11 DL-Aware MPI - OFAWS 2020
SCALING DL FRAMEWORKS USING MVAPICH2-X ON TACC FRONTERA
• On single node, TensorFlow
(TF) is 8% better than
MXNet
• TF (tf_cnn_benchmark) is
2.13x better than PyTorch
• TensorFlow is 1.7x better
than MXNet
• TF (Keras) gives better
performance compared to
PyTorch and MXNet.
0
5000
10000
15000
20000
2500030000
3500040000
1 2 4 8 16 32 64 128 256Im
age
s p
er s
ecNodes
Keras TF_CNN MXNet PyTorch
A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).
12 DL-Aware MPI - OFAWS 2020
PERFORMANCE OF CNTK WITH MVAPICH2-X ON CPU-BASED DEEP LEARNING
0
100
200
300
400
500
600
700
800
28 56 112 224
Exec
uti
on
Tim
e (s
)
No. of Processes
Intel MPI
MVAPICH2
MVAPICH2-XPMEM
CNTK AlexNet Training
(B.S=default, iteration=50, ppn=28)
20%
9%
▪ CPU-based training of AlexNet neural network using ImageNet ILSVRC2012 dataset
▪ Advanced XPMEM-based designs show up to 20% benefits over Intel MPI (IMPI) for CNTK DNN training using All_Reduce
▪ The proposed designs show good scalability with increasing system size
Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018
Available since MVAPICH2-X 2.3rc1 release
13 DL-Aware MPI - OFAWS 2020
▪ CPU based Hybrid-Parallel (Data Parallelism and Model Parallelism) training on Stampede2
▪ Benchmark developed for various configuration
• Batch sizes
• No. of model partitions
• No. of model replicas
▪ Evaluation on a very deep model
• ResNet-1000 (a 1,000-layer model)
BENCHMARKING HYPAR-FLOW ON STAMPEDE2
A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and D. K. Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (Accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf
0
200
400
600
800
1000
1200
1 2 4 8 16 32 64 128 256Im
g/s
ec
(hig
her
is b
ette
r)
Nodes
48 Partitions (1 node) 768 Partitions (16 nodes)
192 Partitions (4 nodes) 96 Partitions (2 nodes)
EBS = Diameter of Circle
110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2 Cluster)
14 DL-Aware MPI - OFAWS 2020
▪ ResNet-1001 with variable batch size
▪ Approach:
• 48 model-partitions for 56 cores
• 512 model-replicas for 512 nodes
• Total cores: 48 x 512 = 24,576
▪ Speedup
• 253X on 256 nodes
• 481X on 512 nodes
▪ Scaling Efficiency
• 98% up to 256 nodes
• 93.9% for 512 nodes
OUT-OF-CORE TRAINING WITH HYPAR-FLOW (512 NODES ON TACC FRONTERA)
481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 4 8 16 32 64 128 256 512
Img/
sec
(hig
her
is b
ette
r)Nodes (=Replicas)
56 Partitions (1 node) Ideal
A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and D. K. Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (Accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf
15 DL-Aware MPI - OFAWS 2020
▪CPU-based Deep Learning
•Using MVAPICH2-X
▪GPU-based Deep Learning
•Using MVAPICH2-GDR
HIGH-PERFORMANCE DEEP LEARNING
16 DL-Aware MPI - OFAWS 2020
OPTIMIZED MVAPICH2-GDR (GPUDIRECT RDMA) DESIGN
17 DL-Aware MPI - OFAWS 2020
0
2000
4000
6000
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4KB
and
wid
th (
MB
/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3.4
0
2000
4000
1 2 4 8 16 32 64 128256512 1K 2K 4KBan
dw
idth
(M
B/s
)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3.4
0
10
20
30
0 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K
Late
ncy
(u
s)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3.4
MVAPICH2-GDR-2.3.4Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
1.85us 11X
MVAPICH2-GDR VS. NCCL2 – ALLREDUCE ON GPU SYSTEMS (DGX-2)
• Optimized designs in MVAPICH2-GDR offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)
1
10
100
1000
10000
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2-GDR-2.3.4 NCCL-2.6
~2.5X better
Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 10.1
0
5
10
15
20
25
30
35
40
45
50
8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
64
K
12
8K
Late
ncy
(u
s)
Message Size (Bytes)
MVAPICH2-GDR-2.3.4 NCCL-2.6
~4.7X better
C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, Accepted to be Presented.
18 DL-Aware MPI - OFAWS 2020
1
10
100
1000
4 8 16 32 64 128 256 512 1K 2K 4K 8K
Late
ncy
(u
s)
Message Size (Bytes)
Small Messages - Latency on 128 GPUs
MVAPICH2-GDR-2.3.4 NCCL 2.4
MVAPICH2-GDR VS. NCCL2 – ALLREDUCE ON GPU SYSTEMS (ABCI)
• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 128 GPUs (32 nodes on ABCI)
8X better
ABCI Platform: Dual-socket Intel Xeon Gold, 4 NVIDIA Volta V100 GPUs with NVLink, and two InfiniBand EDR Interconnect
1
10
100
1000
10000
100000
16
K3
2K
64
K1
28
K2
56
K5
12
K1
M2
M4
M8
M1
6M
32
M6
4M
12
8M
25
6M
Late
ncy
(u
s)
Message Size (Bytes)
Large Messages - Latency on 128 GPUs
MVAPICH2-GDR-2.3.4 NCCL 2.4
1.6X better
DL-Aware MPI - OFAWS 202019
MVAPICH2-GDR: MPI_ALLREDUCE (DEVICE BUFFERS) ON SUMMIT
• Optimized designs in MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs
0
1
2
3
4
5
6
32M 64M 128M 256M
Ban
dw
idth
(G
B/s
)
Message Size (Bytes)
Bandwidth on 1,536 GPUs
MVAPICH2-GDR-2.3.4 NCCL 2.5
1.7X better
0
50
100
150
200
250
300
350
400
450
4 16 64 256 1K 4K 16K
Late
ncy
(u
s)
Message Size (Bytes)
Latency on 1,536 GPUs
MVAPICH2-GDR-2.3.4 NCCL 2.5
1.6X better
Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect
0
2
4
6
8
10
24 48 96 192 384 768 1536
Ban
dw
idth
(G
B/s
)
Number of GPUs
128MB Message
SpectrumMPI 10.3 OpenMPI 4.0.1
NCCL 2.5 MVAPICH2-GDR-2.3.4
1.7X better
C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020. Accepted to be Presented.
20 DL-Aware MPI - OFAWS 2020
SCALABLE TENSORFLOW USING HOROVOD AND MVAPICH2-GDR
• ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (16 Volta GPUs)
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16
Imag
e p
er s
eco
nd
Number of GPUs
NCCL-2.5 MVAPICH2-GDR-2.3.4
9% higher
Platform: Nvidia DGX-2 system, CUDA 10.1
0
20
40
60
80
100
1 2 4 8 16
Scal
ing
Effi
cien
cy (
%)
Number of GPUs
NCCL-2.5 MVAPICH2-GDR-2.3.4
Scaling Efficiency =Actual throughput
Ideal throughput at scale× 100%
21 DL-Aware MPI - OFAWS 2020C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020. Accepted to be Presented.
DISTRIBUTED TRAINING WITH TENSORFLOW AND MVAPICH2-GDR ON SUMMIT
• ResNet-50 Training using
TensorFlow benchmark on
SUMMIT -- 1536 Volta
GPUs!
• 1,281,167 (1.2 mil.) images
• Time/epoch = 3.6 seconds
• Total Time (90 epochs)
= 3.6 x 90 = 332 seconds =
5.5 minutes!
0
50
100
150
200
250
300
350
400
1 2 4 6 12 24 48 96 192 384 768 1536
Imag
e p
er s
eco
nd
(Th
ou
san
ds)
Number of GPUs
NCCL-2.5 MVAPICH2-GDR-2.3.2
Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2
*We observed errors for NCCL2 beyond 96 GPUs
MVAPICH2-GDR reaching ~0.35 million
images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images
22 DL-Aware MPI - OFAWS 2020
▪ PyTorch is becoming a very important DL framework
▪ Scaling PyTorch models with Horovod is simple
▪MVAPICH2-GDR provides better performance and scalability compared to NCCL2
SCALING PYTORCH ON ORNL/SUMMIT USING MVAPICH2-GDR
0
50000
100000
150000
200000
250000
6 12 24 48 96 192 384 768
Imag
es/s
ec (
hig
her
is b
ette
r)No. of GPUs
NCCL-2.6 MVAPICH2-GDR-2.3.4
23 DL-Aware MPI - OFAWS 2020
C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020. Accepted to be Presented.
Platform: The Summit Supercomputer (#1 on Top500.org) –
6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1
1.5X higher
▪ RTX 5000 are NVIDIA’s GPUs targeted for data centers
▪ Different from GTX series• Supports GPUDirect RDMA (GDR)
• Supports GDRCOPY
▪ MVAPICH2-GDR offers good performance and reasonable scaling
▪ Scaling is not as good as Lassen because• Nodes are connected with IB FDR
• No NVLink between GPUs (only PCIe)
EARLY EXPLORATION OF MXNET USING MVAPICH2-GDR ON TACC FRONTERA RTX GPU NODES
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 4 8 16 32 64
Imag
es/s
eco
nd
(h
igh
er is
bet
ter)
No. of GPUs
MVAPICH2-GDR
Platform: TACC Frontera-RTX – 4 NVIDIA RTX 5000 GPUs per node, CUDA 10.124 DL-Aware MPI - OFAWS 2020
▪Scalable distributed training is getting important
▪Requires high-performance middleware designs while exploiting modern interconnects
▪Provided a set of MPI (MVAPICH2)-driven solutions to achieve scalable distributed training for TensorFlow, PyTorch and MXNet
▪More details on these solutions and usage are available from: http://hidl.cse.ohio-state.edu and http://mvapich.cse.ohio-state.edu
▪The proposed solutions will continue to enable the DL community to achieve scalability and high-performance for their distributed training
CONCLUSIONS
25 DL-Aware MPI - OFAWS 2020
THANK YOU!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
{awan.10, hashmi.29, chu.368}@osu.edu
[email protected], [email protected]
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
26 DL-Aware MPI - OFAWS 2020