Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Computation
Ln
Ln-1
…
Ln-1
Ln
L1L1
L2 L2
Overlap
Backward Pass
Helper-thread
Communication
…
Main-thread
Reduce (Ln)
Reduce (Ln-1)
Reduce (L2)
Reduce (L1)
Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems
Ammar A. Awan and DK Panda (Advisor)[email protected],
MOTIVATION• Resurgence of Deep Learning (DL)
• Availability of Large Datasets like ImageNet and massively-parallel modern hardware like NVIDIA GPUs
• Emergence of DL frameworks (Caffe, TensorFlow, CNTK, etc.)
• Computability of Deep Neural Networks (DNNs)• Single GPU/node is not enough!• Scale-up and Scale-out training: an emerging research area
• Various Strategies to deal with large DNNs • Data Parallelism, Out-of-Core Training, and Model Parallelism
• Parameter-Server approach / Reduction-Tree approach• Distributed Address-Space Design Constraints
• Challenges for Communication Middleware• Very Large GPU-based Buffers• Reduction Collectives
• Overlap of Computation and Communication
• Co-design MPI middleware and communication of tensors for efficient DNN training• Large-message, CUDA-Aware, and Non-Blocking communication
• Design novel techniques to deal with Out-of-Core Workloads on GPUs
• Exploiting CUDA Unified Memory, efficient prefetching, and page-migration
• In-depth characterization and analysis of DL workloads and frameworks
• Profiling MPI and NCCL Communication
• Holistic understanding of single/multiple compute elements (CPU/GPU performance)
• Significant broader impact – Intersection of ML & HPC – a new research area
• Tutorials and Course (OSU) on High Performance Deep Learning
• Outreach through MVAPICH2-GDR and HiDL public releases
RESEARCH CHALLENGES
PROPOSED DESIGNS AND PERFORMANCE CHARACTERIZATION
PROPOSED FRAMEWORK
SUMMARY OF CONTRIBUTIONS
Accelerating Data-Parallel Training -- EuroMPI ‘16, EuroMPI ’18 and J. Parallel Computing ’19
Layer-wise Overlapped Gradient Aggregation
http://hidl.cse.ohio-state.edu
MPI Reduce Benchmark: 160 GPUs (10 nodes)
GoogLeNet Training: Strong Scaling AlexNet Training: Weak Scaling
051015202530
2 4 8 32 64 128
TrainingTime(secon
ds)
No.ofGPUs
MV2-GDR-NCCL MV2-GDR-Opt
VGG Training with CNTK
0.0010.010.11101001000
1 8 64 512 4K 32K
256K 2M 16M
128M
Latency(m
s)-logscale
MessageSize(bytes)
MV2-GDR-NCCL MV2-GDR-Opt
MVAPICH
Distributed Training Middleware
Communication Middleware (DL-Aware MPI)
Application Layer (DNN Training)
Co-Designs
Out-of-Core
HPC Platforms Multi-/Many-core CPUs (Intel Xeon, AMD EPYC,
and IBM POWER9)NVIDIA GPUs
High-Performance Interconnects
(InfiniBand, Omni-Path)
OSU-Caffe
Large Message
ReductionCUDA-Aware Reductions
Data-Parallel
MPI-NetsHorovod
Model-Parallel
Caffe
TensorFlow
CNTK
PyTorch
CUDA-Aware Broadcast
Performanceand
Design Analysis
Distributed TensorFlow Interfaces
Programming Models for Distributed TF
HPC Platforms
CPUs GPUs
HorovodDistributed Optimizer
Baidu-Allreduce
gRPC+XgRPC
Communication Runtimes/Libraries
Parameter Server
Deep Learning ApplicationDistributed TF Program(tf_cnn_benchmarks)TensorFlow (TF) Programs
No-gRPC
gRPC Verbs
NCCL2
CUDA-Aware MPI
CUDA-Aware Allreduce
Pointer CacheGDR
1Program Launch
Optimizations
3
Characterizationand
Performance Analysis
2
Proposed Allreduce
Designs and Optimizations
InfiniBand Cray Aries
0100200300400500600700800900
1000
1 2 4 8 16Imag
es/s
econ
d (H
ighe
r is b
ette
r)
No. of Nodes (GPUs)
Baidu-MPI Horovod-MPI Horovod-NCCL2 gRPC+verbs gRPC+MPI gRPC (IPoIB)
File System
Data Layer (L1) Layer 2
<< Kernel >>
………. Layer N
<< Kernel >> …
D2D H2D D2HF2H/H2F I/O
…
File System
<< Kernel >>
<< Kernel >> …
…
D2D
H2D
D2H
F2H/H2F
I/O
M2M
F2M/M2F
Prefetch () Advise (Evict())
Prefetch ()
Advise (Evict()) ………. ……….
Prefetch ()
Advise (Evict())……….……….
Forward Propagation (Existing)
Backward Propagation (Existing)
Forward Propagation (Proposed)
Backward Propagation (Proposed)
Unified Data Layer
1 32
4
1 Performance Characterization and Design Analysis: Caffe, CNTK, TensorFlow -- MLHPC ‘17, CCGrid ’19, and HotI ‘19
Communicator Selection
Pure MPI Design for DL-Aware Broadcast
Algorithm Selection
Flexible
Communicator
Intra-node
Communicator
Shared
Memory
Loopback
GPUDirect
RDMA
GDRCopy
CUDA IPC
Pipelined
IPC
Inter-node
Communicator
GPUDirect
RDMA
Host-staging
GDR Write
Pipelined
K-nomial Tree
Scatter-
Allgather
Several others..
MPI_Bcast()
Collectives Design Selection
Staged
Designs
Direct
Designs
Point-to-point (P2P) Design
Selection
Chain (Ring)
Point-to-Point (P2P) Primitives
MPI_Isend
MPI_Irecv
Intra-node P2P
Inter-node P2P
2
0
200400
600800
10001200
1400
1600
Img/
sec (
High
er is
bet
ter)
caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt
oc-caffe-opt is 5Xbetter than
intel-caffe-opt
caffe-gpucannot
run
X 0
50
100
150
200
250
300
Img/
sec (
High
er is
bet
ter)
caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt
caffe-gpucannot
run
X
oc-caffe-opt is2.7X better than intel-caffe-opt
0
5
10
15
20
Img/
sec (
High
er is
bet
ter)
caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt
oc-caffe-optis 80%
better than intel-caffecaffe-gpu
cannot run
X
intel-caffe-opt
(N/A)X
Out-of-Core AlexNet Out-of-Core GoogLeNet Out-of-Core ResNet-50
3 Out-of-Core DNN Training -- HiPC ‘18
4 Co-designing MPI and Caffe for Data Parallelism – PPoPP ‘17
Layer-wise Overlapped Model Propagation
• Faster Convolutions à Faster Training
• Performance of Intel KNL == NVIDIA P100 for AlexNet Training – Volta; different league!
Communication Middleware
Deep Learning Frameworks
Distributed Training Middleware (Horovod)
HPC Platforms
PyTorch
CPUs
GPUs InfiniBand
NCCL MPI
Proposed Profiling Infrastructure (hvprof)
MXNet TensorFlow
Omni-Path
PCIe
NVLink
High-Performance Interconnects
050
100150200250300350400
1 2 4 6 12 24 48 96 192 384 768 1536
Imag
e pe
r sec
ond
Thou
sand
s
Number of GPUs
NCCL-2.4 MVAPICH2-GDR-Next
MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images
Details of all publications are available from: http://go.osu.edu/ammar