1
Computation L n L n-1 L n-1 L n L 1 L 1 L 2 L 2 Overlap Backward Pass Helper-thread Communication Main-thread Reduce (L n ) Reduce (L n-1 ) Reduce (L 2 ) Reduce (L 1 ) Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems Ammar A. Awan and DK Panda (Advisor) [email protected] , [email protected] MOTIVATION Resurgence of Deep Learning (DL) Availability of Large Datasets like ImageNet and massively-parallel modern hardware like NVIDIA GPUs Emergence of DL frameworks (Caffe, TensorFlow, CNTK, etc.) Computability of Deep Neural Networks (DNNs) Single GPU/node is not enough! Scale-up and Scale-out training: an emerging research area Various Strategies to deal with large DNNs Data Parallelism, Out-of-Core Training, and Model Parallelism Parameter-Server approach / Reduction-Tree approach Distributed Address-Space Design Constraints Challenges for Communication Middleware Very Large GPU-based Buffers Reduction Collectives Overlap of Computation and Communication Co-design MPI middleware and communication of tensors for efficient DNN training Large-message, CUDA-Aware, and Non-Blocking communication Design novel techniques to deal with Out-of-Core Workloads on GPUs Exploiting CUDA Unified Memory, efficient prefetching, and page-migration In-depth characterization and analysis of DL workloads and frameworks Profiling MPI and NCCL Communication Holistic understanding of single/multiple compute elements (CPU/GPU performance) Significant broader impact – Intersection of ML & HPC – a new research area Tutorials and Course (OSU) on High Performance Deep Learning Outreach through MVAPICH2-GDR and HiDL public releases RESEARCH CHALLENGES PROPOSED DESIGNS AND PERFORMANCE CHARACTERIZATION PROPOSED FRAMEWORK SUMMARY OF CONTRIBUTIONS Accelerating Data-Parallel Training -- EuroMPI ‘16, EuroMPI ’18 and J. Parallel Computing ’19 Layer-wise Overlapped Gradient Aggregation http:// hidl.cse.ohio-state.edu MPI Reduce Benchmark: 160 GPUs (10 nodes) GoogLeNet Training: Strong Scaling AlexNet Training: Weak Scaling 0 5 10 15 20 25 30 2 4 8 32 64 128 Training Time (seconds) No. of GPUs MV2-GDR-NCCL MV2-GDR-Opt VGG Training with CNTK 0.001 0.01 0.1 1 10 100 1000 1 8 64 512 4K 32K 256K 2M 16M 128M Latency (ms) - logscale Message Size (bytes) MV2-GDR-NCCL MV2-GDR-Opt MVAPICH Distributed Training Middleware Communication Middleware (DL-Aware MPI) Application Layer (DNN Training) Co-Designs Out-of-Core HPC Platforms Multi-/Many-core CPUs (Intel Xeon, AMD EPYC, and IBM POWER9) NVIDIA GPUs High-Performance Interconnects (InfiniBand, Omni-Path) OSU-Caffe Large Message Reduction CUDA-Aware Reductions Data-Parallel MPI-Nets Horovod Model-Parallel Caffe TensorFlow CNTK PyTorch CUDA-Aware Broadcast Performance and Design Analysis Distributed TensorFlow Interfaces Programming Models for Distributed TF HPC Platforms CPUs GPUs Horovod Distributed Optimizer Baidu-Allreduce gRPC+X gRPC Communication Runtimes/Libraries Parameter Server Deep Learning Application Distributed TF Program (tf_cnn_benchmarks) TensorFlow (TF) Programs No-gRPC gRPC Verbs NCCL2 CUDA-Aware MPI CUDA-Aware Allreduce Pointer Cache GDR 1 Program Launch Optimizations 3 Characterization and Performance Analysis 2 Proposed Allreduce Designs and Optimizations InfiniBand Cray Aries 0 100 200 300 400 500 600 700 800 900 1000 1 2 4 8 16 Images/second (Higher is better) No. of Nodes (GPUs) Baidu-MPI Horovod-MPI Horovod-NCCL2 gRPC+verbs gRPC+MPI gRPC (IPoIB) File System Data Layer (L1) Layer 2 << Kernel >> ………. Layer N << Kernel >> D2D H2D D2H F2H/H2F I/O File System << Kernel >> << Kernel >> D2D H2D D2H F2H /H2F I/O M2M F2M/ M2F Prefetch () Advise (Evict()) Prefetch () Advise (Evict()) ………. ………. Prefetch () Advise (Evict()) ………. ………. Forward Propagation (Existing) Backward Propagation (Existing) Forward Propagation (Proposed) Backward Propagation (Proposed) Unified Data Layer 1 3 2 4 1 Performance Characterization and Design Analysis: Caffe, CNTK, TensorFlow -- MLHPC ‘17, CCGrid ’19, and HotI ‘19 Communicator Selection Pure MPI Design for DL-Aware Broadcast Algorithm Selection Flexible Communicator Intra-node Communicator Shared Memory Loopback GPUDirect RDMA GDRCopy CUDA IPC Pipelined IPC Inter-node Communicator GPUDirect RDMA Host-staging GDR Write Pipelined K-nomial Tree Scatter- Allgather Several others.. MPI_Bcast() Collectives Design Selection Staged Designs Direct Designs Point-to-point (P2P) Design Selection Chain (Ring) Point-to- Point (P2P) Primitives MPI_Isend MPI_Irecv Intra-node P2P Inter-node P2P 2 0 200 400 600 800 1000 1200 1400 1600 Img/sec (Higher is better) caffe-gpu oc-caffe-naïve oc-caffe-opt caffe-cpu intel-caffe intel-caffe-opt oc-caffe-opt is 5X better than intel-caffe-opt caffe-gpu cannot run X 0 50 100 150 200 250 300 Img/sec (Higher is better) caffe-gpu oc-caffe-naïve oc-caffe-opt caffe-cpu intel-caffe intel-caffe-opt caffe-gpu cannot run X oc-caffe-opt is 2.7X better than intel-caffe-opt 0 5 10 15 20 Img/sec (Higher is better) caffe-gpu oc-caffe-naïve oc-caffe-opt caffe-cpu intel-caffe intel-caffe-opt oc-caffe-opt is 80% better than intel-caffe caffe-gpu cannot run X intel- caffe-opt (N/A) X Out-of-Core AlexNet Out-of-Core GoogLeNet Out-of-Core ResNet-50 3 Out-of-Core DNN Training -- HiPC ‘18 4 Co-designing MPI and Caffe for Data Parallelism – PPoPP ‘17 Layer-wise Overlapped Model Propagation Faster Convolutions à Faster Training Performance of Intel KNL == NVIDIA P100 for AlexNet Training – Volta; different league! Communication Middleware Deep Learning Frameworks Distributed Training Middleware (Horovod) HPC Platforms PyTorch CPUs GPUs InfiniBand NCCL MPI Proposed Profiling Infrastructure (hvprof) MXNet TensorFlow Omni-Path PCIe NVLink High-Performance Interconnects 0 50 100 150 200 250 300 350 400 1 2 4 6 12 24 48 96 192 384 768 1536 Image per second Thousands Number of GPUs NCCL-2.4 MVAPICH2-GDR-Next MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k! ImageNet-1k has 1.2 million images Details of all publications are available from: http://go.osu.edu/ammar

MVAPICH Co-designing Communication Middleware and Deep … · 2019-12-11 · CUDA-Aware MPI CUDA-Aware Allreduce Pointer Cache GDR 1 Program Launch Optimizations 3 Characterization

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MVAPICH Co-designing Communication Middleware and Deep … · 2019-12-11 · CUDA-Aware MPI CUDA-Aware Allreduce Pointer Cache GDR 1 Program Launch Optimizations 3 Characterization

Computation

Ln

Ln-1

Ln-1

Ln

L1L1

L2 L2

Overlap

Backward Pass

Helper-thread

Communication

Main-thread

Reduce (Ln)

Reduce (Ln-1)

Reduce (L2)

Reduce (L1)

Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems

Ammar A. Awan and DK Panda (Advisor)[email protected],

[email protected]

MOTIVATION• Resurgence of Deep Learning (DL)

• Availability of Large Datasets like ImageNet and massively-parallel modern hardware like NVIDIA GPUs

• Emergence of DL frameworks (Caffe, TensorFlow, CNTK, etc.)

• Computability of Deep Neural Networks (DNNs)• Single GPU/node is not enough!• Scale-up and Scale-out training: an emerging research area

• Various Strategies to deal with large DNNs • Data Parallelism, Out-of-Core Training, and Model Parallelism

• Parameter-Server approach / Reduction-Tree approach• Distributed Address-Space Design Constraints

• Challenges for Communication Middleware• Very Large GPU-based Buffers• Reduction Collectives

• Overlap of Computation and Communication

• Co-design MPI middleware and communication of tensors for efficient DNN training• Large-message, CUDA-Aware, and Non-Blocking communication

• Design novel techniques to deal with Out-of-Core Workloads on GPUs

• Exploiting CUDA Unified Memory, efficient prefetching, and page-migration

• In-depth characterization and analysis of DL workloads and frameworks

• Profiling MPI and NCCL Communication

• Holistic understanding of single/multiple compute elements (CPU/GPU performance)

• Significant broader impact – Intersection of ML & HPC – a new research area

• Tutorials and Course (OSU) on High Performance Deep Learning

• Outreach through MVAPICH2-GDR and HiDL public releases

RESEARCH CHALLENGES

PROPOSED DESIGNS AND PERFORMANCE CHARACTERIZATION

PROPOSED FRAMEWORK

SUMMARY OF CONTRIBUTIONS

Accelerating Data-Parallel Training -- EuroMPI ‘16, EuroMPI ’18 and J. Parallel Computing ’19

Layer-wise Overlapped Gradient Aggregation

http://hidl.cse.ohio-state.edu

MPI Reduce Benchmark: 160 GPUs (10 nodes)

GoogLeNet Training: Strong Scaling AlexNet Training: Weak Scaling

051015202530

2 4 8 32 64 128

TrainingTime(secon

ds)

No.ofGPUs

MV2-GDR-NCCL MV2-GDR-Opt

VGG Training with CNTK

0.0010.010.11101001000

1 8 64 512 4K 32K

256K 2M 16M

128M

Latency(m

s)-logscale

MessageSize(bytes)

MV2-GDR-NCCL MV2-GDR-Opt

MVAPICH

Distributed Training Middleware

Communication Middleware (DL-Aware MPI)

Application Layer (DNN Training)

Co-Designs

Out-of-Core

HPC Platforms Multi-/Many-core CPUs (Intel Xeon, AMD EPYC,

and IBM POWER9)NVIDIA GPUs

High-Performance Interconnects

(InfiniBand, Omni-Path)

OSU-Caffe

Large Message

ReductionCUDA-Aware Reductions

Data-Parallel

MPI-NetsHorovod

Model-Parallel

Caffe

TensorFlow

CNTK

PyTorch

CUDA-Aware Broadcast

Performanceand

Design Analysis

Distributed TensorFlow Interfaces

Programming Models for Distributed TF

HPC Platforms

CPUs GPUs

HorovodDistributed Optimizer

Baidu-Allreduce

gRPC+XgRPC

Communication Runtimes/Libraries

Parameter Server

Deep Learning ApplicationDistributed TF Program(tf_cnn_benchmarks)TensorFlow (TF) Programs

No-gRPC

gRPC Verbs

NCCL2

CUDA-Aware MPI

CUDA-Aware Allreduce

Pointer CacheGDR

1Program Launch

Optimizations

3

Characterizationand

Performance Analysis

2

Proposed Allreduce

Designs and Optimizations

InfiniBand Cray Aries

0100200300400500600700800900

1000

1 2 4 8 16Imag

es/s

econ

d (H

ighe

r is b

ette

r)

No. of Nodes (GPUs)

Baidu-MPI Horovod-MPI Horovod-NCCL2 gRPC+verbs gRPC+MPI gRPC (IPoIB)

File System

Data Layer (L1) Layer 2

<< Kernel >>

………. Layer N

<< Kernel >> …

D2D H2D D2HF2H/H2F I/O

File System

<< Kernel >>

<< Kernel >> …

D2D

H2D

D2H

F2H/H2F

I/O

M2M

F2M/M2F

Prefetch () Advise (Evict())

Prefetch ()

Advise (Evict()) ………. ……….

Prefetch ()

Advise (Evict())……….……….

Forward Propagation (Existing)

Backward Propagation (Existing)

Forward Propagation (Proposed)

Backward Propagation (Proposed)

Unified Data Layer

1 32

4

1 Performance Characterization and Design Analysis: Caffe, CNTK, TensorFlow -- MLHPC ‘17, CCGrid ’19, and HotI ‘19

Communicator Selection

Pure MPI Design for DL-Aware Broadcast

Algorithm Selection

Flexible

Communicator

Intra-node

Communicator

Shared

Memory

Loopback

GPUDirect

RDMA

GDRCopy

CUDA IPC

Pipelined

IPC

Inter-node

Communicator

GPUDirect

RDMA

Host-staging

GDR Write

Pipelined

K-nomial Tree

Scatter-

Allgather

Several others..

MPI_Bcast()

Collectives Design Selection

Staged

Designs

Direct

Designs

Point-to-point (P2P) Design

Selection

Chain (Ring)

Point-to-Point (P2P) Primitives

MPI_Isend

MPI_Irecv

Intra-node P2P

Inter-node P2P

2

0

200400

600800

10001200

1400

1600

Img/

sec (

High

er is

bet

ter)

caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt

oc-caffe-opt is 5Xbetter than

intel-caffe-opt

caffe-gpucannot

run

X 0

50

100

150

200

250

300

Img/

sec (

High

er is

bet

ter)

caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt

caffe-gpucannot

run

X

oc-caffe-opt is2.7X better than intel-caffe-opt

0

5

10

15

20

Img/

sec (

High

er is

bet

ter)

caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt

oc-caffe-optis 80%

better than intel-caffecaffe-gpu

cannot run

X

intel-caffe-opt

(N/A)X

Out-of-Core AlexNet Out-of-Core GoogLeNet Out-of-Core ResNet-50

3 Out-of-Core DNN Training -- HiPC ‘18

4 Co-designing MPI and Caffe for Data Parallelism – PPoPP ‘17

Layer-wise Overlapped Model Propagation

• Faster Convolutions à Faster Training

• Performance of Intel KNL == NVIDIA P100 for AlexNet Training – Volta; different league!

Communication Middleware

Deep Learning Frameworks

Distributed Training Middleware (Horovod)

HPC Platforms

PyTorch

CPUs

GPUs InfiniBand

NCCL MPI

Proposed Profiling Infrastructure (hvprof)

MXNet TensorFlow

Omni-Path

PCIe

NVLink

High-Performance Interconnects

050

100150200250300350400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e pe

r sec

ond

Thou

sand

s

Number of GPUs

NCCL-2.4 MVAPICH2-GDR-Next

MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!

ImageNet-1k has 1.2 million images

Details of all publications are available from: http://go.osu.edu/ammar