39
Benchmarking Deep Learning Workloads on Large-scale HPC Systems Ammar Ahmad Awan and Dhabaleswar K. Panda [email protected], [email protected] The Ohio State University Benchmarking in the Datacenter Workshop @ PPoPP ‘20 Follow us on https ://twitter.com/mvapich

Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda [email protected],

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

Benchmarking Deep Learning Workloads on Large-scale HPC Systems

Ammar Ahmad Awan and Dhabaleswar K. [email protected], [email protected]

The Ohio State University

Benchmarking in the Datacenter Workshop @ PPoPP ‘20

Follow us on

https://twitter.com/mvapich

Page 2: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

2Network Based Computing Laboratory

• Introduction

• Background

• ML/DL Benchmarks

• Solutions and Case Studies

• Conclusion and Future Directions

Agenda

Page 3: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

3Network Based Computing Laboratory

Overview of Artificial Intelligence

Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility-scalability-and-advocacy/

Page 4: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

4Network Based Computing Laboratory 4Network Based Computing Laboratory

Applications: Style Transfer, Caption Generation, Translation, etc.

Courtesy: https://github.com/alexjc/neural-doodle

Courtesy: https://machinelearningmastery.com/inspirational-applications-deep-learning/

Courtesy: https://research.googleblog.com/2015/07/how-google-translate-squeezes-deep.html

Page 5: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

5Network Based Computing Laboratory

The Deep Learning (DL) Revolution

Adopted from: http://www.deeplearningbook.org/contents/intro.html

• DL – a revolutionary sub-set of ML

– Feature extraction vs. hand-crafted features

• Key success: Deep Neural Networks (DNNs)

– Everything was invented in late 80s except ?

AI

Machine Learning

(ML)

Deep Learning (DL)

Examples:

MLPs, DNNs,Examples:

Logistic Regression

Courtesy: https://hackernoon.com/difference-between-artificial-intelligence-machine-learning-and-deep-learning-1pcv3zeg, https://blog.dataiku.com/ai-vs.-machine-learning-vs.-deep-learning

Page 6: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

6Network Based Computing Laboratory 6Network Based Computing Laboratory

Deep Learning on GPUs

*https://blogs.nvidia.com/blog/2014/09/07/imagenet/

• NVIDIA GPUs - driving force for faster DL!

– 90% ImageNet teams used GPUs (2014*)

– DNNs like Inception, ResNet(s), NASNets, and AmoebaNets

– Natural fit for DL workloads – throughput-oriented

• High Performance Computing (HPC) arena

– 135/500 Top HPC systems used NVIDIA GPUs (Nov ’19)

– CUDA-Aware Message Passing Interface (MPI)

• MVAPICH2-GDR, SpectrumMPI, OpenMPI, etc.

– DGX-1/DGX-2- Dedicated DL supercomputers www.top500.org

Page 7: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

7Network Based Computing Laboratory 7Network Based Computing Laboratory

Deep Learning on CPUs

1. https://dl.acm.org/citation.cfm?id=1993516, 2. http://ieeexplore.ieee.org/abstract/document/5762730/, 3. https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1

• CPUs (dense many-cores) are emerging

• CPUs exist on nodes with GPUs

– Many-core Xeon, POWER9, EPYC, ARM, etc.

• Are CPUs really 10x – 100x slower than GPUs? [1-3]

• But CPU-based DL is getting much better and faster

– MKL-DNN, Vectorization, MPI optimized for large messages

– Data-Parallelism (Intel-Caffe – MLHPC ‘17)

– Model/Hybrid-Parallelism (HyPar-Flow – ISC ‘20)

Page 8: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

8Network Based Computing Laboratory

Three Key Pieces in the story:

• Computability of DNNs

– HPC enables faster DL!

• Datasets – ImageNet and beyond…

• State-of-the-art Accuracy – Vision, NLP, Translation, etc.

Why HPC and DL?

Courtesy: A. Canziani et al., “An Analysis of Deep Neural Network Models for Practical Applications”, CoRR, 2016.

0

2000

4000

6000

8000

10000

2 GTX 580 DGX-2

Min

utes

to T

rain

AlexNet

~500X in 5 years

Page 9: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

9Network Based Computing Laboratory 9Network Based Computing Laboratory

• Introduction

• Background

• ML/DL Benchmarks

• Solutions and Case Studies

• Conclusion and Future Directions

Agenda

Page 10: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

10Network Based Computing Laboratory 10Network Based Computing Laboratory

Two Phases in Deep Learning

Courtesy: https://devblogs.nvidia.com/

• Training is compute-intensive

– Many passes over data

– Can take days to weeks

– Model adjustment is done

• Inference

– Single pass over the data

– Takes seconds

– No model adjustment

• Challenge: How to make “Training” faster?

– Exploit HPC hardware and software

Page 11: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

11Network Based Computing Laboratory 11Network Based Computing Laboratory

• Data Parallelism (most common)

• Model and Hybrid Parallelism (emerging)

• ‘X’-Parallelism

– ‘X’—> Spatial, Channel, Filter, etc.

Parallelization Strategies for DNN Training

Model ParallelismData Parallelism

Hybrid (Model and Data) ParallelismCourtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks

Page 12: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

12Network Based Computing Laboratory

• Introduction

• Background

• ML/DL Benchmarks

• Solutions and Case Studies

• Conclusion and Future Directions

Agenda

Page 13: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

13Network Based Computing Laboratory

Benchmarking DL Workloads

Benchmarking Scale-upand Scale-out

performance of DL workloads on large-scale HPC systems

Scal

e-up

Per

form

ance

Scale-out Performance

cuDNN

gRPC

Hadoop

MPIMKL-DNN

Desired

NCCL2

Page 14: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

14Network Based Computing Laboratory

DNN Training: A Complex Solution Space

Data-Parallel Training

Out-of-Core Training

Model/Hybrid ParallelTraining

In-depth Performance Characterization and Profiling Analysis

Page 15: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

15Network Based Computing Laboratory

Several Efforts focused on DL Benchmarking

Many benchmarks but most are not suitable

for HPC Systems

DAWNBench(Stanford)

convnet-benchmarks (Soumith Chintala)

tf_cnn_benchmarks(TensorFlow)

Time Benchmark(CAFFE)

DL Bench(Hong Kong University)

Deep500(ETH Zurich)

MLPerf(mlperf.org – strong industry support)

Page 16: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

16Network Based Computing Laboratory

Application Areas

Frameworks

Communication Libraries I/O (?) Compute

Libraries

Simplified Deep Learning Stack

Language Style Transfer Translation Image Captioning

Page 17: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

17Network Based Computing Laboratory

DL Execution Stack: Details

Distributed Training Middleware

Communication Middleware

DL Frameworks

HPC Platforms Multi-/Many-core CPUs (Intel Xeon, AMD EPYC, and

IBM POWER9)NVIDIA GPUs

High-Performance Interconnects

(InfiniBand, Omni-Path)

MPI

Caffe

Horovod

NCCL

HyPar-Flow

TensorFlow PyTorch Distributed TensorFlow

MXNet PyTorchDistributed

OSU-Caffe

Gloo gRPC

Page 18: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

18Network Based Computing Laboratory

• Introduction

• Background

• ML/DL Benchmarks

• Solutions and Case Studies

• Conclusion and Future Directions

Agenda

Page 19: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

19Network Based Computing Laboratory

• Single Node– Computation only - cuDNN, MKL-DNN, etc.

– Communication - shared-memory, CUDA IPC, etc.)

• Multiple Nodes– Computation - same as single node

– Communication - MPI, NCCL, Gloo, etc.

• Studies Covered in today’s talk1. CPU vs. GPU comparison is usually unfair -- MLHPC ‘17

2. Different approaches, different end-to-end performance for same DL framework -- CCGrid ’19

3. Communication in DL is not the same as Communication in HPC -- HotI ‘19, IEEE Micro ’19

4. Beyond Data-Parallelism -- HyPar-Flow (accepted to be presented at ISC ’20)

Solutions and Case Studies: Different Benchmarking Directions

Single Node Single

Core/GPU

Single Node Multiple

cores/GPUs

N/A

Multiple nodes

Multiple cores/GPUs

Page 20: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

20Network Based Computing Laboratory

• Faster Convolutions à Faster Training

• Performance of Intel KNL == NVIDIA P100 for AlexNet

• Volta; different league!

1. Caffe: CPUs vs. GPUs

A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17), in conjunction with SC ‘17, Denver, CO

DLApplications(ImageRecognition,SpeechProcessing,etc.)

DLFrameworks(Caffe,TensorFlow,etc.)

BLASLibraries

Hardware

Many-coreGPU(PascalP100)

GenericConvolutionLayer

MKLOptimizedConvolutionLayer

MKL2017 cuDNN/cuBLAS

Multi-/Many-core(Xeon,XeonPhi)

cuDNN OptimizedConvolutionLayer

OtherBLASLibraries

OpenBLASATLAS

OtherProcessors

• Caffe – the first framework NVIDIA optimized using cuDNN

– Everyone has optimized it ever since!

• Holistic View of Performance is needed!

Page 21: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

21Network Based Computing Laboratory

• Out-of-Core workloads – no good baseline to compare

– Easiest fallback is to use CPU –> A lot more CPU memory available than GPU memory

• OC-Caffe-Optimized (Opt) designs provide much better than CPU/Optimized CPU designs!– DNN depth is the major cause for slow-downs à significantly more intra-GPU communication

OC-Caffe: GPU (Unified Memory) vs. CPU

0

200400

600800

10001200

1400

1600

Img/

sec (

High

er is

bet

ter)

caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt

oc-caffe-opt is 5Xbetter than

intel-caffe-opt

caffe-gpucannot

run

X 0

50

100

150

200

250

300

Img/

sec (

High

er is

bet

ter)

caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt

caffe-gpucannot

run

X

oc-caffe-opt is2.7X better than intel-caffe-opt

0

5

10

15

20

Img/

sec (

High

er is

bet

ter)

caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt

oc-caffe-optis 80%

better than intel-caffecaffe-gpu

cannot run

X

intel-caffe-opt

(N/A)X

Out-of-Core AlexNet Out-of-Core GoogLeNet Out-of-Core ResNet-50

A. A. Awan, C-H Chu, H. Subramoni, X. Lu, and DK Panda, “OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training”, HiPC ’18

Page 22: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

22Network Based Computing Laboratory

• gRPC (official support)

– Open-source – can be enhanced by others

– Accelerated gRPC (add RDMA to gRPC)

• gRPC+X

– Use gRPC for bootstrap and rendezvous

– Actual communication is in “X”

– Xà MPI, Verbs, GPUDirect RDMA (GDR), etc.

• No-gRPC

– Baidu – the first one to use MPI Collectives for TF

– Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)

2. Same Framework, Different Communication Approaches

Distributed TensorFlow

gRPC

Accelerated gRPC

gRPC+X

gRPC+MPI

gRPC+Verbs

gRPC+GDR

No-gRPC

Baidu-MPI Horovod

MPI

NCCL

A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19. https://arxiv.org/abs/1810.11112

Page 23: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

23Network Based Computing Laboratory

• gRPC and gRPC+X designs are slower than No-gRPC designs

• Baidu design (ring-allreduce) still slower than Horovod-NCCL and gRPC+‘X’

• Horovod-MPI is about 10% slower than Horovod-NCCL2.

Performance Characterization of Distributed TensorFlow

CNTK0

100200300400500600700800900

1000

1 2 4 8 16Imag

es/s

econ

d (H

ighe

r is b

ette

r)

No. of Nodes (GPUs)

Baidu-MPI Horovod-MPI Horovod-NCCL2 gRPC+verbs gRPC+MPI gRPC (IPoIB)

A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19. https://arxiv.org/abs/1810.11112

Page 24: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

24Network Based Computing Laboratory

Overview of the MVAPICH2 Project• High Performance open-source MPI Library

• Support for multiple interconnects– InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS

EFA

• Support for multiple platforms– x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs

• Started in 2001, first open-source version demonstrated at SC ‘02

• Supports the latest MPI-3.1 standard

• http://mvapich.cse.ohio-state.edu

• Additional optimized versions for different systems/environments:– MVAPICH2-X (Advanced MPI + PGAS), since 2011

– MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014

– MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014

– MVAPICH2-Virt with virtualization support, since 2015

– MVAPICH2-EA with support for Energy-Awareness, since 2015

– MVAPICH2-Azure for Azure HPC IB instances, since 2019

– MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019

• Tools:– OSU MPI Micro-Benchmarks (OMB), since 2003

– OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015

• Used by more than 3,075 organizations in 89 countries

• More than 694,000 (> 0.6 million) downloads from the OSU site directly

• Empowering many TOP500 clusters (Nov ‘19 ranking)– 3rd, 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China

– 5th, 448, 448 cores (Frontera) at TACC

– 8th, 391,680 cores (ABCI) in Japan

– 14th, 570,020 cores (Nurion) in South Korea and many others

• Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, OpenHPC, and Spack)

• Partner in the 5th ranked TACC Frontera system

• Empowering Top500 systems for more than 15 years

Page 25: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

25Network Based Computing Laboratory

MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training

MVAPICH2-X for CPU-Based Training

MVAPICH2-GDR for GPU-Based Training

Horovod

TensorFlow PyTorch MXNet

ML/DL Applications

Page 26: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

26Network Based Computing Laboratory

Communication Benchmark: MV2-GDR vs. NCCL2 – Allreduce (DGX-2)• Optimized designs in upcoming MVAPICH2-GDR offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)

1

10

100

1000

10000

256K512K 1M 2M 4M 8M

16M32M

64M128M

256M

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR-Next NCCL-2.5

~2.5X better

Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2

05

101520253035404550

8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR-Next NCCL-2.5

~4.7X better

Page 27: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

27Network Based Computing Laboratory

Communication Benchmark at Scale• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs

0

200

400

600

4 16 64 256 1K 4K 16K

Late

ncy

(us)

Message Size (Bytes)

Latency on 1,536 GPUs

MVAPICH2-GDR-2.3.3 NCCL 2.5

1.6X better

Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

0

5

10

24 48 96192

384768

1536Band

wid

th (G

B/s)

Number of GPUs

128MB Message

SpectrumMPI 10.3 OpenMPI 4.0.1

NCCL 2.5 MVAPICH2-GDR-2.3.3

1.7X better

Page 28: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

28Network Based Computing Laboratory

End-to-end Performance Benchmark for TF on Summit• ResNet-50 Training using

TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!

• 1,281,167 images

• Time/epoch = 3.6 seconds

• Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes!

050

100150200250300350400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e pe

r sec

ond

Thou

sand

s

Number of GPUs

NCCL2 MVAPICH2-GDR

Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2

*We observed errors for NCCL2 beyond 96 GPUs

MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!

ImageNet-1k has 1.2 million images

Page 29: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

29Network Based Computing Laboratory

• White-box profiling is needed for complex DL frameworks

• hvprof provides multiple types of valuable metrics for– 1) ML/DL developers and 2) Designers of MPI libraries

• Profile of Latency for Allreduce (NVLink, PCIe, IB, Omni-Path)

• Summary: Non-power of 2 is under-optimized for all libraries!

3. Communication in DL vs. HPC

CNTK

Awan et al., “Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects”, IEEE Micro (Magazine) ‘20, Hot Interconnects ’19.

Inception-v4– Intel MPI ResNet-101– MVAPICH2

Communication Middleware

Deep Learning Frameworks

Distributed Training Middleware (Horovod)

HPC Platforms

PyTorch

CPUs

GPUs InfiniBand

NCCL MPI

Proposed Profiling Infrastructure (hvprof)

MXNet TensorFlow

Omni-Path

PCIe

NVLink

High-Performance Interconnects

Page 30: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

30Network Based Computing Laboratory

• Data-Parallelism– only for models that fit the memory

• Out-of-core models– Deeper model à Better accuracy but more memory required!

• Model parallelism can work for out-of-core models!

• Key Challenges– Model Partitioning is difficult for application programmers

– Finding the right partition (grain) size is hard

– cut at which layer and why?

– Developing a practical system for model-parallelism• Redesign DL Framework or create additional layers?

• Existing Communication middleware or extensions needed?

4. Beyond Data Parallelism in DNNs

Pascal GPU

Volta GPU

CPU Broadwell (128 GB)

CPU Skylake(192 GB)

Only possible with Model Parallelism!

Memory Consumption (Extrapolated)

1

2

4

8

16

32

64

128

256

512

0 100 200 300 400 500 600 700 800

Mem

ory

(GB)

Input Image Size (Width X Height)ResNet-1k ResNet-5k

A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf

Page 31: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

31Network Based Computing Laboratory

• HyPar-Flow is practical (easy-to-use) and high-performance (uses MPI)

– Based on Keras models and exploits TF 2.0 Eager Execution

– Leverages performance of MPI pt-to-pt. and collectives for communication

Model and Hybrid Parallelism

= Max Pooling

DNN with Identity (Residual) Mappings (Input)

HyPar-Flow

Allreducecomm

Allreducecomm

Allreducecomm

send(D) recv(D)

send(∇)recv(∇)

send(D) recv(D)

send(∇)recv(∇)

ImageConv1 Conv2 Conv3 Conv4

FCOutput

Partition-0 Partition-1 Partition-2

D

D

send(∇)

recv (D)

recv (∇)

send(D) D

∇Replica-0

Replica-1

hf.fit(model, num_partitions, num_replicas, strategy)

E.g. hf.fit(model, 3, 2, hybrid)

A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf

Page 32: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

32Network Based Computing Laboratory

• CPU based results

– AMD EPYC

– Intel Xeon

• Excellent speedups for

– VGG-19

– ResNet-110

– ResNet-1000 (1k layers)

• Able to train “future” models

– E.g. ResNet-5000 (a synthetic 5000-layer model we benchmarked)

Benchmarking HyPar-Flow in Different Configurations

0

200

400

600

800

1000

1200

1 2 4 8 16 32 64 128 256Im

g/se

c (hi

gher

is b

ette

r)

Nodes

48 Partitions (1 node) 768 Partitions (16 nodes)192 Partitions (4 nodes) 96 Partitions (2 nodes)

EBS = Diameter of Circle

110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2)

Page 33: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

33Network Based Computing Laboratory

• ResNet-1001 with variable batch size

• Approach: – 48 model-partitions for 56 cores

– 512 model-replicas for 512 nodes

– Total cores: 48 x 512 = 24,576

• Speedup– 253X on 256 nodes

– 481X on 512 nodes

• Scaling Efficiency– 98% up to 256 nodes

– 93.9% for 512 nodes

End-to-end Performance at Scale (512 nodes on Frontera)

481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera)

0500

100015002000250030003500400045005000

1 2 4 8 16 32 64 128 256 512

Img/

sec (

high

er is

bet

ter)

Nodes (=Replicas)

56 Partitions (1 node) Ideal

Page 34: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

34Network Based Computing Laboratory

• Introduction

• Background

• ML/DL Benchmarks

• Solutions and Case Studies

• Conclusion and Future Directions

Agenda

Page 35: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

35Network Based Computing Laboratory

• Reproducibility is a challenge for HPC and DL

• Challenges

– No standard and user-friendly way of benchmarking ML/DL models

– Disconnect between HPC and DL communities

– Metrics cause confusion b/w communities -- images/second, time to train, latency, bandwidth, etc.

• MLPerf, a good start but mostly for a single-node/GPU

– Can we extend for HPC systems?

• Deep500 – a meta DL framework to evaluate DL or DL frameworks?

• Other benchmarks – framework specific (tf_cnn_benchmarks) or low-level (latency/bw)

Future Direction 1: Reproducibility via Open Infrastructures

Page 36: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

36Network Based Computing Laboratory

• Models beyond ResNet(s) – Generative Adversarial Networks, Transformer, BERT, GPT-2, Mini-go, etc.

• Applications beyond Image Classification – Neural Machine Translation, Language Processing, Recommendation Systems, Reinforcement Learning, Neural Code Gen. etc.

Investigate scale-up/scale-out for published models on current systems like Sierra and Summit and upcoming systems El Capitan and Frontier

Future Direction 2: Benchmarking Emerging Models and Areas

Page 37: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

37Network Based Computing Laboratory

• Deep Learning is an important area

• Several benchmarking efforts to better understand DL workloads

– MLPerf, Deep500, DAWNBench, etc.

• DL on single-node systems is complex enough

– cuDNN, MKL, BLAS, shared-memory communication, CUDA IPC, etc.

• DL on HPC systems is even more challenging

– All of the single node + MPI, NCCL, Gloo, etc.

• Several design choices and benchmarking studies

• No single benchmark can cover all aspects!

Conclusion

Page 38: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

38Network Based Computing Laboratory

Thank You!

Questions?

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

[email protected]

https://awan-10.github.io

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

Page 39: Benchmarking Deep Learning Workloads on Large-scale HPC ... · Benchmarking Deep Learning Workloads on Large-scale HPC Systems AmmarAhmad Awan and Dhabaleswar K. Panda awan.10@osu.edu,

39Network Based Computing Laboratory

Please join us here from 1pm - 5pm

• Tutorial on High Performance Distributed Deep Learning

• Several topics on Distributed DL Trends and Designs

• Includes a Hands-on section on Distributed TensorFlow

• Speakers: DK Panda, Ammar Awan, and Hari Subramoni