THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS

Vishal Mehta, DevTech Compute

HPC-AI Advisory Council, Swiss Conference, Lugano Date:02/04/2019

2

MLPERF BENCHMARKING AREAS

Problem Model DataSet

Image Classification Resnet-50 v1.5 ImageNet

Object Detection – Heavy Weight Mask R-CNN COCO

Object Detection – Light Weight Single-shot Detector (SSD) COCO

Translation (non-recurrent) Transformer WMT English-German

Translation (recurrent) Neural Machine Translator (NMT) WMT English-German

Recommendation Neural Collaborative Filtering (NCF) MovieLens 20M

Reinforcement Learning Mini Go (Based on Alpha Go)

*Closed Divisions: fixed model parameters, fixed data format, results must be reproducible*Open Divisions: encourage innovations, tricks and model adjustment welcomed

Diverse Use Cases Towards a Full Performance Picture

3

MLPERFFirst Industry Benchmark for Measuring AI Performance

https://mlperf.org/

• Open source code on single/multi node DGX systems

• Everywhere: Workstation/Clusters with SLURM/Cloud

• Reproducible Performance!

4

MLPERF RESULTS: CLOSED DIVISIONSSingle Node & At Scale

Results are Time to Complete Model Training

Single Node At Scale

* Full reference: https://mlperf.org/results/

5

AI TRAINING REQUIRES FULL STACK INNOVATION

2015

36000 Mins (25 Days)1xK80 | 2015

CUDA

2016

1200 Mins (20 Hours)DGX-1P | 2016

NVLink

2017

480 Mins (8 Hours)DGX-1V | 2017

Tensor Core

70 Minutes on MLPerfDGX-2H | 2018

NVSwitch

2018

6.3 Minutes on MLPerfAt Scale | 2018

DGX Cluster

6

PILLARS OF GPU PERFORMANCE

CUDA Architecture

NVLink/NVSwitch Integrated Software

Massively Parallel Processing

High Speed Connecting between GPUs for Distributed

Algorithms

Fully Integrated Software and Hardware for Instant

Productivity

NVSwitch

6x NVLink

CUDA

PYTHON

APACHE ARROW on GPU Memory

DASK

cuDNN

RAPIDS

cuMLcuDF

DL

FRAMEWORKS

Tensor Cores

Mixed Precision Matrix Math Support

7

TENSOR CORESMixed Precision Matrix Math

• CUDA TensorOp instructions & data formats. Automated Mixed Precision

• Using Tensor cores via

• Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)

• CUDA C++ Warp Level Matrix Operations

• cuTENSOR library (pre-release version available)

8

cuTENSORA New High-Performance CUDA Library for Tensor Primitives

9

TENSOR CORE AUTOMATIC MIXED PRECISIONOver 3x Speedup With Just Two Lines of Code

TOOLS AND LIBRARIES MAINTAIN NETWORK ACCURACY

Tensor Core Journey Page

Github

Profiler Tools

Performance increase using automatic mixed precision on a variety of

training data sets. All performance collected using 1xV100-16GB except

bert-squadqa, which ran on 1xV100-32GB.

Enable AMP in TensorFlow (NGC Container 19.03)

export TF_ENABLE_AUTO_MIXED_PRECISION=1

https://developer.nvidia.com/tensor-cores

https://ngc.nvidia.com/catalog/containers?orderBy=&query=&quickFilter=deep-learning&filters=

https://github.com/NVIDIA/DeepLearningExamples

https://devblogs.nvidia.com/using-nsight-compute-nvprof-mixed-precision-deep-learning-models/

10

MIXED PRECISION MAINTAINS ACCURACYBenefit From Higher Throughput Without Compromise

ILSVRC12 classification top-1 accuracy.(Sharan Narang, Paulius Micikevicius et al., "Mixed Precision Training“, ICLR 2018)

**Same hyperparameters and learning rate schedule as FP32.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

AlexNet VGG-D GoogleNet(Inception v1)

Inception v2 Inception v3 Resnet50

Model Accura

cy

FP32 Mixed Precision**

11

TENSOR CORES FOR SCIENCEMulti-precision computing

AI-POWERED WEATHER PREDICTION

PLASMA FUSION APPLICATION EARTHQUAKE SIMULATION

7.815.7

125

0

20

40

60

80

100

120

140

V100 TFLOPS

FP64+ MULTI-PRECISION

FP16 Solver

3.5x times faster

FP16/FP32

1.15x ExaOPS

FP16-FP21-FP32-FP64

25x times faster

12

NVIDIA DGX-2

1

2

3

5

4

6 Two Intel Xeon Platinum CPUs

7 1.5 TB System Memory

12

30 TB NVME SSDs Internal Storage

NVIDIA Tesla V100 32GB

Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Twelve NVSwitches2.4 TB/sec bi-section

bandwidth

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

PCIe Switch Complex

8

9

9Dual 10/25 Gb/secEthernet

1313

• 18 NVLINK ports

• @50 GB/s per port bi-directional

• 900 GB/s total bi-directional

• Fully connected crossbar

• 2 billion transistors

NVSWITCH

14

FULL NON-BLOCKING BANDWIDTH

15

UNIFIED MEMORY PROVIDES

• Single memory view shared by all GPUs

• Automatic migration of data between GPUs

• User control of data locality

• CUDA cooperative kernels across multiple

GPU.

NVLINK PROVIDES

• All-to-all high-bandwidth peer mapping

between GPUs

• Full inter-GPU memory interconnect

(incl. Atomics)

NVSWITCH

16

NCCL

• Building block that abstracts highly-optimized communication for each topology

• Rings within node over NVLink

• Trees among nodes over network

• Like MPI collectives, but with a CUDA stream

• Essential to deep learning and machine learning, can be relevant to traditional HPC

• Open sourced as of v2.3

NVIDIA Collectives Communication Library

17

NCCLTrees vs Rings

https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/

Runs on Summit supercomputer on up to 24,576 GPUs.

18

NCCLPerformance comparison on Resnet-50

19

NVIDIA GPU CLOUDGPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows

NGC50+ Containers

DL, ML, HPC

Pre-trained ModelsNLP, Classification, Object Detection & more

Industry WorkflowsMedical Imaging, Intelligent Video Analytics

Model Training ScriptsNLP, Image Classification, Object Detection & more

Innovate Faster

Deploy Anywhere

Simplify Deployments

NGC Support Services

https://ngc.nvidia.com

https://ngc.nvidia.com/

20

MLPERF KEY POINTSNVIDIA’s Platform Available Everywhere to All Developers

• Software innovations used to achieve this industry-leading

performance is available via the NGC container registry.

• NGC containers are available for all key AI frameworks and can be

used anywhere, on desktops, workstations, servers and all leading

cloud services

21

MLPERF KEY POINTSNVIDIA Sets New Records in AI Performance

• NVIDIA ran models of all kinds of complexity, in the industry's first

comprehensive AI benchmark.

• Tensor Core GPUs are the fastest and combined with CUDA the most

versatile platform for AI.

• NVIDIA platform, is available everywhere from desktop to cloud services

22

MLPERF KEY POINTSState-of-the-art AI Computing Requires Full Stack Innovation

APPS &FRAMEWORKS

CUDA-XNVIDIA LIBRARIES

VIRTUAL GPU

CUDA & CORE LIBRARIES - cuBLAS | NCCL

DEEP LEARNING

cuDNN

HPC

cuFFTOpenACC

+550 Applications

Amber

NAMD

CUSTOMER USE CASES

VIRTUAL GRAPHICS

Speech Translate Recommender

SCIENTIFIC APPLICATIONS

Molecular Simulations

WeatherForecasting

SeismicMapping

CONSUMER INTERNET & INDUSTRY APPLICATIONS

ManufacturingHealthcare Finance

GPUs & SYSTEMS

SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY

MACHINE LEARNING

cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRTvDWS vPC

Creative & Technical

Knowledge Workers

vAPPS

+600 Applications

DX/OGL

https://aws.amazon.com/canada/

23

RAPIDS: MACHINE LEARNING AT SCALE

24

25

26

27

28

29

30

THANK YOU

Documents

THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,