31
THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference, Lugano Date:02/04/2019

THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

  • Upload
    others

  • View
    26

  • Download
    0

Embed Size (px)

Citation preview

Page 1: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS

Vishal Mehta, DevTech Compute

HPC-AI Advisory Council, Swiss Conference, Lugano Date:02/04/2019

Page 2: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

2

MLPERF BENCHMARKING AREAS

Problem Model DataSet

Image Classification Resnet-50 v1.5 ImageNet

Object Detection – Heavy Weight Mask R-CNN COCO

Object Detection – Light Weight Single-shot Detector (SSD) COCO

Translation (non-recurrent) Transformer WMT English-German

Translation (recurrent) Neural Machine Translator (NMT) WMT English-German

Recommendation Neural Collaborative Filtering (NCF) MovieLens 20M

Reinforcement Learning Mini Go (Based on Alpha Go)

*Closed Divisions: fixed model parameters, fixed data format, results must be reproducible*Open Divisions: encourage innovations, tricks and model adjustment welcomed

Diverse Use Cases Towards a Full Performance Picture

Page 3: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

3

MLPERFFirst Industry Benchmark for Measuring AI Performance

https://mlperf.org/

• Open source code on single/multi node DGX systems

• Everywhere: Workstation/Clusters with SLURM/Cloud

• Reproducible Performance!

Page 4: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

4

MLPERF RESULTS: CLOSED DIVISIONSSingle Node & At Scale

Results are Time to Complete Model Training

Single Node At Scale

* Full reference: https://mlperf.org/results/

Page 5: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

5

AI TRAINING REQUIRES FULL STACK INNOVATION

2015

36000 Mins (25 Days)1xK80 | 2015

CUDA

2016

1200 Mins (20 Hours)DGX-1P | 2016

NVLink

2017

480 Mins (8 Hours)DGX-1V | 2017

Tensor Core

70 Minutes on MLPerfDGX-2H | 2018

NVSwitch

2018

6.3 Minutes on MLPerfAt Scale | 2018

DGX Cluster

Page 6: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

6

PILLARS OF GPU PERFORMANCE

CUDA Architecture

NVLink/NVSwitch Integrated Software

Massively Parallel Processing

High Speed Connecting between GPUs for Distributed

Algorithms

Fully Integrated Software and Hardware for Instant

Productivity

NVSwitch

6x NVLink

CUDA

PYTHON

APACHE ARROW on GPU Memory

DASK

cuDNN

RAPIDS

cuMLcuDF

DL

FRAMEWORKS

Tensor Cores

Mixed Precision Matrix Math Support

Page 7: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

7

TENSOR CORESMixed Precision Matrix Math

• CUDA TensorOp instructions & data formats. Automated Mixed Precision

• Using Tensor cores via

• Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)

• CUDA C++ Warp Level Matrix Operations

• cuTENSOR library (pre-release version available)

Page 8: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

8

cuTENSORA New High-Performance CUDA Library for Tensor Primitives

Page 9: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

9

TENSOR CORE AUTOMATIC MIXED PRECISIONOver 3x Speedup With Just Two Lines of Code

TOOLS AND LIBRARIES MAINTAIN NETWORK ACCURACY

Tensor Core Journey Page

Github

Profiler Tools

Performance increase using automatic mixed precision on a variety of

training data sets. All performance collected using 1xV100-16GB except

bert-squadqa, which ran on 1xV100-32GB.

Enable AMP in TensorFlow (NGC Container 19.03)

export TF_ENABLE_AUTO_MIXED_PRECISION=1

Page 10: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

10

MIXED PRECISION MAINTAINS ACCURACYBenefit From Higher Throughput Without Compromise

ILSVRC12 classification top-1 accuracy.(Sharan Narang, Paulius Micikevicius et al., "Mixed Precision Training“, ICLR 2018)

**Same hyperparameters and learning rate schedule as FP32.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

AlexNet VGG-D GoogleNet(Inception v1)

Inception v2 Inception v3 Resnet50

Model Accura

cy

FP32 Mixed Precision**

Page 11: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

11

TENSOR CORES FOR SCIENCEMulti-precision computing

AI-POWERED WEATHER PREDICTION

PLASMA FUSION APPLICATION EARTHQUAKE SIMULATION

7.815.7

125

0

20

40

60

80

100

120

140

V100 TFLOPS

FP64+ MULTI-PRECISION

FP16 Solver

3.5x times faster

FP16/FP32

1.15x ExaOPS

FP16-FP21-FP32-FP64

25x times faster

Page 12: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

12

NVIDIA DGX-2

1

2

3

5

4

6 Two Intel Xeon Platinum CPUs

7 1.5 TB System Memory

12

30 TB NVME SSDs Internal Storage

NVIDIA Tesla V100 32GB

Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Twelve NVSwitches2.4 TB/sec bi-section

bandwidth

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

PCIe Switch Complex

8

9

9Dual 10/25 Gb/secEthernet

Page 13: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

1313

• 18 NVLINK ports

• @50 GB/s per port bi-directional

• 900 GB/s total bi-directional

• Fully connected crossbar

• 2 billion transistors

NVSWITCH

Page 14: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

14

FULL NON-BLOCKING BANDWIDTH

Page 15: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

15

UNIFIED MEMORY PROVIDES

• Single memory view shared by all GPUs

• Automatic migration of data between GPUs

• User control of data locality

• CUDA cooperative kernels across multiple

GPU.

NVLINK PROVIDES

• All-to-all high-bandwidth peer mapping

between GPUs

• Full inter-GPU memory interconnect

(incl. Atomics)

NVSWITCH

Page 16: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

16

NCCL

• Building block that abstracts highly-optimized communication for each topology

• Rings within node over NVLink

• Trees among nodes over network

• Like MPI collectives, but with a CUDA stream

• Essential to deep learning and machine learning, can be relevant to traditional HPC

• Open sourced as of v2.3

NVIDIA Collectives Communication Library

Page 17: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

17

NCCLTrees vs Rings

https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/

Runs on Summit supercomputer on up to 24,576 GPUs.

Page 18: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

18

NCCLPerformance comparison on Resnet-50

Page 19: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

19

NVIDIA GPU CLOUDGPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows

NGC50+ Containers

DL, ML, HPC

Pre-trained ModelsNLP, Classification, Object Detection & more

Industry WorkflowsMedical Imaging, Intelligent Video Analytics

Model Training ScriptsNLP, Image Classification, Object Detection & more

Innovate Faster

Deploy Anywhere

Simplify Deployments

NGC Support Services

https://ngc.nvidia.com

Page 20: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

20

MLPERF KEY POINTSNVIDIA’s Platform Available Everywhere to All Developers

• Software innovations used to achieve this industry-leading

performance is available via the NGC container registry.

• NGC containers are available for all key AI frameworks and can be

used anywhere, on desktops, workstations, servers and all leading

cloud services

Page 21: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

21

MLPERF KEY POINTSNVIDIA Sets New Records in AI Performance

• NVIDIA ran models of all kinds of complexity, in the industry's first

comprehensive AI benchmark.

• Tensor Core GPUs are the fastest and combined with CUDA the most

versatile platform for AI.

• NVIDIA platform, is available everywhere from desktop to cloud services

Page 22: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

22

MLPERF KEY POINTSState-of-the-art AI Computing Requires Full Stack Innovation

APPS &FRAMEWORKS

CUDA-XNVIDIA LIBRARIES

VIRTUAL GPU

CUDA & CORE LIBRARIES - cuBLAS | NCCL

DEEP LEARNING

cuDNN

HPC

cuFFTOpenACC

+550 Applications

Amber

NAMD

CUSTOMER USE CASES

VIRTUAL GRAPHICS

Speech Translate Recommender

SCIENTIFIC APPLICATIONS

Molecular Simulations

WeatherForecasting

SeismicMapping

CONSUMER INTERNET & INDUSTRY APPLICATIONS

ManufacturingHealthcare Finance

GPUs & SYSTEMS

SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY

MACHINE LEARNING

cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRTvDWS vPC

Creative & Technical

Knowledge Workers

vAPPS

+600 Applications

DX/OGL

Page 23: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

23

RAPIDS: MACHINE LEARNING AT SCALE

Page 24: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

24

Page 25: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

25

Page 26: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

26

Page 27: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

27

Page 28: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

28

Page 29: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

29

Page 30: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

30

Page 31: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

THANK YOU