Upload
others
View
26
Download
0
Embed Size (px)
Citation preview
THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS
Vishal Mehta, DevTech Compute
HPC-AI Advisory Council, Swiss Conference, Lugano Date:02/04/2019
2
MLPERF BENCHMARKING AREAS
Problem Model DataSet
Image Classification Resnet-50 v1.5 ImageNet
Object Detection – Heavy Weight Mask R-CNN COCO
Object Detection – Light Weight Single-shot Detector (SSD) COCO
Translation (non-recurrent) Transformer WMT English-German
Translation (recurrent) Neural Machine Translator (NMT) WMT English-German
Recommendation Neural Collaborative Filtering (NCF) MovieLens 20M
Reinforcement Learning Mini Go (Based on Alpha Go)
*Closed Divisions: fixed model parameters, fixed data format, results must be reproducible*Open Divisions: encourage innovations, tricks and model adjustment welcomed
Diverse Use Cases Towards a Full Performance Picture
3
MLPERFFirst Industry Benchmark for Measuring AI Performance
https://mlperf.org/
• Open source code on single/multi node DGX systems
• Everywhere: Workstation/Clusters with SLURM/Cloud
• Reproducible Performance!
4
MLPERF RESULTS: CLOSED DIVISIONSSingle Node & At Scale
Results are Time to Complete Model Training
Single Node At Scale
* Full reference: https://mlperf.org/results/
5
AI TRAINING REQUIRES FULL STACK INNOVATION
2015
36000 Mins (25 Days)1xK80 | 2015
CUDA
2016
1200 Mins (20 Hours)DGX-1P | 2016
NVLink
2017
480 Mins (8 Hours)DGX-1V | 2017
Tensor Core
70 Minutes on MLPerfDGX-2H | 2018
NVSwitch
2018
6.3 Minutes on MLPerfAt Scale | 2018
DGX Cluster
6
PILLARS OF GPU PERFORMANCE
CUDA Architecture
NVLink/NVSwitch Integrated Software
Massively Parallel Processing
High Speed Connecting between GPUs for Distributed
Algorithms
Fully Integrated Software and Hardware for Instant
Productivity
NVSwitch
6x NVLink
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK
cuDNN
RAPIDS
cuMLcuDF
DL
FRAMEWORKS
Tensor Cores
Mixed Precision Matrix Math Support
7
TENSOR CORESMixed Precision Matrix Math
• CUDA TensorOp instructions & data formats. Automated Mixed Precision
• Using Tensor cores via
• Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations
• cuTENSOR library (pre-release version available)
8
cuTENSORA New High-Performance CUDA Library for Tensor Primitives
9
TENSOR CORE AUTOMATIC MIXED PRECISIONOver 3x Speedup With Just Two Lines of Code
TOOLS AND LIBRARIES MAINTAIN NETWORK ACCURACY
Tensor Core Journey Page
Github
Profiler Tools
Performance increase using automatic mixed precision on a variety of
training data sets. All performance collected using 1xV100-16GB except
bert-squadqa, which ran on 1xV100-32GB.
Enable AMP in TensorFlow (NGC Container 19.03)
export TF_ENABLE_AUTO_MIXED_PRECISION=1
10
MIXED PRECISION MAINTAINS ACCURACYBenefit From Higher Throughput Without Compromise
ILSVRC12 classification top-1 accuracy.(Sharan Narang, Paulius Micikevicius et al., "Mixed Precision Training“, ICLR 2018)
**Same hyperparameters and learning rate schedule as FP32.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
AlexNet VGG-D GoogleNet(Inception v1)
Inception v2 Inception v3 Resnet50
Model Accura
cy
FP32 Mixed Precision**
11
TENSOR CORES FOR SCIENCEMulti-precision computing
AI-POWERED WEATHER PREDICTION
PLASMA FUSION APPLICATION EARTHQUAKE SIMULATION
7.815.7
125
0
20
40
60
80
100
120
140
V100 TFLOPS
FP64+ MULTI-PRECISION
FP16 Solver
3.5x times faster
FP16/FP32
1.15x ExaOPS
FP16-FP21-FP32-FP64
25x times faster
12
NVIDIA DGX-2
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
12
30 TB NVME SSDs Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/secEthernet
1313
• 18 NVLINK ports
• @50 GB/s per port bi-directional
• 900 GB/s total bi-directional
• Fully connected crossbar
• 2 billion transistors
NVSWITCH
14
FULL NON-BLOCKING BANDWIDTH
15
UNIFIED MEMORY PROVIDES
• Single memory view shared by all GPUs
• Automatic migration of data between GPUs
• User control of data locality
• CUDA cooperative kernels across multiple
GPU.
NVLINK PROVIDES
• All-to-all high-bandwidth peer mapping
between GPUs
• Full inter-GPU memory interconnect
(incl. Atomics)
NVSWITCH
16
NCCL
• Building block that abstracts highly-optimized communication for each topology
• Rings within node over NVLink
• Trees among nodes over network
• Like MPI collectives, but with a CUDA stream
• Essential to deep learning and machine learning, can be relevant to traditional HPC
• Open sourced as of v2.3
NVIDIA Collectives Communication Library
17
NCCLTrees vs Rings
https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/
Runs on Summit supercomputer on up to 24,576 GPUs.
18
NCCLPerformance comparison on Resnet-50
19
NVIDIA GPU CLOUDGPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows
NGC50+ Containers
DL, ML, HPC
Pre-trained ModelsNLP, Classification, Object Detection & more
Industry WorkflowsMedical Imaging, Intelligent Video Analytics
Model Training ScriptsNLP, Image Classification, Object Detection & more
Innovate Faster
Deploy Anywhere
Simplify Deployments
NGC Support Services
https://ngc.nvidia.com
20
MLPERF KEY POINTSNVIDIA’s Platform Available Everywhere to All Developers
• Software innovations used to achieve this industry-leading
performance is available via the NGC container registry.
• NGC containers are available for all key AI frameworks and can be
used anywhere, on desktops, workstations, servers and all leading
cloud services
21
MLPERF KEY POINTSNVIDIA Sets New Records in AI Performance
• NVIDIA ran models of all kinds of complexity, in the industry's first
comprehensive AI benchmark.
• Tensor Core GPUs are the fastest and combined with CUDA the most
versatile platform for AI.
• NVIDIA platform, is available everywhere from desktop to cloud services
22
MLPERF KEY POINTSState-of-the-art AI Computing Requires Full Stack Innovation
APPS &FRAMEWORKS
CUDA-XNVIDIA LIBRARIES
VIRTUAL GPU
CUDA & CORE LIBRARIES - cuBLAS | NCCL
DEEP LEARNING
cuDNN
HPC
cuFFTOpenACC
+550 Applications
Amber
NAMD
CUSTOMER USE CASES
VIRTUAL GRAPHICS
Speech Translate Recommender
SCIENTIFIC APPLICATIONS
Molecular Simulations
WeatherForecasting
SeismicMapping
CONSUMER INTERNET & INDUSTRY APPLICATIONS
ManufacturingHealthcare Finance
GPUs & SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY
MACHINE LEARNING
cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRTvDWS vPC
Creative & Technical
Knowledge Workers
vAPPS
+600 Applications
DX/OGL
23
RAPIDS: MACHINE LEARNING AT SCALE
24
25
26
27
28
29
30
THANK YOU