ACCELERATED SOLUTIONS FOR HPC, DL & MLGabriel Noaje, PhDSenior Solutions Architect, APAC [email protected]
2
THE NEW HPC MARKET
MACHINE LEARNINGSIMULATION DEEP LEARNING
+5.3 7.810.6
15.721.2
125
FP64 FP32 FP16 FP64 FP32 FP16
P100 V100
CPU + Accelerator
Simulation + AI
Volta Tensor CoreAI + Multi-Precision
Full-stack Optimization
Developer Productivity
ACCELERATED COMPUTING —THE PATH FORWARD
4
APPS &FRAMEWORKS
CUDA-X & NVIDIA SDKs
NVIDIA DATA CENTER PLATFORMSingle Platform Drives Utilization and Productivity
CUDA & CORE LIBRARIES - cuBLAS | NCCL
DEEP LEARNING
cuDNN
HPC
cuFFTOpenACC
+600 Applications
Amber
NAMD
CUSTOMER USE CASES Speech Translate Recommender
SCIENTIFIC APPLICATIONS
Molecular Simulations
WeatherForecasting
SeismicMapping
CONSUMER INTERNET & INDUSTRY APPLICATIONS
ManufacturingHealthcare Finance
MACHINE LEARNING
cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRT
VIRTUAL GPU
VIRTUAL GRAPHICS
vDWS vPC
Creative & Technical
Knowledge Workers
vAPPS
TESLA GPUs & SYSTEMS
EVERY MAJOR CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY EVERY OEM
5
NVIDIA POWERS TODAY’S FASTEST SUPERCOMPUTERS
22 of Top 25 Greenest
Piz DaintEurope’s Fastest
5,704 GPUs| 21 PF
ORNL SummitWorld’s Fastest
27,648 GPUs| 149 PF
Total Pangea 3Fastest Industrial
3,348 GPUs| 18 PF
ABCIJapan’s Fastest
4,352 GPUs| 20 PF
LLNL SierraWorld’s 2nd Fastest
17,280 GPUs| 95 PF
6
NVIDIA POWER GORDON BELL WINNERS & 5 OF 6 FINALISTS
GPU Acceleration Critical To HPC At Scale Today
Material Science300X HigherPerformance
Genomics 2.36 ExaFLOPS
Seismic1st Soil & Structure
Simulation
Quantum Chromodynamics
<1% of Uncertainty Margin
Weather1.13 ExaFLOPS
Winner Winner
7
GPU-ACCELERATED HPC APPLICATIONS600+ Applications
MFG, CAD, & CAE
129 apps
Including:• Ansys Fluent• Abaqus
SIMULIA• AutoCAD• CST Studio
Suite
MEDICAL IMAGING
20apps
Including:• Gaussian• VASP• AMBER• HOOMD-
Blue• GAMESS
DATA SCI. & ANALYTICS
Including:• MapD• Kinetica• Graphistry
27apps
DEEP LEARNING
36 apps
Including:• Caffe2• MXNet• Tensorflow
MEDIA & ENT.
148 apps
Including:• DaVinci
Resolve• Premiere
Pro CC• Redshift
Renderer
RESEARCH: HER AND SC
126 apps
Including:• Amber• MILC• NAMD• Relion• VASP
OIL & GAS
19 apps
Including:• RTM• SPECFEM
3D
SAFETY & SECURITY
24apps
Including:• Cyllance• FaceControl• Syndex Pro
TOOLS & MGMT.
16apps
Including:• Bright
Cluster Manager
• HPCtoolkit• Vampir
FEDERAL & DEFENSE
15 apps
Including:• ArcGIS Pro• EVNI• SocetGXP
CLIMATE & WEATHER
4apps
Including:• Cosmos• Gales• WRF
COMP.FINANCE
16 apps
Including:• O-Quant
Options Pricing
• MUREX• MISYS
www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/
8
INTRODUCING CUDA 10
New GPU Architecture, Tensor Cores, NVSwitch Fabric
TURING AND NEW SYSTEMSCUDA Graphs, Vulkan & DX12 Interop, Warp Matrix
CUDA PLATFORM
GPU-accelerated hybrid JPEG decoding,Symmetric Eigenvalue Solvers, FFT Scaling
LIBRARIESNew Nsight Products – Nsight Systems and Nsight Compute
DEVELOPER TOOLS
Scientific Computing
9
NVIDIA CUDA-X UPDATESSoftware To Deliver Acceleration For HPC & AI Apps; 500+ New Updates
CUDA
CUDA-X HPC & AI
40+ GPU Acceleration Libraries
Linear Algebra
Machine
Learning &
Deep Learning
Computational
Physics &
Chemistry
Computational
Fluid Dynamics
Life Sciences
&
Bioinformatics
Structural
Mechanics
Weather &
Climate
Geoscience,
Seismology &
Imaging
Numerical
Analytics
Electronic
Design
Automation
Desktop Development Data Center Supercomputers GPU-Accelerated Cloud
600+ Apps
Parallel Algorithms Signal Processing Deep Learning Machine Learning Visualization
10
https://developer.nvidia.com/nvjpeg
nvJPEG 10.0GPU-accelerated Hybrid JPEG Decoder Up to 1.8x Faster JPEG Decoding
Low-latency hybrid decoding using CPU and GPU resources
Single and batched image decode for optimum throughput and latency
Multiple resolutions and subsampling modes
Color space conversion to RGB, BGR, RGBI, BGRI, YUV
0.0x
0.2x
0.4x
0.6x
0.8x
1.0x
1.2x
1.4x
1.6x
1.8x
2.0x
40 50 60 70 80 90
JPEG Quality (% compression)
Speedup
JPEG decoding performance (images/sec) on Tesla V100 vs. libjpeg-turbo on Intel Skylake CPU 6140
@ 2.3GHz hyperthreading off. Image size: 640x480. Decoding performed on various JPEG
compression/quality levels as described by imagemagick library’s quality parameter.
11NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
cuTENSOR= A BD + + C= A BD * + C
A New High Performance CUDA Library for Tensor Primitives
• Tensor Contractions
• Elementwise Operations
• Mixed Precision
• Coming Soon
• Tensor Reductions
• Out-of-core Contractions
• Tensor Decompositions
• Pre-release version available developer.nvidia.com/cuTENSOR
12
NEW CUDA MATH LIBRARIES
cuFFTDx - cuFFT device extensions: build your own custom FFT kernels
• Inline FFT routine into user kernel.
• Massively improved performance (vs cuFFT).
• Application operations can be fused with FFT.
cuBLASMg - state-of-the-art multi-GPU matrix-matrix multiplication for which each matrix can be distributed --in a 2D-block-cyclic fashion-- among multiple devices.
• Multi-GPU, out-of-core GEMM with best-in-class performance.• Each matrix can be stored in a 2D block-cyclic fashion across multiple devices.• cuBLASMg leverages user-provided workspace --per GPU-- intelligently to reduce memory
traffic across devices.
developer.nvidia.com/CUDAMathLibraryEA
ANNOUNCING CUDA TO ARMENERGY-EFFICIENT SUPERCOMPUTING
NVIDIA GPU Accelerated Computing Platform On ARM
Optimized CUDA-X HPC & AI Software Stack
CUDA, Development Tools and Compilers
Available End of 2019
&
14
NSIGHT PRODUCT FAMILY
Nsight Systems
System-wide application
algorithm tuning
Nsight Compute
CUDA Kernel Profiling and
Debugging
Nsight Graphics
Graphics Shader Profiling and
Debugging
IDE PluginsNsight Eclipse
Edition/Visual Studio (Editor, Debugger)
15
~200 APPS BEING ACCELERATEDACROSS SCIENTIFIC DOMAINS
EASE OF ACCELERATION WITH OPENACC
SYNOPSYSMaterial Science
LSDALTONQuantum Chemistry
CGYROPlasma
GAUSSIANChemistry
VASPMaterial Science
MASAstrophysics
COSMOWeather
GTCPlasma Physics
VMDMolecular Dynamics
MPAS-AWeather
E3SMClimate
HIFUNCFD
SOMAPhysics
LAVACFD
CASTRO / MAESTROAstrophysics
GAMERASeismic
NEKCEMElectromagnetics
HIPSTARCFD
ADS CFDCFD
FINE / OPENCFD
SANJEEVINIDrug Discovery
IBM-CFDCFD
ANSYS FLUENTCFD
FLASHAstrophysics
0
50
100
150
200
Mar-15 Mar-16 Mar-17 Mar-18 Mar-19
16
NVIDIA PLATFORM BUILT FOR AIRapidly Deploy AI at The Highest Performance and Lowest TCO
Backed by NGC Support ServicesAnd Easy to Use Containers on
NVIDIA GPU Cloud
End-to-End Software StackRace from Model Conception to
Deployment at Scale
TensorRTInference Server
Speech
NGC Support Services
Record Breaking AI Training PerformanceMLPerf Benchmark Winners
and Lowest TCO
Train on any Framework
Optimized Model Scripts Across All Use Cases
1 NVIDIA
V100 Server300 CPU
Servers=
Vision
…
17
NVIDIA DGX SUPERPOD BREAKS AT SCALE AI RECORDS
Under 20 Minutes To Train Each MLPerf Benchmark
14.43
35.6
1.21
2.11
0.85
1.28
18.47
13.57
2.23
1.8
1.59
1.33
0 20 40
Object Detection (Heavy Weight)Mask R-CNN
Reinforcement LearningMiniGo
Object Detection (Light Weight)SSD
Translation (Recurrent)GNMT
Translation (Non-recurrent)Transformer
Image ClassificationResNet-50 v.1.5
NVIDIA GPU
Google TPU
Intel CPU
MLPerf At Scale Submissions
Minutes To Train (Lower Is Better)
No TPU Submission
MLPerf 0.6 Performance at Max Scale | MLPerf ID at Scale: RN50 v1.5: 0.6-30, 0.6-6 | Transformer: 0.6-28, 0.6-6 | GNMT: 0.6-26, 0.6-5 | SSD: 0.6-27, 0.6-6 | MiniGo: 0.6-11, 0.6-7 | Mask R-CNN: 0.6-23, 0.6-3
18
UP TO 80% MORE PERFORMANCE ON SAME SERVERSoftware Innovation Delivers Continuous Improvements
1.2x 1.3x1.2x
1.5x
1.8x
0
1
2
Image ClassificationRN50 v.1.5
Translation(non-recurrent)
Transformer
Object Detection(Light Weight)
SSD
Translation(recurrent)
GNMT
Object Detection(Heavy Weight)
Mask R-CNN
Rela
tive
Speedup
MLPerf 0.5 MLPerf 0.6
MLPerf On DGX-2 Server (7 Month Improvement)
Comparing the throughput of a single DGX-2H server on a single epoch (Single pass of the dataset through the neural network) | MLPerf ID 0.5/0.6 comparison: ResNet50 v1.5: 0.5-20/0.6-30 | Transformer: 0.5-21/0.6-20 |
SSD: 0.5-21/0.6-20 | GNMT: 0.5-19/0.6-20 | Mask R-CNN: 0.5-21/0.6-20
NVIDIA DGX SUPERPODAI LEADERSHIP REQUIRES AI INFRASTRUCTURE LEADERSHIP
Test Bed for Highest Performance Scale-Up Systems
• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list
• <2 mins To Train RN-50
Modular & Scalable GPU SuperPOD Architecture
• Built in 3 Weeks
• Optimized For Compute, Networking, Storage & Software
Integrates Fully Optimized Software Stacks
• Freely Available Through NGC
• 96 DGX-2H
• 10 Mellanox EDR IB per node
• 1,536 V100 Tensor Core GPUs
• 1 megawatt of power
Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
20
NVIDIA DGX-2Designed To Train The Previously Impossible
1
2
3
8
4
5 Two Intel Xeon Platinum CPUs
6 1.5 TB System Memory
20
30 TB NVME SSDs Internal Storage
NVIDIA Tesla V100 32GB
Two HGX-2 GPU Motherboards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
7
Two High-Speed Ethernet10/25/40/100 GigE
21
UNIFIED MEMORY PROVIDES
Single memory viewshared by all GPUs
Automatic migration of data between GPUs
User control of data locality
UNIFIED MEMORY + DGX-2
GPU0
GPU1
GPU2
GPU3
GPU4
GPU5
GPU6
GPU7
GPU8
GPU9
GPU10
GPU11
GPU12
GPU13
GPU14
GPU15
512 GB Unified Memory
22
2X HIGHER PERFORMANCE WITH NVSWITCH
2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 GPUs. Servers connected via 4X 100Gb IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 GPUs
Physics(MILC benchmark)
4D Grid
Weather
(ECMWF benchmark)
All-to-all
Recommender
(Sparse Embedding)
Reduce & Broadcast
Language Model
(Transformer with MoE)
All-to-all
DGX-2 with NVSwitch2x DGX-1 (Volta)
2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER
23
PURPOSE-BUILT AI SUPERCOMPUTERS
AI WORKSTATION AI DATA CENTER
• Universal SW for AI
• Predictable execution across platforms
• Pervasive reach
DGX SOFTWARE STACK
The Essential Instrument for AI
Research
DGX-1
AI Workstation for Data Science Teams
DGX Station
The World’s Most Powerful AI System for the Most Complex AI Challenges
DGX-2
PRE-TRAINED MODELSAND MODEL SCRIPTS
24
DGX SYSTEMS AND DGX PODPurpose-Built Systems and Infrastructure for Enterprise AI
SCALABLE PERFORMANCE
FASTER, SIMPLIFIED
DEPLOYMENT
TRUSTED EXPERTISE
AND SUPPORT
DGX RA
Solution
Stor age
EFFORTLESS
PRODUCTIVITY
REVOLUTIONARY
AI PERFORMANCE
FASTEST PATH
TO AI
25
WORLD RECORDS FOR CONVERSATIONAL AIBERT Training and Inference Records on DGX SuperPOD
Largest Transformer Based Model Ever Trained
EXPLODING MODEL SIZEComplexity to Train
CONVERSATIONAL AI RECORDSCode Available on Github
Image Recognition
NLP (Q&A, Translation)
NLP – Generative Tasks(Chatbots, Auto-completion)
8.3Bn
1.5Bn
26M340M
Num
ber
of
Para
mete
rs b
y N
etw
ork 53
minutes
BERTLARGE
Speed Training Record
GPT-2 8BLargest Transformer Based Model Trained
8.3Bnparameters
2.2msLatency
BERTBASE
Fastest Inference (18X Faster Than CPU) X
20X
40X
60X
80X
0 500 1000 1500
Norm
alize
d S
peedup (
1/Tim
e)
# of V100 GPUs
BERTLARGE Training Record: 1472 Tesla V100-SXM3-32GB 450W GPUs | 92 DGX-2H Servers | 8 Mellanox Infiniband Adapters per nodeBERTBASE Inference Record: SQuAD Dataset| Tesla T4 16GB GPU | CPU: Intel Xeon Gold 6240 & OpenVINO v2Scaling Training Performance on: BERTLARGE | Speedups show performance scaling on 1x, 16x, 64x and 92x DGX-2H Servers with 16 NVIDIA V100 GPUs each
Training GPUs - Near Linear ScalingRequires Leading AI Infrastructure
26
INTERSECTION OF HPC & AI TRANSFORMING SCIENCE
AI> Neural networks that learn patterns from
large data sets
> Improve predictive accuracy and faster
response time
HPC> Algorithms based on first principles
theory
> Proven models for accurate results
INDENTIFYING CHEMICAL COMPOUNDSEXASCALE WEATHER MODELING O&G FAULT INTERPRETATIONSPEEDING PATH TO FUSION ENERGY
90% Prediction AccuracyPublish in Nature April 2019
Tensor Cores Achieved 1.13 EF2018 Gordon Bell Winner
Orders Of Magnitude Speedup3M New Compounds In 1 Day
Time-to-solution Reduced From Weeks To 2 Hours
28
TENSOR CORE AUTOMATIC MIXED PRECISION3x Speedup With Just One Line of Code
TOOLS AND LIBRARIES MAINTAIN NETWORK ACCURACY
TRAINING SPEEDUP OVER 3X INFERENCE SPEEDUP OVER 4X
0
20000
40000
60000
80000
100000
PyTorchGNMT
Tota
l Tokens/
sec
FP32 Mixed
3.4X
1xV100
0
2000
4000
6000
8000
TensorRTResNet50
Images/
sec
FP32 INT8 Mixed
4.4X
7ms Latency
1xV100
Tensor Core Journey Page
Github
Profiler Tools
29
HPL-AI & ITERATIVE REFINEMENT SOLVERS
3X PERFORMANCE ON SUMMIT FOR HPL-AI
FUSION OF HPC & AI
HPC (Simulation) – FP64
AI (Machine Learning) – FP16, FP32
3X MORE PERF ON SUMMIT w/ TENSOR CORE GPUS
FP64
(HPL)
Mixed Precision
(HPL-AI)
149 PF
445 PF
HPL-AI: A New Approach to Benchmarking AI Supercomputing
Proposed by Prof Jack Dongarra, et al
30
NGC: GPU-OPTIMIZED SOFTWARE HUBSimplifying DL, ML and HPC Workflows
50+ ContainersDL, ML, HPC
Pre-trained ModelsNLP, Classification, Object Detection & more
Industry WorkflowsMedical Imaging, Intelligent Video Analytics
Model Training ScriptsNLP, Image Classification, Object Detection & more
Innovate Faster
Deploy Anywhere
Simplify Deployments
NGC
31
Clara Train SDK
PRE-TRAINED MODELS
TRANSFER
LEARNINGAI-ASSISTED
ANNOTATIONDICOM 2
NIFTI
TRAINING PIPELINES
TUNE
TRT INFERENCE
SERVER
PIPELINE MANAGER
STREAMING RENDER
DICOM
ADAPTERDEPLOYMENT PIPELINES WEBUI
Clara Deploy SDK
NVIDIA CLARA AI PLATFORMOrgan Segmentation for Medical Imaging
RETRAIN WITH NEW DATA
CT SCANS OF PATIENT’S LIVER
SEGMENTED LIVERhttps://developer.nvidia.com/clara
32
CONTINUOUS PERFORMANCE IMPROVEMENTDevelopers’ Software Optimizations Deliver Better Performance on the Same Hardware
Monthly DL Framework Updates & HPC Software Stack Optimizations Drive Performance
0
2000
4000
6000
8000
10000
12000
18.02 18.09 19.02
Imag
es/S
eco
nd
MxNet
Mixed Precision | 128 Batch Size | ResNet-50 Training | 8x V100
0
50000
100000
150000
200000
250000
300000
350000
400000
18.05 18.09 19.02
Toke
ns/
Seco
nd
PyTorch
0
1000
2000
3000
4000
5000
6000
7000
8000
18.02 18.09 19.02Im
ages
/Sec
on
d
TensorFlow
Mixed Precision | 128 Batch Size | GNMT | 8x V100 Mixed Precision | 256 Batch Size | ResNet-50 Training | 8x V100
Speedup across Chroma, GROMACS, LAMMPS, QE, MILC, VASP, SPECFEM3D, NAMD, AMBER, GTC, RTM | 4x V100 v. Dual-Skylake | CUDA 9 for Mar '18 & Nov '18, CUDA 10 for Mar '19
x
2x
4x
6x
8x
10x
12x
14x
16x
18x
Mar '18 Nov '18 Mar '19
HPC Applications
33
ML WORKFLOW STIFLES INNOVATION
DataSources
Wrangle Data
Train
Time-consuming, inefficient workflow that wastes data science productivity
DataLakeETL
Evaluate Predictions
Data Preparation Train Deploy
34
35
FASTER SPEEDS, REAL WORLD BENEFITS
2,290
1,956
1,999
1,948
169
157
0 500 1,000 1,500 2,000 2,500
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
0 2,000 4,000 6,000 8,000 10,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
cuML — XGBoost
2,741
1,675
715
379
42
19
0 500 1,000 1,500 2,000 2,500 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
End-to-EndcuIO/cuDF —Load and Data Preparation
Benchmark
200GB CSV dataset; Data preparation includes joins, variable transformations.
CPU Cluster Configuration
CPU nodes (61 GB of memory, 8 vCPUs, 64-bit platform), Apache Spark
DGX Cluster Configuration
5x DGX-1 on InfiniBand network
Time in seconds — Shorter is better
cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost
36
GPU DIRECT STORAGE (GDS)
https://devblogs.nvidia.com/gpudirect-storage/
37
COMPLEX STACKS JUST WORKFully Optimized
Developer support for Every ApplicationWorks with ISV and ecosystem partners to optimize
the stack over time
NVIDIA GPU Cloud Innovation for every industry
Say goodbye to DIYStay up to date
From desktop to datacenterDedicated infrastructure
NAMDLAMMPS
NVIDIA SDK & LIBRARIES
INDUSTRY FRAMEWORKS & APPLICATIONS
NCCL TensorRTDeepStreamcuSPARSE NVENC
cuDNNcuBLAS cuFFT cuRAND
CUDA
600+ Applications