Download pdf - ACCELERATED SOLUTIONS FOR HPC, DL & ML · GPU-ACCELERATED HPC APPLICATIONS 600+ Applications MFG, CAD, & CAE 129 apps Including: • Ansys Fluent • Abaqus SIMULIA • AutoCAD •

ACCELERATED SOLUTIONS FOR HPC, DL & MLGabriel Noaje, PhDSenior Solutions Architect, APAC [email protected]

2

THE NEW HPC MARKET

MACHINE LEARNINGSIMULATION DEEP LEARNING

+5.3 7.810.6

15.721.2

125

FP64 FP32 FP16 FP64 FP32 FP16

P100 V100

CPU + Accelerator

Simulation + AI

Volta Tensor CoreAI + Multi-Precision

Full-stack Optimization

Developer Productivity

ACCELERATED COMPUTING —THE PATH FORWARD

4

APPS &FRAMEWORKS

CUDA-X & NVIDIA SDKs

NVIDIA DATA CENTER PLATFORMSingle Platform Drives Utilization and Productivity

CUDA & CORE LIBRARIES - cuBLAS | NCCL

DEEP LEARNING

cuDNN

HPC

cuFFTOpenACC

+600 Applications

Amber

NAMD

CUSTOMER USE CASES Speech Translate Recommender

SCIENTIFIC APPLICATIONS

Molecular Simulations

WeatherForecasting

SeismicMapping

CONSUMER INTERNET & INDUSTRY APPLICATIONS

ManufacturingHealthcare Finance

MACHINE LEARNING

cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRT

VIRTUAL GPU

VIRTUAL GRAPHICS

vDWS vPC

Creative & Technical

Knowledge Workers

vAPPS

TESLA GPUs & SYSTEMS

EVERY MAJOR CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY EVERY OEM

5

NVIDIA POWERS TODAY’S FASTEST SUPERCOMPUTERS

22 of Top 25 Greenest

Piz DaintEurope’s Fastest

5,704 GPUs| 21 PF

ORNL SummitWorld’s Fastest

27,648 GPUs| 149 PF

Total Pangea 3Fastest Industrial

3,348 GPUs| 18 PF

ABCIJapan’s Fastest

4,352 GPUs| 20 PF

LLNL SierraWorld’s 2nd Fastest

17,280 GPUs| 95 PF

6

NVIDIA POWER GORDON BELL WINNERS & 5 OF 6 FINALISTS

GPU Acceleration Critical To HPC At Scale Today

Material Science300X HigherPerformance

Genomics 2.36 ExaFLOPS

Seismic1st Soil & Structure

Simulation

Quantum Chromodynamics

<1% of Uncertainty Margin

Weather1.13 ExaFLOPS

Winner Winner

7

GPU-ACCELERATED HPC APPLICATIONS600+ Applications

MFG, CAD, & CAE

129 apps

Including:• Ansys Fluent• Abaqus

SIMULIA• AutoCAD• CST Studio

Suite

MEDICAL IMAGING

20apps

Including:• Gaussian• VASP• AMBER• HOOMD-

Blue• GAMESS

DATA SCI. & ANALYTICS

Including:• MapD• Kinetica• Graphistry

27apps

DEEP LEARNING

36 apps

Including:• Caffe2• MXNet• Tensorflow

MEDIA & ENT.

148 apps

Including:• DaVinci

Resolve• Premiere

Pro CC• Redshift

Renderer

RESEARCH: HER AND SC

126 apps

Including:• Amber• MILC• NAMD• Relion• VASP

OIL & GAS

19 apps

Including:• RTM• SPECFEM

3D

SAFETY & SECURITY

24apps

Including:• Cyllance• FaceControl• Syndex Pro

TOOLS & MGMT.

16apps

Including:• Bright

Cluster Manager

• HPCtoolkit• Vampir

FEDERAL & DEFENSE

15 apps

Including:• ArcGIS Pro• EVNI• SocetGXP

CLIMATE & WEATHER

4apps

Including:• Cosmos• Gales• WRF

COMP.FINANCE

16 apps

Including:• O-Quant

Options Pricing

• MUREX• MISYS

www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/

8

INTRODUCING CUDA 10

New GPU Architecture, Tensor Cores, NVSwitch Fabric

TURING AND NEW SYSTEMSCUDA Graphs, Vulkan & DX12 Interop, Warp Matrix

CUDA PLATFORM

GPU-accelerated hybrid JPEG decoding,Symmetric Eigenvalue Solvers, FFT Scaling

LIBRARIESNew Nsight Products – Nsight Systems and Nsight Compute

DEVELOPER TOOLS

Scientific Computing

9

NVIDIA CUDA-X UPDATESSoftware To Deliver Acceleration For HPC & AI Apps; 500+ New Updates

CUDA

CUDA-X HPC & AI

40+ GPU Acceleration Libraries

Linear Algebra

Machine

Learning &

Deep Learning

Computational

Physics &

Chemistry

Computational

Fluid Dynamics

Life Sciences

&

Bioinformatics

Structural

Mechanics

Weather &

Climate

Geoscience,

Seismology &

Imaging

Numerical

Analytics

Electronic

Design

Automation

Desktop Development Data Center Supercomputers GPU-Accelerated Cloud

600+ Apps

Parallel Algorithms Signal Processing Deep Learning Machine Learning Visualization

10

https://developer.nvidia.com/nvjpeg

nvJPEG 10.0GPU-accelerated Hybrid JPEG Decoder Up to 1.8x Faster JPEG Decoding

Low-latency hybrid decoding using CPU and GPU resources

Single and batched image decode for optimum throughput and latency

Multiple resolutions and subsampling modes

Color space conversion to RGB, BGR, RGBI, BGRI, YUV

0.0x

0.2x

0.4x

0.6x

0.8x

1.0x

1.2x

1.4x

1.6x

1.8x

2.0x

40 50 60 70 80 90

JPEG Quality (% compression)

Speedup

JPEG decoding performance (images/sec) on Tesla V100 vs. libjpeg-turbo on Intel Skylake CPU 6140

@ 2.3GHz hyperthreading off. Image size: 640x480. Decoding performed on various JPEG

compression/quality levels as described by imagemagick library’s quality parameter.

11NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

cuTENSOR= A BD + + C= A BD * + C

A New High Performance CUDA Library for Tensor Primitives

• Tensor Contractions

• Elementwise Operations

• Mixed Precision

• Coming Soon

• Tensor Reductions

• Out-of-core Contractions

• Tensor Decompositions

• Pre-release version available developer.nvidia.com/cuTENSOR

12

NEW CUDA MATH LIBRARIES

cuFFTDx - cuFFT device extensions: build your own custom FFT kernels

• Inline FFT routine into user kernel.

• Massively improved performance (vs cuFFT).

• Application operations can be fused with FFT.

cuBLASMg - state-of-the-art multi-GPU matrix-matrix multiplication for which each matrix can be distributed --in a 2D-block-cyclic fashion-- among multiple devices.

• Multi-GPU, out-of-core GEMM with best-in-class performance.• Each matrix can be stored in a 2D block-cyclic fashion across multiple devices.• cuBLASMg leverages user-provided workspace --per GPU-- intelligently to reduce memory

traffic across devices.

developer.nvidia.com/CUDAMathLibraryEA

imap://[email protected]:993/fetch%3EUID%3E/INBOX/developer.nvidia.com/CUDAMathLibraryEA

ANNOUNCING CUDA TO ARMENERGY-EFFICIENT SUPERCOMPUTING

NVIDIA GPU Accelerated Computing Platform On ARM

Optimized CUDA-X HPC & AI Software Stack

CUDA, Development Tools and Compilers

Available End of 2019

&

14

NSIGHT PRODUCT FAMILY

Nsight Systems

System-wide application

algorithm tuning

Nsight Compute

CUDA Kernel Profiling and

Debugging

Nsight Graphics

Graphics Shader Profiling and

Debugging

IDE PluginsNsight Eclipse

Edition/Visual Studio (Editor, Debugger)

15

~200 APPS BEING ACCELERATEDACROSS SCIENTIFIC DOMAINS

EASE OF ACCELERATION WITH OPENACC

SYNOPSYSMaterial Science

LSDALTONQuantum Chemistry

CGYROPlasma

GAUSSIANChemistry

VASPMaterial Science

MASAstrophysics

COSMOWeather

GTCPlasma Physics

VMDMolecular Dynamics

MPAS-AWeather

E3SMClimate

HIFUNCFD

SOMAPhysics

LAVACFD

CASTRO / MAESTROAstrophysics

GAMERASeismic

NEKCEMElectromagnetics

HIPSTARCFD

ADS CFDCFD

FINE / OPENCFD

SANJEEVINIDrug Discovery

IBM-CFDCFD

ANSYS FLUENTCFD

FLASHAstrophysics

0

50

100

150

200

Mar-15 Mar-16 Mar-17 Mar-18 Mar-19

16

NVIDIA PLATFORM BUILT FOR AIRapidly Deploy AI at The Highest Performance and Lowest TCO

Backed by NGC Support ServicesAnd Easy to Use Containers on

NVIDIA GPU Cloud

End-to-End Software StackRace from Model Conception to

Deployment at Scale

TensorRTInference Server

Speech

NGC Support Services

Record Breaking AI Training PerformanceMLPerf Benchmark Winners

and Lowest TCO

Train on any Framework

Optimized Model Scripts Across All Use Cases

1 NVIDIA

V100 Server300 CPU

Servers=

Vision

…

17

NVIDIA DGX SUPERPOD BREAKS AT SCALE AI RECORDS

Under 20 Minutes To Train Each MLPerf Benchmark

14.43

35.6

1.21

2.11

0.85

1.28

18.47

13.57

2.23

1.8

1.59

1.33

0 20 40

Object Detection (Heavy Weight)Mask R-CNN

Reinforcement LearningMiniGo

Object Detection (Light Weight)SSD

Translation (Recurrent)GNMT

Translation (Non-recurrent)Transformer

Image ClassificationResNet-50 v.1.5

NVIDIA GPU

Google TPU

Intel CPU

MLPerf At Scale Submissions

Minutes To Train (Lower Is Better)

No TPU Submission

MLPerf 0.6 Performance at Max Scale | MLPerf ID at Scale: RN50 v1.5: 0.6-30, 0.6-6 | Transformer: 0.6-28, 0.6-6 | GNMT: 0.6-26, 0.6-5 | SSD: 0.6-27, 0.6-6 | MiniGo: 0.6-11, 0.6-7 | Mask R-CNN: 0.6-23, 0.6-3

18

UP TO 80% MORE PERFORMANCE ON SAME SERVERSoftware Innovation Delivers Continuous Improvements

1.2x 1.3x1.2x

1.5x

1.8x

0

1

2

Image ClassificationRN50 v.1.5

Translation(non-recurrent)

Transformer

Object Detection(Light Weight)

SSD

Translation(recurrent)

GNMT

Object Detection(Heavy Weight)

Mask R-CNN

Rela

tive

Speedup

MLPerf 0.5 MLPerf 0.6

MLPerf On DGX-2 Server (7 Month Improvement)

Comparing the throughput of a single DGX-2H server on a single epoch (Single pass of the dataset through the neural network) | MLPerf ID 0.5/0.6 comparison: ResNet50 v1.5: 0.5-20/0.6-30 | Transformer: 0.5-21/0.6-20 |

SSD: 0.5-21/0.6-20 | GNMT: 0.5-19/0.6-20 | Mask R-CNN: 0.5-21/0.6-20

NVIDIA DGX SUPERPODAI LEADERSHIP REQUIRES AI INFRASTRUCTURE LEADERSHIP

Test Bed for Highest Performance Scale-Up Systems

• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list

• <2 mins To Train RN-50

Modular & Scalable GPU SuperPOD Architecture

• Built in 3 Weeks

• Optimized For Compute, Networking, Storage & Software

Integrates Fully Optimized Software Stacks

• Freely Available Through NGC

• 96 DGX-2H

• 10 Mellanox EDR IB per node

• 1,536 V100 Tensor Core GPUs

• 1 megawatt of power

Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC

20

NVIDIA DGX-2Designed To Train The Previously Impossible

1

2

3

8

4

5 Two Intel Xeon Platinum CPUs

6 1.5 TB System Memory

20

30 TB NVME SSDs Internal Storage

NVIDIA Tesla V100 32GB

Two HGX-2 GPU Motherboards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Twelve NVSwitches2.4 TB/sec bi-section

bandwidth

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

7

Two High-Speed Ethernet10/25/40/100 GigE

21

UNIFIED MEMORY PROVIDES

Single memory viewshared by all GPUs

Automatic migration of data between GPUs

User control of data locality

UNIFIED MEMORY + DGX-2

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

GPU8

GPU9

GPU10

GPU11

GPU12

GPU13

GPU14

GPU15

512 GB Unified Memory

22

2X HIGHER PERFORMANCE WITH NVSWITCH

2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 GPUs. Servers connected via 4X 100Gb IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 GPUs

Physics(MILC benchmark)

4D Grid

Weather

(ECMWF benchmark)

All-to-all

Recommender

(Sparse Embedding)

Reduce & Broadcast

Language Model

(Transformer with MoE)

All-to-all

DGX-2 with NVSwitch2x DGX-1 (Volta)

2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER

23

PURPOSE-BUILT AI SUPERCOMPUTERS

AI WORKSTATION AI DATA CENTER

• Universal SW for AI

• Predictable execution across platforms

• Pervasive reach

DGX SOFTWARE STACK

The Essential Instrument for AI

Research

DGX-1

AI Workstation for Data Science Teams

DGX Station

The World’s Most Powerful AI System for the Most Complex AI Challenges

DGX-2

PRE-TRAINED MODELSAND MODEL SCRIPTS

24

DGX SYSTEMS AND DGX PODPurpose-Built Systems and Infrastructure for Enterprise AI

SCALABLE PERFORMANCE

FASTER, SIMPLIFIED

DEPLOYMENT

TRUSTED EXPERTISE

AND SUPPORT

DGX RA

Solution

Stor age

EFFORTLESS

PRODUCTIVITY

REVOLUTIONARY

AI PERFORMANCE

FASTEST PATH

TO AI

25

WORLD RECORDS FOR CONVERSATIONAL AIBERT Training and Inference Records on DGX SuperPOD

Largest Transformer Based Model Ever Trained

EXPLODING MODEL SIZEComplexity to Train

CONVERSATIONAL AI RECORDSCode Available on Github

Image Recognition

NLP (Q&A, Translation)

NLP – Generative Tasks(Chatbots, Auto-completion)

8.3Bn

1.5Bn

26M340M

Num

ber

of

Para

mete

rs b

y N

etw

ork 53

minutes

BERTLARGE

Speed Training Record

GPT-2 8BLargest Transformer Based Model Trained

8.3Bnparameters

2.2msLatency

BERTBASE

Fastest Inference (18X Faster Than CPU) X

20X

40X

60X

80X

0 500 1000 1500

Norm

alize

d S

peedup (

1/Tim

e)

# of V100 GPUs

BERTLARGE Training Record: 1472 Tesla V100-SXM3-32GB 450W GPUs | 92 DGX-2H Servers | 8 Mellanox Infiniband Adapters per nodeBERTBASE Inference Record: SQuAD Dataset| Tesla T4 16GB GPU | CPU: Intel Xeon Gold 6240 & OpenVINO v2Scaling Training Performance on: BERTLARGE | Speedups show performance scaling on 1x, 16x, 64x and 92x DGX-2H Servers with 16 NVIDIA V100 GPUs each

Training GPUs - Near Linear ScalingRequires Leading AI Infrastructure

26

INTERSECTION OF HPC & AI TRANSFORMING SCIENCE

AI> Neural networks that learn patterns from

large data sets

> Improve predictive accuracy and faster

response time

HPC> Algorithms based on first principles

theory

> Proven models for accurate results

INDENTIFYING CHEMICAL COMPOUNDSEXASCALE WEATHER MODELING O&G FAULT INTERPRETATIONSPEEDING PATH TO FUSION ENERGY

90% Prediction AccuracyPublish in Nature April 2019

Tensor Cores Achieved 1.13 EF2018 Gordon Bell Winner

Orders Of Magnitude Speedup3M New Compounds In 1 Day

Time-to-solution Reduced From Weeks To 2 Hours

28

TENSOR CORE AUTOMATIC MIXED PRECISION3x Speedup With Just One Line of Code

TOOLS AND LIBRARIES MAINTAIN NETWORK ACCURACY

TRAINING SPEEDUP OVER 3X INFERENCE SPEEDUP OVER 4X

0

20000

40000

60000

80000

100000

PyTorchGNMT

Tota

l Tokens/

sec

FP32 Mixed

3.4X

1xV100

0

2000

4000

6000

8000

TensorRTResNet50

Images/

sec

FP32 INT8 Mixed

4.4X

7ms Latency

1xV100

Tensor Core Journey Page

Github

Profiler Tools

https://developer.nvidia.com/tensor-cores

https://ngc.nvidia.com/catalog/containers?orderBy=&query=&quickFilter=deep-learning&filters=

https://github.com/NVIDIA/DeepLearningExamples

https://devblogs.nvidia.com/using-nsight-compute-nvprof-mixed-precision-deep-learning-models/

29

HPL-AI & ITERATIVE REFINEMENT SOLVERS

3X PERFORMANCE ON SUMMIT FOR HPL-AI

FUSION OF HPC & AI

HPC (Simulation) – FP64

AI (Machine Learning) – FP16, FP32

3X MORE PERF ON SUMMIT w/ TENSOR CORE GPUS

FP64

(HPL)

Mixed Precision

(HPL-AI)

149 PF

445 PF

HPL-AI: A New Approach to Benchmarking AI Supercomputing

Proposed by Prof Jack Dongarra, et al

30

NGC: GPU-OPTIMIZED SOFTWARE HUBSimplifying DL, ML and HPC Workflows

50+ ContainersDL, ML, HPC

Pre-trained ModelsNLP, Classification, Object Detection & more

Industry WorkflowsMedical Imaging, Intelligent Video Analytics

Model Training ScriptsNLP, Image Classification, Object Detection & more

Innovate Faster

Deploy Anywhere

Simplify Deployments

NGC

31

Clara Train SDK

PRE-TRAINED MODELS

TRANSFER

LEARNINGAI-ASSISTED

ANNOTATIONDICOM 2

NIFTI

TRAINING PIPELINES

TUNE

TRT INFERENCE

SERVER

PIPELINE MANAGER

STREAMING RENDER

DICOM

ADAPTERDEPLOYMENT PIPELINES WEBUI

Clara Deploy SDK

NVIDIA CLARA AI PLATFORMOrgan Segmentation for Medical Imaging

RETRAIN WITH NEW DATA

CT SCANS OF PATIENT’S LIVER

SEGMENTED LIVERhttps://developer.nvidia.com/clara

https://developer.nvidia.com/clara

32

CONTINUOUS PERFORMANCE IMPROVEMENTDevelopers’ Software Optimizations Deliver Better Performance on the Same Hardware

Monthly DL Framework Updates & HPC Software Stack Optimizations Drive Performance

0

2000

4000

6000

8000

10000

12000

18.02 18.09 19.02

Imag

es/S

eco

nd

MxNet

Mixed Precision | 128 Batch Size | ResNet-50 Training | 8x V100

0

50000

100000

150000

200000

250000

300000

350000

400000

18.05 18.09 19.02

Toke

ns/

Seco

nd

PyTorch

0

1000

2000

3000

4000

5000

6000

7000

8000

18.02 18.09 19.02Im

ages

/Sec

on

d

TensorFlow

Mixed Precision | 128 Batch Size | GNMT | 8x V100 Mixed Precision | 256 Batch Size | ResNet-50 Training | 8x V100

Speedup across Chroma, GROMACS, LAMMPS, QE, MILC, VASP, SPECFEM3D, NAMD, AMBER, GTC, RTM | 4x V100 v. Dual-Skylake | CUDA 9 for Mar '18 & Nov '18, CUDA 10 for Mar '19

x

2x

4x

6x

8x

10x

12x

14x

16x

18x

Mar '18 Nov '18 Mar '19

HPC Applications

33

ML WORKFLOW STIFLES INNOVATION

DataSources

Wrangle Data

Train

Time-consuming, inefficient workflow that wastes data science productivity

DataLakeETL

Evaluate Predictions

Data Preparation Train Deploy

34

35

FASTER SPEEDS, REAL WORLD BENEFITS

2,290

1,956

1,999

1,948

169

157

0 500 1,000 1,500 2,000 2,500

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

DGX-2

5x DGX-1

0 2,000 4,000 6,000 8,000 10,000

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

DGX-2

5x DGX-1

cuML — XGBoost

2,741

1,675

715

379

42

19

0 500 1,000 1,500 2,000 2,500 3,000

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

DGX-2

5x DGX-1

End-to-EndcuIO/cuDF —Load and Data Preparation

Benchmark

200GB CSV dataset; Data preparation includes joins, variable transformations.

CPU Cluster Configuration

CPU nodes (61 GB of memory, 8 vCPUs, 64-bit platform), Apache Spark

DGX Cluster Configuration

5x DGX-1 on InfiniBand network

Time in seconds — Shorter is better

cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost

36

GPU DIRECT STORAGE (GDS)

https://devblogs.nvidia.com/gpudirect-storage/

https://devblogs.nvidia.com/gpudirect-storage/

37

COMPLEX STACKS JUST WORKFully Optimized

Developer support for Every ApplicationWorks with ISV and ecosystem partners to optimize

the stack over time

NVIDIA GPU Cloud Innovation for every industry

Say goodbye to DIYStay up to date

From desktop to datacenterDedicated infrastructure

NAMDLAMMPS

NVIDIA SDK & LIBRARIES

INDUSTRY FRAMEWORKS & APPLICATIONS

NCCL TensorRTDeepStreamcuSPARSE NVENC

cuDNNcuBLAS cuFFT cuRAND

CUDA

600+ Applications