Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
27th January 2016
REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION)
François Courteille |Senior Solutions Architect, NVIDIA |[email protected]
2
ENTERPRISE AUTO GAMING DATA CENTER PRO VISUALIZATION
THE WORLD LEADER IN VISUAL COMPUTING
3
TESLA ACCELERATED COMPUTING PLATFORM Focused on Co-Design from Top to Bottom
Productive Programming Model & Tools
Expert Co-Design
Accessibility
APPLICATION
MIDDLEWARE
SYS SW
LARGE SYSTEMS
PROCESSOR
Fast GPU Engineered for High Throughput
0.0
0.5
1.0
1.5
2.0
2.5
3.0
2008 2009 2010 2011 2012 2013 2014
NVIDIA GPU x86 CPUTFLOPS
M2090
M1060
K20
K80
K40
Fast GPU +
Strong CPU
4
PERFORMANCE LEAD CONTINUES TO GROW
0
500
1000
1500
2000
2500
3000
3500
2008 2009 2010 2011 2012 2013 2014
Peak Double Precision FLOPS
NVIDIA GPU x86 CPU
M2090
M1060
K20
K80
Westmere Sandy Bridge
Haswell
GFLOPS
0
100
200
300
400
500
600
2008 2009 2010 2011 2012 2013 2014
Peak Memory Bandwidth
NVIDIA GPU x86 CPU
GB/s
K20
K80
Westmere Sandy Bridge
Haswell
Ivy Bridge
K40
Ivy Bridge
K40
M2090
M1060
5
GPU Architecture Roadmap SG
EM
M /
W
2012 2014 2008 2010 2016
48
36
12
0
24
60
2018
72
Tesla Fermi
Kepler
Maxwell
Pascal Mixed Precision 3D Memory NVLink
Kepler SM (SMX)
• Scheduler not tied to cores Double issue for max utilization
Instruction Cache
Warp Scheduler
Warp Scheduler
Warp Scheduler
Warp Scheduler
• SP SP DP
…!
SP SFU LD/ST
192 CUDA …
cores!
SP SP DP SP SFU LD/ST
Shared Memory / L1 Cache
On-Chip Network
5
Reg
iste
r F
ile
Maxwell SM (SMM) SMM
Instruction Cache • Simplified design Tex / L1 $ Tex / L1 $ –
–
power-of-two, quadrant-based
scheduler tied to cores
• Better utilization Shared Memory
–
–
single issue sufficient
lower instruction latency Quadrant
Instruction Buffer
Warp Scheduler • Efficiency
–
–
<10% difference from SMX
~50% SMX chip area
SFU LD/ST SP SP SP SP … 32 SP CUDA Cores …
SFU LD/ST SP SP SP SP
Reg
iste
r F
ile
Histogram : Performance per SM 9.0
7.5
6.0
5.5x faster 4.5
3.0
1.5
0.0 1 2 4 8 16 32 64 128
Elements per thread
Fermi M2070 Kepler K20X Maxwell GTX 750 Ti
Higher performance expected with larger GPUS (more SMs)
Bandw
idth
/SM
, G
iB/s
16 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
TESLA GPU ACCELERATORS 2015-2016* 2015 2016 2017
KEPLER – K40
1.43TF DP, 4.3TF SP Peak
3.3 TF SGEMM/1.22TF DGEMM 12 GB, 288 GB/s, 235W
PCIe Active/PCIe Passive
KEPLER – K80
2xGPU, 2.9TF DP, 8.7TF SP
Peak
4.4TF SGEMM/1.59TF DGEMM
24GB, ~480 GB/s, 300W
PCIe Passive MAXWELL – M60
2xGPU, 7.4TF SP Peak,
~6TF SGEMM
16GB, 320 GB/s, 300W
PCIe Active/PCIe Passive
MAXWELL – M6
1xGPU, TBD TF SP Peak,
8GB, 160 GB/s,
75-100W, MXM
POR In Definition
GRID
Enabled
GRID
Enabled
*For End Customer Deployments
MAXWELL – M40
1xGPU, 7TF SP Peak
(Boost Clock),
12GB, 288 GB/s, 250W
PCIe Passive
MAXWELL – M4
1xGPU, 2.2 TF SP Peak,
4GB, 88 GB/s,
50-75W, PCIe Low Profile
17
TESLA PLATFORM PRODUCT STACK
Software
System Tools & Services
Accelerators
Accelerated Computing
Toolkit
Tesla K80
HPC
Enterprise Services ∙ Data Center GPU Manager ∙ Mesos ∙ Docker
GRID 2.0
Tesla M60, M6
Enterprise Virtualization DL Training
Hyperscale
Hyperscale Suite
Tesla M40 Tesla M4
Web Services
18
NVLINK HIGH-SPEED GPU INTERCONNECT
NVLink
NVLink
POWER CPU
X86, ARM64, POWER CPU
X86, ARM64, POWER CPU
PASCAL GPU KEPLER GPU
2016 2014
PCIe PCIe
NODE DESIGN FLEXIBILITY
19
UNIFIED MEMORY: SIMPLER & FASTER WITH NVLINK
Traditional Developer View Developer View With Unified Memory
Unified Memory System Memory
GPU Memory
Developer View With Pascal & NVLink
Unified Memory
NVLink
Share Data Structures at
CPU Memory Speeds, not PCIe speeds
Oversubscribe GPU Memory
20
MOVE DATA WHERE IT IS NEEDED FAST
Accelerated Communication
GPU Direct RDMA NVLINK
Fast Access to other Nodes
Eliminate CPU Latency
Eliminate CPU Bottleneck
2x App Performance
5x Faster Than PCIe
Fast Access to System Memory
GPU Direct P2P
Multi-GPU Scaling
Fast GPU Communication
Fast GPU Memory Access
22
U.S. Dept. of Energy Pre-Exascale Supercomputers
for Science
NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED
NOAA New Supercomputer for Next-Gen
Weather Forecasting
IBM Watson Breakthrough Natural Language
Processing for Cognitive Computing
SUMMIT
SIERRA
23
U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform
100-300 PFLOPS Peak
10x in Scientific App Performance
IBM POWER9 CPU + NVIDIA Volta GPU
NVLink High Speed Interconnect
40 TFLOPS per Node, >3,400 Nodes
2017
Major Step Forward on the Path to Exascale
25
0
25
50
75
100
125
2013 2014 2015
ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS
100+ accelerated systems now on Top500 list
1/3 of total FLOPS powered by accelerators
NVIDIA Tesla GPUs sweep 23 of 24 new
accelerated supercomputers
Tesla supercomputers growing at 50% CAGR
over past five years
Top500: # of Accelerated Supercomputers
26
TESLA PLATFORM FOR HPC
27
TESLA ACCELERATED COMPUTING PLATFORM
Development Data Center Infrastructure
GPU
Accelerators Interconnect
System
Management
Compiler
Solutions
GPU Boost
…
GPU Direct
NVLink
…
NVML
…
LLVM
…
Profile and
Debug
CUDA Debugging API
…
Development
Tools
Programming
Languages
Infrastructure
Management Communication System Solutions
/
Software
Solutions
Libraries
cuBLAS
…
“ Accelerators Will Be Installed in More than Half of New Systems ”
Source: Top 6 predictions for HPC in 2015
“In 2014, NVIDIA enjoyed a dominant market share with 85%
of the accelerator market.”
28
370 GPU-Accelerated Applications
www.nvidia.com/appscatalog
29
70% OF TOP HPC APPS ACCELERATED TOP 25 APPS IN SURVEY INTERSECT360 SURVEY OF TOP APPS
GROMACS
SIMULIA Abaqus
NAMD
AMBER
ANSYS Mechanical
Exelis IDL
MSC NASTRAN
ANSYS Fluent
WRF
VASP
OpenFOAM
CHARMM
Quantum Espresso
LAMMPS
NWChem
LS-DYNA
Schrodinger
Gaussian
GAMESS
ANSYS CFX
Star-CD
CCSM
COMSOL
Star-CCM+
BLAST
= All popular functions accelerated
Top 10 HPC Apps 90%
Accelerated
Top 50 HPC Apps 70%
Accelerated
Intersect360, Nov 2015 “HPC Application Support for GPU Computing”
= Some popular functions accelerated
= In development
= Not supported
33
HYPERSCALE SUITE
Deep Learning Toolkit
GPU REST Engine GPU Accelerated FFmpeg
Image Compute Engine
TESLA M40
POWERFUL Fastest Deep Learning Performance
TESLA M4
LOW POWER Highest Hyperscale Throughput
GPU support in Mesos
TESLA FOR HYPERSCALE
http://devblogs.nvidia.com/parallelforall/accelerating-hyperscale-datacenter-applications-tesla-gpus/
34
TESLA PLATFORM FOR DEVELOPERS
35 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
TESLA FOR SIMLUATION
LIBRARIES
TESLA ACCELERATED COMPUTING
LANGUAGES DIRECTIVES
ACCELERATED COMPUTING TOOLKIT
37
DROP-IN ACCELERATION WITH GPU LIBRARIES
5x-10x speedups out of the box
Automatically scale with multi-GPU libraries (cuBLAS-XT, cuFFT-XT, AmgX,…)
75% of developers use GPU
libraries to accelerate their
application
AmgX cuFFT
NPP cuBLAS cuRAND
cuSPARSE MATH
BLAS | LAPACK | SPARSE | FFT
Math | Deep Learning | Image Processing
38
“DROP-IN” ACCELERATION: NVBLAS
39
OpenACC Simple | Powerful |
Portable
Fueling the Next Wave of
Scientific Discoveries in HPC
University of Illinois PowerGrid- MRI Reconstruction
70x Speed-Up
2 Days of Effort
http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf
http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway
http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf
http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7
RIKEN Japan NICAM- Climate Modeling
7-8x Speed-Up
5% of Code Modified
main() { <serial code> #pragma acc kernels //automatically runs on GPU
{ <parallel code> } }
8000+
Developers
using OpenACC
40
Janus Juul Eriksen, PhD Fellow
qLEAP Center for Theoretical Chemistry, Aarhus University
“ OpenACC makes GPU computing approachable for
domain scientists. Initial OpenACC implementation
required only minor effort, and more importantly,
no modifications of our existing CPU implementation.
“
Lines of Code
Modified
# of Weeks
Required
# of Codes to
Maintain
<100 Lines 1 Week 1 Source
Big Performance
0.0x
4.0x
8.0x
12.0x
Alanine-113 Atoms
Alanine-223 Atoms
Alanine-333 Atoms
Speedup v
s CPU
Minimal Effort
LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
LS-DALTON Large-scale Application for Calculating High-accuracy
Molecular Energies
41
OPENACC DELIVERS TRUE PERFORMANCE PORTABILITY
Paving the Path Forward: Single Code for All HPC Processors
4.1x 5.2x
7.1x
4.3x 5.3x 7.1x 7.6x
11.9x
30.3x
0x
5x
10x
15x
20x
25x
30x
35x
359.MINIGHOST (MANTEVO) NEMO (CLIMATE & OCEAN) CLOVERLEAF (PHYSICS)
CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC
Speedup v
s Sin
gle
CPU
Core
Application Performance Benchmark
359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU NEMO: Each socket CPU: Intel Xeon E5-‐2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs
42
INTRODUCING THE NEW OPENACC TOOLKIT
Free Toolkit Offers Simple & Powerful Path to Accelerated Computing
PGI Compiler Free OpenACC compiler for academia
NVProf Profiler Easily find where to add compiler directives
Code Samples Learn from examples of real-world algorithms
Documentation Quick start guide, Best practices, Forums
http://developer.nvidia.com/openacc
GPU Wizard Identify which GPU libraries can jumpstart code
44
DATE COURSE REGION
March 2016 Intro to Performance Portability
with OpenACC China
March 2016 Intro to Performance Portability
with OpenACC India
May 2016 Advanced OpenACC Worldwide
September 2016 Intro to Performance Portability
with OpenACC Worldwide
Registration page: https://developer.nvidia.com/openacc-courses
Self-paced labs: http://nvidia.qwiklab.com
FREE OPENACC COURSES Begin Accelerating Applications with OpenACC
46
PROGRAMMING LANGUAGES
OpenACC, CUDA Fortran Fortran
OpenACC, CUDA C C
Thrust, CUDA C++, KOKKOS, RAJA, HEMI, OCCA C++
PyCUDA, Copperhead, Numba, Numbapro Python
GPU.NET, Hybridizer (Altimesh),JCUDA,CUDA4J JAVA,C#
MATLAB, Mathematica, LabVIEW, Scilab, Octave Numerical analytics
47
COMPILE PYTHON FOR PARALLEL ARCHITECTURES
Anaconda Accelerate from Continuum Analytics
NumbaPro array-oriented compiler for Python & NumPy
Compile for CPUs or GPUs (uses LLVM + NVIDIA Compiler SDK)
Fast Development + Fast Execution: Ideal Combination
http://continuum.io
Free Academic
License
49
MORE C++ PARALLEL FOR LOOPS
GPU Lambdas Enable Custom Parallel Programming Models
Kokkos::parallel_for(N, KOKKOS_LAMBDA (int i) { y[i] = a * x[i] + y[i]; });
Kokkos
https://github.com/kokkos
RAJA::forall<cuda_exec>(0, N, [=] __device__ (int i) { y[i] = a * x[i] + y[i]; });
RAJA
https://e-reports-ext.llnl.gov/pdf/782261.pdf
hemi::parallel_for(0, N, [=] HEMI_LAMBDA (int i) { y[i] = a * x[i] + y[i]; });
Hemi CUDA Portability
Library
http://github.com/harrism/hemi
50
THRUST LIBRARY Programming with algorithms and policies today
Thrust Sort Speedup CUDA 7.0 vs. 6.5 (32M samples)
2.0x Bundled with NVIDIA’s CUDA Toolkit
Supports execution on GPUs and CPUs 1.1x
1.0x
Ongoing performance & feature improvements
0.0x
char short int long float double Functionality beyond Parallel STL From CUDA 7.0 Performance Report.
Run on K40m, ECC ON, input and output data on device
Performance may vary based on OS and software
versions, and motherboard configuration
14
1.7x 1.8x
1.2x 1.3x 1.1x
51
Portable, High-level Parallel Code TODAY
Thrust library allows the same C++ code to target both:
NVIDIA GPUs
x86, ARM and POWER CPUs
Thrust was the inspiration for a proposal to the ISO C++ Committee
Committee voted unanimously to accept as official tech. specification working draft
N3960 Technical Specification Working Draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf
Prototype: https://github.com/n3554/n3554
52
13
Technical Specification for C++ Extensions for Parallelism
STANDARDIZING Published as ISO/IEC TS 19570:2015, July 2015.
PARALLEL STL
Draft available online
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf
We’ve proposed adding this to C++17
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0024r0.html
53
CUDA Super Simplified Memory Management Code
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }
void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }
CPU Code CUDA 6 Code with Unified Memory
54
49
INTRODUCING NCCL (“NICKEL”): ACCELERATED COLLECTIVES
FOR MULTI-GPU SYSTEMS
55
INTRODUCING NCCL Accelerating multi-GPU collective communications
GOAL:
• Build a research library of accelerated collectives that is easily integrated and
topology-aware so as to improve the scalability of multi-GPU applications
APPROACH:
• Pattern the library after MPI’s collectives
• Handle the intra-node communication in an optimal way
• Provide the necessary functionality for MPI to build on top to handle inter-node
50
56
NCCL FEATURES AND FUTURES (Green = Currently available)
Collectives • Broadcast
• All-Gather
• Reduce
• All-Reduce
• Reduce-Scatter
• Scatter
• Gather
• All-To-All
• Neighborhood
Key Features
• Single-node, up to 8 GPUs
• Host-side API
• Asynchronous/non-blocking interface
• Multi-thread, multi-process support
• In-place and out-of-place operation
• Integration with MPI
• Topology Detection
• NVLink & PCIe/QPI* support
51
57
NCCL IMPLEMENTATION
Implemented as monolithic CUDA C++ kernels combining the following:
• GPUDirect P2P Direct Access
• Three primitive operations: Copy, Reduce, ReduceAndCopy
• Intra-kernel synchronization between GPUs
• One CUDA thread block per ring-direction
52
58
NCCL EXAMPLE All-reduce
#include <nccl.h> ncclComm_t comm[4];
ncclCommInitAll(comm, 4, {0, 1, 2, 3});
foreach g in (GPUs) { // or foreach thread
cudaSetDevice(g);
double *d_send, *d_recv; // allocate d_send, d_recv; fill d_send with data ncclAllReduce(d_send,d_recv, d_recv,
N, ncclDouble, ncclSum, comm[g], stream[g]); // consume d_recv
} 53
59
54
NCCL PERFORMANCE Bandwidth at different problem sizes (4 Maxwell GPUs)
Broadcast All-Reduce
All-Gather Reduce-Scatter
60
55
AVAILABLE NOW github.com/NVIDIA/nccl
61
COMMON PROGRAMMING MODELS ACROSS
MULTIPLE CPUS
x86
Libraries
Programming
Languages
Compiler
Directives
AmgX
cuBLAS
/
62
GPU DEVELOPER ECO-SYSTEM
Consultants & Training
ANEO GPU Tech
OEM Solution Providers
Debuggers & Profilers
cuda-gdb NV Visual Profiler
Parallel Nsight Visual Studio
Allinea TotalView
MATLAB Mathematica NI LabView
pyCUDA
Numerical Packages
OpenACC mCUDA OpenMP Ocelot
Auto-parallelizing & Cluster Tools
BLAS FFT
LAPACK NPP
Video Imaging GPULib
Libraries C
C++ Fortran
Java Python
GPU Compilers
63
DEVELOP ON GEFORCE, DEPLOY ON TESLA
Designed for Developers & Gamers
Available Everywhere
https://developer.nvidia.com/cuda-gpus
Designed for the Data Center
ECC
24x7 Runtime
GPU Monitoring
Cluster Management
GPUDirect-RDMA
Hyper-Q for MPI
3 Year Warranty
Integrated OEM Systems, Professional Support
64
RESOURCES
CUDA resource center:
http://docs.nvidia.com/cuda
GTC on-demand and webinars:
http://on-demand-gtc.gputechconf.com
http://www.gputechconf.com/gtc-webinars
Parallel Forall Blog:
http://devblogs.nvidia.com/parallelforall
Self-paced labs:
http://nvidia.qwiklab.com
Learn more about GPUs
65
TEGRA TX1
24
JETSON TX1 Supercomputer on a module
Under 10 W for typical use cases
KEY SPECS
GPU
1 TFLOP/s 256-core Maxwell
CPU
64-bit ARM A57 CPUs
Memory
4 GB LPDDR4 | 25.6 GB/s
Storage
16 GB eMMC
Wifi/BT
802.11 2x2 ac / BT Ready
Networking
1 Gigabit Ethernet
Size
50mm x 87mm
Interface
400 pin board-to-board connector
Power
Under 10W
JETSON LINUX SDK
NVTX NVIDIA Tools eXtension
Debugger | Profiler | System Trace
GPU Compute
Graphics
Deep Learning and Computer Vision
26
10X ENERGY EFFICIENCY MACHINE LEARNING
Alexnet
FOR
50
45
40
35
30
25
20
15
10
5
0
Intel core i7-6700K (Skylake)
Jetson TX1
Eff
icie
ncy
Images/
sec/W
att
27
PATH TO AN AUTONOMOUS
DRONE TX1
*Based on SGEMM performance
TODAY’S DRONE
(GPS-BASED)
CORE i7
JETSON
Performance* 1x 100x 100x
Power
(compute)
2W
60W
6W
Power
(mechanical)
70W
100W
80W
Flight Time 20 minutes 9 minutes 18 minutes
28
Comprehensive developer platform http://developer.nvidia.com/embedded-computing
31
J
$
$
Pr
S
In
Jetson TX1 Developer
$599 retail
$299 EDU
Pre-order Nov 12
Shipping Nov 16 (US)
Intl to follow
Kit
32
Jetson TX1 Module
$299 Available 1Q16
Distributors Worldwide
(1000 unit QTY)
33
Jetson for Embedded
Tesla for Cloud
Titan X for PC
DRIVE PX for Auto
ONE ARCHITECTURE — END-TO-END AI
74
FIVE THINGS TO REMEMBER
Time of accelerators has come
NVIDIA is focused on co-design from top-to-bottom
Accelerators are surging in supercomputing
Machine learning is the next killer application for HPC
Tesla platform leads in every way