REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) · REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) François Courteille |Senior Solutions Architect, NVIDIA

27th January 2016

REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION)

François Courteille |Senior Solutions Architect, NVIDIA |[email protected]

2

ENTERPRISE AUTO GAMING DATA CENTER PRO VISUALIZATION

THE WORLD LEADER IN VISUAL COMPUTING

3

TESLA ACCELERATED COMPUTING PLATFORM Focused on Co-Design from Top to Bottom

Productive Programming Model & Tools

Expert Co-Design

Accessibility

APPLICATION

MIDDLEWARE

SYS SW

LARGE SYSTEMS

PROCESSOR

Fast GPU Engineered for High Throughput

0.0

0.5

1.0

1.5

2.0

2.5

3.0

2008 2009 2010 2011 2012 2013 2014

NVIDIA GPU x86 CPUTFLOPS

M2090

M1060

K20

K80

K40

Fast GPU +

Strong CPU

4

PERFORMANCE LEAD CONTINUES TO GROW

0

500

1000

1500

2000

2500

3000

3500

2008 2009 2010 2011 2012 2013 2014

Peak Double Precision FLOPS

NVIDIA GPU x86 CPU

M2090

M1060

K20

K80

Westmere Sandy Bridge

Haswell

GFLOPS

0

100

200

300

400

500

600

2008 2009 2010 2011 2012 2013 2014

Peak Memory Bandwidth

NVIDIA GPU x86 CPU

GB/s

K20

K80

Westmere Sandy Bridge

Haswell

Ivy Bridge

K40

Ivy Bridge

K40

M2090

M1060

5

GPU Architecture Roadmap SG

EM

M /

W

2012 2014 2008 2010 2016

48

36

12

0

24

60

2018

72

Tesla Fermi

Kepler

Maxwell

Pascal Mixed Precision 3D Memory NVLink

Kepler SM (SMX)

• Scheduler not tied to cores Double issue for max utilization

Instruction Cache

Warp Scheduler

Warp Scheduler

Warp Scheduler

Warp Scheduler

• SP SP DP

…!

SP SFU LD/ST

192 CUDA …

cores!

SP SP DP SP SFU LD/ST

Shared Memory / L1 Cache

On-Chip Network

5

Reg

iste

r F

ile

Maxwell SM (SMM) SMM

Instruction Cache • Simplified design Tex / L1 $ Tex / L1 $ –

–

power-of-two, quadrant-based

scheduler tied to cores

• Better utilization Shared Memory

–

–

single issue sufficient

lower instruction latency Quadrant

Instruction Buffer

Warp Scheduler • Efficiency

–

–

<10% difference from SMX

~50% SMX chip area

SFU LD/ST SP SP SP SP … 32 SP CUDA Cores …

SFU LD/ST SP SP SP SP

Reg

iste

r F

ile

Histogram : Performance per SM 9.0

7.5

6.0

5.5x faster 4.5

3.0

1.5

0.0 1 2 4 8 16 32 64 128

Elements per thread

Fermi M2070 Kepler K20X Maxwell GTX 750 Ti

Higher performance expected with larger GPUS (more SMs)

Bandw

idth

/SM

, G

iB/s

16 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

TESLA GPU ACCELERATORS 2015-2016* 2015 2016 2017

KEPLER – K40

1.43TF DP, 4.3TF SP Peak

3.3 TF SGEMM/1.22TF DGEMM 12 GB, 288 GB/s, 235W

PCIe Active/PCIe Passive

KEPLER – K80

2xGPU, 2.9TF DP, 8.7TF SP

Peak

4.4TF SGEMM/1.59TF DGEMM

24GB, ~480 GB/s, 300W

PCIe Passive MAXWELL – M60

2xGPU, 7.4TF SP Peak,

~6TF SGEMM

16GB, 320 GB/s, 300W

PCIe Active/PCIe Passive

MAXWELL – M6

1xGPU, TBD TF SP Peak,

8GB, 160 GB/s,

75-100W, MXM

POR In Definition

GRID

Enabled

GRID

Enabled

*For End Customer Deployments

MAXWELL – M40

1xGPU, 7TF SP Peak

(Boost Clock),

12GB, 288 GB/s, 250W

PCIe Passive

MAXWELL – M4

1xGPU, 2.2 TF SP Peak,

4GB, 88 GB/s,

50-75W, PCIe Low Profile

17

TESLA PLATFORM PRODUCT STACK

Software

System Tools & Services

Accelerators

Accelerated Computing

Toolkit

Tesla K80

HPC

Enterprise Services ∙ Data Center GPU Manager ∙ Mesos ∙ Docker

GRID 2.0

Tesla M60, M6

Enterprise Virtualization DL Training

Hyperscale

Hyperscale Suite

Tesla M40 Tesla M4

Web Services

18

NVLINK HIGH-SPEED GPU INTERCONNECT

NVLink

NVLink

POWER CPU

X86, ARM64, POWER CPU

X86, ARM64, POWER CPU

PASCAL GPU KEPLER GPU

2016 2014

PCIe PCIe

NODE DESIGN FLEXIBILITY

19

UNIFIED MEMORY: SIMPLER & FASTER WITH NVLINK

Traditional Developer View Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

Developer View With Pascal & NVLink

Unified Memory

NVLink

Share Data Structures at

CPU Memory Speeds, not PCIe speeds

Oversubscribe GPU Memory

20

MOVE DATA WHERE IT IS NEEDED FAST

Accelerated Communication

GPU Direct RDMA NVLINK

Fast Access to other Nodes

Eliminate CPU Latency

Eliminate CPU Bottleneck

2x App Performance

5x Faster Than PCIe

Fast Access to System Memory

GPU Direct P2P

Multi-GPU Scaling

Fast GPU Communication

Fast GPU Memory Access

22

U.S. Dept. of Energy Pre-Exascale Supercomputers

for Science

NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED

NOAA New Supercomputer for Next-Gen

Weather Forecasting

IBM Watson Breakthrough Natural Language

Processing for Cognitive Computing

SUMMIT

SIERRA

23

U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform

100-300 PFLOPS Peak

10x in Scientific App Performance

IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

40 TFLOPS per Node, >3,400 Nodes

2017

Major Step Forward on the Path to Exascale

25

0

25

50

75

100

125

2013 2014 2015

ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS

100+ accelerated systems now on Top500 list

1/3 of total FLOPS powered by accelerators

NVIDIA Tesla GPUs sweep 23 of 24 new

accelerated supercomputers

Tesla supercomputers growing at 50% CAGR

over past five years

Top500: # of Accelerated Supercomputers

26

TESLA PLATFORM FOR HPC

27

TESLA ACCELERATED COMPUTING PLATFORM

Development Data Center Infrastructure

GPU

Accelerators Interconnect

System

Management

Compiler

Solutions

GPU Boost

…

GPU Direct

NVLink

…

NVML

…

LLVM

…

Profile and

Debug

CUDA Debugging API

…

Development

Tools

Programming

Languages

Infrastructure

Management Communication System Solutions

/

Software

Solutions

Libraries

cuBLAS

…

“ Accelerators Will Be Installed in More than Half of New Systems ”

Source: Top 6 predictions for HPC in 2015

“In 2014, NVIDIA enjoyed a dominant market share with 85%

of the accelerator market.”

28

370 GPU-Accelerated Applications

www.nvidia.com/appscatalog

29

70% OF TOP HPC APPS ACCELERATED TOP 25 APPS IN SURVEY INTERSECT360 SURVEY OF TOP APPS

GROMACS

SIMULIA Abaqus

NAMD

AMBER

ANSYS Mechanical

Exelis IDL

MSC NASTRAN

ANSYS Fluent

WRF

VASP

OpenFOAM

CHARMM

Quantum Espresso

LAMMPS

NWChem

LS-DYNA

Schrodinger

Gaussian

GAMESS

ANSYS CFX

Star-CD

CCSM

COMSOL

Star-CCM+

BLAST

= All popular functions accelerated

Top 10 HPC Apps 90%

Accelerated

Top 50 HPC Apps 70%

Accelerated

Intersect360, Nov 2015 “HPC Application Support for GPU Computing”

= Some popular functions accelerated

= In development

= Not supported

33

HYPERSCALE SUITE

Deep Learning Toolkit

GPU REST Engine GPU Accelerated FFmpeg

Image Compute Engine

TESLA M40

POWERFUL Fastest Deep Learning Performance

TESLA M4

LOW POWER Highest Hyperscale Throughput

GPU support in Mesos

TESLA FOR HYPERSCALE

http://devblogs.nvidia.com/parallelforall/accelerating-hyperscale-datacenter-applications-tesla-gpus/

34

TESLA PLATFORM FOR DEVELOPERS

35 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

TESLA FOR SIMLUATION

LIBRARIES

TESLA ACCELERATED COMPUTING

LANGUAGES DIRECTIVES

ACCELERATED COMPUTING TOOLKIT

37

DROP-IN ACCELERATION WITH GPU LIBRARIES

5x-10x speedups out of the box

Automatically scale with multi-GPU libraries (cuBLAS-XT, cuFFT-XT, AmgX,…)

75% of developers use GPU

libraries to accelerate their

application

AmgX cuFFT

NPP cuBLAS cuRAND

cuSPARSE MATH

BLAS | LAPACK | SPARSE | FFT

Math | Deep Learning | Image Processing

38

“DROP-IN” ACCELERATION: NVBLAS

39

OpenACC Simple | Powerful |

Portable

Fueling the Next Wave of

Scientific Discoveries in HPC

University of Illinois PowerGrid- MRI Reconstruction

70x Speed-Up

2 Days of Effort

http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf

http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway

http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf

http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7

RIKEN Japan NICAM- Climate Modeling

7-8x Speed-Up

5% of Code Modified

main() { <serial code> #pragma acc kernels //automatically runs on GPU

{ <parallel code> } }

8000+

Developers

using OpenACC

http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf

http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway/







































40

Janus Juul Eriksen, PhD Fellow

qLEAP Center for Theoretical Chemistry, Aarhus University

“ OpenACC makes GPU computing approachable for

domain scientists. Initial OpenACC implementation

required only minor effort, and more importantly,

no modifications of our existing CPU implementation.

“

Lines of Code

Modified

# of Weeks

Required

# of Codes to

Maintain

<100 Lines 1 Week 1 Source

Big Performance

0.0x

4.0x

8.0x

12.0x

Alanine-113 Atoms

Alanine-223 Atoms

Alanine-333 Atoms

Speedup v

s CPU

Minimal Effort

LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)

LS-DALTON Large-scale Application for Calculating High-accuracy

Molecular Energies

41

OPENACC DELIVERS TRUE PERFORMANCE PORTABILITY

Paving the Path Forward: Single Code for All HPC Processors

4.1x 5.2x

7.1x

4.3x 5.3x 7.1x 7.6x

11.9x

30.3x

0x

5x

10x

15x

20x

25x

30x

35x

359.MINIGHOST (MANTEVO) NEMO (CLIMATE & OCEAN) CLOVERLEAF (PHYSICS)

CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC

Speedup v

s Sin

gle

CPU

Core

Application Performance Benchmark

359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU NEMO: Each socket CPU: Intel Xeon E5-‐2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs

42

INTRODUCING THE NEW OPENACC TOOLKIT

Free Toolkit Offers Simple & Powerful Path to Accelerated Computing

PGI Compiler Free OpenACC compiler for academia

NVProf Profiler Easily find where to add compiler directives

Code Samples Learn from examples of real-world algorithms

Documentation Quick start guide, Best practices, Forums

http://developer.nvidia.com/openacc

GPU Wizard Identify which GPU libraries can jumpstart code



44

DATE COURSE REGION

March 2016 Intro to Performance Portability

with OpenACC China

March 2016 Intro to Performance Portability

with OpenACC India

May 2016 Advanced OpenACC Worldwide

September 2016 Intro to Performance Portability

with OpenACC Worldwide

Registration page: https://developer.nvidia.com/openacc-courses

Self-paced labs: http://nvidia.qwiklab.com

FREE OPENACC COURSES Begin Accelerating Applications with OpenACC

https://developer.nvidia.com/openacc-courses





http://nvidia.qwiklab.com/



46

PROGRAMMING LANGUAGES

OpenACC, CUDA Fortran Fortran

OpenACC, CUDA C C

Thrust, CUDA C++, KOKKOS, RAJA, HEMI, OCCA C++

PyCUDA, Copperhead, Numba, Numbapro Python

GPU.NET, Hybridizer (Altimesh),JCUDA,CUDA4J JAVA,C#

MATLAB, Mathematica, LabVIEW, Scilab, Octave Numerical analytics

47

COMPILE PYTHON FOR PARALLEL ARCHITECTURES

Anaconda Accelerate from Continuum Analytics

NumbaPro array-oriented compiler for Python & NumPy

Compile for CPUs or GPUs (uses LLVM + NVIDIA Compiler SDK)

Fast Development + Fast Execution: Ideal Combination

http://continuum.io

Free Academic

License

49

MORE C++ PARALLEL FOR LOOPS

GPU Lambdas Enable Custom Parallel Programming Models

Kokkos::parallel_for(N, KOKKOS_LAMBDA (int i) { y[i] = a * x[i] + y[i]; });

Kokkos

https://github.com/kokkos

RAJA::forall<cuda_exec>(0, N, [=] __device__ (int i) { y[i] = a * x[i] + y[i]; });

RAJA

https://e-reports-ext.llnl.gov/pdf/782261.pdf

hemi::parallel_for(0, N, [=] HEMI_LAMBDA (int i) { y[i] = a * x[i] + y[i]; });

Hemi CUDA Portability

Library

http://github.com/harrism/hemi

https://github.com/kokkos






http://github.com/harrism/hemi

50

THRUST LIBRARY Programming with algorithms and policies today

Thrust Sort Speedup CUDA 7.0 vs. 6.5 (32M samples)

2.0x Bundled with NVIDIA’s CUDA Toolkit

Supports execution on GPUs and CPUs 1.1x

1.0x

Ongoing performance & feature improvements

0.0x

char short int long float double Functionality beyond Parallel STL From CUDA 7.0 Performance Report.

Run on K40m, ECC ON, input and output data on device

Performance may vary based on OS and software

versions, and motherboard configuration

14

1.7x 1.8x

1.2x 1.3x 1.1x

51

Portable, High-level Parallel Code TODAY

Thrust library allows the same C++ code to target both:

NVIDIA GPUs

x86, ARM and POWER CPUs

Thrust was the inspiration for a proposal to the ISO C++ Committee

Committee voted unanimously to accept as official tech. specification working draft

N3960 Technical Specification Working Draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf

Prototype: https://github.com/n3554/n3554

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf



https://github.com/n3554/n3554

52

13

Technical Specification for C++ Extensions for Parallelism

STANDARDIZING Published as ISO/IEC TS 19570:2015, July 2015.

PARALLEL STL

Draft available online


We’ve proposed adding this to C++17

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0024r0.html









































































53

CUDA Super Simplified Memory Management Code

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }

void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }

CPU Code CUDA 6 Code with Unified Memory

54

49

INTRODUCING NCCL (“NICKEL”): ACCELERATED COLLECTIVES

FOR MULTI-GPU SYSTEMS

55

INTRODUCING NCCL Accelerating multi-GPU collective communications

GOAL:

• Build a research library of accelerated collectives that is easily integrated and

topology-aware so as to improve the scalability of multi-GPU applications

APPROACH:

• Pattern the library after MPI’s collectives

• Handle the intra-node communication in an optimal way

• Provide the necessary functionality for MPI to build on top to handle inter-node

50

56

NCCL FEATURES AND FUTURES (Green = Currently available)

Collectives • Broadcast

• All-Gather

• Reduce

• All-Reduce

• Reduce-Scatter

• Scatter

• Gather

• All-To-All

• Neighborhood

Key Features

• Single-node, up to 8 GPUs

• Host-side API

• Asynchronous/non-blocking interface

• Multi-thread, multi-process support

• In-place and out-of-place operation

• Integration with MPI

• Topology Detection

• NVLink & PCIe/QPI* support

51

57

NCCL IMPLEMENTATION

Implemented as monolithic CUDA C++ kernels combining the following:

• GPUDirect P2P Direct Access

• Three primitive operations: Copy, Reduce, ReduceAndCopy

• Intra-kernel synchronization between GPUs

• One CUDA thread block per ring-direction

52

58

NCCL EXAMPLE All-reduce

#include <nccl.h> ncclComm_t comm[4];

ncclCommInitAll(comm, 4, {0, 1, 2, 3});

foreach g in (GPUs) { // or foreach thread

cudaSetDevice(g);

double *d_send, *d_recv; // allocate d_send, d_recv; fill d_send with data ncclAllReduce(d_send,d_recv, d_recv,

N, ncclDouble, ncclSum, comm[g], stream[g]); // consume d_recv

} 53

59

54

NCCL PERFORMANCE Bandwidth at different problem sizes (4 Maxwell GPUs)

Broadcast All-Reduce

All-Gather Reduce-Scatter

60

55

AVAILABLE NOW github.com/NVIDIA/nccl

61

COMMON PROGRAMMING MODELS ACROSS

MULTIPLE CPUS

x86

Libraries

Programming

Languages

Compiler

Directives

AmgX

cuBLAS

/

62

GPU DEVELOPER ECO-SYSTEM

Consultants & Training

ANEO GPU Tech

OEM Solution Providers

Debuggers & Profilers

cuda-gdb NV Visual Profiler

Parallel Nsight Visual Studio

Allinea TotalView

MATLAB Mathematica NI LabView

pyCUDA

Numerical Packages

OpenACC mCUDA OpenMP Ocelot

Auto-parallelizing & Cluster Tools

BLAS FFT

LAPACK NPP

Video Imaging GPULib

Libraries C

C++ Fortran

Java Python

GPU Compilers

http://www.supermicro.com/

http://en.wikipedia.org/wiki/File:Logo_groupe_bull.jpg

http://images.google.com/imgres?imgurl=http://fishtrain.com/wp-content/uploads/2007/09/cray_logo.gif&imgrefurl=http://fishtrain.com/2007/09/03/nvidias-playbook/&usg=__mBEPjqB6tUo0mps50ld866NdmmI=&h=70&w=160&sz=3&hl=en&start=8&sig2=erIWlru80_C67bxBapde6g&tbnid=ooG9_suq3ywK-M:&tbnh=43&tbnw=98&prev=/images?q=cray+logo&gbv=2&hl=en&ei=aHYpSvyWEo-ctgPd-dXxCg

http://www.google.com/imgres?imgurl=http://blog.taragana.com/wp-content/uploads/2009/05/nec-logo.jpg&imgrefurl=http://blog.taragana.com/index.php/t/east-asia/&h=354&w=354&sz=8&tbnid=YJa5kHMJJ5aMmM:&tbnh=121&tbnw=121&prev=/images?q=NEC+logo&hl=en&usg=__vqs8CIGTn2HFsKXlXcsnKjhGaww=&ei=Q98zSsTUG4vWsgPysrDODg&sa=X&oi=image_result&resnum=2&ct=image

63

DEVELOP ON GEFORCE, DEPLOY ON TESLA

Designed for Developers & Gamers

Available Everywhere

https://developer.nvidia.com/cuda-gpus

Designed for the Data Center

ECC

24x7 Runtime

GPU Monitoring

Cluster Management

GPUDirect-RDMA

Hyper-Q for MPI

3 Year Warranty

Integrated OEM Systems, Professional Support

64

RESOURCES

CUDA resource center:

http://docs.nvidia.com/cuda

GTC on-demand and webinars:

http://on-demand-gtc.gputechconf.com

http://www.gputechconf.com/gtc-webinars

Parallel Forall Blog:

http://devblogs.nvidia.com/parallelforall

Self-paced labs:

http://nvidia.qwiklab.com

Learn more about GPUs




http://on-demand-gtc.gputechconf.com/
















65

TEGRA TX1

24

JETSON TX1 Supercomputer on a module

Under 10 W for typical use cases

KEY SPECS

GPU

1 TFLOP/s 256-core Maxwell

CPU

64-bit ARM A57 CPUs

Memory

4 GB LPDDR4 | 25.6 GB/s

Storage

16 GB eMMC

Wifi/BT

802.11 2x2 ac / BT Ready

Networking

1 Gigabit Ethernet

Size

50mm x 87mm

Interface

400 pin board-to-board connector

Power

Under 10W

JETSON LINUX SDK

NVTX NVIDIA Tools eXtension

Debugger | Profiler | System Trace

GPU Compute

Graphics

Deep Learning and Computer Vision

26

10X ENERGY EFFICIENCY MACHINE LEARNING

Alexnet

FOR

50

45

40

35

30

25

20

15

10

5

0

Intel core i7-6700K (Skylake)

Jetson TX1

Eff

icie

ncy

Images/

sec/W

att

27

PATH TO AN AUTONOMOUS

DRONE TX1

*Based on SGEMM performance

TODAY’S DRONE

(GPS-BASED)

CORE i7

JETSON

Performance* 1x 100x 100x

Power

(compute)

2W

60W

6W

Power

(mechanical)

70W

100W

80W

Flight Time 20 minutes 9 minutes 18 minutes

28

Comprehensive developer platform http://developer.nvidia.com/embedded-computing

http://developer.nvidia.com/embedded-computing































31

J

$

$

Pr

S

In

Jetson TX1 Developer

$599 retail

$299 EDU

Pre-order Nov 12

Shipping Nov 16 (US)

Intl to follow

Kit

32

Jetson TX1 Module

$299 Available 1Q16

Distributors Worldwide

(1000 unit QTY)

33

Jetson for Embedded

Tesla for Cloud

Titan X for PC

DRIVE PX for Auto

ONE ARCHITECTURE — END-TO-END AI

74

FIVE THINGS TO REMEMBER

Time of accelerators has come

NVIDIA is focused on co-design from top-to-bottom

Accelerators are surging in supercomputing

Machine learning is the next killer application for HPC

Tesla platform leads in every way

Documents

REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) · REAL TIME CONTROL FOR ADAPTIVE OPTICS WORKSHOP (3RD EDITION) François Courteille |Senior Solutions Architect, NVIDIA