GPU Technology Conference 2014 Keynote

Preview:

DESCRIPTION

NVIDIA CEO Jen-Hsun Huang introduces NVLink and shares a roadmap of the GPU. Primary topics also include an introduction of the GeForce GTX Titan Z, CUDA for machine learning, and Iray VCA.

Citation preview

5

4

3

2

1

0 2003 2005 2007 2009 2011 2013

Tera

FLO

PS

GPU

CPU

GTC — GROWING AND EXPANDING

2010 2012 2014

397 429

729

FASTEST GROWING TOPICS

Big Data Analytics

Machine Learning

Computer Vision

FASTEST GROWING TOPICS

Energy Exploration

Life Science & Genomics

Molecular Dynamics

#1 TOPIC

HPC / Supercomputing

2012 2013 2014

FOSTERING THE GPU ECOSYSTEM Big Data / Cloud / Computer Vision

AudioStreamTV

CUDA EVERYWHERE

Takayuki Aoki Global Scientific Information and Computing Center

Tokyo Institute of Technology

“ Large-scale CFD Applications and a Full GPU Implementation of a Weather

Prediction Code on the TSUBAME Supercomputer ”

BANDWIDTH BOTTLENECKS

CPU GPU

PCIe

PCI Express

CPU Memory

GPU Memory

16GB/sec

60GB/sec

288GB/sec

INTRODUCING NVLINK CPU GPU

PCIe

Differential with embedded clock

PCIe programming model (w/ DMA+)

Unified Memory

Cache coherency in Gen 2.0

5 to 12X PCIe

5X More Bandwidth for Multi-GPU Scaling

GPU

PCIe SWITCH

CPU GPU GPU GPU

3D MEMORY 3D Chip-on-Wafer integration

Many X bandwidth

2.5X capacity

4X energy efficiency

0

200

400

600

800

1000

1200

2008 2010 2012 2014 2016

Memory Bandwidth

Blaise Pascal 1623-1662

Mechanical Calculator

Probability Theory

Pascal’s Theorem

Pascal’s Law

PASCAL

NVLink

3D Memory

Module

5 to 12X PCIe 3.0

2 to 4X memory BW & size

1/3 size of PCIe card

SG

EM

M /

W N

orm

alized

2012 2014 2008 2010 2016

Tesla CUDA

Fermi FP64

Kepler Dynamic Parallelism

Maxwell DX12

Pascal Unified Memory

3D Memory

NVLink

20

16

12

8

6

2

0

GPU ROADMAP

4

10

14

18

MACHINE LEARNING

Branch of Artificial Intelligence

Computers that learn from data

person

car

helmet

motorcycle

bird

frog

person

dog

chair

person

hammer

flower pot

power drill

Machine Learning using Deep Neural Networks

Input Result

Building High-level Features Using Large Scale Unsupervised Learning

Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng

Stanford / Google

1 billion connections

10 million 200x200 pixel images

1,000 machines (16,000 cores)

3 days

1,000 CPU Servers 2,000 CPUs • 16,000 cores

600 kWatts

$5,000,000

GOOGLE BRAIN Today’s Largest Networks

1B connections 10M images ~3 days ~30 ExaFLOPS

Human Brain

~100B neurons x 1000 connections 500M images 5,000,000X “Google Brain” ~150 YottaFLOPS ~40,000 “Google Brain-Years”

SOURCE: Ian Goodfellow

Deep Learning with COTS HPC Systems

A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro

Stanford / NVIDIA • ICML 2013

STANFORD AI LAB

3 GPU-Accelerated Servers 12 GPUs • 18,432 cores

4 kWatts

$33,000

Now You Can Build Google’s

$1M Artificial Brain on the Cheap “ “

-Wired

1,000 CPU Servers 2,000 CPUs • 16,000 cores

600 kWatts

$5,000,000

GOOGLE BRAIN

DEMO: MACHINE LEARNING, SIMPLE TRAINING SET

1.2M

1000

2

7

25

Image training set

Classes

Weeks of training

GPUs

EXAFLOPS total to train

DEMO: MACHINE LEARNING, NYU OVERFEAT

CUDA for MACHINE LEARNING

Talks @ GTC

Image Detection

Face Recognition

Gesture Recognition

Video Search & Analytics

Speech Recognition & Translation

Recommendation Engines

Indexing & Search

Use Cases Early Adopters

Image Analytics for Creative Cloud

Image Classification

Speech/Image Recognition

Recommendation

Hadoop

Search Rankings

Big Data & Infinite Compute Turbocharge Deep Learning

SOURCE: KPCB/Mary Meeker, company data. Unstructured data: IDC's Digital Universe Study.

800M photos uploaded per day 100 hours of video uploaded per minute Unstructured data exploding

0

100

200

300

400

500

600

700

800

900

2007 2008 2009 2010 2011 2012 2013 2014

Facebook

Instagram

Snapchat

Flickr

0

20

40

60

80

100

120

2007 2008 2009 2010 2011 2012 2013

Hours

(Y

ouTu

be)

Millions

1,104

5,379

0

1,000

2,000

3,000

4,000

5,000

6,000

2010 2015

Exabyte

s of

data

DEMO: TITAN Z REVEAL

5,760 CUDA cores

12GB memory

8 TeraFLOPS

$2999

STANFORD AI LAB

1 Titan Z-Accelerated Server 3 Titan Zs • 17,280 cores

2 kWatts

$12,000

1,000 CPU Servers 2,000 CPUs • 16,000 cores

600 kWatts

$5,000,000

GOOGLE BRAIN

300X energy efficiency

400X lower cost

Fits next to a desk

RenderMan with programmable shading

1.5 hours to render each frame

CCI 6/32 minicomputer

First CGI Film Nominated for

an Academy Award®

State-of-the-art water simulator

48 hours to simulate the base water

250 hours to render each frame

2013 Academy Award® Winner

BEST VISUAL EFFECTS

DEMO: WHALE

DEMO: FLEX

DEMO: FLAMEWORKS

DEMO: UE4

One is a photo, One is Iray…

Bunkspeed Maya

Catia 3ds Max

IRAY VCA SCALABLE GPU RENDERING

APPLIANCE

8 Kepler-class

12GB per GPU

23,040

2 x 1GigE

2 x 10GigE

1 x InfiniBand

GPUs

GPU memory

CUDA cores

Network

DEMO: IRAY / HONDA

0 20 40 60 80

Relative Performance

CPU-only Workstation

Quadro K5000 Workstation

Iray VCA

Bunkspeed Maya

Catia 3ds Max

IRAY VCA SCALABLE GPU RENDERING

APPLIANCE

MSRP $50,000

GRID GPU in the Cloud

Ben Fathi Chief Technology Officer

Horizon DaaS Platform

Mobile CUDA

“10 of the Top 10” Greenest Supercomputers Powered by CUDA GPUs

Unify GPU and Tegra Architecture

192 fully programmable CUDA cores

326 GFLOPS

4X energy efficiency over A15

TEGRA K1 Mobile Super Chip

MOBILE

ARCHITECTURE

Maxwell

Kepler

Tesla

Fermi

Tegra 3

Tegra 4

Tegra K1

GPU

ARCHITECTURE

Computer Vision on CUDA

Feature Detection / Tracking

~30 GFLOPS @ 30 Hz

Object Recognition / Tracking

~180 GFLOPS @ 30 Hz

3D Scene Interpretation

~280 GFLOPS @ 30 Hz

JETSON TK1 1st MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS

192 CUDA cores

326 GFLOPS

VisionWorks SDK

$192

VISIONWORKS COMPUTER VISION ON CUDA

Driver Assistance Computational Photography

Augmented Reality Robotics CUDA

Jetson TK1

VisionWorks Primitives

Your Code

Sample Pipelines

Object Detection / Tracking

Structure from Motion …

Classifier Corner Detection …

Sin

gle

Pre

cis

ion G

FLO

PS /

W N

orm

alized

80

60

0

40

2013 2014 2011 2012 2015

Tegra 2 Tegra 3

Tegra 4

Tegra K1 Kepler GPU CUDA 64b & 32b CPU

Erista Maxwell GPU

20

TEGRA ROADMAP

Andreas Reich Head of Audi Pre-Development

VIDEO: AUDI ADAS

CUDA EVERYWHERE PASCAL PC CLOUD MOBILE

DEMO: PORTAL ON SHIELD