Tegra – at the Convergence of Mobile and GPU Supercomputingon-demand.gputechconf.com/gtc/2013/presentations/S3494... · 2013-04-19 · Ecosystem Broad View – including Ouya Development

© 2012 NVIDIA - Page 1

Tegra – at the Convergence of Mobile and GPU Supercomputing Neil Trevett, VP Mobile Content, NVIDIA

http://www.gputechconf.com/page/home.html


Welcome to the Inaugural GTC Mobile Summit! Tuesday Afternoon - Room 210C

Ecosystem Broad View – including Ouya Development Tools – including Tegra 4 and Shield

Wednesday Morning - Marriott Ballroom 3 Visualization – including using H.264 for still imagery Augmented device interaction – including depth camera on Tegra

Wednesday Afternoon - Room 210C Vision and Computational Photography – including Chimera Web – the fastest mobile browser Mobile Panel – your chance to ask gnarly questions!

Select Mobile Summit Tag in your GTC Mobile App!


Why Mobile GPU Compute?

Courtesy Metaio http://www.youtube.com/watch?v=xw3M-TNOo44&feature=related

State-of-the-art Augmented Reality without GPU Compute

http://www.youtube.com/watch?v=xw3M-TNOo44&feature=related




Augmented Reality with GPU Compute

High-Quality Reflections, Refractions, and Caustics in Augmented Reality and their Contribution to Visual Coherence P. Kán, H. Kaufmann, Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria

Research today on CUDA equipped laptop PCs

How will this GPU Compute Capability migrate from high-end PCs to mobile?


Mobile SOC Performance Increases

1

100

CPU

/GPU

AG

GRE

GAT

E PE

RFO

RMA

NCE

2013 2015

Tegra 4 1st Quad A15 Chimera Computational Photography

2014 2011

2012

Tegra 2 1st Dual A9

Tegra 3 1st Quad A9 1st Power saver 5th core

Logan

10

Core 2 Duo

Parker

Core i5

HTC One X+

Google Nexus 7

100x perf increase in four years

Device Shipping Dates

Full Kepler GPU CUDA 5.0

OpenGL 4.3

Denver CPU Maxwell GPU

FinFET


Power is the New Design Limit The Process Fairy keeps bringing more transistors.. ..but the ‘End of Voltage Scaling’ means power is much more of an issue than in the past

In the Good Old Days Leakage was not important, and voltage

scaled with feature size

L’ = L/2 D’ = 1/L2 = 4D f’ = 2f V’ = V/2 E’ = CV2 = E/8 P’ = P

Halve L and get 4x the transistors and 8x the capability for the same power

The New Reality Leakage has limited threshold voltage,

largely ending voltage scaling

L’ = L/2 D’ = 1/L2 = 4D f’ = ~2f V’ = ~V E’ = CV2 = E/2 P’ = 4P

Halve L and get 4x the transistors and 8x the capability for

4x the power!!


Mobile Thermal Design Point

2-4W 4-7W

6-10W 30-90W

4-5” Screen takes 250-500mW

7” Screen takes 1W

10” Screen takes 1-2W Resolution makes a difference -

the iPad3 screen takes up to 8W!

Typical max system power levels before thermal failure Even as battery technology improves - these thermal limits remain


How to Save Power? Much more expensive to MOVE data than COMPUTE data Energy efficiency must now be key metric during silicon AND software design

Awareness of where data lives, where computation happens, how is it scheduled

Need to use hardware acceleration Reduce data movement Lots of local processing in parallel Efficient caching and memory usage

32-bit Integer Add 1pJ

32-bit Float Operation 7pJ

32-bit Register Write 0.5pJ

Send 32-bits 2mm 24pJ

Send 32-bits Off-chip 50pJ

For 40nm, 1V process

Write 32-bits to Memory 600pJ


Dark Silicon, Mobile SOCs and Power Efficiency Lots of space for transistors - can’t turn them on at same time!

Would exceed Thermal Design Point

Dark Silicon - specialized hardware turned on when needed Dedicated units can increase locality and parallelism of computation

GPUs are also much more power efficient than CPUs When exploiting data parallelism

Pow

er C

onsu

mpt

ion

Computation Flexibility

Enabling new mobile experiences requires pushing computation onto GPUs and

dedicated hardware

Dedicated Hardware

GPU Compute

Multi-core CPU

X1

X10

X100


Mobile GPU Compute Adoption

NVIDIA invented GPU Computing What we learned - it’s not technology alone it’s USE CASES

Augmented Reality

Face, Body and Gesture Tracking

Computational Photography

Mobile GPU Compute Use Case Pipeline

3D Scene/Object Reconstruction


ISP – Dedicated Hardware for Sensor Processing Camera ISP (Image Signal Processor) typically has little or no programmability

Scan-line-based, data flows through compact hardware pipe No global memory used to minimize power

BUT… computational photography apps now want to mix non-programmable ISP processing with more flexible GPU processing -> Chimera – new NVIDIA Computational Photography Architecture

Camera ISP ~760 math Ops

~42K vals = 670Kb ~250Gops @ 300MHz


Flexible Use of ISP, GPU and CPU

Flexible routing of image frames between computation engines

Potential to integrate more hardware blocks over time - ISPs for different types of sensors – e.g. IR and depth cameras

- ‘Scanners’ - very low power, always on, to detect things in the environment to process


Tegra 4 Family

Tegra 4 (“Wayne”) World’s Fastest Mobile Processor

Tegra 4i (“Grey”) 1st Integrated Tegra 4 LTE Processor

Superphone / Tablet Smartphone

Quad CPU Cortex A15, 4+1 Cortex A9 r4, 4+1

NVIDIA GPU 72 Core 60 Core

LTE Optional with i500 Integrated i500

Chimera


Android Three Layer Ecosystem

API Drivers - Java (SDK) and Native (NDK)

Apps and Games Most use Java, Cutting-edge apps/games use native APIs

Middleware and Apps Engines Use native APIs for power and performance

Partners

VisX Turn-key vision middleware

developed by NVIDIA: E.g. Tap-to-track, Panorama Paint

VisX


?

APIs for Mobile Imaging and Vision

Graphics Camera and Images

Java

Native

MediaCodec SurfaceTexture

FilterScript (RenderScript Subset) Java Binding to OpenGL ES

(similar to JSR239)

OpenCV Use GLSL shaders for imaging

OpenGL 4.3 Compute Shaders provide general purpose computation on uniforms, images and

textures for image and vision processing

Open source research project for advanced camera control

OpenCV4Tegra Open source OpenCV vision library with OpenGL ES GLSL, ARM Multithreading and NEON optimizations

Open standard under development at Khronos for optimized, power efficient vision acceleration


APIs for Mobile GPU Compute

Graphics GPU Compute

Java

Native

RenderScript Run performance critical sections as

native C. Automatically offload C code segments to the GPU if possible

Java Binding to OpenGL ES (similar to JSR239)

Use GLSL shaders for GPGPU compute

OpenGL 4.3 Compute Shaders provides sufficient flexibility for physics, AI, Global Illumination and Ray-tracing acceleration

? Program GPUs in C - over 375 million CUDA-enabled GPUs in notebooks, workstations and supercomputers


CUDA 5.0 and OpenGL 4.3 on Tegra Today Kayla Tegra + discrete GPU development platforms

Available to select developers

OpenGL 4.3 and CUDA 5.0 Full Kepler support on Linux PhysX, VisX …

Enables early development of ARM-based applications with desktop-class graphics and compute Talk to us if you are interested

Or email [email protected]

mailto:[email protected]


Thank You!

Powerful GPU Compute is coming to a mobile device near you! New use cases need GPUs for acceptable battery consumption Logan will bring full Kepler-class GPU to Mobile! Desktop APIs for full GPU Compute: OpenGL 4.3 and CUDA 5.0

If you have apps that need Mobile GPU Compute now is the time to be talking to us…

Questions? [email protected]

mailto:[email protected]

Documents

Tegra – at the Convergence of Mobile and GPU Supercomputingon-demand.gputechconf.com/gtc/2013/presentations/S3494... · 2013-04-19 · Ecosystem Broad View – including Ouya Development