Silicon Acceleration APIs - Khronos Group · 2016-11-23 · •Natural User Interface - VISION -...

Preview:

Citation preview

© Copyright Khronos Group 2016 - Page 1

Silicon Acceleration APIsEmbedded Technology 2016, Yokohama

Neil TrevettVice President Developer Ecosystem, NVIDIA | President, Khronos

ntrevett@nvidia.com | @neilt3d November 2016

© Copyright Khronos Group 2016 - Page 2

Khronos and Open Standards

Software

Silicon

Khronos is an Industry Consortium of over 100 companies

We create royalty-free, open standard APIs for hardware acceleration of

Graphics, Parallel Compute, Neural Networks and Vision

© Copyright Khronos Group 2016 - Page 3

Accelerated API Landscape

Vision Frameworks

High-level Language-based

Acceleration Frameworks

Explicit

Kernels

GPUFPGA

DSPDedicated

Hardware

Neural Net LibrariesOpenVX Neural

Net Extension

3D

Graphics

© Copyright Khronos Group 2016 - Page 4

OpenCL – Low-level Parallel Programing• Low level programming of heterogeneous parallel compute resources

- One code tree can be executed on CPUs, GPUs, DSPs and FPGA

• OpenCL C language to write kernel programs to execute on any compute device

- Platform Layer API - to query, select and initialize compute devices

- Runtime API - to build and execute kernels programs on multiple devices

• New in OpenCL 2.2 - OpenCL C++ kernel language - a static subset of C++14

- Adaptable and elegant sharable code – great for building libraries

- Templates enable meta-programming for highly adaptive software

- Lambdas used to implement nested/dynamic parallelism

OpenCL

Kernel

Code

OpenCL

Kernel

Code

OpenCL

Kernel

Code

OpenCL

Kernel

Code

GPU

DSPCPU

CPUFPGA

Kernel code

compiled for

devicesDevices

CPU

Host

Runtime API

loads and executes

kernels across devices

© Copyright Khronos Group 2016 - Page 5

OpenCL Conformant Implementations

OpenCL 1.0Specification

Dec08 Jun10OpenCL 1.1

Specification

Nov11OpenCL 1.2

SpecificationOpenCL 2.0

Specification

Nov13

1.0 | Jul13

1.0 | Aug09

1.0 | May09

1.0 | May10

1.0 | Feb11

1.0 | May09

1.0 | Jan10

1.1 | Aug10

1.1 | Jul11

1.2 | May12

1.2 | Jun12

1.1 | Feb11

1.1 |Mar11

1.1 | Jun10

1.1 | Aug12

1.1 | Nov12

1.1 | May13

1.1 | Apr12

1.2 | Apr14

1.2 | Sep13

1.2 | Dec12

Desktop

Mobile

FPGA

2.0 | Jul14

OpenCL 2.1 Specification

Nov15

1.2 | May15

2.0 | Dec14

1.0 | Dec14

1.2 | Dec14

1.2 | Sep14

Vendor timelines are

first implementation of

each spec generation

1.2 | May15

Embedded

1.2 | Aug15

1.2 | Mar16

2.0 | Nov15

2.1 | Jun15

© Copyright Khronos Group 2016 - Page 6

SYCL for OpenCL• Single-source heterogeneous programming using STANDARD C++

- Use C++ templates and lambda functions for host & device code

• Aligns the hardware acceleration of OpenCL with direction of the C++ standard

- C++14 with open source C++17 Parallel STL hosted by Khronos

C++ Kernel LanguageLow Level Control

‘GPGPU’-style separation of

device-side kernel source

code and host code

Single-source C++Programmer FamiliarityApproach also taken by

C++ AMP and OpenMP

Developer ChoiceThe development of the two specifications are aligned so

code can be easily shared between the two approaches

© Copyright Khronos Group 2016 - Page 7

Embedded Vision AccelerationEmbedded Technology, Yokohama

Shorin Kyo, Huawei

© Copyright Khronos Group 2016 - Page 8

OpenVX – Low Power Vision Acceleration • Targeted at vision acceleration in real-time, mobile and embedded platforms

- Precisely defined API for production deployment

• Higher abstraction than OpenCL for performance portability across diverse architectures

- Multi-core CPUs, GPUs, DSPs and DSP arrays, ISPs, Dedicated hardware…

• Extends portable vision acceleration to very low power domains

- Doesn’t require high-power CPU/GPU Complex or OpenCL precision

GPU

Vision Engine

Middleware

Application

DSP

Hardware

Pow

er

Eff

icie

ncy

Computation Flexibility

Dedicated Hardware

GPUCompute

Multi-coreCPUX1

X10

X100Vision Processing

Efficiency

Vision DSPs

© Copyright Khronos Group 2016 - Page 9

OpenVX Graphs• OpenVX developers express a graph of image operations (‘Nodes’)

- Nodes can be on any hardware or processor coded in any language

• Graphs can execute almost autonomously

- Possible to Minimize host interaction during frame-rate graph execution

• Graphs are the key to run-time optimization opportunities…

Array of

Keypoints

YUV

Frame

Gray

Frame

Camera

Input

Rendering

Output

Pyrt

Color Conversion

Channel Extract

Optical Flow

Harris Track

Image Pyramid

RGB

Frame

Array of

FeaturesFtrt-1OpenVX Graph

OpenVX Nodes

Feature Extraction Example Graph

© Copyright Khronos Group 2016 - Page 10

OpenVX Efficiency through Graphs..

Reuse pre-allocated memory for

multiple intermediate data

MemoryManagement

Less allocation overhead,more memory forother applications

Replace a sub-graph with a

single faster node

Kernel Merge

Better memorylocality, less kernel launch overhead

Split the graph execution across

the whole system: CPU / GPU /

dedicated HW

GraphScheduling

Faster executionor lower powerconsumption

Execute a sub-graph at tile

granularity instead of image

granularity

DataTiling

Better use of data cache andlocal memory

© Copyright Khronos Group 2016 - Page 11

Example Relative Performance

1.1

2.9

8.7

1.5

2.5

0

1

2

3

4

5

6

7

8

9

10

Arithmetic Analysis Filter Geometric Overall

OpenCV (GPU accelerated)

OpenVX (GPU accelerated)

Relative Performance

NVIDIA

implementation

experience.Geometric mean of

>2200 primitives,

grouped into each

categories,

running at different

image sizes and

parameter settings

© Copyright Khronos Group 2016 - Page 12

Dedicated Vision

Hardware

Layered Vision Processing Ecosystem

Programmable Vision

Processors

Application

C/C++

Implementers may use OpenCL or Compute Shaders to

implement OpenVX nodes on programmable processors

And then developers can use OpenVX to enable a

developer to easily connect those nodes into a graph

The OpenVX graph enables implementers to optimize execution across

diverse hardware architectures an drive to lower power implementations

OpenVX enables the graph to be extended to include hardware

architectures that don’t support programmable APIs

AMD OpenVX- Open source, highly optimized

for x86 CPU and OpenCL for GPU

- “Graph Optimizer” looks at

entire processing pipeline and

removes/replaces/merges

functions to improve

performance and bandwidth

- Scripting for rapid prototyping,

without re-compiling, at

production performance levelshttp://gpuopen.com/compute-product/amd-openvx/

© Copyright Khronos Group 2016 - Page 13

OpenVX 1.0 Shipping, OpenVX 1.1 Released!• Multiple OpenVX 1.0 Implementations shipping – spec in October 2014

- Open source sample implementation and conformance tests available

• OpenVX 1.1 Specification released 2nd May 2016

- Expands node functionality AND enhances graph framework

• OpenVX is EXTENSIBLE

- Implementers can add their own nodes at any time to meet customer and market needs

= shipping implementations

© Copyright Khronos Group 2016 - Page 14

OpenVX Neural Net Extension• Convolution Neural Network topologies can be represented as OpenVX graphs

- Layers are represented as OpenVX nodes

- Layers connected by multi-dimensional tensors objects

- Layer types include convolution, activation, pooling, fully-connected, soft-max

- CNN nodes can be mixed with traditional vision nodes

• Import/Export Extension

- Efficient handling of network Weights/Biases or complete networks

• The specification is provisional

- Welcome feedback from the deep learning community

VisionNode

VisionNode

VisionNode

Downstream

Application

Processing

Native

Camera

Control CNN Nodes

An OpenVX graph mixing CNN nodes

with traditional vision nodes

© Copyright Khronos Group 2016 - Page 15

NNEF - Neural Network Exchange Format

NN Authoring Framework 1

NN Authoring Framework 2

NN Authoring Framework 3

Inference Engine 1

Inference Engine 2

Inference Engine 3

NNEF encapsulates neural network structure, data formats, commonly used operations

(such as convolution, pooling, normalization, etc.) and formal network semantics

NN Authoring Framework 1

NN Authoring Framework 2

NN Authoring Framework 3

Inference Engine 1

Inference Engine 2

Inference Engine 3

Neural Net

Exchange

Format

NNEF 1.0 is currently being defined

OpenVX will import NNEF files

© Copyright Khronos Group 2016 - Page 16http://hwstats.unity3d.com/mobile/gpu.html

OpenGL ES 2.0

OpenGL ES 3.x

OpenGL ESCompute Shaders

32-bit integers and floats

NPOT, 3D/depth textures

Texture arrays

Multiple Render TargetsVertex and

fragment shadersFixed function

Pipeline

Tessellation and geometry shaders

ASTC Texture Compression

Floating point render targets

Debug and robustness for security

Epic’s Rivalry demo using full Unreal Engine 4

https://www.youtube.com/watch?v=jRr-G95GdaM

Vertex and Fragment Shaders Compute Shaders

2003

ES 1.0

2004

ES 1.1

2007

ES 2.0

2012

ES 3.0

2014

ES 3.1

Driver

Update

Silicon

Update

Silicon

Update

Driver

Update2015

ES 3.2

Silicon

Update

Close to 2 Billion OpenGL ES

devices shipped in 2015

2016

NL

Driver

Update

AEP

© Copyright Khronos Group 2016 - Page 17

The Key Principle of Vulkan: Explicit Control

• Application tells the driver what it is going to do

- In enough detail that driver doesn’t have to guess

- When the driver needs to know it

• In return, driver promises to do

- What the application asks for

- When it asks for it

- Very quickly

• No driver magic – no surprises

You have to supply a

LOT of information!

And, you have to

supply it early

Efficient, predictable

performance

© Copyright Khronos Group 2016 - Page 18

Vulkan Explicit GPU Control

GPU

High-level Driver

AbstractionContext management

Memory allocation

Full GLSL compiler

Error detection

Layered GPU Control

ApplicationSingle thread per context

GPU

Thin DriverExplicit GPU Control

ApplicationMemory allocation

Thread management

Synchronization

Multi-threaded generation

of command buffers

Language Front-end

CompilersInitially GLSL

Loadable debug and

validation layers

Vulkan 1.0 provides access to

OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality

but with increased performance and flexibility

Loadable LayersNo error handling overhead in

production code

SPIR-V Pre-compiled Shaders:No front-end compiler in driver

Future shading language flexibility

Simpler drivers:Improved efficiency/performance

Reduced CPU bottlenecks

Lower latency

Increased portability

Graphics, compute and DMA queues: Work dispatch flexibility

Multi-threaded Command Buffers:Command creation can be multi-threaded

Multiple CPU cores increase performance

Resource management in app code: Less driver hitches and surprises

Vulkan Benefits

SPIR-V pre-compiled shaders

© Copyright Khronos Group 2016 - Page 19

Vulkan Multi-threading Efficiency

GPU

Command

Buffer

Command

Buffer

Command

Buffer

Command

Buffer

Command

Buffer

Command

Queue

CPU

Thread

CPU

Thread

CPU

Thread

CPU

Thread

CPU

Thread

CPU

Thread

1. Multiple threads can construct Command Buffers in parallel

Application is responsible for thread management and synch

2. Command Buffers placed in Command

Queue by separate submission thread

Applications can create graphics, compute

and DMA command buffers with a general

queue model that can be extended to more

heterogeneous processing in the future

© Copyright Khronos Group 2016 - Page 20

Khronos Safety Critical APIs

New Generation APIs for

safety certifiable vision,

graphics and compute

OpenGL ES 1.0 - 2003Fixed function graphics

OpenGL ES 2.0 - 2007Shader programmable pipeline

OpenGL SC 1.0 - 2005Fixed function graphics subset

OpenGL SC 2.0 - April 2016Shader programmable pipeline subset

Experience and Guidelines

Small driver size

Advanced functionality

Graphics and compute

Safety Critical

vision

processing

https://www.khronos.org/news/press/advisory-panel-to-create-design-guidelines-for-safety-critical-apis

Safety Critical

heterogeneous

compute

ISO 26262

DO-178B/C

© Copyright Khronos Group 2016 - Page 21

Please Consider Joining Khronos!

Influence how

standards evolve!

Access draft specs to

build products faster!

Understand early

industry requirements!

Ship products that conform to

international standards!

Khronos is proven to RAPIDLY generate hardware API standards

that create significant market opportunities

Any company or organization is welcome to join Khronos

for a voice and a vote in any of its standards

www.khronos.org

© Copyright Khronos Group 2016 - Page 22

VeriSilicon

Computer Vision solution with OpenVX over “VIP”

Gaku Ogura

Simon Jones

18 November 2016

© Copyright Khronos Group 2016 - Page 23

VeriSilicon – Who We Are

• Founded in 2001

• Over 650 employees

• 70% dedicated to R&D

© Copyright Khronos Group 2016 - Page 24

VeriSilicon – What we can do ?

Idea to Spec

Spec to RTL

RTL to Netlist

Netlist to

GDSManufacture

Shipping

Test

Package

IP Development & Licensing

System Architecture

Silicon Design

© Copyright Khronos Group 2016 - Page 25

Movement of Intelligent Devices

• Natural User Interface- VISION- Natural Language Processing- Sensors

▲Multi-Media

► Graphics, Video, Audio, Voice

▲Many applications – need flexibility

▲1000s of algorithms – new one every day

▲Compute intensive – need hardware acceleration

▲Algorithms keep changing – need programmability

© Copyright Khronos Group 2016 - Page 26

Deep Learning: Neural Networks Return with

Vengeance• With big data and high density compute, deep neural networks can be trained to solve

“impossible” computer vision problems

% Teams using Deep Learning

© Copyright Khronos Group 2016 - Page 27

The Basics of Deep Neural Networks (DNN)• Training: 1016-1022 MACs/dataset

• Inference: 106-1011 MACs/image

forward

backward

“Pug”

“Rottweiler”

error

=?

forward“Rottweiler”

100K-100MNetwork

Coefficients

Offline (mostly)Analogous to compiling

On-the-fly

Analogous to running s/wEmbedded systems use this

© Copyright Khronos Group 2016 - Page 28

Challenges of Real-Time Inference

Classification Detection Segmentation

100G MAC/s 1T MAC/s 10T MAC/s

• Majority of deep neural net computation is Multiply-Accumulate (MAC) for convolutions and inner products

• Memory bandwidth is a bigger challenge- Convolution in deep neural nets works with not just 2D images, but high-dimensional arrays (i.e. “tensor”)- Example: AlexNet (classification) has 60M parameters (240MB FP32), DDR BW with brute force

implementation: 35GB/s

© Copyright Khronos Group 2016 - Page 29

Processor Architectures

Energy

Programmability

CustomRTL

DSP• VeriSilicon• CEVA• Videantis• Cadence• Synopsys

CPU

VIPVeriSilicon

GPU• VeriSilicon• ARM• Imagination

OpenCLOpenVXOpenCV

© Copyright Khronos Group 2016 - Page 30

Efficient Processor Architecture to Enable DNN

for Embedded Vision

Intelligent Data Fabrics

Mature CV BlockHW Accelerator

Local Memory / Cache

Mature CV BlockHW Accelerator

Customer DefinedHW Accelerator

Custom RTL to accelerate mature CV algorithms for 24/7 low power operation

• Programmable engine to cope with new CV algorithms

• Future new NN layers can be implemented here

• Scalable architecture to support different PPA (Performance, Power, Area) requirements

• High utilization MACs for DNN• Scalable architecture to

support different PPA (Performance, Power, Area) requirements

• Handles other DNN functions (normalization, pooling,…)

• Handles pruning, compression, batching… to increase MAC utilization and decrease DDR BW

For data synchronization between threads/blocks and minimize DDR BW

I/F to extend HW acceleration for application specific purpose

ProgrammableEngine

ProgrammableEngine

ProgrammableEngine

Scalable MACs

Scalable MACs

Scalable MACs

Deep LearningFrameworks

VisionFramework

API

Vision Applications

© Copyright Khronos Group 2016 - Page 31

Use Case Study: Faster RCNN• One of state-of-the-art object detection CNNs

- Network configuration- 300 region proposals

- 60M parameters, 30G MACs/image

• Collaborative computing in VIP

- NN Engine: number crunching- Convolution layers, full connected layers

- Tensor Processing Fabric: data rearrangement- LRN, max pooling, ROI pooling

- Programmable Engine: precision compute- RPN, NMS, softmax

• Real-time performance (30fps HD) under 0.5W @ 28nm HPM

Programmable

Engine

NN Cores

TP Fabric

Local Memory / Cache

Output of a layer is queued in DDR if next layer is on different processing engine

© Copyright Khronos Group 2016 - Page 32

Scalable Performance from VIP8000

• Scalable performance with SAME application code on different processor

variants

VIP8000 Series VIP Nano VIP8000UL VIP8000L VIP8000

Number of Execution cores 1 2 4 8

Clock Frequency SVT @WC125C (MHz) 800 800 800 800

HD Performance@800MHz

Perspective Warping 60 fps 120 fps 240 fps 480 fps

Optical Flow LK 30 fps 60 fps 120 fps 240 fps

Pedestrian Detection 30 fps 60 fps 120 fps 240 fps

Convolutional Neural Network (AlexNet) 125 fps 250 fps 500 fps 1000 fps

© Copyright Khronos Group 2016 - Page 33

Comprehensive Multi-Media/Pixel Subsystems

Company Proprietary and

Confidential

CPU

VeriSilicon Universal Compression

VPUCore(Video)

VIP Core(Vision/Image)

IO Controller

GC Core(GPU)

DMAEngine

DC(Display)

Memory Controller

ZSP(Audio/Voice)

AXI Bus

WirelessConnectivity

Local Bus

MixedSignal IPs

ISP

© Copyright Khronos Group 2016 - Page 34

japan-sales@verisilicon.com

Recommended