Виктор Ерухимов Open VX mixar moscow sept15

© Copyright Khronos Group 2014 - Page 1

Vision AccelerationmixAR, September 2015

Victor ErukhimovItseez, Itseez3D

Itseez• Real time computer vision solutions on embedded platforms:– Mobile products: ItSeez3D, Facense– Automotive: driver assistance systems– Ecosystem: OpenCV, OpenVX

S CANNE R

VICTOR [email protected]

Capture the world in 3D!

Sample models

Embedded vision challenges• Intense and power hungry computations•Need to run in real-time on embedded/mobile/wearable devices•Very few specialized hardware products•Software ecosystem not ready for embedded real-time scenarios


Vision AccelerationmixAR, September 2015

Victor ErukhimovItseez, Itseez3D


Khronos Connects Software to Silicon

Open Consortium creating ROYALTY-FREE, OPEN STANDARD APIs for hardware acceleration

Defining the roadmap for low-level silicon interfaces needed on every platform

Graphics, compute, rich media,vision, sensor and camera

processing

Rigorous specifications AND conformance tests for cross-

vendor portability

Acceleration APIs BY the Industry

FOR the Industry

Well over a BILLION people use Khronos APIs Every Day…http://accelerateyourworld.org/


Khronos Standards

Visual Computing- 3D Graphics- Heterogeneous Parallel Computing

3D Asset Handling- 3D authoring asset interchange

- 3D asset transmission format with compression

Acceleration in HTML5- 3D in browser – no Plug-in

- Heterogeneous computing for JavaScript

Over 100 companies defining royalty-free APIs to connect software to silicon

Sensor Processing- Vision Acceleration- Camera Control - Sensor Fusion


Mobile Vision Acceleration = New Experiences

Augmented Reality

Face, Body andGesture Tracking

ComputationalPhotography and

Videography

3D Scene/ObjectReconstruction

Need for advanced sensors and the acceleration to process them


Visual Computing = Graphics PLUS Vision

Real-time GPU ComputeResearch project on GPU-accelerated laptop

High-Quality Reflections, Refractions, and Caustics in Augmented Reality and their Contribution to Visual Coherence

P. Kán, H. Kaufmann, Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria

https://www.youtube.com/watch?v=i2MEwVZzDaA

Imagery

Data

VisionProcessing

GraphicsProcessing

Enhanced sensor and vision

capability deepens the interaction

between real and virtual worlds


Vision Pipeline Challenges and Opportunities

• Light / Proximity

• 2 cameras• 3 microphones• Touch• Position

- GPS- WiFi (fingerprint)- Cellular trilateration- NFC/Bluetooth Beacons

• Accelerometer• Magnetometer• Gyroscope• Pressure / Temp / Humidity

19

Sensor ProliferationDiverse sensor awareness of the user and surroundings

• Camera sensors >20MPix

• Novel sensor configurations

• Stereo pairs

• Plenoptic Arrays• Active Structured Light

• Active TOF

Growing Camera DiversityCapturing color, range

and lightfields

Diverse Vision ProcessorsDriving for high performance

and low power

• Multi-core CPUs

• Programmable GPUs

• DSPs and DSP arrays

• Camera ISPs• Dedicated vision IP blocks

Flexible sensor and camera control to generate

required image stream

Use best processing available for image stream processing –

with code portability

Control/fuse vision data by/with all other sensor data

on device


Vision Processing Power Efficiency• Depth sensors = significant processing- Generate/use environmental information

• Wearables will need ‘always-on’ vision- With smaller thermal limit / battery than phones!

• GPUs has x10 CPU imaging power efficiency- GPUs architected for efficient pixel handling

• Traditional cameras have dedicated hardware- ISP = Image Signal Processor – on all SOCs today

• SOCs have space for more transistors- But can’t turn on at same time = Dark Silicon

• Potential for dedicated sensor/vision silicon- Can trigger full CPU/GPU complex

Pow

er E

ffic

ienc

y

Computation Flexibility

Dedicated Hardware

GPUCompute

Multi-coreCPUX1

X10

X100

AdvancedSensors

Wearables

But how to program specialized processors?Performance and Functional Portability


OpenVX – Power Efficient Vision Acceleration• Out-of-the-Box vision acceleration framework- Enables low-power, real-time applications- Targeted at mobile and embedded platforms

• Functional Portability- Tightly defined specification - Full conformance tests

• Performance portability across diverse HW- Higher-level abstraction hides hardware details- ISPs, Dedicated hardware, DSPs and DSP arrays,

GPUs, Multi-core CPUs …

• Enables low-power, always-on acceleration- Can run solely on dedicated vision hardware- Does not require full SOC CPU/GPU complex to

be powered on

VisionAccelerator

ApplicationApplication

ApplicationApplication

VisionAcceleratorVision

AcceleratorVisionAccelerator


OpenVX Graphs – The Key to Efficiency• Vision processing directed graphs for power and performance efficiency- Each Node can be implemented in software or accelerated hardware- Nodes may be fused by the implementation to eliminate memory transfers- Processing can be tiled to keep data entirely in local memory/cache

• VXU Utility Library for access to single nodes - Easy way to start using OpenVX by calling each node independently

• EGLStreams can provide data and event interop with other Khronos APIs- BUT use of other Khronos APIs are not mandated

OpenVXNode

OpenVXNode

OpenVXNode

OpenVXNode

Downstream ApplicationProcessing

NativeCamera Control

Example OpenVX Graph


OpenVX 1.0 Function Overview• Core data structures- Images and Image Pyramids- Processing Graphs, Kernels, Parameters

• Image Processing- Arithmetic, Logical, and statistical operations- Multichannel Color and BitDepth Extraction and Conversion- 2D Filtering and Morphological operations- Image Resizing and Warping

• Core Computer Vision- Pyramid computation- Integral Image computation

• Feature Extraction and Tracking- Histogram Computation and Equalization- Canny Edge Detection- Harris and FAST Corner detection- Sparse Optical Flow

OpenVX 1.0 defines framework for

creating, managing and executing graphs

Focused set of widely used functions that are

readily accelerated

Implementers can add functions as extensions

Widely used extensions adopted into future versions of the core

OpenVX SpecificationIs Extensible

Khronos maintains extension registry


Example Graph - Stereo Machine Vision

Camera 1Compute Depth

Map(User Node)

Detect and track objects(User Node)

Camera 2Image

PyramidStereo

Rectify with Remap

Stereo Rectify with

Remap

Compute Optical Flow

Object coordinates

OpenVX Graph

Delay

Tiling extension enables user nodes (extensions) to also optimally run in local memory


OpenVX and OpenCV are Complementary

Governance Community driven open sourcewith no formal specification

Formal specification defined andimplemented by hardware vendors

Conformance No conformance tests for consistency and every vendor implements different subset

Full conformance test suite / processcreates a reliable acceleration platform

Portability APIs can vary depending on processor Hardware abstracted for portability

ScopeVery wide

1000s of imaging and vision functionsMultiple camera APIs/interfaces

Tight focus on hardware accelerated functions for mobile visionUse external camera API

Efficiency Memory-based architectureEach operation reads and writes memory

Graph-based executionOptimizable computation, data transfer

Use Case Rapid experimentation Production development & deployment


OpenVX Announcement• Finalized OpenVX 1.0.1 specification released June 2015- www.khronos.org/openvx

• Full conformance test suite and Adopters Program immediately available- $20K Adopters fee ($15K for members) – working group reviews submitted results- Test suite exercises graph framework and functionality of each OpenVX 1.0 node- Approved Conformant implementations can use the OpenVX trademark

• Open source sample implementation of OpenVX 1.0.1 released


Khronos APIs for Vision ProcessingGPU Compute Shaders (OpenGL 4.X and OpenGL ES 3.1)Pervasively available on almost any mobile device or OSEasy integration into graphics apps – no vision/compute API interop neededProgram in GLSL not CLimited to acceleration on a single GPU

General Purpose Heterogeneous Programming FrameworkFlexible, low-level access to any devices with OpenCL compilerSingle programming and run-time framework for CPUs, GPUs, DSPs, hardwareOpen standard for any device or OS – being used as backed by many languages and frameworksNeeds full compiler stack and IEEE precision

Out of the Box Vision Framework - Operators and graph framework libraryCan run some or all modes on dedicated hardware – no compiler neededHigher-level abstraction means easier performance portability to diverse hardwareGraph optimization opens up possibility of low-power, always-on vision accelerationFixed set of operators – but can be extended

It is possible to use OpenCL or GLSL to build OpenVX Nodes on programmable devices!


Example : Feature tracking Graph

color convert

channel extract

pyramid

optical flow pyrLK

pyr 0

RGB frame

frameYUV

frameGray

Array of keypoints

Image captureAPI

Display API

Pyramids

pyr -‐1 pts 0pts -‐1

pyr_delay ptr_delay


First processing : Keypoint detection

color convert

channel extract

pyramid

pyr 0

RGB image

frameYUV

frameGray


Harriscorner

pyr_delay ptr_delay


Context & Data Objects Creation

// Create the ‘OpenVX world’vx_context context = vxCreateContext();

// Create image objects necessary for the RGB -> Y transformationvx_image frameYUV = vxCreateImage(context, width_, height_, VX_DF_IMAGE_IYUV);vx_image frameGray = vxCreateImage(context, width_, height_, VX_DF_IMAGE_U8);

// Image pyramids for two successive frames are necessary for the computation.// A delay object with 2 slots is created for this purposevx_pyramid pyr_exemplar = vxCreatePyramid(context, 4, VX_SCALE_PYRAMID_HALF,

width_, height_, VX_DF_IMAGE_U8);vx_delay pyr_delay = vxCreateDelay(context, (vx_reference)pyr_exemplar, 2);vxReleasePyramid(&pyr_exemplar);

// Tracked points need to be stored for two successive frames.// A delay object with 2 slots is created for this purposevx_array pts_exemplar = vxCreateArray(context, VX_TYPE_KEYPOINT, 2000);vx_delay pts_delay = vxCreateDelay(context, (vx_reference)pts_exemplar, 2);vxReleaseArray(&pts_exemplar);

pyr 0

frameYUV

pyr_delay


frameGray

ptr_delay


Initial step: Keypoint Detection

// RGB to Y conversionvxuColorConvert(context, frameRGB, frameYUV);vxuChannelExtract (context, frameYUV, VX_CHANNEL_Y, frameGray);

// Keypoint detection : Harris cornervx_float32 strength_thresh = 0.0f;vx_scalar s_strength_thresh = vxCreateScalar(context, VX_TYPE_FLOAT32, &strength_thresh);vx_float32 min_distance = 3.0f;vx_scalar s_min_distance = vxCreateScalar(context, VX_TYPE_FLOAT32, &min_distance);vx_float32 k_sensitivity = 0.04f;vx_scalar s_k_sensitivity = vxCreateScalar(context_ VX_TYPE_FLOAT32, &k_sensitivity);vx_int32 gradientSize = 3;vx_int32 blockSize = 3;

vxuHarrisCorners(context, frameGray, s_strength_thresh, s_min_distance,s_k_sensitivity, gradientSize, blockSize,(vx_array)vxGetReferenceFromDelay(pts_delay, -1), 0 );

// Create the first pyramid needed for optical flowvxuGaussianPyramid(context, frameGray, (vx_pyramid)vxGetReferenceFromDelay(pyr_delay, -1))

;

color convert

channel extract

pyramid

pyr 0

RGB frame

frameYUV

frameGray


Harriscorner

pyr_delay ptr_delay


Feature tracking: Graph Creation

color convert

channel extract

pyramid

optical flow pyrLK

pyr 0

RGB frame

frameYUV

frameGray

Array of keypoints


pyr_delay ptr_delay

vx_graph graph = vxCreateGraph(context);

// RGB to Y conversion nodesvx_node cvt_color_node = vxColorConvertNode(graph, frame, frameYUV);vx_node ch_extract_node = vxChannelExtractNode(graph, frameYUV, VX_CHANNEL_Y,

frameGray);

// Pyramid image nodevx_node pyr_node = vxGaussianPyramidNode(graph, frameGray,

(vx_pyramid) vxGetReferenceFromDelay(pyr_delay, 0));

// Lucas-Kanade optical flow node// Note: keypoints of the previous frame are also given as 'new points estimates'vx_float32 lk_epsilon = 0.01f;vx_scalar s_lk_epsilon = vxCreateScalar(context, VX_TYPE_FLOAT32, &lk_epsilon);vx_uint32 lk_num_iters = 5;vx_scalar s_lk_num_iters = vxCreateScalar(context, VX_TYPE_UINT32, &lk_num_iters);vx_bool lk_use_init_est = vx_false_e;vx_scalar s_lk_use_init_est = vxCreateScalar(context, VX_TYPE_BOOL, &lk_use_init_est);

vx_node opt_flow_node = vxOpticalFlowPyrLKNode(graph,(vx_pyramid) vxGetReferenceFromDelay(pyr_delay, -1),(vx_pyramid) vxGetReferenceFromDelay(pyr_delay, 0),(vx_array) vxGetReferenceFromDelay(pts_delay, -1),(vx_array) vxGetReferenceFromDelay(pts_delay, -1),(vx_array) vxGetReferenceFromDelay(pts_delay, 0),VX_TERM_CRITERIA_BOTH, s_lk_epsilon, s_lk_num_iters,s_lk_use_init_est, 10);

vxReleaseScalar(&s_lk_epsilon);vxReleaseScalar(&s_lk_num_iters);vxReleaseScalar(&s_lk_use_init_est);


Feature tracking: Execution

color convert

channel extract

pyramid

optical flow pyrLK

pyr 0

new_frame

frameYUV

frameGray

Array of keypoints


pyr_delay ptr_delay

// Context & data creation// <…>

// Graph the first image// <…>

// Keypoints detection// <…>

// Graph creation// <…>

// Graph verification (mandatory before executing the graph)vxVerifyGraph(graph);

// MAIN PROCESSING LOOPfor (;;) {

// Grab next frame // <…>

// Set the new graph inputvxSetParameterByIndex(cvt_color_node, 0, (vx_reference)new_frame);

// Process graphvxProcessGraph(graph);

// ‘Age’ pyramid and keypoint delay objects for the next frame processingvxAgeDelay(pyr_delay);vxAgeDelay(pts_delay);

}


OpenVX 1.0.1 extensions• Tiling extension: more efficient processing of graphs with user nodes- Provisional spec released

• XML Schema extension: cross-platform graph saving and loading- Provisional spec released

OpenVX™ User Kernel Tiling Extension Specification

Motivation and Overview

Large external memory

HD Output ImageHD Output Image

Tile-based processing

9/25/15 30

Accelerators

Small local memory

F2F1

HD Input Image200 cycles

1 cycle

• Optimal performance & memory utilization• Full intermediate (purple) image never exists• Tedious programming

F1 F2

DMA Engine

PingPong

OpenVX Codevx_context context = vxCreateContext();

vx_image input = vxCreateImage(context, 640, 480, VX_DF_IMAGE_U8);

vx_image output = vxCreateImage(context, 640, 480, VX_DF_IMAGE_U8);

vx_image intermediate = vxCreateVirtualImage(context, 640, 480, VX_DF_IMAGE_U8);

vx_graph graph = vxCreateGraph(context);

vx_node F1 = vxF1Node(input, intermediate);

vx_node F2 = vxF2Node(intermediate, output);

vxVerifyGraph(graph);

vxProcessGraph(graph);

outputinput F1 F2

context

graph

OpenVX handles the tiling!

inter-‐mediate

OpenVX 1.0 tiling and user kernels• An implementation of OpenVX 1.0 can already do tiled processing with the standard

kernels– The user/programmer just needs to be sure to declare intermediate images as “virtual”– “Virtual” indicates the user will not try to access the intermediate results, so they to not need to be fully allocated/constructed

• User can already create their own kernels per the existing OpenVX 1.0 specification– There is a User Kernel section in the OpenVX 1.0 Advanced Framework API section– But the image data for these user-defined kernels cannot be “tiled”– Note: a “kernel” is analogous to a C++ “class” and a “node” is analogous to an “instance”

The use of kernels versus nodes enables object-oriented programming within the C programming language

• The new User Kernel Tiling Extension is only needed for tiled processing of user-defined kernels– The user/programmer needs to provide additional information about their kernel to enable the OpenVX implementation to properly

decompose the image into tiles and run the user node on these tiles– The User Kernel Tiling Extension defines an API that can be used to provide this additional information

O

The User Kernel Tiling Extension1.The user writes the kernel function to be executed on each tile– The OpenVX runtime will call this function on a specific tile during vxProcessGraph()– The extension defines macros this function can use to determine information about the given tile and its parent image

– E.g., the tile’s height and width, the tile’s (x, y) location in the parent image, and the parent image’s height and width

2.The user adds this new kernel to the OpenVX system via vxAddTilingKernel()– vxAddTilingKernel() takes a name, a pointer to the user’s function, and the number of kernel parameters

3.The user describes each of the kernel’s parameters via vxAddParameterToKernel()– This is the same function used to describe non-tiled user kernel parameters

4.The user tells OpenVX about its pixel-access behavior via vxSetKernelAttribute()– Must set the output block size, input neighborhood size, and border mode

5.The user calls vxFinalizeKernel() to indicate that the kernel description is complete

f

Required user tiling kernel attributes• VX_KERNEL_ATTRIBUTE_OUTPUT_TILE_BLOCK_SIZE– The size of the region the user’s kernel prefers to write on each loop iteration– The OpenVX implementation will ensure that the tile sizes are a multiple of this block size

– Except possibly at the edges of the image

• VX_KERNEL_ATTRIBUTE_INPUT_NEIGHBORHOOD– The “extra” input pixels needed to compute an output block– E.g., a pixelwise function has an input neighborhood of 0 on all sides– A 3x3 filter has a neighborhood of 1, and a 5x5 filter has a neighborhood of 2 (on all sides)

• VX_KERNEL_ATTRIBUTE_BORDER– Indicates whether the kernel function can correctly handle the odd-sized tiles near the edges of the image (VX_BORDER_MODE_SELF) or

not (VX_BORDER_MODE_UNDEFINED)

• Examples:

tileBlocksize = (1, 1)Neighborhood = (0, 0, 0, 0)

e.g., pixelwise add


e.g., 3x3 box filter


e.g., 5x5 box filter


e.g., 4x4 pixelate


e.g., SIMD-‐optimized 5x5 boxthat writes 4 pixels/cycle

Additional optimization• The user may provide two versions of the function for the user kernel• The fast version and the flexible version• The OpenVX implementation will only call the fast function when it’s “safe”

– The tile size is a whole-number multiple of the output tile block size– The input neighborhood doesn’t extend beyond the boundaries of the input image

• The fast version of the function doesn’t have to check any edge conditions– Computes efficiently without conditional checks and branches

• The flexible version needs to make the appropriate checks to handle the edge conditions

• There is a relationship between the fast function, flexible function, and border mode– Read the spec

Fast

Flexible


Applications and Middleware

Tegra K1

CUDA Libraries

VisionWorks Primitives

Classifier Corner Detection

3rd Party

Vision Pipeline Samples

ObjectDetection

3rd Party Pipelines …

…

SLAMVisionWorks Framework

NVIDIA VisionWorks is Integrating OpenVX• VisionWorks library contains diverse vision and imaging primitives• Will leverage OpenVX for optimized primitive execution• Can extend VisionWorks nodes through GPU-accelerated primitives

• Provided with sample library of fully accelerated pipelines

GPU Libraries


Khronos APIs for Augmented Reality

Advanced Camera Control and stream

generation

3D Rendering and Video Composition

On GPU

AudioRendering

Application on CPUs, GPUs

and DSPs

SensorFusion

VisionProcessing

MEMSSensors

EGLStream -stream data

between APIs

Precision timestamps on all sensor samples

AR needs not just advanced sensor processing, vision acceleration, computation and rendering - but also for

all these subsystems to work efficiently together


Summary• Khronos is building a trio of interoperating APIs for portable / power-efficient

vision and sensor processing• OpenVX 1.0 specification is now finalized and released- Full conformance tests and Adopters program immediately available- Khronos open source sample implementation by end of 2014- First commercial implementations already close to shipping

• Any company is welcome to join Khronos to influence the direction of mobile and embedded vision processing!- $15K annual membership fee for access to all Khronos API working groups- Well-defined IP framework protects your IP and conformant implementations

• More Information- www.khronos.org- [email protected] @neilt3d


Background Material


Need for Camera Control API - OpenKCAM• Advanced control of ISP and camera subsystem – with cross-platform portability- Generate sophisticated image stream for advanced imaging & vision apps

• No platform API currently fulfills all developer requirements- Portable access to growing sensor diversity: e.g. depth sensors and sensor arrays- Cross sensor synch: e.g. synch of camera and MEMS sensors- Advanced, high-frequency per-frame burst control of camera/sensor: e.g. ROI- Multiple input, output re-circulating streams with RAW, Bayer or YUV Processing

Image Signal Processor (ISP)

Image/VisionApplications

Defines control of Sensor, Color Filter ArrayLens, Flash, Focus, Aperture

Auto Exposure (AE)Auto White Balance (AWB)

Auto Focus (AF)

EGLStreams


OpenKCAM is FCAM-based• FCAM (2010) Stanford/Nokia, open source• Capture stream of camera images with precision control- A pipeline that converts requests into image stream- All parameters packed into the requests - no visible state- Programmer has full control over sensor settings for each frame in stream

• Control over focus and flash- No hidden daemon running

• Control ISP- Can access supplemental

statistics from ISP if available

• No global state- State travels with image requests- Every pipeline stage may have different state- Enables fast, deterministic state changes

Khronos coordinating with MIPI on camera control and

data formats


Sensor Industry Fragmentation …


Sensor Data Types• Raw sensor data- Acceleration, Magnetic Field, Angular Rates- Pressure, Ambient Light, Proximity, Temperature, Humidity, RGB light, UV light- Heart rate, Blood Oxygen Level, Skin Hydration, Breathalyzer

• Fused sensor data- Orientation (Quaternion or Euler Angles)- Gravity, Linear Acceleration, Position

• Contextual awareness- Device Motion: general movement of the device: still, free-fall, …- Carry: how the device is being held by a user: in pocket, in hand, …- Posture: how the body holding the device is positioned: standing, sitting, step, …- Transport: about the environment around the device: in elevator, in car, …


Low-level Sensor Abstraction API

Apps Need SophisticatedAccess to Sensor DataWithout coding to specific

sensor hardware

Apps request semantic sensor informationStreamInput defines possible requests, e.g.

Read Physical or Virtual Sensors e.g. “Game Quaternion” Context detection e.g. “Am I in an elevator?”

StreamInput processing graph provides optimized sensor data stream

High-value, smart sensor fusion middleware can connect to apps in a portable way

Apps can gain ‘magical’ situational awareness

Advanced Sensors EverywhereMulti-axis motion/position, quaternions,

context-awareness, gestures, activity monitoring, health and environmental sensorsSensor Discoverability

Sensor Code Portability

Presentations & Public Speaking

Виктор Ерухимов Open VX mixar moscow sept15