"The Evolution of Object Recognition in Embedded Systems," a Presentation from CEVA

Moshe Shahar

12 May 2015

The Evolution of Object Recognition in

Embedded Systems

Timeline of CEVA’s Feature Extraction and

Classification Algorithms

count scale

2012 2013 Time 2014 2015

HARRIS

3D Objects

(KD-Tree)

CEVA-MM3101 CEVA-XM4

• Find peak responses over scale

in Laplacian pyramid

• Find response with sub pixel

accuracy

• Only keep “corner like”

responses

• Assign orientation

• Create recognition signature

• (Patent protected)

SIFT—Scale Invariant Feature Transform

Performance estimations were

30-50MCycles for VGA frame

Octave)

(First

Octave)

Gaussian Difference of

Gaussian

• SIFT is very accurate

but very complex to

implement on a

programmable platform

• SURF includes fast

implementation and fast

descriptor matching

• Invariant to various

image transforms like

rotation and illumination

changes

• Still widely used

• Partial patent protection

SURF—Speeded Up Robust Features

Integral Image

Response Map Calculatoin

Maxima 3x3x3 + interpolate

Detected Interest Points

Descriptor DB

Sorted Interest Points

DetectorDetector

Sort according to levels and

regions

Integral sum

Build 64 sum Descriptor

SortSort DescriptorDescriptor

HOG—Histogram of Gradients

Input image

Scaled image

Scale 1

Scaled image

Scale 9

Normalization

Gradient

Calculation

Descriptor

Calculation

Bilinear

Scaling

HOG algorithm is based on Dalal & Triggs paper

(2005)

Common use is object detection, especially

pedestrian detection

Reference Code—OpenCV 2.4.3

1. Load 2 vectors in single cycle

2. Perform multiple operations in single

3. Store a transposed rectangle of 4x4

pixels in single cycle

4. Perform the load and filter again

5. Store 4x4 transposed to memory in

single cycle

HOG—Bilinear Scaling

Memory

vAvBvCvD

Vector Registers

Memory

vAvBvCvD

Vector Registers

Memory

vAvBvCvD

Vector Registers

filter

transpose

• Implemented using ‘Look Up Table’ (LUT)—N way parallel access to

local memory in one cycle

• Parallel load mechanism—Load N gamma values in a cycle

HOG—Gamma Normalization

Memory Map

ORB—Oriented FAST and Rotated BRIEF

• An efficient alternative to SIFT (and patent free)

• Pyramid is used for scale-invariance

• Features are detected using FAST9, Harris and non-max-suppress

• Descriptors are based on BRIEF with normalized orientation

ORB—Feature Extraction

Image Fast9 Harris

Non-Max-

Suppress

Oriented

Descriptors

Pyramid

ORB—FAST9 Implementation

Continuous arc of 9 or more pixels:

All much brighter then (p+Th)

All much darker then (p-Th)

ORB—FAST9 Implementation

• Early exit is used to detect potential positions

• Long memory access of 32 bytes using

• quickly load consecutive pixels

• Vector compare is used to compare the center of the corner to

the borders

• Building a binary (bit) map with positions that need to be calculated

• Calculation of N positions in parallel

• Using different two dimensional loads

• Vector predicates are used selectively calculate only the locations

that pass the threshold

• Use N way parallel lookup

• More MACs per cycle

• Complex multi-operation instructions

• Wider bandwidth to local memory

(while conserving power)

• More orthogonal memory accesses

• Supporting ISA to allow conditional

execution per vector element

• More operations per loaded bit

• Higher performance per mW

• Improve local data reuse

Vision Processor Evolution

Computing power

“Multi Scalar”

Improve

efficiency

• Ability to processes 2D data efficiently is critical for many algorithms

and especially for convolutions

• Idea is to take advantage of the pixel overlap in image processing by

reusing same data to produce multiple outputs

• Significantly increases processing capability

• Saves external memory bandwidth and frees system buses for other tasks

• Reduces power consumption

2-Dimension Data Processing Capability

Input Image

For 16MAC with of 512-bit bandwidth, only 176-bit actually loaded

• Convolutional neural network (CNN)

• A deep learning neural network algorithm

• Used for classification, localization and detection of objects in images

• CNN value

1. Best recognition quality

2. Re-trainable to any object without code change

• CNN combines 2D convolutions, 2D max and 1D MAC operations

• Good match for vector DSPs

Vector Accelerated Deep Neural Network

Convolution

Subsampling Subsampling Fully

Connected

• The algorithm enables to utilize the overlapping convolutions to get

efficient processing

• Executing one or several filters in parallel on the same input—ideal

for using a 2-dimention data processing capability

CNN—2D Convolution Layer

Convolution layer

• Subsampling stage: max filter operation used to find strongest response

on MxN patch from previous layer reducing the scaling in each axis

• Example processor capability: Calculate MxN max filter using vector

max on 3-input vectors, 16 elements each, 16-bit per element

CNN—Pooling Layer

Pooling layer

• Includes many multiplications with different weights accumulated to

single result

• Requires high accumulation precision and large amount of MAC

operations

• Ideal for vector processor

CNN—Fully Connected Layer

Fully Connected

• CNN—The new king of the block

• Will dominant object recognition once real-time is possible

• Allows a lot of algorithmic freedom within the implementation

• Ideal for programmable or accelerator+processor solutions

• 3D becoming more widely used

• 3D object detection, classification and recognition will evolve

rapidly

• No clear winner, each vendor has its own flow

• Dominant database seems to be KD-Tree, very serial in nature

• Rapid developments, new innovations coming soon…

What Do We See as The Next Trends?

Timeline of CEVA’s Feature Extraction and

Classification Algorithms C

count scale

2012 2013 Time 2014 2015 2016

Parallel

access

reuse Fast

Filters

64 bit

Parallel

access

Filters

Parallel

access

Filters+ Parallel

access+

• 4th-generation imaging and vision processor IP

• Vector-type processor; combines fixed- and floating-point math; up

to 4096-bit processing per cycle

• Platform includes vision processor, libraries, tools and applications

Introducing CEVA-XM4™

Come see us at our booth

for real time demos ….

"The Evolution of Object Recognition in Embedded Systems," a Presentation from CEVA

Technology

Novel CI- Backoff Scheme for Real-time Embedded Speech Recognition

Knowledge-Embedded Representation Learning for Fine ... · Knowledge-Embedded Representation Learning for Fine-Grained Image Recognition Tianshui Chen1, Liang Lin1;2, Riquan Chen1,

CEVA Logistics

Embedded System Design for Iris Recognition System

Embedded real time speed limit sign recognition using ...tavares/downloads/publications/artigos/NCAA... · Embedded real time speed limit sign recognition using image processing and

Embedded System for Individual Recognition Based on ECG

Psihopol Ceva Simpozion

TensorRT Optimizations for Embedded Facial Recognition · Built-in facial recognition engine specs Min face resolution for face recognition 12 pixels between the eyes Number of faces

Ceva Administrativ

嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks

Ieeepro techno solutions ieee embedded project secure and robust iris recognition

Odor Recognition System Using Embedded Leaning Vector

Embedded and - Neurotechnology · 8/30/2010 · FingerCell Embedded Development Kit (EDK) is based on the FingerCell embedded ﬁ ngerprint recognition algorithm that is especially

Présentation CEVA

Texto Ceva

Imuno Ceva

GIOVANNI CEVA

Development of Portable Embedded Face Recognition Prototype By

Human Activity Recognition using Embedded Smartphone Sensors · Human Activity Recognition using Embedded Smartphone Sensors Ruchita Deshmukh1, Sneha Aware2, Akshay Picha3, Abhiyash

CEVA · 2016-02-29 · 3—หลักปฏิบัตืในการดำาเนินธุรกิจของ ceva _หน้าก่อน สารบัญ/ หน้าถัดไป