High-Performance Power-Efficient Solutions for Embedded ... · H. Tabkhi, M. Sabbagh, and G. Schirner, “Power-efficient real-time solution for adaptive vision algorithms,” IET

Traffic Separation

Challenges of Embedded Vision Contributions

High-Performance Power-Efficient Solutions for Embedded Vision ComputingPhD dissertation of Hamed Tabkhi, PhD adviser: Prof. Gunar Schirner

Department of Electrical and Computer Engineering,

Northeastern University, Boston (MA), USA

{tabkhi, schirner}@ece.neu.edu

C) Communication-Centric Arch. Template

A) Streaming vs Algorithm-Intrinsic

Function-Level Processor

Insight: Not all traffic is equal!

Streaming: - Input/output stream (Independent of algorithm selection)

Algorithm-intrinsic:- Generated by algorithm itself (algorithm dependent)

F) Experimental Results

A) Flexibility/Efficiency

A)Embedded Vision

Application areas- Advanced Driver Assistant (ADAS)

- Security / video surveillance

- Robotics

Rapidly growing- ADAS alone 13x over 5 years

- 2011: $10B -> 2016: $130B

B) Market Requirements- Complex advanced algorithms (Adaptive)

- Diversity of scenes (e.g. indoor, outdoor)

- High res. (1080p) and rate (60fps)

- Significant computation (~50 GOPS)

- Huge bandwidth (~10 GBPS)

- Very low power (~ 1 Watt)

E) Current Approaches

HW solutions for filters- Mid-processing stuck on SW

- Flexible, but inefficient

Cannot handle adaptive algorithms!- Inefficient execution in SW

- Cannot handle heavy traffic

- Low resolutions / quality

Problem: (1) Realize individual adaptive vision algorithm?

(2) Construct single larger vision flow in platform?

(3) Support many vision flows on same platform?

Contributions:

1) Traffic Separation (addresses prob. 1 & 2)

- Manages traffic of adaptive algorithms

- Simplifies chaining of vision algorithms

2) Function-Level Processor (addresses

prob. 3)

- Offers function-level flexibility with

efficiency close to custom-HW

C) Coarse-Grained Vision Pipeline

Pre-Processing (vision filters)

- High but regular compute, limited traffic

Mid-Processing (adaptive)

- High compute, high traffic

Post-Processing (intelligent / control)

- Limited compute / traffic

Post-

Processing

Pre-

Processing

Mid-

Processing

Adaptive VisionAlgorithm

Precision Adjustment

ReadDMA

writeDMA

Computation Clock Domain

Communication Clock Domain

Input stream

Output stream

Algorithm-intrinsic data

System Memory

Operational Stream Interconnet

Inp

utIn

terf

ace

Ou

tpu

tIn

terf

ace

Precision Adjustment

Control Unit (CU)Async. FIFOs Async. FIFOs

Architecture support for traffic separation

- Streaming clock domain (computation)

- Algorithm execution

- Autonomous quality adjustment

- Operational clock domain (communication)

- Dedicated DMAs

- Stream access to memory

- Asynchronous FIFOs bridging clock domains

D) System-Level Benefits

E) Vision SoC Solution on Zynq

MoG Background Subtraction

FG Mask

Gaussian Parameters

PixelStream Component

Labeling

Objects Labels

FG Labels

FG Mask Mean-Shift

Object Tracking

New Positions

Objects Histogeram

ObjectStream

MoGVideo

OverlayHDMI2Gray

AXI Video

DMA 0

FGMask

Processor Subsystem

Object Detection(Component-Labeling)

Mem

ory

Co

ntro

ller

HDMI

InputPacking

Unit

Async FIFO Async FIFO

HDMI

OutputUnpacking

Unit

Async FIFO Async FIFO

HDMI Clock

Domain

AXI Video

DMA 1

Write

Channel

Read

Channel

Read

Channel

Write

Channel

AXI Bus 0

AXI Bus 1

AXI Clock

Domain

Programmable Logic

Smoothing

Morphology

Object Tracking(Mean-shift)

ErosionDilation

AX

I

Interc

onnect

Vision Algorithms

Stream In Stream

Out

Algorithm-

intrinsic

Scene

model

Input

Frame Input

Frame Input

frame

Input

Frame Input

Frame Output

frame

Original Scene ForeGround (FG)

18%

19%

67%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Communication

Computation

Stream Pixels

16%

28%

4%

24%

26%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%static

Async. FIFOs

Adjustment

AXI_data

DMA

AXI clk

Vision Algo 2

Vision Algo 1

Vision Algo 0

SystemIn

SystemOut

Host

ProcessorSystem Memory

FLP-PVP

Cache DMA Conf.

B) Programming Abstraction

C) Function-Set Architecture

Low-Pass Filter

(Convolution)DMA

Color/IlluminationExtraction

DMA

ILP-BFDSP

Cache

ILP-BFDSP

Cache

ILP-BFDSP

Cache

ILP-BFDSP

Cache

ILP-BFDSP

Cache

ILP-BFDSP

Cache

E) System-Level Integration

0

6

12

18

24

ILP ILP+ACC FLP

Op

era

tio

n [G

OP

s]

0

3

6

9

12

ILP ILP+ACC FLP

# o

f IL

P c

ore

s

0

0.4

0.8

1.2

1.6

ILP ILP+ACC FLP

Off

-chip

traff

ic [G

B/s

]

0

0.5

1

1.5

2

2.5

3

ILP ILP+ACC FLP

Po

we

r [w

]

Communication

Computation

D) Adaptive Vision Algorithms

Complex scene analysis- Track multiple objects

Machine-learning principles- Keep a model of scene

- e.g. MoG background subtraction, optical

flow, SVM

Observation: Not all traffic is equal!

Algorithm-intrinsic traffic dominates

- 60x in MoG, 20x in Optical flow

Streaming: fixed, algorithm-intrinsic: adjustable

Observed traffic separation in:

• Mixture of Gaussian (MoG)• Kanade Lucas Tomasi (KLT)

optical flow

• Component labeling• MeanShift object tracker

B) Optimization: Compression for Algorithm-Intrinsic DataN Bits

Most Significant

Bits (MSBs)

N Bits

Most Significant

Bits (MSBs)

32 Bits32 Bits

N Bits N Bits00...0 LSBs

MoG Background Subtraction

ParametersIn

PrecisionAdjustment

(N-bit to 32-bit)

PixelOut

PrecisionAdjustment

(32-bit to N-bit)

ParametersOut

PixelIn

Precision-adjustment on algorithm-

intrinsic data access- Bandwidth/quality trade-off

- Pareto front (blue line)

- Evaluate quality [MS-SSIM]

- Significant bandwidth reduction in MoG

- Simple scene: 63%

- Medium scene: 59%

- Complex scene: 56%

- Same trade-off observed for optical flow

0

2

4

6

8

Complex Medium Simple

Ban

dw

idth

[G

Bs/s

]

Original Parameters Tuned

Pipeline construction of multiple vision algorithms

- Streaming: point-to-point connection

- Hidden from memory

- Algorithm-intrinsic data to communication interface

- With dedicated precision adjustment

HWSWHW

Morphology MoGSmoothing

(CNV) HDMI

inHDMIout

HistogramChecking

ComponentLabeling

Video Overlay

Object tracking vision flow- Smoothing

- 1x CNV on 8bit data 5x5 window

- Mixture of Gaussian

- Morphology - Dilation, erosion, erosion

- 3x CNVs on 1bit data 15x15 window

- Component labeling

- Histogram checking

- Video Overlay

Implementation results- 1080p 30Hz, or 768p 60Hz

- Limitation: on chip mem.

- Great performance / power

efficiency

- 40 GOPs in 1.7 Watt

- 30x faster then SW-only execution

on desktop machine

Instruction-Level

Processors (ILPs)

- High flexibility, low efficiency

Custom HW Accelerators

(HWACCs)- Low flexibility, high efficiency

HWACC

ILPs

Control Processors

DSPs

GPUs

Application InstructionFlexibility

Effic

ienc

y [G

OPs

/Wa

tt] FLP

Function

Insight: Mismatch in

Programming Granularity - How to compose a program?

Architecture Granularity- How to execute a program?

Abstraction

Instruction

Function

Add

Sub

For

Filter

CNV

SortFunction-

LevelArchitecture

Compiler

Arc

hite

ctu

re

Pro

gra

mm

ing

Function-Level Processor- Matches abstractions at

function-level granularity

- Architecture for function-

level programming

- Increases efficiency

- Maintains flexibility

- Simplifies application

composition

FunctionA

FunctionB

FunctionE

FunctionI

FunctionB

FunctionD

FunctionE

FunctionF

FunctionH

-Function

C

FunctionG

FunctionJ

FunctionK

-

FunctionD

FunctionH

FunctionI

FunctionJ

FunctionK

-

FunctionB

FunctionE

FunctionF

FunctionJ

Applications of Market (Domain)

-

FunctionA

FunctionB

FunctionC

FunctionD

FunctionE

FunctionF

FunctionG

FunctionH

FunctionI

FunctionJ

FunctionK

Function Set

Streaming applications composed of

functions

- e.g. OpenCV, OpenSDR

Requirement:

- Compute inside FLP as much as possible

- Identify common functionality and composition rules

D) FLP Architecture

FLP Components:- Optimized Function Blocks (FBs)

- MUX-based interconnect

- Separation of data traffic

- Autonomous control / synchronization

Function0

Function1

Function2

FunctionN-1

FunctionN Ou

tpu

t E

nco

de

r/F

orm

atte

r

Function3

Function5(Arithmetic

Unit)

Function6

Parameters Buffer/Cache

MUX

MUX

MUX

MUX

MUX

MUX

MUX

MUX

MUX

MUX

FLP Streaming-Pipe Controller

Inp

ut

En

cod

er/

Fo

rma

tte

r

Parameters Buffer/Cache Parameters Buffer/Cache

Operational (Algorithm-Intrinsic) Buffer/Cache

Direct-Memory Access (DMA)


System InputInterface



Direct-Memory

Access (DMA)

Direct-Memory

Access (DMA)

Direct-Memory

Access (DMA)

Direct-Memory

Access (DMA)

BackwardMUX

BackwardMUX

BackwardMUX

ForewardMUX

ForewardMUX

ForewardMUX

Operational (Algorithm-intrinsic) Data Streaming Data

Selected Publications H. Tabkhi, M. Sabbagh, and G. Schirner, “Power-efficient real-time solution for adaptive vision algorithms,” IET Computers Digital Techniques, vol. 9, no. 1, pp.16–26, 2015. H. Tabkhi, R. Bushey, and G. Schirner, “Algorithm and architecture co-design of mixture of gaussian (mog) background subtraction for embedded vision,” in IEEE 17th Asilomar

Conference on Signals, Systems and Computers, Nov 2013, pp. 1815–1820. ——, “Function-level processor (flp): A high performance, minimal bandwidth, low power architecture for market-oriented mpsocs,” IEEE Embedded Systems Letters, vol. 6, no. 4, pp. 65–

68, Dec 2014. ——, “Function-level processor (flp): Raising efficiency by operating at function granularity for market-oriented mpsoc,” in IEEE 25th International Conference on Application-specific

Systems, Architectures and Processors (ASAP), June 2014, pp. 121–130.

FLP

Streaming Communication Fabric

Shared memory

ILP

Control Unit

Interrupt line Int Cont

DMA

Control Bus

System I/O

DMAs

LSPM

DMAs

LSPM

FBN-1

FBN

FB0

FB1

MU

XM

UX

MU

XM

UX

FLP pairs with ILP cores- To create complete control and analytic

processing.

- FLP for pre/mid-processing

- ILPs for post-processing (control and

intelligence)

Pipeline Vision Processor (PVP)- FLP is generalization of PVP

- Result of joint work with PVP chief architect

Results on 10 selected vision applications- Computation:

- FLP-PVP <= 22.5 GOPs

- ILP+ACC requires 2 ILP cores

- ILP requires 7 ILP cores

- Off-chip communication:

- FLP offers 5x less than ILP and 3x less than ILP+ACC

- Power:

- FLP offers 18x less than ILP and 5x less than ILP+ACC

FLP Principles:- Target stream processing applications

- Compute contiguously inside FLP

- Limited ILP interaction

Documents

High-Performance Power-Efficient Solutions for Embedded ... · H. Tabkhi, M. Sabbagh, and G. Schirner, “Power-efficient real-time solution for adaptive vision algorithms,” IET