1
Traffic Separation Challenges of Embedded Vision Contributions High-Performance Power-Efficient Solutions for Embedded Vision Computing PhD dissertation of Hamed Tabkhi, PhD adviser: Prof. Gunar Schirner Department of Electrical and Computer Engineering, Northeastern University, Boston (MA), USA {tabkhi, schirner}@ece.neu.edu C) Communication-Centric Arch. Template A) Streaming vs Algorithm-Intrinsic Function-Level Processor Insight: Not all traffic is equal! Streaming: - Input/output stream (Independent of algorithm selection) Algorithm-intrinsic: - Generated by algorithm itself (algorithm dependent) F) Experimental Results A) Flexibility/Efficiency A)Embedded Vision Application areas - Advanced Driver Assistant (ADAS) - Security / video surveillance - Robotics Rapidly growing - ADAS alone 13x over 5 years - 2011: $10B -> 2016: $130B B) Market Requirements - Complex advanced algorithms (Adaptive) - Diversity of scenes (e.g. indoor, outdoor) - High res. (1080p) and rate (60fps) - Significant computation (~50 GOPS) - Huge bandwidth (~10 GBPS) - Very low power (~ 1 Watt) E) Current Approaches HW solutions for filters - Mid-processing stuck on SW - Flexible, but inefficient Cannot handle adaptive algorithms! - Inefficient execution in SW - Cannot handle heavy traffic - Low resolutions / quality Problem: (1) Realize individual adaptive vision algorithm? (2) Construct single larger vision flow in platform? (3) Support many vision flows on same platform? Contributions: 1) Traffic Separation (addresses prob. 1 & 2) - Manages traffic of adaptive algorithms - Simplifies chaining of vision algorithms 2) Function-Level Processor (addresses prob. 3) - Offers function-level flexibility with efficiency close to custom-HW C) Coarse-Grained Vision Pipeline Pre-Processing (vision filters) - High but regular compute, limited traffic Mid-Processing (adaptive) - High compute, high traffic Post-Processing (intelligent / control) - Limited compute / traffic Post- Processing Pre- Processing Mid- Processing Adaptive Vision Algorithm Precision Adjustment Read DMA write DMA Computation Clock Domain Communication Clock Domain Input stream Output stream Algorithm-intrinsic data System Memory Operational Stream Interconnet Input Interface Output Interface Precision Adjustment Control Unit (CU) Async. FIFOs Async. FIFOs Architecture support for traffic separation - Streaming clock domain (computation) - Algorithm execution - Autonomous quality adjustment - Operational clock domain (communication) - Dedicated DMAs - Stream access to memory - Asynchronous FIFOs bridging clock domains D) System-Level Benefits E) Vision SoC Solution on Zynq MoG Background Subtraction FG Mask Gaussian Paramete rs Pixel Str eam Component Labeling Objects Labe ls FG Labe ls FG Mask Mean-Shift Object Tracking Ne w Positions Obje ct s Histoger am Object Str eam MoG Video Overlay HDMI2Gray AXI Video DMA 0 FG Mask Processor Subsystem Object Detection (Component-Labeling) Memory Controller HDMI Input Packing Unit Async FIFO Async FIFO HDMI Output Unpacking Unit Async FIFO Async FIFO HDMI Clock Domain AXI Video DMA 1 Write Channel Read Channel Read Channel Write Channel AXI Bus 0 AXI Bus 1 AXI Clock Domain Programmable Logic Smoothing Morphology Object Tracking (Mean-shift) Erosion Dilation AXI Interconnect Vision Algorithms Stream In Stream Out Algorithm- intrinsic Scene model Input Frame Input Frame Input frame Input Frame Input Frame Output frame Original Scene ForeGround (FG) 18% 19% 67% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Communication Computation Stream Pixels 16% 28% 4% 24% 26% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% static Async. FIFOs Adjustment AXI_data DMA AXI clk Vision Algo 2 Vision Algo 1 Vision Algo 0 System In System Out Host Processor System Memory FLP-PVP Cache DMA Conf. B) Programming Abstraction C) Function-Set Architecture Low-Pass Filter (Convolution) DMA Color/ Illumination Extraction DMA Cache Cache Cache ILP-BFDSP Cache ILP-BFDSP Cache ILP-BFDSP Cache E) System-Level Integration 0 6 12 18 24 ILP ILP+ACC FLP Operation [GOPs] 0 3 6 9 12 ILP ILP+ACC FLP # of ILP cores 0 0.4 0.8 1.2 1.6 ILP ILP+ACC FLP Off-chip traffic [GB/s] 0 0.5 1 1.5 2 2.5 3 ILP ILP+ACC FLP Power [w] Communication Computation D) Adaptive Vision Algorithms Complex scene analysis - Track multiple objects Machine-learning principles - Keep a model of scene - e.g. MoG background subtraction, optical flow, SVM Observation: Not all traffic is equal! Algorithm-intrinsic traffic dominates - 60x in MoG, 20x in Optical flow Streaming: fixed, algorithm-intrinsic: adjustable Observed traffic separation in: Mixture of Gaussian (MoG) Kanade Lucas Tomasi (KLT) optical flow Component labeling MeanShift object tracker B) Optimization: Compression for Algorithm-Intrinsic Data N Bits Most Significant Bits (MSBs) N Bits Most Significant Bits (MSBs) 32 Bits 32 Bits N Bits N Bits 00...0 LSBs MoG Background Subtraction Parameters In Precision Adjustment (N-bit to 32-bit) Pixel Out Precision Adjustment (32-bit to N-bit) Parameters Out Pixel In Precision-adjustment on algorithm- intrinsic data access - Bandwidth/quality trade-off - Pareto front (blue line) - Evaluate quality [MS-SSIM] - Significant bandwidth reduction in MoG - Simple scene: 63% - Medium scene: 59% - Complex scene: 56% - Same trade-off observed for optical flow 0 2 4 6 8 Complex Medium Simple Bandwidth [GBs/s] Original Parameters Tuned Pipeline construction of multiple vision algorithms - Streaming: point-to-point connection - Hidden from memory - Algorithm-intrinsic data to communication interface - With dedicated precision adjustment HW SW HW Morphology MoG Smoothing (CNV) HDMI in HDMI out Histogram Checking Component Labeling Video Overlay Object tracking vision flow - Smoothing - 1x CNV on 8bit data 5x5 window - Mixture of Gaussian - Morphology - Dilation, erosion, erosion - 3x CNVs on 1bit data 15x15 window - Component labeling - Histogram checking - Video Overlay Implementation results - 1080p 30Hz, or 768p 60Hz - Limitation: on chip mem. - Great performance / power efficiency - 40 GOPs in 1.7 Watt - 30x faster then SW-only execution on desktop machine Instruction-Level Processors (ILPs) - High flexibility, low efficiency Custom HW Accelerators (HWACCs) - Low flexibility, high efficiency HWACC ILPs Control Processors DSPs GPUs Application Instruction Flexibility Efficiency [GOPs/Watt] FLP Function Insight: Mismatch in Programming Granularity - How to compose a program? Architecture Granularity - How to execute a program? Abstraction Instruction Function Add Sub For Filter CNV Sort Function- Level Architecture Compiler Architecture Programming Function-Level Processor - Matches abstractions at function-level granularity - Architecture for function- level programming - Increases efficiency - Maintains flexibility - Simplifies application composition Function A Function B Function E Function I Function B Function D Function E Function F Function H - Function C Function G Function J Function K - Function D Function H Function I Function J Function K - Function B Function E Function F Function J Applications of Market (Domain) - Function A Function B Function C Function D Function E Function F Function G Function H Function I Function J Function K Function Set Streaming applications composed of functions - e.g. OpenCV, OpenSDR Requirement: - Compute inside FLP as much as possible - Identify common functionality and composition rules D) FLP Architecture FLP Components: - Optimized Function Blocks (FBs) - MUX-based interconnect - Separation of data traffic - Autonomous control / synchronization Function0 Function1 Function2 FunctionN-1 FunctionN Output Encoder/Formatter Function3 Function5 (Arithmetic Unit) Function6 Parameters Buffer/Cache MUX MUX MUX MUX MUX MUX MUX MUX MUX MUX FLP Streaming-Pipe Controller Input Encoder/Formatter Parameters Buffer/Cache Parameters Buffer/Cache Operational (Algorithm-Intrinsic) Buffer/Cache Direct-Memory Access (DMA) Direct-Memory Access (DMA) System Input Interface Direct-Memory Access (DMA) Direct-Memory Access (DMA) Direct- Memory Access (DMA) Direct- Memory Access (DMA) Direct- Memory Access (DMA) Direct- Memory Access (DMA) Backward MUX Backward MUX Backward MUX Foreward MUX Foreward MUX Foreward MUX Operational (Algorithm-intrinsic) Data Streaming Data Selected Publications H. Tabkhi, M. Sabbagh, and G. Schirner, “Power-efficient real-time solution for adaptive vision algorithms,” IET Computers Digital Techniques, vol. 9, no. 1, pp.1626, 2015. H. Tabkhi, R. Bushey, and G. Schirner, “Algorithm and architecture co -design of mixture of gaussian (mog) background subtraction for embedded vision,” in IEEE 17th Asilomar Conference on Signals, Systems and Computers, Nov 2013, pp. 18151820. ——, “Function-level processor (flp): A high performance, minimal bandwidth, low power architecture for market-oriented mpsocs,” IEEE Embedded Systems Letters , vol. 6, no. 4, pp. 6568, Dec 2014. ——, “Function-level processor (flp): Raising efficiency by operating at function granularity for market-oriented mpsoc,” in IEEE 25th International Conference on Application-specific Systems, Architectures and Processors (ASAP), June 2014, pp. 121130. FLP Streaming Communication Fabric Shared memory ILP Control Unit Interrupt line Int Cont DMA Control Bus System I/O DMAs LSPM DMAs LSPM FBN-1 FBN FB0 FB1 MUX MUX MUX MUX FLP pairs with ILP cores - To create complete control and analytic processing. - FLP for pre/mid-processing - ILPs for post-processing (control and intelligence) Pipeline Vision Processor (PVP) - FLP is generalization of PVP - Result of joint work with PVP chief architect Results on 10 selected vision applications - Computation: - FLP-PVP <= 22.5 GOPs - ILP+ACC requires 2 ILP cores - ILP requires 7 ILP cores - Off-chip communication: - FLP offers 5x less than ILP and 3x less than ILP+ACC - Power: - FLP offers 18x less than ILP and 5x less than ILP+ACC FLP Principles: - Target stream processing applications - Compute contiguously inside FLP - Limited ILP interaction

High-Performance Power-Efficient Solutions for Embedded ... · H. Tabkhi, M. Sabbagh, and G. Schirner, “Power-efficient real-time solution for adaptive vision algorithms,” IET

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High-Performance Power-Efficient Solutions for Embedded ... · H. Tabkhi, M. Sabbagh, and G. Schirner, “Power-efficient real-time solution for adaptive vision algorithms,” IET

Traffic Separation

Challenges of Embedded Vision Contributions

High-Performance Power-Efficient Solutions for Embedded Vision ComputingPhD dissertation of Hamed Tabkhi, PhD adviser: Prof. Gunar Schirner

Department of Electrical and Computer Engineering,

Northeastern University, Boston (MA), USA

{tabkhi, schirner}@ece.neu.edu

C) Communication-Centric Arch. Template

A) Streaming vs Algorithm-Intrinsic

Function-Level Processor

Insight: Not all traffic is equal!

Streaming: - Input/output stream (Independent of algorithm selection)

Algorithm-intrinsic:- Generated by algorithm itself (algorithm dependent)

F) Experimental Results

A) Flexibility/Efficiency

A)Embedded Vision

Application areas- Advanced Driver Assistant (ADAS)

- Security / video surveillance

- Robotics

Rapidly growing- ADAS alone 13x over 5 years

- 2011: $10B -> 2016: $130B

B) Market Requirements- Complex advanced algorithms (Adaptive)

- Diversity of scenes (e.g. indoor, outdoor)

- High res. (1080p) and rate (60fps)

- Significant computation (~50 GOPS)

- Huge bandwidth (~10 GBPS)

- Very low power (~ 1 Watt)

E) Current Approaches

HW solutions for filters- Mid-processing stuck on SW

- Flexible, but inefficient

Cannot handle adaptive algorithms!- Inefficient execution in SW

- Cannot handle heavy traffic

- Low resolutions / quality

Problem: (1) Realize individual adaptive vision algorithm?

(2) Construct single larger vision flow in platform?

(3) Support many vision flows on same platform?

Contributions:

1) Traffic Separation (addresses prob. 1 & 2)

- Manages traffic of adaptive algorithms

- Simplifies chaining of vision algorithms

2) Function-Level Processor (addresses

prob. 3)

- Offers function-level flexibility with

efficiency close to custom-HW

C) Coarse-Grained Vision Pipeline

Pre-Processing (vision filters)

- High but regular compute, limited traffic

Mid-Processing (adaptive)

- High compute, high traffic

Post-Processing (intelligent / control)

- Limited compute / traffic

Post-

Processing

Pre-

Processing

Mid-

Processing

Adaptive VisionAlgorithm

Precision Adjustment

ReadDMA

writeDMA

Computation Clock Domain

Communication Clock Domain

Input stream

Output stream

Algorithm-intrinsic data

System Memory

Operational Stream Interconnet

Inp

utIn

terf

ace

Ou

tpu

tIn

terf

ace

Precision Adjustment

Control Unit (CU)Async. FIFOs Async. FIFOs

Architecture support for traffic separation

- Streaming clock domain (computation)

- Algorithm execution

- Autonomous quality adjustment

- Operational clock domain (communication)

- Dedicated DMAs

- Stream access to memory

- Asynchronous FIFOs bridging clock domains

D) System-Level Benefits

E) Vision SoC Solution on Zynq

MoG Background Subtraction

FG Mask

Gaussian Parameters

PixelStream Component

Labeling

Objects Labels

FG Labels

FG Mask Mean-Shift

Object Tracking

New Positions

Objects Histogeram

ObjectStream

MoGVideo

OverlayHDMI2Gray

AXI Video

DMA 0

FGMask

Processor Subsystem

Object Detection(Component-Labeling)

Mem

ory

Co

ntro

ller

HDMI

InputPacking

Unit

Async FIFO Async FIFO

HDMI

OutputUnpacking

Unit

Async FIFO Async FIFO

HDMI Clock

Domain

AXI Video

DMA 1

Write

Channel

Read

Channel

Read

Channel

Write

Channel

AXI Bus 0

AXI Bus 1

AXI Clock

Domain

Programmable Logic

Smoothing

Morphology

Object Tracking(Mean-shift)

ErosionDilation

AX

I

Interc

onnect

Vision Algorithms

Stream In Stream

Out

Algorithm-

intrinsic

Scene

model

Input

Frame Input

Frame Input

frame

Input

Frame Input

Frame Output

frame

Original Scene ForeGround (FG)

18%

19%

67%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Communication

Computation

Stream Pixels

16%

28%

4%

24%

26%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%static

Async. FIFOs

Adjustment

AXI_data

DMA

AXI clk

Vision Algo 2

Vision Algo 1

Vision Algo 0

SystemIn

SystemOut

Host

ProcessorSystem Memory

FLP-PVP

Cache DMA Conf.

B) Programming Abstraction

C) Function-Set Architecture

Low-Pass Filter

(Convolution)DMA

Color/IlluminationExtraction

DMA

ILP-BFDSP

Cache

ILP-BFDSP

Cache

ILP-BFDSP

Cache

ILP-BFDSP

Cache

ILP-BFDSP

Cache

ILP-BFDSP

Cache

E) System-Level Integration

0

6

12

18

24

ILP ILP+ACC FLP

Op

era

tio

n [G

OP

s]

0

3

6

9

12

ILP ILP+ACC FLP

# o

f IL

P c

ore

s

0

0.4

0.8

1.2

1.6

ILP ILP+ACC FLP

Off

-chip

traff

ic [G

B/s

]

0

0.5

1

1.5

2

2.5

3

ILP ILP+ACC FLP

Po

we

r [w

]

Communication

Computation

D) Adaptive Vision Algorithms

Complex scene analysis- Track multiple objects

Machine-learning principles- Keep a model of scene

- e.g. MoG background subtraction, optical

flow, SVM

Observation: Not all traffic is equal!

Algorithm-intrinsic traffic dominates

- 60x in MoG, 20x in Optical flow

Streaming: fixed, algorithm-intrinsic: adjustable

Observed traffic separation in:

• Mixture of Gaussian (MoG)• Kanade Lucas Tomasi (KLT)

optical flow

• Component labeling• MeanShift object tracker

B) Optimization: Compression for Algorithm-Intrinsic DataN Bits

Most Significant

Bits (MSBs)

N Bits

Most Significant

Bits (MSBs)

32 Bits32 Bits

N Bits N Bits00...0 LSBs

MoG Background Subtraction

ParametersIn

PrecisionAdjustment

(N-bit to 32-bit)

PixelOut

PrecisionAdjustment

(32-bit to N-bit)

ParametersOut

PixelIn

Precision-adjustment on algorithm-

intrinsic data access- Bandwidth/quality trade-off

- Pareto front (blue line)

- Evaluate quality [MS-SSIM]

- Significant bandwidth reduction in MoG

- Simple scene: 63%

- Medium scene: 59%

- Complex scene: 56%

- Same trade-off observed for optical flow

0

2

4

6

8

Complex Medium Simple

Ban

dw

idth

[G

Bs/s

]

Original Parameters Tuned

Pipeline construction of multiple vision algorithms

- Streaming: point-to-point connection

- Hidden from memory

- Algorithm-intrinsic data to communication interface

- With dedicated precision adjustment

HWSWHW

Morphology MoGSmoothing

(CNV) HDMI

inHDMIout

HistogramChecking

ComponentLabeling

Video Overlay

Object tracking vision flow- Smoothing

- 1x CNV on 8bit data 5x5 window

- Mixture of Gaussian

- Morphology - Dilation, erosion, erosion

- 3x CNVs on 1bit data 15x15 window

- Component labeling

- Histogram checking

- Video Overlay

Implementation results- 1080p 30Hz, or 768p 60Hz

- Limitation: on chip mem.

- Great performance / power

efficiency

- 40 GOPs in 1.7 Watt

- 30x faster then SW-only execution

on desktop machine

Instruction-Level

Processors (ILPs)

- High flexibility, low efficiency

Custom HW Accelerators

(HWACCs)- Low flexibility, high efficiency

HWACC

ILPs

Control Processors

DSPs

GPUs

Application InstructionFlexibility

Effic

ienc

y [G

OPs

/Wa

tt] FLP

Function

Insight: Mismatch in

Programming Granularity - How to compose a program?

Architecture Granularity- How to execute a program?

Abstraction

Instruction

Function

Add

Sub

For

Filter

CNV

SortFunction-

LevelArchitecture

Compiler

Arc

hite

ctu

re

Pro

gra

mm

ing

Function-Level Processor- Matches abstractions at

function-level granularity

- Architecture for function-

level programming

- Increases efficiency

- Maintains flexibility

- Simplifies application

composition

FunctionA

FunctionB

FunctionE

FunctionI

FunctionB

FunctionD

FunctionE

FunctionF

FunctionH

-Function

C

FunctionG

FunctionJ

FunctionK

-

FunctionD

FunctionH

FunctionI

FunctionJ

FunctionK

-

FunctionB

FunctionE

FunctionF

FunctionJ

Applications of Market (Domain)

-

FunctionA

FunctionB

FunctionC

FunctionD

FunctionE

FunctionF

FunctionG

FunctionH

FunctionI

FunctionJ

FunctionK

Function Set

Streaming applications composed of

functions

- e.g. OpenCV, OpenSDR

Requirement:

- Compute inside FLP as much as possible

- Identify common functionality and composition rules

D) FLP Architecture

FLP Components:- Optimized Function Blocks (FBs)

- MUX-based interconnect

- Separation of data traffic

- Autonomous control / synchronization

Function0

Function1

Function2

FunctionN-1

FunctionN Ou

tpu

t E

nco

de

r/F

orm

atte

r

Function3

Function5(Arithmetic

Unit)

Function6

Parameters Buffer/Cache

MUX

MUX

MUX

MUX

MUX

MUX

MUX

MUX

MUX

MUX

FLP Streaming-Pipe Controller

Inp

ut

En

cod

er/

Fo

rma

tte

r

Parameters Buffer/Cache Parameters Buffer/Cache

Operational (Algorithm-Intrinsic) Buffer/Cache

Direct-Memory Access (DMA)

Direct-Memory Access (DMA)

System InputInterface

Direct-Memory Access (DMA)

Direct-Memory Access (DMA)

Direct-Memory

Access (DMA)

Direct-Memory

Access (DMA)

Direct-Memory

Access (DMA)

Direct-Memory

Access (DMA)

BackwardMUX

BackwardMUX

BackwardMUX

ForewardMUX

ForewardMUX

ForewardMUX

Operational (Algorithm-intrinsic) Data Streaming Data

Selected Publications H. Tabkhi, M. Sabbagh, and G. Schirner, “Power-efficient real-time solution for adaptive vision algorithms,” IET Computers Digital Techniques, vol. 9, no. 1, pp.16–26, 2015. H. Tabkhi, R. Bushey, and G. Schirner, “Algorithm and architecture co-design of mixture of gaussian (mog) background subtraction for embedded vision,” in IEEE 17th Asilomar

Conference on Signals, Systems and Computers, Nov 2013, pp. 1815–1820. ——, “Function-level processor (flp): A high performance, minimal bandwidth, low power architecture for market-oriented mpsocs,” IEEE Embedded Systems Letters, vol. 6, no. 4, pp. 65–

68, Dec 2014. ——, “Function-level processor (flp): Raising efficiency by operating at function granularity for market-oriented mpsoc,” in IEEE 25th International Conference on Application-specific

Systems, Architectures and Processors (ASAP), June 2014, pp. 121–130.

FLP

Streaming Communication Fabric

Shared memory

ILP

Control Unit

Interrupt line Int Cont

DMA

Control Bus

System I/O

DMAs

LSPM

DMAs

LSPM

FBN-1

FBN

FB0

FB1

MU

XM

UX

MU

XM

UX

FLP pairs with ILP cores- To create complete control and analytic

processing.

- FLP for pre/mid-processing

- ILPs for post-processing (control and

intelligence)

Pipeline Vision Processor (PVP)- FLP is generalization of PVP

- Result of joint work with PVP chief architect

Results on 10 selected vision applications- Computation:

- FLP-PVP <= 22.5 GOPs

- ILP+ACC requires 2 ILP cores

- ILP requires 7 ILP cores

- Off-chip communication:

- FLP offers 5x less than ILP and 3x less than ILP+ACC

- Power:

- FLP offers 18x less than ILP and 5x less than ILP+ACC

FLP Principles:- Target stream processing applications

- Compute contiguously inside FLP

- Limited ILP interaction