Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Traffic Separation
Challenges of Embedded Vision Contributions
High-Performance Power-Efficient Solutions for Embedded Vision ComputingPhD dissertation of Hamed Tabkhi, PhD adviser: Prof. Gunar Schirner
Department of Electrical and Computer Engineering,
Northeastern University, Boston (MA), USA
{tabkhi, schirner}@ece.neu.edu
C) Communication-Centric Arch. Template
A) Streaming vs Algorithm-Intrinsic
Function-Level Processor
Insight: Not all traffic is equal!
Streaming: - Input/output stream (Independent of algorithm selection)
Algorithm-intrinsic:- Generated by algorithm itself (algorithm dependent)
F) Experimental Results
A) Flexibility/Efficiency
A)Embedded Vision
Application areas- Advanced Driver Assistant (ADAS)
- Security / video surveillance
- Robotics
Rapidly growing- ADAS alone 13x over 5 years
- 2011: $10B -> 2016: $130B
B) Market Requirements- Complex advanced algorithms (Adaptive)
- Diversity of scenes (e.g. indoor, outdoor)
- High res. (1080p) and rate (60fps)
- Significant computation (~50 GOPS)
- Huge bandwidth (~10 GBPS)
- Very low power (~ 1 Watt)
E) Current Approaches
HW solutions for filters- Mid-processing stuck on SW
- Flexible, but inefficient
Cannot handle adaptive algorithms!- Inefficient execution in SW
- Cannot handle heavy traffic
- Low resolutions / quality
Problem: (1) Realize individual adaptive vision algorithm?
(2) Construct single larger vision flow in platform?
(3) Support many vision flows on same platform?
Contributions:
1) Traffic Separation (addresses prob. 1 & 2)
- Manages traffic of adaptive algorithms
- Simplifies chaining of vision algorithms
2) Function-Level Processor (addresses
prob. 3)
- Offers function-level flexibility with
efficiency close to custom-HW
C) Coarse-Grained Vision Pipeline
Pre-Processing (vision filters)
- High but regular compute, limited traffic
Mid-Processing (adaptive)
- High compute, high traffic
Post-Processing (intelligent / control)
- Limited compute / traffic
Post-
Processing
Pre-
Processing
Mid-
Processing
Adaptive VisionAlgorithm
Precision Adjustment
ReadDMA
writeDMA
Computation Clock Domain
Communication Clock Domain
Input stream
Output stream
Algorithm-intrinsic data
System Memory
Operational Stream Interconnet
Inp
utIn
terf
ace
Ou
tpu
tIn
terf
ace
Precision Adjustment
Control Unit (CU)Async. FIFOs Async. FIFOs
Architecture support for traffic separation
- Streaming clock domain (computation)
- Algorithm execution
- Autonomous quality adjustment
- Operational clock domain (communication)
- Dedicated DMAs
- Stream access to memory
- Asynchronous FIFOs bridging clock domains
D) System-Level Benefits
E) Vision SoC Solution on Zynq
MoG Background Subtraction
FG Mask
Gaussian Parameters
PixelStream Component
Labeling
Objects Labels
FG Labels
FG Mask Mean-Shift
Object Tracking
New Positions
Objects Histogeram
ObjectStream
MoGVideo
OverlayHDMI2Gray
AXI Video
DMA 0
FGMask
Processor Subsystem
Object Detection(Component-Labeling)
Mem
ory
Co
ntro
ller
HDMI
InputPacking
Unit
Async FIFO Async FIFO
HDMI
OutputUnpacking
Unit
Async FIFO Async FIFO
HDMI Clock
Domain
AXI Video
DMA 1
Write
Channel
Read
Channel
Read
Channel
Write
Channel
AXI Bus 0
AXI Bus 1
AXI Clock
Domain
Programmable Logic
Smoothing
Morphology
Object Tracking(Mean-shift)
ErosionDilation
AX
I
Interc
onnect
Vision Algorithms
Stream In Stream
Out
Algorithm-
intrinsic
Scene
model
Input
Frame Input
Frame Input
frame
Input
Frame Input
Frame Output
frame
Original Scene ForeGround (FG)
18%
19%
67%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Communication
Computation
Stream Pixels
16%
28%
4%
24%
26%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%static
Async. FIFOs
Adjustment
AXI_data
DMA
AXI clk
Vision Algo 2
Vision Algo 1
Vision Algo 0
SystemIn
SystemOut
Host
ProcessorSystem Memory
FLP-PVP
Cache DMA Conf.
B) Programming Abstraction
C) Function-Set Architecture
Low-Pass Filter
(Convolution)DMA
Color/IlluminationExtraction
DMA
ILP-BFDSP
Cache
ILP-BFDSP
Cache
ILP-BFDSP
Cache
ILP-BFDSP
Cache
ILP-BFDSP
Cache
ILP-BFDSP
Cache
E) System-Level Integration
0
6
12
18
24
ILP ILP+ACC FLP
Op
era
tio
n [G
OP
s]
0
3
6
9
12
ILP ILP+ACC FLP
# o
f IL
P c
ore
s
0
0.4
0.8
1.2
1.6
ILP ILP+ACC FLP
Off
-chip
traff
ic [G
B/s
]
0
0.5
1
1.5
2
2.5
3
ILP ILP+ACC FLP
Po
we
r [w
]
Communication
Computation
D) Adaptive Vision Algorithms
Complex scene analysis- Track multiple objects
Machine-learning principles- Keep a model of scene
- e.g. MoG background subtraction, optical
flow, SVM
Observation: Not all traffic is equal!
Algorithm-intrinsic traffic dominates
- 60x in MoG, 20x in Optical flow
Streaming: fixed, algorithm-intrinsic: adjustable
Observed traffic separation in:
• Mixture of Gaussian (MoG)• Kanade Lucas Tomasi (KLT)
optical flow
• Component labeling• MeanShift object tracker
B) Optimization: Compression for Algorithm-Intrinsic DataN Bits
Most Significant
Bits (MSBs)
N Bits
Most Significant
Bits (MSBs)
32 Bits32 Bits
N Bits N Bits00...0 LSBs
MoG Background Subtraction
ParametersIn
PrecisionAdjustment
(N-bit to 32-bit)
PixelOut
PrecisionAdjustment
(32-bit to N-bit)
ParametersOut
PixelIn
Precision-adjustment on algorithm-
intrinsic data access- Bandwidth/quality trade-off
- Pareto front (blue line)
- Evaluate quality [MS-SSIM]
- Significant bandwidth reduction in MoG
- Simple scene: 63%
- Medium scene: 59%
- Complex scene: 56%
- Same trade-off observed for optical flow
0
2
4
6
8
Complex Medium Simple
Ban
dw
idth
[G
Bs/s
]
Original Parameters Tuned
Pipeline construction of multiple vision algorithms
- Streaming: point-to-point connection
- Hidden from memory
- Algorithm-intrinsic data to communication interface
- With dedicated precision adjustment
HWSWHW
Morphology MoGSmoothing
(CNV) HDMI
inHDMIout
HistogramChecking
ComponentLabeling
Video Overlay
Object tracking vision flow- Smoothing
- 1x CNV on 8bit data 5x5 window
- Mixture of Gaussian
- Morphology - Dilation, erosion, erosion
- 3x CNVs on 1bit data 15x15 window
- Component labeling
- Histogram checking
- Video Overlay
Implementation results- 1080p 30Hz, or 768p 60Hz
- Limitation: on chip mem.
- Great performance / power
efficiency
- 40 GOPs in 1.7 Watt
- 30x faster then SW-only execution
on desktop machine
Instruction-Level
Processors (ILPs)
- High flexibility, low efficiency
Custom HW Accelerators
(HWACCs)- Low flexibility, high efficiency
HWACC
ILPs
Control Processors
DSPs
GPUs
Application InstructionFlexibility
Effic
ienc
y [G
OPs
/Wa
tt] FLP
Function
Insight: Mismatch in
Programming Granularity - How to compose a program?
Architecture Granularity- How to execute a program?
Abstraction
Instruction
Function
Add
Sub
For
Filter
CNV
SortFunction-
LevelArchitecture
Compiler
Arc
hite
ctu
re
Pro
gra
mm
ing
Function-Level Processor- Matches abstractions at
function-level granularity
- Architecture for function-
level programming
- Increases efficiency
- Maintains flexibility
- Simplifies application
composition
FunctionA
FunctionB
FunctionE
FunctionI
FunctionB
FunctionD
FunctionE
FunctionF
FunctionH
-Function
C
FunctionG
FunctionJ
FunctionK
-
FunctionD
FunctionH
FunctionI
FunctionJ
FunctionK
-
FunctionB
FunctionE
FunctionF
FunctionJ
Applications of Market (Domain)
-
FunctionA
FunctionB
FunctionC
FunctionD
FunctionE
FunctionF
FunctionG
FunctionH
FunctionI
FunctionJ
FunctionK
Function Set
Streaming applications composed of
functions
- e.g. OpenCV, OpenSDR
Requirement:
- Compute inside FLP as much as possible
- Identify common functionality and composition rules
D) FLP Architecture
FLP Components:- Optimized Function Blocks (FBs)
- MUX-based interconnect
- Separation of data traffic
- Autonomous control / synchronization
Function0
Function1
Function2
FunctionN-1
FunctionN Ou
tpu
t E
nco
de
r/F
orm
atte
r
Function3
Function5(Arithmetic
Unit)
Function6
Parameters Buffer/Cache
MUX
MUX
MUX
MUX
MUX
MUX
MUX
MUX
MUX
MUX
FLP Streaming-Pipe Controller
Inp
ut
En
cod
er/
Fo
rma
tte
r
Parameters Buffer/Cache Parameters Buffer/Cache
Operational (Algorithm-Intrinsic) Buffer/Cache
Direct-Memory Access (DMA)
Direct-Memory Access (DMA)
System InputInterface
Direct-Memory Access (DMA)
Direct-Memory Access (DMA)
Direct-Memory
Access (DMA)
Direct-Memory
Access (DMA)
Direct-Memory
Access (DMA)
Direct-Memory
Access (DMA)
BackwardMUX
BackwardMUX
BackwardMUX
ForewardMUX
ForewardMUX
ForewardMUX
Operational (Algorithm-intrinsic) Data Streaming Data
Selected Publications H. Tabkhi, M. Sabbagh, and G. Schirner, “Power-efficient real-time solution for adaptive vision algorithms,” IET Computers Digital Techniques, vol. 9, no. 1, pp.16–26, 2015. H. Tabkhi, R. Bushey, and G. Schirner, “Algorithm and architecture co-design of mixture of gaussian (mog) background subtraction for embedded vision,” in IEEE 17th Asilomar
Conference on Signals, Systems and Computers, Nov 2013, pp. 1815–1820. ——, “Function-level processor (flp): A high performance, minimal bandwidth, low power architecture for market-oriented mpsocs,” IEEE Embedded Systems Letters, vol. 6, no. 4, pp. 65–
68, Dec 2014. ——, “Function-level processor (flp): Raising efficiency by operating at function granularity for market-oriented mpsoc,” in IEEE 25th International Conference on Application-specific
Systems, Architectures and Processors (ASAP), June 2014, pp. 121–130.
FLP
Streaming Communication Fabric
Shared memory
ILP
Control Unit
Interrupt line Int Cont
DMA
Control Bus
System I/O
DMAs
LSPM
DMAs
LSPM
FBN-1
FBN
FB0
FB1
MU
XM
UX
MU
XM
UX
FLP pairs with ILP cores- To create complete control and analytic
processing.
- FLP for pre/mid-processing
- ILPs for post-processing (control and
intelligence)
Pipeline Vision Processor (PVP)- FLP is generalization of PVP
- Result of joint work with PVP chief architect
Results on 10 selected vision applications- Computation:
- FLP-PVP <= 22.5 GOPs
- ILP+ACC requires 2 ILP cores
- ILP requires 7 ILP cores
- Off-chip communication:
- FLP offers 5x less than ILP and 3x less than ILP+ACC
- Power:
- FLP offers 18x less than ILP and 5x less than ILP+ACC
FLP Principles:- Target stream processing applications
- Compute contiguously inside FLP
- Limited ILP interaction