Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Exploring Computation-Communication Tradeoffs in Camera Systems
1
Amrita MazumdarThierry MoreauSung KimMeghan Cowan
Armin AlaghiLuis CezeMark OskinVisvesh Sathe
IISWC 2017
video surveillance cameras
3D-360 virtual reality camera rig
Camera applications are a prominent workload with tight constraints
2
large data size
large data size
energy harvesting camera
augmented reality glasses
light weight
light weight
real-time processing
real-time processing
real-time processing
low-power
low-power
Hardware implementations compound the camera system design space
constraint
power
time size
bandwidth
implementation
ASICFPGA
DSPCPU
GPU
3
camera system
DogChat™
We can represent camera applications as camera processing pipelines to clarify design space exploration
sensor block 1 block 2 block 3 block 4
4
functions in the application
5
DogChat™
sensor image processing
face detection
feature tracking
image rendering
We can represent camera applications as camera processing pipelines to clarify design space exploration
6
DogChat™
sensor image processing
face detection
feature tracking
image rendering
offloaded to cloud
Developers can trade off between computation and communication costs
7
DogChat™
Developers can trade off between computation and communication costs
offloaded to cloudin-camera processing
sensor image processing
face detection
feature tracking
image rendering
8
Optional and required blocks in camera pipelines introduce more tradeoffs
edge detection
motion detection
motion tracking
required
optional
sensor image processing
face detection
feature tracking
image rendering
sensor image processing
face detection
feature tracking
image rendering
edge detection
motion detection
motion tracking
Custom hardware platforms explode the camera system design space
9
ASIC
FPGADSP CPU
GPUDSP
FPGA
required
optional
sensor image processing
face detection
feature tracking
image rendering
edge detection
motion detection
motion tracking
Custom hardware platforms explode the camera system design space
10
ASIC
FPGADSP CPU
GPUDSP
FPGA
required
optional
In-camera processing pipelines can help us evaluate these tradeoffs!
Challenges for modern camera systems
Low-power: face authentication for energy-harvesting cameras with ASIC design
Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration
11
motion detection
face detection
neural network
prep align depth
stitch
Challenges for modern camera systems
Low-power: face authentication for energy-harvesting cameras with ASIC design
Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration
12
motion detection
face detection
neural network
prep align depth
stitch
Face authentication with energy harvesting cameras
WISP Cam energy-harvesting camera
powered by RF1 frame / second
~1 mW processing / frame
13
Is this Armin? ✅
14
Face authentication with energy harvesting cameras
sensor neural network
other application functions
on-chip CPU cloud
15
CPU-based face authentication neural networks can exceed WISPcam power budgets
sensor neural network
other application functions
ASIC hardware cloud
16
adding optional blocks can reduce power consumption for a neural network
face detection
motion detection
on-chip circuit
CPU-based face authentication neural networks can exceed WISPcam power budgets
Exploring design tradeoffs in ASIC accelerators
Evaluated NN topology and hardware impact on energy and accuracy
Selected a 400-8-1 network topology and used 8-bit datapaths for optimal energy/accuracy point
17
SNNAP
DMA Master
Bus Scheduler
PU
SRAM
control
PE
PE
SIG
... MUL MUL MUL MUL
weight weight weight weightd_in
ADD ADD ADD ADD
offset88 88 88 88
16 16 16 16acc.fifo
sig.fifosigmoid unit
26 26 26 26
26 26 26 26acc
1626
8 26acc
PE0 PE1 PE2 PE3
8
d_out
feature unit
integral accumulatorVJ
integral image accumulator
classifier unit
window buffer
stage unitthreshold unit
feature unit
pixels in
input row
integral row output
4 41
+= 2 311 116 72
threshold
‘yes’ weight‘no’ weight
a db c
++ - x
+ +
>- x
- x
weight1++
a db cweight2
++
a db cweight3
previous row
+ + +
Streaming face detection accelerator
Explored classifier and other algorithm parameters to optimize energy optimality
neural network face detection
many more details in paper!
Synthesized ASIC accelerators in Synopsys
Constructed simulator to evaluate power consumption on real-world video input
Computed power for computation and transfer of resulting data for each pipeline configuration
18
EvaluationWhich pipeline achieves the lowest overall power?
Which pipeline achieves the lowest power consumption?
19
platform configuration compute transfer
sensor <1% >99%
sensor motion <1% >99%
sensor face detect 10% 90%
sensor NN 16% 84%
sensor motion face detect >99% <1%
sensor motion NN >99% <1%
sensor face detect NN >99% <1%
sensor motion face detect NN >99% <1%
log Power (µW)1 1000 1000000
160
419
257,236
132
782,090
374
3,731
11,340
(ratios)
Which pipeline achieves the lowest power consumption?
20
platform configuration compute transfer
sensor <1% >99%
sensor motion <1% >99%
sensor face detect 10% 90%
sensor NN 16% 84%
sensor motion face detect >99% <1%
sensor motion NN >99% <1%
sensor face detect NN >99% <1%
sensor motion face detect NN >99% <1%
log Power (µW)1 1000 1000000
160
419
257,236
132
782,090
374
3,731
11,340
(ratios)
prefilters reduce overall power
Which pipeline achieves the lowest power consumption?
21
platform configuration compute transfer
sensor <1% >99%
sensor motion <1% >99%
sensor face detect 10% 90%
sensor NN 16% 84%
sensor motion face detect >99% <1%
sensor motion NN >99% <1%
sensor face detect NN >99% <1%
sensor motion face detect NN >99% <1%
log Power (µW)1 1000 1000000
160
419
257,236
132
782,090
374
3,731
11,340
(ratios)
just using NN
prefilters with NN use less power
Which pipeline achieves the lowest power consumption?
22
platform configuration compute transfer
sensor <1% >99%
sensor motion <1% >99%
sensor face detect 10% 90%
sensor NN 16% 84%
sensor motion face detect >99% <1%
sensor motion NN >99% <1%
sensor face detect NN >99% <1%
sensor motion face detect NN >99% <1%
log Power (µW)1 1000 1000000
160
419
257,236
132
782,090
374
3,731
11,340
(ratios)
most power-efficient
most power-efficient with on-chip NN
In-camera processing for face authentication
In isolation, even well-designed hardware can show sub-optimal performance
Optional blocks can improve the overall cost,if they balance compute and communication
better than the original design
23
motion detection
face detection
neural network
Challenges for modern camera systems
Low-power: face authentication for energy-harvesting cameras with ASIC design
Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration
24
motion detection
face detection
neural network
prep align depth
stitch
Challenges for modern camera systems
Low-power: face authentication for energy-harvesting cameras with ASIC design
Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration
25
motion detection
face detection
neural network
prep align depth
stitch
26
16 GoPro cameras 4K-30 fps
3.6 GB/s raw video
Goal: 30 fps
3D-360 stereo video 1.8 GB/s output
Producing real-time VR video from a camera rig
27
16 GoPro cameras 4K-30 fps
3.6 GB/s raw video
Goal: 30 fps
3D-360 stereo video 1.8 GB/s output
Producing real-time VR video from a camera rig
cloud processing prevents real-
time video
28
offloaded to cloud
prep image align
depth from flow
image stitchsensor stream
to viewer
VR pipeline is usually offloaded to perform heavy computation
5% 20% 70% 5%
processing time
need to accelerate “depth from flow” to achieve high
performance
29
prep image align
depth from flow
image stitchsensor stream
to viewer
Offloading before the costly step doesn’t avoid compute-communication tradeoffs
Vide
o Fr
ame
Size
(MB)
0
150
300
450
600image alignment step produces significant
intermediate data
offloading early on is still 2x final output
size
Evaluation
30
Designed a simple parallel accelerator for Xilinx Zynq SoC, simulated for Virtex UltraScale+
Evaluated against CPU and GPU implementations in Halide
Assumed 2GB/s network link for communication
Which pipeline achieves the highest frame rate?
implementation details in paper
Which pipeline achieves the highest frame rate?
31
pipeline configuration compute transfer
sensor 100 15.8
sensor prep 100 15.8
sensor prep align 100 3.95
sensor prep align depth (CPU) 0.09 5.27
sensor prep align depth (GPU) 11.2 5.27
sensor prep align depth (FPGA) 174 5.27
sensor prep align depth (CPU) stitch 0.09 31.6
sensor prep align depth (GPU) stitch 11.2 31.6
sensor prep align depth (FPGA) stitch 174 31.6
effective FPS0 7 14 21 28 35
31.6
11.2
0.1
5.3
5.3
0.1
4.0
15.8
15.8
.09
.09
(FPS)
Which pipeline achieves the highest frame rate?
32
pipeline configuration compute transfer
sensor 100 15.8
sensor prep 100 15.8
sensor prep align 100 3.95
sensor prep align depth (CPU) 0.09 5.27
sensor prep align depth (GPU) 11.2 5.27
sensor prep align depth (FPGA) 174 5.27
sensor prep align depth (CPU) stitch 0.09 31.6
sensor prep align depth (GPU) stitch 11.2 31.6
sensor prep align depth (FPGA) stitch 174 31.6
effective FPS0 7 14 21 28 35
31.6
11.2
0.1
5.3
5.3
0.1
4.0
15.8
15.8
.09
.09
(FPS)
CPU results are slowest
Which pipeline achieves the highest frame rate?
33
pipeline configuration compute transfer
sensor 100 15.8
sensor prep 100 15.8
sensor prep align 100 3.95
sensor prep align depth (CPU) 0.09 5.27
sensor prep align depth (GPU) 11.2 5.27
sensor prep align depth (FPGA) 174 5.27
sensor prep align depth (CPU) stitch 0.09 31.6
sensor prep align depth (GPU) stitch 11.2 31.6
sensor prep align depth (FPGA) stitch 174 31.6
effective FPS0 7 14 21 28 35
31.6
11.2
0.1
5.3
5.3
0.1
4.0
15.8
15.8
.09
.09
(FPS)
Data size is too big after depth for
offloading
Which pipeline achieves the highest frame rate?
34
pipeline configuration compute transfer
sensor 100 15.8
sensor prep 100 15.8
sensor prep align 100 3.95
sensor prep align depth (CPU) 0.09 5.27
sensor prep align depth (GPU) 11.2 5.27
sensor prep align depth (FPGA) 174 5.27
sensor prep align depth (CPU) stitch 0.09 31.6
sensor prep align depth (GPU) stitch 11.2 31.6
sensor prep align depth (FPGA) stitch 174 31.6
effective FPS0 7 14 21 28 35
31.6
11.2
0.1
5.3
5.3
0.1
4.0
15.8
15.8
.09
.09
(FPS)
full pipeline with FPGA is only one
that achieves real-time frame rate
In-camera processing for real-time VR video
Computation and communication together highlight benefits not seen when considered separately
For VR video, in-camera processing pipelines enable applications that could not even be achieved via
cloud offload
35
prep align depth
stitch
In-camera pipelines evaluate computation-communication trade-offs
Use hardware-software co-design to balance constraints and optimize designs
Achieve optimal performance by considering bottlenecks in context of full system
In-camera processing pipelines help characterize camera systems
Thank you!