Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
A 1920 × 1080 25fps, 2.4TOPS/W Unified Optical Flow and Depth 6D Vision Processor for Energy-Efficient, Low Power Autonomous Navigation
Ziyun Li, Jingcheng Wang, Dennis Sylvester, David Blaauw, and Hun-Seok Kim University of Michigan, Ann Arbor, MI, [email protected]
Abstract This paper presents a unified 6D vision (3D coordinate + 3D motion) processor that is the first to achieve real-time dense optical flow and stereo depth for full HD (1920×1080, FHD) resolution. The proposed design is also the first ASIC that accelerates real-time optical flow computation. It supports a wide search range (176×176 pixels) to enable dense optical flow on FHD images with real-time 25fps/30fps (flow/depth) throughput, consuming 760mW in 28nm CMOS.
Introduction Autonomous navigation of micro aerial vehicles (MAVs) and self-driving cars (Fig. 1) requires real-time accurate and dense perception of 3D coordinates (depth) and 3D apparent motion (optical flow). The semi-global matching (SGM) [1] algorithm is widely used [2,3] in real-time stereo depth estimation and achieves high accuracy by applying a semi-global optimization over the entire image. However, directly applying SGM to optical flow is impractical for real-time applications because each pixel requires evaluation of d2 (= 31k for d = 176 in our chip) flow candidates as well as SGM on each matching candidate. Recently, a variant of SGM for optical flow, NG-fSGM [4], was proposed to avoid the quadratic complexity by performing neighbor-guided aggressive candidate pruning. However, NG-fSGM still requires a large memory footprint (~4MB) with massive memory access bandwidth (2.6Tb/s). Because of aggressive neighbor guided pruning and randomly added candidates to the search space [1], the memory access pattern is highly irregular and inter-pixel dependency has variable latency. Addressing these challenges, the proposed design uses: 1) four new custom designed 16×16 (136b/access) coalescing crossbar (xbar) switches that efficiently eliminate redundant memory accesses with built-in arbitration at each cross point for 4.08Tb/s peak on-chip memory bandwidth; 2) a new rotating buffer to maximize on-chip memory reuse for a wide 2D search range; 3) a diagonal image-scanning stride to resolve the variable length inter-pixel dependency in the deep pipeline.
Proposed Processor Architecture As shown in Fig. 2, the processor streams previous and current image blocks into 64 on-chip rotating image buffers (20 Kb each, Fig. 3). Pixel matching is performed using census transformed pixels (80b each) of the current vs. previous (optical flow) or left vs. right frames (depth). Rather than exhaustively evaluating all candidates, each pixel selectively evaluates 60 candidates guided by previously processed neighboring pixels plus 4 random positions (64 total). This produces 64 ‘local’ matching costs (7b each) for the 64 optical flow/depth vectors. The processor then aggregates the local matching costs along 8 paths. A subset of optical flow/depth vectors with the minimum aggregated cost for each path are passed to next processing pixel as flow/depth candidates. The minimum of combined aggregated costs is selected as the final optical flow/depth. To eliminate the need for off-chip DRAM and to ensure scalability to different image resolutions, the proposed design processes the input in units of 88×88 overlapping blocks. Adjacent blocks are overlapped by 8 pixels to allow candidate propagation across block boundaries [2,4]. However, block-based processing incurs significant memory inefficiency on optical flow computation because, unlike 1D disparity matching, the same blocks are read multiple times for 2D matching to process different parts of the image. Therefore, we propose a rotating on-chip image buffer scheme with 64 SRAMs (Fig. 4) to maximize on-chip memory reuse and reduce interface bandwidth by 2× at the cost of 28% larger on-chip memory size. Furthermore, processing the image in the conventional raster scan order results in a severe dependency for NG-fSGM where the previous pixel processing must be completed to pass candidates to the adjacent pixel (Fig. 5). This issue is even more critical in NG-fSGM where pixel processing has variable latency. Inspired by [2], we process
pixels diagonally to hide the variable length dependency in the pipeline, achieving better frequency and voltage scalability from a shorter critical path. Although NG-fSGM aggressively prunes the search space, it still requires a tremendous memory bandwidth of >2Tb/s with irregular and unpredictable memory access patterns (Fig. 2). This motivates a new multi-bank memory access with a coalescing crossbar discussed below. Once pixel data is fetched, NG-fSGM computation can be highly parallelized. Therefore, the proposed 6D vision processor is partitioned into 3 clock domains and 2 voltage domains to optimize and balance compute vs. memory access performance and power (Fig. 3). Image access, census transform, and local cost computation are in the (Fhigh, Vhigh) domain to improve throughput, while cost aggregation and flow selection are performed with (Flow, Vlow) to improve energy efficiency. Fig. 7 shows that crossbar and memory accesses account for ~62% of overall power (simulation).
High Bandwidth, Coalescing Xpoint Crossbar Fig. 6 (top) shows the proposed architecture and timing of the custom-designed high-bandwidth coalescing xbar. Fig. 6 (bottom) details the circuits of a cross point (Xpoint). In the proposed design, computing census transform of 64 flow candidates on-the-fly requires 2.6Tb/s random memory access. To provide such a high throughput with an area-/power-efficient solution, we use 4 custom 136b/word, 2-cycle pipelined coalescing xbars that enable 4.08Tb/s peak memory bandwidth. NG-fSGM memory accesses are highly irregular but neighboring pixels tend to generate overlapping accesses. Only ~55 out of 162 accesses are unique on average (simulation). Our crossbar detects and arbitrates memory access (query) collisions at each Xpoint, and it broadcasts memory data and address to queries that lost arbitration. In the first cycle, each Xpoint decides which query wins or loses arbitration by examining neighboring arbitration lines and it launches the access address for the winning query. In second cycle, both winning query and losing queries receive the broadcasted data and address. If a query in the queue has a matching address, the read data is consumed and the query is removed from the queue (to avoid redundant memory accesses). This coalescing yields 54% performance improvement. Fig. 8 compares different xbar design choices. The proposed four 16×16 coalescing xbars achieve an access throughput comparable to that of a single 64×64 xbar while achieving 3.1× reduction in power and 3.4× reduction in area. Overall, this approach enables 2.6Tb/s average bandwidth from 4 xbars.
Measurement Results and Comparisons The processor was fabricated in TSMC 28nm CMOS (Fig. 9). Real-time images are block-partitioned and streamed to the processor through a USB3.0 interface. Figs. 10 and 11 show the measured optical flow and depth results for the KITTI automotive and JPL-captured MAV benchmarks, respectively. Fig. 12 shows the measured throughput and energy efficiency trade-off with varying Fhigh. The optimal energy point occurs at Fhigh≈2.7×Flow. Fig. 13 shows the voltage and frequency scaling behavior of the chip. Table 1 and Fig. 15 provide comparison to prior works. The proposed processor consumes 760mW to process optical flow/depth on 25/30fps FHD images. Normalized energy [6] (Fig. 15) on depth @ 30fps FHD images is 0.069 nJ/pixel, marking >1.5× improvement over prior arts specialized for depth processing only. Normalized energy on optical flow @ 25fps FHD is 0.0048, yielding additional 14.3× improvement while all prior work is unable to support optical flow computation.
Acknowledgement We thank TSMC for chip fabrication, and JPL for MAV benchmark.
References [1] Hirschmuller, CVPR, pp.807-814, 2005 [2] Li, ISSCC, pp. 62-63, 2017 [5] Xiang, SiPS, pp. 1-6, 2016 [3] Lee, ISSCC, pp. 256-257, 2016 [6] Chen, ISSCC, pp. 422-423, 2015 [4] Xiang, ICIP, pp. 4483–4487, 2016
2018 Symposium on VLSI Circuits Digest of Technical Papers978-1-5386-4214-6/18/$31.00 ©2018 IEEE 135
9.52%1.79%
3.47%
10.91%
12%
14.19%
48.12%
xbar mem access census transform clock aggregation selection rest
Fig. 7 Simulated power break down
50 100 150 200 250 300
60
80
100
120
140
160
180
200
220
Flow (MHz)
Th
roug
hp
ut (
fps
@ V
GA
)
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
E
nerg
y E
ffic
ien
cy (
nJ/
pix
el)
Fig. 12 Throughput & efficiency vs Flow when Fhigh = 300MHz
ThroughputEnergy Efficiency
Fig. 9 Die photo
-88
-88
+88
+88
0
88
Optical flow
Depth
Rawimage
USB/FIFO Protocol & interface Clock
Xbar & image buffer
Local cost generation
Aggregation & disparity selection
3.1m
m
3.0mm
Fig. 11 Chip measurement results on KITTI automotive scenes
Fig. 4 Rotating on-chip image memory management scheme
0
0
2
3
4
5
6
7
8
9
10
11 15
14
13
12 16
17
18
19
0
0
2
3
4
5
6
7
8
9
10
11 15
14
13
12 16
17
18
19
20
21
0
0
2
3
4
5
6
7
8
9
10
11 15
14
13
12 16
17
t0 t1 t2
Unused block Flow search blocks Discarded blockFetch block
*Four image buffers form a block
0.7 0.8 0.9 1.0 1.1 1.2 1.3
20
40
60
80
100
120
140
160
180
200
220
Throughput Efficiency
Vhigh / V
Th
roug
hp
ut
(fp
s @
VG
A)
0.02
0.03
0.04
0.05
0.06
0.07
E
ne
rgy
Eff
icie
ncy
(n
J/p
ixel
)Fhigh: 500MHzFlow: 180MHz
Fhigh: 300MHzFlow: 90MHz
Fhigh: 225MHzFlow: 90MHz
Fhigh: 90MHzFlow: 55MHz
Fhigh: 70MHzFlow: 55MHz
Fig. 13 Measured voltage/frequency scaling
Forward aggregation
(4 path, proposed)Latencyhidden
B
C
D
E
FA
Current pixel Buffered Finished Critical path Scan order Data dependency
1~7 cycles variable latency(depending on access addresses)
Fig. 5 Variable critical path hidden diagonal scan
Optical Flow
Depth
Table. 1 Comparison with state-of-the-art vision chips
ISSCC 2015[6] ISSCC 2016[3]
Method
Technology
Chip area
On-chip memory
Frequency
Throughput & image size
Accuracy (Outlier %)
Operating voltage
Power
5-View BP Truncated SGM NG-SGM40nm 65nm 28nm
22mm2 16mm2 9.3mm2
352Kb 3946.9Kb 1568Kb215MHz 250MHz Flow: 180MHz, Fhigh: 500MHz
1920 X 108030fps
1280 X 72030fps
1920 X 1080 @ 25fps (flow)1920 X 1080 @ 30fps (depth)
4.7% @ Middlebury
0.9 V 1.2V Vlow: 0.9V Vhigh: 1.3V @ 30 fps HDVlow: 0.58V Vhigh: 0.75V @ 30 fps VGA
1019mW(includes DRAM power)
582mW (excludes DRAM power)
760mW @ 25 fps HD (flow) / 30 fps HD (depth) 62 mW @ 30 fps VGA (flow )/ 40 fps VGA (depth)
Normalized energy*(depth) 0.153nJ 0.329nJ
0.069nJ @ 30 fps HD (depth)0.029nJ @ 30 fps VGA (depth)
SGM
40nm
10.8mm2
1064Kb
170MHz1920 X 1080
30fps
Not reported due to limited depth range0.75V @ 30 fps HD
0.52V @ 30 fps VGA836mW @ 30 fps HD55 mW @ 30 fps VGA
0.104nJ @ 30 fps HD0.047nJ @ 30 fps VGA
Search Range 1x64 1x64 176x176=309761x1287% @ KITTI
Normalized energy*(Optical flow)
0.0048nJ @ 25 fps HD (flow)0.0022nJ @ 30 fps VGA (flow)Optical flow NOT supported
This workISSCC 2017[2]
Stereo depth only Depth & Optical flow
Normalized energy [6]
nJSearch range • # of pixel=
Curr(Right)
Neighbor-Guided candidatesRandom candidates
Fig. 2 Datapath of the proposed NG-SGM 6D vision algorithm
Processing block
Prev(Left)
Repeat 8 timesto get aggregated costs
on 8 paths
Image fetch & scheduling
Flow/depth selection
Summation of aggregated
costs on 8 paths
Min selection
Depth of one pixel
Search block
Curr(Right)88 x 3
Top K minbecomes center of
flow candidates
orFlow of one
pixelPrev(Left)
aggregation over
sparse candidates
Prev(Left)
64 sparse candidates
Local matching costs,99% sparsity
Neighbor-Guidedflow search
To do census transform:Need to access 1824 pixels
at random locations!
Pixel-wise cost generation
Cost aggregation
Curr(Right)
Prev(Left)
9
9 10
1015 ×
9
9
Prev(Left)
Curr(Right)
Census transform Census transform
81 bits: 101...10Hamming distance
on 64 flow candidates per pixel
15 ×
4 ×
4 ×
Candidates from neighbors
Randomcandidate
s
88 x 3
Interface Local cost generation NG-SGM aggregation Interface
Vlow, Finterface
Path buffer (12 kb)
Prot
ocol
&bu
s co
ntro
l
Chip boundary
Resu
lt bu
ffer
Resu
lt bu
ffer
Prot
ocol
&bu
s con
trolUSB/
32b fifo
USB/32b fifo
1.9Gb/s
1.6Gb/s
96bits
Path buffer (12 kb)
Path buffer (12 kb)
Path buffer (12 kb)
Aggregationunit
Kth minselection
Aggregationunit
Kth minselection
Aggregationunit
Kth minselection
Aggregationunit
Kth minselection
Glo
bal a
ggre
gatio
n un
it
Glo
bal K
th m
in se
lect
ion
Asyn
chro
nous
loca
l cos
t fifo…...
7 bits
128
bits
960bits
960bits
16 × 16xbar
16 × 16xbar
16 × 16xbar
16 × 16xbar
…...
Addr mux
Census transform
& local cost
Data mux448bits
8192bits
Image read addr
generation
420bits
asyn
chro
nous
addr
fifo
Image buffers (20 kb × 64)
2.4Tbs/s access BW
16bits
128bits
Vlow, Fcore Vhigh, Fxbar SRAM Proposed coalescing-resolving xbar
240 bits
Fig. 3 NG-fSGM Chip block diagram with partition of various voltage/frequency domains
Fig. 1 Optical flow and depth estimation on autonomous MAVs
CameraMotion
disparity =
Optical flow =
2D search, track object/camera motion
1D search, estimate object depth
Coordinate Motion
Optical flow
Drone with visual navigation
Depth mapOptical Flow map far
close
Stereo image stream (>30fps HD/720p/VGA)
0t
1t 0 1t t−1 0 1 0( , )x x y y− −
0 0x x′ −
' ' '( , , , , , )P X Y Z X Y Z
00
0
(,
)
px
y
'
00
0
(,
)
qx
y
1 1 1( , )p x y '1 1 1( , )q x y
Left, Previous Right, Previous
time
Left, Current Right, Current
Fig. 10 Chip measurement results on captured MAV scenes
Optical Flow
Depth
Fig. 6 Block diagram, timing of the coalescing Xbar and schematic of a Xpoint
Timing control
Row SA& Peripheral...
...
...
...
... ...
Row SA& Peripheral
Row SA& Peripheral
...
Column Peripheral & Driver
Column Peripheral & Driver
Mem 0 Mem 2 Mem 14...
Mem 1 Mem 3 Mem 1516×16 mesh
...
Query queue 0
Query queue 1
Query queue 15
...
...
...
Xpoint Xpoint Xpoint
XpointXpointXpoint
Xpoint Xpoint Xpoint
Q D
Q D
Q D
DATA_MEM
ARB_EN_bar
IN_bar
ARB_EN_bar
IN_bar
× 16
BL for query 15
800n 800n
ARB BL for query 0
DATA_EN_xpoint_bar
DATA_MEM_bar800n
DATA_REPLYDATA
× 128
ADDR_ENLOSE _LINE ARB_CURR
WIN _LINE1200n
1200n
ADDR_EN_xpoint
ADDR_REPLY_EN
ADDR_EN
ADDR_SA_EN
ARB_SA_EN
Q DDATA_EN CLK
DATA_ENxpoint
Control
CLK
× 8
ADDR_EN_xpoint_bar
ADDR REPLY_ENREPLY_ADDR
ADDR_BL
1200n
1200n
ADDR_bar
ADDR
Check Coalescing
BroadcastAddress
BroadcastData
Q D
Q D
...
2 cycle pipelined xbarCLK
PCH
ARB
MEM Tcd
PCH
SA
Addr
PCH
Data
SA
Addr
SA
PCH
Tsetup
SA
ARB
PCH
ARB
ADDR
DATA
2ns
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
16
64
256
1k
4k
16k
Fig. 15 Measured energy efficiency compared with prior arts @ HD
No
rmal
ized
ene
rgy
(nJ)
0.0048
0.069
0.104
0.153
0.329
64 64
128176
30976
This work, flow, HD
This work, depth, HD
[3] [6] [2]
Normalized energy [6]
nJSearch range • # of pixel=
Search Range
Normalized energy
Sea
rch
Ra
ng
e (#
of
pix
els
)
Fig. 8 Simulated avg/worst case # of xbar cycles/pixel
5.49
3.61 3.552.44
17
10 10 9
0
2
4
6
8
10
12
14
16
18
# o
f xb
ar a
cce
ss
cyc
les
Design choice
Avg # of cyclesWorst case
Without Coalescing
With Coalescing
2018 Symposium on VLSI Circuits Digest of Technical Papers 136