A 1920 × 1080 25fps, 2.4TOPS/W Unified Optical Flow and Depth …blaauw.engin.umich.edu/wp-content/uploads/sites/342/2018/... · 2019. 9. 10. · A 1920 × 1080 25fps, 2.4TOPS/W

A 1920 × 1080 25fps, 2.4TOPS/W Unified Optical Flow and Depth 6D Vision Processor for Energy-Efficient, Low Power Autonomous Navigation

Ziyun Li, Jingcheng Wang, Dennis Sylvester, David Blaauw, and Hun-Seok Kim University of Michigan, Ann Arbor, MI, [email protected]

Abstract This paper presents a unified 6D vision (3D coordinate + 3D motion) processor that is the first to achieve real-time dense optical flow and stereo depth for full HD (1920×1080, FHD) resolution. The proposed design is also the first ASIC that accelerates real-time optical flow computation. It supports a wide search range (176×176 pixels) to enable dense optical flow on FHD images with real-time 25fps/30fps (flow/depth) throughput, consuming 760mW in 28nm CMOS.

Introduction Autonomous navigation of micro aerial vehicles (MAVs) and self-driving cars (Fig. 1) requires real-time accurate and dense perception of 3D coordinates (depth) and 3D apparent motion (optical flow). The semi-global matching (SGM) [1] algorithm is widely used [2,3] in real-time stereo depth estimation and achieves high accuracy by applying a semi-global optimization over the entire image. However, directly applying SGM to optical flow is impractical for real-time applications because each pixel requires evaluation of d2 (= 31k for d = 176 in our chip) flow candidates as well as SGM on each matching candidate. Recently, a variant of SGM for optical flow, NG-fSGM [4], was proposed to avoid the quadratic complexity by performing neighbor-guided aggressive candidate pruning. However, NG-fSGM still requires a large memory footprint (~4MB) with massive memory access bandwidth (2.6Tb/s). Because of aggressive neighbor guided pruning and randomly added candidates to the search space [1], the memory access pattern is highly irregular and inter-pixel dependency has variable latency. Addressing these challenges, the proposed design uses: 1) four new custom designed 16×16 (136b/access) coalescing crossbar (xbar) switches that efficiently eliminate redundant memory accesses with built-in arbitration at each cross point for 4.08Tb/s peak on-chip memory bandwidth; 2) a new rotating buffer to maximize on-chip memory reuse for a wide 2D search range; 3) a diagonal image-scanning stride to resolve the variable length inter-pixel dependency in the deep pipeline.

Proposed Processor Architecture As shown in Fig. 2, the processor streams previous and current image blocks into 64 on-chip rotating image buffers (20 Kb each, Fig. 3). Pixel matching is performed using census transformed pixels (80b each) of the current vs. previous (optical flow) or left vs. right frames (depth). Rather than exhaustively evaluating all candidates, each pixel selectively evaluates 60 candidates guided by previously processed neighboring pixels plus 4 random positions (64 total). This produces 64 ‘local’ matching costs (7b each) for the 64 optical flow/depth vectors. The processor then aggregates the local matching costs along 8 paths. A subset of optical flow/depth vectors with the minimum aggregated cost for each path are passed to next processing pixel as flow/depth candidates. The minimum of combined aggregated costs is selected as the final optical flow/depth. To eliminate the need for off-chip DRAM and to ensure scalability to different image resolutions, the proposed design processes the input in units of 88×88 overlapping blocks. Adjacent blocks are overlapped by 8 pixels to allow candidate propagation across block boundaries [2,4]. However, block-based processing incurs significant memory inefficiency on optical flow computation because, unlike 1D disparity matching, the same blocks are read multiple times for 2D matching to process different parts of the image. Therefore, we propose a rotating on-chip image buffer scheme with 64 SRAMs (Fig. 4) to maximize on-chip memory reuse and reduce interface bandwidth by 2× at the cost of 28% larger on-chip memory size. Furthermore, processing the image in the conventional raster scan order results in a severe dependency for NG-fSGM where the previous pixel processing must be completed to pass candidates to the adjacent pixel (Fig. 5). This issue is even more critical in NG-fSGM where pixel processing has variable latency. Inspired by [2], we process

pixels diagonally to hide the variable length dependency in the pipeline, achieving better frequency and voltage scalability from a shorter critical path. Although NG-fSGM aggressively prunes the search space, it still requires a tremendous memory bandwidth of >2Tb/s with irregular and unpredictable memory access patterns (Fig. 2). This motivates a new multi-bank memory access with a coalescing crossbar discussed below. Once pixel data is fetched, NG-fSGM computation can be highly parallelized. Therefore, the proposed 6D vision processor is partitioned into 3 clock domains and 2 voltage domains to optimize and balance compute vs. memory access performance and power (Fig. 3). Image access, census transform, and local cost computation are in the (Fhigh, Vhigh) domain to improve throughput, while cost aggregation and flow selection are performed with (Flow, Vlow) to improve energy efficiency. Fig. 7 shows that crossbar and memory accesses account for ~62% of overall power (simulation).

High Bandwidth, Coalescing Xpoint Crossbar Fig. 6 (top) shows the proposed architecture and timing of the custom-designed high-bandwidth coalescing xbar. Fig. 6 (bottom) details the circuits of a cross point (Xpoint). In the proposed design, computing census transform of 64 flow candidates on-the-fly requires 2.6Tb/s random memory access. To provide such a high throughput with an area-/power-efficient solution, we use 4 custom 136b/word, 2-cycle pipelined coalescing xbars that enable 4.08Tb/s peak memory bandwidth. NG-fSGM memory accesses are highly irregular but neighboring pixels tend to generate overlapping accesses. Only ~55 out of 162 accesses are unique on average (simulation). Our crossbar detects and arbitrates memory access (query) collisions at each Xpoint, and it broadcasts memory data and address to queries that lost arbitration. In the first cycle, each Xpoint decides which query wins or loses arbitration by examining neighboring arbitration lines and it launches the access address for the winning query. In second cycle, both winning query and losing queries receive the broadcasted data and address. If a query in the queue has a matching address, the read data is consumed and the query is removed from the queue (to avoid redundant memory accesses). This coalescing yields 54% performance improvement. Fig. 8 compares different xbar design choices. The proposed four 16×16 coalescing xbars achieve an access throughput comparable to that of a single 64×64 xbar while achieving 3.1× reduction in power and 3.4× reduction in area. Overall, this approach enables 2.6Tb/s average bandwidth from 4 xbars.

Measurement Results and Comparisons The processor was fabricated in TSMC 28nm CMOS (Fig. 9). Real-time images are block-partitioned and streamed to the processor through a USB3.0 interface. Figs. 10 and 11 show the measured optical flow and depth results for the KITTI automotive and JPL-captured MAV benchmarks, respectively. Fig. 12 shows the measured throughput and energy efficiency trade-off with varying Fhigh. The optimal energy point occurs at Fhigh≈2.7×Flow. Fig. 13 shows the voltage and frequency scaling behavior of the chip. Table 1 and Fig. 15 provide comparison to prior works. The proposed processor consumes 760mW to process optical flow/depth on 25/30fps FHD images. Normalized energy [6] (Fig. 15) on depth @ 30fps FHD images is 0.069 nJ/pixel, marking >1.5× improvement over prior arts specialized for depth processing only. Normalized energy on optical flow @ 25fps FHD is 0.0048, yielding additional 14.3× improvement while all prior work is unable to support optical flow computation.

Acknowledgement We thank TSMC for chip fabrication, and JPL for MAV benchmark.

References [1] Hirschmuller, CVPR, pp.807-814, 2005 [2] Li, ISSCC, pp. 62-63, 2017 [5] Xiang, SiPS, pp. 1-6, 2016 [3] Lee, ISSCC, pp. 256-257, 2016 [6] Chen, ISSCC, pp. 422-423, 2015 [4] Xiang, ICIP, pp. 4483–4487, 2016

2018 Symposium on VLSI Circuits Digest of Technical Papers978-1-5386-4214-6/18/$31.00 ©2018 IEEE 135

9.52%1.79%

3.47%

10.91%

12%

14.19%

48.12%

xbar mem access census transform clock aggregation selection rest

Fig. 7 Simulated power break down

50 100 150 200 250 300

60

80

100

120

140

160

180

200

220

Flow (MHz)

Th

roug

hp

ut (

fps

@ V

GA

)

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

E

nerg

y E

ffic

ien

cy (

nJ/

pix

el)

Fig. 12 Throughput & efficiency vs Flow when Fhigh = 300MHz

ThroughputEnergy Efficiency

Fig. 9 Die photo

-88

-88

+88

+88

0

88

Optical flow

Depth

Rawimage

USB/FIFO Protocol & interface Clock

Xbar & image buffer

Local cost generation

Aggregation & disparity selection

3.1m

m

3.0mm

Fig. 11 Chip measurement results on KITTI automotive scenes

Fig. 4 Rotating on-chip image memory management scheme

0

0

2

3

4

5

6

7

8

9

10

11 15

14

13

12 16

17

18

19

0

0

2

3

4

5

6

7

8

9

10

11 15

14

13

12 16

17

18

19

20

21

0

0

2

3

4

5

6

7

8

9

10

11 15

14

13

12 16

17

t0 t1 t2

Unused block Flow search blocks Discarded blockFetch block

*Four image buffers form a block

0.7 0.8 0.9 1.0 1.1 1.2 1.3

20

40

60

80

100

120

140

160

180

200

220

Throughput Efficiency

Vhigh / V

Th

roug

hp

ut

(fp

s @

VG

A)

0.02

0.03

0.04

0.05

0.06

0.07

E

ne

rgy

Eff

icie

ncy

(n

J/p

ixel

)Fhigh: 500MHzFlow: 180MHz

Fhigh: 300MHzFlow: 90MHz




Fig. 13 Measured voltage/frequency scaling

Forward aggregation

(4 path, proposed)Latencyhidden

B

C

D

E

FA

Current pixel Buffered Finished Critical path Scan order Data dependency

1~7 cycles variable latency(depending on access addresses)

Fig. 5 Variable critical path hidden diagonal scan

Optical Flow

Depth

Table. 1 Comparison with state-of-the-art vision chips

ISSCC 2015[6] ISSCC 2016[3]

Method

Technology

Chip area

On-chip memory

Frequency

Throughput & image size

Accuracy (Outlier %)

Operating voltage

Power

5-View BP Truncated SGM NG-SGM40nm 65nm 28nm

22mm2 16mm2 9.3mm2

352Kb 3946.9Kb 1568Kb215MHz 250MHz Flow: 180MHz, Fhigh: 500MHz

1920 X 108030fps

1280 X 72030fps

1920 X 1080 @ 25fps (flow)1920 X 1080 @ 30fps (depth)

4.7% @ Middlebury

0.9 V 1.2V Vlow: 0.9V Vhigh: 1.3V @ 30 fps HDVlow: 0.58V Vhigh: 0.75V @ 30 fps VGA

1019mW(includes DRAM power)

582mW (excludes DRAM power)

760mW @ 25 fps HD (flow) / 30 fps HD (depth) 62 mW @ 30 fps VGA (flow )/ 40 fps VGA (depth)

Normalized energy*(depth) 0.153nJ 0.329nJ

0.069nJ @ 30 fps HD (depth)0.029nJ @ 30 fps VGA (depth)

SGM

40nm

10.8mm2

1064Kb

170MHz1920 X 1080

30fps

Not reported due to limited depth range0.75V @ 30 fps HD

0.52V @ 30 fps VGA836mW @ 30 fps HD55 mW @ 30 fps VGA

0.104nJ @ 30 fps HD0.047nJ @ 30 fps VGA

Search Range 1x64 1x64 176x176=309761x1287% @ KITTI

Normalized energy*(Optical flow)

0.0048nJ @ 25 fps HD (flow)0.0022nJ @ 30 fps VGA (flow)Optical flow NOT supported

This workISSCC 2017[2]

Stereo depth only Depth & Optical flow

Normalized energy [6]

nJSearch range • # of pixel=

Curr(Right)

Neighbor-Guided candidatesRandom candidates

Fig. 2 Datapath of the proposed NG-SGM 6D vision algorithm

Processing block

Prev(Left)

Repeat 8 timesto get aggregated costs

on 8 paths

Image fetch & scheduling

Flow/depth selection

Summation of aggregated

costs on 8 paths

Min selection

Depth of one pixel

Search block

Curr(Right)88 x 3

Top K minbecomes center of

flow candidates

orFlow of one

pixelPrev(Left)

aggregation over

sparse candidates

Prev(Left)

64 sparse candidates

Local matching costs,99% sparsity

Neighbor-Guidedflow search

To do census transform:Need to access 1824 pixels

at random locations!

Pixel-wise cost generation

Cost aggregation

Curr(Right)

Prev(Left)

9

9 10

1015 ×

9

9

Prev(Left)

Curr(Right)

Census transform Census transform

81 bits: 101...10Hamming distance

on 64 flow candidates per pixel

15 ×

4 ×

4 ×

Candidates from neighbors

Randomcandidate

s

88 x 3

Interface Local cost generation NG-SGM aggregation Interface

Vlow, Finterface

Path buffer (12 kb)

Prot

ocol

&bu

s co

ntro

l

Chip boundary

Resu

lt bu

ffer

Resu

lt bu

ffer

Prot

ocol

&bu

s con

trolUSB/

32b fifo

USB/32b fifo

1.9Gb/s

1.6Gb/s

96bits

Path buffer (12 kb)

Path buffer (12 kb)

Path buffer (12 kb)

Aggregationunit

Kth minselection

Aggregationunit

Kth minselection

Aggregationunit

Kth minselection

Aggregationunit

Kth minselection

Glo

bal a

ggre

gatio

n un

it

Glo

bal K

th m

in se

lect

ion

Asyn

chro

nous

loca

l cos

t fifo…...

7 bits

128

bits

960bits

960bits

16 × 16xbar

16 × 16xbar

16 × 16xbar

16 × 16xbar

…...

Addr mux

Census transform

& local cost

Data mux448bits

8192bits

Image read addr

generation

420bits

asyn

chro

nous

addr

fifo

Image buffers (20 kb × 64)

2.4Tbs/s access BW

16bits

128bits

Vlow, Fcore Vhigh, Fxbar SRAM Proposed coalescing-resolving xbar

240 bits

Fig. 3 NG-fSGM Chip block diagram with partition of various voltage/frequency domains

Fig. 1 Optical flow and depth estimation on autonomous MAVs

CameraMotion

disparity =

Optical flow =

2D search, track object/camera motion

1D search, estimate object depth

Coordinate Motion

Optical flow

Drone with visual navigation

Depth mapOptical Flow map far

close

Stereo image stream (>30fps HD/720p/VGA)

0t

1t 0 1t t−1 0 1 0( , )x x y y− −

0 0x x′ −

' ' '( , , , , , )P X Y Z X Y Z

00

0

(,

)

px

y

'

00

0

(,

)

qx

y

1 1 1( , )p x y '1 1 1( , )q x y

Left, Previous Right, Previous

time

Left, Current Right, Current

Fig. 10 Chip measurement results on captured MAV scenes

Optical Flow

Depth

Fig. 6 Block diagram, timing of the coalescing Xbar and schematic of a Xpoint

Timing control

Row SA& Peripheral...

...

...

...

... ...

Row SA& Peripheral

Row SA& Peripheral

...

Column Peripheral & Driver

Column Peripheral & Driver

Mem 0 Mem 2 Mem 14...

Mem 1 Mem 3 Mem 1516×16 mesh

...

Query queue 0

Query queue 1

Query queue 15

...

...

...

Xpoint Xpoint Xpoint

XpointXpointXpoint

Xpoint Xpoint Xpoint

Q D

Q D

Q D

DATA_MEM

ARB_EN_bar

IN_bar

ARB_EN_bar

IN_bar

× 16

BL for query 15

800n 800n

ARB BL for query 0

DATA_EN_xpoint_bar

DATA_MEM_bar800n

DATA_REPLYDATA

× 128

ADDR_ENLOSE _LINE ARB_CURR

WIN _LINE1200n

1200n

ADDR_EN_xpoint

ADDR_REPLY_EN

ADDR_EN

ADDR_SA_EN

ARB_SA_EN

Q DDATA_EN CLK

DATA_ENxpoint

Control

CLK

× 8

ADDR_EN_xpoint_bar

ADDR REPLY_ENREPLY_ADDR

ADDR_BL

1200n

1200n

ADDR_bar

ADDR

Check Coalescing

BroadcastAddress

BroadcastData

Q D

Q D

...

2 cycle pipelined xbarCLK

PCH

ARB

MEM Tcd

PCH

SA

Addr

PCH

Data

SA

Addr

SA

PCH

Tsetup

SA

ARB

PCH

ARB

ADDR

DATA

2ns

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

16

64

256

1k

4k

16k

Fig. 15 Measured energy efficiency compared with prior arts @ HD

No

rmal

ized

ene

rgy

(nJ)

0.0048

0.069

0.104

0.153

0.329

64 64

128176

30976

This work, flow, HD

This work, depth, HD

[3] [6] [2]

Normalized energy [6]

nJSearch range • # of pixel=

Search Range

Normalized energy

Sea

rch

Ra

ng

e (#

of

pix

els

)

Fig. 8 Simulated avg/worst case # of xbar cycles/pixel

5.49

3.61 3.552.44

17

10 10 9

0

2

4

6

8

10

12

14

16

18

# o

f xb

ar a

cce

ss

cyc

les

Design choice

Avg # of cyclesWorst case

Without Coalescing

With Coalescing

2018 Symposium on VLSI Circuits Digest of Technical Papers 136

Documents

A 1920 × 1080 25fps, 2.4TOPS/W Unified Optical Flow and Depth …blaauw.engin.umich.edu/wp-content/uploads/sites/342/2018/... · 2019. 9. 10. · A 1920 × 1080 25fps, 2.4TOPS/W