JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL ...on-demand.gputechconf.com/gtc/2017/presentation/s7347-joe-chen-a... · Zhao Chen . Machine Learning Intern, NVIDIA . JOINT

Zhao Chen

Machine Learning Intern, NVIDIA

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS

2

ABOUT ME

•5th year PhD student in physics @ Stanford by day, deep learning computer vision scientist by night.

•Intern with Deep Learning Applied Research (Autonomous Vehicles) @ NVIDIA, Oct-Dec 2016.

Zhao Chen, Joint Detection and Segmentation with Deep Hierarchical Networks, GTC 2017.

3

TALK OVERVIEW

(1) Problem statement and summary.

(2) Dataset and preliminaries.

(3) Model motivation.

(4) Results and visualizations.


4

TALK OVERVIEW






5

FROM SINGLE TO MULTITASK LEARNING Putting deep learning to work in the real world

Detection Model . . .

Segmentation Model

. . .

Object Bounding Boxes

Segmentation Mask

6

FROM SINGLE TO MULTITASK LEARNING Putting deep learning to work in the real world

Detection Model . . .

Segmentation Model

. . .


Segmentation Mask

Poor scalability + inefficient use of information!

7

FROM SINGLE TO MULTITASK LEARNING

How do we use one model to perform multiple tasks faster and better?

Putting deep learning to work in the real world

Shared Model

. . . Object Bounding Boxes

Segmentation Mask

8




Shared Model


Segmentation Mask

+ edge detection, + surface normals, + distance estimation…

9




Shared Model


Segmentation Mask

How do you relate various tasks to each other in a multi-task neural network?

10

WHAT WE WILL SHOW

•By ordering tasks based on receptive field and information density, we improve segmentation and detection accuracy by ~2% and ~8% over single networks, respectively.

•The joint network is robust and easy to tune compared to non-hierarchical baselines.


11

TALK OVERVIEW






12

CITYSCAPES DATASET • 2975 Training Images @ resolution 1024 x 2048.

• 20 classes for semantic segmentation, including 8 object classes. Of these 8, 4 are much more represented (car, bicycle, person, rider): the “easy classes.”

• Both segmentation, bounding box, and edge ground truth can be generated.

Raw Image

Edge Detection

Semantic Seg.

Bounding Box

13

HOW TO TRAIN A SEGMENTATION NETWORK • Standard FCN (Shelhamer 2015) Architecture: Convolutions followed by a

deconvolution to retrieve a pixel-dense prediction mask.


14

HOW TO TRAIN A DETECTION NETWORK • Network outputs confidence that a pixel lies near the center of an object.

• Points of high confidence produce bounding box coordinates.

• Confidences are rougher than full segmentation but robust to occlusion.


15

TALK OVERVIEW






16

Shared Feature Map (from base CNN) Input (1024 x 2048)

Deconv

Low-Res Seg Predictions (W x H x 20)

Obj. Confidence Positions

Bbox Coordinate Positions

L = αLseg + (1- α)Ldet Zhao Chen, Joint Detection and Segmentation with Deep Hierarchical Networks, GTC 2017.

17

OUR BASELINE MODEL PERFORMANCE

Seg. Weight Det. Weight

(α controls how much attention we pay to segmentation vs detection at training)

= α


18




= α


19




= α


20




= α


21




= α


22




= α


23




= α


24




= α


25

A LABEL HIERARCHY ALONG TWO AXES

Density of Information

Requ

ired

Rec

epti

ve F

ield



26



Requ

ired

Rec

epti

ve F

ield


Object Confidence


27



Requ

ired

Rec

epti

ve F

ield


Semantic Segmentation

Object Confidence


28



Requ

ired

Rec

epti

ve F

ield

Object Bounding Boxes Edge Detection

Semantic Segmentation

Object Confidence

(plus)


29


Deconv





30


Segmentation Features

Deconv


Obj. Confidence Features


Obj. BBox Features



31



Deconv




Obj. BBox Features



Decreasing information density

32

Shared Feature Map (from base CNN)

Edge Features

Deconv

Input (1024 x 2048)

Low-Res Edge Predictions (W x H x 3)


Deconv




Obj. BBox Features


Decreasing information density Zhao Chen, Joint Detection and Segmentation with Deep Hierarchical Networks, GTC 2017.

33


Edge Features

Deconv

Input (1024 x 2048)



Deconv




Obj. BBox Features



34


Edge Features

Deconv

Input (1024 x 2048)



Deconv




Obj. BBox Features


X


35


Edge Features

Deconv

Input (1024 x 2048)



Deconv




Obj. BBox Features


X

Increasing receptive field Zhao Chen, Joint Detection and Segmentation with Deep Hierarchical Networks, GTC 2017.

36


Edge Features

Deconv

Input (1024 x 2048)



Deconv




Obj. BBox Features

Dilated Bbox Coordinate Positions

Dilated Convs

Increasing receptive field Zhao Chen, Joint Detection and Segmentation with Deep Hierarchical Networks, GTC 2017.

37


Edge Features

Deconv

Input (1024 x 2048)



Deconv




Obj. BBox Features

Dilated Bbox Coordinate Positions

Dilated Convs

Deep Hierarchical Network (DHM)


38

TALK OVERVIEW






39

RESULTS: HIGH ROBUSTNESS


40

RESULTS: HIGH ROBUSTNESS


41 Zhao Chen, Joint Detection and Segmentation with Deep Hierarchical Networks, GTC 2017.

42

Edge Predictions

RAW IMAGE

Segmentation Predictions

Bounding Box Predictions


43

VISUALIZATIONS

SINGLE NETWORK

DET

ECTI

ON

SE

GM

ENTA

ITIO

N

DHM (OURS)

44

VISUALIZATIONS

SINGLE NETWORK

SALI

ENCY

(CA

R)

SEG

MEN

TAIT

ION

DHM (OURS)

45

VISUALIZATIONS

SINGLE NETWORK

DET

ECTI

ON

SE

GM

ENTA

ITIO

N

DHM (OURS)

46

VISUALIZATIONS

SINGLE NETWORK

DET

ECTI

ON

SE

GM

ENTA

ITIO

N

DHM (OURS)

47

VISUALIZATIONS

SINGLE NETWORK

SALI

ENCY

(BU

S)

SEG

MEN

TAIT

ION

DHM (OURS)

48

VISUALIZATIONS

SINGLE NETWORK

DET

ECTI

ON

SE

GM

ENTA

ITIO

N

DHM (OURS)

49

VISUALIZATIONS

SINGLE NETWORK

DET

ECTI

ON

SE

GM

ENTA

ITIO

N

DHM (OURS)

50

VISUALIZATIONS

SINGLE NETWORK

DET

ECTI

ON

SE

GM

ENTA

ITIO

N

DHM (OURS)

51

SUMMARY • Our two hierarchies within our model allow our network to reason about intra-

task relationships:

• Information density: (Seg +) Edge > Seg > Object Conf > Bbox

• Receptive field: (Seg +) Edge = Bbox >> Object Conf > Seg

• With these relationships wired in, our network is:

• More accurate

• Robust to tuning

• Simultaneously better at fine detail and more instance aware

• Efficient and scalable (3 tasks, 1 network!)


52

REFERENCES •J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene classificationa and semantic segmentation. In CVPR, 2012.

•S. Gidaris and N. Komodakis. Object detection via a multiregion and semantic segmentation-aware cnn model. In ICCV, 2015.

•B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV, 2014.

•S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia. Multi-scale patch aggregation (mpa) for simultaneous detection and segmentation. In CVPR, 2016.

•E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.

•B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015.

•J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In https://arxiv.org/pdf/1512.04412.pdf, 2015.

53

THANK YOU!

Special thanks to:

My internship mentor: Jian Yao

My managers: John Zedlewski and Andrew Tao

All the wonderful people in DLAR/DLAV.

Additional questions/comments: [email protected]


Documents

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL ...on-demand.gputechconf.com/gtc/2017/presentation/s7347-joe-chen-a... · Zhao Chen . Machine Learning Intern, NVIDIA . JOINT