19
A Deep Belief Network Approach to Learning Depth from Optical Flow Reuben Feinman 1 Applied Mathematics Honors Thesis by

Thesis Presentation

Embed Size (px)

Citation preview

Page 1: Thesis Presentation

A Deep Belief Network

Approach to Learning Depth

from Optical Flow

Reuben Feinman1

Applied Mathematics Honors Thesis

by

Page 2: Thesis Presentation

Background

2

•Visual system of insects are exquisitely sensitive to motion

•Srinivasan et al 1989 showed that bees decipher the range of their targets by absolute motion and motion relative to the background

•Key idea: optical flow is important to navigation

Page 3: Thesis Presentation

Motion Parallax in the Dorsal Stream

Humans perceive depth rather precisely via motion parallax

• Motion is a powerful monocular cue to depth understanding

• Assists with interpretation of spatial relationships

• “Optical flow”: the motion information encoded in the visual system

3

source: opticflow.bu.edu

Page 4: Thesis Presentation

Deep Learning

4

•The mapping from motion to depth is highly nonlinear (Braunstein, 1976)•Great progress in deep learning; multiple layers of nonlinear processing, more complex input to output function

source: www.deeplearning.stanford.edu

Motion Information

Depth prediction

->->->->

-->

Page 5: Thesis Presentation

Computer Graphics•Need labeled training data; videos do not have ground truth depth

•Graphical scenes generated by a gaming engine provide large number of training samples for supervised learning

5

A scene excerpt from our CryEngine forest database

RGB frame

ground truth depth map

Page 6: Thesis Presentation

6

MT Motion Model • Hierarchical model of motion processing; alternate between template

matching and max pooling

• Convolutional learning of spatio-temporal features

• Extension of HMAX (Serre et al 2007)

Jhuang et al 2007

Page 7: Thesis Presentation

Population Responses

7

Dorsal velocity model outputs a motion energy feature map

•(# Speeds) x (# Directions) x Height x Width •In other words: Each pixel contains a feature vector X with (# Speeds) x (# Directions) dimensions

Page 8: Thesis Presentation

8

Deep Belief Networks

•MLP: fail•Lots of unlabeled data available; maybe we can exploit this data and extract deep hierarchical representations of our motion model outputs•Initialize network with feature detectors

source: http://deeplearning.net

Page 9: Thesis Presentation

The RBM Model

9

Maximum likelihood learning: update model parameters to maximize the likelihood of our training data

Standard RBM:

Gaussian-Bernoulli RBM:

P(v,h) = (1/Z)*exp(-E(v,h))

We then create a new “free energy” version which sums over all possible hidden states

P(v) = (1/Z)*exp(-F(v))

source: http://deeplearning.net

Page 10: Thesis Presentation

Justifying Greedy Layer-Wise Pre-Training

10

•We use a Markov chain with alternating Gibbs Samplingh’ ~ P(h | v = v)v’ ~ P(v | h = h’)

•Gibbs Sampling is guaranteed to reduce the KL divergence between the posterior distribution in a given layer and the model’s equilibrium distribution

Hinton et al 2006

Page 11: Thesis Presentation

The DBN

11

• The data: feature vectors have 72 elements, tuned to 9 different speeds and 8 directions (9*8 = 72)• DBN takes in 3x3 pixel window• 3 Hidden layers of 800 units; sigmoidal activation• Linear output layer

Technicalities:•Mini-batch training with batch size of 5000•Sparse initialization scheme•RMSprop learning rule (regularized mean squares)•Backpropagation fine-tuning with dropout, dropping 20% of units at each layer except for the input layer•Geometrically decaying learning rate (LR = 0.998*LR at each epoch)

Page 12: Thesis Presentation

Results

12

DBN Linear Regression Ground Truth

test set R2: 0.445 test set R2: 0.240

Page 13: Thesis Presentation

13

MLP (sparse initialization)

single-pixel linear

regression

3x3 window linear

regression

single-pixel DBN

3x3 window DBN

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 1 2 3 4 5 6

R^2

Sco

reR^2 Score per Model

Page 14: Thesis Presentation

Markov Random Field SmoothingReceptive field can be a powerful tool for decoding

14

MRF defined by two potential functions:1) Φ = ∑_i [ (w • x_i − d_i) ^ 2 ]2) Ψ = ∑_<i,j> [ (d_i − d_j)^2 /( (d_i − d_j)^2 + 1) ) ]

(note: <i,j> = all neighboring pairs i,j)

P(d | x ; alpha, w) = (1/Z) * exp(− (alpha*Ψ + Φ)).Peter Orchard, University of Edinburgh

ground truth original prediction: 0.595 MRF prediction: 0.630

Page 15: Thesis Presentation

Drone Test

15

Page 16: Thesis Presentation

16

Page 17: Thesis Presentation

Future Work

• Increase pre-training dataset

• Real video labeled data with XBOX Kinect

• Down-sample motion features and ground truth

17

Page 18: Thesis Presentation

Thanks!

• Thomas Serre

• Stuart Geman

• David Mely

• Youssef Barhomi

18

Questions?

Page 19: Thesis Presentation

Normalizing the Data• Training a GB-RBM is hard; the distributions of spike firing rates have many

variations depending on the dataset

• We propose a normalized GB-RBM where the training data is normalized to zero mean and unit variance; all datasets thereafter (validation & test) are normalized with the same parameters

19

Dataset histograms before and after normalization