45
Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Embed Size (px)

Citation preview

Page 1: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo LarochelleNIPS 2007

Presented by Ahmed Hefny

Page 2: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Story so far …

• Deep neural nets are more expressive: Can learn wider classes of functions with less hidden units (parameters) and training examples.

• Unfortunately they are not easy to train with randomly initialized gradient-based methods.

Page 3: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Story so far …

• Hinton et. al. (2006) proposed greedy unsupervised layer-wise training:• Greedy layer-wise: Train layers sequentially starting from bottom

(input) layer.• Unsupervised: Each layer learns a higher-level representation of

the layer below. The training criterion does not depend on the labels.

• Each layer is trained as a Restricted Boltzman Machine. (RBM is the building block of Deep Belief Networks).

• The trained model can be fine tuned using a supervised method.

RBM 0

RBM 1

RBM 2

Page 4: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

This paper

• Extends the concept to:• Continuous variables• Uncooperative input distributions• Simultaneous Layer Training

• Explores variations to better understand the training method:• What if we use greedy supervised layer-wise training ?• What if we replace RBMs with auto-encoders ?RBM 0

RBM 1

RBM 2

Page 5: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Outline

• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 6: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Outline

• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 7: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Restricted Boltzman Machine

𝑣

hUndirected bipartite graphical model with connections between visible nodes and hidden nodes.

Corresponds to joint probability distribution

Page 8: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Restricted Boltzman Machine

𝑣

hUndirected bipartite graphical model with connections between visible nodes and hidden nodes.

Corresponds to joint probability distribution

Factorized Conditionals

Page 9: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Restricted Boltzman Machine (Training)• Given input vectors , adjust to increase

Page 10: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Restricted Boltzman Machine (Training)• Given input vectors , adjust to increase

Page 11: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Restricted Boltzman Machine (Training)• Given input vectors , adjust to increase

Sample given Sample and using Gibbs sampling

Page 12: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Restricted Boltzman Machine (Training)• Now we can perform stochastic gradient descent on data log-

likelihood

• Stop based on some criterion (e.g. reconstruction error )

Page 13: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Deep Belief Network

• A DBN is a model of the form

denotes input variables denotes hidden layers of causal variables

Page 14: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Deep Belief Network

• A DBN is a model of the form

denotes input variables denotes hidden layers of causal variables

Page 15: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Deep Belief Network

• A DBN is a model of the form

denotes input variables denotes hidden layers of causal variables

is an RBM RBM = Infinitely

Deep network with tied weights

Page 16: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Greedy layer-wise training

• is intractable• Approximate with • Treat bottom two layers as an RBM• Fit parameters using contrastive divergence

Page 17: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Greedy layer-wise training

• is intractable• Approximate with • Treat bottom two layers as an RBM• Fit parameters using contrastive divergence

• That gives an approximate • We need to match it with

Page 18: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Greedy layer-wise training

• Approximate • Treat layers as an RBM• Fit parameters using contrastive divergence• Sample recursively using starting from

Page 19: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Outline

• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 20: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Supervised Fine Tuning (In this paper)• Use greedy layer-wise training to initialize weights of all layers except

output layer.

• For fine-tuning, use stochastic gradient descent of a cost function on the outputs where the conditional expected values of hidden nodes are approximated using mean-field.

Page 21: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Supervised Fine Tuning (In this paper)• Use greedy layer-wise training to initialize weights of all layers except

output layer.

• Use backpropagation

Page 22: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Outline

• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 23: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Continuous Inputs

• Recall RBMs:

• If we restrict then normalization gives us binomial with given by sigmoid.• Instead, if we get exponential density• If is closed interval then we get truncated exponential

Page 24: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Continuous Inputs (Case for truncated exponential [0,1])• Sampling

For truncated exponential, inverse CDF can be used

where is sampled uniformly from [0,1]

• Conditional Expectation

Page 25: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Continuous Inputs

• To handle Gaussian inputs, we need to augment the energy function with a term quadratic in .• For a diagonal covariance matrix

Giving

Page 26: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Continuous Hidden Nodes ?

Page 27: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Continuous Hidden Nodes ?

• Truncated Exponential

• Gaussian

Page 28: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Uncooperative Input Distributions

• Setting

• No particular relation between p and f, (e.g. Gaussian and sinus)

Page 29: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Uncooperative Input Distributions

• Setting

• No particular relation between p and f, (e.g. Gaussian and sinus)

• Problem: Unsupvervised pre-training may not help prediction

Page 30: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Outline

• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Analysis Experiments

Page 31: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Uncooperative Input Distributions

• Proposal: Mix unsupervised and supervised training for each layer

Temp. Ouptut Layer

Stochastic Gradient of input log likelihood by Contrastive Divergence

Stochastic Gradient of prediction error

Combined Update

Page 32: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Simultaneous Layer Training

• Greedy Layer-wise Training• For each layer• Repeat Until Criterion Met

• Sample layer input (by recursively applying trained layers to data)• Update parameters using contrastive divergence

Page 33: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Simultaneous Layer Training

• Simultaneous Training• Repeat Until Criterion Met• Sample input to all layers• Update parameters of all layers using contrastive divergence

• Simpler: One criterion for the entire network• Takes more time

Page 34: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Outline

• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training

• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training

• Analysis Experiments

Page 35: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Experiments

• Does greedy unsupervised pre-training help ?• What if we replace RBM with auto-encoders ?• What if we do greedy supervised pre-training ?

• Does continuous variable modeling help ?• Does partially supervised pre-training help ?

Page 36: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Experiment 1

• Does greedy unsupervised pre-training help ?• What if we replace RBM with auto-encoders ?• What if we do greedy supervised pre-training ?

• Does continuous variable modeling help ?• Does partially supervised pre-training help ?

Page 37: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Experiment 1

Page 38: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Experiment 1

Page 39: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Experiment 1(MSE and Training Errors)

Partially Supervised < Unsupervised Pre-training < No Pre-training

Gaussian < Binomial

Page 40: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Experiment 2

• Does greedy unsupervised pre-training help ?• What if we replace RBM with auto-encoders ?• What if we do greedy supervised pre-training ?

• Does continuous variable modeling help ?• Does partially supervised pre-training help ?

Page 41: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Experiment 2

• Auto Encoders• Learn a compact representation to reconstruct X

• Trained to minimize reconstruction cross-entropyX X

Page 42: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Experiment 2

(500~1000) layer width 20 nodes in last two layers

Page 43: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Experiment 2

• Auto-encoder pre-training outperforms supervised pre-training but is still outperformed by RBM.

• Without pre-training, deep nets do not generalize well, but they can still fit the data if the output layers are wide enough.

Page 44: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Conclusions

• Unsupervised pre-training is important for deep networks.

• Partial supervision further enhances results, especially when input distribution and the function to be estimated are not closely related.

• Explicitly modeling conditional inputs is better than using binomial models.

Page 45: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

Thanks