Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Modelling image patches with a directed hierarchy of Markov random fields
Simon Osindero* and Geoffrey Hinton
Department of Computer ScienceUniversity of Toronto
Motivation • There’s a notion that we can efficiently describe the world in terms hierarchical structures.
– E.g. features of features of features of…• Being able to automatically discover and
infer about this structure seems useful for many tasks in AI, signal processing, etc.
• Deep architectures have long been of interest in this respect.
– but until recently have typically been very difficult to learn well
• Developments in unsupervised learning for deep generative models.
– Greedy layerwise training and subsequent fine tuning.
• Interesting recognition/discriminative models made practical and “regularized” by initialization from generative solutions.
– Excellent performance on MNIST, etc.
• Lateral connections can introduce useful additional generative modeling power.
• One difficult way to maintain the constraints between the parts is to generate each part very precisely.
• Vague top-down specification of the parts is less demanding
– but it can mess up relationships between features
– so use redundant features and use lateral interactions to clean up the mess.
• Cooperative and competitive interactions between features help to coordinate locations.
vague top-down activation of parts
clean-up using known lateral interactions
pose parameters
features that have top-down support
“square” +
Similar to soldiers on a parade ground.
Motivation
Motivation
• Lateral connections can help enforce statistical structure that is difficult to capture with directed connections alone.– leads us to MRF's in the hidden layers – should allow us to model image
patches very well.
• The basic learning algorithm remains efficient and scalable – It learns one hidden layer at a time.
• Even though we have hidden MRF’swe can still use a fast, simple method to perform good, approximate inference in the learned model.
Image
Pixels
Hidden MRF
(Features)
Hidden MRF
(Features)
Roadmap for the rest of the talk
• Learning in deep belief networks.– Brief overview of Restricted Boltzmann Machines
(RBMs) and introduction to Semi-Restricted Boltzmann Machines (SRBMs).
– Brief overview of contrastive divergence learning.– Compositional learning and layer-wise training with
lateral connections.– Inference and generation in deep networks.
• A hierarchical generative model for natural image patches.– Comparison of models with and without lateral
connections.• Some thoughts on deep nets for image
enhancement.
Brief background : Boltzmann Machines• Can think in terms of an undirected graphical
models or random field parametrised by an potential/energy function– next slide.
• Or in terms of a neural net with binary stochastic units.
• State of 1 or 0.• The probability of turning on is
determined by a sigmoid function of the weighted input from other units (plus a bias)
00
∑+j
ijji wvb
)( 1=ihp
∑−−+==
jijji
i wvbhp
)exp(1)( 11
Restricted Boltzmann Machines (RBMs) ∑−=
jiijji whvE
,)( hv,
• Hidden units conditionally independent given visible units and vice-versa. – This affords simple and effective Gibbs sampling schemes.
• Can also view as a directed model with a non-factorial prior over the hidden variables.
• Energy function shown here is for binary stochastic units. – Can be adapted to other exponential family distributions. E.g. Gaussian, Poisson,
etc.• Parameters can be learnt using contrastive divergence as an approximation
to ML.
∏=i
ivpp )|()|( hhv
∑ −
−
=
gu,
gu,
hv,
hv, )(
)(
)( E
E
eep
Energy Function
Joint Distribution
Visible Units v (Data)
Factorial Conditionals
Hidden Units h
W
Normalising Constant
(aka Partition Function)
∏=j
jhpp )|()|( vvh
Semi Restricted Boltzmann Machines (SRBM’s) ∑∑
<
−−='
'',
)(ii
iiiiji
ijji LvvwhvE hv,
• Introduce lateral interactions between the visible units – an MRF.– Still straightforward to learn parameters using contrastive divergence.
• Very effective at capturing constraints – this can be quite useful.• Visible units are no longer conditionally independent given visible units.
– But inference is still fast.• Can sample from conditional distribution on visible units by Gibbs sampling.• Alternatively, we can use a mean-field approximation.
L
Energy Function
Lateral Interactions
∏≠i
ivpp )|()|( hhv
Factorial Posterior
∏=j
jhpp )|()|( vvhHowever
Contrastive divergence, Gibbs sampling and mean field updatesContrastive Divergence Learning
RBM
•Hidden units conditionally independent given visible units and vice versa.
•Can use one iteration of Gibbs sampling to get “negative phase” statistics
SRBM
•Hidden units remain conditionally independent given visible units.
•Use mean-field settling (conditioned on hidden units) to approximate negative phase statistics involving visible units.
0>< jihv∞>< jihv
t = 0 t = 1 t = 2 t = infinity
a fantasy
End of learning
Start of learning
Data Dist Model Dist
Parameter Updates
)( 10 ><−><=∆ jijiij hvhvW ε )( 10
''' ><−><=∆ iiiiii vvvvL ε
1>< jihv
Learning a semi-restricted BoltzmannMachine
0>< jihv1>< jihv
i
j
i
j
t = 0 t = 1
)( 10 ><−><=∆ jijiij hvhvW ε
1. Start with a training vector on the visible units.
2. Update all of the hidden units in parallel
3. Repeatedly update all of the visible units in parallel. This uses mean-field updates (with the hidden units fixed) to get a “reconstruction”.
4. Update all of the hidden units again.
reconstructiondata
)( 10
''' ><−><=∆ iiiiii vvvvL ε
k i ik k k
update for a lateral weight
Learning a semi-restricted BoltzmannMachine
• Method 1: To form a reconstruction, cycle through the visible units updating each in turn using the top-down input from the hiddens plus the lateral input from the other visibles.
• Method 2: Use “mean field” visible units that have real values. Update them all in parallel.– Use damping to prevent oscillations
)()(11i
ti
ti xpp σλλ −+=+
total input to idamping
Overview:Learning deep belief networks• (S)RBMs will form the building blocks for deep
networks we will explore.
• We “stack” the constituent models on top of each other in order to compose the final network.
Deep Belief Nets (DBN’s):Compositional learningBasic Idea & Intuition:• Learn a DBN greedily, one layer at a time, by sequentially training a series
of undirected models.• Inferred hidden states of the “previous layer” are used as training data on
the visible units for the “subsequent layer”.
• After learning to the first hidden layer:
• If we leave p(v|h) alone and improve p(h), we will improve p(v). • To improve p(h), we need it to be a better model of the aggregated posterior
distribution over hidden vectors inferred from the data.• We can train another (S)RBM on this aggregated posterior distribution to
produce an improved model of p(h). • This process can be repeated recursively…
• Can also get insights and justify the greedy learning more formally by considering an undirected model as an infinitely deep directed model and applying variational bounds. (E.g. see Hinton, Osindero & Teh 2006)
∑=h
hhvv )()|()( ppp
Deep Belief Nets (DBN’s):Compositional learning
Directed Connections
Data
Hidden MRF
Hidden MRF
•Learn an initial model over the “actual” data.
Directed Connections
Directed Connections
•Re-represent the data.
•‘Freeze’ the lower level connections – effectively considering them to define a directed model.
•Learn a new model using the inferred distribution over hidden units as data.
•At each layer we apply contrastive divergence to learn an undirected model.
•The same basic method applies whether we have lateral connections or not – the main difference is the form of the contrastive divergence updates.
Deep Belief Nets (DBN’s):Fast approximate inference• The intra-layer parameters learnt during the greedy layer-wise
training also provide us with a fast way to do approximate inference in the final model.– Methods exist to refine these inference parameters.
• This variational inference simply uses the real-valued probabilities as activities in a feed-forward neural network.– The probabilities come from the “p(h|v) terms” given by the component
(S)RBMs during learning.
• The lateral connections do not enter directly into this approximate inference. Representations can be formed in an entirely feed-forward manner– no iterative settling is required.
• Effectively, the influence of the lateral interactions is taken into account during the training– the way in which the model is learnt leads to a set of parameters in
which the simple approximate inference process should work well.
Deep Belief Nets (DBN’s):Sample generation• To generate samples from the model we initialise the top
level (S)RBM with random states and perform a long run of Gibbs sampling amongst the topmost layers.
• At the end of this Gibbs chain we do an ancestral pass down through the layers of the model.
• If the model employs lateral connections, we must perform iterative settling to equilibrium (Gibbs sampling again) at each level before proceeding to the level below
Application to natural image data
• Unsupervised deep learning applied to small patches of natural images.
• We explore the properties of models with and without lateral interactions.
• The lowest level (S)RBM has real-valued input units to model the pixel intensities. – Formally, the visible units are conditionally Gaussian
(although we use a mean field approximation).– The units in the other layers are binary-stochastic, as
before.
Natural image data
•150K 20x20 patches extracted from 10 different ‘natural scene’ images from van Hateren image database.
•Log transform pixel values, then do preprocessing to have zero-mean across each pixel and then zero mean per patch.
•Whitened using ZCA filters.
•Models were trained with and without lateral connections.
•Same architecture in both cases (with and without lateral connections). • 400 Visible Units (20 pixel by 20 pixel patches)•2000 1st Hiden Layer. •500 2nd Hidden Layer. •1000 3rd Hidden Layer.
Filters learned from natural images
2000 learnt filters used to provide input to first hidden layer.
Tuned & tiled across:
•Locations
•Orientations
•Spatial frequencies
•Phase
•Some are seemingly ‘random’ but even these show structure.
Images seen at different levelsFirst Hidden Layer
Third Hidden Layer
Second Hidden Layer
Each row shows the output probability of three example units when the patch model is ‘scanned’ over the large 768x512 image in the upper right.
Generative samples from models
Model without lateral
interactions
Model with lateral
interactions
True DataNearest neighbour matched between training data and
samples generated from the lateral model (Cosine Dist)
Samples from true data
Samples from a model with lateral interactions
Samples from a model without lateral interactions
Marginal and pairwise statistics
•Conditional response histograms from 2 Gabor filters applied to 10K sample images from different sources.
•Columns 1-3: Filters are 2, 4, or 8 pixels apart.
•Column 4: Same location but orthogonal orientations.
•Column 5: Same location and orientation but one octave apart in spatial frequency.
Conditional histograms Pixel marginals
Assessing models that have computational intractable distributions• Small, toy systems with computable constants.• Attempt to estimate the necessary normalising
constants.– E.g. try to find clever new bounds on partition
functions or do very patient Monte Carlo simulation.• “Turing tests” for generative samples.
– True samples from data and those from model. Ask humans to make positive identifications or use TAFC.
• Use the models as part of an actual application with a computable cost function.– E.g. Employ the prior as part of another task of
interest.
Potential applications: Bayesian image enhancement• Use the model as a prior over “true” undistorted images.
• Propose/learn a model for distortion or noise process.
• Image enhancement becomes a matter of inference.
• Performance depends on:– Quality of prior (and noise model).– Tractability of inference problem and solutions obtainable.
• Could be several promising approaches here. But…
( )Xp
( )XY |p
( ) ( )[ ]XXYXX
pp |maxargˆ =
Sketch: Deep network heteroencodersfor “direct” image enhancement
• Abundant data for many applications.– Take clean images and apply
“distortions”. (E.g. JPEG compression, gaussian noise, blur, downsampling.)
• Could “spatialise” the lateral interactions over multiple parametrically coupled layers.
• General “skip layer” connections additional units may be useful too (especially for models with laterals.)
• Graduated learning curriculum: – begin with small distortions and then
increase slowly as learning progresses.• Regularisation:
– Could use autoencoding training objective (and possibly continued unsupervised learning) as additional forms of dynamic stabilisation.
• Efficient whole images & multiscale: – Convolution-patch based? Wavelet
domain? Multiple models?
• Initialise a discriminative model using the unsupervised solution (with/without laterals?)
• Train with SGD/backprop.• Aim to directly predict the (known) true
image from the distorted ones.True Image
Target
Distorted Image Input
A few other possible future directions• Exploration and characterisation of different architectures.
– E.g. units per layer and network depth.• Imposition of additional priors on the representations learnt.
– E.g. Topologically restricted connectivity.• Higher-order interactions to modulate lateral connectivity based on
‘top down’ input.– E.g. “Gating” the lateral connection between two co-linear edges.
• Better characterisation of the responses of units deep in the network and the roles played by lateral connections.
• In depth comparisons to results and predictions from other models and with data from experimental studies of visual cortical areas (e.g. V2 receptive fields).
• Exploration of other exponential families within the models.• Extension of deep networks with lateral connections to other
domains.
Summary & Close
• Composing deep belief networks from Semi-Restricted Boltzmannmachines is not much harder than composing them from restricted Boltzmann machines. – Allows us to build deep generative models with hidden MRF’s.
• This additional representational power seems very useful for modeling statistical structure in natural images. (And this intuitively makes sense.)– The generative samples from such a model are of notably high quality.
• Other types Boltzmann machines with higher order interactions should be equally amenable.
• Deep network pre-training seems like a promising starting point for fast, efficient image enhancement. – Work In Progress…
Thank you for your attention
• Any questions?
• Further details and background info at:– www.cs.toronto.edu/~osindero– www.cs.toronto.edu/~hinton