Modelling image patches with a directed hierarchy of ...lisa/seminaires/19-02-2008.pdf · Modelling image patches with a directed hierarchy of Markov random fields ... • Developments

Modelling image patches with a directed hierarchy of Markov random fields

Simon Osindero* and Geoffrey Hinton

Department of Computer ScienceUniversity of Toronto

Motivation • There’s a notion that we can efficiently describe the world in terms hierarchical structures.

– E.g. features of features of features of…• Being able to automatically discover and

infer about this structure seems useful for many tasks in AI, signal processing, etc.

• Deep architectures have long been of interest in this respect.

– but until recently have typically been very difficult to learn well

• Developments in unsupervised learning for deep generative models.

– Greedy layerwise training and subsequent fine tuning.

• Interesting recognition/discriminative models made practical and “regularized” by initialization from generative solutions.

– Excellent performance on MNIST, etc.

• Lateral connections can introduce useful additional generative modeling power.

• One difficult way to maintain the constraints between the parts is to generate each part very precisely.

• Vague top-down specification of the parts is less demanding

– but it can mess up relationships between features

– so use redundant features and use lateral interactions to clean up the mess.

• Cooperative and competitive interactions between features help to coordinate locations.

vague top-down activation of parts

clean-up using known lateral interactions

pose parameters

features that have top-down support

“square” +

Similar to soldiers on a parade ground.

Motivation

Motivation

• Lateral connections can help enforce statistical structure that is difficult to capture with directed connections alone.– leads us to MRF's in the hidden layers – should allow us to model image

patches very well.

• The basic learning algorithm remains efficient and scalable – It learns one hidden layer at a time.

• Even though we have hidden MRF’swe can still use a fast, simple method to perform good, approximate inference in the learned model.

Image

Pixels

Hidden MRF

(Features)

Hidden MRF

(Features)

Roadmap for the rest of the talk

• Learning in deep belief networks.– Brief overview of Restricted Boltzmann Machines

(RBMs) and introduction to Semi-Restricted Boltzmann Machines (SRBMs).

– Brief overview of contrastive divergence learning.– Compositional learning and layer-wise training with

lateral connections.– Inference and generation in deep networks.

• A hierarchical generative model for natural image patches.– Comparison of models with and without lateral

connections.• Some thoughts on deep nets for image

enhancement.

Brief background : Boltzmann Machines• Can think in terms of an undirected graphical

models or random field parametrised by an potential/energy function– next slide.

• Or in terms of a neural net with binary stochastic units.

• State of 1 or 0.• The probability of turning on is

determined by a sigmoid function of the weighted input from other units (plus a bias)

00

∑+j

ijji wvb

)( 1=ihp

∑−−+==

jijji

i wvbhp

)exp(1)( 11

Restricted Boltzmann Machines (RBMs) ∑−=

jiijji whvE

,)( hv,

• Hidden units conditionally independent given visible units and vice-versa. – This affords simple and effective Gibbs sampling schemes.

• Can also view as a directed model with a non-factorial prior over the hidden variables.

• Energy function shown here is for binary stochastic units. – Can be adapted to other exponential family distributions. E.g. Gaussian, Poisson,

etc.• Parameters can be learnt using contrastive divergence as an approximation

to ML.

∏=i

ivpp )|()|( hhv

∑ −

−

=

gu,

gu,

hv,

hv, )(

)(

)( E

E

eep

Energy Function

Joint Distribution

Visible Units v (Data)

Factorial Conditionals

Hidden Units h

W

Normalising Constant

(aka Partition Function)

∏=j

jhpp )|()|( vvh

Semi Restricted Boltzmann Machines (SRBM’s) ∑∑

<

−−='

'',

)(ii

iiiiji

ijji LvvwhvE hv,

• Introduce lateral interactions between the visible units – an MRF.– Still straightforward to learn parameters using contrastive divergence.

• Very effective at capturing constraints – this can be quite useful.• Visible units are no longer conditionally independent given visible units.

– But inference is still fast.• Can sample from conditional distribution on visible units by Gibbs sampling.• Alternatively, we can use a mean-field approximation.

L

Energy Function

Lateral Interactions

∏≠i

ivpp )|()|( hhv

Factorial Posterior

∏=j

jhpp )|()|( vvhHowever

Contrastive divergence, Gibbs sampling and mean field updatesContrastive Divergence Learning

RBM

•Hidden units conditionally independent given visible units and vice versa.

•Can use one iteration of Gibbs sampling to get “negative phase” statistics

SRBM

•Hidden units remain conditionally independent given visible units.

•Use mean-field settling (conditioned on hidden units) to approximate negative phase statistics involving visible units.

0>< jihv∞>< jihv

t = 0 t = 1 t = 2 t = infinity

a fantasy

End of learning

Start of learning

Data Dist Model Dist

Parameter Updates

)( 10 ><−><=∆ jijiij hvhvW ε )( 10

''' ><−><=∆ iiiiii vvvvL ε

1>< jihv

Learning a semi-restricted BoltzmannMachine

0>< jihv1>< jihv

i

j

i

j

t = 0 t = 1

)( 10 ><−><=∆ jijiij hvhvW ε

1. Start with a training vector on the visible units.

2. Update all of the hidden units in parallel

3. Repeatedly update all of the visible units in parallel. This uses mean-field updates (with the hidden units fixed) to get a “reconstruction”.

4. Update all of the hidden units again.

reconstructiondata

)( 10

''' ><−><=∆ iiiiii vvvvL ε

k i ik k k

update for a lateral weight

Learning a semi-restricted BoltzmannMachine

• Method 1: To form a reconstruction, cycle through the visible units updating each in turn using the top-down input from the hiddens plus the lateral input from the other visibles.

• Method 2: Use “mean field” visible units that have real values. Update them all in parallel.– Use damping to prevent oscillations

)()(11i

ti

ti xpp σλλ −+=+

total input to idamping

Overview:Learning deep belief networks• (S)RBMs will form the building blocks for deep

networks we will explore.

• We “stack” the constituent models on top of each other in order to compose the final network.

Deep Belief Nets (DBN’s):Compositional learningBasic Idea & Intuition:• Learn a DBN greedily, one layer at a time, by sequentially training a series

of undirected models.• Inferred hidden states of the “previous layer” are used as training data on

the visible units for the “subsequent layer”.

• After learning to the first hidden layer:

• If we leave p(v|h) alone and improve p(h), we will improve p(v). • To improve p(h), we need it to be a better model of the aggregated posterior

distribution over hidden vectors inferred from the data.• We can train another (S)RBM on this aggregated posterior distribution to

produce an improved model of p(h). • This process can be repeated recursively…

• Can also get insights and justify the greedy learning more formally by considering an undirected model as an infinitely deep directed model and applying variational bounds. (E.g. see Hinton, Osindero & Teh 2006)

∑=h

hhvv )()|()( ppp

Deep Belief Nets (DBN’s):Compositional learning

Directed Connections

Data

Hidden MRF

Hidden MRF

•Learn an initial model over the “actual” data.



•Re-represent the data.

•‘Freeze’ the lower level connections – effectively considering them to define a directed model.

•Learn a new model using the inferred distribution over hidden units as data.

•At each layer we apply contrastive divergence to learn an undirected model.

•The same basic method applies whether we have lateral connections or not – the main difference is the form of the contrastive divergence updates.

Deep Belief Nets (DBN’s):Fast approximate inference• The intra-layer parameters learnt during the greedy layer-wise

training also provide us with a fast way to do approximate inference in the final model.– Methods exist to refine these inference parameters.

• This variational inference simply uses the real-valued probabilities as activities in a feed-forward neural network.– The probabilities come from the “p(h|v) terms” given by the component

(S)RBMs during learning.

• The lateral connections do not enter directly into this approximate inference. Representations can be formed in an entirely feed-forward manner– no iterative settling is required.

• Effectively, the influence of the lateral interactions is taken into account during the training– the way in which the model is learnt leads to a set of parameters in

which the simple approximate inference process should work well.

Deep Belief Nets (DBN’s):Sample generation• To generate samples from the model we initialise the top

level (S)RBM with random states and perform a long run of Gibbs sampling amongst the topmost layers.

• At the end of this Gibbs chain we do an ancestral pass down through the layers of the model.

• If the model employs lateral connections, we must perform iterative settling to equilibrium (Gibbs sampling again) at each level before proceeding to the level below

Application to natural image data

• Unsupervised deep learning applied to small patches of natural images.

• We explore the properties of models with and without lateral interactions.

• The lowest level (S)RBM has real-valued input units to model the pixel intensities. – Formally, the visible units are conditionally Gaussian

(although we use a mean field approximation).– The units in the other layers are binary-stochastic, as

before.

Natural image data

•150K 20x20 patches extracted from 10 different ‘natural scene’ images from van Hateren image database.

•Log transform pixel values, then do preprocessing to have zero-mean across each pixel and then zero mean per patch.

•Whitened using ZCA filters.

•Models were trained with and without lateral connections.

•Same architecture in both cases (with and without lateral connections). • 400 Visible Units (20 pixel by 20 pixel patches)•2000 1st Hiden Layer. •500 2nd Hidden Layer. •1000 3rd Hidden Layer.

Filters learned from natural images

2000 learnt filters used to provide input to first hidden layer.

Tuned & tiled across:

•Locations

•Orientations

•Spatial frequencies

•Phase

•Some are seemingly ‘random’ but even these show structure.

Images seen at different levelsFirst Hidden Layer

Third Hidden Layer

Second Hidden Layer

Each row shows the output probability of three example units when the patch model is ‘scanned’ over the large 768x512 image in the upper right.

Generative samples from models

Model without lateral

interactions

Model with lateral

interactions

True DataNearest neighbour matched between training data and

samples generated from the lateral model (Cosine Dist)

Samples from true data

Samples from a model with lateral interactions

Samples from a model without lateral interactions

Marginal and pairwise statistics

•Conditional response histograms from 2 Gabor filters applied to 10K sample images from different sources.

•Columns 1-3: Filters are 2, 4, or 8 pixels apart.

•Column 4: Same location but orthogonal orientations.

•Column 5: Same location and orientation but one octave apart in spatial frequency.

Conditional histograms Pixel marginals

Assessing models that have computational intractable distributions• Small, toy systems with computable constants.• Attempt to estimate the necessary normalising

constants.– E.g. try to find clever new bounds on partition

functions or do very patient Monte Carlo simulation.• “Turing tests” for generative samples.

– True samples from data and those from model. Ask humans to make positive identifications or use TAFC.

• Use the models as part of an actual application with a computable cost function.– E.g. Employ the prior as part of another task of

interest.

Potential applications: Bayesian image enhancement• Use the model as a prior over “true” undistorted images.

• Propose/learn a model for distortion or noise process.

• Image enhancement becomes a matter of inference.

• Performance depends on:– Quality of prior (and noise model).– Tractability of inference problem and solutions obtainable.

• Could be several promising approaches here. But…

( )Xp

( )XY |p

( ) ( )[ ]XXYXX

pp |maxargˆ =

Sketch: Deep network heteroencodersfor “direct” image enhancement

• Abundant data for many applications.– Take clean images and apply

“distortions”. (E.g. JPEG compression, gaussian noise, blur, downsampling.)

• Could “spatialise” the lateral interactions over multiple parametrically coupled layers.

• General “skip layer” connections additional units may be useful too (especially for models with laterals.)

• Graduated learning curriculum: – begin with small distortions and then

increase slowly as learning progresses.• Regularisation:

– Could use autoencoding training objective (and possibly continued unsupervised learning) as additional forms of dynamic stabilisation.

• Efficient whole images & multiscale: – Convolution-patch based? Wavelet

domain? Multiple models?

• Initialise a discriminative model using the unsupervised solution (with/without laterals?)

• Train with SGD/backprop.• Aim to directly predict the (known) true

image from the distorted ones.True Image

Target

Distorted Image Input

A few other possible future directions• Exploration and characterisation of different architectures.

– E.g. units per layer and network depth.• Imposition of additional priors on the representations learnt.

– E.g. Topologically restricted connectivity.• Higher-order interactions to modulate lateral connectivity based on

‘top down’ input.– E.g. “Gating” the lateral connection between two co-linear edges.

• Better characterisation of the responses of units deep in the network and the roles played by lateral connections.

• In depth comparisons to results and predictions from other models and with data from experimental studies of visual cortical areas (e.g. V2 receptive fields).

• Exploration of other exponential families within the models.• Extension of deep networks with lateral connections to other

domains.

Summary & Close

• Composing deep belief networks from Semi-Restricted Boltzmannmachines is not much harder than composing them from restricted Boltzmann machines. – Allows us to build deep generative models with hidden MRF’s.

• This additional representational power seems very useful for modeling statistical structure in natural images. (And this intuitively makes sense.)– The generative samples from such a model are of notably high quality.

• Other types Boltzmann machines with higher order interactions should be equally amenable.

• Deep network pre-training seems like a promising starting point for fast, efficient image enhancement. – Work In Progress…

Thank you for your attention

• Any questions?

• Further details and background info at:– www.cs.toronto.edu/~osindero– www.cs.toronto.edu/~hinton

Documents

Modelling image patches with a directed hierarchy of ...lisa/seminaires/19-02-2008.pdf · Modelling image patches with a directed hierarchy of Markov random fields ... • Developments