Feature representation with Deep Gaussian processes

Preview:

Citation preview

Feature representation with Deep Gaussianprocesses

Andreas Damianou

Department of Neuro- and Computer Science, University ofSheffield, UK

Feature Extraction with Gaussian Processes Workshop, Sheffield,18/09/2014

Outline

Part 1: A general viewDeep modelling and deep GPs

Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders

Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples

Summary

Outline

Part 1: A general viewDeep modelling and deep GPs

Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders

Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples

Summary

Feature representation + feature relationships

Feature representation + feature relationships

I Good data representations facilitate learning / inference.

I Deep structures facilitate learning associated with abstractinformation (Bengio, 2009).

Hierarchy of features

[Lee et al. 09, “Convolutional DBNs for Scalable Unsupervised Learning of Hierarchical Representations”]

Hierarchy of features

Abstract concepts

Local features

...

...

...

...

...

...

...

...

[Damianou and Lawrence 2013, “Deep Gaussian Processes”]

Deep learning (directed graph)

Y = f(f(· · · f(X))), Hi = fi(Hi−1)

Deep Gaussian processes - Big Picture

Deep GP:

I Directed graphical model

I Non-parametric, non-linear mappings f

I Mappings f marginalised out analytically

I Likelihood is a non-linear function of the inputs

I Continuous variables

I NOT a GP!

Challenges:

I Marginalise out H

I No sampling: analytic approximation of objective

Solution:

I Variational approximation

I This also gives access to the model evidence

Deep Gaussian processes - Big Picture

Deep GP:

I Directed graphical model

I Non-parametric, non-linear mappings f

I Mappings f marginalised out analytically

I Likelihood is a non-linear function of the inputs

I Continuous variables

I NOT a GP!

Challenges:

I Marginalise out H

I No sampling: analytic approximation of objective

Solution:

I Variational approximation

I This also gives access to the model evidence

Deep Gaussian processes - Big Picture

Deep GP:

I Directed graphical model

I Non-parametric, non-linear mappings f

I Mappings f marginalised out analytically

I Likelihood is a non-linear function of the inputs

I Continuous variables

I NOT a GP!

Challenges:

I Marginalise out H

I No sampling: analytic approximation of objective

Solution:

I Variational approximation

I This also gives access to the model evidence

Hierarchical GP-LVM

I Hidden layers are not marginalised out.

I This leads to some difficulties.

[Lawrence and Moore, 2004]

Hierarchical GP-LVM

I Hidden layers are not marginalised out.

I This leads to some difficulties.

[Lawrence and Moore, 2004]

Sampling from a deep GP

Yf

f

Input

Output

Unobserved

Outline

Part 1: A general viewDeep modelling and deep GPs

Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders

Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples

Summary

Dynamics

GP-LVM

Y

X

Dynamics

GP-LVM

Y

H

Dynamics

GP-LVM

Y

H

tI If Y form is a multivariate time-series, then H

also has to be one

I Place a temporal GP prior on the latent space:

h = h(t) = GP(0, kh(t, t))f = f(h) = GP(0, kf (h, h))y = f(h) + ε

Dynamics

I Dynamics are encoded in the covariance matrix K = k(t, t).

I We can consider special forms for K.

Model individual sequences Model periodic data

I https://www.youtube.com/watch?v=i9TEoYxaBxQ (missa)

I https://www.youtube.com/watch?v=mUY1XHPnoCU (dog)

I https://www.youtube.com/watch?v=fHDWloJtgk8 (mocap)

Autoencoder

Y

H

YGP-LVM:

−6 −5 −4 −3 −2 −1 0 1 2 3 4−6

−4

−2

0

2

4

6

Non-parametric auto-encoder:

−10 −5 0 5 10

−10

−5

0

5

10

Outline

Part 1: A general viewDeep modelling and deep GPs

Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders

Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples

Summary

Sampling from a deep GP

Yf

f

Input

Output

Unobserved

Modelling a step function - Posterior samples

0

0.5

1

0

0.5

1

−1 −0.5 0 0.5 1 1.5 2

0

0.5

1

I GP (top), 2 and 4 layer deep GP (middle, bottom).I Posterior is bimodal.

Image by J. Hensman

MAP optimisation?

y

h1

h2

x

f

f

f

I Joint = p(y|h1)p(h1|h2)p(h2|x)

I MAP optimization is extremely problematicbecause:

• Dimensionality of hs has to be decided a priori

• Prone to overfitting, if h are treated as parameters

• Deep structures are not supported by the model’sobjective but have to be forced [Lawrence & Moore ’07]

Regularization solution: approximate Bayesian framework

I Analytic variational bound F ≤ p(y|x)• Extend the Variational Free Energy sparse GPs (Titsias 09) /

Variational Compression tricks.• Approximately marginalise out h

I Automatic structure discovery (nodes, connections, layers)

• Use the Automatic / Manifold Relevance Determination trick

I ...

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)(((((p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q=(((((p(f2|u2,h2)q(u2)q(h2)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)p(u2|h̃2)p(h2|x)Q=q(u2)q(h2)

p(u2|h̃2) contains k(h̃2, h̃2)−1

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2) p(f2|h2)︸ ︷︷ ︸contains

(k(h2, h2))−1

p(h2|x)

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)(((((p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q=(((((p(f2|u2,h2)q(u2)q(h2)

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)(((((p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q=(((((p(f2|u2,h2)q(u2)q(h2)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)p(u2|h̃2)p(h2|x)Q=q(u2)q(h2)

p(u2|h̃2) contains k(h̃2, h̃2)−1

Inducing points: sparseness, tractability and Big Data

h1 f1h2 f2· · · · · ·h30 f30h31 f31· · · · · ·hN fN

Inducing points: sparseness, tractability and Big Data

h1 f1h2 f2· · · · · ·h30 f30h̃(i) u(i)

h31 f31· · · · · ·hN fN

Inducing points: sparseness, tractability and Big Data

h1 f1h2 f2· · · · · ·h30 f30h̃(i) u(i)

h31 f31· · · · · ·hN fN

I Inducing points originally introduced for faster (sparse) GPsI Our manipulation allows to compress information from the

inputs of every layer (var. compression)I This induces tractabilityI Viewing them as global variables⇒ extension to Big Data [Hensman et al., UAI 2013]

Automatic dimensionality detection

I Achieved by employing automatic relevance determination(ARD) priors for the mapping f .

I f ∼ GP(0, kf ) with:

kf (xi,xj) = σ2 exp

−1

2

Q∑q=1

wq (xi ,q − xj ,q)2

I Example:

−3 −2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

Deep GP: MNIST example

http://staffwww.dcs.sheffield.ac.uk/people/A.Damianou/research/index.html#DeepGPs

MNIST: The first layer

1 layer GP-LVM:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

5 layer deep GP (showing 1st layer):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Manifold Relevance Determination

I Observations come into two different views: Y and Z.

I The latent space is segmented into parts private to Y , privateto Z and shared between Y and Z.

I Used for data consolidation and discovering commonalities.

MRD weights

Deep GPs: Another multi-view example

X

−2

−1

0

1

X

.15

0.1

.05

0

.05

Automatic structure discovery

Tools:

I ARD: Eliminate uncessary nodes/connections

I MRD: Conditional independencies

I Approximating evidence: Number of layers (?)

Automatic structure discovery

Tools:

I ARD: Eliminate uncessary nodes/connections

I MRD: Conditional independencies

I Approximating evidence: Number of layers (?)

Deep GP variants

Y

H1

H2

X

Deep GP -Supervised

Y

H1

H2

Deep GP -Unsupervised

Y Z

Hs1HY

1 HZ1

X

Multi-view

Y

H

X

Warped GP

Y

H

t

Temporal

Y

H1

H2

Y

Autoencoder

Summary

I A deep GP is not a GP.

I Sampling is straight-forward. Regularization and trainingneeds to be worked out.

I The solution is a special treatment of auxiliary variables.

I Many variants: multi-view, temporal, autoencoders ...

I Future: how to make it scalable?

I Future: how does it compare to / complement moretraditional deep models?

Thanks

Thanks to Neil Lawrence, James Hensman, Michalis Titsias, CarlHenrik Ek.

References:

• N. D. Lawrence (2006) “The Gaussian process latent variable model” Technical Report no

CS-06-03, The University of Sheffield, Department of Computer Science

• N. D. Lawrence (2006) “Probabilistic dimensional reduction with the Gaussian process

latent variable model” (talk)

• C. E. Rasmussen (2008), “Learning with Gaussian Processes”, Max Planck Institute for

Biological Cybernetics, Published: Feb. 5, 2008 (Videolectures.net)

• Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine

Learning. MIT Press, Cambridge, MA, 2006. ISBN 026218253X.

• M. K. Titsias (2009), “Variational learning of inducing variables in sparse Gaussian

processes”, AISTATS 2009

• A. C. Damianou, M. K. Titsias and N. D. Lawrence (2011), “Variational Gaussian process

dynamical systems”, NIPS 2011

• A. C. Damianou, C. H. Ek, M. K. Titsias and N. D. Lawrence (2012), “Manifold Relevance

Determination”, ICML 2012

• A. C. Damianou and N. D. Lawrence (2013), “Deep Gaussian processes”, AISTATS 2013

• J. Hensman (2013), “Gaussian processes for Big Data”, UAI 2013

Recommended