Transcript
Page 1: Feature representation with Deep Gaussian processes

Feature representation with Deep Gaussianprocesses

Andreas Damianou

Department of Neuro- and Computer Science, University ofSheffield, UK

Feature Extraction with Gaussian Processes Workshop, Sheffield,18/09/2014

Page 2: Feature representation with Deep Gaussian processes

Outline

Part 1: A general viewDeep modelling and deep GPs

Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders

Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples

Summary

Page 3: Feature representation with Deep Gaussian processes

Outline

Part 1: A general viewDeep modelling and deep GPs

Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders

Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples

Summary

Page 4: Feature representation with Deep Gaussian processes

Feature representation + feature relationships

Page 5: Feature representation with Deep Gaussian processes

Feature representation + feature relationships

I Good data representations facilitate learning / inference.

I Deep structures facilitate learning associated with abstractinformation (Bengio, 2009).

Page 6: Feature representation with Deep Gaussian processes

Hierarchy of features

[Lee et al. 09, “Convolutional DBNs for Scalable Unsupervised Learning of Hierarchical Representations”]

Page 7: Feature representation with Deep Gaussian processes

Hierarchy of features

Abstract concepts

Local features

...

...

...

...

...

...

...

...

[Damianou and Lawrence 2013, “Deep Gaussian Processes”]

Page 8: Feature representation with Deep Gaussian processes
Page 9: Feature representation with Deep Gaussian processes
Page 10: Feature representation with Deep Gaussian processes
Page 11: Feature representation with Deep Gaussian processes
Page 12: Feature representation with Deep Gaussian processes

Deep learning (directed graph)

Y = f(f(· · · f(X))), Hi = fi(Hi−1)

Page 13: Feature representation with Deep Gaussian processes

Deep Gaussian processes - Big Picture

Deep GP:

I Directed graphical model

I Non-parametric, non-linear mappings f

I Mappings f marginalised out analytically

I Likelihood is a non-linear function of the inputs

I Continuous variables

I NOT a GP!

Challenges:

I Marginalise out H

I No sampling: analytic approximation of objective

Solution:

I Variational approximation

I This also gives access to the model evidence

Page 14: Feature representation with Deep Gaussian processes

Deep Gaussian processes - Big Picture

Deep GP:

I Directed graphical model

I Non-parametric, non-linear mappings f

I Mappings f marginalised out analytically

I Likelihood is a non-linear function of the inputs

I Continuous variables

I NOT a GP!

Challenges:

I Marginalise out H

I No sampling: analytic approximation of objective

Solution:

I Variational approximation

I This also gives access to the model evidence

Page 15: Feature representation with Deep Gaussian processes

Deep Gaussian processes - Big Picture

Deep GP:

I Directed graphical model

I Non-parametric, non-linear mappings f

I Mappings f marginalised out analytically

I Likelihood is a non-linear function of the inputs

I Continuous variables

I NOT a GP!

Challenges:

I Marginalise out H

I No sampling: analytic approximation of objective

Solution:

I Variational approximation

I This also gives access to the model evidence

Page 16: Feature representation with Deep Gaussian processes

Hierarchical GP-LVM

I Hidden layers are not marginalised out.

I This leads to some difficulties.

[Lawrence and Moore, 2004]

Page 17: Feature representation with Deep Gaussian processes

Hierarchical GP-LVM

I Hidden layers are not marginalised out.

I This leads to some difficulties.

[Lawrence and Moore, 2004]

Page 18: Feature representation with Deep Gaussian processes

Sampling from a deep GP

Yf

f

Input

Output

Unobserved

Page 19: Feature representation with Deep Gaussian processes

Outline

Part 1: A general viewDeep modelling and deep GPs

Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders

Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples

Summary

Page 20: Feature representation with Deep Gaussian processes

Dynamics

GP-LVM

Y

X

Page 21: Feature representation with Deep Gaussian processes

Dynamics

GP-LVM

Y

H

Page 22: Feature representation with Deep Gaussian processes

Dynamics

GP-LVM

Y

H

tI If Y form is a multivariate time-series, then H

also has to be one

I Place a temporal GP prior on the latent space:

h = h(t) = GP(0, kh(t, t))f = f(h) = GP(0, kf (h, h))y = f(h) + ε

Page 23: Feature representation with Deep Gaussian processes

Dynamics

I Dynamics are encoded in the covariance matrix K = k(t, t).

I We can consider special forms for K.

Model individual sequences Model periodic data

I https://www.youtube.com/watch?v=i9TEoYxaBxQ (missa)

I https://www.youtube.com/watch?v=mUY1XHPnoCU (dog)

I https://www.youtube.com/watch?v=fHDWloJtgk8 (mocap)

Page 24: Feature representation with Deep Gaussian processes

Autoencoder

Y

H

YGP-LVM:

−6 −5 −4 −3 −2 −1 0 1 2 3 4−6

−4

−2

0

2

4

6

Non-parametric auto-encoder:

−10 −5 0 5 10

−10

−5

0

5

10

Page 25: Feature representation with Deep Gaussian processes

Outline

Part 1: A general viewDeep modelling and deep GPs

Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders

Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples

Summary

Page 26: Feature representation with Deep Gaussian processes

Sampling from a deep GP

Yf

f

Input

Output

Unobserved

Page 27: Feature representation with Deep Gaussian processes

Modelling a step function - Posterior samples

0

0.5

1

0

0.5

1

−1 −0.5 0 0.5 1 1.5 2

0

0.5

1

I GP (top), 2 and 4 layer deep GP (middle, bottom).I Posterior is bimodal.

Image by J. Hensman

Page 28: Feature representation with Deep Gaussian processes

MAP optimisation?

y

h1

h2

x

f

f

f

I Joint = p(y|h1)p(h1|h2)p(h2|x)

I MAP optimization is extremely problematicbecause:

• Dimensionality of hs has to be decided a priori

• Prone to overfitting, if h are treated as parameters

• Deep structures are not supported by the model’sobjective but have to be forced [Lawrence & Moore ’07]

Page 29: Feature representation with Deep Gaussian processes

Regularization solution: approximate Bayesian framework

I Analytic variational bound F ≤ p(y|x)• Extend the Variational Free Energy sparse GPs (Titsias 09) /

Variational Compression tricks.• Approximately marginalise out h

I Automatic structure discovery (nodes, connections, layers)

• Use the Automatic / Manifold Relevance Determination trick

I ...

Page 30: Feature representation with Deep Gaussian processes

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)(((((p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q=(((((p(f2|u2,h2)q(u2)q(h2)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)p(u2|h̃2)p(h2|x)Q=q(u2)q(h2)

p(u2|h̃2) contains k(h̃2, h̃2)−1

Page 31: Feature representation with Deep Gaussian processes

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)

Page 32: Feature representation with Deep Gaussian processes

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

Page 33: Feature representation with Deep Gaussian processes

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

Page 34: Feature representation with Deep Gaussian processes

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2) p(f2|h2)︸ ︷︷ ︸contains

(k(h2, h2))−1

p(h2|x)

Page 35: Feature representation with Deep Gaussian processes

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

Page 36: Feature representation with Deep Gaussian processes

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q

Page 37: Feature representation with Deep Gaussian processes

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)(((((p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q=(((((p(f2|u2,h2)q(u2)q(h2)

Page 38: Feature representation with Deep Gaussian processes

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)(((((p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q=(((((p(f2|u2,h2)q(u2)q(h2)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)p(u2|h̃2)p(h2|x)Q=q(u2)q(h2)

p(u2|h̃2) contains k(h̃2, h̃2)−1

Page 39: Feature representation with Deep Gaussian processes

Inducing points: sparseness, tractability and Big Data

h1 f1h2 f2· · · · · ·h30 f30h31 f31· · · · · ·hN fN

Page 40: Feature representation with Deep Gaussian processes

Inducing points: sparseness, tractability and Big Data

h1 f1h2 f2· · · · · ·h30 f30h̃(i) u(i)

h31 f31· · · · · ·hN fN

Page 41: Feature representation with Deep Gaussian processes

Inducing points: sparseness, tractability and Big Data

h1 f1h2 f2· · · · · ·h30 f30h̃(i) u(i)

h31 f31· · · · · ·hN fN

I Inducing points originally introduced for faster (sparse) GPsI Our manipulation allows to compress information from the

inputs of every layer (var. compression)I This induces tractabilityI Viewing them as global variables⇒ extension to Big Data [Hensman et al., UAI 2013]

Page 42: Feature representation with Deep Gaussian processes

Automatic dimensionality detection

I Achieved by employing automatic relevance determination(ARD) priors for the mapping f .

I f ∼ GP(0, kf ) with:

kf (xi,xj) = σ2 exp

−1

2

Q∑q=1

wq (xi ,q − xj ,q)2

I Example:

−3 −2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

Page 43: Feature representation with Deep Gaussian processes

Deep GP: MNIST example

http://staffwww.dcs.sheffield.ac.uk/people/A.Damianou/research/index.html#DeepGPs

Page 44: Feature representation with Deep Gaussian processes

MNIST: The first layer

1 layer GP-LVM:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

5 layer deep GP (showing 1st layer):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 45: Feature representation with Deep Gaussian processes

Manifold Relevance Determination

I Observations come into two different views: Y and Z.

I The latent space is segmented into parts private to Y , privateto Z and shared between Y and Z.

I Used for data consolidation and discovering commonalities.

Page 46: Feature representation with Deep Gaussian processes

MRD weights

Page 47: Feature representation with Deep Gaussian processes

Deep GPs: Another multi-view example

X

−2

−1

0

1

X

.15

0.1

.05

0

.05

Page 48: Feature representation with Deep Gaussian processes

Automatic structure discovery

Tools:

I ARD: Eliminate uncessary nodes/connections

I MRD: Conditional independencies

I Approximating evidence: Number of layers (?)

Page 49: Feature representation with Deep Gaussian processes

Automatic structure discovery

Tools:

I ARD: Eliminate uncessary nodes/connections

I MRD: Conditional independencies

I Approximating evidence: Number of layers (?)

Page 50: Feature representation with Deep Gaussian processes

Deep GP variants

Y

H1

H2

X

Deep GP -Supervised

Y

H1

H2

Deep GP -Unsupervised

Y Z

Hs1HY

1 HZ1

X

Multi-view

Y

H

X

Warped GP

Y

H

t

Temporal

Y

H1

H2

Y

Autoencoder

Page 51: Feature representation with Deep Gaussian processes

Summary

I A deep GP is not a GP.

I Sampling is straight-forward. Regularization and trainingneeds to be worked out.

I The solution is a special treatment of auxiliary variables.

I Many variants: multi-view, temporal, autoencoders ...

I Future: how to make it scalable?

I Future: how does it compare to / complement moretraditional deep models?

Page 52: Feature representation with Deep Gaussian processes

Thanks

Thanks to Neil Lawrence, James Hensman, Michalis Titsias, CarlHenrik Ek.

Page 53: Feature representation with Deep Gaussian processes

References:

• N. D. Lawrence (2006) “The Gaussian process latent variable model” Technical Report no

CS-06-03, The University of Sheffield, Department of Computer Science

• N. D. Lawrence (2006) “Probabilistic dimensional reduction with the Gaussian process

latent variable model” (talk)

• C. E. Rasmussen (2008), “Learning with Gaussian Processes”, Max Planck Institute for

Biological Cybernetics, Published: Feb. 5, 2008 (Videolectures.net)

• Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine

Learning. MIT Press, Cambridge, MA, 2006. ISBN 026218253X.

• M. K. Titsias (2009), “Variational learning of inducing variables in sparse Gaussian

processes”, AISTATS 2009

• A. C. Damianou, M. K. Titsias and N. D. Lawrence (2011), “Variational Gaussian process

dynamical systems”, NIPS 2011

• A. C. Damianou, C. H. Ek, M. K. Titsias and N. D. Lawrence (2012), “Manifold Relevance

Determination”, ICML 2012

• A. C. Damianou and N. D. Lawrence (2013), “Deep Gaussian processes”, AISTATS 2013

• J. Hensman (2013), “Gaussian processes for Big Data”, UAI 2013


Recommended