Feature representation with Deep Gaussian processes

Feature representation with Deep Gaussianprocesses

Andreas Damianou

Department of Neuro- and Computer Science, University ofSheffield, UK

Feature Extraction with Gaussian Processes Workshop, Sheffield,18/09/2014

Outline

Part 1: A general viewDeep modelling and deep GPs

Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders

Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples

Summary

Outline

Summary

Feature representation + feature relationships

I Good data representations facilitate learning / inference.

I Deep structures facilitate learning associated with abstractinformation (Bengio, 2009).

Hierarchy of features

[Lee et al. 09, “Convolutional DBNs for Scalable Unsupervised Learning of Hierarchical Representations”]

Hierarchy of features

Abstract concepts

Local features

[Damianou and Lawrence 2013, “Deep Gaussian Processes”]

Deep learning (directed graph)

Y = f(f(· · · f(X))), Hi = fi(Hi−1)

Deep Gaussian processes - Big Picture

Deep GP:

I Directed graphical model

I Non-parametric, non-linear mappings f

I Mappings f marginalised out analytically

I Likelihood is a non-linear function of the inputs

I Continuous variables

I NOT a GP!

Challenges:

I Marginalise out H

I No sampling: analytic approximation of objective

Solution:

I Variational approximation

I This also gives access to the model evidence

Deep GP:

I NOT a GP!

Challenges:

I Marginalise out H

Solution:

Deep GP:

I NOT a GP!

Challenges:

I Marginalise out H

Solution:

Hierarchical GP-LVM

I Hidden layers are not marginalised out.

I This leads to some difficulties.

[Lawrence and Moore, 2004]

Hierarchical GP-LVM

I Hidden layers are not marginalised out.

I This leads to some difficulties.

[Lawrence and Moore, 2004]

Sampling from a deep GP

Output

Unobserved

Outline

Summary

Dynamics

GP-LVM

Dynamics

GP-LVM

Dynamics

GP-LVM

tI If Y form is a multivariate time-series, then H

also has to be one

I Place a temporal GP prior on the latent space:

h = h(t) = GP(0, kh(t, t))f = f(h) = GP(0, kf (h, h))y = f(h) + ε

Dynamics

I Dynamics are encoded in the covariance matrix K = k(t, t).

I We can consider special forms for K.

Model individual sequences Model periodic data

I https://www.youtube.com/watch?v=i9TEoYxaBxQ (missa)

I https://www.youtube.com/watch?v=mUY1XHPnoCU (dog)

I https://www.youtube.com/watch?v=fHDWloJtgk8 (mocap)

Autoencoder

YGP-LVM:

−6 −5 −4 −3 −2 −1 0 1 2 3 4−6

Non-parametric auto-encoder:

−10 −5 0 5 10

Outline

Summary

Sampling from a deep GP

Output

Unobserved

Modelling a step function - Posterior samples

−1 −0.5 0 0.5 1 1.5 2

I GP (top), 2 and 4 layer deep GP (middle, bottom).I Posterior is bimodal.

Image by J. Hensman

MAP optimisation?

I Joint = p(y|h1)p(h1|h2)p(h2|x)

I MAP optimization is extremely problematicbecause:

• Dimensionality of hs has to be decided a priori

• Prone to overfitting, if h are treated as parameters

• Deep structures are not supported by the model’sobjective but have to be forced [Lawrence & Moore ’07]

Regularization solution: approximate Bayesian framework

I Analytic variational bound F ≤ p(y|x)• Extend the Variational Free Energy sparse GPs (Titsias 09) /

Variational Compression tricks.• Approximately marginalise out h

I Automatic structure discovery (nodes, connections, layers)

• Use the Automatic / Manifold Relevance Determination trick

Direct marginalisation of h is intractable (O o)

I New objective: p(y|x) =∫h1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)p(u2|h̃2)p(h2|x)Q=q(u2)q(h2)

p(u2|h̃2) contains k(h̃2, h̃2)−1

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2) p(f2|h2)︸︷︷︸contains

(k(h2, h2))−1

p(h2|x)

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

(p(y|h1)

∫h2p(h1|h2)p(h2|x)

)I p(h1|x) =

∫h2,f2

p(h1|f2)p(f2|h2) p(h2|x)

I p(h1|x, h̃2) =∫h2,f2,u2

p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)

I log p(h1|x, h̃2)≥∫h2,f2,u2

Q log p(h1|f2)p(u2|h̃2)p(h2|x)Q=q(u2)q(h2)

p(u2|h̃2) contains k(h̃2, h̃2)−1

Inducing points: sparseness, tractability and Big Data

h1 f1h2 f2· · · · · ·h30 f30h31 f31· · · · · ·hN fN

h1 f1h2 f2· · · · · ·h30 f30h̃(i) u(i)

h31 f31· · · · · ·hN fN

h1 f1h2 f2· · · · · ·h30 f30h̃(i) u(i)

h31 f31· · · · · ·hN fN

I Inducing points originally introduced for faster (sparse) GPsI Our manipulation allows to compress information from the

inputs of every layer (var. compression)I This induces tractabilityI Viewing them as global variables⇒ extension to Big Data [Hensman et al., UAI 2013]

Automatic dimensionality detection

I Achieved by employing automatic relevance determination(ARD) priors for the mapping f .

I f ∼ GP(0, kf ) with:

kf (xi,xj) = σ2 exp

Q∑q=1

wq (xi ,q − xj ,q)2

I Example:

−3 −2 −1 0 1 2−2

−1.5

−0.5

Deep GP: MNIST example

http://staffwww.dcs.sheffield.ac.uk/people/A.Damianou/research/index.html#DeepGPs

MNIST: The first layer

1 layer GP-LVM:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

5 layer deep GP (showing 1st layer):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Manifold Relevance Determination

I Observations come into two different views: Y and Z.

I The latent space is segmented into parts private to Y , privateto Z and shared between Y and Z.

I Used for data consolidation and discovering commonalities.

MRD weights

Deep GPs: Another multi-view example

Automatic structure discovery

Tools:

I ARD: Eliminate uncessary nodes/connections

I MRD: Conditional independencies

I Approximating evidence: Number of layers (?)

Automatic structure discovery

Tools:

I ARD: Eliminate uncessary nodes/connections

I MRD: Conditional independencies

I Approximating evidence: Number of layers (?)

Deep GP variants

Deep GP -Supervised

Deep GP -Unsupervised

Multi-view

Warped GP

Temporal

Autoencoder

Summary

I A deep GP is not a GP.

I Sampling is straight-forward. Regularization and trainingneeds to be worked out.

I The solution is a special treatment of auxiliary variables.

I Many variants: multi-view, temporal, autoencoders ...

I Future: how to make it scalable?

I Future: how does it compare to / complement moretraditional deep models?

Thanks

Thanks to Neil Lawrence, James Hensman, Michalis Titsias, CarlHenrik Ek.

References:

• N. D. Lawrence (2006) “The Gaussian process latent variable model” Technical Report no

CS-06-03, The University of Sheffield, Department of Computer Science

• N. D. Lawrence (2006) “Probabilistic dimensional reduction with the Gaussian process

latent variable model” (talk)

• C. E. Rasmussen (2008), “Learning with Gaussian Processes”, Max Planck Institute for

Biological Cybernetics, Published: Feb. 5, 2008 (Videolectures.net)

• Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine

Learning. MIT Press, Cambridge, MA, 2006. ISBN 026218253X.

• M. K. Titsias (2009), “Variational learning of inducing variables in sparse Gaussian

processes”, AISTATS 2009

• A. C. Damianou, M. K. Titsias and N. D. Lawrence (2011), “Variational Gaussian process

dynamical systems”, NIPS 2011

• A. C. Damianou, C. H. Ek, M. K. Titsias and N. D. Lawrence (2012), “Manifold Relevance

Determination”, ICML 2012

• A. C. Damianou and N. D. Lawrence (2013), “Deep Gaussian processes”, AISTATS 2013

• J. Hensman (2013), “Gaussian processes for Big Data”, UAI 2013

Feature representation with Deep Gaussian processes

Documents

A Hierarchical Feature Representation for Phonetic

Statistical Feature Selection for non-Gaussian distributed

Color based object detection & image segmentation€¦ · Build statistical model of feature distribution for object: Simplest statistical model: Gaussian Over color space: 3D Gaussian

Hierarchical feature representation and multimodal fusion

A Gaussian Mixture Model Spectral Representation for ...mi.eng.cam.ac.uk/~mjfg/thesis_mns25.pdf4.1 Gaussian mixture model representations of the speech spectrum 45 4.1.1 Mixture models

Pattern Recognition : Feature Representation x Decision Function d(x)

1 Unsupervised Feature Selection for High-Dimensional Non-Gaussian Data Clustering ... · 2018-12-20 · Unsupervised Feature Selection for High-Dimensional Non-Gaussian Data Clustering

Gaussian Based Hue Descriptors - Computing Science ...funt/Mirzaei+Funt_GaussianHue_PAMI2015.pdfpseudo-color atlas as the representation of hue. Such Gaussian-like reflectances have

Information Bottleneck for Gaussian Variablespeople.csail.mit.edu/~gamir/pubs/Chechik_JMLR2004.pdf · 2006-12-30 · Gaussian Information Bottleneck the complexity of the representation

Neural representation of linguistic feature hierarchy reflects … · 2020-06-15 · 1 Neural representation of linguistic feature hierarchy reflects second-language proficiency Giovanni

Research Article Improving Feature Representation Based on a …downloads.hindawi.com/journals/cin/2016/1638936.pdf · 2019-07-30 · Research Article Improving Feature Representation

Gaussian Processes for Audio Feature Extractionlearning.eng.cam.ac.uk/.../Turner/Presentations/gp-audio.pdf · 2014. 10. 9. · Gaussian Processes for Audio Feature Extraction Dr

A Gaussian Mixture Model Spectral Representation for ...mi.eng.cam.ac.uk/~mjfg/thesis_mns25.pdf · A Gaussian Mixture Model Spectral Representation for Speech Recognition ... 2.1.2

Generally astigmatic Gaussian beam representation and … · 2020. 7. 21. · Gaussian beam, general astigmatism, ray trace, lens, focus, optimize, geometric optics 1. INTRODUCTION

Semantic Feature Representation to Capture News …xie/pdf/xieEtAlFlairs2014.pdfSemantic Feature Representation to Capture News Impact Boyi Xie and Dingquan Wang and Rebecca J. Passonneau

Chapter 8: Planning fileOutline Representing Actions State-space Representation Feature-based Representation STRIPS Representation Planning Forward Planning Regression Planning Planning

Joint Image-Text Representation by Gaussian Visual-Semantic …web.cs.ucla.edu/~zhou.ren/GVSE_16.pdf · 2016. 7. 26. · Joint Image-Text Representation by Gaussian Visual-Semantic

Image Representation. Global vectorial representations Image features have a feature vector representation to collects the numerical values of the feature

Semantic Feature Representation to Capture News Impactwdd/pub/Semantic Feature Representation t… · Semantic Feature Representation to Capture News Impact First Author Afﬁliation

Modular Feature Models: Representation and Configuration