24
Sparse Approximations to Bayesian Gaussian Processes Matthias Seeger Matthias Seeger University of Edinburgh University of Edinburgh

Sparse Approximations to Bayesian Gaussian Processes

  • Upload
    prema

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

Sparse Approximations to Bayesian Gaussian Processes. Matthias Seeger University of Edinburgh. Collaborators. Neil Lawrence (Sheffield) Chris Williams (Edinburgh) Ralf Herbrich (MSR Cambridge). Overview of the Talk. Gaussian processes and approximations - PowerPoint PPT Presentation

Citation preview

Page 1: Sparse Approximations to Bayesian Gaussian Processes

Sparse Approximations toBayesian Gaussian Processes

Matthias SeegerMatthias Seeger

University of EdinburghUniversity of Edinburgh

Page 2: Sparse Approximations to Bayesian Gaussian Processes

Collaborators

Neil Lawrence (Sheffield)Neil Lawrence (Sheffield) Chris Williams (Edinburgh)Chris Williams (Edinburgh) Ralf Herbrich (MSR Cambridge)Ralf Herbrich (MSR Cambridge)

Page 3: Sparse Approximations to Bayesian Gaussian Processes

Overview of the Talk

Gaussian processes and approximationsGaussian processes and approximations Understanding sparse schemes asUnderstanding sparse schemes as

likelihood approximationslikelihood approximations Two schemes and their relationshipsTwo schemes and their relationships Fast greedy selection for the projected latent Fast greedy selection for the projected latent

variables scheme (GP regression)variables scheme (GP regression)

Page 4: Sparse Approximations to Bayesian Gaussian Processes

Why Sparse Approximations?

GPs lead to very powerful Bayesian GPs lead to very powerful Bayesian methods for function fitting, classification, methods for function fitting, classification, etc. Yet: (Almost) Nobody uses them!etc. Yet: (Almost) Nobody uses them!

Reason: Horrible scaling Reason: Horrible scaling O(nO(n33)) If sparse approximations work, there is a If sparse approximations work, there is a

host of applications, e.g. as building blocks host of applications, e.g. as building blocks in Bayesian networks, etc.in Bayesian networks, etc.

Page 5: Sparse Approximations to Bayesian Gaussian Processes

Gaussian Process Models

Target Target yy separated by latent separated by latent uu from all other from all other variables Inference a finite problemvariables Inference a finite problem

y1

u1

x1

y2

u2

x2

y3

u3

x3

Gaussian priorGaussian prior(dense),(dense),kernel kernel KK

Page 6: Sparse Approximations to Bayesian Gaussian Processes

ParameterisationData Data D = {(D = {(xxii,y,yii) | i=1,…,n}.) | i=1,…,n}.

Latent outputs Latent outputs u = (uu = (u11,…,u,…,unn).).

Approximate posterior process Approximate posterior process P(u(P(u(¢¢) | D)) | D)by GP by GP Q(u(Q(u(¢¢) | D)) | D)

Conditional GP (Prior)Conditional GP (Prior) nn-dim. Gaussian-dim. Gaussian

Page 7: Sparse Approximations to Bayesian Gaussian Processes

GP Approximations

Most (non-MCMC) GP approximations use Most (non-MCMC) GP approximations use this representationthis representation

Exact computation of Exact computation of Q(Q(uu | D) | D) intractable, intractable, needsneeds

Attractive for sparse approximations:Attractive for sparse approximations:Sequential fitting of Sequential fitting of Q(Q(uu | D) | D) toto P( P(uu | D) | D)

Page 8: Sparse Approximations to Bayesian Gaussian Processes

Assumed Density Filtering

Update (ADF step):Update (ADF step):

Page 9: Sparse Approximations to Bayesian Gaussian Processes

Towards Sparsity

ADF = Bayesian Online [Opper].ADF = Bayesian Online [Opper].Multiple updates: Cavity method [Opper, Multiple updates: Cavity method [Opper, Winther], EP [Minka]Winther], EP [Minka]

Generalizations: EP [Minka], ADATAP Generalizations: EP [Minka], ADATAP [Csato,Opper,Winther: COW][Csato,Opper,Winther: COW]

Sequential updates suitable for sparse Sequential updates suitable for sparse online or greedy methodsonline or greedy methods

Page 10: Sparse Approximations to Bayesian Gaussian Processes

Likelihood Approximations

Active set:Active set: I I ½½ {1,…,n}, |I| = d {1,…,n}, |I| = d¿¿ n n

Several sparse schemes can be understood asSeveral sparse schemes can be understood aslikelihood approximationslikelihood approximations

Depends on Depends on uuII only only

Page 11: Sparse Approximations to Bayesian Gaussian Processes

Likelihood Approximations (II)

y1

u1

x1

y2

u2

x2

y3

u3

x3

y4

u4

x4

Active Set Active Set I = {2,3}I = {2,3}

Page 12: Sparse Approximations to Bayesian Gaussian Processes

Likelihood Approximations (III)

For such sparse schemes:For such sparse schemes: O(dO(d22)) parameters at most parameters at most Prediction in Prediction in O(dO(d22)), , O(d)O(d) for mean only for mean only Approximations to marginal likelihood Approximations to marginal likelihood

(variational lower bound, ADATAP (variational lower bound, ADATAP [COW]), PAC bounds [Seeger], etc., [COW]), PAC bounds [Seeger], etc., become become cheapcheap as well! as well!

Page 13: Sparse Approximations to Bayesian Gaussian Processes

Two Schemes IVM [Lawrence, Seeger, Herbrich: LSH]IVM [Lawrence, Seeger, Herbrich: LSH]

ADF with fast greedy forward selectionADF with fast greedy forward selection Sparse Greedy GPR [Smola, Bartlett: SB]Sparse Greedy GPR [Smola, Bartlett: SB]

Greedy, expensive. Can be sped up:Greedy, expensive. Can be sped up:Projected Latent Variables [Seeger, Projected Latent Variables [Seeger, Lawrence, Williams]. More general:Lawrence, Williams]. More general:Sparse batch ADATAP [COW]Sparse batch ADATAP [COW]

Not here: Sparse Online GP [Csato, Opper]Not here: Sparse Online GP [Csato, Opper]

Page 14: Sparse Approximations to Bayesian Gaussian Processes

Informative Vector Machine

ADF, stopped after ADF, stopped after ddinclusions [could do deletions, exchanges]inclusions [could do deletions, exchanges]

Fast greedy forward selection using criteria Fast greedy forward selection using criteria known in active learningknown in active learning

Faster than SVM on hard MNIST binary Faster than SVM on hard MNIST binary tasks, yet probabilistic (error bars, etc.)tasks, yet probabilistic (error bars, etc.)

Only Only dd are non-zero are non-zero

Page 15: Sparse Approximations to Bayesian Gaussian Processes

Why So Simple? Locality Property of ADF:Locality Property of ADF:

Marginal Marginal QQnewnew(u(uii)) in in O(1)O(1) from from Q(uQ(uii))

Locality Property and Gaussianity:Locality Property and Gaussianity:Relations like:Relations like:

Fast evaluation of differential criteria Fast evaluation of differential criteria

Page 16: Sparse Approximations to Bayesian Gaussian Processes

KL-Optimal Projections

Csato/Opper observed:Csato/Opper observed:

Page 17: Sparse Approximations to Bayesian Gaussian Processes

KL-Optimal Projections (II)

For Gaussian likelihood:For Gaussian likelihood:

Can be used online or batchCan be used online or batch A bit unfortunate: We use relative entropy A bit unfortunate: We use relative entropy

both ways aroundboth ways around!!

Page 18: Sparse Approximations to Bayesian Gaussian Processes

Projected Latent Variables

Full GPR samples Full GPR samples uuII»» P( P(uuII), ), uuRR»» P( P(uuRR | | uuII), ),

yy»» N( N(yy | | uu, , 22 II).). Instead: Instead: yy»» N( N(yy | E[ | E[uu | | uuII], ], 22 II). Latent ). Latent

variables variables uuRR replaced by replaced by projectionsprojections in in

likelihood [SB] (without interpret.)likelihood [SB] (without interpret.) NoteNote: Sparse batch ADATAP [COW] more : Sparse batch ADATAP [COW] more

general (non-Gaussian likelihoods)general (non-Gaussian likelihoods)

Page 19: Sparse Approximations to Bayesian Gaussian Processes

Fast Greedy Selections

With this likelihood approximation, typical With this likelihood approximation, typical forward selection criteria (MAP [SB]; diff. forward selection criteria (MAP [SB]; diff. entropy, info-gain [LSH]) are too entropy, info-gain [LSH]) are too expensiveexpensive

Problem: Upon inclusion, latent Problem: Upon inclusion, latent uuii is is

coupled with coupled with allall targets targets yy Cheap criterion: Ignore most couplings for Cheap criterion: Ignore most couplings for

score evaluation (score evaluation (notnot for inclusion!) for inclusion!)

Page 20: Sparse Approximations to Bayesian Gaussian Processes

Yet Another Approximation

To score To score xxii, we approximate , we approximate QQnewnew((uu | D) | D)

after inclusion of after inclusion of ii by by

Example: Example: Information gainInformation gain

Page 21: Sparse Approximations to Bayesian Gaussian Processes

Fast Greedy Selections (II)

Leads to Leads to O(1)O(1) criteria. criteria.Cost of searching over Cost of searching over allall remaining points remaining points dominated by cost for inclusiondominated by cost for inclusion

Can easily be generalized to allow for Can easily be generalized to allow for couplings between couplings between uuii and and somesome targets, if targets, if

desireddesired Can be done for sparse batch ADATAP as Can be done for sparse batch ADATAP as

wellwell

Page 22: Sparse Approximations to Bayesian Gaussian Processes

Marginal Likelihood

Can be optimized efficiently w.r.t. Can be optimized efficiently w.r.t. and and kernel parameters, kernel parameters, O(n d (d+p))O(n d (d+p)) per per gradient, gradient, pp number of parameters number of parameters

Keep Keep II fixed during line searches, reselect fixed during line searches, reselect for search directionsfor search directions

The marginal likelihood isThe marginal likelihood is

Page 23: Sparse Approximations to Bayesian Gaussian Processes

Conclusions

Most sparse approximations can be Most sparse approximations can be understood as likelihood approximationsunderstood as likelihood approximations

Several schemes available, all Several schemes available, all O(n dO(n d22)), yet , yet constants do matter here!constants do matter here!

Fast information-theoretic criteria effective Fast information-theoretic criteria effective for classification Extension to active for classification Extension to active learning straightforwardlearning straightforward

Page 24: Sparse Approximations to Bayesian Gaussian Processes

Conclusions (II)

MissingMissing: Experimental comparison, esp. to : Experimental comparison, esp. to test effectiveness of marginal likelihood test effectiveness of marginal likelihood optimizationoptimization

Extensions:Extensions: CC classes: Easy in classes: Easy in O(n dO(n d22 C C22)), maybe in , maybe in

O(n dO(n d22 C) C) Integrate with Bayesian networksIntegrate with Bayesian networks

[Friedman, Nachman][Friedman, Nachman]