Sparse Approximations to Bayesian Gaussian Processes

Preview:

DESCRIPTION

Sparse Approximations to Bayesian Gaussian Processes. Matthias Seeger University of Edinburgh. Collaborators. Neil Lawrence (Sheffield) Chris Williams (Edinburgh) Ralf Herbrich (MSR Cambridge). Overview of the Talk. Gaussian processes and approximations - PowerPoint PPT Presentation

Citation preview

Sparse Approximations toBayesian Gaussian Processes

Matthias SeegerMatthias Seeger

University of EdinburghUniversity of Edinburgh

Collaborators

Neil Lawrence (Sheffield)Neil Lawrence (Sheffield) Chris Williams (Edinburgh)Chris Williams (Edinburgh) Ralf Herbrich (MSR Cambridge)Ralf Herbrich (MSR Cambridge)

Overview of the Talk

Gaussian processes and approximationsGaussian processes and approximations Understanding sparse schemes asUnderstanding sparse schemes as

likelihood approximationslikelihood approximations Two schemes and their relationshipsTwo schemes and their relationships Fast greedy selection for the projected latent Fast greedy selection for the projected latent

variables scheme (GP regression)variables scheme (GP regression)

Why Sparse Approximations?

GPs lead to very powerful Bayesian GPs lead to very powerful Bayesian methods for function fitting, classification, methods for function fitting, classification, etc. Yet: (Almost) Nobody uses them!etc. Yet: (Almost) Nobody uses them!

Reason: Horrible scaling Reason: Horrible scaling O(nO(n33)) If sparse approximations work, there is a If sparse approximations work, there is a

host of applications, e.g. as building blocks host of applications, e.g. as building blocks in Bayesian networks, etc.in Bayesian networks, etc.

Gaussian Process Models

Target Target yy separated by latent separated by latent uu from all other from all other variables Inference a finite problemvariables Inference a finite problem

y1

u1

x1

y2

u2

x2

y3

u3

x3

Gaussian priorGaussian prior(dense),(dense),kernel kernel KK

ParameterisationData Data D = {(D = {(xxii,y,yii) | i=1,…,n}.) | i=1,…,n}.

Latent outputs Latent outputs u = (uu = (u11,…,u,…,unn).).

Approximate posterior process Approximate posterior process P(u(P(u(¢¢) | D)) | D)by GP by GP Q(u(Q(u(¢¢) | D)) | D)

Conditional GP (Prior)Conditional GP (Prior) nn-dim. Gaussian-dim. Gaussian

GP Approximations

Most (non-MCMC) GP approximations use Most (non-MCMC) GP approximations use this representationthis representation

Exact computation of Exact computation of Q(Q(uu | D) | D) intractable, intractable, needsneeds

Attractive for sparse approximations:Attractive for sparse approximations:Sequential fitting of Sequential fitting of Q(Q(uu | D) | D) toto P( P(uu | D) | D)

Assumed Density Filtering

Update (ADF step):Update (ADF step):

Towards Sparsity

ADF = Bayesian Online [Opper].ADF = Bayesian Online [Opper].Multiple updates: Cavity method [Opper, Multiple updates: Cavity method [Opper, Winther], EP [Minka]Winther], EP [Minka]

Generalizations: EP [Minka], ADATAP Generalizations: EP [Minka], ADATAP [Csato,Opper,Winther: COW][Csato,Opper,Winther: COW]

Sequential updates suitable for sparse Sequential updates suitable for sparse online or greedy methodsonline or greedy methods

Likelihood Approximations

Active set:Active set: I I ½½ {1,…,n}, |I| = d {1,…,n}, |I| = d¿¿ n n

Several sparse schemes can be understood asSeveral sparse schemes can be understood aslikelihood approximationslikelihood approximations

Depends on Depends on uuII only only

Likelihood Approximations (II)

y1

u1

x1

y2

u2

x2

y3

u3

x3

y4

u4

x4

Active Set Active Set I = {2,3}I = {2,3}

Likelihood Approximations (III)

For such sparse schemes:For such sparse schemes: O(dO(d22)) parameters at most parameters at most Prediction in Prediction in O(dO(d22)), , O(d)O(d) for mean only for mean only Approximations to marginal likelihood Approximations to marginal likelihood

(variational lower bound, ADATAP (variational lower bound, ADATAP [COW]), PAC bounds [Seeger], etc., [COW]), PAC bounds [Seeger], etc., become become cheapcheap as well! as well!

Two Schemes IVM [Lawrence, Seeger, Herbrich: LSH]IVM [Lawrence, Seeger, Herbrich: LSH]

ADF with fast greedy forward selectionADF with fast greedy forward selection Sparse Greedy GPR [Smola, Bartlett: SB]Sparse Greedy GPR [Smola, Bartlett: SB]

Greedy, expensive. Can be sped up:Greedy, expensive. Can be sped up:Projected Latent Variables [Seeger, Projected Latent Variables [Seeger, Lawrence, Williams]. More general:Lawrence, Williams]. More general:Sparse batch ADATAP [COW]Sparse batch ADATAP [COW]

Not here: Sparse Online GP [Csato, Opper]Not here: Sparse Online GP [Csato, Opper]

Informative Vector Machine

ADF, stopped after ADF, stopped after ddinclusions [could do deletions, exchanges]inclusions [could do deletions, exchanges]

Fast greedy forward selection using criteria Fast greedy forward selection using criteria known in active learningknown in active learning

Faster than SVM on hard MNIST binary Faster than SVM on hard MNIST binary tasks, yet probabilistic (error bars, etc.)tasks, yet probabilistic (error bars, etc.)

Only Only dd are non-zero are non-zero

Why So Simple? Locality Property of ADF:Locality Property of ADF:

Marginal Marginal QQnewnew(u(uii)) in in O(1)O(1) from from Q(uQ(uii))

Locality Property and Gaussianity:Locality Property and Gaussianity:Relations like:Relations like:

Fast evaluation of differential criteria Fast evaluation of differential criteria

KL-Optimal Projections

Csato/Opper observed:Csato/Opper observed:

KL-Optimal Projections (II)

For Gaussian likelihood:For Gaussian likelihood:

Can be used online or batchCan be used online or batch A bit unfortunate: We use relative entropy A bit unfortunate: We use relative entropy

both ways aroundboth ways around!!

Projected Latent Variables

Full GPR samples Full GPR samples uuII»» P( P(uuII), ), uuRR»» P( P(uuRR | | uuII), ),

yy»» N( N(yy | | uu, , 22 II).). Instead: Instead: yy»» N( N(yy | E[ | E[uu | | uuII], ], 22 II). Latent ). Latent

variables variables uuRR replaced by replaced by projectionsprojections in in

likelihood [SB] (without interpret.)likelihood [SB] (without interpret.) NoteNote: Sparse batch ADATAP [COW] more : Sparse batch ADATAP [COW] more

general (non-Gaussian likelihoods)general (non-Gaussian likelihoods)

Fast Greedy Selections

With this likelihood approximation, typical With this likelihood approximation, typical forward selection criteria (MAP [SB]; diff. forward selection criteria (MAP [SB]; diff. entropy, info-gain [LSH]) are too entropy, info-gain [LSH]) are too expensiveexpensive

Problem: Upon inclusion, latent Problem: Upon inclusion, latent uuii is is

coupled with coupled with allall targets targets yy Cheap criterion: Ignore most couplings for Cheap criterion: Ignore most couplings for

score evaluation (score evaluation (notnot for inclusion!) for inclusion!)

Yet Another Approximation

To score To score xxii, we approximate , we approximate QQnewnew((uu | D) | D)

after inclusion of after inclusion of ii by by

Example: Example: Information gainInformation gain

Fast Greedy Selections (II)

Leads to Leads to O(1)O(1) criteria. criteria.Cost of searching over Cost of searching over allall remaining points remaining points dominated by cost for inclusiondominated by cost for inclusion

Can easily be generalized to allow for Can easily be generalized to allow for couplings between couplings between uuii and and somesome targets, if targets, if

desireddesired Can be done for sparse batch ADATAP as Can be done for sparse batch ADATAP as

wellwell

Marginal Likelihood

Can be optimized efficiently w.r.t. Can be optimized efficiently w.r.t. and and kernel parameters, kernel parameters, O(n d (d+p))O(n d (d+p)) per per gradient, gradient, pp number of parameters number of parameters

Keep Keep II fixed during line searches, reselect fixed during line searches, reselect for search directionsfor search directions

The marginal likelihood isThe marginal likelihood is

Conclusions

Most sparse approximations can be Most sparse approximations can be understood as likelihood approximationsunderstood as likelihood approximations

Several schemes available, all Several schemes available, all O(n dO(n d22)), yet , yet constants do matter here!constants do matter here!

Fast information-theoretic criteria effective Fast information-theoretic criteria effective for classification Extension to active for classification Extension to active learning straightforwardlearning straightforward

Conclusions (II)

MissingMissing: Experimental comparison, esp. to : Experimental comparison, esp. to test effectiveness of marginal likelihood test effectiveness of marginal likelihood optimizationoptimization

Extensions:Extensions: CC classes: Easy in classes: Easy in O(n dO(n d22 C C22)), maybe in , maybe in

O(n dO(n d22 C) C) Integrate with Bayesian networksIntegrate with Bayesian networks

[Friedman, Nachman][Friedman, Nachman]

Recommended