Sparse Approximations to Bayesian Gaussian Processes

Sparse Approximations toBayesian Gaussian Processes

Matthias SeegerMatthias Seeger

University of EdinburghUniversity of Edinburgh

Collaborators

Neil Lawrence (Sheffield)Neil Lawrence (Sheffield) Chris Williams (Edinburgh)Chris Williams (Edinburgh) Ralf Herbrich (MSR Cambridge)Ralf Herbrich (MSR Cambridge)

Overview of the Talk

Gaussian processes and approximationsGaussian processes and approximations Understanding sparse schemes asUnderstanding sparse schemes as

likelihood approximationslikelihood approximations Two schemes and their relationshipsTwo schemes and their relationships Fast greedy selection for the projected latent Fast greedy selection for the projected latent

variables scheme (GP regression)variables scheme (GP regression)

Why Sparse Approximations?

GPs lead to very powerful Bayesian GPs lead to very powerful Bayesian methods for function fitting, classification, methods for function fitting, classification, etc. Yet: (Almost) Nobody uses them!etc. Yet: (Almost) Nobody uses them!

Reason: Horrible scaling Reason: Horrible scaling O(nO(n33)) If sparse approximations work, there is a If sparse approximations work, there is a

host of applications, e.g. as building blocks host of applications, e.g. as building blocks in Bayesian networks, etc.in Bayesian networks, etc.

Gaussian Process Models

Target Target yy separated by latent separated by latent uu from all other from all other variables Inference a finite problemvariables Inference a finite problem

Gaussian priorGaussian prior(dense),(dense),kernel kernel KK

ParameterisationData Data D = {(D = {(xxii,y,yii) | i=1,…,n}.) | i=1,…,n}.

Latent outputs Latent outputs u = (uu = (u11,…,u,…,unn).).

Approximate posterior process Approximate posterior process P(u(P(u(¢¢) | D)) | D)by GP by GP Q(u(Q(u(¢¢) | D)) | D)

Conditional GP (Prior)Conditional GP (Prior) nn-dim. Gaussian-dim. Gaussian

GP Approximations

Most (non-MCMC) GP approximations use Most (non-MCMC) GP approximations use this representationthis representation

Exact computation of Exact computation of Q(Q(uu | D) | D) intractable, intractable, needsneeds

Attractive for sparse approximations:Attractive for sparse approximations:Sequential fitting of Sequential fitting of Q(Q(uu | D) | D) toto P( P(uu | D) | D)

Assumed Density Filtering

Update (ADF step):Update (ADF step):

Towards Sparsity

ADF = Bayesian Online [Opper].ADF = Bayesian Online [Opper].Multiple updates: Cavity method [Opper, Multiple updates: Cavity method [Opper, Winther], EP [Minka]Winther], EP [Minka]

Generalizations: EP [Minka], ADATAP Generalizations: EP [Minka], ADATAP [Csato,Opper,Winther: COW][Csato,Opper,Winther: COW]

Sequential updates suitable for sparse Sequential updates suitable for sparse online or greedy methodsonline or greedy methods

Likelihood Approximations

Active set:Active set: I I ½½ {1,…,n}, |I| = d {1,…,n}, |I| = d¿¿ n n

Several sparse schemes can be understood asSeveral sparse schemes can be understood aslikelihood approximationslikelihood approximations

Depends on Depends on uuII only only

Likelihood Approximations (II)

Active Set Active Set I = {2,3}I = {2,3}

Likelihood Approximations (III)

For such sparse schemes:For such sparse schemes: O(dO(d22)) parameters at most parameters at most Prediction in Prediction in O(dO(d22)), , O(d)O(d) for mean only for mean only Approximations to marginal likelihood Approximations to marginal likelihood

(variational lower bound, ADATAP (variational lower bound, ADATAP [COW]), PAC bounds [Seeger], etc., [COW]), PAC bounds [Seeger], etc., become become cheapcheap as well! as well!

Two Schemes IVM [Lawrence, Seeger, Herbrich: LSH]IVM [Lawrence, Seeger, Herbrich: LSH]

ADF with fast greedy forward selectionADF with fast greedy forward selection Sparse Greedy GPR [Smola, Bartlett: SB]Sparse Greedy GPR [Smola, Bartlett: SB]

Greedy, expensive. Can be sped up:Greedy, expensive. Can be sped up:Projected Latent Variables [Seeger, Projected Latent Variables [Seeger, Lawrence, Williams]. More general:Lawrence, Williams]. More general:Sparse batch ADATAP [COW]Sparse batch ADATAP [COW]

Not here: Sparse Online GP [Csato, Opper]Not here: Sparse Online GP [Csato, Opper]

Informative Vector Machine

ADF, stopped after ADF, stopped after ddinclusions [could do deletions, exchanges]inclusions [could do deletions, exchanges]

Fast greedy forward selection using criteria Fast greedy forward selection using criteria known in active learningknown in active learning

Faster than SVM on hard MNIST binary Faster than SVM on hard MNIST binary tasks, yet probabilistic (error bars, etc.)tasks, yet probabilistic (error bars, etc.)

Only Only dd are non-zero are non-zero

Why So Simple? Locality Property of ADF:Locality Property of ADF:

Marginal Marginal QQnewnew(u(uii)) in in O(1)O(1) from from Q(uQ(uii))

Locality Property and Gaussianity:Locality Property and Gaussianity:Relations like:Relations like:

Fast evaluation of differential criteria Fast evaluation of differential criteria

KL-Optimal Projections

Csato/Opper observed:Csato/Opper observed:

KL-Optimal Projections (II)

For Gaussian likelihood:For Gaussian likelihood:

Can be used online or batchCan be used online or batch A bit unfortunate: We use relative entropy A bit unfortunate: We use relative entropy

both ways aroundboth ways around!!

Projected Latent Variables

Full GPR samples Full GPR samples uuII»» P( P(uuII), ), uuRR»» P( P(uuRR | | uuII), ),

variables variables uuRR replaced by replaced by projectionsprojections in in

likelihood [SB] (without interpret.)likelihood [SB] (without interpret.) NoteNote: Sparse batch ADATAP [COW] more : Sparse batch ADATAP [COW] more

general (non-Gaussian likelihoods)general (non-Gaussian likelihoods)

Fast Greedy Selections

With this likelihood approximation, typical With this likelihood approximation, typical forward selection criteria (MAP [SB]; diff. forward selection criteria (MAP [SB]; diff. entropy, info-gain [LSH]) are too entropy, info-gain [LSH]) are too expensiveexpensive

Problem: Upon inclusion, latent Problem: Upon inclusion, latent uuii is is

coupled with coupled with allall targets targets yy Cheap criterion: Ignore most couplings for Cheap criterion: Ignore most couplings for

score evaluation (score evaluation (notnot for inclusion!) for inclusion!)

Yet Another Approximation

To score To score xxii, we approximate , we approximate QQnewnew((uu | D) | D)

after inclusion of after inclusion of ii by by

Example: Example: Information gainInformation gain

Fast Greedy Selections (II)

Leads to Leads to O(1)O(1) criteria. criteria.Cost of searching over Cost of searching over allall remaining points remaining points dominated by cost for inclusiondominated by cost for inclusion

Can easily be generalized to allow for Can easily be generalized to allow for couplings between couplings between uuii and and somesome targets, if targets, if

desireddesired Can be done for sparse batch ADATAP as Can be done for sparse batch ADATAP as

wellwell

Marginal Likelihood

Can be optimized efficiently w.r.t. Can be optimized efficiently w.r.t. and and kernel parameters, kernel parameters, O(n d (d+p))O(n d (d+p)) per per gradient, gradient, pp number of parameters number of parameters

Keep Keep II fixed during line searches, reselect fixed during line searches, reselect for search directionsfor search directions

The marginal likelihood isThe marginal likelihood is

Conclusions

Most sparse approximations can be Most sparse approximations can be understood as likelihood approximationsunderstood as likelihood approximations

Several schemes available, all Several schemes available, all O(n dO(n d22)), yet , yet constants do matter here!constants do matter here!

Fast information-theoretic criteria effective Fast information-theoretic criteria effective for classification Extension to active for classification Extension to active learning straightforwardlearning straightforward

Conclusions (II)

MissingMissing: Experimental comparison, esp. to : Experimental comparison, esp. to test effectiveness of marginal likelihood test effectiveness of marginal likelihood optimizationoptimization

Extensions:Extensions: CC classes: Easy in classes: Easy in O(n dO(n d22 C C22)), maybe in , maybe in

O(n dO(n d22 C) C) Integrate with Bayesian networksIntegrate with Bayesian networks

[Friedman, Nachman][Friedman, Nachman]

Sparse Approximations to Bayesian Gaussian Processes

Documents

GAUSSIAN APPROXIMATIONS FOR PROBABILITY MEASURES ON R · GAUSSIAN APPROXIMATIONS FOR PROBABILITY MEASURES ON Rd 3 where d TV denotes the total variation distance. Second the logarithmic

Sparse Log Gaussian Processes via MCMC for Spatial Epidemiology

Sparse Inverse Kernel Gaussian Process Regressionkdas1/papers/SADM13_CIDU.pdfJan 31, 2013 · 1 Sparse Inverse Kernel Gaussian Process Regression Kamalika Das∗, Ashok N. Srivastava

Gaussian Approximations for Option Prices in Stochastic Volatility Models

Research Article Link Prediction via Sparse Gaussian Graphical … · 2019. 5. 14. · Research Article Link Prediction via Sparse Gaussian Graphical Model LiangliangZhang,LongqiYang,GuyuHu,ZhisongPan,andZhenLi

Concave Gaussian Variational Approximations for Inference ...proceedings.mlr.press/v15/challis11a/challis11a.pdf · 199 Concave Gaussian Variational Approximations for Inference in

Incremental Variational Sparse Gaussian Process Regression · In this section, we provide a brief summary of Gaussian process regression and sparse Gaussian process regression for

Symbolic sparse Gaussian elimination: A = LU

Sparse Gaussian Process Approximations and …Gaussian processes with more general invariance properties. Finally, we revisit the Finally, we revisit the derivation of the Gaussian

Online Sparse Matrix Gaussian Process Regression and Vision

A Unifying Framework of Anytime Sparse Gaussian Process ...proceedings.mlr.press/v37/hoang15-supp.pdf · A Unifying Framework of Anytime Sparse Gaussian Process Regression Models

Sparse Gaussian Elimination modulo an Update - CRIStALcristal.univ-lille.fr/~bouillag/pub/CASC16.pdf · Sparse Gaussian Elimination modulo p: an Update CharlesBouillaguet 1andClaireDelaplace;2

Sparse Approximations

Vecchia-Laplace Approximations for Generalized Gaussian ......4 Conclusions Matthias Katzfuss (Texas A&M) Vecchia-Laplace Approximations September 28, 2018 2 / 27 Gaussian processes

Variational Model Selection for Sparse Gaussian Process Regression · 2013-12-03 · Variational Model Selection for Sparse Gaussian Process Regression Sparse GP regression using

GAUSSIAN APPROXIMATIONS AND MULTIPLIER BOOTSTRAP FOR ... · PDF fileGAUSSIAN APPROXIMATIONS AND MULTIPLIER BOOTSTRAP FOR MAXIMA ... limit theorems for empirical processes ... GAUSSIAN

GPz: non-stationary sparse Gaussian processes for ... · GPZ: non-stationary sparse Gaussian processes for heteroscedastic uncertainty estimation in photometric redshifts Ibrahim

Improving the Gaussian Process Sparse Spectrum ... · Many Approximations Full GP: 1650 1700 1750 1800 1850 1900 1950 2 1 0 2 3 I Sparse pseudo-input cannot handle complex functions

MCMC for Variationally Sparse Gaussian Processes

Sparse Gaussian Processes for Large-Scale Machine Learning