Download pdf - Bayesian machine learning: a tutorial · Bayesian machine learning: a tutorial R emi Bardenet CNRS & CRIStAL, Univ. Lille, France R emi Bardenet (CNRS & Univ. Lille)Bayesian ML1

Bayesian machine learning: a tutorial

Remi Bardenet

CNRS & CRIStAL, Univ. Lille, France

Remi Bardenet (CNRS & Univ. Lille) Bayesian ML 1

OutlineThe what

Typical statistical problemsStatistical decision theoryPosterior expected utility and Bayes rules

The whyThe philosophical whyThe practical why

The howConjugacyMonte Carlo methodsMetropolis-HastingsVariational approximations

In depth with Gaussian processes in MLFrom linear regression to GPsModeling and learningMore applicationsReferences and open issues


OutlineThe what






Typical jobs for statisticians

Estimation

I You have data x1, . . . , xn that you assume drawn fromp(x1, . . . , xn|θ?), with θ? ∈ Rd .

I You want an estimate θ(x1, . . . , xn) of θ? ∈ Rd .

Confidence regions

I You have data x1, . . . , xn that you assume drawn fromp(x1, . . . , xn · |θ?), with θ? ∈ Rd .

I You want a region A(x1, . . . , xn) ⊂ Rd and make a statementthat θ ∈ A(x1, . . . , xn) with some certainty.


Typical jobs for statisticians

Estimation



Confidence regions




Statistical decision theory1

Figure: Abraham Wald (1902–1950)

1A. Wald. Statistical decision functions. Wiley, 1950.Remi Bardenet (CNRS & Univ. Lille) Bayesian ML 5

Statistical decision theory

I Let Θ be the “states of the world”, typically the space ofparameters of interest.

I Decisions are functions d(x1, . . . , xn) ∈ D.

I Let L(d , θ) denote the loss of making decision d when thestate of the world is θ.

I Wald defines the risk of a decision as

R(d , θ) =

∫L(d , θ)p(x1:n|θ)dx1:n.

I Wald says d1 is a better decision than d2 if

∀θ ∈ Θ, L(d1, θ) 6 L(d2, θ). (1)

I d is called admissible if there is no better decision than d .







R(d , θ) =



∀θ ∈ Θ, L(d1, θ) 6 L(d2, θ). (1)








R(d , θ) =



∀θ ∈ Θ, L(d1, θ) 6 L(d2, θ). (1)








R(d , θ) =



∀θ ∈ Θ, L(d1, θ) 6 L(d2, θ). (1)








R(d , θ) =



∀θ ∈ Θ, L(d1, θ) 6 L(d2, θ). (1)








R(d , θ) =



∀θ ∈ Θ, L(d1, θ) 6 L(d2, θ). (1)



Illustration with a simple estimation problem

I You have data x1, . . . , xn that you assume drawn from

p(x1, . . . , xn|θ?) =n∏

i=1

N (xi |θ?, σ2),

and you know σ2.

I You choose a loss function, say L(θ, θ) = ‖θ − θ‖2.

I You restrict your decision space to unbiased estimators.

I The sample mean θ := n−1∑n

i=1 xi is unbiased, and hasminimum variance among unbiased estimators.

I SinceR(θ, θ) = Varθ,

θ is the best decision you can make in Wald’s framework.


Wald’s view of frequentist estimation

Estimation



A Waldian answer

I Our decisions are estimates d(x1, . . . , xn) = θ(x1, . . . , xn).

I We pick a loss, say L(d , θ) = L(θ, θ) = ‖θ − θ‖2.

I If you have an unbiased estimator with minimum variance,then this is the best decision among unbiased estimators.


Wald’s view of frequentist estimation

Estimation



A Waldian answer

I Our decisions are estimates d(x1, . . . , xn) = θ(x1, . . . , xn).

I In general, the loss can be more complex and unbiasedestimors unknown/irrelevant.

I In these cases, you may settle for a minimax estimator

θ(x1, . . . , xn) = arg mind

supθ

R(d , θ).


Wald’s is only one view of frequentist statistics...

I On estimation, some would argue in favour of the maximumlikelihood2.

Figure: Ronald Fisher (1890–1962)

2S. M. Stigler. “The epic story of maximum likelihood”. In: StatisticalScience (2007), pp. 598–620.


... but bear with me, since it is predominant in machine learning

For instance, supervised learning is usually formalized as

g? = arg ming

EL(y , g(x)). (2)

which you approximate by

g = arg ming

n∑i=1

L(yi , g(xi )) + penalty(g),

while trying to control the excess risk

EL(y , g(x))− EL(y , g?(x)).


Wald’s view of frequentist confidence regions

Confidence regions



A Waldian answer

I Our decisions are subsets of Rd : d(x1:n) = A(x1:n).

I A common loss is L(d , θ) = L(A, θ) = 1θ/∈A + γ|A|.I So you want to find A(x1:n) that minimizes

L(A, θ) =

∫[1θ? /∈Ap(x1:n|θ?) + γ|A|] dx1:n.


Illustration with a simple confidence interval problem


p(x1, . . . , xn|θ?) =n∏

i=1

N (xi |θ?, σ2).

I You choose a loss function, say L(A, θ) = 1θ/∈A + γ|A|.I You restrict your decisions to intervals centered around the

sample mean θ.

I Since θ−θσ/√n∼ N (0, 1), we know (exercise) that for

A := [θ − kσ/√n, θ + kσ/

√n],

it comes

R(A, θ) = P(|N (0, 1)| > k) +2γkσ√

n.

I All is left to do is choose k .I Textbook examples bypass the need for γ: they fix α > 0 and

find the smallest k such that P(|N (0, 1)| > k) 6 α.


Summary so far

I Waldian frequentists measure risks as expectations w.r.t. thedata-generating process.

R(d , θ) =

∫L(d(x1:n), θ)p(x1:n|θ)dx1:n

I One major difficulty is that the risk remains a function of θ.

I Without additional structure (unbiasedness, Gaussianity, etc.),it is difficult to go beyond minimax rules.

Idea

What if we introduced a distribution on Θ, and tried to minimize

r(d) =

∫R(d , θ)p(θ)dθ

=

∫ [∫L(d(x1:n), θ)p(x1:n|θ)dx1:n

]p(θ)dθ


From expected frequentist loss to posterior expected loss

r(d) =

∫R(d , θ)p(θ)dθ

=

∫ [∫L(d(x1:n), θ)p(x1:n|θ)dx1:n

]p(θ)dθ

=

∫ [∫L(d(x1:n), θ)p(x1:n|θ)p(θ)dθ

]dx1:n

=

∫ [∫L(d(x1:n), θ)

p(x1:n|θ)p(θ)

p(x1:n)dθ

]p(x1:n)dx1:n

=

∫ [∫L(d(x1:n), θ)p(θ|x1:n)dθ

]p(x1:n)dx1:n


Bayesians minimize posterior expected utility

The posterior expected utility paradigm: Bayes rules

Pick d to solve

arg mind

∫L(d(x1:n), θ)p(θ|x1:n)dθ.

Bayes rules have good frequentist properties3

I Under general conditions, Bayes decision rules are admissible,all admissible rules are limits of Bayes rules.

I Bayes rules with “least favourable priors” are minimax.

3G. Parmigiani and L. Inoue. Decision theory: principles and approaches.Vol. 812. John Wiley & Sons, 2009.


Illustration with a simple estimation problem


p(x1, . . . , xn|θ?) =n∏

i=1

N (xi |θ?, σ2),

and you know σ2.

I You choose a loss function, say L(θ, θ) = ‖θ − θ‖2.

I You choose a prior p over θ.

I Your Bayes decision minimizes∫‖θ − θ‖2p(θ|x1:n)dθ,

so you pick

θ =

∫θp(θ|x1:n)dθ.

I Conceptually, it is simpler. In practice, you need to computean integral.


Illustration with a simple confidence interval problem


p(x1, . . . , xn|θ?) =n∏

i=1

N (xi |θ?, σ2).

I You choose a loss function, say L(A, θ) = 1θ/∈A + γ|A|.I You choose a prior p over θ.

I Your Bayes decision minimizes∫1θ/∈Ap(θ|x1:n)dθ + γ|A|,

I Conceptually, it is simpler. In practice, you need to carefullypick your prior and/or restrict the decision space and/orcompute many integrals.


Summary so far

I Bayes rules fit into Wald’s framework.

I For a fixed prior, the Bayesian risk completely orders decisionrules.

I The key idea is posterior expected utility.

I You can answer most basic statistical questions using thisprinciple: [more examples].


A recent motivating success

GW170814: A Three-Detector Observation of Gravitational Wavesfrom a Binary Black Hole Coalescence

B. P. Abbott et al.*

(LIGO Scientific Collaboration and Virgo Collaboration)(Received 23 September 2017; published 6 October 2017)

On August 14, 2017 at 10∶30:43 UTC, the Advanced Virgo detector and the two Advanced LIGOdetectors coherently observed a transient gravitational-wave signal produced by the coalescence of twostellar mass black holes, with a false-alarm rate of ≲1 in 27 000 years. The signal was observed with athree-detector network matched-filter signal-to-noise ratio of 18. The inferred masses of the initial blackholes are 30.5þ5.7

−3.0M⊙ and 25:3þ2.8−4.2M⊙ (at the 90% credible level). The luminosity distance of the source is

540þ130−210 Mpc, corresponding to a redshift of z ¼ 0.11þ0.03

−0.04 . A network of three detectors improves the skylocalization of the source, reducing the area of the 90% credible region from 1160 deg2 using only the twoLIGO detectors to 60 deg2 using all three detectors. For the first time, we can test the nature ofgravitational-wave polarizations from the antenna response of the LIGO-Virgo network, thus enabling anew class of phenomenological tests of gravity.

DOI: 10.1103/PhysRevLett.119.141101

I. INTRODUCTION

The era of gravitational-wave (GW) astronomy beganwith the detection of binary black hole (BBH) mergers, bythe Advanced Laser Interferometer Gravitational-WaveObservatory (LIGO) detectors [1], during the first of theAdvanced Detector Observation Runs. Three detections,GW150914 [2], GW151226 [3], and GW170104 [4], and alower significance candidate, LVT151012 [5], have beenannounced so far. The Advanced Virgo detector [6] joinedthe second observation run on August 1, 2017.On August 14, 2017, GWs from the coalescence of two

black holes at a luminosity distance of 540þ130−210 Mpc, with

masses of 30.5þ5.7−3.0M⊙ and 25.3þ2.8

−4.2M⊙, were observed in allthree detectors. The signal was first observed at the LIGOLivingston detector at 10∶30:43 UTC, and at the LIGOHanford and Virgo detectors with a delay of ∼8 ms and∼14 ms, respectively.The signal-to-noise ratio (SNR) time series, the time-

frequency representation of the strain data, and the timeseries data of the three detectors together with the inferredGW waveform, are shown in Fig. 1. The different sensitiv-ities and responses of the three detectors result in the GWproducing different values of matched-filter SNR in eachdetector.Three methods were used to assess the impact of the

Virgo instrument on this detection. (a) Using the best fit

waveform obtained from analysis of the LIGO detectors’data alone, we find that the probability, in 5000 s of dataaround the event, of a peak in SNR from Virgo data due tonoise and as large as the one observed, within a timewindow determined by the maximum possible time offlight, is 0.3%. (b) A search for unmodeled GW transientsdemonstrates that adding Advanced Virgo improves thefalse-alarm rate by an order of magnitude over thetwo-detector network. (c) We compare the matched-filtermarginal likelihood for a model with a coherent BBHsignal in all three detectors to that for a model assumingpure Gaussian noise in Virgo and a BBH signal only in theLIGO detectors: the three detector BBH signal model ispreferred with a Bayes factor of more than 1600.Until Advanced Virgo became operational, typical GW

position estimates were highly uncertain compared to thefields of view of most telescopes. The baseline formed bythe two LIGO detectors allowed us to localize most mergersto roughly annular regions spanning hundreds to about athousand square degrees at the 90% credible level [7–9].Virgo adds additional independent baselines, which incases such as GW170814 can reduce the positionaluncertainty by an order of magnitude or more [8].Tests of general relativity (GR) in the strong field regime

have been performed with the signals from the BBHmergers detected by the LIGO interferometers [2–5,10].In GR, GWs are characterized by two tensor (spin-2)polarizations only, whereas generic metric theories mayallow up to six polarizations [11,12]. As the two LIGOinstruments have similar orientations, little informationabout polarizations can be obtained using the LIGOdetectors alone. With the addition of Advanced Virgowe can probe, for the first time, gravitational-wave polar-izations geometrically by projecting the wave’s amplitude

*Full author list given at the end of the Letter.

Published by the American Physical Society under the terms ofthe Creative Commons Attribution 4.0 International license.Further distribution of this work must maintain attribution tothe author(s) and the published article’s title, journal citation,and DOI.

PRL 119, 141101 (2017) P HY S I CA L R EV I EW LE T T ER Sweek ending

6 OCTOBER 2017

0031-9007=17=119(14)=141101(16) 141101-1 Published by the American Physical Society


OutlineThe what






The subjectivistic viewpoint

I Top requirement is internal coherence of decisions.

I Favourizes interpreting probability distributions as personalbeliefs.

Figure: Bruno de Finetti (1906–1985) and L. Jimmie Savage(1917–1971)


The logical justification

I Top requirement is to find a version of propositional logic thatallows taking into account uncertainty.

I Also favourizes interpreting probability distributions as beliefs,but aims for objective priors.

Figure: Richart T. Cox (1898–1991), Edwin T. Jaynes(1917–1971), and Harold Jeffreys (1891–1989)


The hybrid view4

I The starting point is posterior expected utility, loosely justifiedby Wald’s theory.

I It is simple, widely applicable, has good frequentist properties.

I It satisfies the likelihood principle.I It is easy to interpret: beliefs are

I represented by probabilities,I updated using Bayes’ rule,I integrated when making decisions.

I It is easy to communicate your uncertaintyI Simply give your posterior.I When making a decision, make sure that the priors of everyone

involved would yield the same decision.

4C. P. Robert. The Bayesian choice: from decision-theoretic foundations tocomputational implementation. Springer Science & Business Media, 2007.


Practical advantages of posterior expected utility

I Conceptually answers all ML problems.

I Suits all applications where quantifying uncertainty is vital vscomputational complexity: all basic sciences, health, evenone-shot commercial decisions.

I We never invoked any large-sample argument, so suits all sizesof datasets.


OutlineThe what






Conjugacy

I Say we have a linear regression problem

yi = f (xi ) + εi ,

f (x) = θT x , εi i.i.d. Gaussians N (0, σ2).

I If we choose p(θ) ∼ N (0,Σ), then (exercise)

p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

= N (σ−2A−1Xy,A−1),

where A = σ−2XTX + Σ−1.

I If the loss is not too complicated, then integrals are easy. Forinstance, prediction with L2 loss is simple:


Conjugacy


yi = f (xi ) + εi ,



p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

= N (σ−2A−1Xy,A−1),




Conjugacy


yi = f (xi ) + εi ,



p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

= N (σ−2A−1Xy,A−1),



arg miny?

∫‖y? − f (x?)‖2p(θ|(x , y)1:n)dθ


Conjugacy


yi = f (xi ) + εi ,



p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

= N (σ−2A−1Xy,A−1),



arg miny?

∫‖y? − θT x?‖2p(θ|(x , y)1:n)dθ


Conjugacy


yi = f (xi ) + εi ,



p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

= N (σ−2A−1Xy,A−1),



y? := σ−2xT? A−1Xy.


Monte Carlo methods

I Sometimes, you’re less lucky. Say we’re doing logisticregression.

yi = Bernoulli [σ(f (xi ))] ,

with f (x) = θT x , σ(x) = 1/(1 + e−x).

I Even if we choose p(θ) ∼ N (0,Σ),

p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

does not have a simple closed form.

I We need powerful numerical integration methods, that is,constructions of nodes (θi ) and weights wi such that∫

hdπ ≈N∑i=1

wih(θi ).


Monte Carlo methods



with f (x) = θT x , σ(x) = 1/(1 + e−x).


p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)



hdπ ≈N∑i=1

wih(θi ).


Monte Carlo methods



with f (x) = θT x , σ(x) = 1/(1 + e−x).


p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)



hdπ ≈N∑i=1

wih(θi ).


Metropolis-Hastings

MH(π(θ), q(θ′|θ), θ0, Niter

)1 for k ← 1 to Niter

2 θ ← θk−1

3 θ′ ∼ q(.|θ), u ∼ U(0,1),

4 α = π(θ′)π(θ)

q(θ|θ′)q(θ′|θ)

5 if u < α

6 θk ← θ′ . Accept

7 else θk ← θ . Reject

8 return (θk)k=1,...,Niter


Metropolis-Hastings



2 θ ← θk−1

3 θ′ ∼ q(.|θ), u ∼ U(0,1),

4 α = π(θ′)π(θ)


5 if u < α





Metropolis-Hastings



2 θ ← θk−1

3 θ′ ∼ q(.|θ), u ∼ U(0,1),

4 α = π(θ′)π(θ)


5 if u < α





Metropolis-Hastings



2 θ ← θk−1

3 θ′ ∼ q(.|θ), u ∼ U(0,1),

4 α = π(θ′)π(θ)


5 if u < α





Metropolis-Hastings



2 θ ← θk−1

3 θ′ ∼ q(.|θ), u ∼ U(0,1),

4 α = π(θ′)π(θ)


5 if u < α





The MCMC magic

I Under assumptions5,

√Niter

[1

Niter

Niter∑k=0

h(θk)−∫

h (θ)π(θ)dθ

]→ N (0, σ2

lim(h)),

I If you choose q carefully, you can hope for a polynomialincrease of the mixing time and σ2

lim(h) with d .

I Most MCMC algorithms are instances of Metropolis-Hastingswith clever choices of proposal6, even the NUTS HMC of Stanand PyMC3.

I For nice illustrations, check outhttps://chi-feng.github.io/mcmc-demo/

5R. Douc et al. Nonlinear time series. Chapman-Hall, 2014.6C. P. Robert and G. Casella. Monte Carlo Statistical Methods. New York:

Springer-Verlag, 2004.Remi Bardenet (CNRS & Univ. Lille) Bayesian ML 29

https://chi-feng.github.io/mcmc-demo/

Variational approximations

I When in a hurry, you can settle for a good approximation toyour posterior

π(θ) = p(θ|x) ∝ p(x|θ)p(θ),

say minimize in q

KL(q, π) = Eq log q − Eq log p(θ|x)

= −[−Eq log q + Eq log p(x, θ)] + log p(x).

I Equivalently, we can maximize the evidence lower bound(ELBO)7.

I Ideally, I would rather cast the choice of q into a Wald-likeproblem.

7D. M. Blei et al. “Variational inference: A review for statisticians”. In:Journal of the American Statistical Association 112.518 (2017), pp. 859–877.


OutlineThe what






Linear regression

I yi = f (xi ) + εi , f (x) = θT x , εi i.i.d. Gaussians N (0, σ2).

I If we choose p(θ) ∼ N (0,Σ2), then

p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

= [Exercise]



Linear regression



p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

∝ exp

[−‖y − Xθ‖2

2σ2− θTΣ−1θ

2

]



Linear regression



p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

∝ exp

[−‖y − Xθ‖2

2σ2− θTΣ−1θ

2

]∝ exp

[yTXθ

σ2− θTXTXθ

2σ2− θTΣ−1θ

2

]



Linear regression



p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

∝ exp

[−‖y − Xθ‖2

2σ2− θTΣ−1θ

2

]∝ exp

[yTXθ

σ2− θTXTXθ

2σ2− θTΣ−1θ

2

]∝ exp

[−1

2

(θ − σ−2A−1Xy

)TA(θ − σ−2A−1Xy

)]



Linear regression



p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

∝ exp

[−‖y − Xθ‖2

2σ2− θTΣ−1θ

2

]∝ exp

[yTXθ

σ2− θTXTXθ

2σ2− θTΣ−1θ

2

]∝ exp

[−1

2

(θ − σ−2A−1Xy


)]= N (σ−2A−1Xy,A−1),



Linear regression

I yi = f (xi ) + εi , f (x) = θT x , εi i.i.d. Gaussians N (0, σ2).I If we choose p(θ) ∼ N (0,Σ2), then

p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

∝ exp

[−‖y − Xθ‖2

2σ2− θTΣ−1θ

2

]∝ exp

[yTXθ

σ2− θTXTXθ

2σ2− θTΣ−1θ

2

]∝ exp

[−1

2

(θ − σ−2A−1Xy


)]= N (σ−2A−1Xy,A−1),

where A = σ−2XTX + Σ−1.I Remember prediction with L2 loss is simple:

arg miny?

∫‖y? − f (x?)‖2p(θ|(x , y)1:n)dθ = σ−2xT? A−1Xy.


Linear regression



p(θ|(x , y)1:n) ∝ p((x , y)1:n|θ)p(θ)

∝ exp

[−‖y − Xθ‖2

2σ2− θTΣ−1θ

2

]∝ exp

[yTXθ

σ2− θTXTXθ

2σ2− θTΣ−1θ

2

]∝ exp

[−1

2

(θ − σ−2A−1Xy


)]= N (σ−2A−1Xy,A−1),


I Actually, we can even check that

p(f (x?)|x?, (x , y)1:n) = N (σ−2xT? A−1Xy, x?TA−1x?).


Linear regression with nonlinear features

I Replace each x by a vector of features ϕ(x) ∈ Rp:

yi = f (xi ) + εi , i = 1, . . . , n,

f (x) = θTϕ(x), εi i.i.d. Gaussians N (0, σ2), θ ∼ N (0,Σ).I Think ϕ(x) = (1, x1, x2, x1x2, . . . )I Recall

p(f (x?)|x?, (x , y)1:n) = N (σ−2ΦT? A−1Φy,Φ?A−1Φ?)

where A = σ−2ΦTΦ + Σ−1.I Requires p × p inversion.I But let K = ΦΣΦT , then can rewrite (Exercise)

p(f (x?)|(x , y)1:n) = N (µ?, σ2?),

where

µ? = ΦT? ΣΦT (K + σ2I )−1y,

σ2? = ϕ?Σϕ? − ϕT

? ΣΦT (K + σ2I )−1ΦΣϕ?.


Gaussian processes

I A distribution over a space of functions f : Rd → R.

Gaussian processes

If ∀p ∈ N,∀x1, . . . , xp ∈ Rd

[f (x1), . . . , f (xp)]T ∼ N (m,K),

where m = [µ(x1), . . . , µ(xp)] and

K = ((K (xi , xj))),

then we say f ∼ GP(µ,K ).

I Unicity is usually easy, existence is tricky.

I Necessary condition is that all matrices K are positive definite.


Sampling, conditioning and predicting

See notebook 01 onhttps://github.com/rbardenet/bnp-course


https://github.com/rbardenet/bnp-course

Commonly-used kernels


Learning

I In regression,

p(y|x) =

∫p(y|f)p(f|x)df

= N (y|0,K + σ2In).

I So simply put a prior over η = (σ, θ) and integrate.

I Prediction becomes

f? ∼∫

p(f?|x, η)p(η)dη.

I Alternately, lots of people maximize the marginal likelihood.


Learning

I In regression,

p(y|x, θ) =

∫p(y|f)p(f|x)df

= N (y|0,Kθ + σ2In).

I So simply put a prior over η = (σ, θ) and integrate.

I Prediction becomes

f? ∼∫

p(f?|x, η)p(η)dη.

I Alternately, lots of people maximize the marginal likelihood.


Beyond regression: classification8

I (Exercise) Find a simple classification model with GPs.

I Take for instance

p(y = +1|x , f ) = Bernoulli(σ(f (x))), f ∼ GP(0,K ).

I Problem: prediction is not easy anymore

p(f?|X , y, x?) =

∫p(f?|X , f, x?)p(f|X , y)df

8C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for MachineLearning. MIT Press, 2006.


http://www.gaussianprocess.org/gpml




I Take for instance



p(f?|X , y, x?) =

∫p(f?|X , f, x?)p(f|X , y)df

8Rasmussen and Williams, Gaussian Processes for Machine Learning.Remi Bardenet (CNRS & Univ. Lille) Bayesian ML 38




I Take for instance



p(f?|X , y, x?) =

∫p(f?|X , f, x?)p(f|X , y)df

8Rasmussen and Williams, Gaussian Processes for Machine Learning.Remi Bardenet (CNRS & Univ. Lille) Bayesian ML 38


Beyond regression: ranking9

I (Exercise) Find a simple ranking model with GPs: your data is(u, v)1:n where ∀i , ui ≺ vi . Your user wants to know whethera new u? ≺ v?.

I Take for instance

p(u ≺ v |u, v , f ) = ϕ(f (v)−f (u)), f ∼ GP(0,K ), ϕ increasing.

I Same difficulties with learning.

9W. Chu and Z. Ghahramani. “Preference learning with Gaussianprocesses”. In: Proceedings of the 22nd International Conference on MachineLearning. 2005, pp. 137–144.




I Take for instance



9Chu and Ghahramani, “Preference learning with Gaussian processes”.Remi Bardenet (CNRS & Univ. Lille) Bayesian ML 39



I Take for instance



9Chu and Ghahramani, “Preference learning with Gaussian processes”.Remi Bardenet (CNRS & Univ. Lille) Bayesian ML 39

Emulators of expensive models

RESEARCH ARTICLE

Bayesian Sensitivity Analysis of a Cardiac CellModel Using a Gaussian Process EmulatorEugene TY Chang1,2, Mark Strong3, Richard H Clayton1,2*

1 Insigneo Institute for in-silicoMedicine, University of Sheffield, Sheffield, United Kingdom, 2 Department ofComputer Science University of Sheffield, Sheffield, United Kingdom, 3 School of Health and RelatedResearch, University of Sheffield, Sheffield, United Kingdom

* [email protected]

AbstractModels of electrical activity in cardiac cells have become important research tools as they

can provide a quantitative description of detailed and integrative physiology. However, car-

diac cell models have many parameters, and how uncertainties in these parameters affect

the model output is difficult to assess without undertaking large numbers of model runs. In

this study we show that a surrogate statistical model of a cardiac cell model (the Luo-Rudy

1991 model) can be built using Gaussian process (GP) emulators. Using this approach we

examined how eight outputs describing the action potential shape and action potential dura-

tion restitution depend on six inputs, which we selected to be the maximum conductances in

the Luo-Rudy 1991 model. We found that the GP emulators could be fitted to a small num-

ber of model runs, and behaved as would be expected based on the underlying physiology

that the model represents. We have shown that an emulator approach is a powerful tool for

uncertainty and sensitivity analysis in cardiac cell models.

IntroductionThe heart is an electromechanical pump.With each beat an electrical action potential originatesin the natural pacemaker, and propagates through the entire organ, acting to initiate and syn-chronise contraction. Since the publication of the first model describing the cardiac action poten-tial over 50 years ago [1], models of electrical activation and recovery in cardiac cells and tissuehave become valuable research tools. These models have explanatory power as they expressquantitatively our knowledge of the biophysical processes that generate the cardiac action poten-tial. Thus they can be used to examine the mechanisms by which gene mutations, pharmaceuti-cals and diseases act to change the properties of electrical activity in single cells and in wholetissues [2–4]. The present generation of models include detailed descriptions of the dynamics oftransmembrane current flowing through ion channels, pumps and exchangers in the cell mem-brane, coupled to detailed models of Ca2+ storage and release within the cell [4], and there hasbeen an increasing trend frommodels of animal cells towards detailed models of human atrial

PLOSONE | DOI:10.1371/journal.pone.0130252 June 26, 2015 1 / 20

a11111

OPEN ACCESS

Citation: Chang ETY, Strong M, Clayton RH (2015)Bayesian Sensitivity Analysis of a Cardiac Cell ModelUsing a Gaussian Process Emulator. PLoS ONE 10(6): e0130252. doi:10.1371/journal.pone.0130252

Academic Editor: Kevin Burrage, Oxford University,UNITED KINGDOM

Received: February 18, 2015

Accepted: May 19, 2015

Published: June 26, 2015

Copyright: © 2015 Chang et al. This is an openaccess article distributed under the terms of theCreative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in anymedium, provided the original author and source arecredited.

Data Availability Statement: All relevant data andmethods are contained within the paper and itsSupporting Information files.

Funding: This work was supported by the UKEngineering and Physical Sciences ResearchCouncil (www.epsrc.ac.uk) grant number EP/K037145/1. Mark Strong is funded by a post-doctoralfellowship grant from the National Institute for HealthResearch http://www.nihr.ac.uk (Post-DoctoralFellowship, PDF-2012-05-258). The fundingagreement ensured the authors’ independence indesigning the study, interpreting the data, writing, andpublishing the report. This report is independentresearch supported by the National Institute for


Nonparametric fits

arX

iv:1

204.

2272

v2 [

astr

o-ph

.CO

] 1

0 Ju

l 201

2

Gaussian Process Cosmography

Arman Shafieloo1, Alex G. Kim2, Eric V. Linder1,2,31 Institute for the Early Universe WCU, Ewha Womans University, Seoul, Korea

2 Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA and3 University of California, Berkeley, CA 94720, USA

(Dated: July 11, 2012)

Gaussian processes provide a method for extracting cosmological information from observationswithout assuming a cosmological model. We carry out cosmography – mapping the time evolutionof the cosmic expansion – in a model-independent manner using kinematic variables and a geometricprobe of cosmology. Using the state of the art supernova distance data from the Union2.1 compila-tion, we constrain, without any assumptions about dark energy parametrization or matter density,the Hubble parameter and deceleration parameter as a function of redshift. Extraction of theserelations is tested successfully against models with features on various coherence scales, subject tocertain statistical cautions.

I. INTRODUCTION

Cosmic acceleration is a fundamental mystery of greatinterest and importance to understanding cosmology,gravitation, and high energy physics. The cosmic ex-pansion rate is slowed down by gravitationally attractivematter and sped up by some other, unknown contribu-tion to the dynamical equations. While great effort isbeing put into identifying the source of this extra darkenergy contribution, the overall expansion behavior alsoholds important clues to origin, evolution, and presentstate of our universe.Indeed, by studying the expansion as a whole one

sidesteps the issue of exactly how to divide the gravi-tationally attractive (e.g. the imperfectly known matterdensity) and the accelerating contributions, and whetherthey are independent or have some interaction (see, e.g.,[1–3]). By concentrating on the kinematic variables – theexpansion properties as a function of redshift z or scalefactor a = 1/(1 + z) – one does not need to know the in-ternal structure of the field equations, i.e. the dynamics.The clarity and focus on kinematics trades off against theloss of information on the specific dynamics.Another gain comes from using geometric measure-

ments – cosmic-distance variables that do not depend onthe particular forces and mass densities. The sensitiv-ity of Type Ia supernova measurements of the distance-redshift relation to the deceleration parameter was usedto discover the accelerated expansion [4, 5]. Other probessuch as gravitational lensing, galaxy-clustering statis-tics, cluster-mass abundances, etc. provide valuable in-formation, but are dependent on non-kinematic variables.Some techniques, such as distances from baryon acousticoscillations and Sunyaev-Zel’dovich effects in clusters, areon the fence, nominally geometric but having implicit de-pendence on the gravitational interaction of matter andso the force law dynamics.Given distance and redshift measurements, the cosmic

expansion rate is related by a derivative of the data, andthe deceleration parameter by a further derivative. Thisis problematic for data with real world noise, as the dif-ferentiation further amplifies the noise. Various smooth-

ing procedures have been suggested, e.g. [6], but tend toinduce bias in the function reconstruction due to para-metric restriction of the behavior or to have poor errorcontrol. Using a general orthonormal basis or principalcomponent analysis is another approach, to describe thedistance-redshift relation (e.g. [7]) or the deceleration pa-rameter [8], or using a correlated prior for smoothness onthe dark energy equation of state [9], but in practice afinite (and small) number of modes is significant beyondthe prior, essentially reducing to a parametric approach.Gaussian processes [10] offer an interesting possibility forimproving this situation.

Gaussian processes (GP) have been used recently [11–13] in a dynamical reconstruction, going from a set ofrealizations of the equation of state parameter w(z) ofthe dark energy component forward to comparison of thederived distance-relation to the distance data. The com-parison was carried out through a Markov Chain MonteCarlo (MCMC) assessment of likelihoods. Note that theGP interpolation does not occur between data points butrather on an arbitrary grid of some possibly unmeasuredquantity. This approach is intriguing, but relies on sep-aration of the matter density from the dark energy be-havior, i.e. it works within a dynamical framework.

The approach in this paper takes a fundamentally dif-ferent path. We begin with the observations of supernovadistances and here consider only kinematic quantities.Modeling the cosmic distance relation as a smooth kine-matic function drawn from a GP, the value of the functionat any redshift is then predicted directly through testingthe GP model against the data. The cosmic expansioncan then be extracted from the means and covariance ma-trices of the Gaussian process realizations (weighted bya posterior) directly for quantities related linearly to theoriginal GP, even through derivatives. This allows us toprobe the rate and acceleration of the cosmic expansionin a highly model-independent manner (at the price offocusing on only this type of information). One can viewthis as a top-down approach, complementary with thebottom-up approach of starting with theoretical quanti-ties and working toward the data, and then applying alikelihood comparison.


Natural language processing

Using Gaussian Processes for Rumour Stance Classification inSocial Media

MICHAL LUKASIK, University of SheffieldKALINA BONTCHEVA, University of SheffieldTREVOR COHN, University of MelbourneARKAITZ ZUBIAGA, University of WarwickMARIA LIAKATA, University of WarwickROB PROCTER, University of Warwick

Social media tend to be rife with rumours while new reports are released piecemeal during breaking news. Interestingly,one can mine multiple reactions expressed by social media users in those situations, exploring their stance towards rumours,ultimately enabling the flagging of highly disputed rumours as being potentially false. In this work, we set out to developan automated, supervised classifier that uses multi-task learning to classify the stance expressed in each individual tweetin a rumourous conversation as either supporting, denying or questioning the rumour. Using a classifier based on GaussianProcesses, and exploring its effectiveness on two datasets with very different characteristics and varying distributions ofstances, we show that our approach consistently outperforms competitive baseline classifiers. Our classifier is especiallyeffective in estimating the distribution of different types of stance associated with a given rumour, which we set forth asa desired characteristic for a rumour-tracking system that will warn both ordinary users of Twitter and professional newspractitioners when a rumour is being rebutted.

1. INTRODUCTIONThere is an increasing need to interpret and act upon rumours spreading quickly through social me-dia during breaking news, where new reports are released piecemeal and often have an unverifiedstatus at the time of posting. Previous research has posited the damage that the diffusion of falserumours can cause in society, and that corrections issued by news organisations or state agenciessuch as the police may not necessarily achieve the desired effect sufficiently quickly [Lewandowskyet al. 2012; Procter et al. 2013a]. Being able to determine the accuracy of reports is therefore crucialin these scenarios. However, the veracity of rumours in circulation is usually hard to establish [All-port and Postman 1947], since as many views and testimonies as possible need to be assembled andexamined in order to reach a final judgement. Examples of rumours that were later disproven, afterbeing widely circulated, include a 2010 earthquake in Chile, where rumours of a volcano eruptionand a tsunami warning in Valparaiso spawned on Twitter [Mendoza et al. 2010]. Another exampleis the England riots in 2011, where false rumours claimed that rioters were going to attack Birming-ham’s Children’s Hospital and that animals had escaped from London Zoo [Procter et al. 2013b].

Previous work by ourselves and others has argued that looking at how users in social mediaorient to rumours is a crucial first step towards making an informed judgement on the veracity of arumourous report [Zubiaga et al. 2016; Tolmie et al. 2015; Mendoza et al. 2010]. For example, in thecase of the riots in England in August 2011, Procter et al. manually analysed the stance expressed byusers in social media towards rumours [Procter et al. 2013b]. Each tweet discussing a rumour wasmanually categorised as supporting, denying or questioning it. It is obvious that manual methodshave their disadvantages in that they do not scale well; the ability to perform stance categorisationof tweets in an automated way would be of great use in tracking rumours, flagging those that arelargely denied or questioned as being more likely to be false.

Determining the stance of social media posts automatically has been attracting increasing interestin the scientific community in recent years, as this is a useful first step towards more in-depth rumouranalysis:

— It can help detect rumours and flag them as such more quickly [Zhao et al. 2015].— It is useful for tracking public opinion about rumours and hence for monitoring their wider effect

on society.

arX

iv:1

609.

0196

2v1

[cs

.CL

] 7

Sep

201

6


Bayesian optimization for hyperparameter tuning

Algorithms for Hyper-Parameter Optimization

James BergstraThe Rowland Institute

Harvard [email protected]

Remi BardenetLaboratoire de Recherche en Informatique

Universite [email protected]

Yoshua BengioDept. d’Informatique et Recherche Operationelle

Universite de [email protected]

Balazs KeglLinear Accelerator LaboratoryUniversite Paris-Sud, CNRS

[email protected]

Abstract

Several recent advances to the state of the art in image classification benchmarkshave come from better configurations of existing techniques rather than novel ap-proaches to feature learning. Traditionally, hyper-parameter optimization has beenthe job of humans because they can be very efficient in regimes where only a fewtrials are possible. Presently, computer clusters and GPU processors make it pos-sible to run more trials and we show that algorithmic approaches can find betterresults. We present hyper-parameter optimization results on tasks of training neu-ral networks and deep belief networks (DBNs). We optimize hyper-parametersusing random search and two new greedy sequential methods based on the ex-pected improvement criterion. Random search has been shown to be sufficientlyefficient for learning neural networks for several datasets, but we show it is unreli-able for training DBNs. The sequential algorithms are applied to the most difficultDBN learning problems from [1] and find significantly better results than the bestpreviously reported. This work contributes novel techniques for making responsesurface models P (y|x) in which many elements of hyper-parameter assignment(x) are known to be irrelevant given particular values of other elements.

1 Introduction

Models such as Deep Belief Networks (DBNs) [2], stacked denoising autoencoders [3], convo-lutional networks [4], as well as classifiers based on sophisticated feature extraction techniqueshave from ten to perhaps fifty hyper-parameters, depending on how the experimenter chooses toparametrize the model, and how many hyper-parameters the experimenter chooses to fix at a rea-sonable default. The difficulty of tuning these models makes published results difficult to reproduceand extend, and makes even the original investigation of such methods more of an art than a science.

Recent results such as [5], [6], and [7] demonstrate that the challenge of hyper-parameter opti-mization in large and multilayer models is a direct impediment to scientific progress. These workshave advanced state of the art performance on image classification problems by more concertedhyper-parameter optimization in simple algorithms, rather than by innovative modeling or machinelearning strategies. It would be wrong to conclude from a result such as [5] that feature learningis useless. Instead, hyper-parameter optimization should be regarded as a formal outer loop in thelearning process. A learning algorithm, as a functional from data to classifier (taking classificationproblems as an example), includes a budgeting choice of how many CPU cycles are to be spenton hyper-parameter exploration, and how many CPU cycles are to be spent evaluating each hyper-parameter choice (i.e. by tuning the regular parameters). The results of [5] and [7] suggest thatwith current generation hardware such as large computer clusters and GPUs, the optimal alloca-

1


Bayesian optimization

I Goal is to minimize a noisy f with N iterations, N small.

I Key application: find the hyperparameters of your MLalgorithm that minimize the validation error.

I Idea is to sequentiallyI update your model on f ,I optimize an aquisition criterion.

I Check out notebook 03 onhttps://github.com/rbardenet/bnp-course.


https://github.com/rbardenet/bnp-course

An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

EI


An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.00

0.02

0.04

0.06

0.08

0.10

0.12

EI


An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

EI


An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

EI


An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

EI


An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

EI


An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.000

0.005

0.010

0.015

0.020

0.025

0.030

EI


An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.002

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

EI


An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.000

0.002

0.004

0.006

0.008

0.010

EI


An example in 1D

0 2 4 6 8 10

5

0

5

10

targ

et

0 2 4 6 8 100.000

0.001

0.002

0.003

0.004

0.005

0.006

EI


Popular aquisition criteria

Expected improvement10

EI(z) = E(

max(mN − f (z), 0)|(x , y)1:n

),

where mN = min16i6N f (xi ).

An easy computation yields

EI(x) = σ(x)(uΦ(u) + ϕ(u)

), (3)

whereu = (mn − m(x))/σ(x),

and Φ and ϕ denote the cdf and pdf of the N (0, 1) distribution.

10D. R. Jones. “A Taxonomy of Global Optimization Methods Based onResponse Surfaces”. In: Journal of Global Optimization 21 (2001),pp. 345–383.


GP-UCB (Srinivas et al., 2010)

GP-UCB

GP-UCB(z) = m(z) + βσ(z).

I If β properly tuned, can use bandit results.

I First criterion to give application theoretical results.


Bayesian optimization for hyperparameter tuning

Algorithms for Hyper-Parameter Optimization

James BergstraThe Rowland Institute

Harvard [email protected]

Remi BardenetLaboratoire de Recherche en Informatique

Universite [email protected]

Yoshua BengioDept. d’Informatique et Recherche Operationelle

Universite de [email protected]

Balazs KeglLinear Accelerator LaboratoryUniversite Paris-Sud, CNRS

[email protected]

Abstract

Several recent advances to the state of the art in image classification benchmarkshave come from better configurations of existing techniques rather than novel ap-proaches to feature learning. Traditionally, hyper-parameter optimization has beenthe job of humans because they can be very efficient in regimes where only a fewtrials are possible. Presently, computer clusters and GPU processors make it pos-sible to run more trials and we show that algorithmic approaches can find betterresults. We present hyper-parameter optimization results on tasks of training neu-ral networks and deep belief networks (DBNs). We optimize hyper-parametersusing random search and two new greedy sequential methods based on the ex-pected improvement criterion. Random search has been shown to be sufficientlyefficient for learning neural networks for several datasets, but we show it is unreli-able for training DBNs. The sequential algorithms are applied to the most difficultDBN learning problems from [1] and find significantly better results than the bestpreviously reported. This work contributes novel techniques for making responsesurface models P (y|x) in which many elements of hyper-parameter assignment(x) are known to be irrelevant given particular values of other elements.

1 Introduction

Models such as Deep Belief Networks (DBNs) [2], stacked denoising autoencoders [3], convo-lutional networks [4], as well as classifiers based on sophisticated feature extraction techniqueshave from ten to perhaps fifty hyper-parameters, depending on how the experimenter chooses toparametrize the model, and how many hyper-parameters the experimenter chooses to fix at a rea-sonable default. The difficulty of tuning these models makes published results difficult to reproduceand extend, and makes even the original investigation of such methods more of an art than a science.

Recent results such as [5], [6], and [7] demonstrate that the challenge of hyper-parameter opti-mization in large and multilayer models is a direct impediment to scientific progress. These workshave advanced state of the art performance on image classification problems by more concertedhyper-parameter optimization in simple algorithms, rather than by innovative modeling or machinelearning strategies. It would be wrong to conclude from a result such as [5] that feature learningis useless. Instead, hyper-parameter optimization should be regarded as a formal outer loop in thelearning process. A learning algorithm, as a functional from data to classifier (taking classificationproblems as an example), includes a budgeting choice of how many CPU cycles are to be spenton hyper-parameter exploration, and how many CPU cycles are to be spent evaluating each hyper-parameter choice (i.e. by tuning the regular parameters). The results of [5] and [7] suggest thatwith current generation hardware such as large computer clusters and GPUs, the optimal alloca-

1

I Checkout hyperopt and spearmint.


Going further: Hyperopt across datasets10

10R. Bardenet et al. “Collaborative hyperparameter tuning”. In:International Conference on Machine Learning (ICML). Atlanta, Georgia, 2013.


Some useful hyperlinks

I Textbook by C. E. Rasmussen and C. K. I. Williams.Gaussian Processes for Machine Learning. MIT Press, 2006,I great for understanding, methods, pointers to ML and stats.

I Videolecture by C. Rasmussen.I lecture notes by P. Orbanz. Lecture notes on Bayesian

nonparametrics. 2014.I mathematically clean, without losing the focus on ML.



http://videolectures.net/mlss09uk_rasmussen_gp/?q=rasmussen

http://ce.sharif.edu/courses/93-94/2/ce957-1/resources/root/References/porbanz_BNP.pdf



Some open issues

I Fully Bayesian scalable approaches!

I Natural approaches to constrained GPs.

I Links with other models based on Gaussians and geometry.

Back to the roots

I Formulate HT across datasets and algorithms as a posteriorexpected loss problem, including computational constraints.

I Solve resulting dynamic programming problem.


References I

Bardenet, R. et al. “Collaborative hyperparameter tuning”. In:International Conference on Machine Learning (ICML). Atlanta,Georgia, 2013.

Blei, D. M., A. Kucukelbir, and J. D. McAuliffe. “Variationalinference: A review for statisticians”. In: Journal of the AmericanStatistical Association 112.518 (2017), pp. 859–877.

Chu, W. and Z. Ghahramani. “Preference learning with Gaussianprocesses”. In: Proceedings of the 22nd International Conferenceon Machine Learning. 2005, pp. 137–144.

Douc, R., E. Moulines, and D. Stoffer. Nonlinear time series.Chapman-Hall, 2014.

Jones, D. R. “A Taxonomy of Global Optimization Methods Basedon Response Surfaces”. In: Journal of Global Optimization 21(2001), pp. 345–383.

Orbanz, P. Lecture notes on Bayesian nonparametrics. 2014.



References II

Parmigiani, G. and L. Inoue. Decision theory: principles andapproaches. Vol. 812. John Wiley & Sons, 2009.

Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes forMachine Learning. MIT Press, 2006.

Robert, C. P. The Bayesian choice: from decision-theoreticfoundations to computational implementation. Springer Science& Business Media, 2007.

Robert, C. P. and G. Casella. Monte Carlo Statistical Methods.New York: Springer-Verlag, 2004.

Stigler, S. M. “The epic story of maximum likelihood”. In:Statistical Science (2007), pp. 598–620.

Wald, A. Statistical decision functions. Wiley, 1950.