63
Modern Gaussian Processes: Scalable Inference and Novel Applications (Part I-b) Introduction to Gaussian Processes Edwin V. Bonilla and Maurizio Filippone CSIRO’s Data61, Sydney, Australia and EURECOM, Sophia Antipolis, France July 14 th , 2019 1

Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Modern Gaussian Processes:

Scalable Inference and Novel Applications

(Part I-b) Introduction to Gaussian Processes

Edwin V. Bonilla and Maurizio Filippone

CSIRO’s Data61, Sydney, Australia and EURECOM, Sophia Antipolis, France

July 14th, 2019

1

Page 2: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Outline

1 Bayesian Modeling

2 Gaussian Processes

Bayesian Linear Models

Gaussian Processes

Connections with Deep Neural Nets

Optimizing Kernel Parameters

3 Challenges

2

Page 3: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Modeling

Page 4: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Learning from Data — Function Estimation

• Take these two examples

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

xy

• We are interested in estimating a function f(x) from data

• Most problems in Machine Learning can be cast this way!

3

Page 5: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

What do Bayesian Models Have to Offer?

• Regression example

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y

4

Page 6: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

What do Bayesian Models Have to Offer?

• Classification example

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

x

y

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

x

y

5

Page 7: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes

Page 8: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Linear Models

• Implement a linear combination of basis functions

f(x) = w>ϕ (x)

with

ϕ (x) = (ϕ1(x), . . . , ϕD(x))>

6

Page 9: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Probabilistic Interpretation of Loss Minimization

• Inputs : X = (x1, . . . , xN)>

• Labels : y = (y1, . . . , yN)>

• Weights : w = (w1, . . . ,wD)>

Quadratic Loss p(y|X,w) ∝ exp(−Loss)

05

1015

0.0

0.4

0.8

• Minimization of a loss function

• . . . equivalent as maximizing likelihood p(y|X,w)

7

Page 10: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Inference

• Inputs : X = (x1, . . . , xN)>

• Labels : y = (y1, . . . , yN)>

• Weights : w = (w1, . . . ,wD)>

p(w) p(w|y,X)

p(w|y,X) =p(y|X,w)p(w)∫p(y|X,w)p(w)dw

8

Page 11: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

9

Page 12: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

9

Page 13: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

●●

●●

9

Page 14: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

●●

●●

●●

●●

●●

●●

9

Page 15: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

9

Page 16: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

9

Page 17: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

9

Page 18: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

9

Page 19: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

9

Page 20: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Models in Action

• Today’s posterior is tomorrow’s prior

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

9

Page 21: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression

• Modeling observations as noisy realizations of a linear

combination of the features:

p(y|w,X, σ2) = N (Φw, σ2I)

• Φ = Φ(X) has entries

Φ =

ϕ1(x1) . . . ϕD(x1)...

. . ....

ϕ1(xN) . . . ϕD(xN)

• Gaussian prior over model parameters:

p(w) = N (0,S)

10

Page 22: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression

• Modeling observations as noisy realizations of a linear

combination of the features:

p(y|w,X, σ2) = N (Φw, σ2I)

• Φ = Φ(X) has entries

Φ =

ϕ1(x1) . . . ϕD(x1)...

. . ....

ϕ1(xN) . . . ϕD(xN)

• Gaussian prior over model parameters:

p(w) = N (0,S)

10

Page 23: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression

• Bayes rule:

p(w|X, y) =p(y|X,w)p(w)∫p(y|X,w)p(w)dw

=p(y|X,w)p(w)

p(y|X)

• Posterior density: p(w|X, y)

I Distribution over parameters after observing data

• Conditional Likelihood : p(y|X,w)

I Measure of “fitness”

• Prior density: p(w)

I Anything we know about parameters before we see any data

• Marginal likelihood: p(y|X)

I It is a normalization constant – ensures∫p(w|X, y) dw = 1.

11

Page 24: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression

• Bayes rule:

p(w|X, y) =p(y|X,w)p(w)∫p(y|X,w)p(w)dw

=p(y|X,w)p(w)

p(y|X)

• Posterior density: p(w|X, y)

I Distribution over parameters after observing data

• Conditional Likelihood : p(y|X,w)

I Measure of “fitness”

• Prior density: p(w)

I Anything we know about parameters before we see any data

• Marginal likelihood: p(y|X)

I It is a normalization constant – ensures∫p(w|X, y) dw = 1.

11

Page 25: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression

• Bayes rule:

p(w|X, y) =p(y|X,w)p(w)∫p(y|X,w)p(w)dw

=p(y|X,w)p(w)

p(y|X)

• Posterior density: p(w|X, y)

I Distribution over parameters after observing data

• Conditional Likelihood : p(y|X,w)

I Measure of “fitness”

• Prior density: p(w)

I Anything we know about parameters before we see any data

• Marginal likelihood: p(y|X)

I It is a normalization constant – ensures∫p(w|X, y) dw = 1.

11

Page 26: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression

• Bayes rule:

p(w|X, y) =p(y|X,w)p(w)∫p(y|X,w)p(w)dw

=p(y|X,w)p(w)

p(y|X)

• Posterior density: p(w|X, y)

I Distribution over parameters after observing data

• Conditional Likelihood : p(y|X,w)

I Measure of “fitness”

• Prior density: p(w)

I Anything we know about parameters before we see any data

• Marginal likelihood: p(y|X)

I It is a normalization constant – ensures∫p(w|X, y) dw = 1.

11

Page 27: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression

• Bayes rule:

p(w|X, y) =p(y|X,w)p(w)∫p(y|X,w)p(w)dw

=p(y|X,w)p(w)

p(y|X)

• Posterior density: p(w|X, y)

I Distribution over parameters after observing data

• Conditional Likelihood : p(y|X,w)

I Measure of “fitness”

• Prior density: p(w)

I Anything we know about parameters before we see any data

• Marginal likelihood: p(y|X)

I It is a normalization constant – ensures∫p(w|X, y) dw = 1.

11

Page 28: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression

• Posterior must be Gaussian (Proof in the Appendix)

p(w|X, y, σ2) = N (µ,Σ)

• Covariance:

Σ =

(1

σ2Φ>Φ + S−1

)−1

• Mean:

µ =1

σ2ΣΦ>y

• Predictions – with a similar tedious exercise...

p(y∗|X, y, x∗, σ2) = N (x>∗ µ, σ2 + x>∗ Σx∗)

12

Page 29: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes

• Linear models require specifying a set of basis functions

I Polynomials, Trigonometric, . . .??

• Gaussian Processes work implicitly with a possibly infinite set

of basis functions!

13

Page 30: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes

• Linear models require specifying a set of basis functions

I Polynomials, Trigonometric, . . .??

• Gaussian Processes work implicitly with a possibly infinite set

of basis functions!

13

Page 31: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

• Predictions can be expressed exclusively in terms of scalar

products as follows

k(xi , xj) = ψ(xi )>ψ(xj)

• This allows us to work with either k(·, ·) or ψ(·)• Why is this useful??

14

Page 32: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

• Working with ψ(·) costs O(D2) storage, O(D3) time

• Working with k(·, ·) costs O(N2) storage, O(N3) time

15

Page 33: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

Proof sketch - more in the Appendix

• To show that Bayesian Linear Regression can be formulated

through scalar products only, we need Woodbury identity:

(A + UCV )−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1

-1

-1 -1 -1 -1-1

-1

16

Page 34: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

Proof sketch - more in the Appendix

• Woodbury identity:

(A + UCV )−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1

• We can rewrite:

Σ =

(1

σ2Φ>Φ + S−1

)−1

= S− SΦ>(σ2I + ΦSΦ>

)−1ΦS

• We set A = S, U = V> = Φ>, and C = 1σ2 I

17

Page 35: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Kernels

• We can pick k(·, ·) so that ψ(·) is infinite dimensional!

• It is possible to show that for

k(xi , xj) = exp

(−‖xi − xj‖2

2

)there exists a corresponding ψ(·) that is infinite dimensional!

(Proof in the Appendix)

• There are other kernels satisfying this property

18

Page 36: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W (i) ∼ N (0, αi I )

• Central Limit Theorem implies that f

is Gaussian

...

...

ΦX F

W (0) W (1)

• f has zero-mean

• cov(f) = Ep(W (0),W (1))[Φ(XW (0))W (1)W (1)>Φ(XW (0))>]

Neal, LNS, 1996

19

Page 37: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W (i) ∼ N (0, αi I )

• Central Limit Theorem implies that f

is Gaussian

...

...

ΦX F

W (0) W (1)

• f has zero-mean

• cov(f) = α1Ep(W (0))[Φ(XW (0))Φ(XW (0))>]

• Some choices of Φ lead to analytic expression of known

kernels (RBF, Matern, arc-cosine, Brownian motion, . . .)

Neal, LNS, 1996

20

Page 38: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes for Regression

• Latent function:

f = w>ϕ(x)

with ϕ(·) possibly infinite dimensional!

• The choice of ϕ(·) and the prior over w induce a distribution

over functions

Definition

f is distributed according to a Gaussian process iff for any subset

{x1, . . . , xN} the evaluation of f is jointly Gaussian

f (x) ∼ GP(µ(x), κ(x, x′)) then

f ∼ N (µ,K)

21

Page 39: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes for Regression

• Bayes rule:

p(f|X, y) =p(y|f)p(f|X)∫p(y|f)p(f|X)df

=p(y|f)p(f|X)

p(y|X)

• Conditional Likelihood : p(y|f) = N (y|0, σ2I)

• Prior over latent variables: Implied by the prior over w

p(f|X) = N (f|0,K)

• Marginal likelihood: p(y|X) = N (y|0,K + σ2I)

22

Page 40: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes for Regression

• Bayes rule:

p(f|X, y) =p(y|f)p(f|X)∫p(y|f)p(f|X)df

=p(y|f)p(f|X)

p(y|X)

• Conditional Likelihood : p(y|f) = N (y|0, σ2I)

• Prior over latent variables: Implied by the prior over w

p(f|X) = N (f|0,K)

• Marginal likelihood: p(y|X) = N (y|0,K + σ2I)

22

Page 41: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes for Regression

• Bayes rule:

p(f|X, y) =p(y|f)p(f|X)∫p(y|f)p(f|X)df

=p(y|f)p(f|X)

p(y|X)

• Conditional Likelihood : p(y|f) = N (y|0, σ2I)

• Prior over latent variables: Implied by the prior over w

p(f|X) = N (f|0,K)

• Marginal likelihood: p(y|X) = N (y|0,K + σ2I)

22

Page 42: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Gaussian Processes for Regression

• Bayes rule:

p(f|X, y) =p(y|f)p(f|X)∫p(y|f)p(f|X)df

=p(y|f)p(f|X)

p(y|X)

• Conditional Likelihood : p(y|f) = N (y|0, σ2I)

• Prior over latent variables: Implied by the prior over w

p(f|X) = N (f|0,K)

• Marginal likelihood: p(y|X) = N (y|0,K + σ2I)

22

Page 43: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Optimization of Gaussian Process parameters

• The kernel has parameters that have to be tuned

k(xi , xj) = α exp(−β‖xi − xj‖2)

. . . and there is also the noise parameter σ2.

• Define θ = (α, β, σ2)

• How should we tune them?

23

Page 44: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Optimization of Gaussian Process parameters

• Define Ky = K + σ2I

• Maximize the logarithm of the marginal likelihood

p(y|X,θ) = N (0,Ky)

that is

−1

2log |Ky| −

1

2y>K−1

y y + const.

• Derivatives can be useful for gradient-based optimization

∂ log[p(y|X,θ)]

∂θi

24

Page 45: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Optimization of Gaussian Process parameters

• Log-marginal likelihood

−1

2log |Ky| −

1

2y>K−1

y y + const.

• Derivatives can be useful for gradient-based optimization:

∂ log[p(y|X,θ)]

∂θi= −1

2Tr

(K−1

y∂Ky

∂θi

)+

1

2y>K−1

y∂Ky

∂θiK−1

y y

25

Page 46: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Challenges

Page 47: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Challenges

• Non-Gaussian Likelihoods?

• Scalability?

• Kernel design?

26

Page 48: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Marginal likelihood of GP models - non-Gaussian case

• Marginal likelihood

p(y|X,θ) =

∫p(y|f)p(f|X,θ)df

can only be computed if p(y|f) is Gaussian

• What if p(y|f) is not Gaussian?

27

Page 49: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Scalability

• Marginal likelihood

p(y|X,θ) =

∫p(y|f)p(f|X,θ)df

can only be computed if p(y|X, f) is Gaussian

• ... even then

log[p(y|X,θ)] = −1

2log |Ky| −

1

2yTK−1

y y + const.

where Ky = K(X,θ) is a N × N dense matrix!

• Complexity of exact method is O(N3) time and O(N2) space!

28

Page 50: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Kernel Design

• The choice of a kernel is critical for good performance

• This encodes any assumptions on the prior over functions

RBF Matern Polynomial

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x

y

29

Page 51: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Appendix

Page 52: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Kernels

• For simplicity consider one dimensional inputs x i , x j

• Expand the Gaussian kernel k(x i , x j ) as

exp

(−

(x i − x j )2

2

)= exp

(−x2i

2

)exp

(−x2j

2

)exp

(x ix j

)• Focusing on the last term and applying the Taylor expansion of the exp(·)

function

exp(x ix j

)= 1 +

(x ix j

)+

(x ix j

)2

2!+

(x ix j

)3

3!+

(x ix j

)4

4!+ . . .

30

Page 53: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Kernels

• Define the infinite dimensional mapping

ψ(x) = exp

(−x2

2

)(1, x ,

x2

√2!,x3

√3!,x4

√4!, . . .

)>• It is easy to verify that

k(x i , x j ) = exp

(−

(x i − x j )2

2

)= ψ(x i )

>ψ(x j )

31

Page 54: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression - Finding posterior parameters

• Ignoring normalizing constants, the posterior is:

p(w|X, y, σ2) ∝ exp

{−

1

2(w − µ)>Σ−1(w − µ)

}= exp

{−

1

2(w>Σ−1w − 2w>Σ−1µ+ µ>Σ−1µ)

}∝ exp

{−

1

2(w>Σ−1w − 2w>Σ−1µ)

}

32

Page 55: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression - Finding posterior parameters

• Ignoring non-w terms, the prior multiplied by the likelihood is:

p(y|w,X, σ2)

∝ exp

{−

1

2σ2(y −Φw)>(y −Φw)

}exp

{−

1

2w>S−1w

}∝ exp

{−

1

2

(w>

[1

σ2Φ>Φ + S−1

]w −

2

σ2w>Φ>y

)}

• Posterior (from previous slide):

∝ exp

{−

1

2(w>Σ−1w − 2w>Σ−1µ)

}

33

Page 56: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression - Finding posterior parameters

• Equate individual terms on each side.

• Covariance:

w>Σ−1w = w>[

1

σ2Φ>Φ + S−1

]w

Σ =

(1

σ2Φ>Φ + S−1

)−1

• Mean:

2w>Σ−1µ =2

σ2w>Φ>y

µ =1

σ2ΣΦ>y

34

Page 57: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

• To show that Bayesian Linear Regression can be formulated through scalar

products only, we need Woodbury identity:

(A + UCV )−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1

• Intuitively:

-1

-1 -1 -1 -1-1

-1

35

Page 58: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

• Woodbury identity:

(A + UCV )−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1

• We can rewrite:

Σ =

(1

σ2Φ>Φ + S−1

)−1

= S− SΦ>(σ2I + ΦSΦ>

)−1ΦS

• We set A = S, U = V> = Φ>, and C = 1σ2 I

36

Page 59: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

• Mean and variance of the predictions:

p(y∗|X, y, x∗, σ2) = N (φ>∗ µ, σ2 + φ>∗ Σφ∗)

• Rewrite the variance:

σ2 + φ>∗ Σφ∗ =

σ2 + φ>∗ Sφ∗ − φ>∗ SΦ>(σ2I + ΦSΦ>

)−1ΦSφ∗

. . . continued

37

Page 60: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

• Mean and variance of the predictions:

p(y∗|X, y, x∗, σ2) = N (φ>∗ µ, σ2 + φ>∗ Σφ∗)

• Rewrite the variance:

σ2 + φ>∗ Sφ∗ − φ>∗ SΦ>(σ2I + ΦSΦ>

)−1ΦSφ∗ =

σ2 + k∗∗ − k>∗(σ2I + K

)−1k∗

• Where the mapping defining the kernel is

ψ(x) = S1/2φ(x)

and

k∗∗ = k(x∗, x∗) = ψ(x∗)>ψ(x∗)

(k∗)i = k(x∗, xi ) = ψ(x∗)>ψ(xi )

(K)ij = k(xi , xj ) = ψ(xi )>ψ(xj )

38

Page 61: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

• Mean and variance of the predictions:

p(y∗|X, y, x∗, σ2) = N (φ>∗ µ, σ2 + φ>∗ Σφ∗)

• Rewrite the mean:

φ>∗ µ =1

σ2φ>∗ ΣΦ>y

=1

σ2φ>∗

(S− SΦ>

(σ2I + ΦSΦ>

)−1ΦS

)Φ>y

=1

σ2φ>∗ SΦ>

(I−(σ2I + ΦSΦ>

)−1ΦSΦ>

)y

=1

σ2φ>∗ SΦ>

I−(

I +ΦSΦ>

σ2

)−1ΦSΦ>

σ2

y

. . . continued

39

Page 62: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

• Define H = ΦSΦ>

σ2

• The term in the parenthesis

I−(

I +ΦSΦ>

σ2

)−1ΦSΦ>

σ2

becomes (

I− (I + H)−1 H)

= I− (H−1 + I)−1

• Using Woodbury (A,U,V = I and C = H−1)

I− (H−1 + I)−1 = (I + H)−1

40

Page 63: Modern Gaussian Processes: Scalable Inference and Novel ... · Learning from Data | Function Estimation Take these two examples-3 -2 -1 0 1 2 3-3-2-1 0 1 2 3 x y l l l l l l l l l

Bayesian Linear Regression as a Kernel Machine

• Substituting into the expression of the predictive mean

φ>∗ µ =1

σ2φ>∗ SΦ>

I−(

I +ΦSΦ>

σ2

)−1ΦSΦ>

σ2

y

=1

σ2φ>∗ SΦ>

(I +

ΦSΦ>

σ2

)−1

y

= φ>∗ SΦ>(σ2I + ΦSΦ>

)−1y

= k>∗(σ2I + K

)−1y

• All definitions as in the case of the variance

ψ(X) = S1/2φ(X)

(k∗)i = k(x∗, xi ) = ψ(x∗)>ψ(xi )

(K)ij = k(xi , xj ) = ψ(xi )>ψ(xj )

41