18
Introduction Goal: Learn input-output systems: given an input, predict output. Gaussian Process Regression (GPR): powerful nonparametric regression technique Kriging and spline fits are both instances of GPR Outline for today: Gaussian random vectors, marginals, and conditionals Gaussian processes Covariance functions GPR prediction AA222: Introduction to Multidisciplinary Design Optimization 1

Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Introduction

• Goal: Learn input-output systems: given an input, predict output.

• Gaussian Process Regression (GPR): powerful nonparametric regressiontechnique

• Kriging and spline fits are both instances of GPR

Outline for today:

• Gaussian random vectors, marginals, and conditionals

• Gaussian processes

• Covariance functions

• GPR prediction

AA222: Introduction to Multidisciplinary Design Optimization 1

Page 2: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

How You Get Your Grade1

Now that I have your undivided attention. . . ,

• Students are sometimes ‘graded on a curve’

• Originally, this was a bell-curve : the normal (Gaussian) distribution

• If a random variable X is normally distributed with mean µ and variance σ2

p(x) =1

σ√

2πexp

(−(x−µ)2

2σ2

)

1Not really.

AA222: Introduction to Multidisciplinary Design Optimization 2

Page 3: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Multiple Gaussian Random Variables

What can we say about multiple normally-distributed random variables?

x1 and x2 are Gaussian, µ1 = 1, µ2 = 3, σ1 = σ2 = 1.

AA222: Introduction to Multidisciplinary Design Optimization 3

Page 4: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Multiple Gaussian Random Variables

If x1 and x2 are independent,

p(x1, x2) = p(x1)p(x2),

=1

σ1

√2π· 1σ2

√2π

exp(−(x1 − µ1)2

2σ21

)exp

(−(x2 − µ2)2

2σ22

),

=1

σ1

√2π· 1σ2

√2π

exp(−(x1 − µ1)2

2σ21

− (x2 − µ2)2

2σ22

),

=1

(2π)σ1σ2exp

−[x1 − µ1

x2 − µ2

]T [ 1σ2

10

0 1σ2

2

] [x1 − µ1

x2 − µ2

]2

,

=1

(2π)n/2|Σ|exp

(−(x− µ)TΣ−1(x− µ)

2

).

AA222: Introduction to Multidisciplinary Design Optimization 4

Page 5: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Correlated Gaussian Random Variables

p(x) =1

(2π)n/2|Σ|exp

(−(x− µ)TΣ−1(x− µ)

2

)• What about cross-terms in Σ?

• They denote covariances: components of x don’t vary independently

• Plots below show samples generated when σ12 = 0, 0.7, 0.9

AA222: Introduction to Multidisciplinary Design Optimization 5

Page 6: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Gaussian Random Vector

• A vector X whose components Xi are Gaussian random variables

• Density function analogous to 1-D case, but note covariances!

p(x) =1

(2π)12|Σ|

exp(−(x− µ)TΣ−1(x− µ)

2

)

Probability density for a 2-D Gaussian random vector.

AA222: Introduction to Multidisciplinary Design Optimization 6

Page 7: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Marginal Distribution

• Idea: ignore some components; “marginalize over” or “integrate out”

• Picture: project joint distribution onto appropriate ‘wall’, normalize

• Math: p(xi) =∫X−i

p(xi, x−i)dx−i

• Marginals of Gaussians are Gaussian!

• Mean: omit components of µ

E[X1] = µ1

• Covariance: omit rows and columns of Σ

cov(X1) = Σ11.Marginal density for 2-D Gaussian

Note: not normalized!

AA222: Introduction to Multidisciplinary Design Optimization 7

Page 8: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Conditional Distribution

• Idea: fix the “given” value of one or more components; X1 | X2 = a

• Picture: slice of joint distribution (suitably normalized)

• Math: Bayes rule. p(x1|x2) =p(x1, x2)p(x2)

• Conditionals of Gaussians are Gaussian!

• Mean: Depends on “given” a via covariance

E(X1 | X2 = a) = µ1 + Σ12Σ−122 (a− µ2)

• Covariance: independent of a!

cov(X1 | X2 = a) = Σ11 − Σ12Σ−122 Σ21

Conditional pdfs for 2-D Gaussian

Note: not normalized!

link to prediction slide

AA222: Introduction to Multidisciplinary Design Optimization 8

Page 9: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Motivation for Gaussian Process Regression

• Suppose we want to model a system x −→ G −→ y

• What if we consider the outputs for each x as a random variable?

AA222: Introduction to Multidisciplinary Design Optimization 9

Page 10: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Motivation for Gaussian Process Regression

• Suppose we want to model a system x −→ G −→ y

• What if we consider the outputs for each x as a random variable?

• Outputs corresponding to ‘nearby’ inputs are positively correlated

AA222: Introduction to Multidisciplinary Design Optimization 10

Page 11: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Motivation for Gaussian Process Regression

• Suppose we want to model a system x −→ G −→ y

• What if we consider the outputs for each x as a random variable?

• Outputs corresponding to ‘nearby’ inputs are positively correlated

• Outputs corresponding to ‘distant’ outputs are uncorrelated

AA222: Introduction to Multidisciplinary Design Optimization 11

Page 12: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Gaussian Processes

Stochastic process: possibly infinite set of random variables.

Model a system x −→ G −→ y using a stochastic process.

• The random variables themselves represent the outputs of the system

• The indices into the set represent the inputs of the system

G = {Yx, x ∈ X}

e.g., Y(4.5,−6.7) is the output at the (x1 = 4.5, x2 = −6.7)th input.

GP: Any finite subset of outputs form a Gaussian random vector.

Specify marginal mean, variance: regardless of other outputs, the output at anysingle location has mean µ0 and variance σ2

0.

What about the covariance?

AA222: Introduction to Multidisciplinary Design Optimization 12

Page 13: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Covariance Function

Covariance between outputs depends on distance between inputs.

• Specify a correlation function instead, then scale by marginal variance

• As d(x1, x2)→ 0, corr(Y1, Y2)→ 1

• As d(x1, x2)→∞, corr(Y1, Y2)→ 0

• Monotonic decrease in between may not be a bad idea!

• Negative exponential, squared exponential, linear decrease to 0, all possible.

• For smooth functions, use corr(Yi, Yj) = exp(−(xi − xj)2

2τ2

)

• For other covariance functions, see Rasmussen and Williams, 2006

AA222: Introduction to Multidisciplinary Design Optimization 13

Page 14: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

GPR Prediction is Straightforward

• Consider a finite subset outputs {Yx(i), i = 1, . . . ,m}

• Concatenate the outputs to form a Gaussian random vector Y

• Correlation function specifies correlation matrix R, with Rij = corr(Yi, Yj)

• Multiply by marginal variance σ20 to get covariance matrix Σ

• Suppose a subset of these outputs Y1 is known (“given”)

• What is the conditional distribution of (Y2 | Y1 = y)?

• We have seen this before. . .

AA222: Introduction to Multidisciplinary Design Optimization 14

Page 15: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

GPR Prediction is Straightforward

• Consider a finite subset outputs {Yx(i), i = 1, . . . ,m}

• Concatenate the outputs to form a Gaussian random vector Y

• Correlation function specifies correlation matrix R, with Rij = corr(Yi, Yj)

• Multiply by marginal variance σ20 to get covariance matrix Σ

• Suppose a subset of these outputs Y1 is known (“given”)

• What is the conditional distribution of (Y2 | Y1 = y)?

• We have seen this before. . .

• We’ve just made prediction using a Kriging model!

AA222: Introduction to Multidisciplinary Design Optimization 8

Page 16: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

GPR Prediction: A Summary

• Begin with given data D = {(x(i), y(i)), i = 1, . . . ,m1}

• Append desired prediction locations and corresponding unknown outputs

• Write it all together as one large set {Yx(i), i = 1, . . . ,m}

• Concatenate outputs to form a Gaussian random vector Y

• Use correlation function to compute correlations of every output Yi with everyother output Yj. Use resulting correlation matrix R, scale by σ0 to getcovariance matrix Σ. Mean vector is just µ01

• Use formulae for Gaussian conditional distributions(k = known, u = unknown).

E(Yu | Yk = y) = µ01 + ΣukΣ−1kk (y − µ01),

cov(Yu | Yk = y) = Σuu − ΣukΣ−1kkΣku.

AA222: Introduction to Multidisciplinary Design Optimization 9

Page 17: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Calibrating: Tuning Hyperparameters

How do we set µ0, σ0, τ?

Cross-validation: Out of m known data points,

• Leave one out

• Construct fit using all the others

• See how it performed on the left-out datum

• Repeat for all points, compute average2 ‘leave-one-out error’

Find values of µ0, σ0, τ that minimizes average held-out error.

Alternative statistical approach: maximum marginal likelihood (ML-II )(see Rasmussen and Williams, 2006)

2Not as difficult as it seems: algebra tricks allow us to simplify the math

AA222: Introduction to Multidisciplinary Design Optimization 10

Page 18: Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Questions?

AA222: Introduction to Multidisciplinary Design Optimization 11