Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance

Introduction

• Goal: Learn input-output systems: given an input, predict output.

• Gaussian Process Regression (GPR): powerful nonparametric regressiontechnique

• Kriging and spline fits are both instances of GPR

Outline for today:

• Gaussian random vectors, marginals, and conditionals

• Gaussian processes

• Covariance functions

• GPR prediction

AA222: Introduction to Multidisciplinary Design Optimization 1

How You Get Your Grade1

Now that I have your undivided attention. . . ,

• Students are sometimes ‘graded on a curve’

• Originally, this was a bell-curve : the normal (Gaussian) distribution

• If a random variable X is normally distributed with mean µ and variance σ2

p(x) =1

σ√

2πexp

(−(x−µ)2

2σ2

)

1Not really.


Multiple Gaussian Random Variables

What can we say about multiple normally-distributed random variables?

x1 and x2 are Gaussian, µ1 = 1, µ2 = 3, σ1 = σ2 = 1.


Multiple Gaussian Random Variables

If x1 and x2 are independent,

p(x1, x2) = p(x1)p(x2),

=1

σ1

√2π· 1σ2

√2π

exp(−(x1 − µ1)2

2σ21

)exp

(−(x2 − µ2)2

2σ22

),

=1

σ1

√2π· 1σ2

√2π

exp(−(x1 − µ1)2

2σ21

− (x2 − µ2)2

2σ22

),

=1

(2π)σ1σ2exp

−[x1 − µ1

x2 − µ2

]T [ 1σ2

10

0 1σ2

2

] [x1 − µ1

x2 − µ2

]2

,

=1

(2π)n/2|Σ|exp

(−(x− µ)TΣ−1(x− µ)

2

).


Correlated Gaussian Random Variables

p(x) =1

(2π)n/2|Σ|exp

(−(x− µ)TΣ−1(x− µ)

2

)• What about cross-terms in Σ?

• They denote covariances: components of x don’t vary independently

• Plots below show samples generated when σ12 = 0, 0.7, 0.9


Gaussian Random Vector

• A vector X whose components Xi are Gaussian random variables

• Density function analogous to 1-D case, but note covariances!

p(x) =1

(2π)12|Σ|

exp(−(x− µ)TΣ−1(x− µ)

2

)

Probability density for a 2-D Gaussian random vector.


Marginal Distribution

• Idea: ignore some components; “marginalize over” or “integrate out”

• Picture: project joint distribution onto appropriate ‘wall’, normalize

• Math: p(xi) =∫X−i

p(xi, x−i)dx−i

• Marginals of Gaussians are Gaussian!

• Mean: omit components of µ

E[X1] = µ1

• Covariance: omit rows and columns of Σ

cov(X1) = Σ11.Marginal density for 2-D Gaussian

Note: not normalized!


Conditional Distribution

• Idea: fix the “given” value of one or more components; X1 | X2 = a

• Picture: slice of joint distribution (suitably normalized)

• Math: Bayes rule. p(x1|x2) =p(x1, x2)p(x2)

• Conditionals of Gaussians are Gaussian!

• Mean: Depends on “given” a via covariance

E(X1 | X2 = a) = µ1 + Σ12Σ−122 (a− µ2)

• Covariance: independent of a!

cov(X1 | X2 = a) = Σ11 − Σ12Σ−122 Σ21

Conditional pdfs for 2-D Gaussian

Note: not normalized!

link to prediction slide


Motivation for Gaussian Process Regression

• Suppose we want to model a system x −→ G −→ y

• What if we consider the outputs for each x as a random variable?





• Outputs corresponding to ‘nearby’ inputs are positively correlated





• Outputs corresponding to ‘nearby’ inputs are positively correlated

• Outputs corresponding to ‘distant’ outputs are uncorrelated


Gaussian Processes

Stochastic process: possibly infinite set of random variables.

Model a system x −→ G −→ y using a stochastic process.

• The random variables themselves represent the outputs of the system

• The indices into the set represent the inputs of the system

G = {Yx, x ∈ X}

e.g., Y(4.5,−6.7) is the output at the (x1 = 4.5, x2 = −6.7)th input.

GP: Any finite subset of outputs form a Gaussian random vector.

Specify marginal mean, variance: regardless of other outputs, the output at anysingle location has mean µ0 and variance σ2

0.

What about the covariance?


Covariance Function

Covariance between outputs depends on distance between inputs.

• Specify a correlation function instead, then scale by marginal variance

• As d(x1, x2)→ 0, corr(Y1, Y2)→ 1

• As d(x1, x2)→∞, corr(Y1, Y2)→ 0

• Monotonic decrease in between may not be a bad idea!

• Negative exponential, squared exponential, linear decrease to 0, all possible.

• For smooth functions, use corr(Yi, Yj) = exp(−(xi − xj)2

2τ2

)

• For other covariance functions, see Rasmussen and Williams, 2006


GPR Prediction is Straightforward

• Consider a finite subset outputs {Yx(i), i = 1, . . . ,m}

• Concatenate the outputs to form a Gaussian random vector Y

• Correlation function specifies correlation matrix R, with Rij = corr(Yi, Yj)

• Multiply by marginal variance σ20 to get covariance matrix Σ

• Suppose a subset of these outputs Y1 is known (“given”)

• What is the conditional distribution of (Y2 | Y1 = y)?

• We have seen this before. . .


GPR Prediction is Straightforward

• Consider a finite subset outputs {Yx(i), i = 1, . . . ,m}

• Concatenate the outputs to form a Gaussian random vector Y

• Correlation function specifies correlation matrix R, with Rij = corr(Yi, Yj)

• Multiply by marginal variance σ20 to get covariance matrix Σ

• Suppose a subset of these outputs Y1 is known (“given”)

• What is the conditional distribution of (Y2 | Y1 = y)?

• We have seen this before. . .

• We’ve just made prediction using a Kriging model!


GPR Prediction: A Summary

• Begin with given data D = {(x(i), y(i)), i = 1, . . . ,m1}

• Append desired prediction locations and corresponding unknown outputs

• Write it all together as one large set {Yx(i), i = 1, . . . ,m}

• Concatenate outputs to form a Gaussian random vector Y

• Use correlation function to compute correlations of every output Yi with everyother output Yj. Use resulting correlation matrix R, scale by σ0 to getcovariance matrix Σ. Mean vector is just µ01

• Use formulae for Gaussian conditional distributions(k = known, u = unknown).

E(Yu | Yk = y) = µ01 + ΣukΣ−1kk (y − µ01),

cov(Yu | Yk = y) = Σuu − ΣukΣ−1kkΣku.


Calibrating: Tuning Hyperparameters

How do we set µ0, σ0, τ?

Cross-validation: Out of m known data points,

• Leave one out

• Construct fit using all the others

• See how it performed on the left-out datum

• Repeat for all points, compute average2 ‘leave-one-out error’

Find values of µ0, σ0, τ that minimizes average held-out error.

Alternative statistical approach: maximum marginal likelihood (ML-II )(see Rasmussen and Williams, 2006)

2Not as difficult as it seems: algebra tricks allow us to simplify the math


Questions?


Documents

Introduction - Stanford Universityadl.stanford.edu/aa222/Lecture_Notes_files/gpr.pdf · AA222: Introduction to Multidisciplinary Design Optimization 12. Covariance Function Covariance