84
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) Bayesian Regression: Basis Functions , MLE & Regularized Least Squares, Multiple Outputs, Inference with s 2 unknown, Zellner’s g - Prior, Uninformative Priors Prof. Nicholas Zabaras University of Notre Dame Notre Dame, IN, USA Email: [email protected] URL: https://www.zabaras.com/ September 18, 2017 1

Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference with s2

unknown, Zellner’s g-Prior, Uninformative Priors

Prof. Nicholas Zabaras

University of Notre Dame

Notre Dame, IN, USA

Email: [email protected]

URL: https://www.zabaras.com/

September 18, 2017

1

Page 2: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Linear basis function models, Maximum likelihood and least squares,

Geometry of least squares, Convexity of the NLL , Sequential learning,

Robust Linear Regression, Regularized least squares, Multiple

Outputs

Bayesian linear regression, Parameter posterior distribution, A Note on

Data Centering, Numerical Example, Predictive distribution, Bayesian

inference in linear regression when s2 is unknown, Zellner’s g-Prior,

Uninformative (Semi-Conjugate) Prior, Evidence Approximation

Contents

2

Following closely:

Chris Bishops’ PRML book, Chapter 3

Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 7

Regression using parametric discriminative models in pmtk3 (run TutRegr.m in Pmtk3)

Page 3: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Linear Regression We already considered in an earlier lecture an example of

linear regression – polynomial curve fitting.

We are interested in a linear combination – regression – of a

``fixed set’’ of nonlinear basis functions.

Supervised learning: N observations {xn} with corresponding

target values {tn} are provided.

The goal is to predict t for a new value x.

We construct a function such that y(x) is a prediction of t.

We follow a Bayesian perspective and model the predictive

distribution p(t|x).

3

Page 4: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Linear Regression From the conditional distribution p(t|x), we can make point

estimates of t for a given x by minimizing a `loss function’.

For a quadratic loss function, the point estimate is the conditional mean y(x,w)=E[t|x].

4

Page 5: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Linear Regression The simplest linear model for regression is one that involves a

linear combination of the input variables

where and we have defined .

This is often simply known as linear regression. D is the input

dimensionality.

5

0 1 1

0

( , ) ...D

D D i i

i

y w w x w x w x

x w

1 2 . .T

Dx x xx 0 1x

Page 6: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Linear Basis Functions Models More generally:

where are known as basis functions and

The parameter allows for any fixed offset in the data and is

called the bias parameter.

For convenience, we define an additional dummy ‘basis

function’ so that,

Often represent features extracted from the data .

6

1

0 1 1 1 1 0

0

( , ) ... , 1M

T

M M i i

i

y w w w w

x w x x x w x x where :

i x

0 1 1. .T

M x x x x

0w

0 1 x

i x x

Page 7: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Polynomial and Gaussian Basis Functions

7

Polynomial basis functions

(scalar input, global support): Gaussian basis functions:

j

j x x

2

2exp

2

j

j

xx

s

MatLab code 1: 0.2 :1

0.2s

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Basis Funct ion { Gaussian ? j (x) = exp

3

!(x ! 7 j )2

2s2)

4

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Basis Function -- Polynomial

Basis Function - Gaussian

Page 8: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Logistic Sigmoidal Basis Functions

8

Sigmoidal basis functions:

1

, ( )1

j

j a

xx a

s e

s s

( ) :as logistic sigmoid function

tanh( ) 2 2 1a a

a a

e ea a

e es

MatLab code

1: 0.2 :1

0.1s

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Basis Funct ion { Sigmoidal <j (x) = <(x ! 7 j

s)Basis Function - Sigmoidal

Page 9: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Sigmoidal and Tanh Basis Functions

9

The sigmoidal and tanh basis functions are related:

A general linear combination of logistic sigmoidal functions is

equivalent to a linear combination of tanh functions:

tanh( ) 2 2 1a a

a a

e ea a

e es

where

1( )

1 aa

es

0 0

1 1

0 0

1 1

0 0

1

( , ) 22

1 tanh2

tanh2 2

1: ,

2 2

M Mj j

j j

j j

j

M Mj

j j

j j

Mj

j j

j

x xy x w w w w w

s s

x

xsw w u u

s

wwhere u w w u

s s

Page 10: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 10

Choice of Basis Functions

We are interested in functions of local support to explore

adaptivity.

Local support functions comprise a spectrum of different

spatial frequencies.

An example is wavelets that are local both spatially and in

frequency.

They are however useful only when the input is defined

on a lattice.

Page 11: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Maximum Likelihood and Least Squares

11

Assume observations from a deterministic function with added

Gaussian noise:

which is the same as saying,

Here is the precision. This is based on a squared loss function for which E[t|x]= y(x,w). An example for 2D x is shown

below.

where( , )t y x w 1| | 0,p N

1| , , | ( , ),p t t y x w x wN

10

20

30

40

0

10

20

30

15.5

16

16.5

17

17.5

18

010

2030

40

0

10

20

30

15

15.5

16

16.5

17

17.5

18

Run surfaceFitDemo

from PMTK3

0 1 1 2 2|t w w x w x x 2 2

0 1 1 2 2 3 1 4 2|t w w x w x w x w x x

Page 12: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Maximum Likelihood and Least Squares

12

Given observed inputs, , and targets ,

we obtain the likelihood function

We often use the log-likelihood:

Instead of maximizing the log-likelihood, one equivalently can

minimize the negative log-likelihood (NLL)

1,..., NX x x 1,...,T

Nt tt

1

1

| , , | ( ),N

T

n n

n

p t

t X w w xN

1

1

, log ( | ) log | ( ), , ,N

T

n n

n

p t

w w x wD N

1

1

, log | ( ),N

T

n n

n

NLL t

w w xN

Page 13: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Maximum Likelihood and Least Squares

13

Taking the log of the likelihood, we obtain:

where with we have denoted:

RSS is often known as the residual sum of squares or sum of

squared errors (SSE) and MSE is the mean squared error.

Computing w via MLE is the same as Least Squares.

1

1

ln | , ln | ( ), ln ln 2 ( )2 2

NT

n n D

n

N Np t E

t w w x wN

2 2

1 1

1( ) ( ) , ( ) , /

2

N NT T

D n n n n

n n

E t RSS t MSE RSS N

w - w x - w x

( )DE w

-4 -3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

5

prediction

truth

Sum of squares error contours for linear regression

w0

w1

-1 -0.5 0 0.5 1 1.5 2 2.5 3-1

-0.5

0

0.5

1

1.5

2

2.5

3

Run contoursSSEdemo,

from PMTK3

Run residualsDemo

from PMTK3

The NLL is

a quadratic

bowl with a

unique

minimum

(the MLE

estimate)

Page 14: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Maximum Likelihood and Least Squares

14

Setting the gradient of the log-likelihood (written here as a row

vector) wrt equal to zero:

This equation can be solved for .

2

1

1ln | , ln ln 2 ( ), ( ) ( )

2 2 2

NT

D D n n

n

N Np E E t

t w w w - w x

w

1 1 1

ln | , ( ) ( ) ( ) ( ) ( ) 0N N N

T T T T T

n n n n n n n

n n n

p t t

w t w - w x x x w x x

w

Page 15: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Maximum Likelihood and Least Squares

15

We obtain (normal equation; ordinary least squares solution):

where we have defined:

Note that indeed:

1

† † :T T

ML ,

w t t Moore-Penrose pseudo- inverse

0 1 1 1 1 11

0 2 1 2 1 2 1 22

1 2

0 1 1

( ) ( ) .. ( )( )

( ) ( ) .. ( ) ( ) ( ) .. ( )( ),

: : : : ..:

( ) ( ) .. ( )( )

TM

TTM N

T

N

TN N M NN

x x xx

x x x x x xx

x x xx

1 1

( ) ( ) ( )N N

TT T T T T T T T

n n n n

n n

t

w x x x w t w t

1 1

( ) ( ) , ( )N N

T T T

n n n n

n n

t

x x t x

0 1 1. .T

i i i M i x x x x

Page 16: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Maximum Likelihood and Least Squares

16

Taking now the log of the likelihood with respect to gives:

So the MLE variance ML is equal to the residual variance of

the target values around the regression function.

1

1

ln | , ln | ( ), ln ln 2 ( )2 2

NT

n n D

n

N Np t E

t w w x wN

2

1

1 2 1( ) ( )

NT

D n n

nML

E tN N

w - w x

Page 17: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Computing the Bias Parameter

17

If we make the bias parameter w0 explicit, then the error

function becomes

Setting the derivative with respect to w0 equal to zero, and

solving for w0, we obtain

The bias parameter compensates for the difference

between the averages of the target values and the weighted

sum of the averages of the basis function values.

21

0

1 1

1( ) ( )

2

N M

D n j j n

n j

E t - w - w

w x

1

0

1 1 1

1 1( )

N N M

n j j n

n n j

w t - wN N

x

0w

1

0

1 1 1

1 1, , ( )

M N N

j jj n j n

j n n

w t - w t = tN N

x

Page 18: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 18

We look for a geometrical

interpretation of the least-

squares solution in an N-

dimensional space. t is a

vector in that space with

components t1, . . . , tN (N>M).

The least-squares regression

function is obtained by finding

the orthogonal projection of

the data vector t onto the

subspace spanned by the

basis functions φj(x).

Note φj is here the jth

column of .

1( ),..., ( )T

j j j N x x

0 1 1 1 1 1

0 2 1 2 1 2

0 1 1

0 1 1

( ) ( ) .. ( )

( ) ( ) .. ( )...

: : : :

( ) ( ) .. ( )

M

M

M

N N M N

x x x

x x x

x x x

y w

Geometry of Least Squares

Page 19: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Geometry of Least Squares

19

We are looking for w such that

the projection error

is orthogonal to the basis ,

i.e. such that:

These are the normal equations

we derived earlier.

1( ) ( ),..., ( )T

j j j N x x x

0 1 1( ) ( ) .. ( )M x x x

( , )n ny

y w

y x w

- -t y t w

j

0T - t w

S: M-dimensional subspace

spanned by ( )j x0

1

1

:

T

T

T

T

M

Page 20: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Convexity of the NLL

20

Convexity of the NLL (positive definite Hessian) leads to a

unique globally optimal MLE.

Some models of interest don’t have concave likelihoods and

locally optimal MLE estimates are needed.

x y

1 -

A B

Convex function Concave function Neither convex or concave

function

A and B are local minimum

Run convexFnHand

from PMTK3

2 , , log ( 0)e log , ( 0)

Convex region Non-convex region

Page 21: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Sequential Learning: LMS Algorithm

21

If the data set is large, we use sequential (on-line) algorithms

We apply the technique of stochastic (sequential) gradient

descent.

If the error function comprises a sum over data points

then after presentation of pattern n, the stochastic gradient

descent algorithm updates the parameter vector w using

t is the iteration number & η the learning rate parameter.

This is known as least-mean-squares or the LMS algorithm.

( 1) ( ) ( ) ( ) ( ) ( )T

n

n n n n

E

E tt t t t

w w w w x x

n

n

E E

2

1

1( ) ( )

2

NT

D n n

n

E t

w - w x

Page 22: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Robust Linear Regression

22

Using a Gaussian distribution for the noise,

can result in poor fit especially if we have outliers in the data.

Squared error penalizes deviations quadratically, so points far from the

line have more affect on the fit than points near the line.

To achieve robustness to outliers one can replace the Gaussian with a

distribution that has heavy tails (e.g. the Laplace distribution). Such a

distribution assigns higher likelihood to outliers, without having to

perturb the regression line to “explain” them.

1( , ) , ~ | 0,t y x w N

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-6

-5

-4

-3

-2

-1

0

1

2

3

4

least squares

laplace

student, dof=0.630

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-6

-5

-4

-3

-2

-1

0

1

2

3

4

Least Squares

Huber loss 1.0

Huber loss 5.0

Run linregRobustDemoCombined

from PMTK3

1

| , , | ( , ), exp ( , )

( ) ( ) , ( ) ( , )i i

i

p t b t y b t yb

NLL r r t y

x w x w x w

w w w x w

Lap

2

2

/ 2,

/ 2

,

H

H

r

r if rL r

r if r

dL r

dr

Page 23: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Robust Linear Regression

23

Using the Laplace distribution leads to L1 error norm (non-linear

objective function) that is difficult to optimize.

A solution is to transform the problem (by increasing its dimension to

2N+M) to a linear programming problem.

𝑟𝑖 ≜ 𝑟𝑖+ − 𝑟𝑖

Note that with our definition above:

, ,min . . 0, 0,

i i

T

i i i i i i i ir r

i

r r s t r r r r t

w

w x

0 0 01 1( ) , ( )

0 0 02 2

i i i

i i i i i i

i i i

i i i

r if r if rr r r r r r

if r r if r

r r r

Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge

Page 24: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Huber Loss Function

24

-3 -2 -1 0 1 2 3-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

L2

L1

huber

Run huberLossDemo

from PMTK3

This is equivalent to L2 for errors

that are smaller than δ, and is

equivalent to L1 for larger errors.

This loss function is everywhere

differentiable, using the fact that

d/dr |r| = sign(r) if r ≠ 0.

The function is also C1 continuous,

since the gradients of the two parts

of the function match at r = ±δ.

Optimizing the Huber loss is much faster than using the Laplace

likelihood, since we can use standard optimization methods (quasi-

Newton) rather than linear programming.

The Huber method also has a probabilistic interpretation, although it

is rather unnatural (Pontil et al. 1998).

Pontil, M., S. Mukherjee, and F. Girosi (1998). On the Noise Model of Support Vector Machine Regression. Technical

report, MIT AI Lab.

Huber, P. (1964). Robust estimation of a location parameter. Annals of Statistics 53, 73a ˘A,S101.

2

2

/ 2,

/ 2

,

H

H

r

r if rL r

r if r

dL r

dr

Page 25: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Regularized LS – Ridge Regression

25

Consider the error function:

data term + regularization term

With the sum-of-squares error function and a quadratic

regularizer, we get

Specifically, setting the gradient with respect to w to zero, and

solving for w as before, we obtain

This is a trivial extension of the least-squares solution we

encountered earlier (Regularized Least Squares – Ridge

Regression)

( ) ( )D WE Ew w

2

1

1 1( )

2 2

NT T

n n

n

t

- w x w w

1

T T

w I t

Page 26: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Regularized Least Squares

26

Regularized solution:

Regularization limits the effective model complexity (the

appropriate number of basis functions).

This is replaced with the problem of finding a suitable value of

the regularization coefficient .

controls how many non-zero w’s (i.e. basis functions) you

have.

1

T T

w I t

Page 27: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Regularizer term plot with q = 0.5

-10 -5 0 5 10-10

-5

0

5

10Regularizer term plot with q = 1

-10 -5 0 5 10-10

-5

0

5

10Regularizer term plot with q = 2

-10 -5 0 5 10-10

-5

0

5

10Regularizer term plot with q = 4

-10 -5 0 5 10-10

-5

0

5

10

Regularized Least Squares

27

With a more general regularizer, we have

q = 1 is known as the Lasso regularizer. These plots

show only the regularizer term with = 0.7334.

2

1 1

1 1( )

2 2

qN MT

n n j

n j

t w

- w x

MatLab code

Page 28: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Regularized Least Squares

28

With a more general regularizer, we have

q = 2 corresponds to the quadratic regularizer.

2

1 1

1 1( )

2 2

qN MT

n n j

n j

t w

- w x

MatLab code

Page 29: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Regularized Least Squares

29

Lasso tends to generate sparser solutions than a quadratic

regularizer – if is large, some of the wj →0 (here w1 = 0).

Here, we consider that the regularized least squares solution is

equivalent to minimizing the unregularized sum of squares with

the constraint shown for some (see proof next).

Unregularized

error function

Constraint

1

qM

j

j

w

q = 2 q = 1

Page 30: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Regularized Least Squares

30

Let us write the constraint in the equivalent form:

This leads to the following Lagrangian function:

This is identical to our regularized least squares (RLS) in the

dependence on w.

For a particular >0, let w*() be the solution of the RLS in (*).

From the Kuhn-Tucker optimality conditions for we then

see:

1

10

2

qM

j

j

w

2

1 1

1( , ) ( )

2 2

qN MT

n n j

n j

L t w

w - w x

2

1 1

1 1( ) (*)

2 2

qN MT

n n j

n j

t w

- w x

( , )L w

*

1

qM

j

j

w

Page 31: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Kuhn-Tucker Optimality Conditions

31

Consider the following constraint minimization problem:

This is equivalent as the minimization with respect to x and of the following Lagrangian:

subject to the following (Kuhn-Tucker) conditions:

Note for maximization problems, the Lagrangian should be

modified as:

min ( ), ( ) 0f g x

x xsubject to

min ( , ) ( ) g( )L f

x,

x x x

0, g( ) 0, g( ) 0 x x

( , ) ( ) g( )L f x x x

Page 32: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Multiple Outputs-Isotropic Covariance

32

If we want to predict K>1 target variables, we use the same

basis for all components of the target vector):

W is an M × K matrix of parameters and t is K dimensional.

Given observed inputs, , and targets,

we obtain the log likelihood function

1 1| , , | ( , ), | ( ),Tp t x W t y x W I t W x IN N

1,..., NX x x 1,...,T

NT t t

2

1

1 1

ln | , , ln | ( ), ln ( )2 2 2

N NT T

n n n n

n n

NKp t

T X W W x I t W xN

Page 33: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

K-Independent Regression Problems

33

As before, we can maximize this function with respect to W,

giving

If we examine this result for each target variable tk, we have

(take the kth column of W and T):

which is identical with the single output case (so there is

decoupling between the target variables)

As expected, we obtain K- independent regression problems.

2

1

ln | , , ln ( )2 2 2

NT

n n

n

NKp

T X W t W x

1

T T

ML

M N N KM KM M

W T

1

ML

T T

k k k

w t t

Page 34: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Multiple Outputs – Full Covariance

34

Let us repeat the earlier formulation but with covariance

matrix S. If we want to predict K>1 target variables, we use

the same basis for all components of the target vector):

where W is an M × K matrix of parameters

Given observed inputs, , and targets,

we obtain the log likelihood function :

| , , | ( , ), | ( ),Tp t x W t y x W t W xN N S S S

1,..., NX x x 1,...,T

NT t t

1

1 1

1ln | ( ), ln ( ) ( )

2 2

N NT

T T T

n n n n n n

n n

Nt

W x t W x t W xN S S S

ln | , ,p T X W S

Page 35: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Multiple Outputs – Full Covariance

35

As before, we maximize this function with respect to W,

For the ML estimate for S, use the result for the MLE of the

covariance of a multivariate Gaussian:

Note that each column of WML is of the form

seen for isotropic noise distribution and is independent of S!

1

1

1ln | , , ln ( ) ( )

2 2

NT

T T

n n n n

n

Np

T X W t W x t W x S S S

1

1

1

0 ( ) ( )N

T T T T

n n n ML

n M K

t W x x W T S

1

1( ) ( )

NT

T T

n ML n n ML n

nN

t W x t W x S

1

T T

ML

w t

Page 36: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 36

Effective model complexity in MLE is governed by the number

of basis functions and is controlled by the size of the data set.

With regularization, the effective model complexity is controlled

mainly by and still by the number and form of the basis

functions.

Thus the model complexity for a particular problem cannot be

decided simply by maximizing the likelihood function as this

leads to excessively complex models and over-fitting.

Independent hold-out data can be used to determine model

complexity but this wastes data and it is computationally

expensive.

Bayesian Linear Regression

Page 37: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 37

A Bayesian treatment of linear regression avoids the over-

fitting of maximum likelihood.

Bayesian approaches lead to automatic methods of

determining model complexity using the training data alone.

Bayesian Linear Regression

Page 38: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 38

Assume additive Gaussian noise with known precision . The

likelihood function p(t|w) is the exponential of a quadratic

function of w

and its conjugate prior is Gaussian:

Combining this with the likelihood and using results for

marginal and conditional Gaussian distributions, gives the

posterior

where

0 0( ) ( | , )p w w m SN

( ) ( | , )N Np w | t w m SN

1

0 0

1 1

0

T

N N

T

N

m S S m t

S S

/2

21

11

| , , | ( ), exp ( )2 2

NN NT T

n n n n

nn

p t t

t X w w x w xN

Bayesian Linear Regression

Page 39: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

We now have the product of two Gaussians and the

posterior is easily computed as:

Posterior Distribution: Derivation

39

1

0 0 0 0 0

2

1

1 1

1( | , ) exp ,

2

( | , , ) exp ( )2

1( | , , ) exp ( ) ( ) ( )

2

T

NT

n n

n

N NT T

n n n n

n n

p

p x t

p t

w m S w - m S w - m

t x w w

t x w w x x w x w

0

1 1

0 0 0

1 1

1 1 1 1

0 0 0 0

1

( | , , )

1 1exp ( ) ( ) ( )

2 2

, ( ) ( )

N

N NT T T T T

n n n n

n n

NT T T

N N N n n

n

p ,

t

,

m

w x t S

w S w w S m w x x w w x

w | S S m t S S S x x S

N

Complete the

square in w

Page 40: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 40

Note that because the posterior distribution is Gaussian, its

posterior mode coincides with its mean.

The above expressions for the posterior mean and variance

can also be written for a sequential calculation (we already

have observed N data points and now considering an

additional data point (xN+1,tN+1)). In this case, we have:

( ) ( | , )N Np w | t w m SN 1

0 0

1 1

0

T

N N

T

N

m S S m t

S S

MAP Nw = m

1 1 1 1( , ) ( | , )N N N N N Np t , w | , x m S w m SN 1

1 1 1 1

1 1

1 1 1

N N N N n n

T

N N n n

t

m S S m

S S

Sequential Posterior Calculation

Page 41: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Bayesian Linear Regression

41

Let us consider for a prior, a zero-mean isotropic Gaussian

governed by a single precision parameter a so that

and the corresponding posterior distribution over w is then

given by

The log of the posterior is the sum of the log likelihood and

the log of the prior and, as a function of w, takes the form

Thus the MAP estimate is the same as regularized least

squares (Ridge Regression) with

1( ) ( | 0, )p a w w IN

1

T

N N

T

N

a

m S t

S I

2

1

ln ( | ) ( )2 2

NT T

n n

n

p t x +const a

w t w w w

/ . a

Page 42: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

A Note on Data Centering

42

In linear regression, it helps to center the data in a way that

does not require us to compute the offset term . Write the

likelihood as:

Let us assume that the input data are centered in each

dimension such that:

The mean of the output is equally likely to be positive or

negative. Let us put an improper prior and integrate

out.

( | , , ) exp2

T

N Np ,

1 1t x w t w t w

1 1 2 1 11

1 2 2 2 2 1 22

1 2

1 2

( ) ( ) .. ( )( )

( ) ( ) .. ( ) ( ) ( ) .. ( )( ),

: : : : ..:

( ) ( ) .. ( )( )

TM

TTM N

T

N

TN N M NN

x x xx

x x x x x xx

x x xx

0 1 1. .T

i i i M i x x x x

1

0 1,...,N

i j

j

i M

x

( ) 1p

Page 43: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

A Note on Data Centering

43

Introducing, , the marginal likelihood becomes :

Completing the square in gives (use the centering of the

input):

Our model is now simplified if instead of t we use (centered

output) and the likelihood is simply written as:

Recall that the MLE estimate for is:

( | , ) exp2

T

N N N Np , t t t t d

1 1 1 1

A

t x w t w t w

1

1 N

i

i

t tN

0 0

2

( | , ) exp 22

exp2

T TN

T T

N

tN tN

T

N N

p , t N t d

t t

1

1

1 1

w

t x w A A A

t w t w

Nt 1t t

( | , ) exp2

Tp ,

t x w t w t w

Ƹ𝜇 = ҧ𝑡 −

𝑗=1

𝐷

ഥ𝝓𝑗𝑇𝒘𝑗 , ഥ𝝓1 , . . . , ഥ𝝓𝐷 𝑖𝑠 𝑓𝑜𝑟𝑚𝑒𝑑

𝑏𝑦𝑎𝑣𝑒𝑟𝑎𝑔𝑖𝑛𝑔 𝑒𝑎𝑐ℎ 𝑐𝑜𝑙𝑢𝑚𝑛𝑜𝑓𝜱

Page 44: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

A Note on Data Centering

44

To simplify the earlier notation, consider a linear regression

model of the form

In the context e.g. of MLE, we need to minimize

Minimization wrt w0 gives:

where:

Thus:

0| Ty w x w x

0

2

0

1

minN

T

i iw

i

t w

,w

- w x

0 0 0

1

0N

T T

i i

i

Tt w tw w N tN N

w w x wx x

1

11

2 2

1

1

1

,:

:

N

i

i

N

Ni

i i

i

D N

iD

i

x N

x

x Nxt t N

xx N

x

ෝ𝑤0 = ҧ𝑡 − ഥ𝒙𝑇𝒘

Page 45: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

A Note on Data Centering

45

Substituting the bias term in our objective function gives:

Minimization wrt w gives:

We thus first compute the MLE of w using the centered input

and output as follows:

We can then estimate the MLE estimate of w0 as follows:

1

11 2 11 11 12 1

1 2 2 221 22 221

1 21 2

1

..

..,

: : : : :::

..

N

iTT i

DDN

TT

T D iDic N

TT D D NN N NNN

iD

i

x N

x x x x x x x

x Nx x x x x x x

x x x x x x xx N

1

x x

x xX X X X x x

x x

,

22

1 1

min minN N

T T T

i i i i

i i

t t t t

w w

w x w x w x x

1

,c N

N

i

i

t

t t N

1t t t t

ෝ𝑤0 = ҧ𝑡 − ഥ𝒙𝑇𝒘

𝑖=1

𝑁

𝒙𝑖 − ഥ𝒙 𝒙𝑖 − ഥ𝒙 𝑇 ෝ𝒘 =

𝑖=1

𝑁

𝒙𝑖 − ഥ𝒙 𝒕𝑖 − ҧ𝒕

ෝ𝒘 = 𝑿𝑐𝑇𝑿𝑐

−1𝑿𝑐𝑇𝒕𝑐 =

𝑖=1

𝑁

𝒙𝑖 − ഥ𝒙 𝒙𝑖 − ഥ𝒙 𝑇

𝑖=1

𝑁

𝒙𝑖 − ഥ𝒙 𝒕𝑖 − ҧ𝒕

Page 46: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Bayesian Regression: Example

46

We generate synthetic data from the function f(x, a) = a0+a1x

with parameter values a0 = −0.3 and a1 = 0.5 by first choosing

values of xn from the uniform distribution U(x|−1, 1), then

evaluating f(xn, a), and finally adding Gaussian noise with

standard deviation of 0.2 to obtain the target values tn.

We assume β=(1/0.2)2=25 and α=2.0.

We perform Bayesian inference sequentially – one point at a

time – so the posterior at each level becomes the new prior.

We show results after 1, 2 and 22 points have been collected.

The results include the likelihood contours (for 1 point), the

posterior and samples of the regression function from the

posterior.

Page 47: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Prior - No data yet

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-3

-2

-1

0

1

2

3y(x,w) using samples of w from the prior

Bayesian Regression: Example

47

MatLab Code

Page 48: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Contours of the posterior

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1y(x,w) using samples of w from the posterior

Example: One Data Point Collected

48

Note that the regression lines pass close to the data point (shown with a circle)

MatLab Code

Page 49: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Contours of the posterior

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1y(x,w) using samples of w from the posterior

Example: 2nd Data Point Collected

49

Note that the regression lines now pass close to both data points

MatLab Code

Page 50: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Contours of the posterior

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1y(x,w) using samples of w from the posterior

Example: 22 Data Points Collected

50

Note that the regression lines after 22 data points have been collected

MatLab Code

Page 51: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Summary of Results

51

prior/posterior (no data yet)

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space (no data yet)

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

MatLab Code

Page 52: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Summary of Results

52

W0

W1

prior/posterior

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

data space

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

likelihood

Run bayesLinRegDemo2d

from PMTK3

After 20 data points

Page 53: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Predictive Distribution

53

We are not interested in w itself but in making predictions of t

for new values of x. This requires the predictive distribution

where

The 1st term represents the noise on the data whereas the 2nd

term reflects the uncertainty associated with w.

Because the noise process and the distribution of w are

independent Gaussians, their variances are additive.

The error bars get larger as we move away from the training

points. By contrast, in the plug-in approximation, the error bars

are of constant size.

As additional data points are observed, the posterior

distribution becomes narrower.

2( | , , ) ( | , ) ( | , ) | ( ), ( )T

N Np t x p t x , p , , d t x x a s Nx t w w x t w m

2 11( ) ( ) ( ),T T

N N Nx x S xs a

S I

Page 54: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

In a full Bayesian treatment, we want to compute the

predictive distribution, i.e. given the training data x and t

and a new test point x, we want the distribution:

To compute the marginal, we use the standard equations

for Gaussian Linear Systems from an earlier lecture.

Predictive Distribution

54

1

1 1

1

1 1

( | , , ) ( | , ) ( | , ) ,

( | , , ) ( | ( , ), )

( | , , )

1 1exp ( ) ( ) ( )

2 2

( ), , ( ) ( )

N NT T T

M M n n n n

n n

N NT

N n n N N M M n n

n n

p t x p t x p d where

p t x t y x and

p ,

x x t x

t x x x

a

a

a

N

N

x t w w x t w

w w

w x t

w I w w w w

w |, S S S I

Page 55: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Appendix: Useful Result

55

For the above linear model, we proved in an earlier lecture

that the following very useful results about marginal and

conditional Gaussian models hold:

1

1

| ,

| | ,

p

p

N

N

x x

y x y Ax b L

1 1| , Tp Ny y A b L + A A

1 1

| | ( ,T T Tp

x y x A LA A L y b) A LA N

Page 56: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Predictive Distribution

56

Thus for our problem:

The predictive distribution now takes the form:

1

1

1 1

( ),

( ) ,

N

N n n N

n

T

, t x

t, x , = 0

x w = S = S

y A = b L =

1

1

| ,

| | ,

p

p

N

N

x x

y x y Ax b L

1( | , , ) ( | ( , ), )p t x t y x w wN

1

( | , , ) ( ),N

N n n N

n

p , t xa

w x t w |, S SN

1 1| , Tp y y A b L + A A N

1

1

| ( ) ( ), ( ) ( )N

T T

N n n N

n

p t t x t x x x

S + SN

Page 57: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

In a full Bayesian treatment, we want to compute the

predictive distribution, i.e. given the training data x and t

and a new test point x, we want the distribution:

where the mean and variance are given by

Note that:

Predictive Distribution

57

1( | , , ) ( | , ) ( | , ) , ( | , , ) ( | ( , ), )p t x p t x p d p t x t y x x t w w x t w w wN

1

2 1

1

1

( ) ( ) ( ) ( ) ,

( ) ( ) ( ) ( )

( ) ( )

NT T T

N n n N N N

n

T

N N

NT T

N n n

n

m x x x t x

x x x

x x

s

a a

+

S m m S t

S

S I I

uncertainty in the data+uncertainty in w

2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N

2 1 1 2 2

1( ) ( ) ( ) ( ) ( )N

T

N N N Nx x x and x xs s s

S+

Page 58: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

It is easy to show:

Note:

and the identity:

Using these results, we can write:

Predictive Distribution

58

2 1 1 2 2

1( ) ( ) ( ) ( ) ( )N

T

N N N Nx x x and x xs s s

S+

11 1

1

1

( ) ( ) ( ) ( )N

T T

N n n N n n

n

x x x xa

S I S

1 1

11

11

T

T

T

M v v MM vv M

v M v

12 1 1 1 1

1 1

2

2 2

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )( ) ( ) ( ) ( )

1 ( ) ( ) 1 ( ) ( )

T T T

N N N n n

T T

N n n N n NT

N N NT T

n N n n N n

x x x x x x x

x x x xx x x x

x x x x

s

s s

S S

S S SS

S S

+ + +

Page 59: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

The notation used here is as follows:

Predictive Distribution: Summary

59

1

2 1

1

1

( ) ( ) ( )

( ) ( ) ( )

( ) ( )

NT

N n n

n

T

N N

NT

N n n

n

m x x x t

x x x

x x

s

a

S

S

S I

+

2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N

2 2 1

1

:

1

( ) , ( ) 1 ..

:

n

T M

n n

M

n

For Polynomial regression

x

x x x x x x

x

Note:

Predictive mean and

variance are functions

of x.

0

1

2 0 1 2 1

1

( )

( )

( ) ( ) , ( ) ( ) ( ) ( ) .. ( ) ,

:

( )

n

n

T

n n n n n M n

M n

x

x

x x x x x x x unit matrix M M

x

I

Page 60: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

Pointwise Uncertainty in the Predictions

60

MatLab code

N=1

N=2

M=9 Gaussians, 10 parameters

Scale of Gaussians adjusted

with data

a = 5*10-3

= 11.1

Using N=1,2,4,10

Data are given here

The predictive uncertainty is

smaller near the data.

The level of uncertainty

decreases with N

Page 61: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

Pointwise Uncertainty in the Predictions

61

MatLab code

N=4 N=10

Page 62: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Summary of Results

62

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

MatLab code

Page 63: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Plugin Approximation

63

-8 -6 -4 -2 0 2 4 6 80

10

20

30

40

50

60

plugin approximation (MLE)

prediction

training data

-8 -6 -4 -2 0 2 4 6 8-10

0

10

20

30

40

50

60

70

80

Posterior predictive (known variance)

prediction

training data

-8 -6 -4 -2 0 2 4 6 8-20

0

20

40

60

80

100

functions sampled from posterior

-8 -6 -4 -2 0 2 4 6 80

5

10

15

20

25

30

35

40

45

50

functions sampled from plugin approximation to posterior

Run linregPostPredDemo

from PMTK3

𝑝(𝑡|𝑥, 𝒙, 𝒕)

= න𝑝(𝑡|𝑥,𝒘)𝛿ෝ𝒘(𝒘)𝑑𝒘

)= 𝑝(𝑡|𝑥, ෝ𝒘

Page 64: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

Covariance Between the Predictions

64

Draw samples from the posterior of w and

then plot y(x,w). We use the same data as

the earlier example.

We are visualizing the joint uncertainty

in the posterior distribution between the

y values at two or more x values.

MatLab Code

N=2

N=1

Same data and basis functions

as in the earlier example

Page 65: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

Covariance Between the Predictions

65

Draw samples from the posterior of w and

then plot y(x,w)

We are visualizing the joint uncertainty

in the posterior distribution between the

y values at two or more x values.

MatLab Code

N=12

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

N=4

Page 66: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Summary of Results

66

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

MatLab Code

Page 67: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Gaussian Basis vs. Gaussian Process

67

If we use localized basis functions such as Gaussians, then in

regions away from the basis function support, the contribution

from the second term in the predictive variance will go to zero,

leaving only the noise contribution β−1.

The model becomes very confident in its predictions when

extrapolating outside the region occupied by the basis functions.

This is an undesirable behavior.

This problem can be avoided by adopting an alternative

Bayesian approach to regression (Gaussian processes).

2 1 1

( )( ) ( ) ( )

away from theT

N Nsupport of x

x x xs S+

Page 68: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Bayesian Inference when s2 is Unknown

68

Let us extend the previous results for linear regression assuming

now that s2 is unknown.

Assume a likelihood of the form:*

A conjugate prior has the following form:

The posterior is now derived as:

/2

2 2 2

/2 2

1| , , | , exp

22

TN

N Np s s s

s

Ny Xw y Xw

y X w y Xw I

0

0

1( )/2 1

0 0 0 02 20

21/2/2

0 0

2, | exp

22

T Taa D N

N D

bbp

as s

s

Dw w V w w y Xw y Xw

wV

In the remaining of this lecture, the response is denoted as y, the dimensionality of w as D and the design matrix

as X.

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|𝒘0, 𝑽0, 𝑎0, 𝑏0 ≜ 𝒩 𝒘|𝒘0, 𝜎2𝑽0 ℐ𝒢 𝜎2|𝑎0, 𝑏0

𝑏0𝑎0

2𝜋 Τ𝐷 2|𝑉0|Τ1 2𝛤 𝑎0

𝜎2 − 𝑎0+ Τ𝐷 2+1 exp −𝒘 − 𝒘0

𝑇𝑽0−1 𝒘 −𝒘0 + 2𝑏02𝜎2

Page 69: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Bayesian Inference when s2 is Unknown

69

Let us define the following:

With these definitions, one with simple algebra can show:

The posterior marginals can now be derived explicitly:

0

0

1( )/2 1

0 0 0 02 20

21/2/2

0 0

2, | exp

22

T Taa D N

N D

bbp

as s

s

w - w V w - w y - Xw y - Xww

VD

11 1

0 0 0

1 1

0 0 0 0 0

,

1/ 2,

2

T T

N N N

T T T

N N N N Na a N b b

V V X X w V V w X y

w V w y y w V w

1 1 1/2 1

0 0 0 0 0 02 2

2

1/2 1

2

2

2, | exp

2

2exp

2

N

N

T T T T Ta D

N N N N

Ta D

N N N N

bp

b

s ss

ss

w w V w w y Xw y Xw w V w y y w V ww

w w V w w

D

2 2| | ,N Np a bs sD IG

2

1 2

| , , 2 12

Na DT

N N NND N N N

N N

bp a

a b

D Tw w V w w

w w V

𝑝 𝒘, 𝜎2|𝒟 = 𝒩ℐ𝒢 𝒘, 𝜎2|𝒘𝑁 , 𝑽𝑁, 𝑎𝑁, 𝑏𝑁 ≜ 𝒩 𝒘|𝒘𝑁, 𝜎2𝑽𝑁 ℐ𝒢 𝜎2|𝑎𝑁, 𝑏𝑁

Page 70: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Posterior Marginals

70

The marginal posterior can be directly written as:

2

1 1 2/2 1

2 2

2

0

2| exp 1

2 2

N

N

a DT T

a DN N N N N N N

N

bp d

bs s

s

Dw w V w w w w V w w

w

2

1 2

| , , 2 12

Na DT

N N NND N N N

N N

bp a

a b

D Tw w V w w

w w V

To compute the integral above, simply set =s-2, ds2=--2d and use the normalizing factor of the

Gamma distribution න

0

𝜆𝑎−1𝑒−𝑏𝜆𝑑𝜆 = 𝛤(𝑎)𝑏−𝑎 ∼ 𝑏−𝑎.

Page 71: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Posterior Predictive Distribution

71

Consider the posterior predictive for m new test inputs:

As a first step, let us integrate in w by writing:

Let us denote the last term in the Eq. above as

These terms

cancel out

from the

integration in

w

2𝛽 = − ෩𝑿𝑇𝒚 + 𝑽𝑁−1𝒘𝑁

𝑇 ෩𝑿𝑇෩𝑿 + 𝑽𝑁−1 −𝑇 ෩𝑿𝑇𝒚 + 𝑽𝑁

−1𝒘𝑁 +𝒘𝑁𝑇𝑽𝑁

−1𝒘𝑁 + 𝒚𝑇𝒚 + 2𝑏𝑁

𝒚 − ෩𝑿𝒘𝑇𝒚 − ෩𝑿𝒘 + 𝒘−𝒘𝑁

𝑇𝑽𝑁−1 𝒘−𝒘𝑁 + 2𝑏𝑁 =

𝒘− ෩𝑿𝑇෩𝑿+ 𝑽𝑁−1 −1 ෩𝑿𝑇𝒚 + 𝑽𝑁

−1𝒘𝑁

𝑇෩𝑿𝑇෩𝑿+ 𝑽𝑁

−1 𝒘− ෩𝑿𝑇෩𝑿+ 𝑽𝑁−1 −1 ෩𝑿𝑇𝒚 + 𝑽𝑁

−1𝒘𝑁

− ෩𝑿𝑇𝒚 + 𝑽𝑁−1𝒘𝑁

𝑇 ෩𝑿𝑇෩𝑿 + 𝑽𝑁−1 −𝑇 ෩𝑿𝑇𝒚 + 𝑽𝑁

−1𝒘𝑁 +𝒘𝑁𝑇𝑽𝑁

−1𝒘𝑁 + 𝒚𝑇𝒚 + 2𝑏𝑁

𝑝 𝒚|෩𝑿, 𝒟

∝ ඵ1

2𝜋 Τ𝑚 2𝜎2 − Τ𝑚 2exp −

𝒚 − ෩𝑿𝒘𝑇𝒚 − ෩𝑿𝒘

2𝜎2𝜎2 − 𝑎𝑁+ Τ𝐷 2+1 exp −

𝒘 − 𝒘𝑁𝑇𝑽𝑁

−1 𝒘−𝒘𝑁 + 2𝑏𝑁2𝜎2

𝑑𝒘𝑑𝜎2

Page 72: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Posterior Predictive Distribution

72

The posterior predictive

is now simplified using =1/s2 and recalling the normalization of

the Gamma distribution:

Substituting and by comparing the 2 Eqs. one can verify that:

Use the Sherman Morrison

Woodburry formula here to show

that (symmetry of V0 is assumed)

𝑝 𝒚|෩𝑿,𝒟 ∝ න 𝜆 Τ𝑚 2+𝑎𝑁−1exp −𝛽𝜆 𝑑𝜆 ∼ 𝛽− Τ𝑚 2+𝑎𝑁

𝑝 𝒚|෩𝑿,𝒟 ∝ − ෩𝑿𝑇𝒚 + 𝑽𝑁−1𝒘𝑁

𝑇 ෩𝑿𝑇෩𝑿+ 𝑽𝑁−1 −𝑇 ෩𝑿𝑇𝒚 + 𝑽𝑁

−1𝒘𝑁 +𝒘𝑁𝑇𝑽𝑁

−1𝒘𝑁 + 𝒚𝑇𝒚 + 2𝑏𝑁−𝑚2+𝑎𝑁

∝ 1 +

𝒚 − ෩𝑿𝒘𝑁𝑇 𝑏𝑁

𝑎𝑁𝑰𝑚 + ෩𝑿𝑽𝑁 ෩𝑿

𝑇

−1

𝒚 − ෩𝑿𝒘𝑁

2𝑎𝑁

−𝑚2+𝑎𝑁

𝑰𝑚 + ෩𝑿𝑽𝑁 ෩𝑿𝑇

−1= 𝑰𝑚 − ෩𝑿 ෩𝑿𝑇෩𝑿 + 𝑽𝑁

−1 −1෩𝑿𝑇

𝑝 𝒚|෩𝑿, 𝒟

∝ ඵ1

2𝜋 Τ𝑚 2𝜎2 − Τ𝑚 2exp −

𝒚 − ෩𝑿𝒘𝑇𝒚 − ෩𝑿𝒘

2𝜎2𝜎2 − 𝑎𝑁+ Τ𝐷 2+1 exp −

𝒘− 𝒘𝑁𝑇𝑽𝑁

−1 𝒘−𝒘𝑁 + 2𝑏𝑁2𝜎2

𝑑𝒘𝑑𝜎2

Page 73: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Bayesian Inference when s2 is Unknown

73

The posterior predictive is also a Student’s T:

The predictive variance has two terms

due to the measurement noise

and due to the uncertainty in w. The second

term depends on how close a test input is to the training

data.

Nm

N

b

aI

𝑝 𝒚|෩𝑿,𝒟 = 𝒯𝑚 𝒚|෩𝑿𝒘𝑁,𝑏𝑁𝑎𝑁

𝑰𝑚 + ෩𝑿𝑽𝑁෩𝑿𝑇 , 2𝑎𝑁

𝑏𝑁𝑎𝑁

෩𝑿𝑽𝑁෩𝑿𝑇

Page 74: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Zellner’s G-Prior

74

It is common to set a0 = b0 = 0, corresponding to an

uninformative prior for σ2, and to set w0 = 0 and V0 =

g(XTX)−1 for any positive value g.

This is called Zellner’s g-prior. Here g plays a role

analogous to 1/λ in ridge regression. However, the prior

covariance is proportional to (XTX)−1 rather than I.

This ensures that the posterior is invariant to scaling of the

inputs.

Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In

Bayesian inference and decision techniques, Studies of Bayesian and Econometrics and Statistics volume 6.

North Holland.

Minka, T. (2000b). Bayesian linear regression. Technical report, MIT.

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|𝒘0, 𝑽0, 𝑎0, 𝑏0 ≜ 𝒩 𝒘|𝒘0, 𝜎2𝑽0 ℐ𝒢 𝜎2|𝑎0, 𝑏0

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0, 𝑔 𝑿𝑇𝑿 −1, 0,0 ≜ 𝒩 𝒘|0, 𝜎2𝑔 𝑿𝑇𝑿 −1 ℐ𝐺 𝜎2|0,0

Page 75: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Unit Information Prior

75

We will see below that if we use an uninformative prior, the

posterior precision given N measurements is .

The unit information prior is defined to contain as much

information as one sample.

To create a unit information prior for linear regression, we

need to use which is equivalent to the g-prior

with g = N.

Zellner’s prior depends on the data: This is contrary to

much of our Bayesian inference discussion!

1 T

N

V X X

1

0

1 T

N

V X X

Kass, R. and L. Wasserman (1995). A reference bayesian test for nested hypotheses and its relationship to

the schwarz criterio. J. of the Am. Stat. Assoc. 90(431), 928–934.

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0, 𝑔 𝑿𝑇𝑿 −1, 0,0 ≜ 𝑁 𝒘|0, 𝜎2𝑔 𝑿𝑇𝑿 −1 𝐼𝑛𝑣𝐺𝑎𝑚𝑚𝑎 𝜎2|0,0

Page 76: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Uninformative Prior

76

An uninformative prior can be obtained by considering the

uninformative limit of the conjugate g-prior, which

corresponds to setting g = ∞. This is equivalent to an

improper NIG prior with w0 = 0, V0 = ∞I, a0 = 0 and b0 = 0,

which gives p(w, σ2) ∝ σ−(D+2).

Alternatively, we can start with the semi-conjugate prior

p(w, σ2) = p(w)p(σ2), and take each term to its

uninformative limit individually, which gives p(w, σ2) ∝ σ−2.

This is equivalent to an improper NIG prior with w0 = 0,V =

∞I, a0 = −D/2 and b0 = 0.

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0,∞𝑰, 0,0 ≜ 𝒩 𝒘|0, 𝜎2∞𝐼 ℐ𝒢 𝜎2|0,0 → 𝜎 )−(𝐷+2

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0,∞𝑰, 0,0 ≜ 𝒩 𝒘|0, 𝜎2∞𝐼 ℐ𝒢 𝜎2| −𝐷

2, 0 → 𝜎−2

Page 77: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Uninformative Prior

77

Using the uninformative prior, , the

corresponding posterior and marginal posteriors are given

by

Note in the calculation of s2:

2 2, | , | , , ,N N N Np a bs sw w w VD NIG

2 2,p s s w

𝑝 𝒘|𝒟 = 𝒯𝐷 𝒘𝑁,𝑏𝑁𝑎𝑁

𝑽𝑁, 2𝑎𝑁 = 𝒯𝐷 𝒘|ෝ𝒘𝑀𝐿𝐸 ,𝑠2

𝑁 − 𝐷𝑪,𝑁 − 𝐷

𝑽𝑁 = 𝑪 = 𝑽0−1 + 𝑿𝑇𝑿 −1 → 𝑿𝑇𝑿 −1, 𝒘𝑵 = 𝑽𝑁 𝑽0

−1𝒘0 + 𝑿𝑇𝒚 → 𝑿𝑇𝑿 −1𝑿𝑇𝒚 = ෝ𝒘𝑀𝐿𝐸

𝑎𝑁 = 𝑎0 + Τ𝑁 2 = Τ𝑁 − 𝐷 2 ,

𝒃𝑁 = 𝒃0 +1

2𝒘0𝑇𝑽0

−1𝒘0 + 𝒚𝑇𝒚 − 𝒘𝑁𝑇𝑽𝑁

−1𝒘𝑁 = Τ𝑠2 2 , 𝑠2 = 𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸

𝑇𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸

𝒘𝑁 = ෝ𝒘𝑀𝐿𝐸 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚

𝑠2 = 𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸

𝑇𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸 = 𝒚 − 𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚 𝑇 𝒚 − 𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚

= 𝒚𝑇𝒚 − 𝒚𝑇𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚 = 𝒚𝑇𝒚 − ෝ𝒘𝑀𝐿𝐸𝑇 𝑿𝑇𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚 = 𝒚𝑇𝒚 − ෝ𝒘𝑀𝐿𝐸

𝑇 𝑽𝑁−1ෝ𝒘𝑀𝐿𝐸

Page 78: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Frequentist Confidence Interval Vs. Bayesian Marginal Credible Interval

78

The use of a (semi-conjugate) uninformative prior is quite

interesting since the resulting posterior turns out to be

equivalent to the results obtained from frequentist statistics.

This is equivalent to the sampling distribution of the MLE

which is given by the following:

is the standard error of the estimated parameter.

The frequentist confidence interval and the Bayesian

marginal credible interval for the parameters are the same. Rice, J. (1995). Mathematical statistics and data analysis. Duxbury. 2nd edition (page 542)

Casella, G. and R. Berger (2002). Statistical inference. Duxbury. 2nd edition (page 554)

𝑝 𝒘𝒋|𝐷 = 𝒯 𝑤𝑗|ෝ𝑤𝑗 ,𝐶𝑗𝑗𝑠

2

𝑁 − 𝐷, 𝑁 − 𝐷

𝑤𝑗 − ෝ𝑤𝑗

𝑠𝑗~𝒯𝑁−𝐷 , 𝑠𝑗 =

𝐶𝑗𝑗𝑠2

𝑁 − 𝐷

Page 79: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

The Caterpillar Example

79

As a worked example of the uninformative prior, consider

the caterpillar dataset. We can compute the posterior

mean and standard deviation, and the 95% credible

intervals (CI) for the regression coefficients.

coeff mean stddev 95pc CI sig

w0 10.998 3.06027 [ 4.652, 17.345] *

w1 -0.004 0.00156 [ -0.008, -0.001] *

w2 -0.054 0.02190 [ -0.099, -0.008] *

w3 0.068 0.09947 [ -0.138, 0.274]

w4 -1.294 0.56381 [ -2.463, -0.124] *

w5 0.232 0.10438 [ 0.015, 0.448] *

w6 -0.357 1.56646 [ -3.605, 2.892]

w7 -0.237 1.00601 [ -2.324, 1.849]

w8 0.181 0.23672 [ -0.310, 0.672]

w9 -1.285 0.86485 [ -3.079, 0.508]

w10 -0.433 0.73487 [ -1.957, 1.091]

The 95%

credible intervals

are identical to

the 95%

confidence

intervals

computed using

standard

frequentist

methods.Run linregBayesCaterpillar

from PMTK3

Marin, J.-M. and C. Robert (2007). Bayesian Core: a practical approach to

computational Bayesian statistics. Springer.

Page 80: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

The Caterpillar Example

80

We can use these marginal posteriors to compute if the

coefficients are significantly different from 0 -- check if its

95% CI excludes 0.

The CIs for coefficients 0, 1, 2, 4, 5 are all significant.

These results are the same as those produced by a

frequentist approach using p-values at the 5% level.

But note that the MLE does not even exist when N <D, so

standard frequentist inference theory breaks down in this

setting. Bayesian inference theory still works using proper

priors.

Maruyama, Y. and E. George (2008). A g-prior extension for p > n. Technical report, U. Tokyo.

Page 81: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Empirical Bayes for Linear Regression

81

We describe next an empirical Bayes procedure for picking

the hyper-parameters in the prior (we will come back to this

and relevance determination in a forthcoming lecture).

More precisely, we choose η = (α, λ) to maximize the

marginal likelihood, where λ = 1/σ2 be the precision of the

observation noise and α is the precision of the prior, p(w) =

N(w|0, α-1I).

This is known as the evidence procedure.

MacKay, D. (1995b). Probable networks and plausible predictions — a review of practical Bayesian methods for

supervised neural networks. Network.

Buntine, W. and A. Weigend (1991). Bayesian backpropagation. Complex Systems 5, 603–643.

MacKay, D. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation

11(5), 1035–1068.

Page 82: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Empirical Bayes for Linear Regression

82

The evidence procedure provides an alternative to using

cross validation.

In the Figure, the log marginal likelihood is plotted for

different values of α, as well as the maximum value found

by the optimizer.

-25 -20 -15 -10 -5 0 5-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

-50

log alpha

log evidence

Run linregPolyVsRegDemo

from PMTK3

Page 83: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Empirical Bayes for Linear Regression

83

-25 -20 -15 -10 -5 0 5-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

-50

log alpha

log evidence

Run linregPolyVsRegDemo

from PMTK3

-20 -15 -10 -5 0 50.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log lambda

negative log marg. likelihood

CV estimate of MSE

We obtain the same result as 5-CV (λ = 1/σ2 is fixed in

both methods).

The key advantage of the evidence procedure over CV is

that it allows different αj to be used for every feature.

Page 84: Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Automatic Relevancy Determination

84

The evidence procedure can be used to perform feature

selection (automatic relevancy determination or ARD)

The evidence procedure is also useful when comparing

different kinds of models:

It is important to (at least approximately) integrate over η

rather than setting it arbitrarily.

Using variation Bayes models our uncertainty on η rather

than computing point estimates.

,( | ) ( | ) ( | )( | )

( |

,

, ) ( | , ) ( | )

p m p m p m d dp m

max p m p m p m d

w

w

w

w

w

w

DD

D