Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras)

Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference with s2

unknown, Zellner’s g-Prior, Uninformative Priors

Prof. Nicholas Zabaras

University of Notre Dame

Notre Dame, IN, USA

Email: [email protected]

URL: https://www.zabaras.com/

September 18, 2017

1

mailto:[email protected]

https://www.zabaras.com/


Linear basis function models, Maximum likelihood and least squares,

Geometry of least squares, Convexity of the NLL , Sequential learning,

Robust Linear Regression, Regularized least squares, Multiple

Outputs

Bayesian linear regression, Parameter posterior distribution, A Note on

Data Centering, Numerical Example, Predictive distribution, Bayesian

inference in linear regression when s2 is unknown, Zellner’s g-Prior,

Uninformative (Semi-Conjugate) Prior, Evidence Approximation

Contents

2

Following closely:

Chris Bishops’ PRML book, Chapter 3

Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 7

Regression using parametric discriminative models in pmtk3 (run TutRegr.m in Pmtk3)

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm

http://www.cs.ubc.ca/~murphyk/MLbook/

http://pmtk3.googlecode.com/svn-history/r2519/trunk/docs/tutorial/html/tutRegr.html#25

https://github.com/probml/pmtk3/blob/master/docs/tutorial/tutRegr.m

https://github.com/probml/pmtk3


Linear Regression We already considered in an earlier lecture an example of

linear regression – polynomial curve fitting.

We are interested in a linear combination – regression – of a

``fixed set’’ of nonlinear basis functions.

Supervised learning: N observations {xn} with corresponding

target values {tn} are provided.

The goal is to predict t for a new value x.

We construct a function such that y(x) is a prediction of t.

We follow a Bayesian perspective and model the predictive

distribution p(t|x).

3


Linear Regression From the conditional distribution p(t|x), we can make point

estimates of t for a given x by minimizing a `loss function’.

For a quadratic loss function, the point estimate is the conditional mean y(x,w)=E[t|x].

4


Linear Regression The simplest linear model for regression is one that involves a

linear combination of the input variables

where and we have defined .

This is often simply known as linear regression. D is the input

dimensionality.

5

0 1 1

0

( , ) ...D

D D i i

i

y w w x w x w x

x w

1 2 . .T

Dx x xx 0 1x


Linear Basis Functions Models More generally:

where are known as basis functions and

The parameter allows for any fixed offset in the data and is

called the bias parameter.

For convenience, we define an additional dummy ‘basis

function’ so that,

Often represent features extracted from the data .

6

1

0 1 1 1 1 0

0

( , ) ... , 1M

T

M M i i

i

y w w w w

x w x x x w x x where :

i x

0 1 1. .T

M x x x x

0w

0 1 x

i x x


Polynomial and Gaussian Basis Functions

7

Polynomial basis functions

(scalar input, global support): Gaussian basis functions:

j

j x x

2

2exp

2

j

j

xx

s

MatLab code 1: 0.2 :1

0.2s

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Basis Funct ion { Gaussian ? j (x) = exp

3

!(x ! 7 j )2

2s2)

4

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Basis Function -- Polynomial

Basis Function - Gaussian

https://www.dropbox.com/s/1y69cxsqljhlg64/Fig1Chapter3Bishop.m?dl=0


Logistic Sigmoidal Basis Functions

8

Sigmoidal basis functions:

1

, ( )1

j

j a

xx a

s e

s s

( ) :as logistic sigmoid function

tanh( ) 2 2 1a a

a a

e ea a

e es

MatLab code

1: 0.2 :1

0.1s

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Basis Funct ion { Sigmoidal <j (x) = <(x ! 7 j

s)Basis Function - Sigmoidal

https://www.dropbox.com/s/1y69cxsqljhlg64/Fig1Chapter3Bishop.m?dl=0


Sigmoidal and Tanh Basis Functions

9

The sigmoidal and tanh basis functions are related:

A general linear combination of logistic sigmoidal functions is

equivalent to a linear combination of tanh functions:

tanh( ) 2 2 1a a

a a

e ea a

e es

where

1( )

1 aa

es

0 0

1 1

0 0

1 1

0 0

1

( , ) 22

1 tanh2

tanh2 2

1: ,

2 2

M Mj j

j j

j j

j

M Mj

j j

j j

Mj

j j

j

x xy x w w w w w

s s

x

xsw w u u

s

wwhere u w w u

s s

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2017, N. Zabaras) 10

Choice of Basis Functions

We are interested in functions of local support to explore

adaptivity.

Local support functions comprise a spectrum of different

spatial frequencies.

An example is wavelets that are local both spatially and in

frequency.

They are however useful only when the input is defined

on a lattice.


Maximum Likelihood and Least Squares

11

Assume observations from a deterministic function with added

Gaussian noise:

which is the same as saying,

Here is the precision. This is based on a squared loss function for which E[t|x]= y(x,w). An example for 2D x is shown

below.

where( , )t y x w 1| | 0,p N

1| , , | ( , ),p t t y x w x wN

10

20

30

40

0

10

20

30

15.5

16

16.5

17

17.5

18

010

2030

40

0

10

20

30

15

15.5

16

16.5

17

17.5

18

Run surfaceFitDemo

from PMTK3

0 1 1 2 2|t w w x w x x 2 2

0 1 1 2 2 3 1 4 2|t w w x w x w x w x x

https://github.com/probml/pmtk3/blob/master/demos/surfaceFitDemo.m




12

Given observed inputs, , and targets ,

we obtain the likelihood function

We often use the log-likelihood:

Instead of maximizing the log-likelihood, one equivalently can

minimize the negative log-likelihood (NLL)

1,..., NX x x 1,...,T

Nt tt

1

1

| , , | ( ),N

T

n n

n

p t

t X w w xN

1

1

, log ( | ) log | ( ), , ,N

T

n n

n

p t

w w x wD N

1

1

, log | ( ),N

T

n n

n

NLL t

w w xN



13

Taking the log of the likelihood, we obtain:

where with we have denoted:

RSS is often known as the residual sum of squares or sum of

squared errors (SSE) and MSE is the mean squared error.

Computing w via MLE is the same as Least Squares.

1

1

ln | , ln | ( ), ln ln 2 ( )2 2

NT

n n D

n

N Np t E

t w w x wN

2 2

1 1

1( ) ( ) , ( ) , /

2

N NT T

D n n n n

n n

E t RSS t MSE RSS N

w - w x - w x

( )DE w

-4 -3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

5

prediction

truth

Sum of squares error contours for linear regression

w0

w1

-1 -0.5 0 0.5 1 1.5 2 2.5 3-1

-0.5

0

0.5

1

1.5

2

2.5

3

Run contoursSSEdemo,

from PMTK3

Run residualsDemo

from PMTK3

The NLL is

a quadratic

bowl with a

unique

minimum

(the MLE

estimate)

https://github.com/probml/pmtk3/blob/master/demos/contoursSSEdemo.m

https://code.google.com/p/pmtk3/source/browse/trunk/demos/contoursSSEdemo.m?r=2839


https://github.com/probml/pmtk3/blob/master/demos/residualsDemo.m




14

Setting the gradient of the log-likelihood (written here as a row

vector) wrt equal to zero:

This equation can be solved for .

2

1

1ln | , ln ln 2 ( ), ( ) ( )

2 2 2

NT

D D n n

n

N Np E E t

t w w w - w x

w

1 1 1

ln | , ( ) ( ) ( ) ( ) ( ) 0N N N

T T T T T

n n n n n n n

n n n

p t t

w t w - w x x x w x x

w



15

We obtain (normal equation; ordinary least squares solution):

where we have defined:

Note that indeed:

1

† † :T T

ML ,

w t t Moore-Penrose pseudo- inverse

0 1 1 1 1 11

0 2 1 2 1 2 1 22

1 2

0 1 1

( ) ( ) .. ( )( )

( ) ( ) .. ( ) ( ) ( ) .. ( )( ),

: : : : ..:

( ) ( ) .. ( )( )

TM

TTM N

T

N

TN N M NN

x x xx

x x x x x xx

x x xx

1 1

( ) ( ) ( )N N

TT T T T T T T T

n n n n

n n

t

w x x x w t w t

1 1

( ) ( ) , ( )N N

T T T

n n n n

n n

t

x x t x

0 1 1. .T

i i i M i x x x x



16

Taking now the log of the likelihood with respect to gives:

So the MLE variance ML is equal to the residual variance of

the target values around the regression function.

1

1

ln | , ln | ( ), ln ln 2 ( )2 2

NT

n n D

n

N Np t E

t w w x wN

2

1

1 2 1( ) ( )

NT

D n n

nML

E tN N

w - w x


Computing the Bias Parameter

17

If we make the bias parameter w0 explicit, then the error

function becomes

Setting the derivative with respect to w0 equal to zero, and

solving for w0, we obtain

The bias parameter compensates for the difference

between the averages of the target values and the weighted

sum of the averages of the basis function values.

21

0

1 1

1( ) ( )

2

N M

D n j j n

n j

E t - w - w

w x

1

0

1 1 1

1 1( )

N N M

n j j n

n n j

w t - wN N

x

0w

1

0

1 1 1

1 1, , ( )

M N N

j jj n j n

j n n

w t - w t = tN N

x


We look for a geometrical

interpretation of the least-

squares solution in an N-

dimensional space. t is a

vector in that space with

components t1, . . . , tN (N>M).

The least-squares regression

function is obtained by finding

the orthogonal projection of

the data vector t onto the

subspace spanned by the

basis functions φj(x).

Note φj is here the jth

column of .

1( ),..., ( )T

j j j N x x

0 1 1 1 1 1

0 2 1 2 1 2

0 1 1

0 1 1

( ) ( ) .. ( )

( ) ( ) .. ( )...

: : : :

( ) ( ) .. ( )

M

M

M

N N M N

x x x

x x x

x x x

y w

Geometry of Least Squares


Geometry of Least Squares

19

We are looking for w such that

the projection error

is orthogonal to the basis ,

i.e. such that:

These are the normal equations

we derived earlier.

1( ) ( ),..., ( )T

j j j N x x x

0 1 1( ) ( ) .. ( )M x x x

( , )n ny

y w

y x w

- -t y t w

j

0T - t w

S: M-dimensional subspace

spanned by ( )j x0

1

1

:

T

T

T

T

M


Convexity of the NLL

20

Convexity of the NLL (positive definite Hessian) leads to a

unique globally optimal MLE.

Some models of interest don’t have concave likelihoods and

locally optimal MLE estimates are needed.

x y

1 -

A B

Convex function Concave function Neither convex or concave

function

A and B are local minimum

Run convexFnHand

from PMTK3

2 , , log ( 0)e log , ( 0)

Convex region Non-convex region

https://github.com/probml/pmtk3/blob/master/demos/convexFnHand.m



Sequential Learning: LMS Algorithm

21

If the data set is large, we use sequential (on-line) algorithms

We apply the technique of stochastic (sequential) gradient

descent.

If the error function comprises a sum over data points

then after presentation of pattern n, the stochastic gradient

descent algorithm updates the parameter vector w using

t is the iteration number & η the learning rate parameter.

This is known as least-mean-squares or the LMS algorithm.

( 1) ( ) ( ) ( ) ( ) ( )T

n

n n n n

E

E tt t t t

w w w w x x

n

n

E E

2

1

1( ) ( )

2

NT

D n n

n

E t

w - w x


Robust Linear Regression

22

Using a Gaussian distribution for the noise,

can result in poor fit especially if we have outliers in the data.

Squared error penalizes deviations quadratically, so points far from the

line have more affect on the fit than points near the line.

To achieve robustness to outliers one can replace the Gaussian with a

distribution that has heavy tails (e.g. the Laplace distribution). Such a

distribution assigns higher likelihood to outliers, without having to

perturb the regression line to “explain” them.

1( , ) , ~ | 0,t y x w N

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-6

-5

-4

-3

-2

-1

0

1

2

3

4

least squares

laplace

student, dof=0.630

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-6

-5

-4

-3

-2

-1

0

1

2

3

4

Least Squares

Huber loss 1.0

Huber loss 5.0

Run linregRobustDemoCombined

from PMTK3

1

| , , | ( , ), exp ( , )

( ) ( ) , ( ) ( , )i i

i

p t b t y b t yb

NLL r r t y

x w x w x w

w w w x w

Lap

2

2

/ 2,

/ 2

,

H

H

r

r if rL r

r if r

dL r

dr

https://github.com/probml/pmtk3/blob/master/demos/linregRobustDemoCombined.m



Robust Linear Regression

23

Using the Laplace distribution leads to L1 error norm (non-linear

objective function) that is difficult to optimize.

A solution is to transform the problem (by increasing its dimension to

2N+M) to a linear programming problem.

𝑟𝑖 ≜ 𝑟𝑖+ − 𝑟𝑖

−

Note that with our definition above:

, ,min . . 0, 0,

i i

T

i i i i i i i ir r

i

r r s t r r r r t

w

w x

0 0 01 1( ) , ( )

0 0 02 2

i i i

i i i i i i

i i i

i i i

r if r if rr r r r r r

if r r if r

r r r

Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge

https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf


Huber Loss Function

24

-3 -2 -1 0 1 2 3-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

L2

L1

huber

Run huberLossDemo

from PMTK3

This is equivalent to L2 for errors

that are smaller than δ, and is

equivalent to L1 for larger errors.

This loss function is everywhere

differentiable, using the fact that

d/dr |r| = sign(r) if r ≠ 0.

The function is also C1 continuous,

since the gradients of the two parts

of the function match at r = ±δ.

Optimizing the Huber loss is much faster than using the Laplace

likelihood, since we can use standard optimization methods (quasi-

Newton) rather than linear programming.

The Huber method also has a probabilistic interpretation, although it

is rather unnatural (Pontil et al. 1998).

Pontil, M., S. Mukherjee, and F. Girosi (1998). On the Noise Model of Support Vector Machine Regression. Technical

report, MIT AI Lab.

Huber, P. (1964). Robust estimation of a location parameter. Annals of Statistics 53, 73a ˘A，S101.

2

2

/ 2,

/ 2

,

H

H

r

r if rL r

r if r

dL r

dr

https://github.com/probml/pmtk3/blob/master/demos/huberLossDemo.m


ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-1651.pdf

http://projecteuclid.org/download/pdf_1/euclid.aoms/1177703732


Regularized LS – Ridge Regression

25

Consider the error function:

data term + regularization term

With the sum-of-squares error function and a quadratic

regularizer, we get

Specifically, setting the gradient with respect to w to zero, and

solving for w as before, we obtain

This is a trivial extension of the least-squares solution we

encountered earlier (Regularized Least Squares – Ridge

Regression)

( ) ( )D WE Ew w

2

1

1 1( )

2 2

NT T

n n

n

t

- w x w w

1

T T

w I t


Regularized Least Squares

26

Regularized solution:

Regularization limits the effective model complexity (the

appropriate number of basis functions).

This is replaced with the problem of finding a suitable value of

the regularization coefficient .

controls how many non-zero w’s (i.e. basis functions) you

have.

1

T T

w I t


Regularizer term plot with q = 0.5

-10 -5 0 5 10-10

-5

0

5

10Regularizer term plot with q = 1

-10 -5 0 5 10-10

-5

0

5


-10 -5 0 5 10-10

-5

0

5


-10 -5 0 5 10-10

-5

0

5

10


27

With a more general regularizer, we have

q = 1 is known as the Lasso regularizer. These plots

show only the regularizer term with = 0.7334.

2

1 1

1 1( )

2 2

qN MT

n n j

n j

t w

- w x

MatLab code

https://www.dropbox.com/s/3wvxl4ggk3fd4mp/Fig3Chapter3Bishop.m?dl=0



28

With a more general regularizer, we have

q = 2 corresponds to the quadratic regularizer.

2

1 1

1 1( )

2 2

qN MT

n n j

n j

t w

- w x

MatLab code

https://www.dropbox.com/s/3wvxl4ggk3fd4mp/Fig3Chapter3Bishop.m?dl=0



29

Lasso tends to generate sparser solutions than a quadratic

regularizer – if is large, some of the wj →0 (here w1 = 0).

Here, we consider that the regularized least squares solution is

equivalent to minimizing the unregularized sum of squares with

the constraint shown for some (see proof next).

Unregularized

error function

Constraint

1

qM

j

j

w

q = 2 q = 1



30

Let us write the constraint in the equivalent form:

This leads to the following Lagrangian function:

This is identical to our regularized least squares (RLS) in the

dependence on w.

For a particular >0, let w*() be the solution of the RLS in (*).

From the Kuhn-Tucker optimality conditions for we then

see:

1

10

2

qM

j

j

w

2

1 1

1( , ) ( )

2 2

qN MT

n n j

n j

L t w

w - w x

2

1 1

1 1( ) (*)

2 2

qN MT

n n j

n j

t w

- w x

( , )L w

*

1

qM

j

j

w


Kuhn-Tucker Optimality Conditions

31

Consider the following constraint minimization problem:

This is equivalent as the minimization with respect to x and of the following Lagrangian:

subject to the following (Kuhn-Tucker) conditions:

Note for maximization problems, the Lagrangian should be

modified as:

min ( ), ( ) 0f g x

x xsubject to

min ( , ) ( ) g( )L f

x,

x x x

0, g( ) 0, g( ) 0 x x

( , ) ( ) g( )L f x x x


Multiple Outputs-Isotropic Covariance

32

If we want to predict K>1 target variables, we use the same

basis for all components of the target vector):

W is an M × K matrix of parameters and t is K dimensional.

Given observed inputs, , and targets,

we obtain the log likelihood function

1 1| , , | ( , ), | ( ),Tp t x W t y x W I t W x IN N

1,..., NX x x 1,...,T

NT t t

2

1

1 1

ln | , , ln | ( ), ln ( )2 2 2

N NT T

n n n n

n n

NKp t

T X W W x I t W xN


K-Independent Regression Problems

33

As before, we can maximize this function with respect to W,

giving

If we examine this result for each target variable tk, we have

(take the kth column of W and T):

which is identical with the single output case (so there is

decoupling between the target variables)

As expected, we obtain K- independent regression problems.

2

1

ln | , , ln ( )2 2 2

NT

n n

n

NKp

T X W t W x

1

T T

ML

M N N KM KM M

W T

1

†

ML

T T

k k k

w t t


Multiple Outputs – Full Covariance

34

Let us repeat the earlier formulation but with covariance

matrix S. If we want to predict K>1 target variables, we use

the same basis for all components of the target vector):

where W is an M × K matrix of parameters

Given observed inputs, , and targets,

we obtain the log likelihood function :

| , , | ( , ), | ( ),Tp t x W t y x W t W xN N S S S

1,..., NX x x 1,...,T

NT t t

1

1 1

1ln | ( ), ln ( ) ( )

2 2

N NT

T T T

n n n n n n

n n

Nt

W x t W x t W xN S S S

ln | , ,p T X W S


Multiple Outputs – Full Covariance

35

As before, we maximize this function with respect to W,

For the ML estimate for S, use the result for the MLE of the

covariance of a multivariate Gaussian:

Note that each column of WML is of the form

seen for isotropic noise distribution and is independent of S!

1

1

1ln | , , ln ( ) ( )

2 2

NT

T T

n n n n

n

Np

T X W t W x t W x S S S

1

1

1

0 ( ) ( )N

T T T T

n n n ML

n M K

t W x x W T S

1

1( ) ( )

NT

T T

n ML n n ML n

nN

t W x t W x S

1

T T

ML

w t


Effective model complexity in MLE is governed by the number

of basis functions and is controlled by the size of the data set.

With regularization, the effective model complexity is controlled

mainly by and still by the number and form of the basis

functions.

Thus the model complexity for a particular problem cannot be

decided simply by maximizing the likelihood function as this

leads to excessively complex models and over-fitting.

Independent hold-out data can be used to determine model

complexity but this wastes data and it is computationally

expensive.

Bayesian Linear Regression


A Bayesian treatment of linear regression avoids the over-

fitting of maximum likelihood.

Bayesian approaches lead to automatic methods of

determining model complexity using the training data alone.



Assume additive Gaussian noise with known precision . The

likelihood function p(t|w) is the exponential of a quadratic

function of w

and its conjugate prior is Gaussian:

Combining this with the likelihood and using results for

marginal and conditional Gaussian distributions, gives the

posterior

where

0 0( ) ( | , )p w w m SN

( ) ( | , )N Np w | t w m SN

1

0 0

1 1

0

T

N N

T

N

m S S m t

S S

/2

21

11

| , , | ( ), exp ( )2 2

NN NT T

n n n n

nn

p t t

t X w w x w xN



We now have the product of two Gaussians and the

posterior is easily computed as:

Posterior Distribution: Derivation

39

1

0 0 0 0 0

2

1

1 1

1( | , ) exp ,

2

( | , , ) exp ( )2

1( | , , ) exp ( ) ( ) ( )

2

T

NT

n n

n

N NT T

n n n n

n n

p

p x t

p t

w m S w - m S w - m

t x w w

t x w w x x w x w

0

1 1

0 0 0

1 1

1 1 1 1

0 0 0 0

1

( | , , )

1 1exp ( ) ( ) ( )

2 2

, ( ) ( )

N

N NT T T T T

n n n n

n n

NT T T

N N N n n

n

p ,

t

,

m

w x t S

w S w w S m w x x w w x

w | S S m t S S S x x S

N

Complete the

square in w


Note that because the posterior distribution is Gaussian, its

posterior mode coincides with its mean.

The above expressions for the posterior mean and variance

can also be written for a sequential calculation (we already

have observed N data points and now considering an

additional data point (xN+1,tN+1)). In this case, we have:

( ) ( | , )N Np w | t w m SN 1

0 0

1 1

0

T

N N

T

N

m S S m t

S S

MAP Nw = m

1 1 1 1( , ) ( | , )N N N N N Np t , w | , x m S w m SN 1

1 1 1 1

1 1

1 1 1

N N N N n n

T

N N n n

t

m S S m

S S

Sequential Posterior Calculation



41

Let us consider for a prior, a zero-mean isotropic Gaussian

governed by a single precision parameter a so that

and the corresponding posterior distribution over w is then

given by

The log of the posterior is the sum of the log likelihood and

the log of the prior and, as a function of w, takes the form

Thus the MAP estimate is the same as regularized least

squares (Ridge Regression) with

1( ) ( | 0, )p a w w IN

1

T

N N

T

N

a

m S t

S I

2

1

ln ( | ) ( )2 2

NT T

n n

n

p t x +const a

w t w w w

/ . a


A Note on Data Centering

42

In linear regression, it helps to center the data in a way that

does not require us to compute the offset term . Write the

likelihood as:

Let us assume that the input data are centered in each

dimension such that:

The mean of the output is equally likely to be positive or

negative. Let us put an improper prior and integrate

out.

( | , , ) exp2

T

N Np ,

1 1t x w t w t w

1 1 2 1 11

1 2 2 2 2 1 22

1 2

1 2

( ) ( ) .. ( )( )

( ) ( ) .. ( ) ( ) ( ) .. ( )( ),

: : : : ..:

( ) ( ) .. ( )( )

TM

TTM N

T

N

TN N M NN

x x xx

x x x x x xx

x x xx

0 1 1. .T

i i i M i x x x x

1

0 1,...,N

i j

j

i M

x

( ) 1p



43

Introducing, , the marginal likelihood becomes :

Completing the square in gives (use the centering of the

input):

Our model is now simplified if instead of t we use (centered

output) and the likelihood is simply written as:

Recall that the MLE estimate for is:

( | , ) exp2

T

N N N Np , t t t t d

1 1 1 1

A

t x w t w t w

1

1 N

i

i

t tN

0 0

2

( | , ) exp 22

exp2

T TN

T T

N

tN tN

T

N N

p , t N t d

t t

1

1

1 1

w

t x w A A A

t w t w

Nt 1t t

( | , ) exp2

Tp ,

t x w t w t w

Ƹ𝜇 = ҧ𝑡 −

𝑗=1

𝐷

ഥ𝝓𝑗𝑇𝒘𝑗 , ഥ𝝓1 , . . . , ഥ𝝓𝐷 𝑖𝑠 𝑓𝑜𝑟𝑚𝑒𝑑

𝑏𝑦𝑎𝑣𝑒𝑟𝑎𝑔𝑖𝑛𝑔 𝑒𝑎𝑐ℎ 𝑐𝑜𝑙𝑢𝑚𝑛𝑜𝑓𝜱



44

To simplify the earlier notation, consider a linear regression

model of the form

In the context e.g. of MLE, we need to minimize

Minimization wrt w0 gives:

where:

Thus:

0| Ty w x w x

0

2

0

1

minN

T

i iw

i

t w

,w

- w x

0 0 0

1

0N

T T

i i

i

Tt w tw w N tN N

w w x wx x

1

11

2 2

1

1

1

,:

:

N

i

i

N

Ni

i i

i

D N

iD

i

x N

x

x Nxt t N

xx N

x

ෝ𝑤0 = ҧ𝑡 − ഥ𝒙𝑇𝒘



45

Substituting the bias term in our objective function gives:

Minimization wrt w gives:

We thus first compute the MLE of w using the centered input

and output as follows:

We can then estimate the MLE estimate of w0 as follows:

1

11 2 11 11 12 1

1 2 2 221 22 221

1 21 2

1

..

..,

: : : : :::

..

N

iTT i

DDN

TT

T D iDic N

TT D D NN N NNN

iD

i

x N

x x x x x x x

x Nx x x x x x x

x x x x x x xx N

1

x x

x xX X X X x x

x x

,

22

1 1

min minN N

T T T

i i i i

i i

t t t t

w w

w x w x w x x

1

,c N

N

i

i

t

t t N

1t t t t

ෝ𝑤0 = ҧ𝑡 − ഥ𝒙𝑇𝒘

𝑖=1

𝑁

𝒙𝑖 − ഥ𝒙 𝒙𝑖 − ഥ𝒙 𝑇 ෝ𝒘 =

𝑖=1

𝑁

𝒙𝑖 − ഥ𝒙 𝒕𝑖 − ҧ𝒕

ෝ𝒘 = 𝑿𝑐𝑇𝑿𝑐

−1𝑿𝑐𝑇𝒕𝑐 =

𝑖=1

𝑁

𝒙𝑖 − ഥ𝒙 𝒙𝑖 − ഥ𝒙 𝑇

𝑖=1

𝑁

𝒙𝑖 − ഥ𝒙 𝒕𝑖 − ҧ𝒕


Bayesian Regression: Example

46

We generate synthetic data from the function f(x, a) = a0+a1x

with parameter values a0 = −0.3 and a1 = 0.5 by first choosing

values of xn from the uniform distribution U(x|−1, 1), then

evaluating f(xn, a), and finally adding Gaussian noise with

standard deviation of 0.2 to obtain the target values tn.

We assume β=(1/0.2)2=25 and α=2.0.

We perform Bayesian inference sequentially – one point at a

time – so the posterior at each level becomes the new prior.

We show results after 1, 2 and 22 points have been collected.

The results include the likelihood contours (for 1 point), the

posterior and samples of the regression function from the

posterior.


Prior - No data yet

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-3

-2

-1

0

1

2

3y(x,w) using samples of w from the prior

Bayesian Regression: Example

47

MatLab Code

https://www.dropbox.com/s/qlh6fuhou58woul/Fig7Chapter3Bishop.m?dl=0


Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Contours of the posterior

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1y(x,w) using samples of w from the posterior

Example: One Data Point Collected

48

Note that the regression lines pass close to the data point (shown with a circle)

MatLab Code



Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


Example: 2nd Data Point Collected

49

Note that the regression lines now pass close to both data points

MatLab Code



Likelihood Contour

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


Example: 22 Data Points Collected

50

Note that the regression lines after 22 data points have been collected

MatLab Code



Summary of Results

51

prior/posterior (no data yet)

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space (no data yet)

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1likelihood

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1prior/posterior

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1data space

MatLab Code



Summary of Results

52

W0

W1

prior/posterior

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

data space

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

W0

W1

-1 0 1-1

0

1

W0

W1

-1 0 1-1

0

1

-1 0 1-1

0

1

x

y

likelihood

Run bayesLinRegDemo2d

from PMTK3

After 20 data points

https://github.com/probml/pmtk3/blob/master/demos/bayesLinRegDemo2d.m



Predictive Distribution

53

We are not interested in w itself but in making predictions of t

for new values of x. This requires the predictive distribution

where

The 1st term represents the noise on the data whereas the 2nd

term reflects the uncertainty associated with w.

Because the noise process and the distribution of w are

independent Gaussians, their variances are additive.

The error bars get larger as we move away from the training

points. By contrast, in the plug-in approximation, the error bars

are of constant size.

As additional data points are observed, the posterior

distribution becomes narrower.

2( | , , ) ( | , ) ( | , ) | ( ), ( )T

N Np t x p t x , p , , d t x x a s Nx t w w x t w m

2 11( ) ( ) ( ),T T

N N Nx x S xs a

S I


In a full Bayesian treatment, we want to compute the

predictive distribution, i.e. given the training data x and t

and a new test point x, we want the distribution:

To compute the marginal, we use the standard equations

for Gaussian Linear Systems from an earlier lecture.


54

1

1 1

1

1 1

( | , , ) ( | , ) ( | , ) ,

( | , , ) ( | ( , ), )

( | , , )

1 1exp ( ) ( ) ( )

2 2

( ), , ( ) ( )

N NT T T

M M n n n n

n n

N NT

N n n N N M M n n

n n

p t x p t x p d where

p t x t y x and

p ,

x x t x

t x x x

a

a

a

N

N

x t w w x t w

w w

w x t

w I w w w w

w |, S S S I


Appendix: Useful Result

55

For the above linear model, we proved in an earlier lecture

that the following very useful results about marginal and

conditional Gaussian models hold:

1

1

| ,

| | ,

p

p

N

N

x x

y x y Ax b L

1 1| , Tp Ny y A b L + A A

1 1

| | ( ,T T Tp

x y x A LA A L y b) A LA N



56

Thus for our problem:

The predictive distribution now takes the form:

1

1

1 1

( ),

( ) ,

N

N n n N

n

T

, t x

t, x , = 0

x w = S = S

y A = b L =

1

1

| ,

| | ,

p

p

N

N

x x

y x y Ax b L

1( | , , ) ( | ( , ), )p t x t y x w wN

1

( | , , ) ( ),N

N n n N

n

p , t xa

w x t w |, S SN

1 1| , Tp y y A b L + A A N

1

1

| ( ) ( ), ( ) ( )N

T T

N n n N

n

p t t x t x x x

S + SN


In a full Bayesian treatment, we want to compute the

predictive distribution, i.e. given the training data x and t

and a new test point x, we want the distribution:

where the mean and variance are given by

Note that:


57

1( | , , ) ( | , ) ( | , ) , ( | , , ) ( | ( , ), )p t x p t x p d p t x t y x x t w w x t w w wN

1

2 1

1

1

( ) ( ) ( ) ( ) ,

( ) ( ) ( ) ( )

( ) ( )

NT T T

N n n N N N

n

T

N N

NT T

N n n

n

m x x x t x

x x x

x x

s

a a

+

S m m S t

S

S I I

uncertainty in the data+uncertainty in w

2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N

2 1 1 2 2

1( ) ( ) ( ) ( ) ( )N

T

N N N Nx x x and x xs s s

S+


It is easy to show:

Note:

and the identity:

Using these results, we can write:


58

2 1 1 2 2

1( ) ( ) ( ) ( ) ( )N

T

N N N Nx x x and x xs s s

S+

11 1

1

1

( ) ( ) ( ) ( )N

T T

N n n N n n

n

x x x xa

S I S

1 1

11

11

T

T

T

M v v MM vv M

v M v

12 1 1 1 1

1 1

2

2 2

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )( ) ( ) ( ) ( )

1 ( ) ( ) 1 ( ) ( )

T T T

N N N n n

T T

N n n N n NT

N N NT T

n N n n N n

x x x x x x x

x x x xx x x x

x x x x

s

s s

S S

S S SS

S S

+ + +


The notation used here is as follows:

Predictive Distribution: Summary

59

1

2 1

1

1

( ) ( ) ( )

( ) ( ) ( )

( ) ( )

NT

N n n

n

T

N N

NT

N n n

n

m x x x t

x x x

x x

s

a

S

S

S I

+

2( | , , ) ( | ( ), ( ))Np t x t m x xsx t N

2 2 1

1

:

1

( ) , ( ) 1 ..

:

n

T M

n n

M

n

For Polynomial regression

x

x x x x x x

x

Note:

Predictive mean and

variance are functions

of x.

0

1

2 0 1 2 1

1

( )

( )

( ) ( ) , ( ) ( ) ( ) ( ) .. ( ) ,

:

( )

n

n

T

n n n n n M n

M n

x

x

x x x x x x x unit matrix M M

x

I


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Predictive Distribution, M = 9

Generating function sin(2pi*x)

Random data points for fitting

Predictive Mean

Predictive Standard Deviation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


Pointwise Uncertainty in the Predictions

60

MatLab code

N=1

N=2

M=9 Gaussians, 10 parameters

Scale of Gaussians adjusted

with data

a = 5*10-3

= 11.1

Using N=1,2,4,10

Data are given here

The predictive uncertainty is

smaller near the data.

The level of uncertainty

decreases with N

https://www.dropbox.com/s/f6uasgsng4p24yi/Fig8Chapter3Bishop.m?dl=0

https://www.dropbox.com/s/sdnuul03b7u36tm/Fig8Chapter3.zip?dl=0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


Pointwise Uncertainty in the Predictions

61

MatLab code

N=4 N=10



Summary of Results

62

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5




Predictive Mean


MatLab code



Plugin Approximation

63

-8 -6 -4 -2 0 2 4 6 80

10

20

30

40

50

60

plugin approximation (MLE)

prediction

training data

-8 -6 -4 -2 0 2 4 6 8-10

0

10

20

30

40

50

60

70

80

Posterior predictive (known variance)

prediction

training data

-8 -6 -4 -2 0 2 4 6 8-20

0

20

40

60

80

100

functions sampled from posterior

-8 -6 -4 -2 0 2 4 6 80

5

10

15

20

25

30

35

40

45

50

functions sampled from plugin approximation to posterior

Run linregPostPredDemo

from PMTK3

𝑝(𝑡|𝑥, 𝒙, 𝒕)

= න𝑝(𝑡|𝑥,𝒘)𝛿ෝ𝒘(𝒘)𝑑𝒘

)= 𝑝(𝑡|𝑥, ෝ𝒘

https://github.com/probml/pmtk3/blob/master/demos/linregPostPredDemo.m



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Plots of y(x,w) where w is a sample from the posterior over w

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5


Covariance Between the Predictions

64

Draw samples from the posterior of w and

then plot y(x,w). We use the same data as

the earlier example.

We are visualizing the joint uncertainty

in the posterior distribution between the

y values at two or more x values.

MatLab Code

N=2

N=1

Same data and basis functions

as in the earlier example

https://www.dropbox.com/s/e3a2c9i3onipgmy/Fig9Chapter3Bishop.m?dl=0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5


Covariance Between the Predictions

65

Draw samples from the posterior of w and

then plot y(x,w)

We are visualizing the joint uncertainty

in the posterior distribution between the

y values at two or more x values.

MatLab Code

N=12

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5


N=4



Summary of Results

66

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

MatLab Code



Gaussian Basis vs. Gaussian Process

67

If we use localized basis functions such as Gaussians, then in

regions away from the basis function support, the contribution

from the second term in the predictive variance will go to zero,

leaving only the noise contribution β−1.

The model becomes very confident in its predictions when

extrapolating outside the region occupied by the basis functions.

This is an undesirable behavior.

This problem can be avoided by adopting an alternative

Bayesian approach to regression (Gaussian processes).

2 1 1

( )( ) ( ) ( )

away from theT

N Nsupport of x

x x xs S+


Bayesian Inference when s2 is Unknown

68

Let us extend the previous results for linear regression assuming

now that s2 is unknown.

Assume a likelihood of the form:*

A conjugate prior has the following form:

The posterior is now derived as:

/2

2 2 2

/2 2

1| , , | , exp

22

TN

N Np s s s

s

Ny Xw y Xw

y X w y Xw I

0

0

1( )/2 1

0 0 0 02 20

21/2/2

0 0

2, | exp

22

T Taa D N

N D

bbp

as s

s

Dw w V w w y Xw y Xw

wV

In the remaining of this lecture, the response is denoted as y, the dimensionality of w as D and the design matrix

as X.

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|𝒘0, 𝑽0, 𝑎0, 𝑏0 ≜ 𝒩 𝒘|𝒘0, 𝜎2𝑽0 ℐ𝒢 𝜎2|𝑎0, 𝑏0

𝑏0𝑎0

2𝜋 Τ𝐷 2|𝑉0|Τ1 2𝛤 𝑎0

𝜎2 − 𝑎0+ Τ𝐷 2+1 exp −𝒘 − 𝒘0

𝑇𝑽0−1 𝒘 −𝒘0 + 2𝑏02𝜎2

https://en.wikipedia.org/wiki/Inverse-gamma_distribution



69

Let us define the following:

With these definitions, one with simple algebra can show:

The posterior marginals can now be derived explicitly:

0

0

1( )/2 1

0 0 0 02 20

21/2/2

0 0

2, | exp

22

T Taa D N

N D

bbp

as s

s

w - w V w - w y - Xw y - Xww

VD

11 1

0 0 0

1 1

0 0 0 0 0

,

1/ 2,

2

T T

N N N

T T T

N N N N Na a N b b

V V X X w V V w X y

w V w y y w V w

1 1 1/2 1

0 0 0 0 0 02 2

2

1/2 1

2

2

2, | exp

2

2exp

2

N

N

T T T T Ta D

N N N N

Ta D

N N N N

bp

b

s ss

ss

w w V w w y Xw y Xw w V w y y w V ww

w w V w w

D

2 2| | ,N Np a bs sD IG

2

1 2

| , , 2 12

Na DT

N N NND N N N

N N

bp a

a b

D Tw w V w w

w w V

𝑝 𝒘, 𝜎2|𝒟 = 𝒩ℐ𝒢 𝒘, 𝜎2|𝒘𝑁 , 𝑽𝑁, 𝑎𝑁, 𝑏𝑁 ≜ 𝒩 𝒘|𝒘𝑁, 𝜎2𝑽𝑁 ℐ𝒢 𝜎2|𝑎𝑁, 𝑏𝑁


Posterior Marginals

70

The marginal posterior can be directly written as:

2

1 1 2/2 1

2 2

2

0

2| exp 1

2 2

N

N

a DT T

a DN N N N N N N

N

bp d

bs s

s

Dw w V w w w w V w w

w

2

1 2

| , , 2 12

Na DT

N N NND N N N

N N

bp a

a b

D Tw w V w w

w w V

To compute the integral above, simply set =s-2, ds2=--2d and use the normalizing factor of the

Gamma distribution න

0

∞

𝜆𝑎−1𝑒−𝑏𝜆𝑑𝜆 = 𝛤(𝑎)𝑏−𝑎 ∼ 𝑏−𝑎.


Posterior Predictive Distribution

71

Consider the posterior predictive for m new test inputs:

As a first step, let us integrate in w by writing:

Let us denote the last term in the Eq. above as

These terms

cancel out

from the

integration in

w

2𝛽 = − ෩𝑿𝑇𝒚 + 𝑽𝑁−1𝒘𝑁

𝑇 ෩𝑿𝑇෩𝑿 + 𝑽𝑁−1 −𝑇 ෩𝑿𝑇𝒚 + 𝑽𝑁

−1𝒘𝑁 +𝒘𝑁𝑇𝑽𝑁

−1𝒘𝑁 + 𝒚𝑇𝒚 + 2𝑏𝑁

𝒚 − ෩𝑿𝒘𝑇𝒚 − ෩𝑿𝒘 + 𝒘−𝒘𝑁

𝑇𝑽𝑁−1 𝒘−𝒘𝑁 + 2𝑏𝑁 =

𝒘− ෩𝑿𝑇෩𝑿+ 𝑽𝑁−1 −1 ෩𝑿𝑇𝒚 + 𝑽𝑁

−1𝒘𝑁

𝑇෩𝑿𝑇෩𝑿+ 𝑽𝑁

−1 𝒘− ෩𝑿𝑇෩𝑿+ 𝑽𝑁−1 −1 ෩𝑿𝑇𝒚 + 𝑽𝑁

−1𝒘𝑁

− ෩𝑿𝑇𝒚 + 𝑽𝑁−1𝒘𝑁

𝑇 ෩𝑿𝑇෩𝑿 + 𝑽𝑁−1 −𝑇 ෩𝑿𝑇𝒚 + 𝑽𝑁


−1𝒘𝑁 + 𝒚𝑇𝒚 + 2𝑏𝑁

𝑝 𝒚|෩𝑿, 𝒟

∝ ඵ1

2𝜋 Τ𝑚 2𝜎2 − Τ𝑚 2exp −

𝒚 − ෩𝑿𝒘𝑇𝒚 − ෩𝑿𝒘

2𝜎2𝜎2 − 𝑎𝑁+ Τ𝐷 2+1 exp −

𝒘 − 𝒘𝑁𝑇𝑽𝑁

−1 𝒘−𝒘𝑁 + 2𝑏𝑁2𝜎2

𝑑𝒘𝑑𝜎2


Posterior Predictive Distribution

72

The posterior predictive

is now simplified using =1/s2 and recalling the normalization of

the Gamma distribution:

Substituting and by comparing the 2 Eqs. one can verify that:

Use the Sherman Morrison

Woodburry formula here to show

that (symmetry of V0 is assumed)

𝑝 𝒚|෩𝑿,𝒟 ∝ න 𝜆 Τ𝑚 2+𝑎𝑁−1exp −𝛽𝜆 𝑑𝜆 ∼ 𝛽− Τ𝑚 2+𝑎𝑁

𝑝 𝒚|෩𝑿,𝒟 ∝ − ෩𝑿𝑇𝒚 + 𝑽𝑁−1𝒘𝑁

𝑇 ෩𝑿𝑇෩𝑿+ 𝑽𝑁−1 −𝑇 ෩𝑿𝑇𝒚 + 𝑽𝑁


−1𝒘𝑁 + 𝒚𝑇𝒚 + 2𝑏𝑁−𝑚2+𝑎𝑁

∝ 1 +

𝒚 − ෩𝑿𝒘𝑁𝑇 𝑏𝑁

𝑎𝑁𝑰𝑚 + ෩𝑿𝑽𝑁 ෩𝑿

𝑇

−1

𝒚 − ෩𝑿𝒘𝑁

2𝑎𝑁

−𝑚2+𝑎𝑁

𝑰𝑚 + ෩𝑿𝑽𝑁 ෩𝑿𝑇

−1= 𝑰𝑚 − ෩𝑿 ෩𝑿𝑇෩𝑿 + 𝑽𝑁

−1 −1෩𝑿𝑇

𝑝 𝒚|෩𝑿, 𝒟

∝ ඵ1

2𝜋 Τ𝑚 2𝜎2 − Τ𝑚 2exp −

𝒚 − ෩𝑿𝒘𝑇𝒚 − ෩𝑿𝒘

2𝜎2𝜎2 − 𝑎𝑁+ Τ𝐷 2+1 exp −

𝒘− 𝒘𝑁𝑇𝑽𝑁

−1 𝒘−𝒘𝑁 + 2𝑏𝑁2𝜎2

𝑑𝒘𝑑𝜎2

https://en.wikipedia.org/wiki/Gamma_distribution

http://zabaras.com/Courses/BayesianComputing/ConditionalGaussianDistributions.pdf



73

The posterior predictive is also a Student’s T:

The predictive variance has two terms

due to the measurement noise

and due to the uncertainty in w. The second

term depends on how close a test input is to the training

data.

Nm

N

b

aI

𝑝 𝒚|෩𝑿,𝒟 = 𝒯𝑚 𝒚|෩𝑿𝒘𝑁,𝑏𝑁𝑎𝑁

𝑰𝑚 + ෩𝑿𝑽𝑁෩𝑿𝑇 , 2𝑎𝑁

𝑏𝑁𝑎𝑁

෩𝑿𝑽𝑁෩𝑿𝑇

https://en.wikipedia.org/wiki/Student's_t-distribution


Zellner’s G-Prior

74

It is common to set a0 = b0 = 0, corresponding to an

uninformative prior for σ2, and to set w0 = 0 and V0 =

g(XTX)−1 for any positive value g.

This is called Zellner’s g-prior. Here g plays a role

analogous to 1/λ in ridge regression. However, the prior

covariance is proportional to (XTX)−1 rather than I.

This ensures that the posterior is invariant to scaling of the

inputs.

Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In

Bayesian inference and decision techniques, Studies of Bayesian and Econometrics and Statistics volume 6.

North Holland.

Minka, T. (2000b). Bayesian linear regression. Technical report, MIT.

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|𝒘0, 𝑽0, 𝑎0, 𝑏0 ≜ 𝒩 𝒘|𝒘0, 𝜎2𝑽0 ℐ𝒢 𝜎2|𝑎0, 𝑏0

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0, 𝑔 𝑿𝑇𝑿 −1, 0,0 ≜ 𝒩 𝒘|0, 𝜎2𝑔 𝑿𝑇𝑿 −1 ℐ𝐺 𝜎2|0,0

http://drsmorey.org/bibtex/upload/Zellner:1986.pdf

http://research.microsoft.com/en-us/um/people/minka/papers/minka-linear.pdf


Unit Information Prior

75

We will see below that if we use an uninformative prior, the

posterior precision given N measurements is .

The unit information prior is defined to contain as much

information as one sample.

To create a unit information prior for linear regression, we

need to use which is equivalent to the g-prior

with g = N.

Zellner’s prior depends on the data: This is contrary to

much of our Bayesian inference discussion!

1 T

N

V X X

1

0

1 T

N

V X X

Kass, R. and L. Wasserman (1995). A reference bayesian test for nested hypotheses and its relationship to

the schwarz criterio. J. of the Am. Stat. Assoc. 90(431), 928–934.

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0, 𝑔 𝑿𝑇𝑿 −1, 0,0 ≜ 𝑁 𝒘|0, 𝜎2𝑔 𝑿𝑇𝑿 −1 𝐼𝑛𝑣𝐺𝑎𝑚𝑚𝑎 𝜎2|0,0

ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/KW95.pdf


Uninformative Prior

76

An uninformative prior can be obtained by considering the

uninformative limit of the conjugate g-prior, which

corresponds to setting g = ∞. This is equivalent to an

improper NIG prior with w0 = 0, V0 = ∞I, a0 = 0 and b0 = 0,

which gives p(w, σ2) ∝ σ−(D+2).

Alternatively, we can start with the semi-conjugate prior

p(w, σ2) = p(w)p(σ2), and take each term to its

uninformative limit individually, which gives p(w, σ2) ∝ σ−2.

This is equivalent to an improper NIG prior with w0 = 0,V =

∞I, a0 = −D/2 and b0 = 0.

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0,∞𝑰, 0,0 ≜ 𝒩 𝒘|0, 𝜎2∞𝐼 ℐ𝒢 𝜎2|0,0 → 𝜎 )−(𝐷+2

𝑝 𝒘, 𝜎2 = 𝒩ℐ𝒢 𝒘, 𝜎2|0,∞𝑰, 0,0 ≜ 𝒩 𝒘|0, 𝜎2∞𝐼 ℐ𝒢 𝜎2| −𝐷

2, 0 → 𝜎−2


Uninformative Prior

77

Using the uninformative prior, , the

corresponding posterior and marginal posteriors are given

by

Note in the calculation of s2:

2 2, | , | , , ,N N N Np a bs sw w w VD NIG

2 2,p s s w

𝑝 𝒘|𝒟 = 𝒯𝐷 𝒘𝑁,𝑏𝑁𝑎𝑁

𝑽𝑁, 2𝑎𝑁 = 𝒯𝐷 𝒘|ෝ𝒘𝑀𝐿𝐸 ,𝑠2

𝑁 − 𝐷𝑪,𝑁 − 𝐷

𝑽𝑁 = 𝑪 = 𝑽0−1 + 𝑿𝑇𝑿 −1 → 𝑿𝑇𝑿 −1, 𝒘𝑵 = 𝑽𝑁 𝑽0

−1𝒘0 + 𝑿𝑇𝒚 → 𝑿𝑇𝑿 −1𝑿𝑇𝒚 = ෝ𝒘𝑀𝐿𝐸

𝑎𝑁 = 𝑎0 + Τ𝑁 2 = Τ𝑁 − 𝐷 2 ,

𝒃𝑁 = 𝒃0 +1

2𝒘0𝑇𝑽0

−1𝒘0 + 𝒚𝑇𝒚 − 𝒘𝑁𝑇𝑽𝑁

−1𝒘𝑁 = Τ𝑠2 2 , 𝑠2 = 𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸

𝑇𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸

𝒘𝑁 = ෝ𝒘𝑀𝐿𝐸 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚

𝑠2 = 𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸

𝑇𝒚 − 𝑿ෝ𝒘𝑀𝐿𝐸 = 𝒚 − 𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚 𝑇 𝒚 − 𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚

= 𝒚𝑇𝒚 − 𝒚𝑇𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚 = 𝒚𝑇𝒚 − ෝ𝒘𝑀𝐿𝐸𝑇 𝑿𝑇𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝒚 = 𝒚𝑇𝒚 − ෝ𝒘𝑀𝐿𝐸

𝑇 𝑽𝑁−1ෝ𝒘𝑀𝐿𝐸


Frequentist Confidence Interval Vs. Bayesian Marginal Credible Interval

78

The use of a (semi-conjugate) uninformative prior is quite

interesting since the resulting posterior turns out to be

equivalent to the results obtained from frequentist statistics.

This is equivalent to the sampling distribution of the MLE

which is given by the following:

is the standard error of the estimated parameter.

The frequentist confidence interval and the Bayesian

marginal credible interval for the parameters are the same. Rice, J. (1995). Mathematical statistics and data analysis. Duxbury. 2nd edition (page 542)

Casella, G. and R. Berger (2002). Statistical inference. Duxbury. 2nd edition (page 554)

𝑝 𝒘𝒋|𝐷 = 𝒯 𝑤𝑗|ෝ𝑤𝑗 ,𝐶𝑗𝑗𝑠

2

𝑁 − 𝐷, 𝑁 − 𝐷

𝑤𝑗 − ෝ𝑤𝑗

𝑠𝑗~𝒯𝑁−𝐷 , 𝑠𝑗 =

𝐶𝑗𝑗𝑠2

𝑁 − 𝐷

https://www.amazon.com/Mathematical-Statistics-Analysis-Available-Enhanced/dp/0534399428

https://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126


The Caterpillar Example

79

As a worked example of the uninformative prior, consider

the caterpillar dataset. We can compute the posterior

mean and standard deviation, and the 95% credible

intervals (CI) for the regression coefficients.

coeff mean stddev 95pc CI sig

w0 10.998 3.06027 [ 4.652, 17.345] *

w1 -0.004 0.00156 [ -0.008, -0.001] *

w2 -0.054 0.02190 [ -0.099, -0.008] *

w3 0.068 0.09947 [ -0.138, 0.274]

w4 -1.294 0.56381 [ -2.463, -0.124] *

w5 0.232 0.10438 [ 0.015, 0.448] *

w6 -0.357 1.56646 [ -3.605, 2.892]

w7 -0.237 1.00601 [ -2.324, 1.849]

w8 0.181 0.23672 [ -0.310, 0.672]

w9 -1.285 0.86485 [ -3.079, 0.508]

w10 -0.433 0.73487 [ -1.957, 1.091]

The 95%

credible intervals

are identical to

the 95%

confidence

intervals

computed using

standard

frequentist

methods.Run linregBayesCaterpillar

from PMTK3

Marin, J.-M. and C. Robert (2007). Bayesian Core: a practical approach to

computational Bayesian statistics. Springer.

https://www.dropbox.com/s/z5jn2rps9gv2clq/Caterpillar_Matlab.rar?dl=0

https://github.com/probml/pmtk3/blob/master/demos/linregBayesCaterpillar.m


http://www.springer.com/us/book/9780387389837


The Caterpillar Example

80

We can use these marginal posteriors to compute if the

coefficients are significantly different from 0 -- check if its

95% CI excludes 0.

The CIs for coefficients 0, 1, 2, 4, 5 are all significant.

These results are the same as those produced by a

frequentist approach using p-values at the 5% level.

But note that the MLE does not even exist when N <D, so

standard frequentist inference theory breaks down in this

setting. Bayesian inference theory still works using proper

priors.

Maruyama, Y. and E. George (2008). A g-prior extension for p > n. Technical report, U. Tokyo.

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=4B531770E85CF03DF5E7929E5622286C?doi=10.1.1.245.2551&rep=rep1&type=pdf


Empirical Bayes for Linear Regression

81

We describe next an empirical Bayes procedure for picking

the hyper-parameters in the prior (we will come back to this

and relevance determination in a forthcoming lecture).

More precisely, we choose η = (α, λ) to maximize the

marginal likelihood, where λ = 1/σ2 be the precision of the

observation noise and α is the precision of the prior, p(w) =

N(w|0, α-1I).

This is known as the evidence procedure.

MacKay, D. (1995b). Probable networks and plausible predictions — a review of practical Bayesian methods for

supervised neural networks. Network.

Buntine, W. and A. Weigend (1991). Bayesian backpropagation. Complex Systems 5, 603–643.

MacKay, D. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation

11(5), 1035–1068.

http://www.inference.phy.cam.ac.uk/mackay/network.pdf

https://www.complex-systems.com/pdf/05-6-4.pdf

http://www.inference.phy.cam.ac.uk/mackay/alpha.pdf



82

The evidence procedure provides an alternative to using

cross validation.

In the Figure, the log marginal likelihood is plotted for

different values of α, as well as the maximum value found

by the optimizer.

-25 -20 -15 -10 -5 0 5-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

-50

log alpha

log evidence

Run linregPolyVsRegDemo

from PMTK3

https://github.com/probml/pmtk3/blob/master/demos/linregPolyVsRegDemo.m




83

-25 -20 -15 -10 -5 0 5-150

-140

-130

-120

-110

-100

-90

-80

-70

-60

-50

log alpha

log evidence

Run linregPolyVsRegDemo

from PMTK3

-20 -15 -10 -5 0 50.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log lambda

negative log marg. likelihood

CV estimate of MSE

We obtain the same result as 5-CV (λ = 1/σ2 is fixed in

both methods).

The key advantage of the evidence procedure over CV is

that it allows different αj to be used for every feature.

https://github.com/probml/pmtk3/blob/master/demos/linregPolyVsRegDemo.m



Automatic Relevancy Determination

84

The evidence procedure can be used to perform feature

selection (automatic relevancy determination or ARD)

The evidence procedure is also useful when comparing

different kinds of models:

It is important to (at least approximately) integrate over η

rather than setting it arbitrarily.

Using variation Bayes models our uncertainty on η rather

than computing point estimates.

,( | ) ( | ) ( | )( | )

( |

,

, ) ( | , ) ( | )

p m p m p m d dp m

max p m p m p m d

w

w

w

w

w

w

DD

D

Documents

Bayesian Regression: Basis Functions, MLE & Regularized Least … · 2019-02-06 · Bayesian Regression: Basis Functions, MLE & Regularized Least Squares, Multiple Outputs, Inference