Lecture 7 Advanced Topics in Least Squares. the multivariate normal distribution for data, d p(d) = (2 ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d)

Lecture 7

Advanced Topics in

Least Squares

the multivariate normal distribution for data, d

p(d) = (2)-N/2 |Cd|-1/2 exp{ -1/2 (d-d)T Cd-1 (d-d) }

Let’s assume that the expectation d Is given by a general linear model

d = Gm

And that the covariance Cd

is known (prior covariance)

Then we have a distribution P(d; m)with unknown parameters, m

p(d)=(2)-N/2|Cd|-1/2exp{ -½ (d-Gm)T Cd-1 (d-Gm) }

We can now apply theprinciple of maximum likelihood

To estimate the unknown parameters m

Principle of Maximum LikelihoodLast lecture we stated this principle as

L(m) = i ln p(di; m) with respect to m

but in this distribution

the whole data vector d is being treated as a single quantity

So the princple becomes simply

Maximize L(m) = ln p(d; m)

p(d;m)=(2)-N/2|Cd|-1/2exp{ -½ (d-Gm)T Cd-1 (d-Gm) }

L(m) = ln p(d; m) =

- ½Nln(2) - ½ln(|Cd|) - ½(d-Gm)T Cd-1 (d-Gm)

The first two terms do not contain m, so the principle of maximum likelihood is

Maximize -½ (d-Gm) T Cd-1 (d-Gm)

or

Minimize (d-Gm) T Cd-1 (d-Gm)

Minimize (d-Gm) T Cd-1 (d-Gm)

Special case of uncorrelated data with equal variance

Cd = d2I

Minimize d-2 (d-Gm)T (d-Gm) with respect to m

Which is the same as

Minimize (d-Gm)T (d- Gm) with respect to m

This is the Principle of Least Squares

This is the Principle of Least Squares

minimize E = eTe = (d-Gm)T(d-Gm) with respect to m

follows from the Principle of Maximum Likelihood in the special case of

a multivariate Normal distribution

the data being uncorrelated and of equal variance

Corollary

If your data are NOT NORMALLY DISTRIBUTED

Then least-squares isnot the right method to use!

What if Cd=d2I but d is unknown?

note |Cd| = 2N

L(m,d) =

-½Nln(2) - ½ln(|Cd|) - ½(d-Gm)T Cd-1 (d-Gm) =

-½Nln(2) – Nln(d) - ½d-2(d-Gm)T (d-Gm)

The first two terms do not contain m, so the principle of maximum likelihood still implies:

Minimize (d-Gm)T(d-Gm) = eTe = E

Then L/d= 0 = Nd-1 + d

-3(d-Gm)T (d-Gm)

Or, solving for d

d2 = N-1 (d-Gm)T (d-Gm) = N-1 eTe

This is the Principle of Maximum Likelihood

implies that

d2 = N-1 (d-Gm)T (d-Gm) = N-1 eTe

Is a good posterior estimate of the variance of the data, when

the data follow a multivariate normal distribution

the data are uncorrelated and with uniform

(but unknown) variance, d2

But back to the general case …

What formula for m does the rule

Minimize (d-Gm)T Cd-1 (d-Gm)

imply ?

Trick … Minimize (d-Gm)T (d-Gm)Implies m = [GTG]-1 GT d

Now write, Minimize (d-Gm)T Cd

-1 (d-Gm) = (d-Gm)T Cd

-1/2 Cd-1/2 (d-Gm) =

(Cd-1/2d-Cd

-1/2Gm)T (Cd-1/2d-Cd

-1/2Gm) =(d’-G’m)T (d’-G’m)

with d’=Cd-1/2d

G’ = Cd-1/2G

This is simple least squares, so m= [G’TG’]-1 G’T d’orm = [GTCd

-1/2Cd-1/2G]-1GTCd

-1/2Cd-1/2d = [GT Cd

-1G]-1GTCd-1d

Symmetric, so it inverse and square root is symmetric, too

So, minimize (d-Gm)T Cd

-1 (d-Gm)

implies

m = [GT Cd-1G]-1GTCd

-1d

and

Cm = {[GTCd-1G]-1GTCd

-1} Cd {[GTCd

-1G]-1GTCd-1 }T =

= [GTCd-1G]-1GTCd

-1G [GTCd-1G]-1 = [GTCd

-1G]-1

Remember formula

Cm = M Cd MT

Example with Correlated Noise

Uncorrelated Noise

Correlated Noise

Scatter Plots

di vs. di+1 high correlation

di vs. di+2 some correlation

di vs. di+3 little correlation

data = straight line + correlated noise

x

d =

a +

bx

+ n

Model for Cd

[Cd ]ij= exp{ -c |i-j| } with c=0.25

exponential falloff from main diagonal

MatLab Code:c = 0.25;[XX, YY] = meshgrid( [1:N], [1:N] );Cd = (sd^2)*exp(-c*abs(XX-YY));

Resultsd

= a

+ b

x +

n

x

Both fits about the same … but

Intercept

Correlated 10.96 ± 20.6

Uncorrelated 8.42 ± 7.9

True 1.0

Slope

Correlated 1.92 ± 0.35

Uncorrelated 1.97 ± 0.14

True 2.0

… note error estimates are larger (more realistic ?) for the correlated case

How to make correlated noisew = [0.1, 0.3, 0.7, 1.0, 0.7, 0.3, 0.1]';w = w/sum(w);Nw = length(w);Nw2 = (Nw-1)/2;

N=101;N2=(N-1)/2;n1 = random('Normal',0,1,N+Nw,1);

n = zeros(N,1);for i = [-Nw2:Nw2] n = n + w(i+Nw2+1)*n1(i+Nw-Nw2:i+Nw+N-1-Nw2);end

Define weighting function

Start with uncorrelated noise

Correlated noise is a weighted average of neighboring uncorrelated noise values

Let’s look at the transformations … d’=Cd

-1/2dG’ = Cd

-1/2G

In the special case of uncorrelated data with different variances Cd = diag( 1

2, 22, … N

2)

di’=i-1 di multiply each data by

the reciprocal of its error

Gij’ = i-1 Gij multiply each row of the

data kernel by the same amount

Then solve by ordinary least squares

12 0 0 …

0 22 0 …

0 0 32 …

...

G11 1

G12 1G13 …

2G21 2

G22 2G13 …

3G31 3

G32 3G33 …

…

NGN1 N

GN2 NGN3 …

m =

1d1

2d2

3d3

…

NdN

Rows have been weighted by a factor of i-1

So this special case is often called

Weighted Least Squares

Note that the total error is

E = eT Cd-1 e = i i

-2 ei2

Each individual error is weighted by the reciprocal of its variance, so errors involving data with SMALL variance get MORE weight

weight

Example: fitting a straight line

100 data, first 50 have a different d than the last 50

N=101;N2=(N-1)/2;sd(1:N2-1) = 5;sd(N2:N) = 100;sd2i = sd.^(-2);Cdi = diag(sd2i);

G(:,1)=ones(N,1);G(:,2)=x;

GTCdiGI=inv(G'*Cdi*G);m = GTCdiGI*G'*Cdi*d;

d2 = m(1) + m(2) .* x;

MatLab Code

Note that Cd

-1 is explicitly defines as a diagonal matrix

Equal variance

Left 50: d = 5 right 50: d = 5

Left has smaller variance

first 50: d = 5 last 50: d = 100

Right has smaller variance

first 50: d = 100 last 50: d = 5

Finally, two miscellaneous comments about least-squares

Comment 1

Case of fitting functions to a dataset

di = m1 f1(xi) + m2 f2(xi) + m3 f3(xi) …

e.g. di = m1 sin(xi) + m2 cos(xi) + m3 sin(2xi) …

f1(x1) f2(x1) f3(x1) …

f1(x2) f2(x2) f3(x2) …

f1(x3) f2(x3) f3(x3) …

…

f1(xN) f2(xN) f3(xN) …

m =

d1

d2

d3

…

dN

Note that the matrix GTG

has element i,j

[GTG]ij = i fi(xk)fj(xk) = fi fj

and thus is diagonal if the functions are orthogonal

if the functions are normalized so fifi=

then GTG = I and the least squares solution is

m = GTd and Cm=d2 I

super-simple formula! mi = fi d

guaranteed uncorrelated errors!

Example of Straight line

x

y

x1 x2 x3 x4 x5

x

yi = a + bxi implies

f1(x) = 1 and f2(x) = x

so condition f1(x)f2(x)=0

implies i xi = 0 or x=0

this happens when the x’s straddle the origin

The choice

f1(x) = 1 and f2(x) = x-x

i.e. y = a’ + b’ (x-x)

leads to uncorrelated errors in (a’,b’)

aa’

Example – wavelet functions

Localized oscillation with a character-istic frequency

GTG

“Almost” diagonal

Comment 2

sometimes writing least-squares as

[GTG] m = GT d or

GT[G m] = GT d

is more useful than m = [GTG]-1 GT d

since you can use some method other than a matrix inverse for solving the equation

Documents

Lecture 7 Advanced Topics in Least Squares. the multivariate normal distribution for data, d p(d) = (2 ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d)