View
218
Download
2
Embed Size (px)
Citation preview
Lecture 7
Advanced Topics in
Least Squares
the multivariate normal distribution for data, d
p(d) = (2)-N/2 |Cd|-1/2 exp{ -1/2 (d-d)T Cd-1 (d-d) }
Let’s assume that the expectation d Is given by a general linear model
d = Gm
And that the covariance Cd
is known (prior covariance)
Then we have a distribution P(d; m)with unknown parameters, m
p(d)=(2)-N/2|Cd|-1/2exp{ -½ (d-Gm)T Cd-1 (d-Gm) }
We can now apply theprinciple of maximum likelihood
To estimate the unknown parameters m
Principle of Maximum LikelihoodLast lecture we stated this principle as
L(m) = i ln p(di; m) with respect to m
but in this distribution
the whole data vector d is being treated as a single quantity
So the princple becomes simply
Maximize L(m) = ln p(d; m)
p(d;m)=(2)-N/2|Cd|-1/2exp{ -½ (d-Gm)T Cd-1 (d-Gm) }
L(m) = ln p(d; m) =
- ½Nln(2) - ½ln(|Cd|) - ½(d-Gm)T Cd-1 (d-Gm)
The first two terms do not contain m, so the principle of maximum likelihood is
Maximize -½ (d-Gm) T Cd-1 (d-Gm)
or
Minimize (d-Gm) T Cd-1 (d-Gm)
Minimize (d-Gm) T Cd-1 (d-Gm)
Special case of uncorrelated data with equal variance
Cd = d2I
Minimize d-2 (d-Gm)T (d-Gm) with respect to m
Which is the same as
Minimize (d-Gm)T (d- Gm) with respect to m
This is the Principle of Least Squares
This is the Principle of Least Squares
minimize E = eTe = (d-Gm)T(d-Gm) with respect to m
follows from the Principle of Maximum Likelihood in the special case of
a multivariate Normal distribution
the data being uncorrelated and of equal variance
Corollary
If your data are NOT NORMALLY DISTRIBUTED
Then least-squares isnot the right method to use!
What if Cd=d2I but d is unknown?
note |Cd| = 2N
L(m,d) =
-½Nln(2) - ½ln(|Cd|) - ½(d-Gm)T Cd-1 (d-Gm) =
-½Nln(2) – Nln(d) - ½d-2(d-Gm)T (d-Gm)
The first two terms do not contain m, so the principle of maximum likelihood still implies:
Minimize (d-Gm)T(d-Gm) = eTe = E
Then L/d= 0 = Nd-1 + d
-3(d-Gm)T (d-Gm)
Or, solving for d
d2 = N-1 (d-Gm)T (d-Gm) = N-1 eTe
This is the Principle of Maximum Likelihood
implies that
d2 = N-1 (d-Gm)T (d-Gm) = N-1 eTe
Is a good posterior estimate of the variance of the data, when
the data follow a multivariate normal distribution
the data are uncorrelated and with uniform
(but unknown) variance, d2
But back to the general case …
What formula for m does the rule
Minimize (d-Gm)T Cd-1 (d-Gm)
imply ?
Trick … Minimize (d-Gm)T (d-Gm)Implies m = [GTG]-1 GT d
Now write, Minimize (d-Gm)T Cd
-1 (d-Gm) = (d-Gm)T Cd
-1/2 Cd-1/2 (d-Gm) =
(Cd-1/2d-Cd
-1/2Gm)T (Cd-1/2d-Cd
-1/2Gm) =(d’-G’m)T (d’-G’m)
with d’=Cd-1/2d
G’ = Cd-1/2G
This is simple least squares, so m= [G’TG’]-1 G’T d’orm = [GTCd
-1/2Cd-1/2G]-1GTCd
-1/2Cd-1/2d = [GT Cd
-1G]-1GTCd-1d
Symmetric, so it inverse and square root is symmetric, too
So, minimize (d-Gm)T Cd
-1 (d-Gm)
implies
m = [GT Cd-1G]-1GTCd
-1d
and
Cm = {[GTCd-1G]-1GTCd
-1} Cd {[GTCd
-1G]-1GTCd-1 }T =
= [GTCd-1G]-1GTCd
-1G [GTCd-1G]-1 = [GTCd
-1G]-1
Remember formula
Cm = M Cd MT
Example with Correlated Noise
Uncorrelated Noise
Correlated Noise
Scatter Plots
di vs. di+1 high correlation
di vs. di+2 some correlation
di vs. di+3 little correlation
data = straight line + correlated noise
x
d =
a +
bx
+ n
Model for Cd
[Cd ]ij= exp{ -c |i-j| } with c=0.25
exponential falloff from main diagonal
MatLab Code:c = 0.25;[XX, YY] = meshgrid( [1:N], [1:N] );Cd = (sd^2)*exp(-c*abs(XX-YY));
Resultsd
= a
+ b
x +
n
x
Both fits about the same … but
Intercept
Correlated 10.96 ± 20.6
Uncorrelated 8.42 ± 7.9
True 1.0
Slope
Correlated 1.92 ± 0.35
Uncorrelated 1.97 ± 0.14
True 2.0
… note error estimates are larger (more realistic ?) for the correlated case
How to make correlated noisew = [0.1, 0.3, 0.7, 1.0, 0.7, 0.3, 0.1]';w = w/sum(w);Nw = length(w);Nw2 = (Nw-1)/2;
N=101;N2=(N-1)/2;n1 = random('Normal',0,1,N+Nw,1);
n = zeros(N,1);for i = [-Nw2:Nw2] n = n + w(i+Nw2+1)*n1(i+Nw-Nw2:i+Nw+N-1-Nw2);end
Define weighting function
Start with uncorrelated noise
Correlated noise is a weighted average of neighboring uncorrelated noise values
Let’s look at the transformations … d’=Cd
-1/2dG’ = Cd
-1/2G
In the special case of uncorrelated data with different variances Cd = diag( 1
2, 22, … N
2)
di’=i-1 di multiply each data by
the reciprocal of its error
Gij’ = i-1 Gij multiply each row of the
data kernel by the same amount
Then solve by ordinary least squares
12 0 0 …
0 22 0 …
0 0 32 …
...
G11 1
G12 1G13 …
2G21 2
G22 2G13 …
3G31 3
G32 3G33 …
…
NGN1 N
GN2 NGN3 …
m =
1d1
2d2
3d3
…
NdN
Rows have been weighted by a factor of i-1
So this special case is often called
Weighted Least Squares
Note that the total error is
E = eT Cd-1 e = i i
-2 ei2
Each individual error is weighted by the reciprocal of its variance, so errors involving data with SMALL variance get MORE weight
weight
Example: fitting a straight line
100 data, first 50 have a different d than the last 50
N=101;N2=(N-1)/2;sd(1:N2-1) = 5;sd(N2:N) = 100;sd2i = sd.^(-2);Cdi = diag(sd2i);
G(:,1)=ones(N,1);G(:,2)=x;
GTCdiGI=inv(G'*Cdi*G);m = GTCdiGI*G'*Cdi*d;
d2 = m(1) + m(2) .* x;
MatLab Code
Note that Cd
-1 is explicitly defines as a diagonal matrix
Equal variance
Left 50: d = 5 right 50: d = 5
Left has smaller variance
first 50: d = 5 last 50: d = 100
Right has smaller variance
first 50: d = 100 last 50: d = 5
Finally, two miscellaneous comments about least-squares
Comment 1
Case of fitting functions to a dataset
di = m1 f1(xi) + m2 f2(xi) + m3 f3(xi) …
e.g. di = m1 sin(xi) + m2 cos(xi) + m3 sin(2xi) …
f1(x1) f2(x1) f3(x1) …
f1(x2) f2(x2) f3(x2) …
f1(x3) f2(x3) f3(x3) …
…
f1(xN) f2(xN) f3(xN) …
m =
d1
d2
d3
…
dN
Note that the matrix GTG
has element i,j
[GTG]ij = i fi(xk)fj(xk) = fi fj
and thus is diagonal if the functions are orthogonal
if the functions are normalized so fifi=
then GTG = I and the least squares solution is
m = GTd and Cm=d2 I
super-simple formula! mi = fi d
guaranteed uncorrelated errors!
Example of Straight line
x
y
x1 x2 x3 x4 x5
x
yi = a + bxi implies
f1(x) = 1 and f2(x) = x
so condition f1(x)f2(x)=0
implies i xi = 0 or x=0
this happens when the x’s straddle the origin
The choice
f1(x) = 1 and f2(x) = x-x
i.e. y = a’ + b’ (x-x)
leads to uncorrelated errors in (a’,b’)
aa’
Example – wavelet functions
Localized oscillation with a character-istic frequency
GTG
“Almost” diagonal
Comment 2
sometimes writing least-squares as
[GTG] m = GT d or
GT[G m] = GT d
is more useful than m = [GTG]-1 GT d
since you can use some method other than a matrix inverse for solving the equation