EE 5510 Random Processes E - 1
Estimation Theory Basics
What is Estimation
Make a best determination ofsomething given measurements(information) related (in a known,or partially known way) to thatsomething, and possibly also givensome prior information
Prior Information may be
• a Random Variable, this is the Bayesian Approach,
• non-random, or sets
Example: Estimation of the mean of random variable:
mx =1
n
n∑i=1
Xi
is an estimator formx = E(X), the mean of the random variableX.We know that mx is a “good” estimator, since
P (|mx −mx| < ε)→ 1
by the law of large numbers.
EE 5510 Random Processes E - 2
Parameter Estimation
We are given a number of observations xi, i.e., x = [x1, x2, · · · , xn]. Thissample vector of “data” is called a statistic. From this statistic we areto estimate the desired parameter, i.e.,
θ = g(x)
Unbiased EstimatorAn estimator is unbiased if
E(θ) = θ
Example: The estimator mx = 1n
∑ni=1 xi is an unbiased estimator since
E(mx) =1
n
n∑i=1
E(xi) = mx
Example: Estimation of unknown variance (and unknown mean):
σ2 =1
n− 1
n∑i=1
(xi − mx)2
This estimator is unbiased, since
E((xi − mx)
2) = E(x2i )− 2E
(1
nxi
n∑j=1
xj
)+ E
(1
n2
n∑j=1
n∑i=1
xixj
)
= E(x2i )− 2
1
nE(x2
i )− 2n− 1
nm2x
+1
n2
n∑j=1
E(x2j) +
n(n− 1)
n2 m2x
= E(x2i )
(1− 1
n
)− n− 1
nm2x
= σ2n− 1
n
EE 5510 Random Processes E - 3
Biased Coin TossAssume that a coin is biased and shows head (=1) with probability θ 6=0.5, and tail with probability 1− θ. We observe the n tosses of this coinand are to estimate the probability θ of a head toss.Let us form the statistic
z =x1 + x2 + · · ·+ xn
n
The mean of this estimator is
E(Z) =1
n
n∑i=1
E(Xi) = θ
and hence this estimator is unbiased.The variance of this estimator is
σ2z = E
((Z − θ)2)
= E
(1
n2
n∑i=1
n∑j=1
XiXj
)− 2E
(1
n
n∑i=1
Xiθ
)+ θ2
= E
(1
n2
n∑i=1
n∑j=1
XiXj
)− θ2
=n− 1
n(n(n− 1)θ2 − nθ)− θ2 =
θ(1− θ)n
The question arrises whether this is the “best” estimator?
Minimum Variance Estimators (MVE)
Let us find the estimator which minimizes E(
(θ − θ)2)
However the MVE cannot, in general, be determined, since
E(
(θ − θ)2)
= E(
(θ − E(θ)− (θ − E(θ))2)
= (θ − E(θ))2 + var(θ)
since the term (θ − E(θ))2, called the bias, depends on the unknownquantity θ.
EE 5510 Random Processes E - 4
Cramer-Rao Lower Bound
AssumptionsWe assume that we have knowledge of the parametric distribution f(x, θ),and that this distribution is differentiable with respect to θ. We also as-sume that the domain of the statistic x does not depend on the parameterθ.DerivationWe now concentrate on unbiased estimators:
E(θ − θ) =
∫(θ − θ)f(x, θ)dx = 0
due to the unbiased nature of θ. Now
d
dθ
∫(θ − θ)f(x, θ)dx =
∫d
dθ(θ − θ)f(x, θ)dx
= −∫f(x, θ)dx+
∫(θ − θ) d
dθf(x, θ)dx = 0
Using the identity
d
dθf(x, θ) =
d ln(f(x, θ))
dθf(x, θ)
we develop:
1 =
∫(θ − θ)d ln(f(x, θ))
dθf(x, θ)dx
≤∫
(θ − θ)2f(x, θ)dx
∫ (d ln(f(x, θ))
dθ
)2
f(x, θ)dx
from which we find the Cramer-Rao lower bound (CRLB):
var(θ) ≥ 1
E(d ln(f(x,θ))
dθ
)2
EE 5510 Random Processes E - 5
Equality in the CRLB is achieved if equality is achieved in the Schwarzinequality (see Page ??) used to prove it, i.e., if
d ln(f(x, θ))
dθ= I(θ)(θ − θ)
I(θ) =
∫ (d ln(f(x, θ))
dθ
)2
f(x, θ)dx
= −∫d2 ln(f(x, θ))
dθ2 f(x, θ)dx
is the Fisher Information of θ contained in the observation x. Example:
Consider n observations of a constant A in Gaussian noise, i.e.,
xi = A+ ni; 1 ≤ i ≤ n
The PDF of the observation vector x = (x1, · · · , xn) is given by
f(x, A) =1
(2πσ2)(n/2) exp
(− 1
2σ2
n∑i=1
(xi − A)2
)We calculate
d ln(f(x, A))
dA=
1
σ2
n∑i=1
(xi − A) =n
σ2
(1
n
n∑i=1
xi − A)
from which we can identify the optimal unbiased estimator as
A =1
n
n∑i=1
xi
Furthermore
d2 ln(f(x, A))
dA2 = − n
σ2 = −I(A)
and the variance of the estimator is given by
var(A) =1
I(A)=σ2
n
EE 5510 Random Processes E - 6
Signals in Gaussian Noise
As an example, consider the application of a radar range estimator:
H(f) x(t)
We transmit s(t) and receive r(t) = s(t−τ)+nw(t), and wish to estimatethe delay τ .We first treat the more general problem of a parameterized signal inGaussian noise:
xi = s[n; θ] + ni; 1 ≤ i ≤ n
We calculate
d ln(f(x, θ))
dθ=
1
σ2
n∑i=1
(xi − s[n; θ])ds[n; θ]
dθ
and the second derivative
d2 ln(f(x, θ))
dθ2 = − 1
σ2
n∑i=1
((xi − s[n; θ])
d2s[n; θ]
dθ2 −(ds[n; θ]
dθ
)2)
EE 5510 Random Processes E - 7
The Fisher information is now calculated as
I(θ) = E
(d2 ln(f(x, θ))
dθ2
)= − 1
σ2
n∑i=1
(ds[n; θ]
dθ
)2
and the CRLB is given by
var(θ) ≥ σ2∑ni=1
(ds[n;θ]dθ
)2
Application to the Radar ProblemIn the radar application the receive filter is a brickwall lowpass filter withcutoff frequency W . The output of this filter is sampled at the Nyquistrate, and we have a discrete time system
x(t)→ x(i/(2W )) = xi = s(i/(2W )− τ) + ni
where ni is sampled Gaussian noise with variance σ2 = N0W (page ??).By the CRLB, the variance of any estimator is bounded by
var(τ) ≥ N0Wn∑i=1
(ds(i/2W − τ)
dτ
)2 =N0W
2Wn∑i=1
(ds(i/2W − τ)
dτ
)21
2W
≈ N0W
2W
∫ T
0
(ds(t− τ)
dτ
)2 =N0/2∫ ∞
−∞(2πf)2S2(f)df
=N0/2
Es
1
F 2 ; F 2 =
∫ ∞−∞
(2πf)2S2(f)df∫ ∞−∞
S2(f)df
The parameter F 2 is the mean-square bandwidth of the signal.
EE 5510 Random Processes E - 8
Bayesian Estimation
xθ
θ
θg(x)
classical
Bayesian
While in classical estimation nothing is known about the parameterθ, other than its range, in the Bayesian approach, θ is considered to bea random variable whose PDF is known or can be estimated. We hencework with the joint and conditional PDF’s
f(x, θ) = f(θ|x)f(x)
Minimum Mean Square Error (MMSE) EstimatorWhat was impossible in classical estimation theory is now doable:
E(
(g(x)− θ)2)
=
∫ ∞−∞
∫ ∞−∞
(g(x)− θ)2 f(x)f(θ|x)dxdθ
=
∫ ∞−∞
f(x)
∫ ∞−∞
(g(x)− θ)2 f(θ|x)dxdθ
Note that we can minimize the second integral with respect to g(·) foreach x, and thus minimize the entire expression.
d
dg
∫ ∞−∞
(g(x)− θ)2 f(θ|x)dxdθ = 2dg
∫ ∞−∞
(g(x)− θ) f(θ|x)dxdθ = 0
This leads to the solution g(x) =
∫ ∞−∞
θf(θ|x)dxdθ
The MMSE estimator is the conditional expection given x:
gMMSE(x) = E (θ|X = x)
EE 5510 Random Processes E - 9
Example: Consider again n observations of an amplitude A in Gaussiannoise, i.e.,
xi = A+ ni; 1 ≤ i ≤ n
The PDF of the observation vector x = (x1, · · · , xn) is nowinterpreted as a conditional probability density
fX |A(x|a) =1
(2πσ2)(n/2) exp
(− 1
2σ2
n∑i=1
(xi − a)2
)The critical assumption is now the PDF of A. We assume
fA(a) =1
(2πσ2A)(1/2) exp
(− 1
2σ2A
(a− µA)2)
We now need to calculate
A = E(A|x)
We use the following result: If x and y are jointly Gaussiandistributed, then
E(Y |x) = E(Y ) + CyxC−1xx (x− E(X))
Cy|x = Cyy − CyxC−1xxCxy
where
Cxx = E(XXT )− µxµTxCyy = E(Y Y T )− µyµTyCyx = E(Y XT )− µyµTx ; Cxy = CT
yx
Applying these formulas to our case with y = A we obtain
Cxx = σ2A11T + σ2I;Cyy = σ2
A;Cyx = σ2A1T
EE 5510 Random Processes E - 10
continued: Our estimator is now given by
A = E(A|x) = µA + σ2A1T
(σ2I + σ2
A11T)−1
(x− µA1)
Using Woodbury’s Identity:
(I + c11T
)−1= I − c11T
1 + nc
allows us to rewrite the conditional expectation as
A = E(A|x) = µA +σ2A
σ2 1T
(I −
σ2A
σ2 11T
1 + nσ2A
σ2
)(x− µA1)
and after some manipulations we obtain
A = αµx︸︷︷︸Data Part
+ (1− α)µA︸ ︷︷ ︸Prior Knowledge
; α =σ2A
σ2A + σ2/n
Mean Square Estimator ErrorLet us calculate the MMSE of our Bayesian estimator:
E(
(A− A)2)
= E
(αn
n∑i=1
(A+ ni) + (1− α)µA
)2
= E((α− 1)2(A− µA)2)+ E
(α2
n2
n∑i=1
n2i
)
= (α− 1)2(A− µA)2 + α2σ2
n
and the Bayesian estimator can give a substantially lower errorthan the unbiased classical estimator, whose MMSE equals σ2
n .
EE 5510 Random Processes E - 11
Linear Estimators
A linear estimator is one of the form:
θ =n∑i=1
aixi + b
where the coefficients ai and b are constants.To find the optimal values for ai we minimize the MSE, i.e.,
∂
∂bE((θ − θ
))= −2E
(θ −
n∑j=1
ajxj − b)
= 0
=⇒ b = E(θ)−n∑j=1
ajµj; µj = E(xj)
This is the equivalence to unbiased condition in the classical case.Furthermore:
∂
∂aiE
(θ − n∑j=1
ajxj − b)2 = −2E
(θ −
n∑j=1
ajxj − b)
︸ ︷︷ ︸error
xi︸︷︷︸data
= 0
This leads to the famous orthogonality principle:
The data is orthogonal to the estimation error
E
(xi
(θ −
n∑j=1
ajxj − b))
= 0
E (xiθ) =n∑j=1
aj (E(xixj)− µiµj) + µiE(θ)
EE 5510 Random Processes E - 12
Letting E(xiθ) = Rθi, and E(xixj) = Rij and E(θ) = µθ, we obtain
Rθi =n∑j=1
aj(Rij − µiµj) + µiµθ; ∀i
or, in matrix notation (R− µxµTx
)a = rθ − µθµx
where
R = {Rij} = {E(xixj)};
is the n× n correlation matrix of the data,
µxµTx = {E(xi)E(xj)};
is the n× n matrix of product of mean of the data,
a = [a1, · · · , an]T ;
is the n× 1 vector of linear estimation coefficients, and
rθ = [Rθ1, · · · , Rθn]T .
The optimal, linear solution is given by the filter coefficients
a =(R− µxµTx
)−1(rθ − µθµx)
EE 5510 Random Processes E - 13
Linear Prediction
Consider a stationary discrete random process x = (x1, x2, x3 · · · ). Wewant to predict the value of xn+j, given the values up to xn, and let usassume that E(xi) = 0 for simplicity.
We will use a linear estimator, i.e., we let θ = xn+j and calculate
xn+j =N−1∑i=0
aixn−i
The parameter N is the predictor order, i.e., the number of previoussamples which are used to predict the new sample value.The solution is given by:
a = R−1r
where, since the process is stationary and Rij = Rj−i
r = [Rj+N−1, · · · , Rj]T
and
R =
R0 R1 · · · RN−1
R1 R0 · · · RN−2...
...RN−1 RN−2 · · · R0
These equations are known as the Yule-Walker prediction equations,or also as the Wiener-Hopf equations.
EE 5510 Random Processes E - 14
Linear Prediction: Examples
The following two examples illustrate a linear predictor for a randomprocess with sinusoidal and exponentially decaying data correlation. Theexamples are for a 3-step and a 1-step prediction.
10 20 30 40 50 60 70 80 90 100 110 120-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Random Process with Power: 1.2805
10 20 30 40 50 60 70 80 90 100 110 120-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Random Process with Power: 0.99726
EE 5510 Random Processes E - 15
Mean Square Error
E(
(xn+1 − xn+1)2)
= E(x2n+1)− 2E (xn+1xn+1) + E
(x2n+1)
Orthogonality:E((xn+1 − xn+1)xn+1) = 0E(xn+1xn+1) = E(x2
n+1)= σ2
x − E (xn+1xn+1)
= σ2x −
N−1∑i=0
aiE (xn+1xn−i)
= σ2x −
N−1∑i=0
aiRi+1
= σ2x − aTr = σ2
x − rTR−1r
These principles can be used for filtering, prediction or smoothing
Data
Data
Data
Smoothing:
Prediction:
Filtering:
EE 5510 Random Processes E - 16
Linear Estimation of Many Parameters
Let us assume that we want to estimate L parameters θl, where the l-thparameter is estimated via the linear equation:
θl =n∑i=1
alixi + bl = aTl x+ bl
• First we find the constant term bl via the MMSE principle:
∂
∂blE((
Θl − aTl X − bl)2)
= −2E(Θl − aTl X − bl
)= 0
and hence
bl = E (Θl)− aTl E (X)
• Second, we proceed to evaluate the values of the coefficients al
∂
∂aliE((
Θl − aTl (X − E(X))− E(Θl))2)
= −2E((
Θl − aTl (X − E(X))− E(Θl))
(Xi − E(Xi)))
= −2E ((Θl − E(Θl)) (Xi − E(Xi))− 2E(aTl (X − E(X)) (Xi − E(Xi))
)= −2Cθlxi − 2aTl Cxxi, ∀l, i
where
Cxxi = [E ((X1 − E(X1)(Xi − E(Xi)) , · · · , E((XL − E(XL)(Xi − E(Xi))]T ;
Collecting all i terms into a vector equation we obtain: Cθlx1
...Cθlxn
T = aTl [Cxx1, . . . , Cxxn]
CTθlx
= aTl Cxx
EE 5510 Random Processes E - 17
From this we obtain the optimal estimator coefficients:
aTl = CTθlxC−1xx
and we derive the estimator for θl as
θl = aTl (x− E(X)) + E(Θl) = CθlxC−1xx (X − E(X)) + E(Θl); ∀l
Summarizing the L equations above into a single matrix equation
θ = A (x− E(X)) + b = CθxC−1xx (x− E(X)) + E(Θ)
Note: This has exactly the same form as the conditional expectation forjointly Gaussian random vectors, see page E - 9, and hence θ is theoptimal estimator if θ and x are Gaussian distributed.
Note: This linear estimator uses only second order statistics, i.e.,
Cθx, Cxx, E(X), E(Θ)
Mean Square Error (MSE)The mean square error of the linear estimator can be calculated as
MSEθ
= E
((Θ− Θ
)(Θ− Θ
)T)= E
((Θ− E(Θ)− A(X − E(X)) (Θ− E(Θ)− A(X − E(X))T
)= Cθθ + ACxxA
T − ACxθ − CθxA
MSEθ
= Cθθ − CθxC−1xxCxθ
EE 5510 Random Processes E - 18
Channel EstimationConsider the unknown discrete channel shown below:
. . . . .
. . . . .h0wn
un
h1 h2 hp-1
xn
z-1 z-1 z-1
the output data is given by
x =
u0
u1 u0...
......
...uN−1 uN−2 · · · uN−p
h0
h1...
hp−1
+w = Uh+w
The data covariance matrix is given by:
Cxx = E((X − E(X))
(X − E(X)T
))= E
((U (H − µh)) (U (H − µh))T
)= HE
((H − µh)(H − µh)T
)HT + E
(wwT
)= UChhU
T + Cww
The optimal Bayesian estimator for this problem is given by:
h = ChhU(HChhU
T + Cww)−1
(x−Uµh)) + µh
If the taps are uncorrelated with zero mean: Chh = I,µh = 0, then
h = U(UUT + Cww
)−1x
EE 5510 Random Processes E - 19
MMSE Channel Estimator
The matrix UUT has rank at most n, and hence(UUT + Cww
)is only invertible is Cww is not (close to) zero. That is, the noise drivesthe estimator. In this case we apply the matrix inversion lemma:
(A+BCD)−1 = A−1 −A−1B(DA−1B +C−1)DA−1
to the required inverse with the identifications:
A = Cww
B = U
C = I
D = UT
This gives us, after some manipulations, and with Cww = σ2I:(UUT + Cww
)−1=
1
σ2
(I −U
(UUT + σ2IUT
))and, finally a simpler formula for the optimal channel estimates:
h =(UUT + σ2I
)−1UTx
Remarks: • The size of the inverse is now only p× p, a significant gainin simplicity
• Signal Design: If the probing signals are designed suchthat UUT ≈ nI, the inverse becomes trivial, and
h =1
n+ σ2UTx
In this case the error is also best, given by
MSEh
= Chh − ChxC−1xxCxh =
σ2/n
1 + σ2/nI
EE 5510 Random Processes E - 20
Wiener Filtering
Consider a system where a signal is observed in noise, i.e.,
x = θ +w = s+w
• parameters evolve continuously
• as many parameters as measurements
Using an optimal linear estimator we obtain the Wiener Filter for thisprocess:
s = Css (Css + Cww)−1 x
Note that the filter makes use of prior information in the form of thesignal correlation Css.Filter Interpretation
s = Css (Css + Cww)−1 x = Fx
This implies the time-varying filter:
sn =n∑k=0
f(n)n−kxk = f (n)Tx
sn = rTss (Css + Cww)−1 x
where
• rTss is the last row of Css
• f (n) = [f(n)n , · · · , f (n)
0 ] is the last row of F (note the inverse orderingof the subscripts)
EE 5510 Random Processes E - 21
The equations
(Css + Cww)f (n) = rss
are the well-known Wiener-Hopf equations (see E - 13):r0 + σ2 r1 · · · rnr1 r0 + σ2 · · · rn−1...
...rn − 1 rn−1 · · · r0 + σ2
f
(n)n
f(n)n−1...
f(n)0
=
rnrn−1
...r0
where rif is the (i, j)-th entry of Css, and Cww is diagonal.
Note: For large values of n the filter taps will become constant since rn →0 as n→∞:
r0 + σ2 · · · rp−1 0 · · · 0r1 r0 + σ2 · · · rp−1 0 0
. . .. . .
0 0 rp−1 · · · r0 + σ2
0 · · · 0 rp−1 · · · r0 + σ2
0...0
f(n)p−1...
f(n)0
=
0...0rp−1
...r0
This leads to stationary filter taps, and we obtain the time-invariantdiscrete Wiener filter:
. . . . .
. . . . .f0
x� n
f1 f2 fp-1
sn
z-1 z-1 z-1
^