Bayesian Estimation - TUThehu/SSP/lecture10.pdf · Bayesian Estimation Bayesian estimators di er from all classical estimators studied so far in that they consider the parameters

Bayesian Estimation

• Bayesian estimators differ from all classical estimators studiedso far in that they consider the parameters as randomvariables instead of unknown constants.

• As such, the parameters also have a PDF, which needs tobe taken into account when seeking for an estimator.

• The PDF of the parameters can be used for incorporatingany prior knowledge we may have about its value.

Bayesian Estimation

• For example, we might know that the normalizedfrequency f0 of an observed sinusoid cannot be greaterthan 0.1. This is ensured by choosing

p(f0) =

{10, if 0 6 f0 6 0.10, otherwise

as the prior PDF in the Bayesian framework.• Usually differentiable PDF’s are easier, and we could

approximate the uniform PDF with, e.g., the Rayleigh PDF.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

Normalized frequency f0

Prio

r

Uniform density

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

Normalized frequency f0

Prio

r

Rayleigh density with σ = 0.035

Prior and Posterior estimates

• One of the key properties of Bayesian approach is that itcan be used also for small data records, and the estimatecan be improved sequentially as new data arrives.

• For example, consider tossing a coin and estimating theprobability of a head, µ.

• As we saw earlier, the ML estimate is the number ofobserved heads divided by total number of tosses:µ = #heads

#tosses .• However, if we can not afford to make more than, say,

three experiments, we may end up seeing three heads andno tails. Thus, we are forced to infer that µML = 1, the coinlands always as a head.


• The Bayesian approach can circumvent this problem,because the prior regularizes the likelihood and avoidsoverfitting to the small amount of data.

• The pictures below illustrate this. The one on the top is thelikelihood function

p(x | µ) = µ#heads(1 − µ)#tails

with #heads = 3 and #tails = 0. The maximum of thefunction is at unity.

• The second curve is the prior density p(µ) of our choice. Itwas selected to reflect the fact that we assume that the coinis probably quite fair.


• The third curve is the posterior density p(µ | x) afterobserving the samples, which can be evaluated using theBayes formula

p(µ | x) =p(x | µ) · p(µ)

p(x)=

likelihood · priorp(x)

• Thus, the third curve is the product of the first two (withnormalization), and one Bayesian alternative is to use themaximum as the estimate.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

µ

p( x

|µ)

Likelihood function after three tosses resulting in a head

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.025

0.05

µ

p(µ)

Prior density before observing any data

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.05

0.1

µ

p(µ|

x)

Posterior density after observing 3 heads

Cost Functions

• Bayesian estimators are defined by a minimizationproblem

θ = arg minθ

∫ ∫C(θ− θ)p(x, θ)dxdθ

which seeks for the value of θ that minimizes the averagecost.

Cost Functions

• The cost function C(x) is typically one of the following

1. Quadratic: C(x) = x2

2. Absolute: C(x) = |x|

3. Hit-or-miss: C(x) =

{0, |x| < δ

1, |x| > δ

• Additional cost functions include Huber’s robust loss andε-insensitive loss.

Cost Functions

• These three cost functions are favoured, because we canfind the minimum cost solution in closed form. We willintroduce the solutions next.

• Functions 1 and 3 are slightly easier to use than 2. Thus,we’ll concentrate on those.

• Regardless of the cost function, the above double integralcan be evaluated and minimized using the rule for jointprobabilities:

p(x, θ) = p(θ | x)p(x).

Cost Functions

• This results in∫ ∫C(θ−θ)p(θ | x)p(x)dxdθ =

∫ (∫C(θ− θ)p(θ | x)dθ

)︸︷︷︸

(∗)

p(x)dx

• Because p(x) is always nonnegative, it suffices to minimizethe multiplier inside the brackets, (∗)1:

θ = arg minθ

∫C(θ− θ)p(θ | x)dθ

1Note, that there’s a slight shift in the paradigm. The double integral results in the theoretical estimate thatrequires the knowledge of p(x). When minimizing only the inner integral, we get the optimum for a particularrealization, not all possible realizations.

1. Quadratic Cost Solution (or the MMSEestimator)

• If we select the quadratic cost, then the Bayesian estimatoris defined by

arg minθ

∫(θ− θ)2 p(θ | x)dθ

• Simple differentiation gives:

∂

∂θ

∫(θ− θ)2 p(θ | x)dθ =

∫∂

∂θ

[(θ− θ)2 p(θ | x)

]dθ

=

∫−2(θ− θ)p(θ | x)dθ


• Setting this equal to zero gives∫−2(θ− θ)p(θ | x)dθ = 0

⇔∫

2θ p(θ | x)dθ =

∫2θp(θ | x)dθ

⇔ θ

∫p(θ | x)dθ︸︷︷︸

=1

=

∫θp(θ | x)dθ

⇔ θ =

∫θp(θ | x)dθ


• Thus, we have the minimum:

θMMSE =

∫θp(θ | x)dθ = E(θ | x),

i.e., the mean of posterior PDF, p(θ | x).2

• This is called the minimum mean square error estimator(MMSE estimator), because it minimizes the averagesquared error.

2Prior PDF, p(θ), refers to the parameter distribution before any observations are made. Posterior PDF, p(θ | x),refers to the parameter distribution after observing the data.

2. Absolute Cost Solution

• If we choose the absolute value as the cost function, wehave to minimize

arg minθ

∫ ∣∣θ− θ∣∣ p(θ | x)dθ

• This can be shown to be equivalent to the followingcondition ∫ θ

−∞ p(θ | x)dθ =

∫∞θ

p(θ | x)dθ

2. Absolute Cost Solution

• In other words, the estimate is the value which divides theprobability mass into equal proportions:

∫ θ−∞ p(θ | x)dθ =

12

• Thus, we have arrived at the definition of the median of theposteriori PDF.

3. Hit-or-miss Cost Solution (or the MAPestimator)

• For the hit-or-miss case, we also need to minimize theinner integral:

θ = arg minθ

∫C(θ− θ)p(θ | x)dθ

with

C(x) =

{0, |x| < δ

1, |x| > δ


• The integral becomes∫C(θ−θ)p(θ | x)dθ =

∫ θ−δ−∞ 1·p(θ | x)dθ+

∫∞θ+δ

1·p(θ | x)dθ

or in a simplified form∫C(θ− θ)p(θ | x)dθ = 1 −

∫ θ+δθ−δ

1 · p(θ | x)dθ


• This is minimized by maximizing∫ θ+δθ−δ

p(θ | x)dθ

• For small δ and smooth p(θ | x) the maximum of theintegral occurs at the maximum of p(θ | x).

• Therefore, the estimator is the mode (the highest value) ofthe posteriori PDF. Thus the name Maximum a Posteriori(MAP) estimator.


• Note, that the MAP estimator

θMAP = arg maxθ

p(θ | x)

is calculated as (using the Bayes’ rule):

θMAP = arg maxθ

p(x | θ)p(θ)

p(x)


• Since p(x) does not depend on theta, it is equivalent tomaximize only the numerator:

θMAP = arg maxθ

p(x | θ)p(θ)

• Incidentally, this is close to the ML estimator:

θML = arg maxθ

p(x | θ)

The only difference is the inclusion of the prior PDF.

Summary

• To summarize, the three most widely used Bayesianestimators are

1 The MMSE, θMMSE = E(θ | x)

2 The Median, or θwith∫θ−∞ p(θ | x)dθ = 1

2 .3 The MAP, θMAP = arg maxθ p(x | θ)p(θ)

Example

• Consider the case of tossing a coin for three times resultingin three heads.

• In the example, we used the Gaussian prior

p(µ) =1√

2πσ2exp

(−

12σ2 (µ− 0.5)2

).

• Now the µMAP becomes

µMAP = arg maxµ

p(x | µ)p(µ)

= arg maxµ

[µ#heads(1 − µ)#tails 1√

2πσ2exp

(−

12σ2 (µ− 0.5)2

)]

Example

• Let’s simplify the arithmetic by setting # heads = 3 and #tails = 0:

µMAP = arg maxµ

[µ3 1√

2πσ2exp

(−

12σ2 (µ− 0.5)2

)]• Equivalently, we can maximize it’s logarithm:

arg maxµ

[3 lnµ− ln

√2πσ2 −

12σ2 (µ− 0.5)2

]

Example

• Now,

∂

∂µln [p(x|µ)p(µ)] =

3µ−

(µ− 0.5)σ2 = 0,

whenµ2 − 0.5µ− 3σ2 = 0.

This happens when

µ =0.5±

√0.25 − 4 · 1 · (−3σ2)

2= 0.25±

√0.25 + 12σ2

2.

Example

• If we substitute the value used in the example, σ = 0.1,

µMAP = 0.25 +

√0.372

≈ 0.554.

• Thus, we have found the analytical solution of themaximum of the curve in slide 5.

Vector Parameter Case for MMSE

• In vector parameter case, the MMSE estimator is

θMMSE = E(θ | x)

or more explicitly

θMMSE =

∫θ1p(θ | x)dθ∫θ2p(θ | x)dθ

...∫θpp(θ | x)dθ


• In the linear model case, there exists a straightforwardsolution:If the observed data can be modeled as

x = Hθ+ w,

where θ ∼ N(µθ, Cθ) and w ∼ N(0, Cw), then

E(θ | x) = µθ + CθHT (HCθHT + Cw)−1(x − Hµθ)


• It is possible to derive an alternative form resembling theLS estimator (exercise):

E(θ | x) = µθ + (C−1θ + HTC−1

w H)−1HTC−1w (x − Hµθ).

• Note that this becomes the LS estimator if µθ = 0 andCθ = I and Cw = σ2

wI.

Vector Parameter Case for the MAP

• The MAP estimator can also be extended to vectorparameters:

θMAP = arg maxθ

p(θ | x)

or, using the Bayes’ rule,

θMAP = arg maxθ

p(x | θ)p(θ)

• Note, that in general this is different from p scalar MAP’s.Scalar MAP would maximize for each parameter θiindividually, but the vector MAP seeks for the globalmaximum of the vector space.

Example: MMSE Estimation of SinusoidalParameters

• Consider the data model

x[n] = a cos 2πf0n+b sin 2πf0n+w[n], n = 0, 1, . . . ,N−1

or in vector formx = Hθ+ w,

where

H =

1 0

cos 2πf0 sin 2πf0cos 4πf0 sin 4πf0

...cos(2(N− 1)πf0) sin(2(N− 1)πf0)

and θ =

(ab

)


• We depart from the classical model by assuming that a andb are random variables with prior PDF θ ∼ N(0,σ2

θI). Alsow is assumed Gaussian (N(0,σ2)) and independent of θ.

• Using the second version of the formula for the linearmodel (on slide 28), we get the MMSE estimator:

E(θ | x) = µθ + (C−1θ + HTC−1

w H)−1HTC−1w (x − Hµθ)


or, in our case,3

E(θ | x) =

(1σ2θ

I + HT1σ2

wIH)−1

HT1σ2

wIx

=

(1σ2θ

I +1σ2

wHTH

)−1

HT1σ2

wx

3Note the correspondence with Ridge regression. It holds that Ridge regression is equivalent to the Bayesianestimator with Gaussian prior for the coefficients. It also holds that the LASSO is equivalent to the Bayesianestimator with Laplacian prior.


• In earlier examples we have seen that the columns of H arenearly orthogonal (exactly orthogonal if f0 = k/N):

HTH ≈ N2

I

• Thus,

E(θ | x) ≈(

1σ2θ

I +N

2σ2w

I)−1

HT1σ2

wx

=

1σ2

w1σ2θ

+ N2σ2

w

HTx.


• In all, the MMSE estimates become

aMMSE =1

1 +2σ2/N

σ2θ

[2N

N−1∑n=0

x[n] cos 2πf0n

]

bMMSE =1

1 +2σ2/N

σ2θ

[2N

N−1∑n=0

x[n] sin 2πf0n

]


• For comparison, recall that the classical MVU estimator is

aMVU =2N

N−1∑n=0

x[n] cos 2πf0n

bMVU =2N

N−1∑n=0

x[n] sin 2πf0n


• The difference can be interpreted as a weighting betweenthe prior knowledge and the data.• If the prior knowledge is unreliable (σ2

θ large), then1

1+ 2σ2/Nσ2θ

≈ 1 and the two estimators are almost equal.

• If the data is unreliable (σ2 large), then the coefficient1

1+ 2σ2/Nσ2θ

is small, making the estimate close to the mean of

the prior PDF.


• An example run is illustrated below. In this case, N = 100,f0 = 15/N, and σ2

θ = 0.48566,σ2 = 4.1173. AltogetherM = 500 tests were performed.

• Since the prior PDF has a small variance, the estimatorgains a lot from using it. This is seen as a significantdifference between the MSE’s of the two estimators.


−1 −0.5 0 0.5 10

10

20

30

40

50

60Classical estimator of a. MSE=0.072474

−1 −0.5 0 0.5 10

10

20

30

40

50

60Classical estimator of b. MSE=0.092735

−1 −0.5 0 0.5 10

10

20

30

40

50

60Bayesian estimator of a. MSE=0.061919

−1 −0.5 0 0.5 10

10

20

30

40

50

60Bayesian estimator of b. MSE=0.076355


• If the prior has a higher variance, the Bayesian approachdoes not perform that much better. In the pictures below,σ2θ = 2.1937, σ2 = 1.9078. The difference in performance is

negligible between the two approaches.


−1 −0.5 0 0.5 10

10

20

30

40

50

60Classical estimator of a. MSE=0.040066

−1 −0.5 0 0.5 10

10

20

30

40

50

60Classical estimator of b. MSE=0.034727

−1 −0.5 0 0.5 10

10

20

30

40

50

60Bayesian estimator of a. MSE=0.03951

−1 −0.5 0 0.5 10

10

20

30

40

50

60Bayesian estimator of b. MSE=0.034477


• The program code is available athttp://www.cs.tut.fi/courses/SGN-2606/BayesSinusoid.m

http://www.cs.tut.fi/courses/SGN-2606/BayesSinusoid.m

Example: MAP Estimator

• Assume that

p(x[n] | θ) =

{θ exp(−θx[n]) if x[n] > 00, if x[n] < 0

with x[n] conditionally IID and the prior of θ:

p(θ) =

{λ exp(−λθ) if θ > 00 if θ < 0

• Now, θ is the unkown RV and λ is known.


• Then the MAP estimator is found by maximizing p(θ | x)or equivalently p(x | θ)p(θ).

• Because both PDF’s have an exponential form, it’s easier tomaximize the logarithm instead:

θ = arg maxθ

(lnp(x | θ) + lnp(θ)) .


• Now,

lnp(x | θ) + lnp(θ) = ln

[N−1∏n=0

θ exp(−θx[n])

]+ ln[λ exp(−λθ)]

= ln

[θN exp

(−θ

N−1∑n=0

x[n]

)]+ ln[λ exp(−λθ)]

= N ln θ−Nθx+ ln λ− λθ

• Differentiation produces

d

dθlnp(x | θ) + lnp(θ) =

N

θ−Nx− λ


• Setting it equal to zero produces the MAP estimator:

θ =1

x+ λN

Example: Deconvolution

• Consider the situation where a signal s[n] passes through achannel with impulse response h[n] and is furthercorrupted by noise w[n]:

x[n] = h(n) ∗ s(n) +w[n]

=

K∑k=0

h[k]s[n− k] +w[n], n = 0, 1, . . . ,N− 1


• Since convolution commutes, we can write this as

x[n] =

ns−1∑k=0

h[n− k]s[k] +w[n]

• In matrix form this is expressed byx[0]x[1]

...x[N− 1]

=

h[0] 0 · · · 0h[1] h[0] · · · 0

......

. . ....

h[N− 1] h[N− 2] · · · h[N− ns]

s[0]s[1]

...s[ns − 1]

+

w[0]w[1]

...w[N− 1]


• Thus, we have again the linear model

x = Hs + w

where the unknown parameter θ is the original signal s.• The noise is assumed Gaussian: w[n] ∼ N(0,σ2).• A reasonable assumption for the signal is that s ∼ N(0, Cs)

with [Cs]ij = rss[i− j], where rss is the autocorrelationfunction of s.

• According to slide 28, the MMSE estimator is

E(s | x) = µs + CsHT (HCsHT + Cw)−1(x − Hµs)

= CsHT (HCsHT + σ2I)−1x


• In general, the form of the estimator varies a lot betweendifferent cases. However, as a special case:• When H = I, the channel is identity and only noise is

present. In this case

s = Cs(Cs + σ2I)−1x

This case is called the Wiener filter. For example, in a singledata point case,

s[0] =rss[0]

rss[0] + σ2 x[0]

Thus, the variance of the noise is used as a parametertelling the reliability of the data with respect to the prior.

Documents

Bayesian Estimation - TUThehu/SSP/lecture10.pdf · Bayesian Estimation Bayesian estimators di er from all classical estimators studied so far in that they consider the parameters