Estimation Theory, PhD Course, Ghent University, Belgium

FACULTY OF ENGINEERING AND ARCHITECTURE

Mathematical Techniques in Engineering Science

Module Statistics

Lecture 7+8

Estimation of parameters:Fisher estimationBayesian estimation

Stijn De Vuyst

30 november 2016

STAT 7+8 Statistics Lecture 7+8 1

Statistics Lecture 7+8

Fisher estimationLikelihood functionScore functionFisher informationMSE: bias and varianceUnbiased estimators: Cramer-Rao Lower BoundBiased estimators

Sufficient statisticsRao-Blackwellisation

Maximum-likelihood estimatorThe EM algorithm

Example: censored data

Bayesian estimation


Estimation of parameters: two approachespopulation X

parameter θ

sample x

estimate θ

Classical frameworkIn 1920s and 1930s by Ronald Fisher,

Karl Pearson, Jerzy Neyman, . . .Later also C.R. Rao, H. Cramer,

Egon Pearson, D. Blackwell,

I θ is unknown, but deterministic

θ ∈ S, the parameter space

Bayesian framework18th century concepts by Thomas Bayes and Pierre-Simon LaplaceHuge following after 1950s due to availability of computer-intensive methods

I θ is an unknown realisationof a random variable Θ

Θ ∈ S


Classical setting: Fisher estimationX: population,

system, process, . . .

parameter θ

θ is a scalar here,but could also be

a vector θ in someparameter space S

X: data,observations,sample, . . .

estimate θ

The samplen independent members taken from the population(n is the sample size)

I X = (X1, X2, . . . , Xn) before observation

I x = (x1, x2, . . . , xn) after observation

X ∈ ΩΩ = Rn for real populations,Ω = 0, 1n for Bernoulli populations,. . .

The ‘model’: likelihood function

I p(x; θ) = Prob[observe X = x if true parameter is θ]

p(x; θ) is called the likelihood function, ln p(x; θ) the log-likelihood

−→ can be either a density (X cont.)or a mass function (X discr.)

STAT 7+8 Fisher Likelihood 4

Example: likelihood function for a Bernoulli populationAssume a Bernoulli population: X ∼ Bern(θ),

i.e. X = 1 with probability θ and X = 0 otherwise

I The observed sample (n = 6) is x = (0, 0, 1, 0, 1, 0)

I Likelihood p(x; θ) =∏6i=1 p(xi|θ) = (1− θ)4θ2, θ ∈ S = [0, 1]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.000

0.005

0.010

0.015

0.020

0.025

Likelihood p(x; θ)

(1− θ)4θ2

θθML =

1

3

Maximum-likelihood estimate for parameter θ

θML = arg maxθ p(x; θ) =count 1s in the data

n=c

n=

2

6=

1

3

STAT 7+8 Fisher Likelihood 5

Exa

mp

le

Score function

The score function of the model

S(θ,x) =∂

∂θln p(x; θ) =

∂∂θ p(x; θ)

p(x; θ)

I S(θ,x) indicates the relative change in likelihood

I indicates the sensitivity of the log-likelihood to its parameter θ

Expected value and variance of the scoreIf X is not yet observed, the score S(θ,X) at θ is a random variableWhat is its mean and variance?

I The expected score is 0

E[S(θ,X)] =

∫Ω

( ∂∂θ

ln p(x; θ))p(x; θ)dx =

∫Ω

∂∂θp(x; θ)

p(x; θ)p(x; θ)dx

REG=

∂

∂θ

∫Ωp(x; θ)dx =

∂

∂θ1 = 0

I The variance of the score is called the Fisher information J(θ)

Var[S(θ,X)] = E[S2(θ,X)] = E[( ∂∂θ

ln p(X; θ))2

], J(θ)

STAT 7+8 Fisher Score 6

Fisher InformationI J(θ) is the variance of the score function S(θ,X),

averaged over all possible samples X in Ω

I J(θ) is a metric for how much you can expect to learn from the sample Xabout parameter θ

Property

J(θ) = E[( ∂∂θ

ln p(X; θ))2

] = −E[∂2

∂θ2ln p(X; θ)]

Proof:The first equality is due to the definition of Fisher information.

The second follows from E[S(θ,X)] = 0, ∀θ, which means that also:

0 =∂

∂θE[S(θ,X)] =

∂

∂θ

∫Ω

( ∂∂θ

ln p)p dx

REG=

∫Ω

[( ∂2

∂θ2ln p)p+

( ∂∂θ

ln p) ∂∂θp]dx

=

∫Ω

( ∂2

∂θ2ln p)p dx +

∫Ω

( ∂∂θ

ln p)2p dx = E[

∂2

∂θ2ln p] + E[

( ∂∂θ

ln p)2

] QED

(!) Note we assume sufficient ‘regularity’ (REG) of the likelihood functionp(x; θ), so that differentiation over θ and integration over x can be switched

STAT 7+8 Fisher Information 7

Estimators for a parameter θ

DefinitionAn estimator θ is a statistic Ω→ S : x→ θ(x) (not depending on any unknown parameters!)

giving values that are hopefully ‘close’ to the true θ

! after observation, θ(x) is a deterministic number

before observation, θ(X) is a random variable→ θ is a shorthand notation for either, depending on the context

MEAN→ bias

I E[θ − θ] = E[θ(X)]− θ is the biasif bias = 0 for all θ ∈ S −→ estimator is ‘unbiased’

if estimator is not unbiased −→ estimator is biased

STAT 7+8 Fisher MSE: bias and variance 8

Estimators for a parameter θ

VARIANCE→ Mean Square Error

I The variance of estimator θ is the expect square deviation from E[θ]:

Var[θ] = E[(θ(X)− E[θ(X)]

)2]

= E[(θ−θ − (E[θ]−θ)

)2]

= E[(θ − θ)2]− 2(E[θ]− θ)E[θ − θ] + (E[θ]− θ)2

= E[(θ − θ)2]MSE

−(E[θ]− θ

bias

)2

The Mean Square Error (MSE) is expected square deviation from true θ.

=⇒ MSE(θ) = bias2 + Var[θ]

Minimum Variance and Unbiased estimator (MVU)θ is unbiased and has lower variance than any other estimator for all θ ∈ S

−→ estimator is ‘MVU’


Estimators for a parameter θOften, the asymptotic distribution of an estimator is of interest

−→ behaviour of θ(X) when sample size n becomes very large?

An estimator θn = θ(X1, . . . , Xn) of θ is consistent if and only if

θn converges to θ (‘in probability’) for n→∞, ∀θ ∈ S, i.e.

limn→∞

Prob[|θn − θ| > ε] = 0 , ∀ε > 0 , or plimn→∞

θn = θ , ∀θ ∈ S

Consistencyvs. biasexamples:

θ

θn = X

unbiased and consistent

θ

θn =X1 +X2 +X3

3, n > 3

unbiased but not consistent

θ

θn = − 1

n+

1

n

n∑

i=1

Xi

biased but consistent

θ a

θn = a 6= θ

biased and not consistentn = 1n = 2n = 3n = 5n = 10n = 50


Unbiased estimators: Cramer-Rao Lower Bound (CRLB)There may be many plausible estimators θ for θ.

? Which is the ‘best’

Several criteria for a suitable estimator are possible,but suppose we aim for an MVU estimator (unbiased and minimal MSE)

Lower bound for the MSE of unbiased estimatorsGiven the model p(x; θ), there is a lower bound on the MSE that any unbiased

estimator θ can possibly achieve:

MSE(θ(X)) > 1

J(θ)−→ ‘Cramer-Rao Lower Bound’ (CRLB)

if θ reaches this bound, MSE(θ(X)) = 1/J(θ) −→ estimator is ‘efficient’

I the CRLB is inverse of the Fisher information

I having a lot of information in the sample about true θ (high J(θ)) allowsfor estimators with very low variance

I efficient ⇒ MVU, but MVU ; efficientbecause CRLB can not always be reached by MVU estimators

STAT 7+8 Fisher CRLB 11

Cramer-Rao Lower Bound (CRLB): proof

I The ‘triangle inequality’, best known in Euclidean vector spaces Rn

u = (u1, . . . , un) ∈ Rn is an n-dimensional vector

||u|| =√u2

1 + . . .+ u2n is the Euclidean length of u

inner (dot) product:

u · v = ||v|| ||u|| cosα∈[−1,1]

= u1v1 + . . .+ unvn

u

v

α

||u|| cosαCauchy-Schwarz:

(u · v)2 6 ||u||2 ||v||2 equality iff u = kv

or:(∑

i

ui vi)2 6

(∑

i

u2i

)(∑

i

v2i

)equality iff ui = kvi, ∀i

I If n→∞, Rn becomes a Hilbert space or ‘function space’:

(∫u(x)v(x)dx

)2

6(∫

u(x)2dx)(∫

v(x)2dx)

equality iff u(x) = kv(x), ∀x



θ(x) is an unbiased estimator for θ, so E[θ(X)− θ] = 0

⇒ 0 =∂

∂θE[θ(X)− θ] =

∂

∂θ

∫ (θ(x)− θ

)p(x; θ)dx

REG=

∫∂

∂θ

((θ(x)− θ)p(x; θ)

)dx

=

∫(0− 1)p(x; θ)dx

−1

+

∫(θ(x)− θ) ∂

∂θp(x; θ)

p(x; θ)S(θ,x)

dx

⇒ 1 =

∫

Ω

(θ(x)− θ)√p(x; θ)

u(x)

√p(x; θ)S(θ,x)

v(x)

dx =

∫

Ω

u(x)v(x)dx

In particular for these two functions:∫u(x)2dx =

∫(θ(x)− θ)2p(x; θ)dx = E[(θ − θ)2] = MSE(θ)

∫v(x)2dx =

∫S2(θ,x)p(x; θ)dx = E[S2(θ,X)2] = J(θ)



So due to the Cauchy-Schwarz inequality in Hilbert space:(∫

u(x)v(x)dx)2

1

6∫u(x)2dx

MSE(θ)

·∫v(x)2dx

J(θ)

which proves the theorem: MSE(θ) = Var[θ] > 1

J(θ)QED

Efficient form

The bound becomes a strict equality if (and only if) u(x) = kv(x), i.e. iff

S(θ,x) = k(θ)[θ(x)− θ] ‘efficient form’

If the score function can be written as k(θ)[θ − θ] for all θ ∈ S−→ estimator θ is ‘efficient’


Example: estimate the variance of a normal population

Assume a zero-mean normal population: X ∼ N(0, σ2),

? How to estimate σ2(= θ) given only the data x = (x1, . . . , xn)

I Likelihood p(x; θ) =

n∏

i=1

1√2πθ

exp−x2

i

2θ

I Log-likelihood ln p(x; θ) = −n ln√

2π − n

2ln θ − 1

2

n∑

i=1

x2i

θ

I Score S(θ,x) =∂

∂θln p(x; θ) = − n

2θ+

1

2

n∑

i=1

x2i

θ2=

n

2θ2

k(θ)

( 1

n

n∑

i=1

x2i

θ(x)

− θ)

The score function can be written in efficient form!

so θ(x) =1

n

n∑

i=1

x2i is an unbiased and efficient estimator for θ = σ2


Exa

mp

le

Example: estimate intensity of a Poisson processA Poisson process with intensity λ is a point process so that the times between‘events’ are indep. and exponentially distributed with mean τ = 1/λ.

0 t

N = n

∼ Expon(λ)The number of events N in an intervalof length t is Poiss(λt)

I Likelihood p(n;λ) = Prob[n events in interval of length t] = e−λt(λt)n

n!I Log-likelihood ln p(n;λ) = −λt+ n lnλt− lnn!

I Score S(λ, n) =∂

∂λln p(n;λ) = −t+

n

λtt =

t

λk(λ)

( n

t

λ(n)

− λ)

This is efficient form, so λ(n) =n

tis an unbiased efficient estimator for λ !

! However, the inverse 1/λ = t/n is not an efficient estimator for τ

p(n; τ) = e−t/τ(t/τ)n

n!, so that the score is S(τ, n) = −n

τ+

t

τ2

This is impossible to write in efficient form,so no unbiased efficient estimator for τ exists!


Exa

mp

le

Biased estimators

Should we always try to find unbiased estimators? No!

I They may not existe.g. no unbiased estimator for 1/p from a Bern(p) population exists

I They may be unreasonablee.g. MVU estimate of p from X ∼ Geom(p) is p(X) = 1〈X=1〉

this estimate is always 0 or 1

I They may have extremely large variance (= MSE)

So unbiased estimators do not always minimize the MSE:MSE(θ) = bias2 + Var[θ]−→ Sometimes it is better to sacrifice unbiasedness for lower variance

Minimising the MSEWe require:

I the concept of sufficient statistics

I the Rao-Blackwell theorem

STAT 7+8 Fisher Biased estimators 17

Sufficient statistics

Recall: a statistic T (x) is any function of the sample datanot depending on unknown parameters

could also be vector-valued: T(x) : Ω→ Rm with m < n, typically

A statistic T(x) is sufficient with respect to the model p(x; θ) if

p(x|T(x); θ) = p(x|T(x)), ∀xi.e. if the distribution of X given that T(X) = t, is independent of θ

−→ “All you can learn about θ from the data X,you can also learn from the statistic T(X)”

I If X is a book in which θ is a character, then a summary T(X) is sufficientif it gives all information about θ that is also in the book

I Sufficiency can be checked using the Neyman-Fisher criterium

STAT 7+8 Fisher Biased estimators Sufficient statistics 18

Sufficient statistics

Neyman-Fisher factorisation criterium

A statistic T(x) is sufficient with respect to the model p(x; θ)

⇔ p(x; θ) = g(x) · h(T(x), θ

)∀x ∈ Ω

independent of θdepends only on x through T(x)

Proof: (assuming X is discrete)

I First note if t = T(x) then “T(X) = t,X = x” and “X = x” are the same event!

−→ p(x; θ) = p(x, t; θ)

⇒ p(x; θ) = p(x, t; θ) = p(x|t; θ) · p(t; θ) sufficiency= p(x|t)

g(x)

· p(t; θ)h(t, θ)

⇐ p(x|t; θ) =p(x, t; θ)

p(t; θ)=

p(x , t ; θ)∑x p(x , t ; θ)1〈T(x)=t〉

=g(x)h(t, θ)∑

x g(x)h(t, θ)1〈T(x)=t〉

independent of θ= p(x|t) −→ sufficiency QED


Example: Sample mean for Bernoulli populationAssume again a Bernoulli population: X ∼ Bern(θ),

i.e. p(x; θ) = θx(1− θ)1−x for x ∈ 0, 1

I Sample size n

I Take as statistic the sample mean T (X) = X =1

n

n∑

i=1

Xi =C

n

with C the count of 1s in the sample

p(x; θ) =

n∏

i=1

p(xi; θ) =

n∏

i=1

θxi(1− θ)1−xi = θ∑xi(1− θ)n−

∑xi

= θnT (x)(1− θ)n−nT (x)

h(T (x),θ)

· 1g(x)

Neyman-Fisher checks out, so the sample mean is a sufficient statistic for θ

−→ T (x) is also efficient, since S(θ,x) =n

θ(1− θ)(T (x)− θ

)


Exa

mp

le

Rao-Blackwellisation of an estimator

Rao-Blackwell Theorem

For model p(x; θ), let θ(x) be an estimator for θ so that Var[θ] exists.If T(x) is a sufficient statistic, then for the new estimator

θ∗(t) = E[θ(X)|T(X) = t],

1) the new estimator θ∗ is a statistic, i.e. does not depend on θ

2) if θ is unbiased, then θ∗ is also unbiased

3) MSE(θ∗) 6 MSE(θ) −→ so new estimator may be ‘better’!

4) MSE(θ∗) = MSE(θ) iff θ(x) depends on x only through T(x)

I Process of improving existing estimators is called ‘Rao-Blackwellisation’

I The process is idempotent: repeating it will give no further improvement

I The proof is essentially based on the law of total expectation:

Let f(t) = E[|T = t] then E[] = ET[f(T)] = ET[EX[|T]]

inner expectation over all X for which T(X) is fixedouter expectation over all T

STAT 7+8 Fisher Biased estimators Rao-Blackwell 21

Rao-Blackwellisation of an estimator

Proof:

1) θ∗ is a statistic because of the sufficiency of T(x):

−→ θ∗(t) =∑

x

θ(x)p(x|t ; θ ) is independent of θ

2) θunbiased

= E[θ] = ET

[EX[θ(X)|T]

]= ET[θ∗(T)] = E[θ∗]

3) Since both estimators are unbiased, their MSE equals their variance, so

MSE(θ)−MSE(θ∗) = Var[θ]−Var[θ∗] = E[θ2]−(E[θ]

)2

θ2

−E[θ∗2]+(E[θ∗]

)2

θ2

= E[θ2(X)]− E[θ∗2(T)] = ET

[EX[θ2(X)|T]− θ∗2(T)

]

= ET

[EX[θ2(X)|T]−

(EX[θ(X)|T]

)2]= ET

[Var[θ(X)|T]

>0

]> 0

4) The inequality is strict iff Var[θ(X)|T = t] = 0, ∀t−→ given T(X) = t, θ is fixed

so θ(x) only depends on x through T(x)


Example: estimate maximum of uniform distributionObserve X1, . . . , Xn ∼ Unif(0, a)

how to estimate upper bound a?x2 x4 x1 x30

x

max(x) = t

a ?

I Original (naive) estimator: since E[Xi] =a

2, one could propose

a(x) = 2x =2

n

n∑

i=1

xi −→ E[a] = a ,MSE(a) =a2

3n(exercise)

I T (x) = max(x) is sufficient for a since Neyman-Fisher checks out:

p(x; a) =

n∏

i=1

1

a1〈0 6 xi 6 a〉 =

1

an1〈T (x) 6 a〉 · ∏n

i=1 1〈0 6 xi〉

I Rao-Blackwell new estimator: (suppose n > 1)

a∗(t) = E[a(X)|T (x) = t] = E[ 2

n

( n−1∑

i=1

Xi +t)|T (x) = t

]=

2t

n+(n−1)

t

n

=n+ 1

nt =

n+ 1

nmax(x) −→ E[a∗] = a ,MSE(a∗) =

a2

n(n+ 2)(exercise)

We find that indeed, MSE(a∗) < MSE(a) , ∀n > 1


Exa

mp

le

The Maximum-likelihood Estimator

For a model p(x; θ), the maximum-likelihood estimator θML (MLE) for θ is thevalue of θ for which the modelproduces the highest probability ofobserving sample X = x,

θML(x) = arg maxθ∈S

p(x; θ)

Likelihood p(x; θ)

θθML

I Finding θML is a maximisation problem:

∂

∂θp(x; θ) = 0 ⇒ ∂

∂θln p(x; θ)

score function

= 0 ⇒ S(θ,x) = 0

−→ so involves finding zeroes of the score function

usually requires numerical (search) algorithms

STAT 7+8 Fisher MLE 24

The Maximum-likelihood Estimator

Properties

I Any unbiased efficient estimator θ is also MLE

score has efficient form S(θ,x) = k(θ)(θ(x)− θ

), so

S(θ,x) = 0 −→ θ is MLE

The converse is not true: not all MLE are efficient

I Under some regularity conditions however, for increasing sample sizen→∞, the MLE

I is consistent: plimn→∞

θML,n = θ

I is asymptotically efficient: limn→∞

Var[θML,n]

1/nJ(θ)= 1

I is asymptotically normal: θML,n −→ N(θ,1

nJ(θ)) as n→∞

STAT 7+8 Fisher MLE 25

EM algorithm (Expectation/Maximisation) for finding MLE

Observed data vs. complete dataThe log-likelihood p(x; θ) may be a complicated function of θ so that

Find arg maxθ ln p(x; θ) −→ is difficult

But in the case where the observed data x is only part of the

underlying complete data ( xobserved

, yhidden

)

often the complete-data log-likelihood problem

Find arg maxθ ln p(x,y; θ) −→ is easy

EM-algorithm

I Numerical search algorithm: θ0E→M→ θ1

E→M→ θ2E→M→ . . . −→ θML

I Sure to converge to local likelihood maximum

STAT 7+8 Fisher EM algorithm 26

EM algorithm

p(x,y; θ) = p(x; θ)p(y|x; θ) −→ ln p(x; θ)observed LL

max is difficult

= ln p(x,y; θ)complete LLmax is easy

− ln p(y|x; θ)hidden

conditional on x

EM approaches argmax of observed LL by iteratively maximising complete LL:

E-step (expectation)So we need to maximise ln p(x,y; θ) . . . but how if y is unknown!?

I Trick 1: Replace complete LL by its expected value :

Lx(θ) = E[ln p(x,Y; θ)] =

∫ln p(x,y; θ) p(y|x; θ)dy

I Trick 2: Use current estimate θk of θ to fix distribution of hidden data−→ Replace p(y|x; θ) by p(y|x; θk) and calculate

Lx(θ|θk) =

∫ln p(x,y; θ) p(y|x; θk)dy

M-step (maximisation)Next estimate of θ is: θk+1 ← arg maxθ Lx(θ|θk)


EM algorithm

I It can be shown that, for the observed LL:

ln p(x; θk+1) > ln p(x; θk)

So if the likelihood has alocal maximum, the EM-algorithmwill converge to it

I In fact, the EM-algorithm is especially useful when the parameter to beestimated is a vector

θ = (θ1, . . . , θh)

so that the ‘search space’ S is very large.


Example: censored data

An electricity company has a power line to a part of the city with fluctuating daily demand. It

is known/assumed that the demand W of one day, measured in MWh, is N(µ, 1) . That is,

the variance is known (σ = 1MWh) but the mean is not.

To estimate the mean daily power demand µ = E[W ], the company asks n = 5 employees to

measure the power, on 5 different days and each with a different power meter. Unfortunately,

the meters have a limited range ri, i = 1, . . . , n. If Wi > ri, the meter fails (×) and does not

give a reading.

employee (i) range meter (ri), MWh measurement (xi), MWh1 7 ×2 5 4.23 8 ×4 6 4.75 10 6.9

−→ We try to find the MLE for µ x = 13 (4.2 + 4.7 + 6.9) = 5.26

STAT 7+8 Fisher EM algorithm Example: censored data 29

Exa

mp

le

Example: censored dataDirect maximisation of observed LLSuppose the first m 6 n measurements succeeded, x = (x1, . . . , xm) (observed)

and the rest failed, Y = (Ym+1, . . . , Yn) (hidden) −→ Yi > ri , m < i 6 n

p(x;µ) =

m∏

i=1

φ(xi − µ)

n∏

i=m+1

(1− Φ(ri − µ)

)

òbs(µ) = ln p(x;µ) = −m2

ln(2π)−m∑

i=1

1

2(xi−µ)2 +

n∑

i=m+1

ln(1−Φ(ri−µ)

)

µML satisfies `′obs(µ) = 0, or:

m(x− µ) =

n∑

i=m+1

φ(ri − µ)

1− Φ(ri − µ)

transcendental equation, difficult to solve

can only be done numerically

−→ so let us use EM algorithm instead!5.0 5.5 6.0 6.5 7.0 7.5 8.0

−20

−18

−16

−14

−12

−10

−8

maximum can befound usingnumericaltechniques

µ

observed LL òbs(µ) = ln p(x;µ)

x


Exa

mp

le

Example: censored dataE-step

Complete LL is ln p(x,Y;µ) = −n2

ln(2π)− 1

2

m∑

i=1

(xi−µ)2− 1

2

n∑

i=m+1

(Yi−µ)2

I 1: Replace LL by its expected value:

E[ln p(x,Y;µ)] = −1

2

m∑

i=1

(xi − µ)2 − 1

2

n∑

i=m+1

E[(Yi − µ)2] +csome constant

indep. of µ

E[(Yi − µ)2] =

∫ ∞

ri

(y − µ)2p(y |x ;µ)

p(y;µ)

dy with p(yi;µ) =φ(yi − µ)

1− Φ(ri − µ)

I 2: . . . and use current estimate µk for distribution of hidden data:

Eµk [(Yi − µ)2] =

∫ ∞

ri

(y−µ)2 p(y; µk)dy =

∫ ∞

ri

(−2yµ+µ2 + y2)2 p(y; µk)dy

= −2µ+

∫ ∞

ri

y p(y; µk)dy

Eµk [Y ]=Eµk [W |W>ri]

+ µ2

∫ ∞

ri

p(y; µk)dy

1

+ c

= −2µ(µk +

φ(ri − µk)

1− Φ(ri − µk)

)+ µ2 + c


Exa

mp

le

Example: censored dataM-step

Lx(µ|µk) = −1

2

m∑

i=1

(xi−µ)2− 1

2

n∑

i=m+1

[− 2µ

(µk +

φ(ri − µk)

1− Φ(ri − µk)

)+µ2

]+ c

L′x(µ|µk) = 0 ⇔ mx− nµ+ (n−m)µk +

n∑

i=m+1

φ(ri − µk)

1− Φ(ri − µk)= 0

So we update: µk+1 ←m

nx+

n−mn

µk +1

n

n∑

i=m+1

φ(ri − µk)

1− Φ(ri − µk)

5.0 5.5 6.0 6.5 7.0 7.5 8.0−20

−18

−16

−14

−12

−10

−8

µ

observed LL òbs(µ) = ln p(x;µ)

µ0

µ1

µ2

started with µ0 = x

convergence is very fast

only 2 or 3 iterations required here


Exa

mp

le

Example: censored dataWhat if σ is also unknown!?

no problem, the EM-algorithm can be used to approximate θ = (µ, σ2) :

µk+1 ←m

nx+

n−mn

µk +1

n

n∑

i=m+1

σkφ((ri − µk)/σk)

1− Φ((ri − µk)/σk)

σ2k+1 ←

1

n

m∑

i=1

x2i +

n−mn

(µ2k + σ2

k) +1

n

n∑

i=m+1

σk(µk + ri)φ((ri − µk)/σk)

1− Φ((ri − µk)/σk)

5.0 5.5 6.0 6.5 7.0 7.50.0

0.5

1.0

1.5

2.0

2.5

3.0

µML

σML

µk

σk

observed LL is -5.91

started with µ0 = x, σ20 = 1

convergence is again very fast

only 6 or 7 iterations required here


Exa

mp

le

STAT 7+8 Bayes 34

STAT 7+8 Bayes 35

STAT 7+8 Bayes 36

STAT 7+8 Bayes 37

STAT 7+8 Bayes 38