The Expectation-Maximization (EM) Algorithmciortuz/ML.ex-book/SLIDES/ML... · 2021. 1. 7. · Algoritmul EM (Expectation-Maximization) permite crearea unor modele probabiliste care

The Expectation-Maximization(EM) Algorithm

0.

Algoritmul EM, fundamentare teoretică:

pasul E [şi pasul M]

CMU, 2008 fall, Eric Xing, HW4, pr. 1.1-3

1.

Algoritmul EM (Expectation-Maximization) permite crearea unormodele probabiliste care pe de o parte depind de un set de parametriθ iar pe de altă parte includ pe lângă variabilele obişnuite (“observabile”sau “vizibile”) x şi variabile necunoscute (“neobservabile”, “ascunse” sau“latente”) z.

În general, ı̂n astfel de situaţii/modele, nu se poate face ı̂n mod directo estimare a parametrilor modelului (θ), ı̂n aşa fel ı̂ncât să se garantezeatingerea maximului verosimilităţii datelor observabile x.

În schimb, algoritmul EM procedează ı̂n manieră iterativă, constituind ast-fel o modalitate foarte convenabilă de estimare a parametrilor θ.

Definim log-verosimilitatea datelor observabile (x) ca fiind logP (x | θ), iarlog-verosimilitatea datelor complete (observabile, x, şi neobservabile, z) cafiind logP (x, z | θ).Observaţie: Pe tot parcursul acestui exerciţiu se va considera funcţia logca având baza supraunitară, fixată.

2.

a. Log-verosimilitatea datelor observabile (x) se poate exprima ı̂n funcţiede datele neobservabile (z), astfel:a

ℓ(θ)not.= logP (x | θ) = log

(∑

z

P (x, z | θ))

În continuare vom nota cu q o funcţie / distribuţie de probabilitate definităpeste variabilele ascunse/neobservabile z.

Folosiţi inegalitatea lui Jensen pentru a demonstra că are loc următoareainegalitate:

logP (x | θ) ≥∑

z

q(z) log

(P (x, z | θ)

q(z)

)

(1)

pentru orice x (fixat), pentru orice valoare a parametrului θ şi pentru oricedistribuţie probabilistă q definită peste variabilele neobservabile z.

aReţineţi că x, vectorul de date observabile, este fixat (dat), ı̂n vreme ce z, vectorul de date neobservabile, esteliber (variabil).

3.

Observaţie (1)

Semnificaţia inegalităţii (1) este următoarea:

Funcţia (de fapt, orice funcţie de forma)

F (q, θ)def.=∑

z

q(z) log

(P (x, z | θ)

q(z)

)

constituie o margine inferioară pentru funcţia de log-verosimilitate a

datelor incomplete / observabile, ℓ(θ)not.= logP (x | θ).

Remarcaţi faptul că F este o funcţie de două variabile, iar prima variabilănu este de tip numeric (cum este θ), ci este de tip funcţional.

Mai mult, se observă că expresia funcţiei F este de fapt o medie,

Eq(z)

[

logP (x, z | θ)

q(z)

]

,

atunci când x, q şi parametrul θ se consideră fixaţi, iar z este lăsat săvarieze.

4.

Soluţie

Inegalitatea lui Jensen, ı̂n contextul teoriei probabilităţilor:

considerând X este o variabilă aleatoare (unară),dacă ϕ este o funcţie (reală) convexă, atunci ϕ(E[X ]) ≤ E[ϕ(X)];dacă ϕ este funcţie concavă, atunci ϕ(E[X ]) ≥ E[ϕ(X)].Aici vom folosi funcţia log cu bază supraunitară (funcţie concavă), deci aplicând inegal-itatea lui Jensen vom obţine: log(E[X ]) ≥ E[log(X)].Log-verosimilitatea datelor observabile este:

ℓ(θ)not.= logP (x | θ) = log

(∑

z

P (x, z | θ))

= log

(∑

z

q(z)P (x, z | θ)

q(z)

)

def.= log

(

Eq(z)

[P (x, z | θ)

q(z)

])

Conform inegalităţii lui Jensen (̂ınlocuind X de mai sus cuP (x, z | θ)

q(z)), rezultă:

ℓ(θ)not.= logP (x | θ) ≥ Eq(z)

[

logP (x, z | θ)

q(z)

]def.=∑

z

q(z) logP (x, z | θ)

q(z),

5.

Notaţie

În continuare, pentru a vă aduce mereu aminte că distribuţia q se referăla datele neobservabile z, vom folosi notaţia q(z) ı̂n loc de q.

În consecinţă, ı̂n cele ce urmează, ı̂n funcţie de context, q(z) va desemnafie distribuţia q, fie valoarea acestei distribuţii pentru o valoare oarecare[a variabilei neobservabile] z.

(Este adevărat că această lejeră ambiguitate poate induce ı̂n eroare citi-torul neexperimentat.)

6.

b. Vă reamintim definiţia entropiei relative (numită şi divergenţaKullback-Leibler):

KL(q(z) || P (z | x, θ)) = −∑

z

q(z) log

(P (z | x, θ)

q(z)

)

Arătaţi călogP (x | θ) = F (q(z), θ) +KL(q(z) || P (z | x, θ)).

Observaţie (2):

Semnificaţia egalităţii care trebuie demonstrată la acest punct este foarte

interesantă: diferenţa dintre funcţia obiectiv ℓ(θ)not.= logP (x | θ) şi marginea

sa inferioară F (q(z), θ) — a se vedea punctul a — este KL(q(z) || P (z | x, θ)).Tocmai pe această chestiune se va “construi” punctul final, şi cel maiimportant, al problemei noastre.

7.

Observaţie (3)

Ideile de bază ale algoritmului EM sunt două:

1. În loc să calculeze maximul funcţiei de log-verosimilitate logP (x | θ) ı̂nraport cu θ, algoritmul EM va maximiza marginea sa inferioară, F (q(z), θ),ı̂n raport cu ambele argumente, q(z) şi θ.

2. Pentru a căuta maximul (de fapt, un maxim local al) marginii in-ferioare F (q(z), θ), algoritmul EM aplică metoda creşterii pe coordonate(engl., coordinate ascent): după ce iniţial se fixează θ(0) eventual aleatoriu,se maximizează iterativ funcţia F (q(z), θ), ı̂n mod alternativ: mai ı̂ntâi ı̂nraport cu distribuţia q(z) şi apoi ı̂n raport cu parametrul θ.

Pasul E: q(t)(z) = argmaxq(z)

F (q(z), θ(t))

Pasul M: θ(t+1) = argmaxθ

F (q(t)(z), θ)

8.

Soluţie

F (q(z), θ)def.=

∑

z

q(z) log

(P (x, z | θ)

q(z)

)

=∑

z

q(z) log

(P (z | x, θ) · P (x | θ)

q(z)

)

=∑

z

q(z)

[

logP (z | x, θ)

q(z)+ logP (x | θ)

]

=∑

z

q(z) log

(P (z | x, θ)

q(z)

)

+∑

z

q(z) logP (x | θ)

= −KL(q(z) || P (z | x, θ)) + logP (x | θ) ·∑

z

q(z)

︸︷︷︸=1

⇒ logP (x | θ) = F (q(z), θ) +KL(q(z) || P (z | x, θ)).

9.

Observaţie (4)

Conform proprietăţii KL(p || q) ≥ 0 pentru ∀p, q, rezultă KL(q(z) || P (z |x, θ)) ≥ 0.Aşadar, din egalitatea care tocmai a fost demonstrată la punctul b obţinem(din nou!, după rezultatul de la punctul a) că F (q(z), θ) este o margine

inferioară pentru log-verosimilitatea datelor observabile, ℓ(θ)not.= logP (x | θ).

10.

c. Fie θ(t) valoarea obţinută pentru parametrul / parametrii θ la iteraţia ta algoritmului EM. Considerând această valoare fixată, arătaţi că maximullui F ı̂n raport cu argumentul / distribuţia q(z) este atins pentru distribuţiaP (z | x, θ(t)), iar valoarea maximului este:

maxq(z)

F (q(z), θ(t)) = EP (z|x,θ(t))[logP (x, z | θ(t))] +H(P (z | x, θ(t)))

11.

Soluţie

Trebuie să maximizăm F (q(z), θ(t)) — marginea inferioară a log-verosimilităţii datelor observabile x — ı̂n raport cu distribuţia q(z).

Pe de o parte, rezultatul de la punctul a ne spune că F (q(z), θ) ≤ logP (x | θ),pentru orice valoare a lui θ; ı̂n particular, pentru θ(t) avem

logP (x | θ(t)) ≥ F (q(z), θ(t))

Pe de altă parte, dacă ı̂n egalitatea demonstrată la punctul b se ı̂nlocuieşteθ cu θ(t), rezultă:

logP (x | θ(t)) = F (q(z), θ(t)) +KL(q(z)||P (z | x, θ(t)))

În fine, dacă alegem q(z) = P (z | x, θ(t)), atunci termenul KL(q(z)||P (z |x, θ(t))) din dreapta egalităţii de mai sus devine zero (vezi exerciţiul [PS-31]de la capitolul de Probabilităţi şi statistică).

Aşadar, valoarea maxq(z) F (q(z), θ(t)) se obţine pentru distribuţia q(z) = P (z |

x, θ(t)).

12.

Observaţie (5)

Notând Gt(θ)def.= F (P (z | x, θ(t)), θ), din calculul de mai sus rezultă că

Gt(θ(t)) = logP (x | θ(t)) = Q(θ(t) | θ(t)) + H(P (z | x, θ(t))). Se poate demon-

stra uşor — procedând similar cu calculul de mai sus — egalitatea

Gt(θ) = Q(θ | θ(t)) +H [P (z | x, θ(t))]

Observând că termenul H [P (z | x, θ(t))] din această ultimă egalitate nu de-pinde de θ, rezultă imediat că

argmaxθ

Gt(θ) = argmaxθ

Q(θ | θ(t))

În consecinţă,

θ(t+1)def.= argmax

θF (P (z | x, θ(t)), θ) = argmax

θGt(θ) = argmax

θQ(θ | θ(t))

14.

From:

What is the expectation maximiza-tion algorithm?

Chuong B. Do, Serafim Batzoglou,Nature Biotechnology,vol. 26, no. 8, 2008, pp. 897-899

15.

Egalitatea precedentă este responsabilă pentru următoarea reformulare(cea uzuală!) a algoritmului EM:

Pasul E′: calculează Q(θ | θ(t)) = EP (z|x,θ(t))[logP (x, z | θ)]Pasul M′: calculează θ(t+1) = argmax

θQ(θ | θ(t))

16.

Algoritmul EM:

monotonie / convergenţă

prelucrare de Liviu Ciortuz, după

en.wikipedia.org/wiki/Expectation-maximization

17.

Pentru a maximiza funcţia de log-verosimilitate a datelor observabile, i.e.

ℓ(θ)def.= logP (x | θ), unde baza logaritmului (nespecificată) este considerată

supraunitară, algoritmul EM procedează ı̂n mod iterativ, optimizând lapasul M al fiecărei iteraţii (t) o funcţie “auxiliară”

Q(θ | θ(t)) def.= EP (z|x,θ(t))[logP (x, z | θ)],

reprezentând media log-verosimilităţii datelor complete (observabile şineobservabile) ı̂n raport cu distribuţia condiţională P (z | x, θ(t)).Vom considera iteraţiile t = 0, 1, . . . şi θ(t+1) = argmaxθQ(θ | θ(t)), cu θ(0) alesı̂n mod arbitrar.

Demonstraţi că pentru orice t fixat (arbitrar) şi pentru orice θ astfel ı̂ncâtQ(θ | θ(t)) ≥ Q(θ(t) | θ(t)) are loc inegalitatea:

logP (x | θ)− logP (x | θ(t)) ≥ Q(θ | θ(t))−Q(θ(t) | θ(t)) (2)

18.

Observaţii1. Semnificaţia imediată a relaţiei (2):Orice ı̂mbunătăţire a valorii funcţiei auxiliare Q(θ | θ(t)) conduce la oı̂mbunătăţire cel puţin la fel de mare a valorii funcţiei obiectiv, ℓ(θ).

2. Dacă ı̂n inegalitatea (2) se ı̂nlocuieşte θ cu θ(t+1) = argmaxθQ(θ | θ(t)), varezulta

logP (x | θ(t+1)) ≥ logP (x | θ(t)).

În final, vom aveaℓ(θ(0)) ≤ ℓ(θ(t)) ≤ ℓ(θ(t+1)) ≤ . . . .Şirul acesta (monoton) estemărginit superior de 0 (vezidefiniţia lui ℓ), deci converge la

o anumită valoare ℓ∗. În anumitecazuri / condiţii, această valoareeste un maxim (̂ın general, local)al funcţiei de log-verosimilitate.

(t)Q (θ |θ )(t+1)

(t+1)θP(x| )log

θ (t)P(x| )log

P(x| )θlog

}}

θ

Q (θ |θ )(t) (t)

(t)Q (θ|θ )

θ (t) θ (t+1)

19.

Observaţii (cont.)

3. Conform aceleiaşi inegalităţi (2), la pasul M de la iteraţia t a algoritmuluiEM, este suficient ca ı̂n loc să se ia θ(t+1) = argmaxθQ(θ | θ(t)), să se aleagăθ(t+1) astfel ı̂ncât Q(θ(t+1) | θ(t)) > Q(θ(t) | θ(t)). Aceasta constituie versiunea“generalizată” a algoritmului EM.

20.

θ (t) θ (t+1) θ (t+2)

t+1G (t+1)(θ )

(t)tG (θ )

P(x| )log θ

(t+1)[ ]θH P(z|x; )

(θ|θ )(t)Q

θ (t)H P(z|x; )[ ]

(t+1)(θ | θ )Q

}

}

θ

23.

The EM algorithm for solving

a Bernoulli mixture model

CMU, 2008 fall, Eric Xing, HW4, pr. 1.4-7

24.

Suppose I have two unfair coins. The first lands on headswith probability p, and the second lands on heads with prob-ability q.

Imagine n tosses, where for each toss I choose to use the firstcoin with probability π and choose to use the second withprobability 1− π. The outcome of each toss i is xi ∈ {0, 1}.

Suppose I tell you the outcomes of the n tosses, xnot.=

{x1, x2, . . . , xn}, but I don’t tell you which coins I used onwhich toss.

01

Z~Bernoulli

X~Bernoulli X~Bernoulli 01

π 1−π

01

1−p q1−qp

Given only the outcomes, x, your job is to compute estimates for θ which is the set ofall parameters, θ = {p, q, π} using the EM algorithm.To compute these estimates, we will create a latent variable Z, where zi ∈ {0, 1} indicatesthe coin used for the nth toss. For example z2 = 1 indicates the first coin was used onthe second toss.

We define the “incomplete” data log-likelihood as logP (x|θ) and the “complete” datalog-likelihood as logP (x, z|θ).

25.

a. Show that E[zi|xi, θ] = P (zi = 1|xi, θ).b. Use Bayes rule to compute P (zi = 1|xi, θ(t)), where θ(t) denotes the pa-rameters at iteration t.

c. Write down the complete log-likelihood, logP (x, z|θ).d. E-Step: Show that the expected log-likelihood of the complete data

Q(θ | θ(t)) not.= EP (z|x,θ(t))[logP (x, z | θ)] is given by

Q(θ | θ(t)) =n∑

i=1

E[zi | xi, θ(t)] · (log π + xi log p + (1− xi) log(1− p)) +

+(1− E[zi | xi, θ(t)]

)· (log(1− π) + xi log q + (1− xi) log(1− q))

e. M-Step: Describe the process you would use to obtain the updateequations for p(t+1), q(t+1) şi π(t+1).

26.

Solution

a.

E[zi | xi, θ] =∑

z∈{0,1}ziP (zi | xi, θ) = 0 · P (zi = 0 | xi, θ) + 1 · P (zi = 1 | xi, θ)

⇒ E[zi | xi, θ] = P (zi = 1 | xi, θ)

b.

P (zi = 1 | xi, θ)

=P (xi | zi = 1, θ)P (zi = 1 | θ)

P (xi | zi = 1, θ)P (zi = 1 | θ) + P (xi | zi = 0, θ)P (zi = 0 | θ)

=pxi · (1− p)1−xi · π

pxi · (1− p)1−xi · π + qxi · (1− q)1−xi · (1− π)

27.

c.

Note (in Romanian): Din punct de vedere metodologic, pentru a calcula P (x, z|θ)ne putem inspira de la punctul precedent: observăm că la numitorul fracţiei care ne

dă valoarea lui P (zi|xi, θ) avem P (xi, zi = 1|θ) şi P (xi, zi = 0|θ). Pornind de la aceste douăexpresii, ne vom propune să le exprimăm ı̂n mod unitar, adică sub forma unei singure

expresii.

logP (x, z | θ) i.i.d.= logn∏

i=1

P (xi, zi | θ) = logn∏

i=1

P (xi | zi, θ) · P (zi | θ)

= log

n∏

i=1

(pxi(1− p)1−xiπ

)zi (qxi(1− q)1−xi(1− π))1−zi

=n∑

i=1

log((pxi(1− p)1−xiπ

)zi (qxi(1− q)1−xi(1− π))1−zi

)

=

n∑

i=1

[zi log


)+ (1− zi) log

(qxi(1− q)1−xi(1− π)

)]

28.

e.

µ(t)i

not.= E[zi | xi, θ(t)] a,b=

(p(t))xi · (1− p(t))1−xi · π(t)(p(t))xi · (1− p(t))1−xi · π(t) + (q(t))xi · (1− q(t))1−xi · (1− π(t))

⇒ Q(θ | θ(t)) =n∑

i=1

[

µ(t)i (log π + xi log p+ (1− xi) log(1− p)) +

n∑

i=1

[ +(

1− µ(t)i)

· (log(1− π) + xi log q + (1− xi) log(1− q))]

∂Q(θ | θ(t))∂p

= 0 ⇔n∑

i=1

µ(t)i

(xip− 1− xi

1− p

)

= 0 ⇔ 1p

n∑

i=1

µ(t)i xi =

1

1− pn∑

i=1

µ(t)i (1− xi)

⇔ (1−p)n∑

i=1

µ(t)i xi = p

n∑

i=1

µ(t)i (1−xi) ⇔

n∑

i=1

µ(t)i xi = p

(n∑

i=1

µ(t)i (1−xi) +

n∑

i=1

µ(t)i xi

)

⇔n∑

i=1

µ(t)i xi = p

n∑

i=1

µ(t)i ⇒ p(t+1) =

∑ni=1 µ

(t)i xi

∑ni=1 µ

(t)i

∈ [0, 1]

30.

∂Q(θ | θ(t))∂q

= 0 ⇔n∑

i=1

(1−µ(t)i )(xiq− 1− xi

1− q

)

= 0 ⇒ q(t+1) =∑n

i=1(1− µ(t)i )xi

∑ni=1(1− µ

(t)i )

∈ [0, 1]

∂Q(θ | θ(t))∂π

= 0 ⇔n∑

i=1

(

µ(t)i

π− 1− µ

(t)i

1− π

)

= 0 ⇔ 1π

n∑

i=1

µ(t)i =

1

1− πn∑

i=1

(1− µ(t)i )

⇔ (1− π)n∑

i=1

µ(t)i = π

n∑

i=1

(1− µ(t)i ) ⇔n∑

i=1

µ(t)i = π

(n∑

i=1

(1− µ(t)i ) +n∑

i=1

µ(t)i

)

⇔n∑

i=1

µ(t)i = π

n∑

i=1

1 ⇔n∑

i=1

µ(t)i = nπ ⇒ π(t+1) =

1

n

n∑

i=1

µ(t)i ∈ [0, 1]

Note: It can be easily shown that the second derivatives (∂2

∂p2,∂2

∂q2and

∂2

∂π2) are all

negative on the respective domains of definition. Therefore, the above solutions (p(t+1),

q(t+1) and π(t+1)) maximize the Q functional.

31.

The EM algorithm for mixtures of Bernoulli distributions: pseudo-code

Iniţializare:

atribuie valori arbitrare (π(0), p(0), q(0) ı̂n intervalul (0, 1))pentru parametrii π, p şi respectiv q;

Corpul iterativ:pentru t = 0, . . . , T − 1 (cu T fixat ı̂n avans)

(sau: până când log-verosimilitatea datelor observabile nu mai creşte semnificativ),

(sau: până când ‖π(t) − π(t+1)‖2 < ε, ‖p(t) − p(t+1)‖2 < ε, ‖q(t) − q(t+1)‖2 < ε),cu ε fixat ı̂n avans)

executăPasul E:

µ(t)i =

(p(t))xi ·(1−p(t))1−xi ·π(t)(p(t))xi ·(1−p(t))1−xi ·π(t)+(q(t))xi ·(1−q(t))1−xi ·(1−π(t)) ;

Pasul M:

p(t+1) =

∑ni=1 µ

(t)i xi

∑ni=1 µ

(t)i

, q(t+1) =

∑ni=1(1−µ

(t)i )xi

∑ni=1(1−µ

(t)i )

, π(t+1) = 1n

∑ni=1 µ

(t)i ;

Returnează π(t+1), p(t+1), q(t+1);

32.

L. Ciortuz, S. Ciobanu:

Exemplifying a simpler version of the EM / BMM algorithm

starting from the dataset used atCMU, 2005 fall, T. Mitchell, A. Moore, midterm, pr. 1.3

A. supervised

Coin Result

i Zi Xi1 1 1 (Head)2 0 0 (Tail)3 0 0 (Tail)4 0 0 (Tail)5 0 1 (Head)

(X |Z = 1) ∼ Bernoulli (θ)(X |Z = 0) ∼ Bernoulli (2θ),

with θ ∈ (0, 1/2).

=⇒

B. unsupervised

Coin Result

i Zi Xi1 ? 1 (Head)2 ? 0 (Tail)3 ? 0 (Tail)4 ? 0 (Tail)5 ? 1 (Head)

Z ∼ Bernoulli (π), with π ∈ (0, 1),X ∼ πBernoulli (θ) +(1 − π)Bernoulli (2θ),

and θ ∈ (0, 1/2).

1−θ θ2θ21−

01

Z~Bernoulli


π 1−π

01

θ

Note: In the next five slides we will assume that π is fixed.

33.

E step:

E[zi|xi, θ(t)] = P (zi = 1|xi, θ(t))F. Bayes

=π (θ(t))xi(1− θ(t))1−xi

π (θ(t))xi(1 − θ(t))1−xi + (1− π)(2θ(t))xi(1 − 2θ(t))1−xinot.= µ

(t)i (3)

ℓcompl(θ)not.= lnP (X,Z|θ)

= Z1 ln(θπ) + (1− Z1) ln(2θ(1 − π)) + Z2 ln((1 − θ)π) + (1 − Z2) ln((1− 2θ)(1 − π)) +Z3 ln((1 − θ)π) + (1− Z3) ln((1 − 2θ)(1− π)) +Z4 ln((1 − θ)π) + (1− Z4) ln((1 − 2θ)(1− π)) + Z5 ln(θπ) + (1− Z5) ln(2θ(1 − π))

Q(θ|θ(t)) not.= E[ℓcompl(θ)]= E[Z1] ln(θπ) + (1 − E[Z1]) ln(2θ(1− π)) + E[Z2] ln((1− θ)π) + (1− E[Z2]) ln((1− 2θ)(1 − π)) +

E[Z3] ln((1− θ)π) + (1− E[Z3]) ln((1− 2θ)(1− π)) +E[Z4] ln((1− θ)π) + (1− E[Z4]) ln((1− 2θ)(1− π)) + E[Z5] ln(θπ) + (1− E[Z5]) ln(2θ(1 − π))

π const.= const.+ E[Z1] ln θ + (1− E[Z1]) ln θ + E[Z2] ln(1− θ) + (1 − E[Z2]) ln(1− 2θ) +

E[Z3] ln(1− θ) + (1− E[Z3]) ln(1− 2θ) + E[Z4] ln(1− θ) + (1− E[Z4]) ln(1− 2θ) +E[Z5] ln θ + (1 − E[Z5]) ln θ

= const.+ 2 ln θ + (E[Z2] + E[Z3] + E[Z4]) ln(1− θ) + (3 − (E[Z2] + E[Z3] + E[Z4])) ln(1− 2θ)

34.

M step:

θ(t+1)def.= argmax

θ∈(0,1/2)Q(θ|θ(t))

∂

∂θQ(θ|θ(t)) = 0 ⇔ 2

θ=E[Z2] + E[Z3] + E[Z4]

1− θ +2[3− (E[Z2] + E[Z3] + E[Z4])]

1− 2θ⇔ 2(1− θ)(1− 2θ) = (E[Z2] + E[Z3] + E[Z4])θ(1 − 2θ) + 2(3− (E[Z2] + E[Z3] + E[Z4]))θ(1 − θ)

⇔ 2(1− θ)(1 − 2θ) = (µ(t)2 + µ(t)3 + µ

(t)4 )θ(1 − 2θ) + 2(3− (µ

(t)2 + µ

(t)3 + µ

(t)4 ))θ(1 − θ)

⇔ aθ2 + bθ + c = 0 with a = 10, b = −12 + µ(t)2 + µ(t)3 + µ

(t)4 , c = 2 ∈ R (4)

Note that

∂2

∂θ2Q(θ|θ(t)) = − 2

θ2−

≥0︷︸︸︷

µ(t)2 + µ

(t)3 + µ

(t)4

(1− θ)2 −4(

≥0︷︸︸︷

3− (µ(t)2 + µ(t)3 + µ

(t)4 ))

(1− 2θ)2 < 0,

so Q(θ|θ(t)) is a strictly concave function, so it has a unique maximum on (0, 1/2).

µ(t)i ∈ [0, 1] due to the relationship (3), therefore ∆ = b2 − 4ac = b2 − 80 ≥ 1 > 0, which

implies that the solutions to the above 2nd degree equation (denoted θ1 and respectivelyθ2) are real nombers. It can be shown (by using high school mathematics) that θ1 > 0,

θ2 > 0, and moreover θ1 < 1/2, and θ2 > 1/2. Therefore, the solution−b−

√b2 − 4ac2a

is

the one corresponding to the maximum value of Q(θ|θ(t)) in the (0, 1/2) interval.

35.

So, the updating rule for the M-step is

θ(t+1) =−b−

√b2 − 4ac2a

.. (5)

Now, to summarize, the updating rules for this version of the EMalgorithm are the relationships (3) and (5).

36.

Execution of two iterations of the EM/BMM algorithm,when π = 0.5 (fixed)

[plots made by Sebastian Ciobanu]

0.0 0.1 0.2 0.3 0.4 0.5

−10

−9

−8

−7

−6

−5

−4

θ

θ(0)θ(1)

ln(p(X|θ))

Q(θ|θ(0))

Q(θ|θ(1))

θ(2)

0.0 0.1 0.2 0.3 0.4 0.5

−10

−9

−8

−7

−6

−5

−4

θ

θ(0)θ(1)

ln(p(X|θ))

Q(θ|θ(0)) + ln2*H(Z|X, θ(0))

Q(θ|θ(1)) + ln2*H(Z|X, θ(1))

θ(2)

37.

iteration (t) θ(t) log-likelihood

0 0.45 −4.1578751 0.3187987 −3.4268622 0.2738492 −3.3662613 0.2675115 −3.3650754 0.266764 −3.3650595 0.2666779 −3.3650586 0.266668 −3.3650587 0.2666668 −3.3650588 0.2666667 −3.3650589 0.2666667 −3.36505810 0.2666667 −3.365058

µ(10)1 = 0.3333333, µ

(10)2 = 0.6111111, µ

(10)3 = 0.6111111, µ

(10)4 = 0.6111111, µ

(10)5 = 0.3333333

38.

When π is free:E step:

E[zi|xi, θ(t)

, π(t)] = P (zi = 1|xi, θ

(t), π

(t))

F. Bayes=

π(t) (θ(t))xi(1− θ(t))1−xi

π(t) (θ(t))xi(1− θ(t))1−xi + (1− π(t))(2θ(t))xi(1− 2θ(t))1−xinot.= µ

(t)i

Q(θ, π|θ(t), π(t))not.= E[ℓcompl(θ, π)]

= E[Z1] ln(θπ) + (1− E[Z1]) ln(2θ(1− π)) +E[Z2] ln((1− θ)π) + (1− E[Z2]) ln((1− 2θ)(1− π)) +

E[Z3] ln((1− θ)π) + (1− E[Z3]) ln((1− 2θ)(1− π)) +

E[Z4] ln((1− θ)π) + (1− E[Z4]) ln((1− 2θ)(1− π)) + E[Z5] ln(θπ) + (1− E[Z5]) ln(2θ(1− π))

M step: θ(t+1) same as before, but using the µ(t)i from this E step.

π(t+1)def.= argmaxπ∈(0,1) Q(θ, π|θ

(t), π(t))

∂

∂πQ(θ, π|θ(t), π(t))=0 ⇔

∂

∂π

[

const. + (E[Z1] +E[Z2] + E[Z3] + E[Z4] + E[Z5]) ln π+∂

∂π(5−(E[Z1] + E[Z2] + E[Z3] + E[Z4] +E[Z5])) ln(1−π)

]

= 0 ⇔

E[Z1] + E[Z2] + E[Z3] +E[Z4] + E[Z5]

π=

5−(E[Z1] + E[Z2] +E[Z3] +E[Z4] + E[Z5])

1−π⇔

π =1

5(E[Z1] +E[Z2] + E[Z3] + E[Z4] + E[Z5])

Therefore π(t+1) =1

5(µ

(t)1 + µ

(t)2 + µ

(t)3 + µ

(t)4 + µ

(t)5 )∈ [0, 1]. Note: Q(θ, π|θ

(t), π(t)) is concave w.r.t. π.

39.

Execution of the EM/BMM algorithm, when π is also free

[plots made by Sebastian Ciobanu]−

5.0

−4

.5−

4.0

−3

.5

Number of iterations

Lo

g L

ike

liho

od

0 2 4 6 8 10 12 14 16 18 20

π

θ0.0 0.2 0.4 0.6 0.8 1.0

−0

.20

.00

.20

.40

.60

.8

40.

iteration (t) µ(t)1 µ

(t)2 µ

(t)3 µ

(t)4 µ

(t)5 θ

(t) π(t) log-likelihood

0 0.45 0.2 −5.4036356831

1 0.1111111111 0.5789473684 0.5789473684 0.5789473684 0.1111111111 0.2615013333 0.3918128655 −3.3694085874

2 0.2436363636 0.4993525294 0.4993525294 0.4993525294 0.2436363636 0.2499117539 0.3970660631 −3.3650619849

3 0.2477120572 0.4968812604 0.4968812604 0.4968812604 0.2477120572 0.2495757662 0.3972135791 −3.3650583379

4 0.2478268932 0.496811612 0.496811612 0.496811612 0.2478268932 0.249566316 0.3972177245 −3.365058335

5 0.2478301205 0.4968096546 0.4968096546 0.4968096546 0.2478301205 0.2495660504 0.397217841 −3.365058335

6 0.2478302112 0.4968095996 0.4968095996 0.4968095996 0.2478302112 0.249566043 0.3972178443 −3.365058335

41.

Revisiting the case when π is fixed

Datorită proprietăţii de liniaritate a mediei, urmează că media sumei unor variabilealeatoare este suma mediilor lor. Folosind această proprietate simplă vom arătaacum că putem să reformulăm algoritmul nostru EM pentru estimarea parametrilorunei mixturi de distribuţii Bernoulli sub o altă formă, foarte convenabilă.

La rezolvarea dată anterior pentru pasul E (când probabilitatea de selecţie, π, eraconsiderată fixată), am obţinut

µ(t)i = P (zi = 1|xi, θ

(t))F. Bayes

=π (θ(t))xi(1− θ(t))1−xi

π (θ(t))xi(1− θ(t))1−xi + (1− π)(2θ(t))xi(1− 2θ(t))1−xi.

Putem scriepentru cazul xi = 1 (adică, pentru xi ≡ stemă)

q(t)1

not.= P (zi = 1|xi = 1, θ

(t)) =π θ(t)

π θ(t) + (1− π) 2θ(t), (6)

pentru cazul xi = 0 (adică, pentru xi ≡ ban)

q(t)0

not.= P (zi = 1|xi = 0, θ

(t)) =π (1− θ(t))

π (1− θ(t)) + (1− π) (1− 2θ(t)). (7)

42.

Fie z1 numărul (neobservabil) de apariţii ale feţei stemă pentru moneda 1, din

totalul de nSnot.= 2 apariţii ale stemei ı̂n datele noastre.

Similar, fie z0 numărul (neobservabil) de apariţii ale feţei ban pentru moneda

1, din totalul de nBnot.= 3 apariţii ale banului ı̂n datele noastre.

Funcţia de verosimilitate corespunzătoare a datelor [din tabel] este

L(θ) = θz1(1− θ)z0 · (2θ)nS−z1(1− 2θ)nB−z0 ,

iar funcţia de log-verosimilitate este

ℓ(θ) = z1 ln θ + z0 ln(1− θ) + (nS − z1) ln(2θ) + (nB − z0) ln(1 − 2θ),

ceea ce conduce imediat la noua expresie a funcţiei auxiliare:

Q(θ|θ(t))def.=E[ℓ(θ)]lin. med.

= E[z1] ln θ + E[z0] ln(1 − θ) + (nS − E[z1]) ln(2θ) + (nB − E[z0]) ln(1− 2θ).

43.

La pasul E, mediile

• E[z1], adică numărul aşteptat (engl., expected number) de apariţii ale feţeistemă pentru moneda 1 din totalul de nS

not.= 2 apariţii ale stemei ı̂n datele

noastre, şi

• E[z0], adică numărul aşteptat (engl., expected number) de apariţii ale feţeiban pentru moneda 1 din totalul de nB

not.= 3 apariţii ale banului ı̂n datele

noastre,

se calculează — la iteraţia t— folosind fie proprietatea de liniaritate a mediilorfie formula mediei distribuţiei binomiale:

E[z1] = nS · q(t)1 şi E[z0] = nB · q(t)0 (8)

În această nouă formulare a algoritmului EM pentru problema noastră, pasulE este constituit de formulele (6) şi (7), iar pasul M se elaborează după cumurmează.Ştim că

θ(t+1) = argmaxθ∈(0, 1/2)

Q(θ|θ(t)). (9)

44.

Dacă maximul funcţiei Q(θ|θ(t)) se situează ı̂n interiorul intervalului (0, 1/2),atunci cu necesitate vom avea

∂

∂θ

(

E[z1] ln θ + E[z0] ln(1 − θ) + (nS − E[z1]) ln(2θ) + (nB − E[z0]) ln(1− 2θ))

= 0

⇔ E[z1]θ

− E[z0]1− θ +

2(nS − E[z1])2θ

− 2(nB − E[z0])1− 2θ = 0

⇔ nS · q(t)1

θ− nB · q

(t)0

1− θ +2(nS − nS · q(t)1 )

2θ− 2(nB − nB · q

(t)0 )

1− 2θ = 0

⇔ nS · q(t)1

θ− nB · q

(t)0

1− θ +✁2nS(1 − q(t)1 )

✁2θ− 2nB(1− q

(t)0 )

1− 2θ = 0

⇔ nS(1− θ)(1 − 2θ)− nBq(t)0 θ(1− 2θ)− 2nB(1− q(t)0 )θ(1− θ) = 0

⇔ nS + θ(−3ns −✟✟✟

nBq(t)0 − 2nB + ✁2nBq

(t)0 ) + θ

2(2nS +✟✟✟✟

2nBq(t)0 + 2nB −✟✟

✟✟2nBq

(t)0 ) = 0

⇔ 2(nS + nB)θ2 − (3nS + 2nB − nBq(t)0 )θ + nS = 0. (10)

Se constată uşor că ecuaţia (10) este de fapt aceeaşi cu ecuaţia (4) pe care amobţinut-o la pasul M atunci când probabilitatea de selecţie π era consideratăfixată. Prin urmare, soluţia problemei de optimizare (9) — adică, regula deactualizare a valorilor parametrului θ la pasul M pentru această nouă versiunea algoritmului EM — va avea exact forma (5).

Recapitulând, regulile de actualizare pentru această versiune a algoritmuluiEM sunt: la pasul E relaţia (7), iar la pasul M şi varianta regulii (5) pentruecuaţia (10).

45.

Observaţii :

1. Din formula (10) se vede faptul că nu este necesar să mai calculăm q(t)1

la pasul E. Chiar şi dacă ar fi trebuit să calculăm q(t)1 , pasul E tot ar fi fost

mai eficient ı̂n noua versiune decât ı̂n versiunea ,,clasică“ a algoritmului EMpentru rezolvare de mixturi de distribuţii Bernoulli, fiindcă acolo calculam

n medii µ(t)i . Acest fapt este foarte semnificativ pentru cazuri ı̂n care avem

foarte multe instanţe xi!

2. Acest ultim tip de estimare aparametrului / parametrilor uneidistribuţii probabiliste cu ajutorulalgoritmului EM este posibil datorităfaptului că au loc relaţiile (8).

Ilustrăm această importantă,,dependenţă“ ı̂n figura alăturată(pentru a facilita reţinerea ei decătre cititor!). Vă reamintim că nSşi nB sunt datele / variabilele ob-servabile, iar z̃1 şi z̃0 sunt variabileleneobservabile.

1−θ θ2θ21−

01

Z~Bernoulli


π 1−π

01

θ

z1~ nS 1

(t)q( , )~binomial z0

~ nB 0(t)

q( , )~binomial

46.

Estimating the parameters of a mixture of

vectors of Bernoulli variables, which are independentand [respectively] identically distributed

A. when all variables are observable: MLE, in the direct manner;

B. when some variables are unobservable: the EM algorithm

Liviu Ciortuz, following

“What is the expectation maximization algorithm?”,Chuong B. Do, Serafim Batzoglou,

Nature Biotechnology, vol. 26, no. 8, 2008, pag. 897-899

47.

Fie următorul experiment probabilist :

Dispunem de două monede, A şi B.Efectuăm 5 serii de operaţiuni de tipul următor:

− Alegem ı̂n mod aleatoriu una dintre monedele A şi B, cu probabilitateegală (1/2);

− Aruncăm de 10 ori moneda care tocmai a fost aleasă (Z) şi notămrezultatul, sumarizat ca număr (X) de feţe ‘head’ (ro. ‘stemă’) obţinuteı̂n urma aruncării.

48.

A. La acest punct vom considera că s-a obţinut următorul rezultat pentruexperimentul nostru:

i Zi Xi

1 B 5H (5T )2 A 9H (1T )3 A 8H (2T )4 B 4H (6T )5 A 7H (3T )

Semnificaţia variabilelor aleatoare Zi şi Xi pentru i = 1, . . . , 5 din tabelul demai sus este imediată.

49.

i. Calculaţi θ̂A şi θ̂B, probabilităţile de apariţie a feţei ‘head’/stemă pentrucele două monede, folosind definiţia clasică pentru probabilitatea eveni-mentelor aleatoare, şi anume raportul dintre numărul de cazuri favorabileşi numărul de cazuri posibile, relativ la ı̂ntregul experiment.

Răspuns:

Analizând datele din tabelul din enunţ, rezultă imediat

θ̂A =24

24 + 6= 0.8 şi θ̂B =

9

9 + 11= 0.45.

De observat că termenii 6 şi respectiv 11 de la numitorii acestor fracţiireprezintă numărul de feţe ‘tail’/ban care au fost obţinute la aruncareamonedei A şi respectiv B: 6T = 1T + 2T + 3T , 11T = 5T + 6T .

50.

Observaţie

Dacă ı̂n locul variabilelor binare Zi ∈ {A,B} pentru i = 1, . . . , 5 introducemı̂n mod natural variabilele-indicator Zi,A ∈ {0, 1} şi Zi,B ∈ {0, 1} tot pentrui = 1, . . . , 5, definite prin Zi,A = 1 iff Zi = A, şi Zi,A = 0 iff Zi,B = 0, atunci

procesările necesare pentru calculul probabilităţilor/parametrilor θ̂A şi θ̂Bpot fi prezentate ı̂n mod sintetizat ca ı̂n tabelul de mai jos.a

i Zi,A Zi,B Xi

1 0 1 5H2 1 0 9H3 1 0 8H4 0 1 4H5 1 0 7H

⇒

aPrezentăm acest “artificiu” ca pregătire pentru rezolvarea (ulterioară a) punctuluiB al prezentei probleme.

52.

⇒

Xi · Zi,A Xi · Zi,B0H (0T ) 5H (5T )9H (1T ) 0H (0T )8H (2T ) 0H (0T )0H (0T ) 4H (6T )7H (3T ) 0H (0T )

∑5i=1Xi · Zi,A = 24H

∑5i=1Xi · Zi,B = 9H

∑5i=1(10−Xi) · Zi,A = 6T

∑5i=1(10−Xi) · Zi,B = 11T

⇒

⇒

θ̂A =24

24 + 6= 0.8

θ̂B =9

9 + 11= 0.45

53.

ii. Calculaţi L1(θA, θB)not.= P (X,Z | θA, θB), funcţia de verosimilitate a datelor

Xnot.= < X1, . . . , X5 > şi Z

not.=< Z1, . . . , Z5 >, ı̂n raport cu parametrii θA şi

respectiv θB ai distribuţiilor care modelează aruncarea celor două monede.

Observaţie: Facem presupunerea că am reţinut succesiunea [tuturor] rezul-tatelor obţinute la aruncările celor două monede. Precizarea de facto aacestei succesiuni nu este esenţială. Dacă nu s-ar face această presupunere,ar trebui să să lucrăm cu distribuţia binomială.

54.

Răspuns:

Calculul verosimilităţii datelor:

L1(θA, θB)def.= P (X,ZA, ZB | θA, θB) indep. cdt.=

5∏

i=1

P (Xi, Zi,A, Zi,B | θA, θB)

=5∏

i=1

P (Xi | Zi,A, Zi,B; θA, θB) · P (Zi,A, Zi,B | θA, θB)

= P (X1 | ZB,1 = 1, θB) · 1/2 ·P (X2 | ZA,2 = 1, θA) · 1/2 ·P (X3 | ZA,3 = 1, θA) · 1/2 ·P (X4 | ZB,4 = 1, θB) · 1/2 ·P (X5 | ZA,5 = 1, θA) · 1/2

= θ5B(1− θB)5 · θ9A(1− θA) · θ8A(1− θA)2 · θ4B(1− θB)6 · θ7A(1− θA)3 ·1

25

=1

25θ24A (1− θA)6 θ9B(1− θB)11

55.

iii. Calculaţi θ̂Anot.= argmaxθA logL1(θA, θB) şi θ̂B

not.= argmaxθB logL1(θA, θB)

folosind derivatele parţiale de ordinul ı̂ntâi.

Observaţii:

1. Baza logaritmului, fixată dar lăsată mai sus nespecificată, se va considerasupraunitară (de exemplu 2, e sau 10).

2. Lucrând corect, veţi obţine acelaşi rezultat ca la punctul i.

56.

Răspuns:

Funcţia de log-verosimilitate a datelor complete se exprimă astfel:

logL1(θA, θB) = −5 log 2 + 24 log θA + 6 log(1− θA) + 9 log θB + 11 log(1− θB)

Prin urmare, maximul acestei funcţii ı̂n raport cu parametrul θA se cal-culează astfel:

∂ logL1(θA, θB)

∂θA= 0 ⇔ 24

θA− 6

1− θA= 0 ⇔ 4

θA=

1

1− θA⇔ 4− 4 θA = θA ⇔ θ̂A = 0.8

Similar, se face calculul şi pentru∂ logL1(θA, θB)

∂θBşi se obţine θ̂B = 0.45.

a

Cele două valori obţinute, θ̂A şi θ̂B, reprezintă estimarea de verosimilitatemaximă (MLE) a probabilităţilor de apariţie a feţei ‘head’ (‘stema’) pentrumoneda A şi respectiv moneda B.

aSe verifică uşor faptul că ı̂ntr-adevăr rădăcinile derivatelor parţiale de ordinul ı̂ntâipentru funcţia de log-verosimilitate reprezintă puncte de maxim. Pentru aceasta sestudiază semnele acestor derivate.

57.

Observaţii

1. Am arătat pe acest caz particular că metoda de calculare a proba-bilităţilor (θ̂A şi θ̂B) direct din datele observate (aşa cum o ştim din liceu)corespunde de fapt metodei de estimare ı̂n sensul verosimităţii maxime(MLE).

2. La punctul B vom arăta cum anume se poate face estimarea aceloraşiparametri θA şi θB ı̂n cazul in care o parte din date, şi anume variabilele Zi(pentru i = 1, . . . , 5) sunt neobservabile.

58.

B. La acest punct se va relua experimentul de la punctul A, ı̂nsă de dataaceasta vom considera că valorile variabilelor Zi nu sunt cunoscute.

i Zi Xi

1 ? 5H (5T )2 ? 9H (1T )3 ? 8H (2T )4 ? 4H (6T )5 ? 7H (3T )

59.

iv. Pentru convenienţă, ı̂n locul variabilelor ,,neobservabile“ Zi pentrui = 1, . . . , 5 vom considera variabilele-indicator (de asemenea neobservabile)Zi,A, Zi,B ∈ {0, 1}, cu Zi,A = 1 iff Zi,B = 0 şi Zi,B = 1 iff Zi,A = 0. Evident,ı̂ntrucât variabilele Zi sunt aleatoare, rezultă că şi variabilele Zi,A şi Zi,Bsunt aleatoare.

Folosind teorema lui Bayes, calculaţi mediile variabilelor neobservabileZi,A şi Zi,B condiţionate de variabilele observabile Xi. Veţi considera căparametrii acestor distribuţii (Bernoulli) care modelează aruncarea mon-

edelor A şi B au valorile θ(0)A = 0.6 şi respectiv θ

(0)B = 0.5.

Aşadar, se cer: E[Zi,A | Xi, θ(0)A , θ(0)B ] şi E[Zi,B | Xi, θ

(0)A , θ

(0)B ] pentru i = 1, . . . , 5.

Ca şi mai ı̂nainte, probabilităţile a priori P (Zi,A = 1) şi P (Zi,B = 1) se vorconsidera egale cu 1/2.

60.

Notă

Algoritmul EM ne permite să facem ı̂n mod iterativ estimarea parametrilorθA şi θB ı̂n funcţie de valorile variabilelor observabile, Xi, şi de valorile

inţiale atribuite parametrilor (̂ın cazul nostru, θ(0)A = 0.6 şi θ

(0)B = 0.5).

Vom face o sinteză a calculelor de la prima iteraţie a algoritmului EM —detaliate la punctele iv, v şi vi de mai jos — sub forma următoare, careseamănă ı̂ntr-o anumită măsură cu tabelele de la punctul A:

i Zi,A Zi,B Xi

1 − − 5H2 − − 9H3 − − 8H4 − − 4H5 − − 7H

E⇒

E[Zi,A] E[Zi,B]

0.45H 0.55H0.80H 0.20H0.73H 0.27H0.35H 0.65H0.65H 0.35H

M⇒

61.

Notă (cont.)

M⇒

Xi · E[Zi,A] Xi · E[Zi,B]2.2H (2.2T ) 2.8H (2.8T )7.2H (0.8T ) 1.8H (0.2T )5.9H (1.5T ) 2.1H (0.5T )1.4H (2.1T ) 2.6H (3.9T )4.5H (1.9T ) 2.5H (1.1T )

∑5i=1Xi · E[Zi,A] = 21.3H

∑5i=1Xi · E[Zi,B] = 11.7H

∑5i=1(10−Xi) · E[Zi,A] = 8.7T

∑5i=1(10−Xi) · E[Zi,B] = 8.3T

⇒

⇒

θ̂(1)A =

21.3

21.3 + 8.7≈ 0.71

θ̂(1)B =

11.7

11.7 + 8.3≈ 0.58

62.

Notă (cont.)

Vom arăta că:

• ı̂ntr-adevăr, este posibilă calcularea mediilor variabilelor neobservabileZi,A şi Zi,B, condiţionate de variabilele observabile Xi şi ı̂n funcţie devalorile asignate iniţial (0.6 şi 0.5 ı̂n enunţ, dar ı̂n general ele se potasigna ı̂n mod aleatoriu) pentru parametrii θA şi θB;

• faţă de tabloul de sinteză de la punctul precedent, când toate vari-abilele erau observabile şi se calculau produsele Xi ·Zi,A şi Xi ·Zi,B, aicise ı̂nlocuiesc variabilele Zi,A şi Zi,B cu mediile E[Zi,A] şi E[Zi,B] ı̂n pro-dusele respective. De fapt, ı̂n loc să se calculeze

∑

iXi·Zi,A se calculeazămedia E[

∑

iXi · Zi,A], şi similar pentru B.

Notaţie: Pentru simplitate, ı̂n cele de mai sus (inclusiv ı̂n tabelele prece-dente), prin E[Zi,A] am notat E[Zi,A | Xi, θ(0)], iar prin E[Zi,B] am notatE[Zi,B | Xi, θ(0)], unde θ(0) not.= (θ(0)A , θ(0)B ).

63.

Răspuns (iv)

Întrucât variabilele Zi,A au valori boolene (0 sau 1), rezultă că

E[Zi,A |Xi, θ(0)] = 0 · P (Zi,A = 0 |Xi, θ(0)) + 1 · P (Zi,A = 1 |Xi, θ(0))= P (Zi,A = 1 |Xi, θ(0))

Probabilităţile P (Zi,A = 1 | X, θ(0)) = P (Zi,A = 1 | Xi, θ(0)), pentru i = 1, . . . , 5,se pot calcula folosind teorema lui Bayes:

P (Zi,A = 1 | Xi, θ(0)) =P (Xi |Zi,A = 1, θ(0)) · P (Zi,A = 1 |θ(0))

∑

j∈{0,1} P (Xi |Zi,A = j, θ(0)) · P (Zi,A = j |θ(0))

=P (Xi | Zi,A = 1, θ(0)A )

P (Xi | Zi,A = 1, θ(0)A ) + P (Xi | Zi,B = 1, θ(0)B )

S-a ţinut cont că P (Zi,A = 1 | θ(0)) = P (Zi,B = 1 | θ(0)) = 1/2 (a se vedeaenunţul).

64.

De exemplu, pentru i = 1 vom avea:

E[ZA,1 | X1, θ(0)] =0.65(1− 0.6)5

0.65(1− 0.6)5 + 0.55(1− 0.5)5 =1

1 +

(0.25

0.24

)5 ≈ 0.45

Similar cu E[ZA,1 | X1, θ(0)] se calculează şi celelalte medii E[Zi,A | Xi, θ(0)]pentru i = 2, . . . , 5 şi E[Zi,B | Xi, θ(0)] pentru i = 1, . . . , 5.Am ı̂nregistrat aceste valori/medii ı̂n cel de-al doilea tabel din Nota demai sus.

Observaţie: Se poate ţine cont că, de ı̂ndată ce s-a calculat E[Zi,A | Xi, θ(0)],se poate obţine imediat şi E[Zi,B | Xi, θ(0)] = 1 − E[Zi,A | Xi, θ(0)], fiindcăZi,A + Zi,B = 1.

65.

v. Calculaţi media funcţiei de log-verosimilitate a datelor complete, X(observabile) şi Z (neobservabile):

L2(θA, θB)def.= EP (Z|X,θ(0))[logP (X,Z | θ)],

unde θnot.= (θA, θB) şi θ

(0) not.= (θ(0)A , θ

(0)B ).

Semnificaţia notaţiei de mai sus este următoarea:

funcţia L2(θA, θB) este o medie a variabilei aleatoare reprezentată delog-verosimilitatea datelor complete (observabile şi, respectiv, neobserv-abile), iar această medie se calculează ı̂n raport cu distribuţia probabilistăcondiţională a datelor neobservabile, P (Z | X, θ(0)).

Observaţie: La elaborarea calculului, veţi folosi mai ı̂ntâi proprietatea deliniaritate a mediilor variabilelor aleatoare, şi apoi rezultatele de la punctuliv.

66.

Răspuns:

Media funcţiei de log-verosimilitate a datelor complete, L2(θA, θB), se cal-culează astfel:

L2(θA, θB)def.= EP (Z|X,θ(0))[logP (X,Z | θ)]

indep. cdt.= EP (Z|X,θ(0))[log

5∏

i=1

P (Xi, Zi,A, Zi,B | θA, θB)]

reg. de mult.= EP (Z|X,θ(0))[log

5∏

i=1

P (Xi | Zi,A, Zi,B; θA, θB) · P (Zi,A, Zi,B | θA, θB)]

În continuare, omiţând din nou distribuţia probabilistă ı̂n raport cu carese calculează media aceasta ı̂ntrucât ea poate fi sub̂ınţeleasă, vom scrie:

67.

L2(θA, θB)

= E[log5∏

i=1

·(θZi,AA )Xi · [(1− θA)Zi,A ]10−Xi · (θZi,BB )

Xi · [(1− θB)Zi,B ]10−Xi ·1

2]

= E[5∑

i=1

[Xi · Zi,A · log θA + (10−Xi) · Zi,A · log(1− θA) +

E[5∑

i=1

[Xi · Zi,B · log θB + (10−Xi) · Zi,B · log(1− θB)− log 2] ]

lin. med.=

5∑

i=1

[Xi · E[Zi,A] · log θA + (10−Xi) · E[Zi,A] · log(1− θA) +5∑

i=1

[Xi · E[Zi,B] · log θB + (10−Xi) ·E[Zi,B] · log(1− θB)− log 2]

=5∑

i=1

log[θXi·E[Zi,A]A · (1− θA)(10−Xi)·E[Zi,A] · θ

Xi·E[Zi,B]B · (1− θB)(10−Xi)·E[Zi,B] ·

1

2]

= log(θ2.2A · (1− θA)2.2 · θ2.8B · (1− θB)2.8 · . . . · θ4.5A · (1− θA)1.9 · θ2.5B · (1− θB)1.1 ·1

2).

68.

La ultima egalitate de mai sus, cantităţile fracţionare provin din calculelesimple X1 · E[Z1,A | X1, θ] ≈ 2.2, X1 · E[Z1,B | X1, θ] ≈ 2.8, . . ., X5 · E[Z1,A |X5, θ] ≈ 4.5, X5 · E[Z1,B | X5, θ] ≈ 2.5 (a se vedea tabelele din cadrul Noteiprecedente).

69.

vi. Calculaţi θ(1)A

not.= argmaxθA L2(θA, θB) şi θ

(1)B

not.= argmaxθB L2(θA, θB).

Răspuns:

Valorile parametrilor θA şi θB pentru care se atinge maximul mediei funcţieide log-verosimilitate a datelor complete se obţin cu ajutorul derivatelorparţiale de ordinul ı̂ntâi:a

∂L2(θA, θB)

∂θA= 0

⇒ ∂∂θA

(2.2 log θA + 2.2 log(1− θA) + . . .+ 4.5 log θA + 1.9 log(1− θA)) = 0

⇒ 2.2θA

− 2.21− θA

+ . . .+4.5

θA− 1.9

1− θA= 0 ⇒ . . .⇒ θ(1)A ≈ 0.71.

Similar, vom obţine θ(1)B ≈ 0.58.

aEste imediat că derivatele de ordinul al doilea au valori negative pe tot domeniul de definiţie.

70.

C. Formalizaţi paşii E şi M ai algoritmului EM pentru estimareaparametrilor θA şi θB ı̂n condiţiile de la punctul B.

Răspuns:

Formulele care se folosesc ı̂n cadrul algoritmului EM pentru rezolvareaproblemei date — i.e. estimarea parametrilor θA şi θB atunci când vari-abilele Zi sunt neobservabile —, se elaborează/deduc astfel:

71.

Notând cu

− xi valoarea variabilei Xi,

− θ(t)A şi respectiv θ(t)B estimările parametrilor θA şi θB la iteraţia t a algo-ritmului EM,

− p(t+1)i,A şi respectiv p(t+1)i,B , mediile E[Zi,A | Xi, θ(t)A ] şi E[Zi,B | Xi, θ(t)B ],vom avea:

p(t+1)i,A =

(θ(t)A )

xi(1− θ(t)A )10−xi(θ

(t)A )

xi(1− θ(t)A )10−xi + (θ(t)B )

xi(1− θ(t)B )10−xi

p(t+1)i,B =

(θ(t)B )

xi(1− θ(t)B )10−xi(θ

(t)A )

xi(1− θ(t)A )10−xi + (θ(t)B )

xi(1− θ(t)B )10−xi

73.

Pasul M:

Ca şi mai ı̂nainte, ı̂n formulele de mai jos vom folosi notaţiile simplificate

E[Zi,A]not.= E[Zi,A | Xi, θ(t)] şi E[Zi,B] not.= E[Zi,B | Xi, θ(t)].

Cu aceste notaţii, procedând similar cu calculul de la partea B, punctul v,vom avea:

L2(θA, θB) = log5∏

i=1

θxiE[Zi,A]A (1− θA)(10−xi)E[Zi,A] θ

xiE[Zi,B]B (1− θB)(10−xi)E[Zi,B]

74.

Prin urmare,

∂

∂θAL2(θA, θB) = 0 ⇒

1

θA

5∑

i=1

xiE[Zi,A] =1

1− θA

5∑

i=1

(10− xi)E[Zi,A]

⇒ (1− θA)5∑

i=1

xiE[Zi,A] = θA

5∑

i=1

(10− xi)E[Zi,A]

⇒5∑

i=1

xiE[Zi,A] = 10 θA

5∑

i=1

E[Zi,A]

⇒ θA =∑5

i=1 xi E[Zi,A]

10∑5

i=1E[Zi,A]şi, similar, θB =

∑5i=1 xi E[Zi,B]

10∑5

i=1E[Zi,B]

Aşadar, la pasul M al algoritmului EM vom avea:

θ(t+1)A =

∑5i=1 xi p

(t+1)i,A

10∑5

i=1 p(t+1)i,A

şi θ(t+1)B =

∑5i=1 xi p

(t+1)i,B

10∑5

i=1 p(t+1)i,B

75.

Observaţie

Implementând algoritmul EM cu relaţiile obţinute pentru pasul E şi pasul

M, după execuţia a 10 iteraţii se vor obţine valorile θ(10)A ≈ 0.80 şi θ

(10)B ≈ 0.52.

Este interesant de observat că estimarea obţinută pentru parametrul θAeste acum la acelaşi nivel cu cea obţinută prin metoda verosimilităţiimaxime (MLE) ı̂n cazul observării tuturor variabilelor (0.80, vezi rezolvareade la partea A, punctul i), iar estimarea obţinută pentru parametrul θB acoborât de la valorea 0.58 care a fost obţinută la prima iteraţie a algoritmu-lui EM la o valoare (0.52) care este considerabil mai apropiată de estimareaprin metoda MLE (0.45).

76.


a mixture of K categorical distributions

CMU, 2015 spring, Tom Mitchell, Nina Balcan, HW6, pr. 1

78.

Mixture models are helpful for modelling unknown subpopulations in data.If we have a collection of data points X = {X1, . . . , Xn}, where each Xi is [in-dependently] drawn from one of K possible distributions, we can introduce adiscrete-valued random variable Zi ∈ {1, . . . ,K} that indicates which distribu-tion Xi is drawn from.

This exercise deals with the categorical mixture model, where each observationXi is a discrete value drawn from a categorical distribution. The parameter fora categorical distribution is a K-dimensional vector π that lists the probabilityof each of K possible values (therefore,

∑

k πk = 1).

For example, suppose our categorical mixture model has 3 underlying dis-tributions. Then, each Zi could take on one of three values, {1, 2, 3}, withrespective probabilities π1, π2, π3; equivalently, Zi ∼ Categorical(π), where π =(π1, π2, π3) ∈ R3+, with π1 + π2 + π3 = 1. The observation Xi is then generatedfrom another categorical distribution, depending on the value of Zi.

The generative process for a categorical mixture model is summarized as fol-lows:

Zi ∼ Categorical(π)Xi ∼ Categorical(θZi)

79.

For this model, where we observe X but not Z, we want to learn the parame-ters of the K categorical components Θ = {π, θ1, . . . , θK}, where each θk ∈ RM isthe parameter for the categorical distribution associated with the k-th mix-ture component. (It implies that each Xi can take on one of M possiblevalues). We will use the EM algorithm to accomplish this.

A note on notation and a hint:

When working with categorical distributionsit is helpful to use indicator func-tions. The indicator function 1{x=j} has the value 1 when x = j and 0 otherwise.With this notation, we can express the probability that a random variabledrawn from a categorical distribution (e.g., Y ∼ Categorical(φ), where φ ∈ RN)takes on a particular value as

P (Y ) =

N∏

i=1

φ1{Y =i}i .

80.

a. What is the joint distribution P (X,Z; Θ)?

b. What is the posterior distribution of the latent variables, P (Z|X ; Θ)?c. Compute the expectation of the log-likelihood,

Q(Θ|Θ′) = EZ|X;Θ′ [logP (X,Z; Θ′)]

d. What is the update step for Θ? That is, what Θ maximizes Q(Θ|Θ′)?Hint : Make sure your solution enforces the constraint that parameters tocategorical distributions must sum to 1. Lagrange multipliers are a great wayto solve constrained optimization problems.

81.

Answer:

a.

P (X,Z; Θ) =

n∏

i=1

P (Xi, Zi; Θ) =

n∏

i=1

P (Xi|Zi; Θ)P (Zi; Θ)

=

n∏

i=1

K∏

k=1

πk

M∏

j=1

θ1{Xi=vj}kj

1{Zi=k}

b.

P (Zi = k|Xi; Θ) Bayes F.=P (Xi|Zi = k; Θ)P (Zi = k; Θ)

P (Xi; Θ)

=P (Xi|Zi = k; Θ)P (Zi = k; Θ)

∑Kl=1 P (Zi = l; Θ)P (Xi|Zi = l; Θ)

=

∏Kk=1

[

πk∏Mj=1 θ

1{Xi=vj}kj

]1{Zi=k}

∑Kl=1 πl

∏Mj=1 θ

1{Xi=vj}lj

82.

c.

Q(Θ|Θ′) def.= EZ|X;Θ′ [logP (X,Z; Θ)]

a.= EZ|X;Θ′

log

n∏

i=1

K∏

k=1

πk

M∏

j=1

θ1{Xi=vj}kj

1{Zi=k}

= EZ|X;Θ′

n∑

i=1

K∑

k=1

1{Zi=k}

log πk +M∑

j=1

1{Xi=vj} log θkj

=n∑

i=1

K∑

k=1

EZ|X;Θ′ [1{Zi=k}]︸︷︷︸

P (Zi=k|Xi;Θ)

log πk +M∑

j=1

1{Xi=vj} log θkj

Using

γiknot.= EZ|X;Θ′ [1{Zi=k}] = P (Zi = k|Xi; Θ′)

b.=

π′k∏Mj=1(θ

′kj)

1{Xi=vj}

∑Kl=1 π

′l

∏Mj=1(θ

′lj)

1{Xi=vj},

leads to

Q(Θ|Θ′) =n∑

i=1

K∑

k=1

γik

log πk +

M∑

j=1

1{Xi=vj} log θkj

83.

d. To find the updates rules for the parameters, we solve the following constrainedoptimization problem:

maxπ,θ1:K Q(Θ|Θ′)subject to

∑Kk=1 πk = 1

∑Mj=1 θkj = 1, ∀k = 1, . . . ,K

We introduce a Lagrange multiplier for each constraint associated to this problem andform the Lagrangian functional :

L = Q(Θ|Θ′) + λπ(

1−K∑

k=1

πk

)

+

K∑

k=1

λθk

1−M∑

j=1

θkj

Now we will differentiate the Lagrangian functional with respect to each parameter wewant to optimize, set the derivative to zero, and solve for the optimal value.

84.

We’ll start with optimizing for π, the parameter to the categorical distribution for thelatent variables Z.

∂L∂πk

= 0 ⇔n∑

i=1

γik ·1

πk− λπ = 0 ⇔ π∗k =

1

λπ

n∑

i=1

γik

We can find the value of the Lagrange multiplier λπ by enforcing the constraint:

K∑

k=1

π∗k = 1 ⇔K∑

k=1

1

λπ

n∑

i=1

γik = 1 ⇔1

λπ

K∑

k=1

n∑

i=1

γik = 1 ⇔

λπ =

K∑

k=1

n∑

i=1

γik ⇔ λπ =n∑

i=1

K∑

k=1

γik

︸︷︷︸1

⇔ λπ = n

Substituting this value for λπ back into the expression of π∗k gives the answer:

π∗k =1

n

n∑

i=1

γik

85.

The process for optimizing the observation parameters θkj is similar. First, we differ-entiate the Lagrangian L with respect to θkj, set to zero, and [finally] solve.

∂L∂θkj

= 0 ⇔n∑

i=1

γik 1{Xi=vj} ·1

θkj− λθk = 0

⇒ θ∗kj =1

λθk

n∑

i=1

γik 1{Xi=vj}

Again, we can find the value of the Lagrange multiplier λθk by enforcing the summationconstraint:

M∑

j=1

θ∗kj = 1 ⇔M∑

j=1

1

λθk

n∑

i=1

γik 1{Xi=vj} = 1 ⇔1

λθk

M∑

j=1

n∑

i=1

γik 1{Xi=vj} = 1

⇔ λθk =M∑

j=1

n∑

i=1

γik 1{Xi=vj}

Substituting this value for λθk back into the expression of θ∗kj gives the answer:

θ∗kj =

∑ni=1 γik 1{Xi=vj}

∑Ml=1

∑ni=1 γik 1{Xi=vl}

=

∑ni=1 γik 1{Xi=vj}

∑ni=1

∑Ml=1 γik 1{Xi=vl}

86.

Using the EM algorithm for

learning a categorical distribution

[and implicitly a multinomial distribution]

Application to the ABO blood model


Anirban DasGupta, Probability for Statistics and MachineLearning, Springer, 2011, Ex. 20.10

87.

Grupele sangvine ale oamenilor sunt determinate de variantele (,,alelele“)unei gene situate pe cromozomul 9 (mai exact, la poziţia 9q34.2), numităgena ABO. Se ştie că fiecare dintre noi dispunem de câte o pereche deastfel de cromozomi, deci de câte două copii ale genei ABO, şi anume unamoştenită de la tată şi una moştenită de la mamă.

Se notează cu A, B şi O cele trei tipuri de aleleale genei ABO. Alelele A şi B sunt dominanteı̂n raport cu alela O. (Altfel spus, alela O esterecesivă ı̂n raport cu fiecare dintre alelele A şiB.) Alelele A şi B sunt codominante. Prin ur-mare, grupele sangvine pot fi specificate conformtabelului alăturat.

grupe alelesangvine moştenite

A AAAO

B BBBO

AB ABO OO

Presupunem că, ı̂ntr-o populaţie oarecare, fiecare dintre alelele A, B şiO are la bărbaţi aceeaşi frecvenţă ca şi la femei. În cele ce urmează,aceste frecvenţe (notate respectiv cu pA, pB şi pO) vor fi considerate apriori necunoscute, dar urmează să le determinăm.

88.

Location of the ABO gene onchromosome 9

Source: wiki

The ABO protein

Source: wiki

89.

pO

AAn

BBn

pB

pA pA

pO

2

2Ap +

pA

pB

2

2

Op

pB

pO

2

2B

p +

nA

nB

ABn

nO

countsProbabilityblood

typetype

A

B

AB

O

AA

AO

BB

BO

AB

geno−

OO

observable data

homologous chromosomes

ch. 9 ch. 9’

ABO

geneB

O

A

alleles

ABO

gene

alleles

A

B

O

parameters

probabilities

unobservable data

EM algorithm InputOutput

90.

a. Considerăm că ı̂ntr-un eşantion de populaţie format din n per-soane sunt nA persoane cu grupa sangvină A, nB persoane cu grupasangvină B, nAB persoane cu grupa sangvină AB şi nO persoane cugrupa sangvină O. (Evident, n = na + nB + nAB + nO.)

Pornind de la aceste date ,,observabile“, să se deriveze algoritmulEM pentru determinarea probabilităţilor pA, pB şi pO. (Evident,ı̂ntrucât suma lor este 1, va fi suficient să se determine doar douădintre ele.)

91.

Indicaţii

1. Presupunând că procesul de asociere a alelelor moştenite decătre un individ de la părinţii lui respectă proprietăţile specificeevenimentelor aleatoare independente, rezultă că probabilităţilede realizare a combinaţiilor (̂ın termeni genetici: ,,fenotipurile“)AA, AO, BB, BO, AB şi OO ı̂ntr-o populaţie oarecare sunt p2A,2pApO, p

2B, 2pBpO, 2pApB, şi respectiv p

2O.

2. Considerând nA = nAA+nAO şi nB = nBB+nBO, unde semnificaţiilenumerelor nAA, nAO, nBB şi nBO sunt similare cu semnificaţiile nu-merelor nA, nB, nAB, şi nO care au fost precizate mai sus, estenatural ca ı̂n formularea algoritmului EM datele nAA şi nBB să fieconsiderate ,,neobservabile“. Ca parametri ai modelului, se vorconsidera probabilităţile pA, pB şi pO.

92.

3. La pasul E al algoritmului EM veţi calcula n̂AA şi n̂BB,care reprezintă respectiv numărul ,,aşteptat“ de apariţii alecombinaţiei de alele AA şi numărul ,,aşteptat“ de apariţii alecombinaţiei de alele BB ı̂n populaţia dată.

Veţi scrie apoi funcţia de log-verosimilitate a datelor com-plete (,,observabile“ şi ,,neobservabile“), exprimată cu ajutoruldistribuţiei Multinomial(n; p2A, 2pApO, p

2B, 2pBpO, 2pApB, p

2O).

4. La pasul M al algoritmului EM, pornind de la media funcţieide log-verosimilitate care a fost calculată la pasul E, veţi stabiliregulile de actualizare pentru probabilităţile pA, pB şi pO.

Atenţie: suma acestor probabilităţi fiind 1, problema de opti-mizare pe care va trebui să o rezolvaţi la pasul M este una curestricţii. În acest sens, metoda multiplicatorilor lui Lagrange văpoate fi de folos.

93.

Solution

Notations:

• parameters: p = {pA, pB, pO}(in fact, it will be enough to estimate pA and pB, since pA + pB + pO = 1)

• observable data: nobs = {nA, nB, nAB, nO}• unobservable data: nunobs = {nAA, nAO, nBB, nBO}

(in fact, we will restrict nunobs to {nAA, nBB})• complete data: ncompl = nobs ∪ nunobs• n = nA + nB + nAB + nO, nA = nAA + nAO, nB = nBB + nBO.

The likelihood of complete data:

L(p)not.= P (ncompl|p) =

n!

nAA! nAO! nBB! nBO! nAB! nO!· (p2A)nAA · (2 pA pO)nAO · (p2B)nBB · (2 pB pO)nBO · (2 pA pB)nAB · (p2O)nO

94.

The log-likelihood function:

ℓ(p)def.= lnL(p)

= ln c+ nAA ln(p2A) + nAO ln(2 pA pO) + nBB ln(p

2B) + nBO ln(2 pB pO) + nAB ln(2 pA pB) + nO ln(p

2O)

= ln c′ + 2 nAA ln pA + nAO(ln pA + ln pO) +

2 nBB ln pB + nBO(ln pB + ln pO) + nAB(ln pA + ln pB) + 2 nO ln pO

= ln c′ + 2 nAA ln pA + (nA − nAA)(ln pA + ln pO) +2 nBB ln pB + (nB − nBB)(ln pB + ln pO) + nAB(ln pA + ln pB) + 2 nO ln pO,

where c and c′ are constants which do not depend on the p parameter.The “auxiliary” function:

Q(p|p(t)) def.= E[ℓ(p)|nobs ; p(t)]= ln c′ + 2 n̂AA ln pA + (nA − n̂AA)(ln pA + ln pO) +

2 n̂BB ln pB + (nB − n̂BB)(ln pB + ln pO) + nAB(ln pA + ln pB) + 2 nO ln pO,

where

n̂AAnot.= E[nAA|nobs ; p(t)] = E[nAA|nA, nB, nAB, nO; p(t)A , p

(t)B , p

(t)O ]

n̂BBnot.= E[nBB|nobs ; p(t)] = E[nBB|nA, nB, nAB, nO; p(t)A , p

(t)B , p

(t)O ]

95.

E step: As indicated by the expression of Q, here we have to compute n̂AAand n̂BB, the expected number of individuals having the phenotype (aka,the allele pair) AA, and respectively BB.

Firstly, taking into account that nAA (or, equivalently, the phenotype AA),when seen as a random variable, follows a binomial distribution of param-

eters nA and(p

(t)A )

2

(p(t)A )

2 + 2 p(t)A p

(t)O

, its expectation will be:

n̂AAnot.= E[nAA|nA, nB, nAB, nO; p(t)A , p

(t)B , p

(t)O ]

=(p

(t)A )

2

(p(t)A )

2 + 2 p(t)A p

(t)O

· nA

Similarly,

n̂BBnot.= E[nBB |nA, nB, nAB, nO; p(t)A , p

(t)B , p

(t)O ]

=(p

(t)B )

2

(p(t)B )

2 + 2 p(t)B p

(t)O

· nB

96.

M step:

Given the constraint pA + pB + pO = 1, we will introduce the Lagrange vari-able/“multiplier” λ and solve for the optimisation problem

p(t+1)not.= (p

(t+1)A , p

(t+1)B , p

(t+1)O )

= argmaxpA, pB , pO

[Q(pA, pB, pO|p(t)A , p(t)B , p

(t)O ) + λ(1 − (pA + pB + pO))]

= argmaxpA, pB , pO

[ln c′ + 2 n̂AA ln pA + (nA − n̂AA)(ln pA + ln pO) + 2 n̂BB ln pB +argmaxpA, pB , pO

(nB − n̂BB)(ln pB + ln pO) + nAB(ln pA + ln pB) + 2 nO ln pO +argmaxpA, pB , pO

λ(1 − (pA + pB + pO))]

Taking the partial derivatives of the objective function w.r.t. pA, pB and pOand solving them leads to:

1

pA(2 n̂AA + nA − n̂AA + nAB)− λ = 0 ⇒ p̂A =

1

λ(n̂AA + nA + nAB)

1

pB(2 n̂BB + nB − n̂BB + nAB)− λ = 0 ⇒ p̂B =

1

λ(n̂BB + nB + nAB)

1

pO(nA − n̂AA + nB − n̂BB + 2 nO)− λ = 0 ⇒ p̂O =

1

λ(nA − n̂AA + nB − n̂BB + 2nO)

97.

Now, enforcing the constraint p̂A + p̂B + p̂O = 1 gives

1

λ(n̂AA + nA + nAB + n̂BB + nB + nAB + nA − n̂AA + nB − n̂BB + 2nO) = 1 ⇔

2

λ(nA + nB + nAB + nO) = 1

Since n = nA + nB + nAB + nO, it follows immediately that1

λ2n = 1.

Therefore λ = 2n.

Together with the results obtained on the previous slide, this leadsto the following updating relations:

p̂(t+1)A =

1

2n(n̂AA + nA + nAB)

p̂(t+1)B =

1

2n(n̂BB + nB + nAB)

p̂(t+1)O =

1

2n(nA − n̂AA + nB − n̂BB + 2nO) =

1

2n(nAO + nBO + 2nO)

98.

b. Implementaţi algoritmul EM pe care l-aţi conceput la punctula şi rulaţi-l pentru input-ul nA = 186, nB = 38, nAB = 13, nO = 284(n = 521).

Ca valori iniţiale pentru parametri, veţi lucra mai ı̂ntâi cu pA =

pB = pO =1

3, iar apoi cu pA = pO = 0.1 şi pB = 0.8.

Pentru oprire, veţi cere ca p(t)A , p

(t)B şi p

(t)O să nu difere (fiecare ı̂n

parte) cu mai mult de 10−4 faţă de valorile calculate la iteraţiaprecedentă. Comparaţi rezultatele obţinute pentru fiecare din celedouă iniţializări.

99.

Starting values:pA = pB = pO = 1/3.

Iterations:

t pA pB pC1 0.2505 0.0611 0.68842 0.2185 0.0505 0.73113 0.2142 0.0502 0.73574 0.2137 0.0501 0.73625 0.2136 0.0501 0.73636 0.2136 0.0501 0.7363

Starting values:pA = pO = 0.01, pB = 0.98.

Iterations:

t pA pB pC1 0.2505 0.0847 0.66482 0.2193 0.0511 0.72963 0.2143 0.0502 0.73554 0.2137 0.0501 0.73625 0.2136 0.0501 0.73636 0.2136 0.0501 0.7363

Note: The final results are the same.

100.

Using the EM algorithm for

learning a multinomial distribution

which depends on a single parameter

Application to a bioinformatics task:

computing probabilities for two linked bi-allelic loci

in haploid cells


Brani Vidakovic, Georgia Institute of Technology,Bayesian Statistics course (ISyE 8843A), 2004,

Handout 12, sec. 1.2.1

101.

Biological background: Meiosis

Source: wiki

102.

Biological background (cont’d)

B/bA/a A/a

B/b

ch. i ch. i


crossover)

(including

meiosis

male / father diploid cell

ab (1−p)/2

P

p/2aB

Ab p/2

(1−p)/2AB

B/bA/a

B/bA/a

ch. i ch. i

haploid cell

fecundation

offspring diploid cell

haploid cell

B/bA/a A/a

B/b

ch. i ch. i

meiosis

(including

crossover)


female / mother diploid cell

ab (1−p’)/2

P’

aB p’/2

p’/2Ab

AB (1−p’)/2

103.

Biological background (cont’d)

Note: Alleles A and B are dominant w.r.t. a and respectively b.

Genotype:

1 2 3 4 5 6 7 8 9AA Aa aA AA Aa aA AA Aa aABB BB BB Bb Bb Bb bB bB bB︸︷︷︸

AB

nAB2 + ψ

4

10 11 12AA Aa aAbb bb bb

︸︷︷︸

Ab

nAb1− ψ4

13 14 15aa aa aaBB Bb bB︸︷︷︸

aB

naB1− ψ4

16aabb︸︷︷︸

ab

nabψ

4

with ψnot.= (1− p)(1 − p′).

We have designated in magenta colour the offspring phenotype, in blue the counts, i.e.number of individuals of having a certain phenotype in a given population, and in redthe corresponding probabilities.

Note that

P (ab) =(1− p)(1 − p′)

4and

P (Ab) = P (aB) =p

2· p

′

2+p

2· 1− p

′

2+

1− p2

· p′

2=

1− (1− p)(1− p′)4

.

104.

Fie o variabilă aleatoare X care ia valori ı̂n mulţimea {v1, v2, v3, v4}şi urmează distribuţia Multinomial

(

n;2 + ψ

4,1− ψ4

,1− ψ4

,ψ

4

)

, unde

ψ este un parametru cu valori ı̂n intervalul (0, 1). Cele patru prob-abilităţi listate ı̂n definiţia acestei distribuţii multinomiale sunt ı̂ncorespondenţă directă cu valorile v1, v2, v3 şi v4.

a. Presupunem că n1, n2, n3 şi n4 sunt numărul de ,,realizări“ale valorilor v1, v2, v3 şi v4 ı̂n totalul celor n ,,observaţii“. Pentrufixarea ideilor, vom considera n1 = 125, n2 = 18, n3 = 20 şi n4 = 34.

Calculaţi estimarea de verosimilitate maximă (MLE) a parametru-lui ψ.

105.

Solution

By denoting n = n1 + n2 + n3 + n4,

and kowing that n ∼ Multinomial(

n;2 + ψ

4,1− ψ4

,1− ψ4

,ψ

4

)

,

it follows that the verosimility function will be

L(ψ)def.= P (D|ψ) = n!

n1! n2! n3! n4!·(2 + ψ

4

)n1

·(1− ψ4

)n2

·(1− ψ4

)n3

·(ψ

4

)n4

,

while

ℓ(ψ)def.= lnL(ψ) = ln c+ n1 ln(2 + ψ) + (n2 + n3) ln(1− ψ) + n4 ln(ψ),

where c is constant w.r.t. ψ.

106.

∂

∂ψℓ(ψ) =

n12 + ψ

− n2 + n31− ψ +

n4ψ

=n1 · ψ(1− ψ)− (n2 + n3) · ψ(2 + ψ) + n4 · (2 + ψ)(1− ψ)

(2 + ψ)(1− ψ)ψ

=n1ψ − n1ψ2 − 2n2ψ − n2ψ2 − 2n3ψ − n3ψ2 + 2n4 − 2n4ψ + n4ψ − n4ψ2

(2 + ψ)(1 − ψ)ψ

=−ψ2(n1 + n2 + n3 + n4) + ψ(n1 − 2n2 − 2n3 − n4) + 2n4

(2 + ψ)(1 − ψ)ψ

=−nψ2 + ψ(n1 − 2n2 − 2n3 − n4) + 2n4

(2 + ψ)(1− ψ)ψ

Note that the denominator (2 + ψ)(1− ψ)ψ is always positive.By substituting n1, n2, n3 n4 for 125, 18, 20 and respectively 34, the nominatorbecomes −197ψ2 + 15ψ + 68.By analysing the sign of this expression, it is easy to see that for our datathe function ℓ(ψ) reaches its maximum in the interval (0, 1) for

ψ̂ =−15 +

√152 + 4 · 197 · 68−2 · 197 =

−15 +√225 + 53809

−394 =−15 + 231.967

−394 = 0.626769036.

107.

b. Fie acum variabila aleatoare X ′ ∼ Multinomial(

n;1

2,ψ

4,1− ψ4

,1− ψ4

,ψ

4

)

. Valorile luate

de variabila X ′, corespunzător acestor cinci probabilităţi, sunt (̂ın ordine) v1, v1, v2, v3 şiv4.

Observaţi că, ı̂n raport cu distribuţia multinomială de la punctul a, am ,,descompus“

probabilitatea2 + ψ

4ı̂n două probabilităţi,

1

2şi

ψ

4, ca şi cum ele ar corespunde unor

evenimente disjuncte.

Vom considera n11, n12, n2, n3 şi respectiv n4 numărul de ,,realizări“ ale acestor valori,ı̂nsă de data aceasta n11 şi n12 vor fi neobservabile. În schimb, vom furniza suman1 = n11 + n12 ca dată ,,observabilă“, alături de celelalte date observabile, n2, n3 şi n4.

Să se estimeze parametrul ψ folosind algoritmul EM. Cum este această estimare faţăde valoarea obţinută la punctul a?

Indicaţie: Veţi elabora formulele corespunzătoare pasului E şi pasului M. Apoi veţi faceo implementare şi veţi rula algoritmul EM (pornind, de exemplu, cu valoarea iniţială0.5 pentru ψ) până când valorile acestui parametru până la cea de-a şasea zecimală nuse mai modifică.

108.

Solution: E-step

The verosimility function is now

P (D|ψ) = n!n11! n12! n2! n3! n4!

·(1

2

)n11

·(ψ

4

)n12

·(1− ψ4

)n2

·(1− ψ4

)n3

·(ψ

4

)n4

=n!

n11! n12! n2! n3! n4!·(1

2

)n11

·(1− ψ4

)n2+n3

·(ψ

4

)n12+n4

,

while the log-verosimility function can be written as

lnP (D|ψ) = ln d+ (n2 + n3) ln(1 − ψ) + (n12 + n4) ln(ψ),

where d is constant w.r.t. ψ.Therefore, the ,,auxiliary“ function will be

Q(ψ|ψ(t+1)) = En12|n1,n2,n3,n4,ψ(t) [lnP (D|ψ)]= ln d+ (n2 + n3) ln(1− ψ) + (n̂12 + n4) ln(ψ),

where

n̂12not.= E[n12|n1, n2, n3, n4, ψ(t)] =

ψ4

12 +

ψ4

· n1=ψ(t)

2 + ψ(t)· n1.

109.

M-step∂

∂ψE[lnP (D|ψ)] = −n2 + n3

1− ψ +n̂12 + n4

ψ

⇒ ∂2

∂ψ2E[lnP (D|ψ)] = − n2 + n3

(1 − ψ)2 −n̂12 + n4ψ2

≤ 0

⇒ the auxiliary function is concave.

∂

∂ψE[lnP (D|ψ)] = 0 ⇔

(n̂12 + n4)(1 − ψ) = ψ(n2 + n3) ⇔ψ(n̂12 + n2 + n3 + n4) = n̂12 + n4 ⇔

ψ(t+1) =n̂12 + n4n− n̂11

where

n̂11not.= E[n11|n1, n2, n3, n4, ψ(t)] = n1 − n̂12 =

2

2 + ψ(t)· n1.

A Matlab implementation of this EM algorithm produces ψ = 0.62682139.

110.

Plots made bySebastian Ciobanu

0.0 0.2 0.4 0.6 0.8 1.0

−1

40

−1

20

−1

00

−8

0−

60

−4

0−

20

ψ

ln(p(X|ψ))

ψ(0) ψ(1)

Q(ψ|ψ(0))Q(ψ|ψ(1))

ψ(2)

111.


a mixture of K categorical distributions,

applied to the problem of Word Sense Disambiguation,

i.e. identifying the semantic domains associated to words in a

text document

CMU, 2012 fall, Eric Xing, Aarti Singh, HW3, pr. 3

112.

The objective of this exercise is to derive the update equations of theEM algorithm for optimizing the latent variables [designating the semanticdomains] involved in generating a text document.Each word will be seen as a random variable w that can take values 1, . . . , Vfrom the vocabulary of words. In fact, we will denote each w by an array of Vcomponents such that w(i) = 1 if w takes the value of the i-th word in the vocabulary.

Hence,∑V

i=1 w(i) = 1.

Given a document containing words wj, j = 1, . . . , N , where N is the lengthof the document, we will assume that these words are generated from amixture of K discrete topics:

P (w) =

K∑

s=1

πs P (w|βs) and P (w|βs) =V∏

i=1

βs(i)w(i),

whereπs denotes the prior [probability] for the latent topic variable Z = s,

βsnot.= (βs(1), . . . , βs(i), . . . , βs(V )), with βs(i) ≥ 0 for i = 1, . . . , V and

∑Vi=1 βs(i) = 1 for each

s = 1, . . . ,K, and

βs(i)not.= P (w(i) = 1|Z = s).

113.

a. In the expectation step, for each word wj, compute

qj(s)not.= P (Zj = s|wj; θ),

the probability that wj belongs to each of the K topics, where θ isthe set of parameters of this mixture model.

Answer:

qj(s)not.= P (Zj = s|wj; θ) Bayes T.=

P (wj|Zj = s; θ) P (Zj = s|θ)P (wj|θ)

=πs P (wj|βs)

∑Ks′=1 πs′ P (wj|βs′)

=πs∏V

l=1 βs(l)wj(l)

∑Ks′=1 πs′

∏Vl=1 βs′(l)

wj(l)

114.

b. In the maximization step, compute θ that maximizes marginea inferioarăof the log-likelihood of the data

ℓ(w; θ)not.= log

N∏

j=1

P (wj; θ)

Hint: Suming over the latent topic variable, we can write ℓ(w; θ) as

ℓ(w; θ) =

N∑

j=1

log∑

s

P (wj , s; θ) =

N∑

j=1

log qj(s)

∑

s P (wj , s; θ)

qj(s)

Further on, using Jensen’s inequality (see pr. EM-2) we get:

ℓ(w; θ) ≥N∑

j=1

∑

s

qj(s) logP (wj , s; θ)

qj(s)=

N∑

j=1

∑

s

qj(s) logP (wj , s; θ) +H(qj)

Hence compute θ as:

argmaxθ

N∑

j=1

∑

s

qj(s) logP (wj , s; θ)

115.

Answer:

θ = argmaxθ

N∑

j=1

K∑

s=1

qj(s) logP (wj, Zj = s|θ)

= argmaxθ

N∑

j=1

K∑

s=1

qj(s) logP (wj|Zj = s; θ) P (Zj = s|θ)

= argmaxθ

N∑

j=1

K∑

s=1

qj(s) log

(

πs

V∏

l=1

βs(l)wj(l)

)

= argmaxθ

N∑

j=1

K∑

s=1

[qj(s) log πs + qj(s)V∑

l=1

log βs(l)wj(l)]

= argmaxθ

N∑

j=1

K∑

s=1

[qj(s) log πs + qj(s)V∑

l=1

wj(l) log βs(l)] (11)

116.

To optimize βs(l):

After eliminating from (11) the terms which are constant with respect to βs we get:

N∑

j=1

qj(s)

V∑

l=1

wj(l) log βs(l)

We will use a Lagrangian to constrain βs to be a probability distribution:

L(βs(l)) =N∑

j=1

qj(s)V∑

l=1

wj(l) log βs(l) + λ

(V∑

l=1

βs(l)− 1)

Then, solving for βs(l):

∂

∂βs(l)L(βs(l)) = 0 ⇔

N∑

j=1

qj(s)wj(l)

βs(l)+ λ = 0

⇔ 1βs(l)

N∑

j=1

qj(s) wj(l) + λ = 0 ⇔1

βs(l)=

−λ∑Nj=1 qj(s) wj(l)

⇔ βs(l) =∑Nj=1 qj(s) wj(l)

−λ (12)

117.

Knowing that∑V

l=1 βs(l) = 1, we have:

V∑

l=1

∑Nj=1 qj(s) wj(l)

−λ = 1 ⇔ −λ =V∑

l=1

N∑

j=1

qj(s) wj(l)

Hence, substituting for −λ in (12), we get:

βs(l) =

∑Nj=1 qj(s) wj(l)

∑Vl=1

∑Nj=1 qj(s) wj(l)

=

∑Nj=1 qj(s) wj(l)

∑Nj=1

∑Vl=1 qj(s) wj(l)

=

∑Nj=1 qj(s) wj(l)

∑Nj=1 qj(s)

V∑

l=1

wj(l)

︸︷︷︸1

=

∑Nj=1 qj(s) wj(l)∑N

j=1 qj(s)

Note: Intuitively the last expression can be interpreted as the portion[of word[ occurrence]s] that had w(l) = 1 among all words [in the givendocument] which are deemed to belong to cluster s.

118.

To optimize πs, we proceed similarly:

We begin by removing from (11) the terms that are constant with respectto πs, thus getting:

N∑

j=1

qj(s) log πs

So, using the Lagrangian with the constraint that∑K

s=1 πs = 1

L(πs) =N∑

j=1

qj(s) log πs + λ

(K∑

s=1

πs − 1)

,

and solving for πs, we have:

∂

∂πsL(πs) = 0 ⇔

N∑

j=1

qj(s)

πs+ λ = 0 ⇔ 1

πs

N∑

j=1

qj(s) = −λ

⇔ πs =∑N

j=1 qj(s)

−λ (13)

119.

Since∑K

s=1 πs = 1, it gives us:

K∑

s=1

∑Nj=1 qj(s)

−λ = 1 ⇔1

−λK∑

s=1

N∑

j=1

qj(s) = 1

⇔ −λ =K∑

s=1

N∑

j=1

qj(s)

Substituting for −λ in (13), we get:

πs =

∑Nj=1 qj(s)

∑Ks=1

∑Nj=1 qj(s)

=

∑Nj=1 qj(s)

∑Nj=1

K∑

s=1

qj(s)

︸︷︷︸1

=

∑Nj=1 qj(s)

N

Note: Intuitively the last expression can be interpreted as the portion [ofword[ occurrence]s] that belongs to cluster s among the total of N words.

120.

To summarize:

E step:

qj(s) =πs∏V

l=1 βs(l)wj(l)

∑Ks′=1 πs′

∏Vl=1 βs′(l)

wj(l)for j = 1, . . . , N and s = 1, . . . , K

M step:

βs(l) =

∑Nj=1 qj(s) wj(l)∑N

j=1 qj(s)for s = 1, . . . , K and l = 1, . . . , V

πs =

∑Nj=1 qj(s)

Nfor s = 1, . . . , K

121.

The EM algorithm for

solving mixtures of Poisson distributions

University of Utah, 2008 fall, Hal Daumé III, HW9, pr. 1

122.

The Poisson distribution is a distribu-tion over positive count values; for acount x with parameter λ, the Poisson

has the form Poisson(x|λ) = 1eλ

· λx

x!.

It can be proven that the maximumlikelihood estimation for λ given a se-quence of counts x1, . . . , xn was simply1

n

∑ni=1 xi – the mean of the counts. 0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

x

p(x

)

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

λ = 1λ = 4λ = 10

Let’s consider a generalization of this: the Poisson mixture model. This isactually used in web server monitoring. The number of accesses to a webserver in a minute typically follows a Poisson distribution.

123.

Suppose we have n web servers we are monitoring and we monitoreach for M minutes. Thus, we have n × M counts; call xi,m thenumber of hits to web server i in minute m. Our goal is to clusterthe web servers according to their hit frequency.

Using the EM algorithm, construct a Poisson mixture model forthis problem and compute the expectations and maximizationsteps for this model.

124.

Hint : Suppose we want K clusters; let zi be the latent variabletelling us which cluster the web server i belongs to (from 1 toK). Let λk denote the parameter for the Poisson distribution forcluster k, and πk the . Then, the complete data likelihood is:

L(λ̄, π̄)def.= p(x̄, z̄|λ̄, π̄) i.i.d.=

n∏

i=1

p(xi, zi|λ̄, π̄) =n∏

i=1

p(xi|zi, λ̄, π̄) · p(zik

|λ̄, π̄)︸︷︷︸

πk

=

n∏

i=1

K∏

k=1

[

πk

M∏

m=1

Poisson(xi,m|λk)]1{zi=k}

where

x̄not.= (x1, . . . , xn), with xi

not.= (xi,1, . . . , xi,M) for i = 1, . . . , n,

z̄not.= (z1, . . . , zK), π̄

not.= (π1, . . . , πK); λ̄

not.= (λ1, . . . , λK), and

1{zi=k} is one if zi = k and zero otherwise.

125.

SolutionThe likelihood function:

L(λ̄, π̄) =n∏

i=1

K∏

k=1

[

πk

M∏

m=1

1

eλkλxi,mk

xi,m!

]1{zi=k}

The likelihood function:

ℓ(λ̄, π̄)def.= lnL(λ̄, π̄) =

n∑

i=1

K∑

k=1

1{zi=k}

[

ln πk +M∑

m=1

ln1

eλkλxi,mk

xi,m!

]

=

n∑

i=1

K∑

k=1

1{zi=k}

[

ln πk +

M∑

m=1

(−λk + xi,m lnλk − ln(xi,m)!)]

The auxiliary function:

Q(λ̄, π̄|λ̄(t), π̄(t)) def.= EP (zi=k|xi,λ̄(t),π̄(t))[ℓ(λ̄, π̄)]

=

n∑

i=1

K∑

k=1

P (zi = k|xi, λ̄(t), π̄(t))[

ln πk +

M∑

m=1

(−λk + xi,m lnλk − ln(xi,m)!)]

126.

M-step:

Since∑Kk=1 πk = 1, we have to introduce the Lagrange multiplier λ:

∂

∂πk

[

Q(λ̄, π̄|λ̄(t), π̄(t)) + λ(1 −K∑

k=1

πk)

]

= 0 ⇔

n∑

i=1

p(t)ik

πk− λ = 0 ⇔ π(t+1)k =

1

λ

n∑

i=1

p(t)ik

It is easy to see that π(t+1)k is indeed a maximum point for Q w.r.t. the pa-

rameter πk.

Imposing the constraint∑K

k=1 π(t+1)k = 1 leads to:

1

λ

K∑

k=1

n∑

i=1

p(t)ik = 1 ⇔

1

λ

n∑

i=1

K∑

k=1

p(t)ik

︸︷︷︸1

= 1 ⇔ nλ= 1 ⇔ λ = n.

Therefore, π(t+1)k =

1

n

∑ni=1 p

(t)ik , which is alyways a non-negative quantity.

128.

Now, regarding the parameters λk:

∂

∂λkQ(λ̄, π̄|λ̄(t), π̄(t)) = 0 ⇔

n∑

i=1

pik

(M∑

m=1

(

−1 + xi,m ·1

λk

))

= 0 ⇔

n∑

i=1

pik

(

−M + 1λk

M∑

m=1

xi,m

)

= 0 ⇔ 1λk

n∑

i=1

pik

(M∑

m=1

xi,m

)

=Mn∑

i=1

pik ⇔

λ(t+1)k =

∑ni=1 pik

(∑M

m=1 xi,m

)

M∑n

i=1 pik

It can be easily shown that for each k ∈ {1, . . . , K}, the value λ(t+1)kis positive, and it maximizes Q w.r.t. the parameter λk.

129.


a Gamma Mixture Model

G. Vegas-Sánchez-Ferrero, M. Mart́ın-Fernández, J. Miguel Sanches

A Gamma Mixture Model for IVUS Imaging, 2014

[ adapted by Liviu Ciortuz ]

130.

The Gamma Mixture Model

p(x|θ) =K∑

k=1

πkGamma(x|θk), where Gamma(x|rk, βk) =1

βrkΓ(rk)xrk−1e

−x

βk ,

with θk = (rk, βk) for k = 1, . . . ,K, rnot.= (r1, . . . , rK), β

not.= (β1, . . . , βK), and π

not.= (π1, . . . , πK),

πk > 0 for k = 1, . . . ,K and∑Kk=1 πk = 1,

θ = (r, β, π).

Consider x1, . . . , xn ∈ R+, each xi being generated by one component of the above mix-ture, denoted by zi ∈ {1, . . . ,K}.Find the maximum likelihood estimation of the parameters θ by using the EM algo-rithm.

Solution

• The verosimility of the “complete” data (x, z), with x n

Documents

The Expectation-Maximization (EM) Algorithmciortuz/ML.ex-book/SLIDES/ML... · 2021. 1. 7. · Algoritmul EM (Expectation-Maximization) permite crearea unor modele probabiliste care