Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

.

Maximum Likelihood andParameter Estimation For HMM

Lecture #8

Background Readings: Chapter 3.3 in , Biological Sequence Analysis, Durbin et al., 2001.© Shlomo Moran, following Danny Geiger and Nir Friedman

2

Parameter Estimation for HMM

An HMM model is defined by the parameters: mkl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

lk

b

mkl

ek(b)

{ : , are states} { ( ) : is a state, is a letter}kl km k l e b k bθ = ∪

3

Parameter Estimation for HMM

To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

4

Maximum Likelihood Parameter Estimation for HMM

The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).ML parameter estimation looks for θ which maximizes the above.

The exact method for finding or approximating this θ depends on the nature of the training set used.

5

Data for HMM

The training set is characterized by:1.For each xj, the information on the states sj

i (the symbolsxj

i are usually known). 2.The size (number of sequences) in it.

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

6

Case 1: ML when Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate mkl and ek(b) for all pairs of states k, l and symbols b.

By the ML method, we look for parameters θ* which maximize the probability of the sample set:p(x1,...,xn| θ*) =MAXθp(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

7

Case 1: Sequences are fully known

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

( )

( , ) ( , )then: prob( | ) [ ( )]

j jkl k

M E bkkl

k l k b

j m e bx θ = ∏ ∏

For each xj we have:

11

( | ) ( )j

i i i

Lj j

s s s ii

p x m e xθ−

=

=∏1Let |{ : , } | (the superscript indicates )

Let ( ) |{ : , } |

j j j j jkl i i

j jk i i

M i s k s l x

E b i s k x b−= = =

= = =

8

Case 1 (cont)

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).

Thus, if Mkl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:

( )

( , ) ( , )

1 prob( | ) [ ( )],.., kl kM E bkkl

k l k b

n m e bx x θ = ∏ ∏

9

Case 1 (cont)

( )

( , ) ( , )[ ( )]kl kM E b

kklk l k b

m e b∏ ∏

So we need to find mkl’s and ek(b)’s which maximize:

Subject to:

For all states , 1 and ( ) 1

[ , ( ) 0 ]

kl kl b

kl k

k m e b

m e b

= =

≥

∑ ∑

10

Case 1 (cont)

( )

( , ) ( , )

( )

[ ( )]

[ ( )][ ] [ ]

kl k

kl k

M E bkkl

k l k b

M E bkkl

k l k b

F m e b

m e b

= =

×

∏ ∏

∏ ∏ ∏ ∏

kb

Subject to: for all , 1, and e ( ) 1.kll

k m b= =∑ ∑

Rewriting, we need to maximize:

11

Case 1 (cont)

If we maximize for each : s.t. 1klMklkl

llk m m =∑∏( )and also [ ( )] s.t. ( ) 1kE b

k kbb

e b e b =∑∏

Then we will maximize also F.Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.

12

ML parameters estimation for a die

Let X be a random variable with 6 values x1,…,x6 denoting the six outcomes of a (possibly unfair) die. Here the parameters are θ ={q1,q2,q3,q4,q5, q6} , ∑qi=1Assume that the data is one sequence:

Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize

Subject to: q1+q2+ q3+ q4+ q5+ q6=1 [and qi ≥0 ]

2 3 2 21 2 3 4 5 6prob( | )Data q q q q q qθ = ⋅ ⋅ ⋅ ⋅ ⋅

252 3 21 2 3 4 5

1i.e., prob( | ) 1 i

iData q q q q q qθ

=

⎛ ⎞= ⋅ ⋅ ⋅ ⋅ ⋅ −⎜ ⎟

⎝ ⎠∑

13

Side comment: Sufficient Statistics

To compute the probability of data in the die example we only require to record the number of times nifalling on side i (namely,n1, n2,…,n6).We do not need to recall the entire sequence of outcomes

{ni | i=1…6} is called sufficient statistics for the multinomial sampling.

( ) 63 51 2 4

51 2 3 4 5 1

( | ) 1prob nn nn n n

iiData q q q q q qθ

== ⋅ ⋅ ⋅ ⋅ ⋅ −∑

14

Sufficient Statistics

A sufficient statistics is a function of the data that summarizes the relevant information for the likelihoodFormally, s(Data) is a sufficient statistics if for any two datasets D and D’

s(Data) = s(Data’ ) ⇒ P(Data|θ) = P(Data’|θ)

Datasets

Statistics

Exercise:Define a concise “sufficient statistics”for the HMM model, when the sequences are fully known.

15

Maximum Likelihood Estimate

( ) 63 51 2 4

51 2 3 4 5 1

log(prob( | )) log 1n

n nn n nii

Data q q q q q qθ=

⎡ ⎤= ⋅ ⋅ ⋅ ⋅ ⋅ −⎢ ⎥⎣ ⎦∑

By the ML approach, we look for parameters that maximize the probability of data (i.e., the likelihood function ).We will find the parameters by considering the corresponding log-likelihood function:

( )5 561 1

log log 1i i ii in q n q

= == + −∑ ∑

A necessary condition for (local) maximum is:

65

1

log(prob( | )) 01

j

j j ii

n nDataq q q

θ

=

∂= − =

∂ −∑

16

Finding the Maximum

Rearranging terms:

jj i

i

nq q

n=Divide the jth equation by the ith equation:

Sum from j=1 to 6:

66

1

11

jjj i i

j i i

n nq q qn n=

=

= = =∑

∑

The only local – and hence global – maximum is given by the relative frequency: 1,..,6i

inq in

= =

6 65

611

j

j ii

n n nq qq

=

= =−∑

17

Generalization to any number k of outcomes

Let X be a random variable with k values x1,…,xk denoting the koutcomes of Independently and Identically Distributed experiments, with parameters θ ={q1,q2,...,qk} (qi is the probability of xi). Again, the data is one sequence of length n, in which xi appears nitimes.

Then we have to maximize

Subject to: q1+q2+ ....+ qk=1

11

1

1 11

i.e., prob( | ) 1k

k

nknn

iki

Data q q qθ −

−

−=

⎛ ⎞= ⋅⋅⋅ ⋅ −⎜ ⎟

⎝ ⎠∑

1 21 2 1prob( | ) ... , ( .. )knn n

k kData q q q n n nθ = ⋅ + + =

18

Generalization for k outcomes (cont)

i k

i k

n nq q

=

By treatment identical to the die case, the maximum is obtained when for all i:

Hence the MLE is given by the relative frequencies:

1,..,ii

nq i k

n= =

19

Fractional Exponents

Some models allow ni’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%):We still have for θ = (q1,..,qk):

And the same analysis yields:

1,..,ii

nq i k

n= =

1 211 2prob( | ) , ( ... )knn n

kkData q q q n n nθ = ⋅ ⋅ ⋅ ⋅ + + =

20

Summary: for an experiment with k outcomes x1,..,xk, the ML problem is the following:Given a statistic (n1’ n2,..,nk) of positive real numbers (ni is the number of observations of xi ), find parameters θ ={q1,q2,...,qk} (qi is the probability of xi) which maximizes the likelihood.

Subject to: q1+q2+ ....+ qk=1

1 2The unique ( , .., ) which maximizes

the likelihood is given by , 1.. .

k

ii

q q qnq i kn

θ =

= =

1 21 2 1prob( | ) ... , ( .. )knn n

k kData q q q n n nθ = ⋅ + + =

1 2The unique ( , .., ) which maximizes

the likelihood is given by , 1,.., .

k

ii

q q qnq i kn

θ =

= =

21

Frequency Vector of Statistics (for single random variable)

The frequency vector of statistics (n1’ n2,..,nk) where n1+..nk=n,is the statistics (p1,..,pk) obtained by letting pi = ni /n (i = 1,..,k).

Consider two statistics of a random variable with 4 outcomes:n=125 and the statistics is (10,25,80,10) n=250 and the statistics is (20,50,160,20).Both statistics gives the same frequency vector (0.08, 0.2, 0.64, 0.08), and hence are optimized by the same parameters.

We can treat a frequency vector (p1,..,pk) as any other statistics, and find the parameters which maximize its likelihood.

22

ML for frequency vectors

For single-dice experiments, the likelihood of any data which has a frequency vector P is maximized by the same parameters which maximizes the likelihood of the “statistics” P (prove!). The “ML problem for frequency vectors” is the same as the original problem, but we restrict the statistics to frequency vectors:

1 2

1

1 2

1 2

Given a frequency vector of observed , ( ,.. ), Find parameters ( , ,.., ) which maximize the

likelihood prob( | ) ... k

k

kpp pk

Data p pq q q

Data q q q

θ

θ

=

= ⋅

23

ML for frequency vectors

Since “frequency vector” can be viewed as a probability vector, our solution for the “single dice”ML problem implies the following:

1

1

1

1

Let ( ,.., ) be a probability -vector.Then among all probability -vectors ( ,.., ),

the likelihood is maximized iff .

The value of this maximum likelihood is

i

i

k

kk

pi

ik

pi

i

P p p kk Q q q

q Q P

p

=

=

=

=

=∏

∏

24

Side Trip:Maximum Likelihood and Entropy

( )

1 2

1 2

1 2 1

1

The logarithm of the maximum likelihood of thefrequency vector ( , ,.., ) is given by

log( ... ) log .

This is a negative number, and its absolute value

- log is known as

k

kkpp p

k i ii

ki ii

p p p

p p p p p

p p

=

=

⋅ = ∑

∑ 1the ofentro ( ,..,y )p .kp p

25

ML and “Relative Entropy”

Thus, MLE for frequency vectors implies:

( )

1

1

1

1

Let ( ,.., ) be a probability vector.Then among all probability vectors ( ,.., ), the sum

log gets its unique maximum when .

This is equivalent to the following: Let ( ,.., ) be

k

k

ki ii

k

P p pQ q q

p q P Q

P p p

=

=

=

=

=

∑

1

1

a probability vector. Then for all probability vectors ( ,.., ) it holds that

log( / ) 0 , with equality iff .k

ki i ii

Q q q

p p q P Q=

=

≥ =∑

26

Relative Entropy

1 1

1

For ( ,.., ) and ( ,.., ) as above,

the sum D(P||Q)= log( / ) is called the

or th Kullback e , and . So the ML formula for frequency vectors imp

Leiber distanceli

relative entropy , of

k kk

i i ii

P p p Q q q

p p q

P Q=

= =

∑

esthat the relative entropy of and is nonnegative,and it equals 0 iff .

P QP Q=

27

Relative entropy of scoring functions

( , ) ( , )( , ) log log( ) ( ) ( , )

where ( , ), ( , ) are the probabilities of ( , ) in the "Match" and "Random" models

P a b P a ba bQ a Q b Q a b

P a b Q a ba b

σ = =

Recall that we have defined the scoring function of alignments via

The relative entropy between P and Q is:

Large relative entropy means that P(a,b), the distribution of the “match”model, is significantly different from Q(a,b), the distribution of the “random” model. (This relative entropy is called “the mutual information of P and Q”, denoted I(P,Q). I(P,Q)=0 iff P =Q.)

, ,

( , )( || ) ( , ) log( ) ( , ) ( , )( , )a b a b

P a bD P Q P a b P a b a bQ a b

σ= =∑ ∑

28

Back to ML of fully known training sets

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Recall: Mkl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set. We need to maximize:

( )

( , ) ( , )

1 prob( | ) [ ( )],.., kl kM E bkkl

k l k b

n m e bx x θ = ∏ ∏

29

Rearranging terms we need to:

( )

( , ) ( , )Maximize [ ( )]kl kM E b

kklk l k b

m e b∏ ∏

Subject to: for all states , 1, and e ( ) 1, , ( ) 0.kl k kl k

l b

km b m e b= = ≥∑ ∑

For this, we need to maximize the likelihoods of two dicesfor each state k: One for transition {mkl|l=1,..,m} and one for emission {ek(b)|b∈Σ}.

30

Apply to HMM (cont.)

We apply the “dice likelihood” technique to get for each k the parameters {mkl|l=1,..,m} and {ek(b)|b∈Σ}:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

'' '

( ) , and ( )( ')

kl kkl k

kl kl b

M E bm e bM E b

= =∑ ∑

Which gives the optimal ML parameters

32

Adding pseudo counts in HMM

We may modify the actual count by our prior knowledge/belief (e.g., when the sample set is too small) : rkl is our prior belief on transitions from k to l.rk(b) is our prior belief on emissions of b from state k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

' '' '

( ) ( )then , and ( )( ) ( ( ') ( ))

kl kl k kkl k

kl kl k kl b

M r E b r bm e bM r E b r b+ +

= =+ +∑ ∑

33

Case 2: State paths are unknown

For a given θ we have:prob(x1,..., xn|θ)= prob(x1| θ) ⋅ ⋅ ⋅ p (xn|θ)

(since the xj are independent)

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

For each sequence x,prob(x|θ)=∑s prob(x,s|θ),

The sum taken over all state paths s which emit x.

34

Thus, for the n sequences (x1,..., xn) we have:

prob(x1,..., xn |θ)= ∑ prob(x1,..., xn , s1,..., sn |θ),

Where the summation is taken over all tuples of n state paths (s1,..., sn ) which generate (x1,..., xn) .For simplicity, we will assume that n=1.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

(s1,..., sn )

35

So we need to maximize prob(x|θ)=∑s prob(x,s|θ), where the summation is over all the sequences Swhich produce the output sequence x.Finding θ* which maximizes ∑s prob(x,s|θ) is hard. [Unlike finding θ* which maximizes prob(x,s|θ) for a single sequence (x,s).]

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

36

ML Parameter Estimation for HMM

The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ’ so that prob(x|θ’) > prob(x|θ) 3. set θ = θ’.4. Repeat until some convergence criterion is met.

A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

37

Baum Welch training

We start with some values of mkl and ek(b), which define prior values of θ.Then we use an iterative algorithm which attempts to replace θ by a θ* s.t.

prob(x|θ*) > prob(x|θ)This is done by “imitating” the algorithm for Case 1, where all states are known:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

38

Baum Welch training

In case 1 we computed the optimal values of mkl and ek(b), (for the optimal θ) by simply counting the number Mkl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.

Si= lSi-1= k

xi-1= b

… …

xi= c

39

Baum Welch training

When the states are unknown, the counting process is replaced by averaging process:For each edge si-1 si we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take Mkl to be the sum over all edges.

Si= ?Si-1= ?

xi-1= b xi= c

… …

40

Baum Welch training

Similarly, For each edge si b and each state k,we compute the average number of times that si=k, which is the expected number of “k → b”transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:

Si = ?

xi-1= b

41

The expected values of Mkl and Ek(b)

Given the current distribution θ, Mkl and Ek(b) are defined by:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Mkl=∑sMsklprob(s|x,θ),

where Mskl is the number of k to l transitions in the sequence s.

Ek(b)=∑sEsk (b)prob(s|x,θ),

where Esk (b) is the number of times k emits b in the sequence s

with output x.

Documents

Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters