40
. Maximum Likelihood and Parameter Estimation For HMM Lecture #8 Background Readings : Chapter 3.3 in , Biological Sequence Analysis, Durbin et al., 2001. © Shlomo Moran, following Danny Geiger and Nir Friedman

Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

Embed Size (px)

Citation preview

Page 1: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

.

Maximum Likelihood andParameter Estimation For HMM

Lecture #8

Background Readings: Chapter 3.3 in , Biological Sequence Analysis, Durbin et al., 2001.© Shlomo Moran, following Danny Geiger and Nir Friedman

Page 2: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

2

Parameter Estimation for HMM

An HMM model is defined by the parameters: mkl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

lk

b

mkl

ek(b)

{ : , are states} { ( ) : is a state, is a letter}kl km k l e b k bθ = ∪

Page 3: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

3

Parameter Estimation for HMM

To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 4: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

4

Maximum Likelihood Parameter Estimation for HMM

The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).ML parameter estimation looks for θ which maximizes the above.

The exact method for finding or approximating this θ depends on the nature of the training set used.

Page 5: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

5

Data for HMM

The training set is characterized by:1.For each xj, the information on the states sj

i (the symbolsxj

i are usually known). 2.The size (number of sequences) in it.

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

Page 6: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

6

Case 1: ML when Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate mkl and ek(b) for all pairs of states k, l and symbols b.

By the ML method, we look for parameters θ* which maximize the probability of the sample set:p(x1,...,xn| θ*) =MAXθp(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 7: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

7

Case 1: Sequences are fully known

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

( )

( , ) ( , )then: prob( | ) [ ( )]

j jkl k

M E bkkl

k l k b

j m e bx θ = ∏ ∏

For each xj we have:

11

( | ) ( )j

i i i

Lj j

s s s ii

p x m e xθ−

=

=∏1Let |{ : , } | (the superscript indicates )

Let ( ) |{ : , } |

j j j j jkl i i

j jk i i

M i s k s l x

E b i s k x b−= = =

= = =

Page 8: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

8

Case 1 (cont)

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).

Thus, if Mkl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:

( )

( , ) ( , )

1 prob( | ) [ ( )],.., kl kM E bkkl

k l k b

n m e bx x θ = ∏ ∏

Page 9: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

9

Case 1 (cont)

( )

( , ) ( , )[ ( )]kl kM E b

kklk l k b

m e b∏ ∏

So we need to find mkl’s and ek(b)’s which maximize:

Subject to:

For all states , 1 and ( ) 1

[ , ( ) 0 ]

kl kl b

kl k

k m e b

m e b

= =

∑ ∑

Page 10: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

10

Case 1 (cont)

( )

( , ) ( , )

( )

[ ( )]

[ ( )][ ] [ ]

kl k

kl k

M E bkkl

k l k b

M E bkkl

k l k b

F m e b

m e b

= =

×

∏ ∏

∏ ∏ ∏ ∏

kb

Subject to: for all , 1, and e ( ) 1.kll

k m b= =∑ ∑

Rewriting, we need to maximize:

Page 11: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

11

Case 1 (cont)

If we maximize for each : s.t. 1klMklkl

llk m m =∑∏( )and also [ ( )] s.t. ( ) 1kE b

k kbb

e b e b =∑∏

Then we will maximize also F.Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.

Page 12: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

12

ML parameters estimation for a die

Let X be a random variable with 6 values x1,…,x6 denoting the six outcomes of a (possibly unfair) die. Here the parameters are θ ={q1,q2,q3,q4,q5, q6} , ∑qi=1Assume that the data is one sequence:

Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize

Subject to: q1+q2+ q3+ q4+ q5+ q6=1 [and qi ≥0 ]

2 3 2 21 2 3 4 5 6prob( | )Data q q q q q qθ = ⋅ ⋅ ⋅ ⋅ ⋅

252 3 21 2 3 4 5

1i.e., prob( | ) 1 i

iData q q q q q qθ

=

⎛ ⎞= ⋅ ⋅ ⋅ ⋅ ⋅ −⎜ ⎟

⎝ ⎠∑

Page 13: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

13

Side comment: Sufficient Statistics

To compute the probability of data in the die example we only require to record the number of times nifalling on side i (namely,n1, n2,…,n6).We do not need to recall the entire sequence of outcomes

{ni | i=1…6} is called sufficient statistics for the multinomial sampling.

( ) 63 51 2 4

51 2 3 4 5 1

( | ) 1prob nn nn n n

iiData q q q q q qθ

== ⋅ ⋅ ⋅ ⋅ ⋅ −∑

Page 14: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

14

Sufficient Statistics

A sufficient statistics is a function of the data that summarizes the relevant information for the likelihoodFormally, s(Data) is a sufficient statistics if for any two datasets D and D’

s(Data) = s(Data’ ) ⇒ P(Data|θ) = P(Data’|θ)

Datasets

Statistics

Exercise:Define a concise “sufficient statistics”for the HMM model, when the sequences are fully known.

Page 15: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

15

Maximum Likelihood Estimate

( ) 63 51 2 4

51 2 3 4 5 1

log(prob( | )) log 1n

n nn n nii

Data q q q q q qθ=

⎡ ⎤= ⋅ ⋅ ⋅ ⋅ ⋅ −⎢ ⎥⎣ ⎦∑

By the ML approach, we look for parameters that maximize the probability of data (i.e., the likelihood function ).We will find the parameters by considering the corresponding log-likelihood function:

( )5 561 1

log log 1i i ii in q n q

= == + −∑ ∑

A necessary condition for (local) maximum is:

65

1

log(prob( | )) 01

j

j j ii

n nDataq q q

θ

=

∂= − =

∂ −∑

Page 16: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

16

Finding the Maximum

Rearranging terms:

jj i

i

nq q

n=Divide the jth equation by the ith equation:

Sum from j=1 to 6:

66

1

11

jjj i i

j i i

n nq q qn n=

=

= = =∑

The only local – and hence global – maximum is given by the relative frequency: 1,..,6i

inq in

= =

6 65

611

j

j ii

n n nq qq

=

= =−∑

Page 17: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

17

Generalization to any number k of outcomes

Let X be a random variable with k values x1,…,xk denoting the koutcomes of Independently and Identically Distributed experiments, with parameters θ ={q1,q2,...,qk} (qi is the probability of xi). Again, the data is one sequence of length n, in which xi appears nitimes.

Then we have to maximize

Subject to: q1+q2+ ....+ qk=1

11

1

1 11

i.e., prob( | ) 1k

k

nknn

iki

Data q q qθ −

−=

⎛ ⎞= ⋅⋅⋅ ⋅ −⎜ ⎟

⎝ ⎠∑

1 21 2 1prob( | ) ... , ( .. )knn n

k kData q q q n n nθ = ⋅ + + =

Page 18: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

18

Generalization for k outcomes (cont)

i k

i k

n nq q

=

By treatment identical to the die case, the maximum is obtained when for all i:

Hence the MLE is given by the relative frequencies:

1,..,ii

nq i k

n= =

Page 19: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

19

Fractional Exponents

Some models allow ni’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%):We still have for θ = (q1,..,qk):

And the same analysis yields:

1,..,ii

nq i k

n= =

1 211 2prob( | ) , ( ... )knn n

kkData q q q n n nθ = ⋅ ⋅ ⋅ ⋅ + + =

Page 20: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

20

Summary: for an experiment with k outcomes x1,..,xk, the ML problem is the following:Given a statistic (n1’ n2,..,nk) of positive real numbers (ni is the number of observations of xi ), find parameters θ ={q1,q2,...,qk} (qi is the probability of xi) which maximizes the likelihood.

Subject to: q1+q2+ ....+ qk=1

1 2The unique ( , .., ) which maximizes

the likelihood is given by , 1.. .

k

ii

q q qnq i kn

θ =

= =

1 21 2 1prob( | ) ... , ( .. )knn n

k kData q q q n n nθ = ⋅ + + =

1 2The unique ( , .., ) which maximizes

the likelihood is given by , 1,.., .

k

ii

q q qnq i kn

θ =

= =

Page 21: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

21

Frequency Vector of Statistics (for single random variable)

The frequency vector of statistics (n1’ n2,..,nk) where n1+..nk=n,is the statistics (p1,..,pk) obtained by letting pi = ni /n (i = 1,..,k).

Consider two statistics of a random variable with 4 outcomes:n=125 and the statistics is (10,25,80,10) n=250 and the statistics is (20,50,160,20).Both statistics gives the same frequency vector (0.08, 0.2, 0.64, 0.08), and hence are optimized by the same parameters.

We can treat a frequency vector (p1,..,pk) as any other statistics, and find the parameters which maximize its likelihood.

Page 22: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

22

ML for frequency vectors

For single-dice experiments, the likelihood of any data which has a frequency vector P is maximized by the same parameters which maximizes the likelihood of the “statistics” P (prove!). The “ML problem for frequency vectors” is the same as the original problem, but we restrict the statistics to frequency vectors:

1 2

1

1 2

1 2

Given a frequency vector of observed , ( ,.. ), Find parameters ( , ,.., ) which maximize the

likelihood prob( | ) ... k

k

kpp pk

Data p pq q q

Data q q q

θ

θ

=

= ⋅

Page 23: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

23

ML for frequency vectors

Since “frequency vector” can be viewed as a probability vector, our solution for the “single dice”ML problem implies the following:

1

1

1

1

Let ( ,.., ) be a probability -vector.Then among all probability -vectors ( ,.., ),

the likelihood is maximized iff .

The value of this maximum likelihood is

i

i

k

kk

pi

ik

pi

i

P p p kk Q q q

q Q P

p

=

=

=

=

=∏

Page 24: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

24

Side Trip:Maximum Likelihood and Entropy

( )

1 2

1 2

1 2 1

1

The logarithm of the maximum likelihood of thefrequency vector ( , ,.., ) is given by

log( ... ) log .

This is a negative number, and its absolute value

- log is known as

k

kkpp p

k i ii

ki ii

p p p

p p p p p

p p

=

=

⋅ = ∑

∑ 1the ofentro ( ,..,y )p .kp p

Page 25: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

25

ML and “Relative Entropy”

Thus, MLE for frequency vectors implies:

( )

1

1

1

1

Let ( ,.., ) be a probability vector.Then among all probability vectors ( ,.., ), the sum

log gets its unique maximum when .

This is equivalent to the following: Let ( ,.., ) be

k

k

ki ii

k

P p pQ q q

p q P Q

P p p

=

=

=

=

=

1

1

a probability vector. Then for all probability vectors ( ,.., ) it holds that

log( / ) 0 , with equality iff .k

ki i ii

Q q q

p p q P Q=

=

≥ =∑

Page 26: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

26

Relative Entropy

1 1

1

For ( ,.., ) and ( ,.., ) as above,

the sum D(P||Q)= log( / ) is called the

or th Kullback e , and . So the ML formula for frequency vectors imp

Leiber distanceli

relative entropy , of

k kk

i i ii

P p p Q q q

p p q

P Q=

= =

esthat the relative entropy of and is nonnegative,and it equals 0 iff .

P QP Q=

Page 27: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

27

Relative entropy of scoring functions

( , ) ( , )( , ) log log( ) ( ) ( , )

where ( , ), ( , ) are the probabilities of ( , ) in the "Match" and "Random" models

P a b P a ba bQ a Q b Q a b

P a b Q a ba b

σ = =

Recall that we have defined the scoring function of alignments via

The relative entropy between P and Q is:

Large relative entropy means that P(a,b), the distribution of the “match”model, is significantly different from Q(a,b), the distribution of the “random” model. (This relative entropy is called “the mutual information of P and Q”, denoted I(P,Q). I(P,Q)=0 iff P =Q.)

, ,

( , )( || ) ( , ) log( ) ( , ) ( , )( , )a b a b

P a bD P Q P a b P a b a bQ a b

σ= =∑ ∑

Page 28: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

28

Back to ML of fully known training sets

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Recall: Mkl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set. We need to maximize:

( )

( , ) ( , )

1 prob( | ) [ ( )],.., kl kM E bkkl

k l k b

n m e bx x θ = ∏ ∏

Page 29: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

29

Rearranging terms we need to:

( )

( , ) ( , )Maximize [ ( )]kl kM E b

kklk l k b

m e b∏ ∏

Subject to: for all states , 1, and e ( ) 1, , ( ) 0.kl k kl k

l b

km b m e b= = ≥∑ ∑

For this, we need to maximize the likelihoods of two dicesfor each state k: One for transition {mkl|l=1,..,m} and one for emission {ek(b)|b∈Σ}.

Page 30: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

30

Apply to HMM (cont.)

We apply the “dice likelihood” technique to get for each k the parameters {mkl|l=1,..,m} and {ek(b)|b∈Σ}:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

'' '

( ) , and ( )( ')

kl kkl k

kl kl b

M E bm e bM E b

= =∑ ∑

Which gives the optimal ML parameters

Page 31: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

32

Adding pseudo counts in HMM

We may modify the actual count by our prior knowledge/belief (e.g., when the sample set is too small) : rkl is our prior belief on transitions from k to l.rk(b) is our prior belief on emissions of b from state k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

' '' '

( ) ( )then , and ( )( ) ( ( ') ( ))

kl kl k kkl k

kl kl k kl b

M r E b r bm e bM r E b r b+ +

= =+ +∑ ∑

Page 32: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

33

Case 2: State paths are unknown

For a given θ we have:prob(x1,..., xn|θ)= prob(x1| θ) ⋅ ⋅ ⋅ p (xn|θ)

(since the xj are independent)

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

For each sequence x,prob(x|θ)=∑s prob(x,s|θ),

The sum taken over all state paths s which emit x.

Page 33: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

34

Thus, for the n sequences (x1,..., xn) we have:

prob(x1,..., xn |θ)= ∑ prob(x1,..., xn , s1,..., sn |θ),

Where the summation is taken over all tuples of n state paths (s1,..., sn ) which generate (x1,..., xn) .For simplicity, we will assume that n=1.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

(s1,..., sn )

Page 34: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

35

So we need to maximize prob(x|θ)=∑s prob(x,s|θ), where the summation is over all the sequences Swhich produce the output sequence x.Finding θ* which maximizes ∑s prob(x,s|θ) is hard. [Unlike finding θ* which maximizes prob(x,s|θ) for a single sequence (x,s).]

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 35: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

36

ML Parameter Estimation for HMM

The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ’ so that prob(x|θ’) > prob(x|θ) 3. set θ = θ’.4. Repeat until some convergence criterion is met.

A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

Page 36: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

37

Baum Welch training

We start with some values of mkl and ek(b), which define prior values of θ.Then we use an iterative algorithm which attempts to replace θ by a θ* s.t.

prob(x|θ*) > prob(x|θ)This is done by “imitating” the algorithm for Case 1, where all states are known:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 37: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

38

Baum Welch training

In case 1 we computed the optimal values of mkl and ek(b), (for the optimal θ) by simply counting the number Mkl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.

Si= lSi-1= k

xi-1= b

… …

xi= c

Page 38: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

39

Baum Welch training

When the states are unknown, the counting process is replaced by averaging process:For each edge si-1 si we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take Mkl to be the sum over all edges.

Si= ?Si-1= ?

xi-1= b xi= c

… …

Page 39: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

40

Baum Welch training

Similarly, For each edge si b and each state k,we compute the average number of times that si=k, which is the expected number of “k → b”transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:

Si = ?

xi-1= b

Page 40: Maximum Likelihood and Parameter Estimation For … · Maximum Likelihood and Parameter Estimation For HMM ... 12 3 4 51 log (prob( | )) log 1 n nn ... statistics, and find the parameters

41

The expected values of Mkl and Ek(b)

Given the current distribution θ, Mkl and Ek(b) are defined by:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Mkl=∑sMsklprob(s|x,θ),

where Mskl is the number of k to l transitions in the sequence s.

Ek(b)=∑sEsk (b)prob(s|x,θ),

where Esk (b) is the number of times k emits b in the sequence s

with output x.