Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

.

Lecture #8:

- Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training

- Extensions of HMM

Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Shlomo Moran, following Dan Geiger and Nir Friedman

2

Parameter Estimation in HMM:What we saw until now

3

The Parameters

To determine the values of the parameters θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.

An HMM model is defined by the probabilty parameters: mkl and ek(b), for all states k,l and all symbols b.

θ denotes the collection of these parameters.

4

Maximum Likelihood Parameter Estimation for HMM

looks for θ which maximizes:

p(x1,..., xn|θ) = ∏j p (xj|θ).

5

Finding ML parameters for HMM when all states are known:

Let Mkl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We look for parameters ={mkl, ek(b)} that:

( )

( , ) ( , )

Maximize [ ( )]kl kM E bkkl

k l k b

m e b k

b

Subject to: for all states , 1, and e ( ) 1.kll

k m b

The optimal ML parameters θ are given by:

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

M E bm e b

M E b

6

Case 2: State paths are unknown

We need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

7

Parameter Estimation when States are Unknown

The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ* so that p(x|θ*) > p(x|θ) 3. set θ = θ*.4. Repeat until some convergence criterion is met.

A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

8

Baum Welch Training

We start with some values of mkl and ek(b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ* s.t.

p(x|θ*) > p(x|θ)This is done by “imitating” the algorithm for Case 1, where all states are known:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

9

When states were known, we counted…

In case 1 we computed the optimal values of mkl and ek(b), (for the optimal θ) by simply counting the number Mkl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.

Si= lSi-1= k

xi-1= b

… …

xi= c

10

When states are unknown Mkl and Ek(b) are taken as averages:

Mkl and Ek(b) are computed according to the current distribution θ, that is:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Similarly Ek(b)=∑sEs

k (b)p(s|x,θ), where Es

k (b) is the number of times k emits b in the sequence s with output x.

( | , ),skl kl

s

M M p s x where Ms

kl is the number of k to l transitions in the sequence s.

11

Computing averages of state-transitions:

Since the number of sequences s is exponential in L, it is too expensive to compute Mkl=∑sMs

klp(s|x,θ) in the naïve way.Hence, we use dynamic programming:For each each pair (k,l) and for each edge si-1 si we compute the average number of “k to l” transitions over this edge. Then we take Mkl to be the sum over all edges.

Si= ?Si-1= ?

xi-1= b xi= c

… …

12

…and of Letter-Emissions

Similarly, For each edge si b and each state k, we compute the average number of times that si=k, which is the expected number of “k → b” transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:

Si = ?

xi= b

13

Baum Welch: step E for Mkl

Count avearge number of state transitions

For computing the averages, Baum Welch computes for each index i and states k,l, the following probability:

s1 SisL

X1 Xi XL

Si-1

Xi-1.. ..

P(si-1=k, si=l | x,θ)

For this, it uses the forwards and backwards algorithms

15

Baum Welch: Step E for Mkl

Claim: By the probability distribution of HMM

s1 l sL

X1 Xi XL

k

Xi-1.. ..

1

1k kl l i li i

F i m e x B ip s k s l x

p x

( ) ( ) ( )( , | , )

( | )

(mkl and el(xi) are the parameters defined by , and Fk(i-1), Bk(i) are the outputs of the forward / backward algorithms)

16

proof of claim

P(x1,…,xL,si-1=k,si=l|) = P(x1,…,xi-1,si-1=k|) mklel(xi ) P(xi+1,…,xL |si=l,)

= Fk(i-1) mklel(xi ) Bl(i)

Via the forward algorithm

Via the backward algorithm

s1 s2 sL-1 sL

X1 X2 XL-1 XL

Si-1

Xi-1

si

Xi

x

p(si-1=k,si=l | x, ) = Fk(i-1) mklel(xi ) Bl(i)

)|( xp

17

Step E for Mkl (end)

For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

11

1

1

11

L

kl i ii

L

k kl l i li

M p s k s l xp x

F i m e x B ip x

( , , | )( | )

( ) ( ) ( )( | )

18

Step E for Mkl , with many sequences:

Claim: When we have n independent input sequences (x1,..., xn ) of lengths L1 .. Ln , then Mkl is given by:

11 1

1 1

1( = , = , )

( )

1( 1) ( ) ( )

( )

|j

j

n L

kl i ijj i

n Lj j

kl l ik ljj i

jM p s k s lp x

F i m e x B ip x

x

where and are the forward and backward algorithms

for under .

j jk lF B

jx

19

Proof of Claim:When we have n independent input sequences (x1,..., xn ), the probability space is the product of n spaces:

1 1 2 2{ ( , ), ( , ),.., ( , ) }, where for 1,.., ,

( , ) is an HMM of length ,

and ranges over all possible state-paths of length .

n n

j j j

j j

s x s x s x j n

s x L

s L

The probability of a simple event in this space with parameters θ is given by:

1 1 2 2

1

( , ), ( , ),.., ( , ) | [( , ) | ]n

n n j j

j

p s x s x s x p s x

20

Proof of Claim (cont):

The probability of that simple event given x=(x1 ,..,xn ):

1 1 2 2

1

[( , ) | ]( , ), ( , ),.., ( , ) | ,

[ | ]

j jnn n

jj

p s xp s x s x s x x

p x

The probability of the compound event (sj,xj ) given x=(x1 ,..,xn ):

[( , ) | ]( , ) | ,

[( | )]

j jj j

j

p s xp s x x

p x

21

Proof of Claim (end):

11

By the same argument used for a single sequence, we get that ,

the contribution of to , is given by

( = , = , )

, ( | )

and the claim by taking the sum over all .

|j

jkl

kl

L

i ij i

kl j

M

j M

jp s k s l

Mp

j

x

x

xx

22

Baum-Welch: Step E for Ek(b) count expected number of letter-emissions

for state k and each symbol b, for each i where Xi=b, compute the expected number of times that Si=k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi=b

1

1 1

1

( | ,... )

( ... , ) / ( ,... )

( ) ( ) / ( ,..., )

i L

L i L

k i k i L

p s k x x

p x x s k p x x

F s B s p x x

23

Baum-Welch: Step E for Ek(b)

For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that si = k, over all i’s for which xi = b.

1

:

( ) ( ) ( )( | )

i

k k k

i x b

E b F i B ip x

24

Step E for Ek(b), many sequences

Exercise: when we have n sequences (x1,..., xn ), the expected number of emissions of b from k is given by:

1 :

1( ) ( ) ( )

( ) ji

nj j

k k kjj i x b

E b F i B ip x

25

Summary: the E part of the Baum Welch training

This part computes the expected numbers Mkl of k→l transitions for all pairs of states k and l, and the expected numbers Ek(b) of transmisions of symbol b from state k, for all states k and symbols b.

The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

26

Baum-Welch: step M

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

M E bm e b

M E b

Use the Mkl’s, Ek(b)’s to compute the new values of mkl and ek(b). These values define θ*.

The correctness of the EM algorithm implies that, if θ* ≠ θ, then:

p(x1,..., xn|θ*) > p(x1,..., xn|θ) i.e, θ* increases the probability of the data, unless it is equal to θ. (this will follow from the correctness of the EM algorithm, to be

proved later.)

This procedure is iterated, until some convergence criterion is met.

27

Viterbi training:Maximizing the probability of the most

probable path

28

Assume that rather then finding θ which maximizes the likelihood of the input x1,..,xn , we wish to maximize the probability of a most probable path, ie to find parameters θ and state paths s(x1),..,s(xn) s.t.the value of

p(s(x1),..,s(xn) , x1,..,xn |θ)is maximized.Clearly, s(xj) should be the most probable path for xj under the parameters θ .We assume only one sequence (n=1).

This is done by Viterbi Training

30

Viterbi training

Start from given values of mkl and ek(b), which define prior values of θ. Each iteration:Step 1: Use Viterbi’s algorithm to find a most probable path s(x) , which maximizes p(s(x), x|θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

31

Viterbi training (cont)

Step 2. Use the ML method for HMM when the states are known to find θ* which maximizes p(s(x) , x|θ*).

Note : If after Step 2 we have p(s(x) , x|θ*)= p(s(x) , x|θ), then it must be that θ=θ*. In this case the next iteration will be identical to the current one, and hence we may terminate the algorithm.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

θ*

32

Viterbi training (cont)

Step 3. If θ≠θ* , set θ←θ* , and repeat.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

If θ=θ* , stop.

34

Extensions of HMM

35

1. Monitoring probabilities of repetitions

Markov chains are rather limited in describing sequences

of symbols with non-random structures. For instance, a

Markov chain forces the distribution of segments in

which some state is repeated k+1 times to be (1-p)pk,

for some p.

A A A A

By adding states we may bypass this restriction:

36

1. State duplications

An extension of Markov chain which allows the

distribution of segments in which a state is repeated k+1

times to have any desired value:

Assign k+1 states to represent the same “real” state. This

may model k repetitions (or less) with any desired

probability.

A1 A2 A3 A4

37

2. Silent states States which do not emit symbols. Can be used to model repetitions. Also used to allow arbitrary jumps (may be used to model

deletions) Need to generalize the Forward and Backward algorithms for

arbitrary acyclic digraphs to count for the silent states:

Silent states:

Regular states:

38

eg, the forwards algorithm should look:

Directed cycles of silent (or other) states complicate things, and should be avoided.

For a regular vertex of states which emit symbol :

and for a vertex of silent states:

l l k klu v E k

l k klu z E k

v x

F v e x F u m

z F z F u m

( , )

( , )

( ) ( ) ( )

( ) ( )

x

v

zSilent states

Regular states

symbols

39

3. High Order Markov Chains

Markov chains in which the transition probabilities depends on the last k states:

P(xi|xi-1,...,x1) = P(xi|xi-1,...,xi-k)

Can be represented by a standard Markov chain with more states. eg for k=2:

AA

BBBA

AB

40

4. Inhomogeneous Markov Chains

An important task in analyzing DNA sequences is recognizing the genes which code for proteins.

A triplet of 3 nucleotides – codon - codes for amino acids. It is known that in parts of DNA which code for genes, the three

codons positions has different statistics. Thus a Markov chain model for DNA should represent not only

the Nucleotide (A, C, G or T), but also its position – the same nucleotide in different position will have different transition probabilities. Used in GENEMARK gene finding program (93).

Documents

Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters