37
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings : Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001. o Moran, following Dan Geiger and Nir Friedman

Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

Embed Size (px)

Citation preview

Page 1: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

.

Lecture #8:

- Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training

- Extensions of HMM

Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Shlomo Moran, following Dan Geiger and Nir Friedman

Page 2: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

2

Parameter Estimation in HMM:What we saw until now

Page 3: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

3

The Parameters

To determine the values of the parameters θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.

An HMM model is defined by the probabilty parameters: mkl and ek(b), for all states k,l and all symbols b.

θ denotes the collection of these parameters.

Page 4: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

4

Maximum Likelihood Parameter Estimation for HMM

looks for θ which maximizes:

p(x1,..., xn|θ) = ∏j p (xj|θ).

Page 5: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

5

Finding ML parameters for HMM when all states are known:

Let Mkl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We look for parameters ={mkl, ek(b)} that:

( )

( , ) ( , )

Maximize [ ( )]kl kM E bkkl

k l k b

m e b k

b

Subject to: for all states , 1, and e ( ) 1.kll

k m b

The optimal ML parameters θ are given by:

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

M E bm e b

M E b

Page 6: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

6

Case 2: State paths are unknown

We need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 7: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

7

Parameter Estimation when States are Unknown

The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ* so that p(x|θ*) > p(x|θ) 3. set θ = θ*.4. Repeat until some convergence criterion is met.

A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

Page 8: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

8

Baum Welch Training

We start with some values of mkl and ek(b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ* s.t.

p(x|θ*) > p(x|θ)This is done by “imitating” the algorithm for Case 1, where all states are known:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 9: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

9

When states were known, we counted…

In case 1 we computed the optimal values of mkl and ek(b), (for the optimal θ) by simply counting the number Mkl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.

Si= lSi-1= k

xi-1= b

… …

xi= c

Page 10: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

10

When states are unknown Mkl and Ek(b) are taken as averages:

Mkl and Ek(b) are computed according to the current distribution θ, that is:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Similarly Ek(b)=∑sEs

k (b)p(s|x,θ), where Es

k (b) is the number of times k emits b in the sequence s with output x.

( | , ),skl kl

s

M M p s x where Ms

kl is the number of k to l transitions in the sequence s.

Page 11: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

11

Computing averages of state-transitions:

Since the number of sequences s is exponential in L, it is too expensive to compute Mkl=∑sMs

klp(s|x,θ) in the naïve way.Hence, we use dynamic programming:For each each pair (k,l) and for each edge si-1 si we compute the average number of “k to l” transitions over this edge. Then we take Mkl to be the sum over all edges.

Si= ?Si-1= ?

xi-1= b xi= c

… …

Page 12: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

12

…and of Letter-Emissions

Similarly, For each edge si b and each state k, we compute the average number of times that si=k, which is the expected number of “k → b” transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:

Si = ?

xi= b

Page 13: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

13

Baum Welch: step E for Mkl

Count avearge number of state transitions

For computing the averages, Baum Welch computes for each index i and states k,l, the following probability:

s1 SisL

X1 Xi XL

Si-1

Xi-1.. ..

P(si-1=k, si=l | x,θ)

For this, it uses the forwards and backwards algorithms

Page 14: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

15

Baum Welch: Step E for Mkl

Claim: By the probability distribution of HMM

s1 l sL

X1 Xi XL

k

Xi-1.. ..

1

1k kl l i li i

F i m e x B ip s k s l x

p x

( ) ( ) ( )( , | , )

( | )

(mkl and el(xi) are the parameters defined by , and Fk(i-1), Bk(i) are the outputs of the forward / backward algorithms)

Page 15: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

16

proof of claim

P(x1,…,xL,si-1=k,si=l|) = P(x1,…,xi-1,si-1=k|) mklel(xi ) P(xi+1,…,xL |si=l,)

= Fk(i-1) mklel(xi ) Bl(i)

Via the forward algorithm

Via the backward algorithm

s1 s2 sL-1 sL

X1 X2 XL-1 XL

Si-1

Xi-1

si

Xi

x

p(si-1=k,si=l | x, ) = Fk(i-1) mklel(xi ) Bl(i)

)|( xp

Page 16: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

17

Step E for Mkl (end)

For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

11

1

1

11

L

kl i ii

L

k kl l i li

M p s k s l xp x

F i m e x B ip x

( , , | )( | )

( ) ( ) ( )( | )

Page 17: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

18

Step E for Mkl , with many sequences:

Claim: When we have n independent input sequences (x1,..., xn ) of lengths L1 .. Ln , then Mkl is given by:

11 1

1 1

1( = , = , )

( )

1( 1) ( ) ( )

( )

|j

j

n L

kl i ijj i

n Lj j

kl l ik ljj i

jM p s k s lp x

F i m e x B ip x

x

where and are the forward and backward algorithms

for under .

j jk lF B

jx

Page 18: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

19

Proof of Claim:When we have n independent input sequences (x1,..., xn ), the probability space is the product of n spaces:

1 1 2 2{ ( , ), ( , ),.., ( , ) }, where for 1,.., ,

( , ) is an HMM of length ,

and ranges over all possible state-paths of length .

n n

j j j

j j

s x s x s x j n

s x L

s L

The probability of a simple event in this space with parameters θ is given by:

1 1 2 2

1

( , ), ( , ),.., ( , ) | [( , ) | ]n

n n j j

j

p s x s x s x p s x

Page 19: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

20

Proof of Claim (cont):

The probability of that simple event given x=(x1 ,..,xn ):

1 1 2 2

1

[( , ) | ]( , ), ( , ),.., ( , ) | ,

[ | ]

j jnn n

jj

p s xp s x s x s x x

p x

The probability of the compound event (sj,xj ) given x=(x1 ,..,xn ):

[( , ) | ]( , ) | ,

[( | )]

j jj j

j

p s xp s x x

p x

Page 20: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

21

Proof of Claim (end):

11

By the same argument used for a single sequence, we get that ,

the contribution of to , is given by

( = , = , )

, ( | )

and the claim by taking the sum over all .

|j

jkl

kl

L

i ij i

kl j

M

j M

jp s k s l

Mp

j

x

x

xx

Page 21: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

22

Baum-Welch: Step E for Ek(b) count expected number of letter-emissions

for state k and each symbol b, for each i where Xi=b, compute the expected number of times that Si=k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi=b

1

1 1

1

( | ,... )

( ... , ) / ( ,... )

( ) ( ) / ( ,..., )

i L

L i L

k i k i L

p s k x x

p x x s k p x x

F s B s p x x

Page 22: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

23

Baum-Welch: Step E for Ek(b)

For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that si = k, over all i’s for which xi = b.

1

:

( ) ( ) ( )( | )

i

k k k

i x b

E b F i B ip x

Page 23: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

24

Step E for Ek(b), many sequences

Exercise: when we have n sequences (x1,..., xn ), the expected number of emissions of b from k is given by:

1 :

1( ) ( ) ( )

( ) ji

nj j

k k kjj i x b

E b F i B ip x

Page 24: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

25

Summary: the E part of the Baum Welch training

This part computes the expected numbers Mkl of k→l transitions for all pairs of states k and l, and the expected numbers Ek(b) of transmisions of symbol b from state k, for all states k and symbols b.

The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

Page 25: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

26

Baum-Welch: step M

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

M E bm e b

M E b

Use the Mkl’s, Ek(b)’s to compute the new values of mkl and ek(b). These values define θ*.

The correctness of the EM algorithm implies that, if θ* ≠ θ, then:

p(x1,..., xn|θ*) > p(x1,..., xn|θ) i.e, θ* increases the probability of the data, unless it is equal to θ. (this will follow from the correctness of the EM algorithm, to be

proved later.)

This procedure is iterated, until some convergence criterion is met.

Page 26: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

27

Viterbi training:Maximizing the probability of the most

probable path

Page 27: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

28

Assume that rather then finding θ which maximizes the likelihood of the input x1,..,xn , we wish to maximize the probability of a most probable path, ie to find parameters θ and state paths s(x1),..,s(xn) s.t.the value of

p(s(x1),..,s(xn) , x1,..,xn |θ)is maximized.Clearly, s(xj) should be the most probable path for xj under the parameters θ .We assume only one sequence (n=1).

This is done by Viterbi Training

Page 28: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

30

Viterbi training

Start from given values of mkl and ek(b), which define prior values of θ. Each iteration:Step 1: Use Viterbi’s algorithm to find a most probable path s(x) , which maximizes p(s(x), x|θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 29: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

31

Viterbi training (cont)

Step 2. Use the ML method for HMM when the states are known to find θ* which maximizes p(s(x) , x|θ*).

Note : If after Step 2 we have p(s(x) , x|θ*)= p(s(x) , x|θ), then it must be that θ=θ*. In this case the next iteration will be identical to the current one, and hence we may terminate the algorithm.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

θ*

Page 30: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

32

Viterbi training (cont)

Step 3. If θ≠θ* , set θ←θ* , and repeat.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

If θ=θ* , stop.

Page 31: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

34

Extensions of HMM

Page 32: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

35

1. Monitoring probabilities of repetitions

Markov chains are rather limited in describing sequences

of symbols with non-random structures. For instance, a

Markov chain forces the distribution of segments in

which some state is repeated k+1 times to be (1-p)pk,

for some p.

A A A A

By adding states we may bypass this restriction:

Page 33: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

36

1. State duplications

An extension of Markov chain which allows the

distribution of segments in which a state is repeated k+1

times to have any desired value:

Assign k+1 states to represent the same “real” state. This

may model k repetitions (or less) with any desired

probability.

A1 A2 A3 A4

Page 34: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

37

2. Silent states States which do not emit symbols. Can be used to model repetitions. Also used to allow arbitrary jumps (may be used to model

deletions) Need to generalize the Forward and Backward algorithms for

arbitrary acyclic digraphs to count for the silent states:

Silent states:

Regular states:

Page 35: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

38

eg, the forwards algorithm should look:

Directed cycles of silent (or other) states complicate things, and should be avoided.

For a regular vertex of states which emit symbol :

and for a vertex of silent states:

l l k klu v E k

l k klu z E k

v x

F v e x F u m

z F z F u m

( , )

( , )

( ) ( ) ( )

( ) ( )

x

v

zSilent states

Regular states

symbols

Page 36: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

39

3. High Order Markov Chains

Markov chains in which the transition probabilities depends on the last k states:

P(xi|xi-1,...,x1) = P(xi|xi-1,...,xi-k)

Can be represented by a standard Markov chain with more states. eg for k=2:

AA

BBBA

AB

Page 37: Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters

40

4. Inhomogeneous Markov Chains

An important task in analyzing DNA sequences is recognizing the genes which code for proteins.

A triplet of 3 nucleotides – codon - codes for amino acids. It is known that in parts of DNA which code for genes, the three

codons positions has different statistics. Thus a Markov chain model for DNA should represent not only

the Nucleotide (A, C, G or T), but also its position – the same nucleotide in different position will have different transition probabilities. Used in GENEMARK gene finding program (93).