42
. Parameter Estimation For HMM Background Readings : Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.

Parameter Estimation For HMM

Embed Size (px)

DESCRIPTION

Parameter Estimation For HMM. Background Readings : Chapter 3.3 in the book, Biological Sequence Analysis , Durbin et al., 2001. M. M. M. M. S 1. S 2. S L-1. S L. T. T. T. T. x 1. x 2. X L-1. x L. Reminder: Hidden Markov Model. - PowerPoint PPT Presentation

Citation preview

Page 1: Parameter Estimation For HMM

.

Parameter Estimation For HMM

Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.

Page 2: Parameter Estimation For HMM

2

Reminder: Hidden Markov Model

Markov Chain transition probabilities: p(Si+1= t|Si = s) = ast

Emission probabilities: p(Xi = b| Si = s) = es(b)

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

11 11

( , ) ( , , ; ,..., ) ( )i i i

L

L L s s s ii

p p s s x x a e xs x

Page 3: Parameter Estimation For HMM

3

Reminder: Most Probable state pathS1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

Given an output sequence x = (x1,…,xL),

A most probable path s*= (s*1,…,s*

L) is one which maximizes p(s|x).

1( ,..., )

* *1 1 1* ( ,..., ) ( ,..., | ,..., )maxarg

Ls s

L L Ls s s p s s x x

Page 4: Parameter Estimation For HMM

4

Reminder: Viterbi’s algorithm for most probable paths1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

For i=1 to L do for each state l :

vl(i) = el(xi) MAXk {vk(i-1)akl }ptri(l)=argmaxk{vk(i-1)akl} [storing previous state for reconstructing the path]Termination: the probability of the most probable path

Initialization: v0(0) = 1 , vk(0) = 0 for k > 0

0

We add the special initial state 0.

p(s1*,…,sL

*;x1,…,xl) = { ( )}max kk

v L

Page 5: Parameter Estimation For HMM

5

A- C- T- T+

A C T T

G +

G

Predicting CpG islands via most probable path:

Output symbols: A, C, G, T (4 letters). Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states).The transitions probabilities ast and ek(b) will be discussed soon.The most probable path found by Viterbi’s algorithm predicts CpG islands.

Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.

Page 6: Parameter Estimation For HMM

6

Reminder: finding most probable state

1. The forward algorithm finds {fk(si) = P(x1,…,xi,si=k): k = 1,...m}.2. The backward algorithm finds {bk(si) = P(xi+1,…,xL|si=k ): k = 1,...m}.3. Return {p(Si=k|x) = fk(si) bk(si) |k=1,...,m}.

To Compute for every i simply run the forward and backward algorithms once, and compute {fk(si) bk(si)} for every i, k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

fl(i) = p(x1,…,xi,si=l ), the probability of a path which emits (x1,..,xi) and in which state si=l. bl(i)= p(xi+1,…,xL,si=l), the probability of a path which emits (xi+1,..,xL) and in which state si=l.

Page 7: Parameter Estimation For HMM

7

Finding the probability that a letter is in a CpG island via the algorithm for most probable state:

The probability that an occurrence of G is in a CpG island (+ state) is:

∑s+ p(Si =s+ |x) = ∑s+ F(Si=s+ )B(Si=s+

)

Where the summation is formally over the 4 “+” states, but actually

only state G+ need to be considered (why?)

A- C- T- T+

A C T T

G +

G

Page 8: Parameter Estimation For HMM

8

Parameter Estimation for HMM

An HMM model is defined by the parameters: akl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

lk

b

akl

ek(b)

Page 9: Parameter Estimation For HMM

9

Parameter Estimation for HMM

To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ) (or p(xj| θ,HMM)).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 10: Parameter Estimation For HMM

10

ML Parameter Estimation for HMM

The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).

ML parameter estimation looks for θ which maximizes the above.

The exact method for finding or approximating this θ depends on the nature of the training set used.

Page 11: Parameter Estimation For HMM

11

Data for HMM

Possible properties of (the sequences in) the training set:1.For each xj, what is our information on the states si (the

symbols x i are usually known).

2.The size (number of sequences) of the training set

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

Page 12: Parameter Estimation For HMM

12

Case 1: Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b.

By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 13: Parameter Estimation For HMM

13

Case 1: ML method

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Let mkl= |{i: si-1=k,si=l}| (in xj). mk(b)=|{i:si=k,xi=b}| (in xj).

( )

( , ) ( , )

then: ( | ) [ ( )]kl km m bkkl

k l k b

jp a e bx

For each xj we have:

j

iii

L

i

jisss

j xeaxp1

1)()|(

Page 14: Parameter Estimation For HMM

14

Case 1 (cont)

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

By the independence of the xj’s, p(x1,...,xn| θ)=∏ip(xj|θ).

Thus, if Akl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:

( )

( , ) ( , )

1 ( | ) [ ( )],.., kl kA E bkkl

k l k b

np a e bx x

Page 15: Parameter Estimation For HMM

15

Case 1 (cont)

( )

( , ) ( , )

[ ( )]kl kA E bkkl

k l k b

a e b

So we need to find akl’s and ek(b)’s which maximize:

Subject to:

])(,

)(,

0 [

1 and 1 states allFor

bea

beak

kkl

l bkkl

Page 16: Parameter Estimation For HMM

16

Case 1 (cont)

( )

( , ) ( , )

( )

[ ( )]

[ ( )][ ] [ ]

kl k

kl k

A E bkkl

k l k b

A E bkkl

k l k b

F a e b

a e b

kb

Subject to: for all , 1, and e ( ) 1.kll

k a b

Rewriting, we need to maximize:

Page 17: Parameter Estimation For HMM

17

Case 1 (cont)

If we maximize for each : s.t. 1klAklkl

ll

k a a ( )and also [ ( )] s.t. ( ) 1kE b

k kbb

e b e b

Then we will maximize also F.Each of the above is a simpler ML problem, which is treated next.

Page 18: Parameter Estimation For HMM

18

A simpler case: ML parameters estimation for a die

Let X be a random variable with 6 values x1,…,x6

denoting the six outcomes of a die. Here the parameters areθ ={1,2,3,4,5, 6} , ∑θi=1Assume that the data is one sequence:

Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize

2 3 2 21 2 3 4 5 6( | )P Data

Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1 [and θi 0 ]

252 3 2

1 2 3 4 51

i.e., ( | ) 1 ii

P Data

Page 19: Parameter Estimation For HMM

19

Side comment: Sufficient Statistics

To compute the probability of data in the die example we only require to record the number of times Ni falling on side i (namely,N1, N2,…,N6).

We do not need to recall the entire sequence of outcomes

{Ni | i=1…6} is called sufficient statistics for

the multinomial sampling.

654321

5

154321 1)|(N

i iNNNNNDataP

Page 20: Parameter Estimation For HMM

20

Sufficient Statistics

A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood

Formally, s(Data) is a sufficient statistics if for any two datasets D and D’

s(Data) = s(Data’ ) P(Data|) = P(Data’|)

Datasets

Statistics

Exercise:Define “sufficient statistics” for the HMM model.

Page 21: Parameter Estimation For HMM

21

Maximum Likelihood EstimateBy the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ).Usually one maximizes the log-likelihood function which is easier to do and gives an identical answer:

63 51 2 4

551 2 3 4 1

log ( | ) log 1N

N NN N Nii

P Data

5

16

5

11loglog

i ii ii NN

A necessary condition for maximum is:

01

)|(log5

1

6

θ

i ij

j

j

NNDataP

Page 22: Parameter Estimation For HMM

22

Finding the Maximum

Rearranging terms:

ii

jj N

N Divide the jth equation by the ith

equation:

Sum from j=1 to 6:

ii

ii

j j

jj N

N

N

N

6

16

1

1

Hence the MLE is given by:

6,..,1 iN

N ii

6

65

1

6

1 NNN

i ij

j

Page 23: Parameter Estimation For HMM

23

Generalization for distribution with any number n of outcomes

Let X be a random variable with n values x1,…,xk denoting the k outcomes of an iid experiments, with parameters θ ={1,2,...,k} (θi is the probability of xi). Again, the data is one sequence of length n:

Data = (xi1,xi2

,...,xin)

Then we have to maximize

1 211 2( | ) , ( ... )knn n

kkP Data n n n

Subject to: θ1+θ2+ ....+ θk=1

11

1

1 11

i.e., ( | ) 1k

k

nknn

iki

P Data

Page 24: Parameter Estimation For HMM

24

Generalization for n outcomes (cont)

i k

i k

n n

By treatment identical to the die case, the maximum is obtained when for all i:

Hence the MLE is given by:

1,..,ii

ni k

n

Page 25: Parameter Estimation For HMM

25

Fractional Exponents

Some models allow ni’s which are not integers (eg, when we are uncertain of a die outcome, and consider it “6” with 20% confidence and “5” with 80%):We still can have

And the same analysis yields:

1,..,ii

ni k

n

1 211 2( | ) , ( ... )knn n

kkP Data n n n

Page 26: Parameter Estimation For HMM

26

Apply the ML method to HMM

Let Akl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We need to:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

( )

( , ) ( , )

Maximize [ ( )]kl kA E bkkl

k l k b

a e b

kb

Subject to: for all states , 1, and e ( ) 1.kll

k a b

Page 27: Parameter Estimation For HMM

27

Apply to HMM (cont.)

We apply the previous technique to get for each k the parameters {akl|l=1,..,m} and {ek(b)|bΣ}:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

A E ba e b

A E b

Which gives the optimal ML parameters

Page 28: Parameter Estimation For HMM

28

Adding pseudo counts in HMM

If the sample set is too small, we may get a biased result. In this case we modify the actual count by our prior knowledge/belief: rkl is our prior belief and transitions from k to l.rk(b) is our prior belief on emissions of b from state k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

' '' '

( ) ( )then , and ( )

( ) ( ( ') ( ))kl kl k k

kl kkl kl k kl b

A r E b r ba e b

A r E b r b

Page 29: Parameter Estimation For HMM

29

Summary of Case 1: Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b.

We just showed a method which finds the (unique) parameters θ* which maximizes p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 30: Parameter Estimation For HMM

30

Case 2: State paths are unknown:

In this case only the values of the xi’s of the input sequences are known. This is a ML problem with “missing data”. We wish to find θ* so that p(x|θ*)=MAXθ{p(x|θ)}.For each sequence x,

p(x|θ)=∑s p(x,s|θ), taken over all state paths s.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 31: Parameter Estimation For HMM

31

Case 2: State paths are unknown

So we need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 32: Parameter Estimation For HMM

32

ML Parameter Estimation for HMM

The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ’ so that p(x1,..., xn|θ’) > p(x1,..., xn|θ) 3. set θ = θ’.4. Repeat until some convergence criterion is met.

A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

Page 33: Parameter Estimation For HMM

34

Baum Welch training

We start with some values of akl and ek(b), which define prior values of θ. Baum-Welch training is an iterative algorithm which attempts to replace θ by a θ* s.t.

p(x|θ*) > p(x|θ)Each iteration consists of few steps:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 34: Parameter Estimation For HMM

35

Baum Welch: step 1

Count expected number of state transitions: For each sequence xj, for each i and for each k,l, compute the posterior state transitions probabilities:

s1 SisL

X1 Xi XL

Si-1

Xi-1.. ..

P(si-1=k, si=l | xj,θ)

USER
prove the equality? (ex. 3.5)
Page 35: Parameter Estimation For HMM

37

Baum Welch training

Claim:

1

( 1) ( ) ( )( = , = | , )

( | )

j j jk kl l i l

i i j

f i a e x b ijp s k s lp

x x

s1 SisL

X1 Xi XL

Si-1

Xi-1.. ..

where and are the forward and backward algorithms

for under .

j jk lf b

jx

Page 36: Parameter Estimation For HMM

38

Step 1: Computing P(si-1=k, si=l | xj,θ)

P(x1,…,xL,si-1=k,si=l) = P(x1,…,xi-1,si-1=k) aklek(xi ) P(xi+1,…,xL |si=l)

p(si-1=k,si=l | xj) = fk(i-1) aklel(xi ) bl(i)

fk’(i-1) ak’l’ek’(xi ) bl’(i)K’l’

= fk(i-1) aklek(xi ) bl(i)

Via the forward algorithm

Via the backward algorithm

s1 s2 sL-1 sL

X1 X2 XL-1 XL

Si-1

Xi-1

si

Xi

xj

Page 37: Parameter Estimation For HMM

39

Step 1 (end)

for each pair (k,l), compute the expected number of state transitions from k to l:

11 1

1 1

1( = , = , )

( )

1( 1) ( ) ( )

( )

|n L

kl i ijj i

n Lj j

kl l ik ljj i

jA p s k s lp x

f i a e x b ip x

x

Page 38: Parameter Estimation For HMM

40

Baum-Welch: Step 2

for each state k and each symbol b, compute the expected number of emissions of b from k:

1 :

1( ) ( ) ( )

( ) ji

nj j

k k kjj i x b

E b f i b ip x

USER
prove the equality? (ex. 3.5)
Page 39: Parameter Estimation For HMM

41

Baum-Welch: step 3

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

A E ba e b

A E b

Use the Akl’s, Ek(b)’s to compute the new values of akl and ek(b). These values define θ*.

It can be shown that: p(x1,..., xn|θ*) > p(x1,..., xn|θ)

i.e, θ* increases the probability of the data

This procedure is iterated, until some convergence criterion is met.

USER
prove the equality? (ex. 3.5)
Page 40: Parameter Estimation For HMM

42

Case 2: State paths are unknown:Viterbi training

Also start from given values of akl and ek(b), which defines prior values of θ. Viterbi training attempts to maximize the probability of a most probable path; i.e., maximize

p((s(x1),..,s(xn)) |θ, x1,..,xn )Where s(xj) is the most probable (under θ) path for xj.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 41: Parameter Estimation For HMM

43

Case 2: State paths are unknown:Viterbi training

Each iteration:1. Find a set {s(xj)} of most probable paths, which

maximizep(s(x1),..,s(xn) |θ, x1,..,xn )

2. Find θ*, which maximizes p(s(x1),..,s(xn) | θ*, x1,..,xn )

Note: In 1. the maximizing arguments are the paths, in 2. it is θ*. 3. Set θ=θ* , and repeat. Stop when paths are not changed.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 42: Parameter Estimation For HMM

44

Case 2: State paths are unknown:Viterbi training

p(s(x1),..,s(xn) | θ*, x1,..,xn ) can be expressed in a closed form (since we are using a single path for each xj), so this time convergence is achieved when the optimal paths are not changed any more.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi