45
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings : Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Shlomo Moran, following Danny Geiger and Nir Friedman

HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

  • View
    228

  • Download
    3

Embed Size (px)

Citation preview

Page 1: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

HMM for CpG Islands

Parameter Estimation For HMM

Maximum Likelihood and the Information Inequality

Lecture #7

Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Shlomo Moran, following Danny Geiger and Nir Friedman

Page 2: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

HMM for CpG Islands

Page 3: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Reminder: Hidden Markov ModelS1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

11 11

( , ) ( , , ; ,..., ) ( )i i i

L

L L s s s ii

p p s s x x m e xs x

Next we apply HMM for the question of recognizing CpG ilands

3

Page 4: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Hidden Markov Model for CpG Islands

The states: Domain(Si)={+, -} {A,C,T,G} (8 values)

In this representation P(xi| si) = 0 or 1 depending on whether xi is consistent with si . e.g. xi= G is consistent with si=(+,G) and with si=(-,G) but not with any other state of si.

A- T+

A T

G +

G

… …… …

4

Page 5: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Reminder: Most Probable state pathS*1

S*2 S*L-1 S*L

x1 x2 XL-1 xL

M M M M

TTTT

Given an output sequence x = (x1,…,xL),

A most probable path s*= (s*1,…,s*

L) is one which maximizes p(s|x).

1( ,..., )

* *1 1 1* ( ,..., ) ( ,..., | ,..., )argmax

Ls s

L L Ls s s p s s x x

5

Page 6: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

A- C- T- T+

A C T T

G +

G

Predicting CpG islands via most probable path:

Output symbols: A, C, G, T (4 letters). Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states total).A most probable path (found by Viterbi’s algorithm) predicts CpG islands.

Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.

7

Page 7: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Reminder: Most probable state

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Given an output sequence x = (x1,…,xL), si is a most probable state (at location i) if:

si=argmaxk p(Si=k |x).

( , )( | ) ( , )

( )i

i i

p S kp S k p S k

p

xx xx

8

Page 8: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Finding the probability that a letter is in a CpG island via the algorithm for most probable state:

The probability that the occurrence of G in the i-th location is in a CpG island (+ state) is:

∑s+ p(Si =s+ |x) = ∑s+ F(Si=s+ )B(Si=s+

)

Where the summation is formally over the 4 “+” states, but actually

only state G+ need to be considered (why?)

A- C- T- T+

A C T T

G +

G

i

10

Page 9: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Parameter Estimation for HMM

11

Page 10: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Defining the Parameters

An HMM model is defined by the parameters: mkl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

lk

b

mkl

ek(b)

{ : , are states} { ( ) : is a state, is a letter}kl km k l e b k b

12

Page 11: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Training Sets

To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

13

Page 12: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Maximum Likelihood Parameter Estimation for HMM

The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).

ML parameter estimation looks for θ which maximizes the above.

The exact method for finding or approximating this θ depends on the nature of the training set used.

14

Page 13: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Data for HMM

The training set is characterized by:1.For each xj, the information on the states sj

i (the symbols

xji are usually known).

2.Its size (sum of lengths of all sequences).

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

15

Page 14: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Case 1: ML when Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate mkl and ek(b) for all pairs of states k, l and symbols b.

By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

16

Page 15: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Case 1: Sequences are fully known

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Let Mkl= |{i: si-1=k,si=l}| (in xj). Ek(b)=|{i:si=k,xi=b}| (in xj).

( )

( , ) ( , )

then: ( | ) [ ( )]kl kM E bkkl

k l k b

jp m e bx

For each xj we have:

11

( | ) ( )j

i i i

Lj j

s s s ii

p x m e x

17

Page 16: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Case 1 (cont)

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).

Thus, if Mkl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:

( )

( , ) ( , )

1 ( | ) [ ( )],.., kl kM E bkkl

k l k b

np m e bx x 18

Page 17: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Case 1 (cont)

( )

( , ) ( , )

[ ( )]kl kM E bkkl

k l k b

m e b

So we need to find mkl’s and ek(b)’s which maximize:

Subject to:

For all states , 1 and ( ) 1

[ , ( ) 0 ]

kl kl b

kl k

k m e b

m e b

19

Page 18: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Case 1 (cont)

( )

( , ) ( , )

( )

[ ( )]

[ ( )][ ] [ ]

kl k

kl k

M E bkkl

k l k b

M E bkkl

k l k b

F m e b

m e b

kb

Subject to: for all , 1, and e ( ) 1.kll

k m b

Rewriting, we need to maximize:

20

Page 19: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Case 1 (cont)

If we maximize for each : s.t. 1klMklkl

ll

k m m ( )and also [ ( )] s.t. ( ) 1kE b

k kbb

e b e b

Then we will maximize also F.Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.

21

Page 20: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

ML Parameters Estimation for a Single Die

22

Page 21: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Defining The Problem

Let X be a random variable with 6 values x1,…,x6

denoting the six outcomes of a (possibly unfair) die. Here the parameters are θ ={1,2,3,4,5, 6} , ∑θi=1Assume that the data is one sequence:

Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize

2 3 2 21 2 3 4 5 6( | )P Data

Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1 [and θi 0 ]

252 3 2

1 2 3 4 51

i.e., ( | ) 1 ii

P Data

23

Page 22: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Side comment: Sufficient Statistics

To compute the probability of data in the die example we only require to record the number of times Ni falling on side i (namely,N1, N2,…,N6).

We do not need to recall the entire sequence of outcomes

{Ni | i=1…6} is called sufficient statistics for

the multinomial sampling.

654321

5

154321 1)|(N

i iNNNNNDataP

24

Page 23: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Sufficient Statistics

A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood

Formally, s(Data) is a sufficient statistics if for any two datasets D and D’

s(Data) = s(Data’ ) P(Data|) = P(Data’|)

Datasets

Statistics

Exercise:Define “sufficient statistics” for the HMM model.

25

Page 24: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Maximum Likelihood EstimateBy the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ).We will find the parameters by considering the corresponding log-likelihood function:

63 51 2 4

551 2 3 4 1

log ( | ) log 1N

N NN N Nii

P Data

5

16

5

11loglog

i ii ii NN

A necessary condition for (local) maximum is:

01

)|(log5

1

6 =-

-=¶

θ¶

å=i ij

j

j

NNDataP

qqq

26

Page 25: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Finding the Maximum

Rearranging terms:

ii

jj N

N Divide the jth equation by the ith

equation:

Sum from j=1 to 6:

ii

ii

j j

jj N

N

N

N

6

16

1

1

So there is only one local – and hence global – maximum. Hence the MLE is given by:

6,..,1 iN

N ii

6

65

1

6

1 NNN

i ij

j

27

Page 26: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Note: Fractional Exponents are possible

Some models allow ni’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%). Our analysis didn’t assume that the ni are integers, thus it applies also for fractional exponents.

28

Page 27: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Generalization for distribution with any number n of outcomes

Let X be a random variable with k values x1,…,xk denoting the k outcomes of Independently and Identically Distributed experiments, with parameters θ ={1,2,...,k} (θi is the probability of xi). Again, the data is one sequence of length n, in which xi appears ni times.

Then we have to maximize

1 211 2( | ) , ( ... )knn n

kkP Data n n n

Subject to: θ1+θ2+ ....+ θk=1

11

1

1 11

i.e., ( | ) 1k

k

nknn

iki

P Data

29

Page 28: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Generalization for n outcomes (cont)

i k

i k

n n

By treatment identical to the die case, the maximum is obtained when for all i:

Hence the MLE is given by the relative frequencies:

1,..,ii

ni k

n

30

Page 29: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

ML for a Single Dice, Normalized Version

Consider the two experiments for a 3-sided dice:1. 8 tosses: 2 x1,, 3 x2, 5 x3.2. 800 tosses: 200 x1,, 300 x2, 500 x3

Clearly, both imply the same ML parameters.In general, when formulating ML for a single dice, we

can ignore the actual number n of tosses, and just use the fraction of each outcome.

31

Page 30: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

1 2

1 1

1

1 2

Given positive numbers ,..., s.t. ... 1

Find parameters ,..., which maximize:

( | ) .k

k k

k

pp pk

k p p p p

P Data

Thus we can replace the number of outcomes ni by pi=ni/n, and get the following normalized setting of the ML problem for a single dice:

And the same analysis yields that a maximum is obtained when:

1,..,i ip i k

Normalized version of ML (cont.)

32

Page 31: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Implication:

The Kullback Leibler

Information inequality

33

Page 32: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

1 2

1

1

1 2

Let ( ,..., ) be a probability distribution over a -set.

For any other such distribution ( ,..., ),

let the likelihood ( | ) .

Then ( | ) is maximized only when

k

k

k

pp pk

P p p k

Q q q

P Data Q q q q

P Data Q P

.Q

We can rephrase the “ML for single dice” inequality:

Rephrasing the ML inequality

1

1

Let ( ,..., ) be a probability distribution over a -set.

For any other such distribution ( ,..., ), consider the sum

( ) log (data| ) log

Then has a unique maximum at .

k

k

i i

P p p k

Q q q

R Q P Q p q

R Q P

Taking logarithms, we get

34

Page 33: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

The Kullback-Leibler Information Inequality

1 1

Given probability distributions over a -set,

( ,..., ) and ( ,..., ).

The of and is defined by:

( || ) log

Then ( || ) 0, with equalit

relativ

y onl

e ent

y

r

when

opy k k

ii

i

k

P p p Q q q

P Q

pD P Q p

q

D P Q

.P Q

35

Page 34: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Proof of the information inequality

By the logarithmic version of the "normalized maximum likelihood"

(2 slides back):

( || ) log log log 0,

and equality holds only when for all . QED

ii i i i i

i

i i

pD P Q p p p p q

q

p q i

36

Page 35: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Using the Solution for the“Dice Maximum Likelihood” to Find Parameters for HMM

When all States are Known

37

Page 36: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

The Parameters

Let Mkl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We need to:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

( )

( , ) ( , )

Maximize [ ( )]kl kM E bkkl

k l k b

m e b

k Subject to: for all states , 1, and e ( ) 1, , ( ) 0.kl kl kl b

k m b m e b 38

Page 37: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Apply to HMM (cont.)

We apply the previous technique to get for each k the parameters {mkl|l=1,..,m} and {ek(b)|bΣ}:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

M E bm e b

M E b

Which gives the optimal ML parameters

39

Page 38: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Summary of Case 1: Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate mkl and ek(b) for all pairs of states k, l and symbols b.

When everything is known, we can find the (unique set of) parameters θ* which maximizes

p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

40

Page 39: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Adding pseudo counts in HMM

We may modify the actual count by our prior knowledge/belief (e.g., when the sample set is too small) : rkl is our prior belief on transitions from k to l.rk(b) is our prior belief on emissions of b from state k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

' '' '

( ) ( )then , and ( )

( ) ( ( ') ( '))kl kl k k

kl kkl kl k kl b

M r E b r ba e b

M r E b r b

41

Page 40: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Case 2: State Paths are Unknown.

Here we use

ML with Hidden Parameters

42

Page 41: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Dice likelihood with hidden parameters

Let X be a random variable with 3 values 0,1,2. Hence the parameters are θ ={0,1,2} , ∑θi=1

Assume that the data is a sequence of 2 tosses which we don’t see, but we know the the sum of

the outcomes is 2.

The problem: Find parameters which maximize the likelihood (probability) of the observed data.

Basic fact: The probability of an event is the sum of the probabilities of the simple events it

contains.

The probability space here: all sequences of 2 tosses:

43

(0,0), (0,1), (0, 2), (1,0),..., (2, 2)

Page 42: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Defining The Problem

44

21 0 2Pr(sum = 2| ) = Pr {(1,1), (2,0), (0,2)} 2 .

Thus, we need to find parameters θ which maximize:

Finding an optimal solution is in general a difficult task. Hence we do the following procedure:1. “Guess” initial parameters2. Repeatedly improve the parameters using the EM

algorithm (to be studied later in this course).Next, we exemplify the EM algorithm on the above example.

Page 43: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

E step: Average Counts:

45

Pr(sum = 2| ) 0.0625 2 0.125 0.3125.

We use the probabilities of the events to generate “average counts” of the outcomes:Average count of 0 is 2*0.125=0.25.Average count of 1 is 2*0.0625=0.125 .Average count of 2 is 2*0.125=0.25.

0 1 2

2

Assume our initial paramaters are:

0.5, 0.25

Pr(1,1) 0.25 0.0625

Pr(2,0) Pr(0,2) 0.5 0.25 0.125

Page 44: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

M step: Updating probabilities by the average counts

46

Pr(sum = 2 | ) 0.04 2 0.16 0.36 0.3125 Pr(sum = 2 | ) .

0 2

1

0.250.4 .

0.6250.125

0.2.0.625

The total of all average count is: 2*0.25+0.125=0.625.The new relative frequencies equal the new parameters, λ1, λ2, λ3 :

2

2

Pr(1,1) 0.2 0.04

Pr(2,0) Pr(0,2) 0.4 0.16

The probabilities of the simple events according to the new parameters:

The probability of the events by the new parameters:

Page 45: HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the text

Summary of the algorithm:

47

• Start with some estimated parameters θ.• Use these parameters to define average counts of the outcomes.• define new parameters λ by the relative frequencies of the average

counts.

We will show that this algorithm never decreases, and usually increases the likelihood of the data.

An application of this algorithm for HMM is known as the Baum Welch algorithm, which we will see next.