70
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Embed Size (px)

Citation preview

Page 1: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

What if a new genome comes?

• We just sequenced the porcupine genome

• We know CpG islands play the same role in this genome

• However, we have no known CpG islands for porcupines

• We suspect the frequency and characteristics of CpG islands are quite different in porcupines

How do we adjust the parameters in our model?

LEARNING

Page 2: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Problem 3: Learning

Re-estimate the parameters of the model based on training

data

Page 3: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Two learning scenarios

1. Estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

2. Estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

QUESTION:Update the parameters of the model to maximize P(x|)

Page 4: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

1. When the right answer is known

Given x = x1…xN

for which the true = 1…N is known,

Define:

Akl = # times kl transition occurs in Ek(b) = # times state k in emits b in x

We can show that the maximum likelihood parameters (maximize P(x|)) are:

Akl Ek(b)

akl = ––––– ek(b) = –––––––

i Aki c Ek(c)

Page 5: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

1. When the right answer is known

Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data

Drawback: Given little data, there may be overfitting:P(x|) is maximized, but is unreasonable0 probabilities – VERY BAD

Example:Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 = F, F, F, F, F, F, F, F, F, F

Then:aFF = 1; aFL = 0eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1

Page 6: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Pseudocounts

Solution for small training sets:

Add pseudocounts

Akl = # times kl transition occurs in + rkl

Ek(b) = # times state k in emits b in x + rk(b)

rkl, rk(b) are pseudocounts representing our prior belief

Larger pseudocounts Strong prior belief

Small pseudocounts ( < 1): just to avoid 0 probabilities

' '' '

( ) ( )then , and ( )

( ) ( ( ') ( ))kl kl k k

kl kkl kl k kl b

A r E b r ba e b

A r E b r b

Page 7: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Pseudocounts

Example: dishonest casino

We will observe player for one day, 600 rolls

Reasonable pseudocounts:

r0F = r0L = rF0 = rL0 = 1;

rFL = rLF = rFF = rLL = 1;

rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair)

rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded)

Above #s pretty arbitrary – assigning priors is an art

Page 8: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

2. When the right answer is unknown

We don’t know the true Akl, Ek(b)

Idea:

• We estimate our “best guess” on what Akl, Ek(b) are

• We update the parameters of the model, based on our guess

• We repeat

Page 9: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ’ so that p(x1,..., xn|θ’) > p(x1,..., xn|θ) 3. set θ = θ’.4. Repeat until some convergence criterion is met.

A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

2. When the right answer is unknown

Page 10: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

2. When the right answer is unknown

We don’t know the true Akl, Ek(b)Starting with our best guess of a model M with parameters :

Given x = x1…xN

for which the true = 1…N is unknown,

We can get to a provably more likely parameter set = (akl, ek(b))

Principle: EXPECTATION MAXIMIZATION

1. E-STEP: Estimate Akl, Ek(b) in the training data

2. M-STEP: Update = (akl, ek(b)) according to Akl, Ek(b)

3. Repeat 1 & 2, until convergence

Page 11: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Baum Welch training

We start with some values of akl and ek(b), which define prior values of θ. Baum-Welch training is an iterative algorithm which attempts to replace θ by a θ* s.t.

p(x|θ*) > p(x|θ)Each iteration consists of few steps:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 12: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Baum Welch training

In case 1 we computed the optimal values of akl and ek(b), (for the optimal θ) by simply counting the number Akl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.

Si= lSi-1= k

xi-1= b

… …

xi= c

Page 13: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Baum Welch training

When the states are unknown, the counting process is replaced by averaging process:For each edge si-1 si we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take Akl to be the sum over all edges.

Si= ?Si-1= ?

xi-1= b xi= c

… …

Page 14: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Baum Welch training

Similarly, For each edge si b and each state k, we compute the average number of times that si=k, which is the expected number of “k → b” transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed as follows:

Si= ?Si-1= ?

xi-1= b xi= c

Page 15: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

s1 SisL

X1 Xi XL

Si-1

Xi-1.. ..

Baum Welch: step 1a Count expected number of state

transitions

For each i and for each k,l, compute the posterior state transitions probabilities:

P(si-1=k, si=l | x,θ)For this, we use the forwards and backwards algorithms

Page 16: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Estimating new parameters

• So,

fk(i) akl el(xi+1) bl(i+1)

Akl = i P(i = k, i+1 = l | x, ) = i –––––––––––––––––

P(x | )

• Similarly,

Ek(b) = [1/P(x | )] {i | xi = b} fk(i) bk(i)

k l

xi+1

akl

el(xi+1)

bl(i+1)fk(i)

x1………xi-1xi+2………xN

xi

Page 17: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Reminder: finding posterior state probabilities

•p(si=k,x) = fk(si) bk(si) (since these are independent events){fk(i) bk(i)} for every i, k are computed by one run of the backward/forward algorithms.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

fk(i) = p(x1,…,xi,si=k ), the probability that in a path which emits (x1,..,xi), state si=k. bk(i)= p(xi+1,…,xL|si=k), the probability that a path emits (xi+1,..,xL), given that state si=k.

Page 18: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Baum Welch: Step 1a (cont)

Claim:

s1 SisL

X1 Xi XL

Si-1

Xi-1.. ..

)|(

)()()(),|,(

xp

ibxeaifxlsksp lilklk

ii1

1

(akl and el(xi) are the parameters defined by , and fk(i-1), bk(i) are the forward and backward functions)

Page 19: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Step 1a: Computing P(si-1=k, si=l | x,θ)

P(x1,…,xL,si-1=k,si=l|) = P(x1,…,xi-1,si-1=k|) aklel(xi ) P(xi+1,…,xL |si=l,)

= fk(i-1) aklel(xi ) bl(i)

Via the forward algorithm

Via the backward algorithm

s1 s2 sL-1 sL

X1 X2 XL-1 XL

Si-1

Xi-1

si

Xi

x

p(si-1=k,si=l | x, ) = fk(i-1) aklel(xi ) bl(i)

)|( xp

Page 20: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Step 1a (end)

For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

)()()1()|(

1

),|,()|(

1

1

11

ibxeaifxp

xlskspxp

A

li

L

ilklk

L

iiikl

Page 21: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Step 1a for many sequences:When we have n input sequences (x1,..., xn ), then Akl is given by:

11 1

1 1

1( = , = , )

( )

1( 1) ( ) ( )

( )

|n L

kl i ijj i

n Lj j

kl l ik ljj i

jA p s k s lp x

f i a e x b ip x

x

where and are the forward and backward algorithms

for under .

j jk lf b

jx

Page 22: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Baum-Welch: Step 1b count expected number of symbols emissions

for state k and each symbol b, for each i where Xi=b, compute the expected number of times that Si=k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi=b

),...,(/)()(

),...(/),...(

),...|(

Likik

LiL

Li

xxpsbsf

xxpksxxp

xxksp

1

11

1

Page 23: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Baum-Welch: Step 1b

For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that si = k, over all i’s for which xi = b.

bxi

kkk

i

ibifxp

bE:

)()()|(

)(

1

Page 24: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Step 1b for many sequences

When we have n sequences (x1,..., xn ), the expected number of emissions of b from k is given by:

1 :

1( ) ( ) ( )

( ) ji

nj j

k k kjj i x b

E b f i b ip x

Page 25: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Summary of Steps 1a and 1b: the E part of the Baum Welch training

These steps compute the expected numbers Akl of k,l transitions for all pairs of states k and l, and the expected numbers Ek(b) of transmitions of symbol b from state k, for all states k and symbols b.

The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

Page 26: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Baum-Welch: step 2

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

A E ba e b

A E b

Use the Akl’s, Ek(b)’s to compute the new values of akl and ek(b). These values define θ*.

The correctness of the EM algorithm implies that: p(x1,..., xn|θ*) p(x1,..., xn|θ)

i.e, θ* increases the probability of the data

This procedure is iterated, until some convergence criterion is met.

Page 27: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

The Baum-Welch AlgorithmInitialization:

Pick the best-guess for model parameters

(or arbitrary)

Iteration:1. Forward

2. Backward

3. Calculate Akl, Ek(b)

4. Calculate new model parameters akl, ek(b)

5. Calculate new log-likelihood P(x | )

GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION

Until P(x | ) does not change much

Page 28: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Viterbi training: maximizing the probabilty of the most probable path

States are unknown.Viterbi training attempts to maximizes the probability of a most probable path, ie the value of

p(s(x1),..,s(xn) , x1,..,xn |θ)Where s(xj) is the most probable (under θ) path for xj.We assume only one sequence (n=1).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 29: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Viterbi training (cont)

Start from given values of akl and ek(b), which define prior values of θ. Each iteration:Step 1: Use Viterbi’s algoritm to find a most probable path s(x) , which maximizes p(s(x), x|θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 30: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Viterbi training (cont)

Step 2. Use the ML method for HMM with known parameters, to find θ* which maximizes p(s(x) , x|θ*)

Note: In Step 1. the maximizing argument is the path s(x), in Step 2. it is the parameters θ*.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 31: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Viterbi training (cont)

3. Set θ=θ* , and repeat. Stop when paths are not changed.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Claim 2 : If s(x) is the optimal path in step 1 of two different iterations, then in both iterations θ has the same values, and hence

p(s(x) , x |θ) will not increase in any later iteration. Hence the algorithm can terminate in this case.

USER
לבדוק: האם חזרה כזו יכולה לקרות רק בשתי איטרציות עוקבות?
Page 32: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Coin-Tossing Example

0.9

Fair loaded

head head

tailtail

0.9

0.1

0.1

1/2 1/4

3/41/2

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

L tosses Fair/Loaded

Head/Tail

Start

1/2 1/2

Page 33: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Example : Homogenous HMM, one sample

Start with some probability tables Iterate until convergenceE-step: Compute p (hi|hi -1,x1,…,xL) from p(hi, hi -1 | x1,…,xL) which is computed using the forward- backward algorithm as explained earlier.

M-step: Update the parameters simultaneously: i p(hi=1 | hi-1=1, x1,…,xL)+p(hi=0 | hi-1=0, x1,…,xL)/(L-1)

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

1

1)|( 1ii hhp

5.0

5.0)()|( 101 hphhp

Page 34: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Coin-Tossing Example

9.01.0

1.09.0P

Numeric example: 3 tosses

Outcomes: head, head, tail

Page 35: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Coin-Tossing ExampleNumeric example: 3 tosses

Outcomes: head, head, tail

P(x1=head,h1=loaded)= P(loaded1) P(head| loaded1)= 0.5*0.75=0.375

P(x1=head,h1=fair)= P(fair1) P(head| fair1)= 0.5*0.5=0.25

First coin is loaded {step 1- forward}

F(hi)=P(x1,…,xi,hi) = P(x1,…,xi-1, hi-1) P(hi | hi-1 ) P(xi | hi)

hi-1

Recall:

Page 36: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Coin-Tossing Example - forwardNumeric example: 3 tosses Outcomes: head, head, tail

P(x1,…,xi,hi) = P(x1,…,xi-1, hi-1) P(hi | hi-1 ) P(xi | hi)hi-1

P(x1=head,h1=loaded)= P(loaded1) P(head| loaded1)= 0.5*0.75=0.375

P(x1=head,h1=fair)= P(fair1) P(head| fair1)= 0.5*0.5=0.25

{step 1}

P(x1 =head,x2 =head,h2 =loaded) = P(x1,h1) P(h2 | h1) P(x2 | h2) =p(x1 =head , loaded1) P(loaded2 | loaded1) P(x2 =head | loaded2) +p(x1 =head , fair1) P(loaded2 | fair1) P(x2 =head | loaded2) = 0.375*0.9*0.75 + 0.25*0.1*0.75=0.253125+ 0.01875= 0.271875

h1

{step 2}

P(x1 =head,x2 =head,h2 =fair) = p(x1 =head , loaded1) P(fair2 | loaded1) P(x2 =head | fair2) +p(x1 =head , fair1) P(fair2 | fair1) P(x2

=head | fair2) = 0.375*0.1*0.5 + 0.25*0.9*0.5= 0.01875 + 0.1125= 0.13125

Page 37: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Coin-Tossing Example - forwardNumeric example: 3 tosses Outcomes: head, head, tail

P(x1,…,xi,hi) = P(x1,…,xi-1, hi-1) P(hi | hi-1 ) P(xi | hi)hi-1

P(x1 =head,x2 =head,h2 =loaded) = 0.271875

P(x1 =head,x2 =head,h2 =fair) = 0.13125

{step 2}

P(x1 =head,x2 =head, x3 =tail ,h3 =loaded) = P(x1, x2 ,h2) P(h3 | h2) P(x3 | h3) = p(x1 =head , x2 =head, loaded2) P(loaded3 | loaded2) P(x3

=tail | loaded3) +p(x1 =head , x2 =head, fair2) P(loaded3 | fair2) P(x3 =tail | loaded3) = 0.271875 *0.9*0.25 + 0.13125 *0.1*0.25=0.6445

h2

{step 3}

P(x1 =head,x2 =head, x3 =tail ,h3 =fair) = p(x1 =head , x2 =head, loaded2) P(fair3 | loaded2) P(x3 =tail | fair3) +p(x1 =head , x2 =head, fair2) P(fair3 | fair2) P(x3 =tail | fair3) = 0.271875 *0.1*0.5 + 0.13125 *0.9*0.5=0.07265

Page 38: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Coin-Tossing Example - backwardNumeric example: 3 tosses Outcomes: head, head, tail

b(hi) = P(xi+1,…,xL|hi)= P(xi+1,…,xL|hi) = P(hi+1 | hi) P(xi+1 | hi+1) b(hi+1)

P(x3=tail | h2=loaded)=P(h3=loaded | h2=loaded) P(x3=tail | h3=loaded)+

P(h3=fair | h2=loaded) P(x3=tail | h3=fair)=0.9*0.25+0.1*0.5=0.275

P(x3=tail | h2=fair)=P(h3=loaded | h2=fair) P(x3=tail | h3=loaded)+

P(h3=fair | h2=fair) P(x3=tail | h3=fair)=0.1*0.25+0.9*0.5=0.475

{step 1}

hi+1

Page 39: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Coin-Tossing Example - backwardNumeric example: 3 tosses Outcomes: head, head, tail

P(x3=tail | h2=loaded)=0.275

P(x3=tail | h2=fair)=0.475

{step 1}

P(x2 =head,x3 =tail | h1 =loaded) = P(loaded2 | loaded1) *P(head| loaded)* 0.275 +P(fair2 | loaded1) *P(head|fair)*0.475=0.9*0.75*0.275+0.1*0.5*0.475=0.209

{step 2}

P(x2 =head,x3 =tail | h1 =fair) = P(loaded2 | fair1) *P(head|loaded)* 0.275 +P(fair2 | fair1) * P(head|fair)*0.475=0.1*0.75*0.275+0.9*0.5*0.475=0.234

b(hi) = P(xi+1,…,xL|hi)= P(xi+1,…,xL|hi) = P(hi+1 | hi) P(xi+1 | hi+1) b(hi+1)

hi+1

Page 40: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

p(x1,…,xL,hi ,hi+1)=f(hi) p(hi+1|hi) p(xi+1| hi+1) b(hi+1)

Coin-Tossing ExampleOutcomes: head, head, tail

f(h1=loaded) = 0.375 , f(h1=fair) = 0.25

b(h2=loaded) = 0.275 , b(h2=fair) = 0.475

P(x1=head,h1=loaded)= P(loaded1) P(head| loaded1)= 0.5*0.75=0.375

P(x1=head,h1=fair)= P(fair1) P(head| fair1)= 0.5*0.5=0.25

{step 1}Recall:

Page 41: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Coin-Tossing Example

Outcomes: head, head, tail

f(h1=loaded) = 0.375 , f(h1=fair) = 0.25

b(h2=loaded) = 0.275 , b(h2=fair) = 0.475

p(x1,…,xL,h1 ,h2)=f(h1) p(h1|h2) p(x2| h2) b(h2)

p(x1,…,xL,h1=loaded ,h2=loaded)=0.375*0.9*0.75*0.275=0.0696

p(x1,…,xL,h1=loaded ,h2=fair)=0.375*0.1*0. 5*0.475=0.0089p(x1,…,xL,h1=fair ,h2=loaded)=0.25*0.1*0.75*0.275=0.00516p(x1,…,xL,h1=fair ,h2=fair)=0.25*0.9*0. 5*0.475=0.0534

Page 42: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Coin-Tossing Example

p(hi|hi -1,x1,…,xL)=p(x1,…,xL,hi ,hi-1)/p(hi-1,x1,…,xL)

f(hi-1)*b(hi-1)

=f(hi-1) p(hi-1|hi) p(xi| hi) b(hi)/(f(hi-1)*b(hi-1))

Page 43: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

M-step

M-step: Update the parameters simultaneously:

(in this case we only have one parameter - )

(i p(hi=loaded | hi-1=loaded, x1,…,xL)+

p (hi=fair | hi-1=fair, x1,…,xL))/(L-1)

Page 44: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Variants of HMMs

Page 45: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Higher-order HMMs

• How do we model “memory” larger than one time point?

• P(i+1 = l | i = k) akl

• P(i+1 = l | i = k, i -1 = j) ajkl

• …• A second order HMM with K states is equivalent to a first order

HMM with K2 states

state HH state HT

state TH state TT

aHHT

aTTH

aHTTaTHH aTHT

aHTH

Page 46: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Modeling the Duration of States

Length distribution of region X:

E[lX] = 1/(1-p)

• Geometric distribution, with mean 1/(1-p)

This is a significant disadvantage of HMMs

Several solutions exist for modeling different length distributions

X Y

1-p

1-q

p q

Page 47: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Solution 1: Chain several states

X Y

1-p

1-q

p

qXX

Disadvantage: Still very inflexible lX = C + geometric with mean 1/(1-p)

Page 48: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Solution 2: Negative binomial distribution

Duration in X: m turns, where– During first m – 1 turns, exactly n – 1 arrows to next state are

followed– During mth turn, an arrow to next state is followed

m – 1 m – 1

P(lX = m) = n – 1 (1 – p)n-1+1p(m-1)-(n-1) = n – 1 (1 – p)npm-n

X

p

XX

p

1 – p 1 – p

p

…… Y

1 – p

Page 49: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Example: genes in prokaryotes

• EasyGene:

Prokaryotic

gene-finder

Larsen TS, Krogh A

• Negative binomial with n = 3

Page 50: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Solution 3: Duration modeling

Upon entering a state:

1. Choose duration d, according to probability distribution2. Generate d letters according to emission probs3. Take a transition to next state according to transition probs

Disadvantage: Increase in complexity:

Time: O(D2)Space: O(D)

where D = maximum duration of state

X

Page 51: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Learning – EM in ABO locus

Tutorial #08

© Ydo Wexler & Dan Geiger

Page 52: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

Example: The ABO locusA locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type.

N

N

N

N

N

N

N

N

N

N

N

N oooo

baba

obob

bbbb

oaoa

aaaa

//

//

//

//

//

// ,,,,,

Suppose we randomly sampled N individuals and found that Na/a have genotype a/a, Na/b have genotype a/b, etc. Then, the MLE is given by:

The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O.

We wish to estimate the proportion in a population of the 6 genotypes.

Page 53: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

The ABO locus (Cont.)However, testing individuals for their genotype is a very expensive test. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ?The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?We use the Hardy-Weinberg equilibrium rule that tells us that in equilibrium the frequencies of the three alleles a,b,o in the population determine the frequencies of the genotypes as follows: a/b= 2a b, a/o= 2a o, b/o= 2b o, a/a= [a]2, b/b= [b]2, o/o= [o]2. So now we have three parameters that we need to estimate.

Page 54: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

The Likelihood FunctionLet X be a random variable with 6 values xa/a, xa/o ,xb/b, xb/o, xa/b , xo/o denoting the six genotypes. The

parameters are = {a ,b, o}.

The probability P(X= xa/b | ) = 2a b.

The probability P(X= xo/o | ) = o o. And so on for the other four genotypes.

What is the probability of Data={B,A,B,B,O,A,B,A,O,B, AB} ?

215232 222)|( oobaobboaaDataP

Obtaining the maximum of this function yields the MLE.

Page 55: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

ABO loci as a special case of HMM

Model the ABO sampling as an HMM with 6 states (genotypes): a/a, a/b, a/o, b/b, b/o, o/o, and 4 outputs (blood types): A,B,AB,O. Assume 3 transitions types: a, b and o, and a state is determined by 2 successive transitions. The probability of transition x is x .

Emission is done every other state, and is determined by the state.Eg, ea/o(A)=1, since a/o produces blood type A.

aoa/o a/b

A AB

a/b

AB

b b a a

Page 56: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

A faster and simpler EM for ABO loci

Can be solved via the Baum-Welch EM training. This is quite inefficient: for L sampling it requires running the forward and backward algorithm on HMM of length 2L, even that there are only 6 distinct genotypes. Direct application of the EM algorithm yields a simpler and more efficient way.

Page 57: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

The EM algorithm in Bayes’ nets

E-step

Go over the data:

Sum the expectations of a hidden variables that you get from this data element

M-step

For every hidden variable x

Update your belief according to the expectation you calculated in the last E-step

Page 58: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

a/b/o (hidden)

A / B / AB / O (observed)

Datatype #people

A 100

B 200

AB 50

O 50

We choose a “reasonable” = {0.2,0.2,0.6}

= {a ,b, o} is the parameter we need to evaluate

Page 59: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

E-step:

n

iimlLElE

11 |

li

iii mlLP

mlLPmlLPmlLE

),(

),()|(|

1

111

M-step:

l

i

iil lE

lE

]|[

]|[1

With l = allele and m = blood type

Page 60: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

E-step:

we compute all the necessary elements 0),( 1 ABMoLP

baABMaLP ),( 1

baABMbLP ),( 1

oaAMoLP ),( 1

)(),( 1 oaaAMaLP

0),( 1 AMbLP

21 ),( oOMoLP

0),( 1 OMaLP

0),( 1 OMbLP

obBMoLP ),( 1

)(),( 1 obbBMbLP

0),( 1 BMaLP

Page 61: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

0= {0.2,0.2,0.6}

n=400 (data size)

E-step (1st step):

m

obal

m

n

iobal

i

i

mlLP

maLPn

mlLP

maLPaE

,,1

1

1,,

1

1

),(

),(

),(

),(

Datatype #people

A 100

B 200

AB 50

O 50

Page 62: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

0= {0.2,0.2,0.6}

n=400 (data size)

obalobalobalobal

mobal

m

n

iobal

i

i

OlP

OaP

ABlP

ABaP

BlP

BaP

AlP

AaP

mlLP

maLPn

mlLP

maLPaE

,,,,,,,,

,,1

1

1,,

1

1

),(

),(50

),(

),(50

),(

),(200

),(

),(100

),(

),(

),(

),(

Datatype #people

A 100

B 200

AB 50

O 50E-step (1st step):

Page 63: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

0= {0.2,0.2,0.6}

n=400 (data size)

800502

500200)2(

)(100

),(

),(50

),(

),(50

),(

),(200

),(

),(100

),(

),(

),(

),(

,,,,,,,,

,,1

1

1,,

1

1

ba

ba

oaa

oaa

obalobalobalobal

mobal

m

n

iobal

i

i

OlP

OaP

ABlP

ABaP

BlP

BaP

AlP

AaP

mlLP

maLPn

mlLP

maLPaE

Datatype #people

A 100

B 200

AB 50

O 50E-step (1st step):

Page 64: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

0= {0.2,0.2,0.6}

n=400 (data size)

Datatype #people

A 100

B 200

AB 50

O 50

1400502

50)2(

)(2000100

),(

),(50

),(

),(50

),(

),(200

),(

),(100

),(

),(

),(

),(

,,,,,,,,

,,1

1

1,,

1

1

ba

ba

obb

obb

obalobalobalobal

mobal

m

n

iobal

i

i

OlP

ObP

ABlP

ABbP

BlP

BbP

AlP

AbP

mlLP

mbLPn

mlLP

mbLPbE

E-step (1st step):

Page 65: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

0= {0.2,0.2,0.6}

n=400 (data size)

180500)2(

200)2(

100

),(

),(50

),(

),(50

),(

),(200

),(

),(100

),(

),(

),(

),(

,,,,,,,,

,,1

1

1,,

1

1

obb

ob

oaa

oa

obalobalobalobal

mobal

m

n

iobal

i

i

OlP

OoP

ABlP

ABoP

BlP

BoP

AlP

AoP

mlLP

moLPn

mlLP

moLPoE

Datatype #people

A 100

B 200

AB 50

O 50E-step (1st step):

Page 66: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example 0= {0.2,0.2,0.6}

n=400 (data size)

M-step (1st step):

20.018014080

80

]|[

]|[0

01

l

a lE

aE

35.018014080

140

]|[

]|[0

01

l

b lE

bE

45.018014080

180

]|[

]|[0

01

l

o lE

oE

Page 67: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example1= {0.2,0.35,0.45}

E-step (2nd step):

842

50)2(

)(100

),(

),(

,,1

1

ba

ba

oaa

oaa

mobal

m mlLP

maLPnaE

1522

50)2(

)(200

ba

ba

obb

obbbE

16450)2(

200)2(

100

obb

ob

oaa

oaoE

Page 68: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

M-step (2nd step):

21.016415284

84

]|[

]|[1

12

l

a lE

aE

38.016415284

152

]|[

]|[1

12

l

b lE

bE

41.016415284

164

]|[

]|[1

12

l

o lE

oE

1= {0.2,0.35,0.45}

Page 69: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example2= {0.21,0.38,0.41}

E-step (3rd step):

842

50)2(

)(100

ba

ba

oaa

oaaaE

1562

50)2(

)(200

ba

ba

obb

obbbE

16050)2(

200)2(

100

obb

ob

oaa

oaoE

Page 70: What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG

EM - ABO Example

M-step (3rd step):

21.016015684

84

]|[

]|[1

12

l

a lE

aE

39.016015684

156

]|[

]|[1

12

l

b lE

bE

40.016015684

160

]|[

]|[1

12

l

o lE

oE

2= {0.29,0.56,0.15}

No change