Hidden Markov Models. The three main questions on HMMs 1.Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ] 2.Decoding GIVENa HMM M, and a

Hidden Markov Models

The three main questions on HMMs

1. Evaluation

GIVEN a HMM M, and a sequence x,FIND Prob[ x | M ]

2. Decoding

GIVEN a HMM M, and a sequence x,FIND the sequence of states that maximizes P[ x, | M ]

3. Learning

GIVEN a HMM M, with unspecified transition/emission probs.,and a sequence x,

FIND parameters = (ei(.), aij) that maximize P[ x | ]

Decoding

GIVEN x = x1x2……xN

We want to find = 1, ……, N,such that P[ x, ] is maximized

* = argmax P[ x, ]

We can use dynamic programming!

Let Vk(i) = max{1,…,i-1} P[x1…xi-1, 1, …, i-1, xi, i = k] = Probability of most likely sequence of states ending at state i = k

1

2

K

…

1

2

K

…

1

2

K

…

…

…

…

1

2

K

…

x1

x2 x3 xK

2

1

K

2

The Viterbi Algorithm

Similar to “aligning” a set of states to a sequence

Time:

O(K2N)

Space:

O(KN)

x1 x2 x3 ………………………………………..xN

State 1

2

K

Vj(i)

Evaluation

We demonstrated algorithms that allow us to compute:

P(x) Probability of x given the model

P(xi…xj) Probability of a substring of x given the model

P(i = k | x) Probability that the ith state is k, given x

A more refined measure of which states x may be in

Motivation for the Backward Algorithm

We want to compute

P(i = k | x),

the probability distribution on the ith position, given x

We start by computing

P(i = k, x) = P(x1…xi, i = k, xi+1…xN)

= P(x1…xi, i = k) P(xi+1…xN | x1…xi, i = k)

= P(x1…xi, i = k) P(xi+1…xN | i = k)

Then, P(i = k | x) = P(i = k, x) / P(x)

Forward, fk(i) Backward, bk(i)

The Backward Algorithm – derivation

Define the backward probability:

bk(i) = P(xi+1…xN | i = k)

= i+1…N P(xi+1,xi+2, …, xN, i+1, …, N | i = k)

= l i+1…N P(xi+1,xi+2, …, xN, i+1 = l, i+2, …, N | i = k)

= l el(xi+1) akl i+1…N P(xi+2, …, xN, i+2, …, N | i+1 = l)

= l el(xi+1) akl bl(i+1)

The Backward Algorithm

We can compute bk(i) for all k, i, using dynamic programming

Initialization:

bk(N) = ak0, for all k

Iteration:

bk(i) = l el(xi+1) akl bl(i+1)

Termination:

P(x) = l a0l el(x1) bl(1)

Computational Complexity

What is the running time, and space required, for Forward, and Backward?

Time: O(K2N)

Space: O(KN)

Useful implementation technique to avoid underflows

Viterbi: sum of logs

Forward/Backward: rescaling at each position by multiplying by a constant

Viterbi, Forward, Backward

VITERBI

Initialization:

V0(0) = 1

Vk(0) = 0, for all k > 0

Iteration:

Vl(i) = el(xi) maxk Vk(i-1) akl

Termination:

P(x, *) = maxk Vk(N)

FORWARD

Initialization:

f0(0) = 1

fk(0) = 0, for all k > 0

Iteration:

fl(i) = el(xi) k fk(i-1) akl

Termination:

P(x) = k fk(N) ak0

BACKWARD

Initialization:

bk(N) = ak0, for all k

Iteration:

bl(i) = k el(xi+1) akl bk(i+1)

Termination:

P(x) = k a0k ek(x1) bk(1)

Posterior Decoding

We can now calculate

fk(i) bk(i)

P(i = k | x) = ––––––– P(x)

Then, we can ask

What is the most likely state at position i of sequence x:

Define ^ by Posterior Decoding:

^i = argmaxk P(i = k | x)

Posterior Decoding

• For each state,

Posterior Decoding gives us a curve of likelihood of state for each position

That is sometimes more informative than Viterbi path *

• Posterior Decoding may give an invalid sequence of states

Why?

Posterior Decoding

• P(i = k | x) = P( | x) 1(i = k)

= {:[i] = k} P( | x)

x1 x2 x3 …………………………………………… xN

State 1

l P(i=l|x)

k

1() = 1, if is true 0, otherwise

Posterior Decoding

• Example: How do we compute P(i = l, ji = l’ | x)?

fl(i) bl(j)

P(i = l, iI = l’ | x) = ––––––– P(x)

x1 x2 x3 …………………………………………… xN

State 1

l P(i=l|x)

k

P(j=l’|x)

A+ C+ G+ T+

A- C- G- T-

A modeling Example

CpG islands in DNA sequences

Example: CpG Islands

CpG nucleotides in the genome are frequently methylated

(Write CpG not to confuse with CG base pair)

C methyl-C T

Methylation often suppressed around genes, promoters CpG islands

Example: CpG Islands

• In CpG islands,

CG is more frequent

Other pairs (AA, AG, AT…) have different frequencies

Question: Detect CpG islands computationally

A model of CpG Islands – (1) Architecture

A+ C+ G+ T+

A- C- G- T-

CpG Island

Not CpG Island

A model of CpG Islands – (2) Transitions

How do we estimate the parameters of the model?

Emission probabilities: 1/0

1. Transition probabilities within CpG islands

Established from many known (experimentally verified)

CpG islands

(Training Set)

2. Transition probabilities within other regions

Established from many known non-CpG islands

+ A C G T

A .180 .274 .426 .120

C .171 .368 .274 .188

G .161 .339 .375 .125

T .079 .355 .384 .182

- A C G T

A .300 .205 .285 .210

C .233 .298 .078 .302

G .248 .246 .298 .208

T .177 .239 .292 .292

Log Likehoods—Telling “Prediction” from “Random”

A C G T

A -0.740 +0.419 +0.580 -0.803

C -0.913 +0.302 +1.812 -0.685

G -0.624 +0.461 +0.331 -0.730

T -1.169 +0.573 +0.393 -0.679

Another way to see effects of transitions:

Log likelihoods

L(u, v) = log[ P(uv | + ) / P(uv | -) ]

Given a region x = x1…xN

A quick-&-dirty way to decide whether entire x is CpG

P(x is CpG) > P(x is not CpG)

i L(xi, xi+1) > 0


• What about transitions between (+) and (-) states?• They affect

Avg. length of CpG island

Avg. separation between two CpG islands

X Y

1-p

1-q

p q

Length distribution of region X:

P[lX = 1] = 1-p

P[lX = 2] = p(1-p)

…

P[lX= k] = pk(1-p)

E[lX] = 1/(1-p)

Geometric distribution, with mean 1/(1-p)

A+ C+ G+ T+

A- C- G- T-

1–p


No reason to favor exiting/entering (+) and (-) regions at a particular nucleotide

To determine transition probabilities between (+) and (-) states

1. Estimate average length of a CpG island: lCPG = 1/(1-p) p = 1 – 1/lCPG

2. For each pair of (+) states k, l, let akl p × akl

3. For each (+) state k, (-) state l, let akl = (1-p)/4(better: take frequency of l in the (-) regions into account)

4. Do the same for (-) states

A problem with this model: CpG islands don’t have exponential length distribution

This is a defect of HMMs – compensated with ease of analysis & computation

Applications of the model

Given a DNA region x,

The Viterbi algorithm predicts locations of CpG islands

Given a nucleotide xi, (say xi = A)

The Viterbi parse tells whether xi is in a CpG island in the most likely general scenario

The Forward/Backward algorithms can calculate

P(xi is in CpG island) = P(i = A+ | x)

Posterior Decoding can assign locally optimal predictions of CpG islands

^i = argmaxk P(i = k | x)

What if a new genome comes?

• We just sequenced the porcupine genome

• We know CpG islands play the same role in this genome

• However, we have no known CpG islands for porcupines

• We suspect the frequency and characteristics of CpG islands are quite different in porcupines

How do we adjust the parameters in our model?

LEARNING

Problem 3: Learning

Re-estimate the parameters of the model based on training data

Two learning scenarios

1. Estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

2. Estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

QUESTION:Update the parameters of the model to maximize P(x|)

1. When the right answer is known

Given x = x1…xN

for which the true = 1…N is known,

Define:

Akl = # times kl transition occurs in Ek(b) = # times state k in emits b in x

We can show that the maximum likelihood parameters (maximize P(x|)) are:

Akl Ek(b)

akl = ––––– ek(b) = –––––––

i Aki c Ek(c)

1. When the right answer is known

Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data

Drawback: Given little data, there may be overfitting:P(x|) is maximized, but is unreasonable0 probabilities – VERY BAD

Example:Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 = F, F, F, F, F, F, F, F, F, F

Then:aFF = 1; aFL = 0eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1

Pseudocounts

Solution for small training sets:

Add pseudocounts

Akl = # times kl transition occurs in + rkl

Ek(b) = # times state k in emits b in x + rk(b)

rkl, rk(b) are pseudocounts representing our prior belief

Larger pseudocounts Strong priof belief

Small pseudocounts ( < 1): just to avoid 0 probabilities

Pseudocounts

Example: dishonest casino

We will observe player for one day, 600 rolls

Reasonable pseudocounts:

r0F = r0L = rF0 = rL0 = 1;

rFL = rLF = rFF = rLL = 1;

rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair)

rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded)

Above #s pretty arbitrary – assigning priors is an art

2. When the right answer is unknown

We don’t know the true Akl, Ek(b)

Idea:

• We estimate our “best guess” on what Akl, Ek(b) are

• We update the parameters of the model, based on our guess

• We repeat

Documents

Hidden Markov Models. The three main questions on HMMs 1.Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ] 2.Decoding GIVENa HMM M, and a