61
CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Embed Size (px)

Citation preview

Page 1: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

CS5263 Bioinformatics

Lecture 11: Markov Chain and Hidden Markov Models

Page 2: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

In the next few lectures

• Hidden Markov Models– The theory– Probabilistic treatment of sequence

alignments using HMMs– Application of HMMs to biological sequence

modeling and discovery of features such as genes

Page 3: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Markov models

• A sequence of random variables is a k-th order Markov chain if, for all i, ith value is independent of all but the previous k values:

• First order:• Second order:

• 0th order: (independence)

Page 4: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Probability to generate a sequence with a 1st Markov model

Page 5: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

CpG islands

• For biological reasons, CpG (C followed by G) is very rare in mammal genomes

• But relatively more common in promoter regions

• Detecting CpG islands help predict genes– Model: consider all 16 conditional probabilities

• P(G | C), P(T | C), P(A | C), P(C | C), P(G | T) …

Page 6: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

A 1st order Markov model for CpG islands

• Essentially a finite state automaton• Transitions are probabilistic

• 4 states: A, C, G, T• 16 transition: ast = P(xi = t | xi-1 = s)• Begin/End states (for convenience)

Page 7: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Probability of a sequence

• What’s the probability of ACGGCTA?– For 0-th order Markov model, simply P(A)P(C)…– For 1st order Markov model like the one above

P(A) * P(C|A) * P(G|C) … P(A|T)

= aBA aAC aCG …aTA

Page 8: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Probability of a sequence

• What’s the probability of ACGGCTA?P(A) * P(C|A) * P(G|C) … P(A|T)

= aBA aAC aCG …aTA

• Equivalent: follow the path in the automaton, and multiply the probability of each arc on the path

• Given a sequence, there is only one path you can walk.

Page 9: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Training

• Estimate the parameters of the model– Training sequences: sequences with labels (CpG

islands or not CpG islands)– Test sequences: sequences for test (labels removed)

• Two models– CpG model learned from known CpG islands– Background model learned from known non-CpG

islands

• What to learn?– Transition frequencies– ast = #(s→t) / #(s → )

Page 10: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Training

• Parameters learned from known CpG islands and non-CpG islands

Page 11: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Discrimination / Classification

• Given a sequence, is it CpG island or not?• Log likelihood ratio• X = ACGGCTA

• Compute log (P(X | CpG+) / P(X | CpG-)) • or log(P(CpG+ | X) / P(CpG- | X)) given priors• Q: how to get a+

BA and a-BA? (B: begin state)

P(X|CpG+) = a+BA a+

AC a+CG …a+

TA

P(X|CpG-) = a-BA a-

AC a-CG …a-

TA

Page 12: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

CpG island scores

Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

Page 13: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Questions

• Q1: given a short sequence, is it more likely from feature model or background model?

• Q2: Given a long sequence, where are the features in it (if any)?– Approach 1: score 100 bp (e.g.) windows

• Pro: simple• Con: arbitrary, fixed length, inflexible

– Approach 2: combine +/- models.

Page 14: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Combined model

Now given a sequence, we cannot direct tell which state a symbol is in. We have two states for each symbol. (hidden Markov model => hidden states)

?

Page 15: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Decoding

• Give you a sequence, ACGTACGTATA…

• What’s the most probable path?

Page 16: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Most probable path

A C G T A C G T A T

A+

A-

C+

C-

A+

A-

A+

A-

C+

C-

G+

G-

G+

G-

T+

T-

T+

T-

T+

T-

Find a path with the following objective:

Maximize the product of probabilities for the arcs on the path

Maximize the sum of log probabilities for the arcs on the path

B

Probability of a path is the product of all probabilities for the arcs on the path

Page 17: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Most probable path

V(+, i+1) = max { V(+, i) + w(x+i, x+

i+1)

V(-, i) + w(x-i, x+

i+1) }

V(-, i+1) = max { V(+, i) + w(x+i, x-

i+1)

V(-, i) + w(x-i, x-

i+1) }

A+

A-

C+

C-

A+

A-

A+

A-

C+

C-

G+

G-

G+

G-

T+

T-

T+

T-

T+

T-

B

w(s, t) = log (ast)

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

Page 18: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Simpler but more general case

• We can go to any state at any time. Doesn’t depend on the input• Each state can emit any symbol with some probabilities (possibly zero)

Fair LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

Probability of a path is the product of all probabilities for the arcs on the path, and the possibilities of emitting the sequence of symbols

Page 19: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

• Two types of rewards– Mileage: for using an arc (fixed)– Bonus: for visiting a node (depending on the token you show)

2 5 2 1 3 6 2 6 1 6

F

L

F

L

F

L

F

L

F

L

F

L

F

L

F

L

F

L

F

L

B

P(s) = 1/6, for s in [1..6]

P(s) = 1/10, for s in [1..5]P(6) = 1/2

Page 20: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

V(F, i+1) = Max { V(F,i) + w(F,F) + r(F, xi+1), V(L,i) + w(L,F) + r(F, xi+1)}

V(L, i+1) = Max { V(F,i) + w(F,F) + r(L, xi+1), V(L,i) + w(L,F) + r(L, xi+1)}

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

F

L

F

L

F

L

F

L

F

L

F

L

F

L

F

L

F

L

F

L

B

P(s) = 1/6, for s in [1..6]

P(s) = 1/10, for s in [1..5]P(6) = 1/2

r(F, x) = log (eF(x))w(F, L) = log (aFL)

Page 21: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

FSA interpretation

Fair LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

Fair LOADED

-0.7

-0.7

2.32.3

r(F,1) = 0.5r(F,2) = 0.5r(F,3) = 0.5r(F,4) = 0.5r(F,5) = 0.5r(F,6) = 0.5

r(L,1) = 0r(L,2) = 0r(L,3) = 0r(L,4) = 0r(L,5) = 0r(L,6) = 1.6

r = log (10 * p)

V(L, i+1) = max { V(L, i) + W(L, L) + r(L, xi+1), V(F, i) + W(F, L) + r(L, xi+1) }

V(F, i+1) = max { V(L, i) + W(L, F) + r(F, xi+1), V(F, i) + W(F, F) + r(F, xi+1) }

Page 22: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

More general cases

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2B

3

k

.

.

.

3

k

.

.

.

3

k

.

.

.

3

k

.

.

.

3

k

.

.

.

3

k

.

.

.

3

k

.

.

.

3

k

.

.

.

3

k

.

.

.

3

k

.

.

.

Page 23: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

V(1,i) + w(1,1) + r(1, xi+1),

V(2,i) + w(2,1) + r(1, xi+1),

V(1, i+1) = max V(3,i) + w(3,1) + r(1, xi+1),……

V(k,i) + w(k,1) + r(1, xi+1)

V(1,i) + w(1,2) + r(2, xi+1),

V(2,i) + w(2,2) + r(2, xi+1),

V(2, i+1) = max V(3,i) + w(3,2) + r(2, xi+1), ……

V(k,i) + w(k,2) + r(2, xi+1) ......A FSA with k states, fully connected

Page 24: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

V(1,i) + w(1, j) + r(j, xi+1),

V(2,i) + w(2, j) + r(j, xi+1),

V(j, i+1) = max V(3,i) + w(3, j) + r(j, xi+1),

……

V(k,i) + w(k, j) + r(j, xi+1)

Or simply:

V(j, i+1) = Maxl {V(l,i) + w(l, j) + r(j, xi+1)}

Generalize

Page 25: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

• Previous equation assumes completely connected structure. In practice may not be the case.

Load_1 Fair Load_2

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L1) = 1/10P(2|L1) = 1/10P(3|L1) = 1/10P(4|L1) = 1/10P(5|L1) = 1/10P(6|L1) = 1/2

P(1|L2) = 1/2P(2|L2) = 1/10P(3|L2) = 1/10P(4|L2) = 1/10P(5|L2) = 1/10P(6|L2) = 1/10

0.05 0.05

0.1

0.90.9

0.9

0.1

0

0

Page 26: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Load_1 Fair Load_2

0.05 0.05

0.1

0.90.9

0.9

0.1

Take advantage of sparse structure

V(L1, i) + w(L1, F) + r(F, xi+1),

V(F, i+1) = max V(L2, i) + w(L2, F) + r(F, xi+1),

V(F, i) + w(F, F) + r(F, xi+1)

V(L1, i) + w(L1, L1) + r(L1, xi+1),

V(L1, i+1) = max V(F, i) + w(F, L1) + r(L1, xi+1)

V(L2, i) + w(L2, L2) + r(L2, xi+1),

V(L2, i+1) = max V(F, i) + w(F, L2) + r(L2, xi+1)

Page 27: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

The Viterbi AlgorithmInput: x = x1……xN

Initialization:P0(0) = 1 (zero in subscript is the start state.)Pk(0) = 0, for all k > 0 (0 in parenthesis is the imaginary first position)

Iteration:for each j = 1..N for each i = 1..k

Pj(i) = ej(xi) maxk akj Pk(i-1)Ptrj(i) = argmaxk akj Pk(i-1)

endend

Termination:Prob(x, *) = maxk Pk(N)

Traceback: N* = argmaxk Pk(N) i-1* = Ptri (i)

Page 28: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

The Viterbi AlgorithmInput: x = x1……xN

Initialization:V0(0) = 0 (zero in subscript is the start state.)Vk(0) = -inf, for all k > 0(0 in parenthesis is the imaginary first position)

Iteration:for each j for each i

Vj(i) = rj(xi) + maxk (wkj + Vk(i-1)) rj(xi) = log(ej(xi)), wkj = log(akj)Ptrj(i) = argmaxk (wkj + Vk(i-1))

endend

Termination:Prob(x, *) = exp{maxk Vk(N)}

Traceback: N* = argmaxk Vk(N) i-1* = Ptri (i)

Page 29: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

The Viterbi Algorithm

Similar to “aligning” a set of states to a sequence

Time:

O(K2N)

Space:

O(KN)

x1 x2 x3 ………………………………………..xN

State 1

2

K

Vj(i)

Page 30: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

• So far so good

• but

Page 31: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Problem

• Is the most probable path necessarily the best one?– Single optimal vs multiple sub-optimal

• What if there are many sub-optimal paths with slightly lower probabilities?

– Global optimal vs local optimal• What’s best globally may not be the best for each

individual

Page 32: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

For example

• Probability for path 1-2-5-7 is 0.4• Probability for path 1-3-6-7 is 0.3• Probability for path 1-4-6-7 is 0.3• What’s the most probable state at step 2?

– 0.4 goes through 5– 0.6 goes through 6– Viterbi may not be the only interesting answer

Page 33: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Another example

• The dishonest casino• Say x = 12341623162616364616234161221341• Most probable path: = FF……F• However: marked letters more likely to be L than

unmarked letters• Another way to interpret the problem

– With Viterbi, every pos is assigned a label– Confidence level for each assignment?

Page 34: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Posterior decoding

• Viterbi finds the path with the highest probability

– The probability is usually very tiny anyway

• We want to know k = 1

Page 35: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Probability of a sequence

• The probability that a sequence is generated by a model. P(X | M)

• Sometimes simply written as P(X)• Sometimes written as P(X | M, θ) or P(X | θ) to

emphasize that we are looking for θ to optimize the likelihood

• Not equal to the probability of a path P(X, )– There are many possible paths that can generate X– Each with a different probability

– P(X) = P(X, ) = P(X | ) P()– Why do we need this?– How to compute without summing over all paths (exponential)?

Page 36: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

The forward algorithm

• Define fk(i) the probability to get to state k at step i– Sum over all possible previous paths– Remember the way we counted # alignments?

Page 37: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

The forward algorithm

Page 38: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

The forward algorithm

We can compute fk(i) for all k, i, using dynamic programming!

Initialization:f0(0) = 1fk(0) = 0, for all k > 0

Iteration:

fk(i) = ek(xi) j fj(i-1) ajk

Termination:

Prob(x) = k fk(N)

Page 39: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Relation between Forward and Viterbi

VITERBI

Initialization:

P0(0) = 1

Pk(0) = 0, for all k > 0

Iteration:

Pk(i) = ek(xi) maxj Pj(i-1) ajk

Termination:

Prob(x, *) = maxk Pk(N)

FORWARD

Initialization:

f0(0) = 1

fk(0) = 0, for all k > 0

Iteration:

fk(i) = ek(xi) j fj(i-1) ajk

Termination:

Prob(x) = k fk(N)

Page 40: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

The backward algorithm

• Define bk(i) as the probability for paths starting from position i within state k, to the end of the sequence– Sum over all possible paths from i to N

Page 41: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

1

This does not include the emission probability of xi

Page 42: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

• fk(i): prob to get to pos i in state k and emit x i

• bk(i): prob from i to end, given i is in state k

• What is fk(i) bk(i)?

Page 43: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Is your guess correct?

Page 44: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

• What we need is

• But:

• We have P(x) already in the forward algorithm

• Can verify: k = 1

Page 45: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

The forward-backward algorithm

• Compute fk(i), for each state k and pos i

• Compute bk(i), for each state k and pos I

• Compute P(x) = kfk(N)

• Compute P(i=k | x) = fk(i) * bk(i) / P(x)

Page 46: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

The prob of x, with the constraint that it has to pass state k on step i

stateSequence

+

P(i=k | x)Space: O(KN)

Time: O(K2N)

/ P(X)Forward probabilities Backward probabilities

Page 47: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Relation to another F-B algorithm

• We’ve learned a forward-backward algorithm in linear-space sequence alignment– Hirscheberg’s algorithm– Also known as forward-backward alignment algorithm

xy

Page 48: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

What’s P(i=k | x) good for?

• For each position, you can assign a probability (in [0, 1]) to the states that the system might be in at that point – confidence level

• Assign each symbol to the most-likely state according to this probability rather than the state on the most-probable path – posterior decoding

^i = argmaxk P(i = k | x)

Page 49: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

For example

If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die

Page 50: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

CpG islands again

• Data: 41 human sequences, totaling 60kbp, including 48 CpG islands of about 1kbp each

• Viterbi: Post-process:– Found 46 of 48 46/48– plus 121 “false positives” 67 false pos

• Posterior Decoding:– same 2 false negatives 46/48– plus 236 false positives 83 false pos

Post-process: merge within 500; discard < 500

Page 51: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

What if a new genome comes?We just sequenced the porcupine genome

We know CpG islands play the same role in this genome

However, we have no known CpG islands for porcupines

We suspect the frequency and characteristics of CpG islands are quite different in porcupines

How do we adjust the parameters in our model?

- LEARNING

Page 52: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Learning

• When the states are known– We’ve already done that– Estimate parameters from labeled data

(known CpG or non-CpG)– “supervised” learning

• When the states are unknown– Estimate parameters without labeled data– “unsupervised” learning

Page 53: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Basic idea

1. We estimate our “best guess” on the model parameters θ

2. We use θ to predict the unknown labels

3. We re-estimate a new set of θ

4. Repeat 2 & 3

Two ways

Page 54: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Viterbi Traininggiven θ estimate π; then re-estimate θ

1. Make initial estimates of parameters θ2. Find Viterbi path π for each training sequence3. Count transitions/emissions on those paths,

getting new θ4. Repeat 2&3

• Not rigorously optimizing desired likelihood, but still useful & commonly used.

• (Arguably good if you’re doing Viterbi decoding.)

Page 55: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Baum-Welch Traininggiven θ, estimate π ensemble; then re-estimate θ

• Instead of estimating the new θ from the most probable path

• We can re-estimate θ from all possible paths• For example, according to Viterbi, pos i is in

state k and pos (i+1) is in state l– This contributes 1 count towards the frequency that

transition kl is used– In Baum-Welch, this transition is counted only

partially, according the probability that this arc is taken by some path

Page 56: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models
Page 57: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Estimated # of kl transition

Page 58: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

Why is this working?

• Proof is complicated (ch 11 in Durbin book)• But basically,

– The backward-forward algorithm computes the likelihood of the data given the HMM

– When we re-estimate θ, we maximize P(sequence | θ).

– Effect: in each iteration, the likelihood of the sequence will be improved

– Therefore, guaranteed to converge (not necessarily to a global optimal)

Page 59: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models
Page 60: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

expectation-maximization (EM)

• Baum-Welch algorithm is a special case of the expectation-maximization (EM) algorithm, a widely used technique in statistics for learning parameters from unlabeled data

• E-step: compute the expectation (e.g. prob for each pos to be in a certain state)

• M-step: maximum-likelihood parameter estimation

• We’ll see EM and similar techniques again in motif finding

• k-means clustering has some similar flavor

Page 61: CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models

HMM summary

• Viterbi – best single path

• Forward – sum over all paths

• Backward – similar

• Baum-Welch – training via EM and forward-backward

• Viterbi – another “EM”, but Viterbi-based