Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences

Hidden Markov Models

As used to summarize multiple sequence alignments, and score

new sequences

Time flies like an arrow, but fruit flies like a banana

• The Markov Property: the state of a system depends only on its previous state.– Works very well for speech recognition: a sound is modified by the

sound immediately preceding it.– For DNA and protein sequences, there is no reason to think that the

subunit just upstream is the only influence: probably the whole neighborhood is important.

– So, HMMs are just an approximation, but they work very well.

• In a first order Markov model, the current event only depends on the immediately preceding event. Second order models use the two preceding events, third order models use the three preceding, etc.– as far as I know, all the HMMs for protein sequences that are readily

available are first order models.

Hidden Markov Models

• Position-specific substitution matrices have an important weakness: they don’t deal very well with gaps caused by small insertions and deletions. – A we know, gaps are part of most sequence alignments.

• The standard solution these days: profile hidden Markov models (HMM).– Like PSSM, they are a statistical model of a multiple sequence

alignment, but they easily incorporate gaps.– HMMs can be used to evaluate a query sequence: to determine

how likely the query is to fit the model.

Hidden Markov Models• The basic problem is: given a series of

observations, determine the most probable set of hidden states that generate them.

– For matching a query sequence to a model, each residue in the query is assigned one of 3 states: it is a match/mismatch to a residue in the model, or it is deleted, or it is inserted

• but, to understand HMMs, we need a simpler and more concrete example, taken from Wikipedia, which we can call Life in St. Paul.

– I have a friend (Carol) in St. Paul who has only 3 interests in life: going for a walk, shopping, and cleaning her apartment.

– St. Paul weather can be either sunny or rainy. Carol decides which of her 3 activities she will do based on the weather.

– Every day she e-mails me to tell me what she is doing. (exciting!)

– From the sequence of her activities, I infer the most likely weather for each day.

Terminology• The HMM for Life in St. Paul has 2 states: rainy and sunny.

– These states are hidden: we don’t know them and can only infer them from Carol’s behavior.

• For each set, there is a set of transition probabilities: what will happen the next day.

– Which depends only on what happened today (the Markov property) – prob R -> R (stays rainy) = 0.7– prob R -> S (goes from rain to sun) = 0.3– prob S-> S (stays sunny) = 0.6– prob S -> R (goes from sun to rain) = 0.4

• For each state, there is a set of emission probabilities: what Carol’s behavior will be under different weather conditions.

– These behaviors are our observations: she tells us what she is doing each day.– if state is Rainy:

• walk = 0.1• clean = 0.5• shop = 0.4

– if state is Sunny:• walk = 0.6• clean = 0.1• shop = 0.3

• A path is a sequence of states through the model: what the state is for each day. We are trying to determine the optimal (most likely) path that fits our observations.

State Diagrams• One way to draw state

diagram is to have arrows going back and forth between states. Note that there is also a transition back to the same state.

– This represents eternity: each day is in one of the other state.

State Diagrams• Another way is to move forward in time, with a new pair of Rainy and

Sunny • states for each new day. I like this because it is a closer approximation to • progressing through a sequence, trying to get a match between the query • sequence and the model.

Basic Questions

• We are interested in 3 basic questions:– Given the model and its parameters (the transition and emission

probabilities), what is the most likely path that explains our observations?

• This is done using the Viterbi algorithm.

– What is the overall probability that the model fits our observations?

• This is the most common use of HMMs in bioinformatics: determining the probability that the query sequence fits the model

• Uses the forward algorithm, which is very similar to Viterbi

– Given a set of observations, how can we accurately estimate parameters for the model?

• This uses the Baum-Welch algorithm.

Viterbi Algorithm• Used to determine most probable path (series of states) given a set of

observed data.– the forward algorithm is closely related and can be performed at the same time: it

generates the overall probability of the observed sequence.– based on the idea that if you always take the most likely next step, you will get

the most likely overall path. Given a first order model, this seems quite reasonable.

– Very similar to dynamic programming• Go through the observations from beginning to end, inferring a state of the

hidden machine for each observation.• Keep track of 3 values: overall probability (prob), Viterbi path (v_path), and

Viterbi probability (v_prob) (i.e. the probability of the observations given the Viterbi path).

• Test all possible next steps. The probability of a possible step given its corresponding observation is the probability of the transition to that next state times the emission probability of the observation in that state.

– for overall probability prob, multiply each new prob by the old prob, then add together (i.e. overall probability is sum of all possible next steps)

– for Viterbi prob, just take the highest next-step probability and multiply it by the previous v_prob

– for Viterbi path, add next state to list of previous states

Our Example• observations: WCSW

• start at beginning. We will simply assume that the day before the beginning was sunny Observation is W (walk), so go with the most probable.

– P(Walk | Sunny) = 0.6; P(Walk | Rainy) = 0.1– prob (overall probability) = 0.6– v_prob (Viterbi probability) = 0.6– v_path = S

• Step 2 (next day) observation is C (clean)– Given that previous day was sunny:

1. stays sunny. Transition prob = 0.6. Prob of cleaning on a sunny day (emission prob) = 0.1. Total probability for this possibility (i.e. it stays sunny) = 0.6 x 0.1 = 0.06.

2. turns rainy. Transition prob = 0.4. Prob of cleaning on a rainy day = 0.5. Total prob for this step = 0.4 x 0.5 = 0.20. Therefore this is the most likely state of step 2

• update overall probability: prob = 0.6 x 0.06 + 0.6 x 0.2 = 0.152• update Viterbi probabilty: v_prob = 0.6 x 0.2 = 0.12• update Viterbi path: v_path = SR

More

• step 3. Obs = S (shop)– given previous day was Rainy:

1. stays rainy = 0.7. Prob of shopping on a rainy day = 0.4. Prob for this step = 0.7 x 0.4 = 0.28

2. turns sunny = 0.3. Prob of shopping on a sunny day = 0.3. Total = 0.3 x 0.3 = 0.09

– best is R. v_path now is SRR– overall prob = previous prob (0.152) x sum of this

step’s possibilities: 0.152 x (0.28 + 0.09) = 0.0562– v_prob = previous v_prob x best current prob = 0.12

x 0.28 = 0.0336.

Still More• Step 4 (last). Obs = W• previous day was R:

1. R-> R transition prob = 0.7. Emission probability of a walk on a rainy day is 0.1. Total for this step = 0.7 x 0.1 = 0.07.

2. R -> S prob = 0.3. Emission prob = 0.6. Total = 0.3 x 0.6 = 0.18. – Best is S (turns sunny). v_path = SRRS– overall prob = previous (0.0562) x sum of current possibilities

(0.07 + 0.18 = 0.25) = 0.562 x 0.25 = 0.0141.– v_prob = previous (0.0336) x best (0.18) = 0.00605.

• We have now determined the best set of states that would give us the observed data (v_path = SRRS), and the overall probability of the observed data (given the model) (prob = 0.0141).

• It is obvious that multiplying probabilities soon gets to very small numbers, so in general they are converted to logarithms and added.

Results from a Longer Runactivity: CCSCCCSCWCWSCSCSSSCSSSWWWWSCCSSWCCSWWSCWSCSSCSSSWW

weather: RRRRRRSSSRSSRSRSRRRRRRSSSSSSRRRSRRRSSSRSSRRRSSRRSS

predicted: RRRRRRRRSRSSRRRRRRRRRRSSSSSRRRRSRRRSSSRSSRRRRRRRSS

• Viterbi prob = 5.48e-30• Overall prob = 1.92e-24• Percent correct = 86.0 %

Estimating Model Parameters• The parameters that need to be estimated are the transition and

emission probabilities. Since they must sum to 1, our St. Paul weather HMM needs only 2 transition probabilities and 2 emission probabilities estimated.– Sequence models have very large numbers of parameters to estimate,

which requires large amounts of data.• If you have training data with known states (e.g. you know when it is

sunny or rainy, plus your friend’s activities, for a period of time: parameters come from simply calculating the proportion of each transition or emission.– Example: on rainy days, Carol took 6 walks, cleaned 30 times, and

shopped 24 times. Emission probability estimates are 6/60 = 0.1, 30/60 =0.5, 24/60 = 0.4. (very frequentist!)

– problem of 0 counts. Add a “pseudocount” so this probabilty isn’t zero (because it might actually happen). Typically add 1 to numerator and total number of possibilites to denominator.

• e.g. for walks, emission prob = (6 +1) / (60 + 3) = 7/63 = 0.111

More Parameter Estimation

• If you don’t know the underlying states: you just have a series of observations: you can use the Baum-Welch algorithm to estimate parameters. – Need the backward algorithm in addition to the forward

algorithm. The backward algorithm works exactly the same as the forward algorithm, except that you calculate from the end to the beginning.

• The Baum-Welch algorithm starts with a set of parameters that have been estimated or guessed. – Then feed each training sequence through the model, with

overall probabilities calculated both forward and backward. – Then use these probabilities to weight the contribution of each

sequence to the transition and emission probabiliites. – Then use this to adjust the parameters.

Profile HMMs

• Start with a multiple sequence alignment: we want to make a statistical model of it.

Building the Model: Match States

• If we were just trying to do an ungapped alignment, we could use a very simple, unbranched HMM, where every match state can only transition to another match state

– Each match state has an emission probability for each amino acid, with different emission probabilities for every match state

– This is essentially what a PSSM is: the scores in each column of a PSSM could be scaled to 0-1, giving emission probabilities.

– The transition probabilities are all 1: the only choice is to move to the next match state.

Insertion States• In a multiple alignment, there are often columns which are gaps in most sequences but

have amino acids in a few sequences. – These columns are best dealt with as insertion states.

• As you move through the model, emitting the query sequence, insertion states emit the extra amino acids found in these columns.

• Insertion states have emission probabilities that are usually the same as the overall proportion of each amino acid in the database.

• Insertion states loop back on themselves, meaning that multiple residues can be inserted here.

• Insertion states are entered from one match state and then exited to the next match state: insertion occurs between adjacent amino acids.

Deletion States• Deletions are places in multiple alignments where most of the sequences

have amino acids but a few of them have gaps.• Deletion states are used to skip over match states.

– You can skip multiple match states by moving from one deletion state to the next.– Deletion states act like affine gap penalties: the transition probability for entering

a deletion state from a match state is equivalent to the gap opening penalty, and the transition probability from one deletion state to the next is the gap extension penalty.

• Unlike match states and insertion states, deletion states are silent: they emit nothing.

The Full HMM • A full HMM model for a making a profile of a multiple alignment.

• Note that it is necessary to have begin and end states too, and often a bit more complicated than is shown here. These are needed to deal with local alignments as opposed to global.

• HMMs also usually have transitions directly from insertion states to deletion states, but these are considered very improbable, but their existence helps in model building.

Variable start and end points

• We can assume that most proteins will have amino acids before and after the region matched by the model.

• It is also possible to have sequences that only match part of the model.

• We model this by:– Allowing any insertion at the beginning or end to occur with no

penalty. These are called “free insert states”.– allowing transitions from the start to any point in the model, and

from any point in the model to the end state.

• This is equivalent to filling out a dynamic programming matrix, where all possible alignments are considered.

Determining the Parameters

• Start with a good multiple alignment, which generally involves hand refinement. This is often refered to as “curated”.

• Most of the columns will become match states.– Pseudocounts are needed to make sure that each match state

has a non-zero probability of emitting all 20 amino acids.

• Insertion and deletion states are given some background low transition probabilities to start with.– columns that are mostly gaps are considered insertion states,

and gaps in columns with mostly amino acids are modeled with deletion states.

Using Profile HMMs

• The query sequence is used as a set of observations that are emitted from the HMM if the proper path is followed.

• Either the Viterbi algorithm (to give the probability for the most likely path) or the forward algorithm (to give the probability of emission by all possible paths) can be used.

• To get log-odds scores, the probability of emitting the sequence from the model is compared to the probability of generating it randomly given the background frequencies of each amino acid.

A Profile HMM Example• We are going to use a DNA

example because then we only have 4 emission probabilities to deal with.

• This is a section of a repeated sequence in Bacillus megaterium.

• There are 15 sequences, and the alignment is 16 bases long.

• The first project is to parameterize the model, to estimate transition and emission probabilities.

• Then we will use the model to score various sequences.

GG-GGAAAAACGTATTTG-GGACAAAAGTATTTG-GAACAAAAGTATGTACGGACAAAATTATTT--GAAGAAAAGTATGTA-GAACAAAAGTAGGTG-GAACAAACGCATTCGGGACAAA-AGTATTTGGGGTAAA-AGTATTTGAGACAAA-AGTAGTTGAGACAAA-AGTATATGGGACAAAGAGTATTTG-AAACAAAGATATTCG-GAACAAAAGTATTTA-GGACAAAAGTGTT

Setting Up the Model• The first problem is dealing with gaps:

which ones should be inserts and which ones should be deletes?– A standard rule of thumb (that we are going

to violate): if more than ½ the sequences have the same gap, it’s an insert, and if less than ½ have it it’s a delete.

• Nine sequences have a gap in column 3, and one also has a gap in column 2.– By the above rule, column 3 should be an

insert and column 2 a deletion, but this means that we will have a transition directly from deletion to insertion, and we would like to avoid that if possible.

– So, let’s call columns 2 and 3 deletions.• Four sequences have a gap in column 10.

This could be a deletion, but I want to make it an insertion just so we have one of them.


More Set Up• Considering the gaps in columns 2 and 3 as delete states means that

the bases in these columns in other sequences are a match state (i.e. aligned bases).

• However, considering column 10 as an insertion means that the bases in column 10 come from an insert state, not a match state, so the model doesn’t have a match state for column 10.

• The final model thus has 15 match states, with the corresponding insert and delete states.

– Most of the insert and delete states aren’t used in our sequences, so they will have low probabilities. But still, they need to be in the model.

• To simplify things a bit, we are doing this as a global model, so we don’t need to worry about free insert states before or after the model, and we aren’t going to allow entering or exiting the model anyplace other than the beginning or the end,

column 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

state M1 M2

/

D1

M3

/

D2

M4 M5 M6 M7 M8 M9 I9 M10 M11 M2 M13 M14 M15

Parameterization• What parameters do we need?• Emissions:

– Each match state needs emission probabilities for all 4 bases– The insert states also need emission probabilities for all 4 bases.

• these are generally based on the random model: the background frequencies of the bases in the relevant genome or database

• Transitions:– For the specific cases on columns 2 and 3, we need match -> delete (M->D)

probabilities, and a delete -> delete (D->D) probability. – For column 10, we need a M->I probability, and an I->I probability (for which we

have no data).– We also need general M->M, M->D, and M->I probabilities for the other columns– Several other numbers also, but many of them will be covered by the need to

have all transition probabilities coming from a given state sum to one.

Emission Probabilities

• Background levels (i.e. what the probabilities for each base would be if a bases were simply chosen at random)– We will use these for emissions from insert states.– We could use base frequencies just from this data set: A = 0.44,

C = 0.08, G = 0.25, T = 0.23.• These numbers are very biased. And, if you use them, you are

assuming that whatever sequence you run the model against is a member of this family. That is, you are assuming the thing you are trying to prove. (Bad!)

– So, let us use overall base frequencies. These sequences are all from B. megaterium, which has a GC content of about 38%.

• Thus, G = C = 0.19 and A = T = 0.31.

• Specific emission probabilities for each match state.– Count up the number of each base (not including gaps) in each

column.– We are clearly missing some data: need pseudocounts to fill in.

Emission Pseudocounts• The simplest way to do pseudocounts is the Laplace method: adding 1 to the

numerator and 4 (i.e. total types of base) to the denominator:– Freq(C in column 1) = (count of C’s + 1) / (total number of bases + 4) – = (2 + 1) / (15 + 4) = 0.158– As compared to actual frequency = 2/15 = 0.133– There are no A’s in column 1, so the probability of A from column 1 = 1/19 = 0.052

• A somewhat more sophisticated method is to use overall base frequencies for each base.

– Freq(C in column 1) = (count of C’s + 0.19) / (total number of bases + 1) = 2.19/16 = 0.137– Freq(A in column 1) = 0.31/16 = 0.019

• The base frequency method could be altered by multiplying the pseudocounts by some constant, as an estimate of our uncertainty of how likely we are to find a sequence with an A first.

– For example, to be more equivalent to the Laplace method, multiply by 4:

– Freq(C in column 1) = (count of C’s + (4 * 0.19) ) / (total number of bases + 4) = 2.76/19 = 0.145

– Freq(A in column 1) = (4 * 0.31)/19 = 0.065– Note how different the probabilities are for A.

• We will just say that how to apply pseudocounts is an area of heuristics and active research.

• We will use the overall base frequency method.

Transition Frequencies• Overall M->D. There are 225 total transitions that occur here, and 9

of them are to delete states. So, we calculate M->D background frequency as 9/225 = 0.040.– For D->D, there is 1 case out of 9 delete states where the sequence

continues the deletion, so we will use 1/9 = 0.111 as the D->D probability.

– Then, D->M = 1 – (D->D) = 0.888• Overall M->I. There are 11 transitions to an insert state (column

10). So, the overall M->I probability is 11/225 = 0.044.– No cases of I->I in the data. So, we arbitrarily decide to make it the

same as D->D (0.111), since we arbitrarily decided which columns to treat as inserts and which to treat as deletes.

– I->M is thus the same as D->M = 0.888• This leaves the background M->M transition, which is 1 – (M->I + M-

>D) = 1 – (0.040 + 0.044) = 0.916.• We also need a low probability for I->D and D->I, which shouldn’t

happen, so we set them at 0.00001

Specific Transitions• We need to deal with columns where we have inserts and deletes.• Column 2 has 1 M->D and 14 M->M.

– Need to add in pseudocounts from the overall data, so: prob(M->D in column 2) = (M->D count + 0.04) / (total transitions in column 2 + 1) = 1.04/16 = 0.065.--M->I in column 2 is the background level, 0.044– M->M for column 2 is 1 – 0.065 - 0.044 = 0.891

• Column 3 has 8 M->D and 6 M->M (there is also the D->D already covered).– Prob(M->D in column 3 ) = 8.04/15 = 0.536– Prob (M->M in column 3) = 1 – 0.536 - 0.044 = 0.420

• Column 10 is an insert with 10 M->I and 5 M->M transitions– Prob(M->I in column 10) = 10.044/16 = 0.628– Prob (M->D in column 10) = 0.04 (background)– Prob (M->M in column 10 is 1 – 0.628 – 0.04 = 0.332

Emission Probability Tablesmatch A C G T

1 0.028 0.130 0.078 0.764

2 0.229 0.005 0.750 0.015

3 0.349 0.154 0.464 0.033

4 0.090 0.005 0.890 0.014

5 0.653 0.005 0.328 0.014

6 0.653 0.255 0.015 0.077

7 0.403 0.505 0.078 0,014

8 0.965 0.005 0.015 0.014

9 0.965 0.005 0.015 0.014

10 0.778 0.130 0.078 0.014

11 0.090 0.005 0.828 0.077

12 0.028 0.067 0.015 0.889

13 0.903 0.005 0.078 0.014

14 0.028 0.005 0.140 0.827

15 0.090 0.005 0.203 0.702

A C G T

overall 0.31 0.19 0.19 0.31

Transitions

Specific

M1M2 0.891

M1->D 0.065

M2->M3 0.420

M2->D 0.536

M9->M10 0.332

M9->I 0.628

Default

M->M 0.916

M->I 0.044

M->D 0.040

D->M 0.888

D->I 0.0001

D->D 0.111

I->M 0.888

I->I 0.111

I->D 0.0001

Scoring a Sequence• Whew! We have now estimated parameters for all transitions and

emissions.• Scoring a sequence. We are going to use both the Viterbi algorithm

and the forward algorithm to determine the most likely path through the model and the overall probability of emitting that sequence.– Note that we really should convert everything to logarithms– Also, it is standard practice to express emission probabilities as odds

rations, which means dividing them by the overall base frequencies.– We are not going to do either of these things here, in the interest of

simplification and clarity.

• Let’s just score the first sequence in the list: – GG-GGAAAAACGTATT– Remove the gap, since a sequence derived from real data is not going

to come with a gap (which came from a multiple alignment program)– GGGGAAAAACGTATT

Scoring• GGGGAAAAACGTATT• Base 1 is G. To start the global model off, we are going to require

that this be a match state.– The emission probability for G in M1 is 0.078, so this is the initial overall

probability and Viterbi probability.• Base 2 is also G. There are 3 possibilities for this base: it might be

a match state (M2), or it might the result of an insert state, or it might be the result of entering a delete state (and thus match a later base. We choose the most likely:– M1->M2 has a 0.891 probability, and the probability of emitting a G in

column 2 is 0.750. So, this probability is 0.891 * 0.750 = 0.668– M1->D = 0.065– M1->I, then emitting a G from the insert state = 0.044 * 0.19 = 0.008– M1->M2 is most likely.

• So, Viterbi probability = previous prob * this prob = 0.078 * 0.668 = 0.052.• Overall prob = 0.078 * (0.668 + 0.065 + 0.008) = 0.078 * 0.741 = 0.058

More Scoring• Base 3 is also a G.

– M2->M3 has 0.420 probability and 0.464 chance of emitting a G. 0.420 * 0.464 = 0.195

– M2->D has 0.536 probability– M1->I, then emitting a G from the insert state = 0.044 * 0.19 = 0.008– Choose M2->D. Viterbi = 0.052 * 0.536 = 0.028. – Overall = 0.058 * (0.195 + 0.536 + 0.008) = 0.058 * 0.739 = 0.043.

• We are now in a delete state between M2 and M4; we skipped the M3 state. Since delete states are silent, the G in position 3 hasn’t been emitted yet.– From the delete state we can either move to another delete state (skipping

the M4 state in addition to M3) or we can move to M4 and emit the G.– D->M4 = 0.888 and M4 emitting a G = 0.890, so prob = 0.888 * 0.890 =

0.790– D->D = 0.111– Move to M4. Viterbi = 0.028 * 0.790 = 0.022.– Overall = 0.043 * (0.790 + 0.111) = 0.043 * 0.901 = 0.039.

• We can now move on to base 4 (another G)• Our path so far: M1->M2->D->M4. We have emitted the first 3 bases.• GGGGAAAAACGTATT

Still More Scoring• GGG GAAAA ACGTATT• The next several bases are easy. Since the probability of moving to a delete or

insert state is low, we just have to be sure that the M->M probability times the emission probability stays above 0.044.

• M4->M5 : G prob = 0.916 * 0.328 = 0.300– Viterbi prob = 0.022 * 0.300 = 0.0066– Overall prob = 0.039 * (0.300 + 0.040 + (0.044 * 0.19) ) = 0.039 * 0.3484 = 0.0136

• M5->M6 : A prob = 0.916 * 0.653 = 0.598– Viterbi prob = 0.0066 * 0.598 = 0.00395– Overall prob = 0.0136 * (0.598 + 0.040 + (0.044 * 0.31) ) = 0.0136 * 0.6516 =

0.0089• M6->M7 : A prob = 0.916 * 0.403 = 0.369

– Viterbi prob = 0.00395 * 0.369 = 0.00146– Overall prob = 0.0089 * (0.369 + 0.040 + (0.044 * 0.31) ) = 0.0089 * 0.423 =

0.00376• M7->M8 : A prob = 0.916 * 0.965 = 0.884


0.00353• M8->M9 : A prob = 0.916 * 0.965 = 0.884


0.00331

Yet More• At this point we have emitted positions 1- 8, and

the most probable path is M1->M2->D->M4->M5->M6->M7->M8->M9

• GGG GAAAA ACGTATT• Since the transition out of M9 is not the standard

one, we need to pause and think it through.– M9->M10 = 0.332. Emission prob for A from M10 is

0.778. 0.332 * 0.778 = 0.258– M9->I = 0.628. Emission prob for A from an insert

state (i.e. background probability) is 0.31 0.628 * 0.31 = 0.195.

– Thus our best choice, the most probable path, is M9->M10. However, looking at the aligned sequences we can see that this is the wrong choice.

• Don’t despair: correction occurs in the next step.– Viterbi prob = 0.00114 * 0.258 = 0.000294– Overall prob = 0.00331 * (0.258 + 0.195 + 0.040) =

0.00331 * 0.493 = 0.00163


Yet Still More• At this point we have emitted positions 1- 8, and the most probable

path is M1->M2->D->M4->M5->M6->M7->M8->M9->M10• GGG GAAAAA CGTATT• At M10, we can:

– move to M11 and emit a C. Prob = 0.916 * 0.005 = 0.0046– Move to an insert state and emit a C. Prob = 0.044 * 0.19 = 0.0083.– Move to a delete state. Prob = 0.04. This would be the best choice, but

it leads to a mess: delete all the remaining match states, then inserting all the remaining bases in the query sequence at the end. It clearly shows the need for dynamic programming.

• And while we are at it, switching to logarithms at the beginning would greater ease calculations.

– So, to continue our example, we move from M10 to an insert state and emit a C.

• Viterbi prob = 0.000294 * 0.0083 = 2.44 x 10-6

• Overall prob = 0.00163 * (0.0046 + 0.0083) = 2.10 x 10-5

To the End…

• Our path so far:– M1->M2->D->M4->M5->M6->M7->M8->M9->M10->I– GGG GAAAAAC GTATT

• From the insert state we can:– I->I and emit a G, with probability 0.111 * 0.19 = 0.0211– I->M11, with prob 0.888 * 0.828 = 0.735

• Viterbi prob = 2.44 x 10-6 * 0.735 = 1.79 x 10-6

• Overall prob = 2.10 x 10-5 * (0.0211 + 0.735) =1.58 x 10-5

• The remaining steps are all match states, so we skip the calculations:– Final Viterbi probability = 4.46 x 10-7

– Final overall prob = 6.79 x 10-6

Final probability

• We need to know what the probability would be for the random model, with every base inserted according to its overall frequency in the genome.

• GGGGAAAAACGTATT has 6 G/C and 9 A/T, so the random probability is: (0.19)6 * (0.31)9 = 1.24 x 10-9

• We compare to the overall probability of 6.79 x 10-6 by dividing, giving 5459. This means that the overall score for this sequence is 5459 times more likely than chance to match the model.

Interpro

• Maybe the most comprehensive site today (2011). InterProScan covers several different HMM databases, including TIGRfams, PFAM, and others

• http://www.ebi.ac.uk/interpro/• Sequences are generally classified as members of a

superfamily, family, or subfamily.• Different member databases supply different information.• It also has a programming (web services) interface, for

doing many submissions automatically.

http://www.ebi.ac.uk/interpro/

Sequences to try• >BMQ_3096 | QMB1551_chromosome:3062647-3060926|

MVNKILKAPEGPPSDVERIKDESNYLRGTLGETMLDRISSGISEDDNRLMKFHGSYLQDD RDLRNERQKQKLEPAYQFMLRVRTPGGVSTPEQWLVMDDLAQKYGNGTLKLTTRQAFQMH GILKWNMKKTIQEIHASLLDTIAACGDVNRNVMCNPNPYQSEVHAEVFEWSKKLSDYLLP RTRAYHELWLDEEKVISTPEVEEEVEPMYGPLYLPRKFKIGVAVPPSNDIDVYSQDLGFI AILEDEKLVGFNVAIGGGMGMTHGDKATYPQLAKVIGFCRPDQILEVAEKVITIQRDYGN RSVRKNARFKYTVDRLGLETVKEELENRLGWSLDEAKSYHFDHNGDRYGWEKGVKGKWHF TLFVQGGRIADFEDYKLMTGLREIAKVHSGDFRLTANQNLIIANVSTEKKKQISDLIEQY GLTDGKHYSALRRSSLACVSLPTCGLAMAEAERYLPVLLEKIEAIVDENGLRDKEITIRM TGCPNGCARPALGEIAFIGKAPGKYNMYLGAAFDGSRLSKMYRENISEEQILNELRVLLP RYAKEREEGEHFGDYVIRAGVIEAVTDGTNFHA

– http://www.ebi.ac.uk/Tools/services/web_iprscan/toolresult.ebi?tool=iprscan&jobId=iprscan-I20111003-174548-0114-52596225-oy

• >BMQ_0452 | QMB1551_chromosome:415338-416369| MWKKRIWVIASIVIVCLLIPYSLPIVFALLTALALEGIVQKLHQSFKMKRVYAVLMTFIL YVVSLILIGFFIVRTIVTQVVALSKIAPSFVKEIYETAIYPTIIKWKYYSNALPTEVISS IERTLEKTVNSLDSLLKSTVEIIISFAATVPGFLLEFLIYLVALVFISLELPAIKNKIKS FLTEETKYKLQVVITQLIQAGVGFIKAQIILSVMTFVMAYVGLWLLDVPYTALLSLLIVI VDILPILGTGSVLVPWGIFALTQGHDSLAIGLFILFGVITVVRRVVEPKVFSTNMGISPL AALVSLYIGFKLLGFVGLFLGPALVILYDALRKVGIIQINFKL

– http://www.ebi.ac.uk/Tools/services/web_iprscan/toolresult.ebi?tool=iprscan&jobId=iprscan-I20111003-173646-0614-88204841-pg

Full Length HMMs• The J. Craig Venter Institute (formerly known as TIGR) has a large

collection of HMMs, called TIGRFAMS.– They are classified according to their isology: how specific they are.

The least specific are motifs, then domains, then families (super and sub), and most specific are equivalogs.

– Equivalogs are homologous proteins which all have the same function. They are usually full length proteins, not just domains.

• A good hit on an equivalog is strong evidence that you have determined the protein’s function.

– Lots of useful information about the HMMs also.

– Two cutoff scores: above the trusted cutoff there should be no false positives. Below the noise cutoff the match is of no value.

– http://www.jcvi.org/cgi-bin/tigrfams/index.cgi

• Panther seems to do similar things, with explicit linking to experimental data.

http://www.jcvi.org/cgi-bin/tigrfams/index.cgi

Protein Domain HMMs• Domains are areas of proteins that perform some function

important to the protein’s overall function.– It is not uncommon to have several different domains in the same

protein

– Across species lines, the domains that perform a specific tasks can sometimes be found together in a single protein, and other times found in separate proteins.

• PFAM: probably stands for “Protein Families”. http://pfam.sanger.ac.uk/– PFAM HMMs are generally domains as opposed to full-length

proteins such as are found in TIGRfams.

– Nice sequence logos!

http://pfam.sanger.ac.uk/

Domain HMMs based on 3-D Structure

• Superfamily: HMMs of domains classified according to the SCOP (Structural Classification of Proteins) scheme, which is based on 3-dimensional protein structure and grouped by evolutionary relationship.– SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/index.html Look at statistics

and top of hierarchy.

• CATH: 3-D structures chopped up into domains, then used to find families of evolutionarily related proteins. Based on PBD database of stuctures.

http://scop.mrc-lmb.cam.ac.uk/scop/index.html

HMMS based on motifs and special features

• Prosite and PRINTS: Contain short patterns “motifs” and rules for distinguishing them in different proteins.

• TM-HMM: finds segments that go through membranes. – This may be the easiest way to detect an integral membrane

protein. – A related HMM program scans for N-terminal signal peptides, for

secretion as well as membrane insertion.

Documents

Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences