Upload
howard-reed
View
226
Download
2
Tags:
Embed Size (px)
Citation preview
Hidden Markov Model
Ed Anderson and Sasha Tkachev
Who Was Markov? Graduate of Saint Petersburg University (1878),
where he began a professor in 1886 Applied the method of continued fractions,
pioneered by his teacher Pafnuty Chebyshev, to probability theory
He proved the central limit theorem under fairly general assumptions
Most remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes
In 1923 Norbert Weiner became the first to treat rigorously a continuous Markov process. The foundation of a general theory was provided during the 1930s by Andrei Kolmogorov.
Excerpted from: http://www-groups.dcs.st-and.ac.uk/~history/Mathematicians/Markov.html
Andrei A Markov Born: 14 June 1856
in Ryazan, Russia Died: 20 July 1922
in Petrograd, Russia
What is the Hidden Markov Model?
Clipped from http://www.nist.gov/dads/HTML/hiddenMarkovModel.html
What Makes HMM Useful? Efficiency:
The algorithms are simple enough to be performant for real-time speech recognition.
Speed is advantageous when dealing with large biological data sets
Strong Theoretical Basis Probability distribution must sum to 1. Scores are not influenced by ad-hoc criteria. Scores may be compared across different experiments of
varying size and complexity Well suited for analyzing noisy, time-phased or
sequentially connected events.
What are HMM’s Limitations?Model building is not so easy
“Since HMM training algorithms are local optimizers, it pays to build HMMs on pre-aligned data whenever possible… the parameter space may be complex with may spurious local optima than can trap a training algorithm.”1
Distance between related states must be constantA disadvantage when analyzing distant and
arbitrarily spaced items:Amino acids in folded proteinsRNA base pairs
1Eddy, S.R., Profile hidden Markov models, Bioinformatics Review, 1998, Vol. 14, no. 9 1998, pg. 757
A Concrete Example
Example adapted from http://en.wikipedia.org/wiki/Viterbi_algorithm
Can you guess the weather based on a person’s activity? Use the Forward algorithm to calculate the probabilities.
(A) Transition Probabilities (Π) Initial State Probabilities
Today Rain SunRain 0.7 0.3 Rain 0.6Sun 0.4 0.6 Sun 0.4
(B) Emission Probabilities
IF Walk Shop CleanRain 0.1 0.4 0.5Sun 0.6 0.3 0.1
Tomorrow
Then
Typical Weather
Observation: Walk1
Hidden Statesp(w eather n |w eather n-1)
P(activity |w eather)
Sun-Sun-Sun 0.4 0.6Sun-Sun-Rain 0.4 0.6Sun-Rain-Sun 0.4 0.6Sun-Rain-Rain 0.4 0.6Rain-Sun-Sun 0.6 0.1Rain-Sun-Rain 0.6 0.1Rain-Rain-Sun 0.6 0.1Rain-Rain-Rain 0.6 0.1
.24
.06
Shop2
p(w eather n |w eather n-1)
P(activity |w eather)
0.6 0.30.6 0.30.4 0.40.4 0.40.3 0.30.3 0.30.7 0.40.7 0.4
.18
.16
Clean3
p(w eather n |w eather n-1)
P(activity |w eather) Probability
0.6 0.1 0.0025920.4 0.5 0.008640 False Maximum0.3 0.1 0.0011520.7 0.5 0.013440 True Maximum0.6 0.1 0.0003240.4 0.5 0.0010800.3 0.1 0.0005040.7 0.5 0.005880
.20
.35
How to Avoid False Optima? Is it necessary to calculate every possible path? The Viterbi algorithm can help.
Example from http://www.telecom.tuc.gr/~ntsourak/demo_viterbi.htm
HMM In Speech Recognition Handling a single word; evaluating each HMM according to the input,
using the Viterbi Search Every senone gets a HMM:
Adapted from Shir, O. M., Speech Recognition Seminar, 10/15/03
Leiden Institute of Advanced Computer Science
UW
ONE
TWO
THREE
T
AHW N
RTH IY
5-state HMM
HMM In Speech Recognition
Taken from Shir, O. M., Speech Recognition Seminar, 10/15/03
Leiden Institute of Advanced Computer Science
time
State with best path-scoreState with path-score < bestState without a valid path-score
P (t)j = max [P (t-1) a b (t)]i ij ji
Total path-score ending up at state j at time t
State transition probability, i to j
Score for state j, given the input at time t
HMM in BioinformaticsSequence profilingGene findingProtein secondary structure predictionRadiation hybrid mappingGenetic linkage mappingPhylogenetic analysis
HMM in Sequence Profiling Review – Lecture 7 Highlights Emission probabilities and transition probabilities
HMM in Sequence Profiling Log Odds scores are comparable across different
length sequences
Taken from lecture 7 slides, apparently from Krogh, “Computational Methods in molecular biology, pages 45-63, Elsevier, 1998.
Why HMM for Sequence Analysis?
Position-specific scoring methods make intuitive sense. BLAST and FASTA use pair-wise alignment as opposed
to profile scoring Profile methods have historically used ad hoc scoring
systems. HMM gap penalties a grounded in probability theory. HMMs provide a coherent, probabilistic model. 2
(2) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps. 755-763
Profile HMM Software ‘Motif’ models have strings of match states separated by a small
number of insert states. ‘Profile’ models have insert and delete states associated with each match state.. 3
(3) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps. 755-763
(4) Ibid., Figure 3 on page 758.
4
HMMER ArchitectureBoth local and global profile alignment.
(5) Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
5
How Does it Work?
Generative models work by recursive enumeration of possible sequences from a finite set of rules.
The Plan 7 architecture explicitly models the entire target sequence, regardless of how much of that sequence matches the main model.
All alignments to a Plan 7 model are “global” alignments!
(6) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
6
HMMR Programs 7
hmmalign - align sequences to an HMM profile hmmbuild - build a profile HMM from an alignment hmmcalibrate - calibrate HMM search statistics hmmconvert - convert between profile HMM file formats hmmemit - generate sequences from a profile HMM hmmfetch - retrieve an HMM from an HMM database hmmindex - create a binary SSI index for an HMM database hmmpfam - search one or more sequences against an HMM database hmmsearch - search a sequence database with a profile HMM
HMMER’s native alignment format is called Stockholm format, the format of the Pfam protein database that allows extensive markup and annotation.
HMMER can read alignments in several common formats, including the output of the CLUSTAL family of programs, Wisconsin/GCG MSF format, the input format for the PHYLIP phylogenetic analysis programs, and “alighed FASTA” format (where the sequences in a FASTA file contain gap symbols, so that they are all the same length).
(7) Excerpted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Building a profile with hmmbuild 8
> hmmbuild globin.hmm globins50.msf
hmmbuild - build a hidden Markov model from an alignmentHMMER 2.3 (April 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Alignment file: globins50.msfFile format: MSFSearch algorithm configuration: Multiple domain (hmmls)Model construction strategy: MAP (gapmax hint: 0.50)Null model used: (default)Prior used: (default)Sequence weighting method: G/S/C tree weightsNew HMM file: globin.hmm
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Alignment: #1Number of sequences: 50Number of columns: 308
Constructed a profile HMM (length 143)Average score: 189.04 bitsMinimum score: -17.62 bitsMaximum score: 234.09 bitsStd. deviation: 53.18 bits
(8) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Calibrating the profile 9
> hmmcalibrate globin.hmm
hmmcalibrate -- calibrate HMM search statistics
HMMER 2.3 (April 2003)
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine
Freely distributed under the GNU General Public License (GPL)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
HMM file: globin.hmm
Length distribution mean: 325
Length distribution s.d.: 200
Number of samples: 5000
random seed: 1051632537
histogram(s) saved to: [not saved]
POSIX threads: 4
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
HMM : globins50
mu : -39.897396
lambda : 0.226086
max : -9.567000
(9) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Searching the sequence DB 10
Header Section
hmmsearch globin.hmm Artemia.fa
hmmsearch - search a sequence database with a profile HMMHMMER 2.3 (April 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-HMM file: globin.hmm [globins50]Sequence database: Artemia.faper-sequence score cutoff: [none]per-domain score cutoff: [none]per-sequence Eval cutoff: <= 10per-domain Eval cutoff: [none]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Query HMM: globins50Accession: [none]Description: [none][HMM has been calibrated; E-values are empirical estimates]
(10) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Searching the sequence DB (cont.) 11
Sequence Top Hits Section
(11) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Searching the sequence DB (cont.) 12
Alignment Output Section
(12) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Searching the sequence DB (cont.) 13
Score Histogram Section
(13) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Local versus Global Alignment 14
HMMER does not do local (Smith/Waterman) and global (Needleman/Wunsch) style alignments in the same way that most computational biology analysis programs do it.
To HMMER, whether local or global alignments are allowed is part of the model, rather than being accomplished by running a different algorithm. You must choose what kind of alignments you want to allow when you build the model By default, hmmbuild builds models which allow alignments that are global with respect to the HMM, local with respect to the sequence, and allows
multiple domains to hit per sequence.
(13) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Experimental Observations My tests on the clipped SH3 Domain sequence in the Krogh paper.15
The insert gap penalty was small but significant. The number of inserts had a linear, negative affect on the score. Relative to the overall score, the inserts and deletes had a small effect.
(15) Krogh, “Computational Methods in molecular biology, pages 45-63, Elsevier, 1998.
Avg Log Odds by Domain8.63 -1.46 14.77
Insert Region Log Odds Correlated to Number of Inserts
-1.52
-1.50
-1.48
-1.46
-1.44
-1.42
-1.40
-1.38
-1.36
-1.34
0 2 4 6 8
Total Inserts
Lo
g O
dd
s