Upload
lekhanh
View
253
Download
0
Embed Size (px)
Citation preview
1
Prof. Yechiam Yemini (YY)
Computer Science DepartmentColumbia University
Chapter 4: Hidden Markov Models
4.4 HMM Applications
2
Overview
Alignment with HMM Gene predictions Protein structure predictions
2
3
Sequence Analysis Using HMMStep 1: construct an HMM model
Design an HMM generator for the observed sequencesAssign hidden states to underlying sequence regionsModel the questions to be answered in terms of hidden-pathway
Step 2: train the HMMSupervised/unsupervised
Step 3: analyze sequencesViterbi decoding: compute most likely hidden-pathway Forward/Backward: compute likelihood of sequence (& p/Z-score)
4
Alignment
3
5
Profile HMM
S EMM M
I
D
ACTG
.6
.2
.1
.1
D
II I
D
6
Searching With Profile HMM Is the target sequence S generated by profile H?
Viterbi: use H to compute the most probable alignment of S Forward: compute probability of S generated by H
Example: searching for globinsCreate profile HMM from known globin sequencesUse the forward algorithm to search SWISS-PROT for globinCompare against the null hypothesis (random sequence)Result: globins are highly distinct
4
7
Pairwise Alignment With HMM Key idea: recast DP as Viterbi decoding“Alignment” = sequence of pairs {<X,Y>,<X,_>, <_,Y>} Consider an HMM that emits pairs (pair HMM):
Begin M: PXY End
X: qX
Y: qy
ε
ε
τ
τ
τδ
δδ
δ
τ
1−2δ−τ
1−2δ−τ
1−ε−τ
1−ε−τ
8
Viterbi Decoding Pairwise Alignment
Begin M: PXY End
X: qX
Y: qy
ε
ε
τ
τ
τδ
δδ
δ
τ
1−2δ−τ
1−2δ−τ
1−ε−τ
1−ε−τ
1 2 3 4 5 6BMXYE
X=TGCGCAY=TGGCTA
TT -GGG
G-
XY-YX-
Viterbi decoding: optimum path of pairs Forward step: select among {XY,-Y,X-}
5
9
Comparing With Standard DP Compute relationships scores probabilities Consider an HMM that emits pairs (pair HMM)
M
X
Ys(Xi,Yj) -d
s(Xi,Yj+k)
s(Xi+k,Yj)-d
-e
-e
Begin M: PXY End
X: qX
Y: qy
ε
ε
τ
τ
τδ
δδ
δ
τ
1−2δ−τ
1−2δ−τ
1−ε−τ
1−ε−τ
<δ,ε,τ,P,q> <s(x,y),d,e>
e.g., e=-log[ε/(1−τ) ]
10
Gene Predictions
6
11
Schematics Prokaryotic genes
Eukaryotic genes
gene genegenepromoter
start stop
terminator
exon exonexonpromoter
start stopdonor acceptor
intron intron
12
Modeling Gene Structure With HMMGene regions are modeled as HMM statesEmission probabilities reflect nucleotide statistics
Splice site
7
13
Simple Gene Models Consider a region of random length L
Markov model L is geometrically distributed P[L=k]=pk(1-p) E[L]=p/(1-p)
How do we model ORF?Codons can be modeled as higher-order states
A=.29C=.31G=.04T=.36
14
VEIL: Viterbi Exon-Intron Locator
Contains 9 hidden states Each state is a Markovian model of regions
Exons, introns, intergenic regions, splice sites, etc.Exon HMM Model
Upstream
Start Codon
Exon
Stop Codon
Downstream
3’ Splice Site
Intron
5’ Poly-A Site
5’ Splice Site
• Enter: start codon or intron (3’ Splice Site)
• Exit: 5’ Splice site or three stop codons(taa, tag, tga)
VEIL Architecture
(Henderson, Salzberg, & Fasman 1997)
8
15
Genie (Kulp 96)
Uses a generalized HMM (GHMM) Edges in model are complete HMMs States are neural networks for signal finding
• J5’ – 5’ UTR
• EI – Initial Exon
• E – Exon, Internal Exon
• I – Intron
• EF – Final Exon
• ES – Single Exon
• J3’ – 3’UTR
BeginSequence
StartTranslation
Donorsplicesite
Acceptorsplicesite
StopTranslation
EndSequence
16
GeneScan (Burge & Karlin 97)
Ex5Intergenic Promoter Ex5Ex2 In4Ex2 Ex3 Ex4 Poly AIn3In2In1Ex1
5’ UTR 3’ UTR
Singleexongene
5’ UTR
Promoter
Intergenicregion
E1E0 E2
Einit Eterm
3’ UTR
I0 I1 I2
Poly A signal
J. Mol. Bio. (1997) 268, 78-94
Base models overall parse of the gene
5’UTR/3’UTR; Promoter; Poly A….
Exon-Intron-Exon structure of gene
Intron states model 3 scenarios:
I0: Intron is between two codons
I1: Intron is right after first codon base
I2: Intron is right after second codon base
Exons model respective scenarios
9
17
GeneScan Strategy
Ex5Intergenic Promoter Ex5Ex2 In4Ex2 Ex3 Ex4 Poly AIn3In2In1Ex1
5’ UTR 3’ UTR
Singleexongene
5’ UTR
Promoter
Intergenicregion
E1E0 E2
Einit Eterm
3’ UTR
I0 I1 I2
Poly A signal
HMM: sequence generator; state=region
Decoding: parse sequence into regions
Viterbi: maximum likelihood decoder via DP
These structures admit generalizations
Forward Strand
Forward Strand
18
Extending The HMM Model
Problem: the HMM does not model natureSequence lengths are not geometrically distributedSome regions have special features (e.g., splice sites use GT…)…….
Challenge: generalize HMM while retaining algorithms
Ex5Intergenic Promoter Ex5Ex2 In4Ex2 Ex3 Ex4 Poly AIn3In2In1Ex1
….ACAGTTATAGCGTAATTGCGAGCTATCATGAG……
Decode
10
19
GeneScan: Extended Sequence Generator Model
Add to state k a length distribution fk(L)Semi-Markov modelExtend the Viterbi DP recursion to reflect this
Incorporate special regions structuresWeight matrix model for splice site
…. Preserve the recursion structure Single
exongene
5’ UTR
Promoter
Intergenicregion
E1E0 E2
Einit Eterm
3’ UTR
I0 I1 I2
Poly A signal
20
Measuring Classifier Performance The challenge: how do we evaluate performance of a classifier?
E.g., consider a test to decide whether a person is sick (p) or healthy (n) Measured Performance
TP=True Positive; TN=True Negative; FP=False Positive; FN = False Negative
Sensitivity: SN=TP/(TP+FN) Sensitivity =percentage of correct predictions when the actual state is positive
Specificity: SP =TN/(TN+FP) Specificity=percentage of correct predictions when the actual state is negative
Receiver Operating Characteristics (ROC) curves Represent tradeoff between sensitivity & specificity
TP
TN
FN
FP
P
N
P’ N’prediction
Actual
Sensitivity
1-specificity
10.5
0.5 1.0
ROC curves
11
21
Performance Comparisons
22
IdentifyingTransmembrane Proteins
12
23
Transmembrane Proteins Transmembrane proteins are the key tools for cells interactions
Signaling, transport, adhesion…
A common structural motif: core helices Example: receptors
Function: receive signals, activate pathways Operations:
1. Signal recognition2. Signal transduction3. Pathway activation4. Down regulation
www.pdb.org
24
Example: The Insulin Pathway
From Wikipedia
Insulin binds toInsR receptor
Receptor activatespathway
Glut-4 is transported tomembrane to generate
glucose influx
Metabolic networkprocessing
InsR phosphorylated complex
13
25
TMHMM [Sonhammer et al 98]
A hidden Markov Model for Predicting Transmembrane Helices in ProteinIn J. Glasgow et al., eds., Proc. Sixth Int. Conf. on Intelligent Systems forMolecular Biology, 175-182. AAAI Press, 1998. 1
A HMM to identifytransmembrane proteins
Architecture: The helix core is key Globular and loop domains Modeling challenge: variable length
Training The Maximization step of Baum
Welch uses simulated annealing 3-stage training
26
TMHMM Performance
14
27
ProteinsStructure Prediction
28
Proteins Are Formed From Peptide Bonds
Ramachandran Regions
15
29
Levels of Protein Structure
30
Structure Formation
16
31
Helix is Formed Through H Bonds
32
Tertiary Structures: b-Barrel Example
17
33
Why Is Structure Important?Protein functions are handled by structural elements
HIV protease drug binding siteEnzyme active site
Antibody-protein interfaces
34
Structure DeterminationMap: sequence structure
Structure determination is handled through crystalographyX-Ray; NMR
PDB: a database of structures
18
35
The Problem Of Structure Prediction Derive conformation geometry from sequence
Ab-initio techniquesHomology techniques
Secondary structure is organized in folds α-helix, β-sheets….
36
The Structure Prediction Problem
Given: an amino acid sequenceOutput: structure annotation {H,B…}
19
37
HMM Model Hidden states: folds Observable: AA sequence Viterbi decoding: find most likely annotation Training: supervised learning use known structures Performance: HMM works well for limited protein types
E.g., TMHMM… For general structure predictions Neural-Nets are superior
(e.g., PHD, R. Burkhard 93)Why?
o [HMM works as long as the underlying conformation is consistent with the Markovian model.Protein conformations may depend on complex long-range non-Markovian interactions. E.g.,consider the impact of a single AA change on hemoglobin conformation in sickle-cell anemia. ]
38
Example (Chu et al, 2004)Step 1: align sequences to partition into regions
Step 2: build extended HMM (semi-markov length)
www.gatsby.ucl.ac.uk/~chuwei/paper/icml04_talk.pdf
20
39
Performance
Challenges long-range interactions (e.g., β-sheets)
40
Conclusions HMM are very useful when a Markovian model is appropriate
Solve three central problems: Decoding sequences to classify their components Computing likelihood of the classification Training
May be extended to handle non-Markovian elements
But can lose their predictive power when Markovian assumptions
diverge from nature