Upload
william-walls
View
59
Download
2
Embed Size (px)
DESCRIPTION
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter +. and thanks to Eli Rusman. * Affymetrix + UC Berkeley Mathematics Dept. Conservation of alternative splicing between human and mouse. - PowerPoint PPT Presentation
Citation preview
HMM Sampling and Applications toGene Finding and Alignment
European Conference on Computational Biology 2003
Simon Cawley* and Lior Pachter+
and thanks to Eli Rusman
* Affymetrix+ UC Berkeley Mathematics Dept
Conservation of alternative splicing between human and
mouse• Modrek and Lee: 40-60% of human genes
have alternative splice forms. Nature Genetics 2002.
• Nurtdinov et al. 75% of human alternative splice forms are conserved in mouse.
Human Molecular Genetics 2003.
Can we develop ab-initio methods for detecting conserved alternative splice sites?
A
A
C
A
T
T
A
G
AA G A T T A C C A C A
Sequence Alignment
A
A
C
A
T
T
A
G
AA G A T T A C C A C A
max
Finding the optimal alignment
ai,j = w ai-1,j + w ai,j-1 + si,j ai-1,j-1
A
A
C
A
T
T
A
G
AA G A T T A C C A C A
Alignment forward variables for positions [1,i] and [1,j]
in each sequence
Match/mismatch probabilities forpositions i,j in each sequence
gap probabilities
Sampling to find alternative alignments
Linear Space Sampling
Sequences length T,U
To obtain k samples
Time complexity: O(TU+k(T+U))
Memory requirements: O(T+U)
Hirschberg’s divide and conquer algorithm
Time complexity: O(TU)
Memory requirements: O(T+U)
Alternative Splicing in Mammalian Genomes
pre-mRNA
TRANSLATION
SPLICING
Protein I
ALTERNATIVE SPLICING
Protein II
TRANSLATION
M. Alexandersson, S. Cawley, L. Pachter, SLAM- Cross-species gene finding and alignment with a
generalized pair hidden Markov model, Genome Research, 13 (2003) p 496-502
Cross-species simultaneous gene finding
and alignment
Modeling gene features
5’ 3’
Exon 1 Exon 2 Exon 3Intron 1 Intron 2
CNS CNS CNS
[human]
[mouse]
The SLAM hidden Markov model
SLAM components• Splice site detector
– VLMM
• Intron and intergenic regions– 2nd order Markov chain
– independent geometric lengths
• Coding sequence– PHMM on protein level
– generalized length distribution
• Conserved non-coding sequence– PHMM on DNA level
SLAM input and output
• Input:– Pair of homologous sequences.
• Output:– CDS and CNS predictions in both sequences.– Protein predictions.– Protein and CNS alignment.
http://bio.math.berkeley.edu/slam/
Input:
Output:
Methodology for identifying alternative splice sites
• Compiled SLAM gene predictions for the human, mouse and rat genomes.
• Identified a set of 3400 human/mouse/rat gene triples with consistent predictions from hs/mm and hs/rn analyses.
• For each triple, sampled sub-optimal parses from hs/mm and hs/rn runs
• Collected alternative exons (non-Viterbi exons) that appeared in both the hs/mm and hs/rn runs
• Examined overlap with RefSeq genes, mRNAs and ESTs
SLAM whole genome predictions
• Built a whole genome homology map (Colin Dewey)http://baboon.math.berkeley.edu/~cdewey/homologyMaps/
• Pre-aligned the homologous blocks to reduce the SLAM search space (Nicolas Bray using AVID)
http://baboon.math.berkeley.edu/mavid/http://hanuman.math.berkeley.edu/kbrowser/
• Ran SLAM on the resulting blockshttp://bio.math.berkeley.edu/slam/mouse/http://bio.math.berkeley.edu/slam/rat/
[human]
[mouse]
[rat]
Comparing predicted alternative exons to ESTs and
mRNAshuman/mouse/rat alternative
exonshuman/mouse alternative
exons
EST/mRNANo
EST/mRNA EST/mRNANo
EST/mRNA
Gene count 29 344 461 3296
Alt. Exon count 29 441 557 7240
Shifties 28 209 262 2227
Newbies 1 232 295 5013
Conclusions
• Sampling is memory efficient, fast, and should be used routinely for alignment applications.
• Conserved alternative splice forms can be detected ab-initio.
• The extent of alternative splicing conservation is currently unclear. Sampling provides an alternative approach for investigating this problem- one that is not sensitive to biases in EST data.
• Problem: design effective and scalable validation strategies for alternative splice sites.