Upload
maja
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Statistical modeling and classification in Biological Sequence Space. April 26, 04; 9.520 Gene Yeo Poggio, Burge @MIT. Framework/Issues. “Build” models around known biology In the process, extend knowledge about known biology “Predict” new examples “Validate” predictions by - PowerPoint PPT Presentation
Citation preview
Statistical modeling and classification in
Biological Sequence Space
April 26, 04; 9.520 Gene Yeo
Poggio, Burge @MIT
• “Build” models around known biology – In the process, extend knowledge about
known biology
• “Predict” new examples• “Validate” predictions by
– prediction accuracy– experimental validation– higher-level traits of predictions– conservation in other genomes
Framework/Issues
Biological sequences
• DNA, RNA and proteins: macromolecules built up from smaller units.
• DNA: units are the nucleotide residues A, C, G and T• RNA: units are the nucleotide residues A, C, G and
U• Proteins: units are the amino acid residues A, C, D,
E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y.
• To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.
• Statistical models can be descriptive and/or predictive.• Given known biological signal-> describe the signal
with statistical modeling & find unknown examples of the same signal – Gene-finding (protein-coding genes)– Noncoding RNA genes– Protein domains
• Warning: although successful, models are not to be taken literally.
• Most important: biological confirmation of predictions is almost always necessary.
Different modelsC
om
ple
xit
y
DNA RNA Protein
Protein structure (a variety of methods)
Splice site motif (WMM, MM, SVM, NN)
Protein gene(HMM,NN)
RNA gene (Covariation,SCFG,NN,SVM)
With so many genomes being sequenced, it remains important to be able to identify genes and the signals within and around genes computationally.
A case study in computational biology: modeling signals in
genes
What is a (protein-coding) gene?
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Some facts about human genes
Comprise about 3% of the genomeAverage gene length: ~ 8,000 bp
Average of 5-6 exons/geneAverage exon length: ~200 bp
Average intron length: ~2,000 bp~8% genes have a single exon
The idea behind a HMM genefinder
• States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..).
• Observations embody state-dependent statistics, such as base composition, dependence, and signal features.
E0 E1 E2
E
poly-A
3'UTR5'UTR
tEi
Es
I0 I 1 I 2
intergenicregion
Forward (+) strand
Reverse (-) strand
Forward (+) strand
Reverse (-) strand
promoter
62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC
62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC
62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA
62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC
62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG
62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC
62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC
62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC
62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC
62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA
62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC
62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA
62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT
62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG
62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC
62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA
62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC
62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG
62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT
62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC
63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT
63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT
63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC
63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC
63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT
63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT
63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT
63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG
63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT
63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG
63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG
63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA
63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT
63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA
63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT
GENSCAN (Burge & Karlin)
Splice sites can be an important signal
Regular expressions can be limiting
CAAGGT AGT
AG 5’ splice junction in eukaryotes
( )TC TC
≥11N AGC 3’ splice junction
Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites.
Position-specific distributions came to represent the variability in motif composition.
Position-specific scoring matrix (PSSM)
S = S1 S2 S3 S4 S5 S6 S7 S8 S9
Odds Ratio R = =
Score s = log2R
P(S|+) P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9)
P(S|-) Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9)
0.50.00.10.11.00.00.10.10.1 T
0.20.80.10.40.01.00.80.20.2 G
0.20.10.10.10.00.00.00.10.4 C0.10.10.70.40.00.00.10.60.3 A+6+5+4+3+2+1-1-2-3Pos
Ok, so we got the genes
• Here’s another catch, there isn’t just one version of each gene.
• But sometimes several
• molecular biology (transcription, splicing)
• signals are modeled as states (HMM) or separately, i.e.PSSMs
Eg. alternative splicing - CD44
Zhu et al Science… (2003)
Human chromosome 11p…
Alternative splicing
• is a major determinant of protein diversity (Lander 2001, Zavolan 2003)
• 30-50% of human diseases involve alt. splicing
Defining constitutive and alternative exons
Constitutiveexon
Skipped exon
3’ alternative exon
5’ alternative exon
Intron retention
Mutually exclusive exons
Fra
gile
X R
ela
ted
Gen
e,
FXR1
Conserved alternative, skipped exon - FXR1
Myoto
nic
Dyst
rop
hy-c
on
tain
ing
WD
Rep
eat,
DMWD
Another example of genes containing CSE: DMWD
Predicting new alternatively spliced exons
1. The problem is ‘ill-posed’
2. High-dimensional space
3. Not overfit data
4. Simple feature selection
5. Unbalanced data set sizes
6. Labels are more “flexible”
Eg. of experimentally validated
Biological sequence space: challenges
• Models that “represent” as much of the biology as possible.
• Biologically motivated features are important• Validating attributes:
– Conservation of events are key in computational biology
– Higher-level consistency with known biology
• Experimental validation of predictions are essential
• “Build” models around known biology – In the process, extend knowledge about
known biology
• “Predict” new examples• “Validate” predictions by
– prediction accuracy– experimental validation– higher-level traits of predictions– conservation in other genomes
Framework/Issues
Secondary Structure Tertiary Structure
Modeling higher order interactions: Yeast Phe
tRNA
If time permits
The Hammerhead Ribozyme
Secondary structure Tertiary structure
Seq1: A C G A A A G U
Seq2: U A G U A A U A
Seq3: A G G U G A C U
Seq4: C G G C A A U G
Seq5: G U G G G A A C
Method of Covariation / Compensatory changes
One example on how to model and predict RNA 2o Structure
• Covariation (using comparative genomics)
Mutual information statistic for pair of columns in a multiple alignment
= fraction of seqs w/ nt. x in col. i, nt. y in col. j
ijM =x,y
(i, j )fx,y∑ 2log x,y
(i, j )f
x
(i )f y
( j )f
x,y
(i, j)f
x
(i )f = fraction of seqs w/ nt. x in col. i
is maximal (2 bits) if x and y individually appear at random (A,C,G,U equally likely), but are perfectly correlated (e.g., always complementary)
ijM
sum over x, y = A, C, G, U
Inferring 2o Structure from Covariation
Stochastic Context-Free Grammars (SCFGs)
• A generalized model which is capable of handling non-local dependencies between words in a language (or bases in an RNA)
Ref:
Durbin et al. “Biological Sequence Analysis” 1998
An SCFG Model of RNA 2o Structure
“Production Rules”:
• P aWb (“pair”)• L aW (“left
bulge/loop”)• R Wa (“right
bulge/loop”)• B SS (“bifurcation”)• S W (“start”)• E (“end”)
last page
• some of the slides were obtained from various places:– available online slides on the web (primarily from
lectures by terry speed).– slides from chris burge, dirk holste