Upload
shauna
View
26
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette. Markov chain models. Transition probabilities = Frequencies of N-grams … AGGTC G ATC … …A GGTCG A TC … …AG GTCGA T C …. f AAA. f AAC. - PowerPoint PPT Presentation
Citation preview
Simple cluster structure oftriplet distributions in genetic texts
Andrei Zinovyev
Institute des Hautes Etudes Scientifique,Bures-sur-Yvette
Transition probabilities = Frequencies of N-grams
…AGGTCGATC …
…AGGTCGATC …
…AGGTCGATC …
Markov chain models
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
Sliding window
width W
fAAA
fAAC
fGGG
…= fijk, i,j,k in [A,C,G,T]
AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCCAACATGACAAT
Protein-coding sequences
bacterial gene
corr
ect f
ram
e
fijk
fijk(1)
fijk(2)
nml
kmnlijijk fffP,,
)1(
nml
ijnlmiijk fffP,,
)2(
TCCAGCTTA TGAGGCATAACTGTTTACTGAGGCCAT ACT GTACTGTTAGGTTGTACTGTTA
AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCCAACATGACAAT
“Shadow” genes
shadow gene,
ijkijkR
ijk ffCf ˆˆˆˆ TA ˆ C =G
ijkijk fPf ˆˆ )1()1( ijkijk fPf ˆˆ )2()2(
When we can detect genes (by their content)?
,
1. When non-coding regions are very different in base composition (e.g., different GC-content)
2. When distances between the phases are large:
ijkfP )1(ijkfP )2(
ijkfnon-coding
ijk kji
ijkijk ppp
ffM 2log
Simple experiment
,
1. Only the forward strands of genomes are used for triplet counting
2. Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x
3. Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets fijk are calculated
4. The dataset consists of N = [L/p] points, where L is the entire length of the sequence
5. Every data point Xi={xis} corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64
Principal Component Analysis
,
Max
imal
disp
ersio
n
1st Principalaxis
2nd principalaxis
ViDaExpert tool
,
Caulobacter crescentus (GenBank NC_002696)
,
ijkf
ijkf
ijkfP )1(
ijkfP )2(
“Path” of sliding window
,
Helicobacter pylori (GenBank NC_000921)
,
Saccharomyces cerevisiae chromosome IV
,
Model sequences: (random codon usage)
,
Model sequences: (random codon usage+50% of frequencies are set to 0)
,
Graph of coding phase
,
Assessment
,
Sequence L W% of
codingbases
Sn1 Sp1 Sn2 Sp2
Helicobacter pylori, complete genome (NC_000921)Caulobacter crescentus, complete genome (NC_002696)Prototheca wickerhamii mitochondrion (NC_001613)Saccharomyces cerevisiae chromosome III (NC_001135)Saccharomyces cerevisiae chromosome IV (NC_001136)
16438314016947
55328316613
1531929
300300120399399
9091496973
0.930.930.820.900.89
0.970.970.930.880.91
0.930.940.840.900.92
0.980.980.950.900.92
Model text RANDOMModel text RANDOM_BIAS
100000100000
500500
4945
0.900.99
0.610.83
0.820.94
0.770.90
FNTP
TPSn
FPTP
TPSp
Completelyblind prediction
Dependence on window size
,
0.75
0.8
0.85
0.9
0.95
1
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
window size
Sn
Sp
Dependence on window size
,
W = 51 W = 252
W = 900 W = 2000
State of art: GLIMMER strategy
,
1. Use MM of 5th order (hexamers) 2. Use interpolation for transition probabilities3. Use long ORF (>500bp) as learning dataset
Problems:1. The number of hexamers to be evaluated
is still big2. Applicable only for collected genomes
of good quality (<1frameshift/1000bp)
What can we learn from this game?
,
• Learning can be replaced with self-learning • Bacterial gene-finders work relatively well, when
concentration of coding sequences is high• Correlations in the order of codons are small• Codon usage is approximately the same along the
genome
• The method presented allows self-learning on piecesof even uncollected DNA (>150 bp)
• The method gives alternative to HMM view on the problem of gene recognition
Acknowledgements
,
Professor Alexander GorbanProfessor Misha Gromov
My coordinates:http://www.ihes.fr/~zinovyev