Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev

Simple cluster structure oftriplet distributions in genetic texts

Andrei Zinovyev

Institute des Hautes Etudes Scientifique,Bures-sur-Yvette

Transition probabilities = Frequencies of N-grams

…AGGTCGATC …

…AGGTCGATC …

…AGGTCGATC …

Markov chain models

AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC





Sliding window

width W

fAAA

fAAC

fGGG

…= fijk, i,j,k in [A,C,G,T]

AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCCAACATGACAAT

Protein-coding sequences

bacterial gene

corr

ect f

ram

e

fijk

fijk(1)

fijk(2)

nml

kmnlijijk fffP,,

)1(

nml

ijnlmiijk fffP,,

)2(

TCCAGCTTA TGAGGCATAACTGTTTACTGAGGCCAT ACT GTACTGTTAGGTTGTACTGTTA

AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCCAACATGACAAT

“Shadow” genes

shadow gene,

ijkijkR

ijk ffCf ˆˆˆˆ TA ˆ C =G

ijkijk fPf ˆˆ )1()1( ijkijk fPf ˆˆ )2()2(

When we can detect genes (by their content)?

,

1. When non-coding regions are very different in base composition (e.g., different GC-content)

2. When distances between the phases are large:

ijkfP )1(ijkfP )2(

ijkfnon-coding

ijk kji

ijkijk ppp

ffM 2log

Simple experiment

,

1. Only the forward strands of genomes are used for triplet counting

2. Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x

3. Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets fijk are calculated

4. The dataset consists of N = [L/p] points, where L is the entire length of the sequence

5. Every data point Xi={xis} corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64

Principal Component Analysis

,

Max

imal

disp

ersio

n

1st Principalaxis

2nd principalaxis

ViDaExpert tool

,

Caulobacter crescentus (GenBank NC_002696)

,

ijkf

ijkf

ijkfP )1(

ijkfP )2(

“Path” of sliding window

,

Helicobacter pylori (GenBank NC_000921)

,

Saccharomyces cerevisiae chromosome IV

,

Model sequences: (random codon usage)

,

Model sequences: (random codon usage+50% of frequencies are set to 0)

,

Graph of coding phase

,

Assessment

,

Sequence L W% of

codingbases

Sn1 Sp1 Sn2 Sp2

Helicobacter pylori, complete genome (NC_000921)Caulobacter crescentus, complete genome (NC_002696)Prototheca wickerhamii mitochondrion (NC_001613)Saccharomyces cerevisiae chromosome III (NC_001135)Saccharomyces cerevisiae chromosome IV (NC_001136)

16438314016947

55328316613

1531929

300300120399399

9091496973

0.930.930.820.900.89

0.970.970.930.880.91

0.930.940.840.900.92

0.980.980.950.900.92

Model text RANDOMModel text RANDOM_BIAS

100000100000

500500

4945

0.900.99

0.610.83

0.820.94

0.770.90

FNTP

TPSn

FPTP

TPSp

Completelyblind prediction

Dependence on window size

,

0.75

0.8

0.85

0.9

0.95

1

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300

window size

Sn

Sp

Dependence on window size

,

W = 51 W = 252

W = 900 W = 2000

State of art: GLIMMER strategy

,

1. Use MM of 5th order (hexamers) 2. Use interpolation for transition probabilities3. Use long ORF (>500bp) as learning dataset

Problems:1. The number of hexamers to be evaluated

is still big2. Applicable only for collected genomes

of good quality (<1frameshift/1000bp)

What can we learn from this game?

,

• Learning can be replaced with self-learning • Bacterial gene-finders work relatively well, when

concentration of coding sequences is high• Correlations in the order of codons are small• Codon usage is approximately the same along the

genome

• The method presented allows self-learning on piecesof even uncollected DNA (>150 bp)

• The method gives alternative to HMM view on the problem of gene recognition

Acknowledgements

,

Professor Alexander GorbanProfessor Misha Gromov

My coordinates:http://www.ihes.fr/~zinovyev

Documents

Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev