Upload
kacie-tetlow
View
218
Download
4
Tags:
Embed Size (px)
Citation preview
Seven clusters and four types of symmetry in
microbial genomes
Andrei Zinovyev
Bioinformatics service
Math@Bio group of M.Gromov
Tatyana Popova
R&D Centre in Biberach, Germany
Alexander Gorban
Centre for Mathematical Modelling
Symbol of GofG’05
Genomic sequence as a text in unknown language
tagggrcgcacgtggtgagctgatgctaggg
frequency dictionaries:t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g
ta gg gr cg ca cg tg gt ga gc tg at gc ta gg
tag ggr cgc acg tgg tga gct gat gct agg
tagg grcg cacg tggt gagc tgat gcta gggr
N = 4=41
N = 16=42
N = 64=43
N=256=44
gggrcgccacgttggtgagctgatgctagggrcgacgtgg
tagggrcgcacgtggtgagctgatgctagggrcgacgtgg
agggrcgcacgtggtgagctgatgctagggrcgacgtggc
..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…
From text to geometrycgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
107
cgtggtgagctgatgctagggrcgcacggtgagctgatgctagggrcgcacacttgagctgatgctagggrcgcacaattcgtgagctgatgctagggrcgcacggtg……gagctgatgctagggrcgcacaagtga
length~200-400
10000-20000 fragments
RN
Method of visualizationprincipal components analysis
RNR
2
R2
PCA plot
Caulobacter crescentus
singles N=4
doublets N=16
triplets N=64
quadruplets N=256
!!!
the information in genomic sequence is encodedby non-overlapping triplets (Nature, 1961)
First explanation
cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
tga tgc tag ggr cgc acg tgg
ctg atg cta ggg rcg cac gtg
Basic 7-cluster structure
gtgagctgatgctagggrcgcacgtggtgagc
gct gat gct agg grc gca cgt
gtgaatcggtgggtgaqtgtgctgctatgagc
atc ggt ggg tga gtg tgc tgc
tcg gtg ggt gag tgt gct gct
cgg tgg gtg agt gtg ctg ctg
Non-coding parts
gtgagctgatgctagggr cgcacgaat
Point mutations:insertions, deletions
a
The flower-like 7 clusters structure is flat
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Seven classes vs Seven clusters
Stanford
TIGR
Georgia Institute of Technology
Computational gene prediction
Accuracy >90%
Mean-field approximationfor triplet frequencies
321KJIIJK PPPF
FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):
FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers
position-specific letter frequency + correlations
: 12 numbersjiP
Why hexagonal symmetry?
0-+
-+0
+0-
+-0
-0+
0+-
GC-content = PC + PG
Genome codon usageand mean-field approximation
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
…
correct frameshift
64 frequencies FIJK
…
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
12 frequencies PI1 , PJ
2 , PK3
PIJ are linear functions of GC-content
eubacteria
archae
THE MYSTERY OF TWOSTRAIGHT LINES ???
R12 R64
FIJK = P1IP2
JP3K + correlations
Codon usage signature
0-+
19 possible eubacterialsignatures
Example: Palindromic signatures
Four symmetry typesof the basic 7-cluster structure
eubacteria
flower-likedegeneratedperpendiculartriangles
paralleltriangles
B.Halodurans (GC=44%)
S.Coelicolor (GC=72%)
F.Nucleatum (GC=27%)
E.Coli (GC=51%)
Web-site
http://www.ihes.fr/~zinovyev/7clusters
cluster structures in genomic sequences
Human genome (chr19)
non-repetitive sequencesrepetitive sequences
singles doublets triplets
Letter frequencies (3 dimensions)
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1 2 3
a
c
g
t
GC-content (50%)
Purine-Pyrimidine (33%)
Amino-Keto
(17%)
a t
c g
a
tc
g a c
gt
Non-linear good 2D representation(elastic principal manifolds)
A T
G C
0%
100%
Measuring densities
A
T
G
C
A
T
G
C
Contrasting density distribution (two ideas)
• Noise is Gaussian
• Noise is smooth
Contrasted density
A
T
G
C
A
T
G
C
Excluding repeats
A
T
G
C
A
T
G
C
Excluding repeats
A
T
G
C
A
T
G
C
Papers (type Zinovyev in Google)
Gorban A, Zinovyev AGorban A, Zinovyev APCA deciphers genome.PCA deciphers genome. 2005. Arxiv preprint
Gorban A, Popova T, Zinovyev A Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences.bacterial genomic sequences. 2005. Physica A 353, 365-387
Gorban A, Popova T, Zinovyev AGorban A, Popova T, Zinovyev AFour basic symmetry types in the universal 7-cluster structure of Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences. microbial genomic sequences. 2005. In Silico Biology 5, 0025
Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributionsSeven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039.
Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene IdentificationSelf-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).
People
Dr. Tanya PopovaInstitute of Computational ModelingRussia
ProfessorAlexander GorbanUniversity of LeicesterUK