View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Codons, Genes and Networks
Bioinformatics service
Math@Bio group of M.Gromov
Andrei Zinovyev
Plan of the talk Part I: 7-clusters structure of
genome (codons and genes)
Part II: Coding and non-coding DNA scaling laws (genes and networks)
Part I: 7-clusters genome structure
Dr. Tatyana Popova
R&D Centre in Biberach, Germany
Prof. Alexander Gorban
Centre for Mathematical Modelling
Genomic sequence as a text in unknown language
tagggacgcacgtggtgagctgatgctaggg
frequency dictionaries:t a g g g a c g c a c g t g g t g a g c t g a t g c t a g g g
ta gg ga cg ca cg tg gt ga gc tg at gc ta gg
tag gga cgc acg tgg tga gct gat gct agg
tagg gacg cacg tggt gagc tgat gcta gggr
N = 4=41
N = 16=42
N = 64=43
N=256=44
gggrcgccacgttggtgagctgatgctagggrcgacgtgg
tagggrcgcacgtggtgagctgatgctagggrcgacgtgg
agggrcgcacgtggtgagctgatgctagggrcgacgtggc
..cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc…
From text to geometrycgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc
107
cgtggtgagctgatgctagggacgcacggtgagctgatgctagggacgcacacttgagctgatgctagggacgcacaattcgtgagctgatgctagggacgcacggtg……gagctgatgctagggacgcacaagtga
length~200-400
10000-20000 fragments
RN
Method of visualizationprincipal components analysis
RNR
2
R2
PCA plot
Caulobacter crescentus
singles N=4
doublets N=16
triplets N=64
quadruplets N=256
!!!
the information in genomic sequence is encodedby non-overlapping triplets (Nature, 1961)
First explanation
cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
tga tgc tag ggr cgc acg tgg
ctg atg cta ggg rcg cac gtg
Basic 7-cluster structure
gtgagctgatgctagggrcgcacgtggtgagc
gct gat gct agg grc gca cgt
gtgaatcggtgggtgaqtgtgctgctatgagc
atc ggt ggg tga gtg tgc tgc
tcg gtg ggt gag tgt gct gct
cgg tgg gtg agt gtg ctg ctg
Non-coding parts
gtgagctgatgctagggr cgcacgaat
Point mutations:insertions, deletions
a
The flower-like 7 clusters structure is flat
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Seven classes vs Seven clusters
StanfordTIGRGeorgia Institute of Technology
Hong-Yu Ou, Feng-Biao Guo and Chun-Ting Zhang (2003). Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. FEBS Letters 540(1-3),188-194
Audic, S. and J. Claverie. Self-identification of protein-coding regions in microbial genomes.Proc Natl Acad Sci U S A, 95(17):10026-31, 1998.
Lomsadze A., Ter-Hovhannisyan V., Chernoff YO, Borodovsky M.Gene identification in novel eukaryotic genomes byself-training algorithm. Nucleic Acids Research, 2005, Vol. 33, No. 20
Computational gene prediction
Accuracy >90%
Mean-field approximationfor triplet frequencies
321KJIIJK PPPF
FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):
FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers
position-specific letter frequency + correlations
: 12 numbersjiP
Why hexagonal symmetry?
0-+
-+0
+0-
+-0
-0+
0+-
GC-content = PC + PG
Genome codon usageand mean-field approximation
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
…
correct frameshift
64 frequencies FIJK
…
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
12 frequencies PI1 , PJ
2 , PK3
PIJ are linear functions of GC-content
eubacteria
archae
THE MYSTERY OF TWOSTRAIGHT LINES ???
R12 R64
FIJK = P1IP2
JP3K + correlations
Codon usage signature
0-+
19 possible eubacterialsignatures
Example: Palindromic signatures
Four symmetry typesof the basic 7-cluster structure
eubacteria
flower-likedegeneratedperpendiculartriangles
paralleltriangles
B.Halodurans (GC=44%)
S.Coelicolor (GC=72%)
F.Nucleatum (GC=27%)
E.Coli (GC=51%)
Using branching principal components to analyze 7-clusters genome structures
Streptomyces coelicolor
Bacillus halodurans Ercherichia coli
Fusobacterium nucleatum
Using branching principal components to analyze 7-clusters genome structures
Web-site
http://www.ihes.fr/~zinovyev/7clusters
cluster structures in genomic sequences
Papers (type Zinovyev in Google)
Gorban A, Zinovyev AGorban A, Zinovyev APCA deciphers genome.PCA deciphers genome. 2005. Arxiv preprint
Gorban A, Popova T, Zinovyev A Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences.bacterial genomic sequences. 2005. Physica A 353, 365-387
Gorban A, Popova T, Zinovyev AGorban A, Popova T, Zinovyev AFour basic symmetry types in the universal 7-cluster structure of Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences. microbial genomic sequences. 2005. In Silico Biology 5, 0025
Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributionsSeven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039.
Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene IdentificationSelf-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).
Part II:Coding and non-coding DNA scaling laws
Dr. Thomas Fink
Bioinformatics service
Dr. Sebastian Ahnert
Cavendish laboratory,University of Cambridge
C-value and G-valueparadox Neither genome length nor gene
number account for complexity of an organism
Drosophila melanogaster (fruit fly) C=120Mb
Podisma pedestris (mountain grasshopper) C=1650 Mb
Non-linear growth of regulation
Mattick, J. S. Nature Reviews Genetics 5, 316–323 (2004).
“Amount of regulation” scales non-linearly with the number of genes: every new gene with a new function requires specific regulation, but the regulators also need to be regulated
Log number of genes
Log n
um
ber
of
regula
tory
genes
bacteria
archae
Slope = 1.96
Slope = 1
Complexity ceiling for prokaryotes
Adding a new function S requires adding a regulatory overhead R, the total increase isN = R + S
Since R ~ N2 , at some point R > S,i.e. gain from a new function is too
expensive for an organism, it requires toomuch regulation to be integrated
There is a maximum possible genome lengthThere is a maximum possible genome lengthfor prokaryotes (~10Mb)for prokaryotes (~10Mb)
There is a maximum possible genome lengthThere is a maximum possible genome lengthfor prokaryotes (~10Mb)for prokaryotes (~10Mb)
How eukaryotes bypassed this limitation?
Presumably, they invented a cheaper (digital) regulatory system, based on RNA
This regulatory information is stored in the “non-coding” DNA
Simple model:Accelerated networks
Node is a gene (c genes)Edge is a “regulation” (n edges)
n = c2
Connectivity < kmax,
regulators are onlyproteins
Connectivity > kmax
deficit of regulations is takenfrom non-coding DNA
How much regulation genome needs to take from non-coding DNA?
)(2 max
max
max ccc
ckndeficit
cmax (prokaryotic ceiling)
These regulations must be encoded in the non-coding part of genome, therefore
N – non-coding DNA lengthC – coding DNA lengthCprok – ceiling for prokaryotes (~10Mb)
some coefficient
Observation:coding length vs non-coding
=1
Minimumnon-codinglength neededfor the «deficit»regulation
Hypothesis Prokaryotes:<Non-coding length> = <Coding length> (little constant add-on, promoters, UTRs…)
15% ≈ 1/7
EukaryotesNreg = /2 C/Cmaxprok(C-Cmaxprok) ~ C2,
Cmaxprok ≈ 10Mb ≈
This is the amount necessary for regulation, but repeats, genome parasites, etc., might make a genome much bigger
This is only a hypothesis, but…
Prediction on the Nreg for human:
Nreg = 87 Mb = 3% of genome length
C = 48 Mb = 1.7%
Nreg+C = 4.7%
Thank you for your attention Questions?