Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Codons, Genes and Networks

Bioinformatics service

Math@Bio group of M.Gromov

Andrei Zinovyev

Plan of the talk Part I: 7-clusters structure of

genome (codons and genes)

Part II: Coding and non-coding DNA scaling laws (genes and networks)

Part I: 7-clusters genome structure

Dr. Tatyana Popova

R&D Centre in Biberach, Germany

Prof. Alexander Gorban

Centre for Mathematical Modelling

Genomic sequence as a text in unknown language

tagggacgcacgtggtgagctgatgctaggg

frequency dictionaries:t a g g g a c g c a c g t g g t g a g c t g a t g c t a g g g

ta gg ga cg ca cg tg gt ga gc tg at gc ta gg

tag gga cgc acg tgg tga gct gat gct agg

tagg gacg cacg tggt gagc tgat gcta gggr

N = 4=41

N = 16=42

N = 64=43

N=256=44

gggrcgccacgttggtgagctgatgctagggrcgacgtgg

tagggrcgcacgtggtgagctgatgctagggrcgacgtgg

agggrcgcacgtggtgagctgatgctagggrcgacgtggc

..cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc…

From text to geometrycgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc

107

cgtggtgagctgatgctagggacgcacggtgagctgatgctagggacgcacacttgagctgatgctagggacgcacaattcgtgagctgatgctagggacgcacggtg……gagctgatgctagggacgcacaagtga

length~200-400

10000-20000 fragments

RN

Method of visualizationprincipal components analysis

RNR

2

R2

PCA plot

Caulobacter crescentus

singles N=4

doublets N=16

triplets N=64

quadruplets N=256

!!!

the information in genomic sequence is encodedby non-overlapping triplets (Nature, 1961)

First explanation

cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

tga tgc tag ggr cgc acg tgg

ctg atg cta ggg rcg cac gtg

Basic 7-cluster structure

gtgagctgatgctagggrcgcacgtggtgagc

gct gat gct agg grc gca cgt

gtgaatcggtgggtgaqtgtgctgctatgagc

atc ggt ggg tga gtg tgc tgc

tcg gtg ggt gag tgt gct gct

cgg tgg gtg agt gtg ctg ctg

Non-coding parts

gtgagctgatgctagggr cgcacgaat

Point mutations:insertions, deletions

a

The flower-like 7 clusters structure is flat

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Seven classes vs Seven clusters

StanfordTIGRGeorgia Institute of Technology

Hong-Yu Ou, Feng-Biao Guo and Chun-Ting Zhang (2003). Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. FEBS Letters 540(1-3),188-194

Audic, S. and J. Claverie. Self-identification of protein-coding regions in microbial genomes.Proc Natl Acad Sci U S A, 95(17):10026-31, 1998.

Lomsadze A., Ter-Hovhannisyan V., Chernoff YO, Borodovsky M.Gene identification in novel eukaryotic genomes byself-training algorithm. Nucleic Acids Research, 2005, Vol. 33, No. 20

Computational gene prediction

Accuracy >90%

Mean-field approximationfor triplet frequencies

321KJIIJK PPPF

FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):

FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers

position-specific letter frequency + correlations

: 12 numbersjiP

Why hexagonal symmetry?

0-+

-+0

+0-

+-0

-0+

0+-

GC-content = PC + PG

Genome codon usageand mean-field approximation

ggtgaATG gat gct agg … gtc gca cgc TAAtgagct

…

correct frameshift

64 frequencies FIJK

…

ggtgaATG gat gct agg … gtc gca cgc TAAtgagct

12 frequencies PI1 , PJ

2 , PK3

PIJ are linear functions of GC-content

eubacteria

archae

THE MYSTERY OF TWOSTRAIGHT LINES ???

R12 R64

FIJK = P1IP2

JP3K + correlations

Codon usage signature

0-+

19 possible eubacterialsignatures

Example: Palindromic signatures

Four symmetry typesof the basic 7-cluster structure

eubacteria

flower-likedegeneratedperpendiculartriangles

paralleltriangles

B.Halodurans (GC=44%)

S.Coelicolor (GC=72%)

F.Nucleatum (GC=27%)

E.Coli (GC=51%)

Using branching principal components to analyze 7-clusters genome structures

Streptomyces coelicolor

Bacillus halodurans Ercherichia coli

Fusobacterium nucleatum

Using branching principal components to analyze 7-clusters genome structures

Web-site

http://www.ihes.fr/~zinovyev/7clusters

cluster structures in genomic sequences

Papers (type Zinovyev in Google)

Gorban A, Zinovyev AGorban A, Zinovyev APCA deciphers genome.PCA deciphers genome. 2005. Arxiv preprint

Gorban A, Popova T, Zinovyev A Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences.bacterial genomic sequences. 2005. Physica A 353, 365-387

Gorban A, Popova T, Zinovyev AGorban A, Popova T, Zinovyev AFour basic symmetry types in the universal 7-cluster structure of Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences. microbial genomic sequences. 2005. In Silico Biology 5, 0025

Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributionsSeven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039.

Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene IdentificationSelf-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).

Part II:Coding and non-coding DNA scaling laws

Dr. Thomas Fink

Bioinformatics service

Dr. Sebastian Ahnert

Cavendish laboratory,University of Cambridge

C-value and G-valueparadox Neither genome length nor gene

number account for complexity of an organism

Drosophila melanogaster (fruit fly) C=120Mb

Podisma pedestris (mountain grasshopper) C=1650 Mb

Non-linear growth of regulation

Mattick, J. S. Nature Reviews Genetics 5, 316–323 (2004).

“Amount of regulation” scales non-linearly with the number of genes: every new gene with a new function requires specific regulation, but the regulators also need to be regulated

Log number of genes

Log n

um

ber

of

regula

tory

genes

bacteria

archae

Slope = 1.96

Slope = 1

Complexity ceiling for prokaryotes

Adding a new function S requires adding a regulatory overhead R, the total increase isN = R + S

Since R ~ N2 , at some point R > S,i.e. gain from a new function is too

expensive for an organism, it requires toomuch regulation to be integrated

There is a maximum possible genome lengthThere is a maximum possible genome lengthfor prokaryotes (~10Mb)for prokaryotes (~10Mb)

There is a maximum possible genome lengthThere is a maximum possible genome lengthfor prokaryotes (~10Mb)for prokaryotes (~10Mb)

How eukaryotes bypassed this limitation?

Presumably, they invented a cheaper (digital) regulatory system, based on RNA

This regulatory information is stored in the “non-coding” DNA

Simple model:Accelerated networks

Node is a gene (c genes)Edge is a “regulation” (n edges)

n = c2

Connectivity < kmax,

regulators are onlyproteins

Connectivity > kmax

deficit of regulations is takenfrom non-coding DNA

How much regulation genome needs to take from non-coding DNA?

)(2 max

max

max ccc

ckndeficit

cmax (prokaryotic ceiling)

These regulations must be encoded in the non-coding part of genome, therefore

N – non-coding DNA lengthC – coding DNA lengthCprok – ceiling for prokaryotes (~10Mb)

some coefficient

Observation:coding length vs non-coding

=1

Minimumnon-codinglength neededfor the «deficit»regulation

Hypothesis Prokaryotes:<Non-coding length> = <Coding length> (little constant add-on, promoters, UTRs…)

15% ≈ 1/7

EukaryotesNreg = /2 C/Cmaxprok(C-Cmaxprok) ~ C2,

Cmaxprok ≈ 10Mb ≈

This is the amount necessary for regulation, but repeats, genome parasites, etc., might make a genome much bigger

This is only a hypothesis, but…

Prediction on the Nreg for human:

Nreg = 87 Mb = 3% of genome length

C = 48 Mb = 1.7%

Nreg+C = 4.7%

Thank you for your attention Questions?

Documents

Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev