Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Codons, Genes and Networks

Bioinformatics service

Math@Bio group of M.Gromov

Andrei Zinovyev

Plan of the talk Part I: 7-clusters structure of

genome (codons and genes)

Part II: Coding and non-coding DNA scaling laws (genes and networks)

Part I: 7-clusters genome structure

Dr. Tatyana Popova

R&D Centre in Biberach, Germany

Prof. Alexander Gorban

Centre for Mathematical Modelling

Genomic sequence as a text in unknown language

tagggacgcacgtggtgagctgatgctaggg

frequency dictionaries:t a g g g a c g c a c g t g g t g a g c t g a t g c t a g g g

ta gg ga cg ca cg tg gt ga gc tg at gc ta gg

tag gga cgc acg tgg tga gct gat gct agg

tagg gacg cacg tggt gagc tgat gcta gggr

N = 4=41

N = 16=42

N = 64=43

N=256=44

gggrcgccacgttggtgagctgatgctagggrcgacgtgg

tagggrcgcacgtggtgagctgatgctagggrcgacgtgg

agggrcgcacgtggtgagctgatgctagggrcgacgtggc

..cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc…

From text to geometrycgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc

cgtggtgagctgatgctagggacgcacggtgagctgatgctagggacgcacacttgagctgatgctagggacgcacaattcgtgagctgatgctagggacgcacggtg……gagctgatgctagggacgcacaagtga

length~200-400

10000-20000 fragments

Method of visualizationprincipal components analysis

PCA plot

Caulobacter crescentus

singles N=4

doublets N=16

triplets N=64

quadruplets N=256

the information in genomic sequence is encodedby non-overlapping triplets (Nature, 1961)

First explanation

cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

tga tgc tag ggr cgc acg tgg

ctg atg cta ggg rcg cac gtg

Basic 7-cluster structure

gtgagctgatgctagggrcgcacgtggtgagc

gct gat gct agg grc gca cgt

gtgaatcggtgggtgaqtgtgctgctatgagc

atc ggt ggg tga gtg tgc tgc

tcg gtg ggt gag tgt gct gct

cgg tgg gtg agt gtg ctg ctg

Non-coding parts

gtgagctgatgctagggr cgcacgaat

Point mutations:insertions, deletions

The flower-like 7 clusters structure is flat

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Seven classes vs Seven clusters

StanfordTIGRGeorgia Institute of Technology

Hong-Yu Ou, Feng-Biao Guo and Chun-Ting Zhang (2003). Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. FEBS Letters 540(1-3),188-194

Audic, S. and J. Claverie. Self-identification of protein-coding regions in microbial genomes.Proc Natl Acad Sci U S A, 95(17):10026-31, 1998.

Lomsadze A., Ter-Hovhannisyan V., Chernoff YO, Borodovsky M.Gene identification in novel eukaryotic genomes byself-training algorithm. Nucleic Acids Research, 2005, Vol. 33, No. 20

Computational gene prediction

Accuracy >90%

Mean-field approximationfor triplet frequencies

321KJIIJK PPPF

FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):

FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers

position-specific letter frequency + correlations

: 12 numbersjiP

Why hexagonal symmetry?

GC-content = PC + PG

Genome codon usageand mean-field approximation

ggtgaATG gat gct agg … gtc gca cgc TAAtgagct

correct frameshift

64 frequencies FIJK

ggtgaATG gat gct agg … gtc gca cgc TAAtgagct

12 frequencies PI1 , PJ

2 , PK3

PIJ are linear functions of GC-content

eubacteria

archae

THE MYSTERY OF TWOSTRAIGHT LINES ???

R12 R64

FIJK = P1IP2

JP3K + correlations

Codon usage signature

19 possible eubacterialsignatures

Example: Palindromic signatures

Four symmetry typesof the basic 7-cluster structure

eubacteria

flower-likedegeneratedperpendiculartriangles

paralleltriangles

B.Halodurans (GC=44%)

S.Coelicolor (GC=72%)

F.Nucleatum (GC=27%)

E.Coli (GC=51%)

Using branching principal components to analyze 7-clusters genome structures

Streptomyces coelicolor

Bacillus halodurans Ercherichia coli

Fusobacterium nucleatum

Using branching principal components to analyze 7-clusters genome structures

Web-site

http://www.ihes.fr/~zinovyev/7clusters

cluster structures in genomic sequences

Papers (type Zinovyev in Google)

Gorban A, Zinovyev AGorban A, Zinovyev APCA deciphers genome.PCA deciphers genome. 2005. Arxiv preprint

Gorban A, Popova T, Zinovyev A Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences.bacterial genomic sequences. 2005. Physica A 353, 365-387

Gorban A, Popova T, Zinovyev AGorban A, Popova T, Zinovyev AFour basic symmetry types in the universal 7-cluster structure of Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences. microbial genomic sequences. 2005. In Silico Biology 5, 0025

Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributionsSeven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039.

Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene IdentificationSelf-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).

Part II:Coding and non-coding DNA scaling laws

Dr. Thomas Fink

Bioinformatics service

Dr. Sebastian Ahnert

Cavendish laboratory,University of Cambridge

C-value and G-valueparadox Neither genome length nor gene

number account for complexity of an organism

Drosophila melanogaster (fruit fly) C=120Mb

Podisma pedestris (mountain grasshopper) C=1650 Mb

Non-linear growth of regulation

Mattick, J. S. Nature Reviews Genetics 5, 316–323 (2004).

“Amount of regulation” scales non-linearly with the number of genes: every new gene with a new function requires specific regulation, but the regulators also need to be regulated

Log number of genes

regula

bacteria

archae

Slope = 1.96

Slope = 1

Complexity ceiling for prokaryotes

Adding a new function S requires adding a regulatory overhead R, the total increase isN = R + S

Since R ~ N2 , at some point R > S,i.e. gain from a new function is too

expensive for an organism, it requires toomuch regulation to be integrated

There is a maximum possible genome lengthThere is a maximum possible genome lengthfor prokaryotes (~10Mb)for prokaryotes (~10Mb)

How eukaryotes bypassed this limitation?

Presumably, they invented a cheaper (digital) regulatory system, based on RNA

This regulatory information is stored in the “non-coding” DNA

Simple model:Accelerated networks

Node is a gene (c genes)Edge is a “regulation” (n edges)

n = c2

Connectivity < kmax,

regulators are onlyproteins

Connectivity > kmax

deficit of regulations is takenfrom non-coding DNA

How much regulation genome needs to take from non-coding DNA?

)(2 max

max ccc

ckndeficit

cmax (prokaryotic ceiling)

These regulations must be encoded in the non-coding part of genome, therefore

N – non-coding DNA lengthC – coding DNA lengthCprok – ceiling for prokaryotes (~10Mb)

some coefficient

Observation:coding length vs non-coding

Minimumnon-codinglength neededfor the «deficit»regulation

Hypothesis Prokaryotes:<Non-coding length> = <Coding length> (little constant add-on, promoters, UTRs…)

15% ≈ 1/7

EukaryotesNreg = /2 C/Cmaxprok(C-Cmaxprok) ~ C2,

Cmaxprok ≈ 10Mb ≈

This is the amount necessary for regulation, but repeats, genome parasites, etc., might make a genome much bigger

This is only a hypothesis, but…

Prediction on the Nreg for human:

Nreg = 87 Mb = 3% of genome length

C = 48 Mb = 1.7%

Nreg+C = 4.7%

Thank you for your attention Questions?

Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Documents

INVARIANT MANIFOLDS for reaction kinetics Andrei Zinovyev Institut des Hautes Études Scientifiques

Chapter 17 How to read a table of codons

Eukaryotic Translational Coupling in UAAUG Stop-Start Codons

A Comprehensive Map of Molecular Interactions in RB Pathway Laurence Calzone (1), Amélie Gelay (1), Andrei Zinovyev (1), François Radvanyi (2), Emmanuel

Invariant grids: method of complexity reduction in reaction networks Andrei Zinovyev Institut Curie, Paris Institut des Hautes Études Scientifiques

Role of the AGA/AGG codons, the rarest codons in global gene …genesdev.cshlp.org/content/8/21/2641.full.pdf · 2007. 4. 26. · Role of minor codons in E. coU lU O< 0.0 (A 3 O "

How Genes Function Quiz 6D. Four main points of how genes function Nucleotides (symbols in the language) are arranged into codons (letters) Codons (letters

by Andrei Yakovlev and Andrei Govorun

Codons notre infrastructure

How much non-coding DNA do eukaryotes require?zinovyev/presentations/ZinovyevMay2008Evry_StatSemantics.pdfHow much non-coding DNA do eukaryotes require? Andrei Zinovyev UMR U900 “Computational

MAI 2017: EQUIPES ET ENCADRANTS DE L'ED … Encadrants_2017-05.pdf · épidémiologie U900 Computational Systems Biology of Cancer Emmanuel Barillot Andrei Zinovyev Institut Curie,

Monster Creation Decoding DNA Triplets Codes and Codons

Inflation and String Cosmology Andrei Linde Andrei Linde

64 UR RUNES AND CODONS - Foundation for the Law of …lawoftime.org/pdfs/Star-Travelers-Almanac-Codons.pdf · 64 UR RUNES AND CODONS Creative Genesis Time Generates Tree 1 Primal

New Good Codons, Bad Transcript: Large Reductions in Gene … · 2013. 4. 23. · Article Fast Track Good Codons, Bad Transcript: Large Reductions in Gene Expression and Fitness Arising

Rare codons capacitate Kras-driven de novo tumorigenesis

Psihologie generala - Andrei Cosmovici generala - Andrei... · Psihologie generala - Andrei Cosmovici Author: Andrei Cosmovici Keywords: Psihologie generala - Andrei Cosmovici Created

The Genetic Code: 61 triplet codons represent 20 amino acids ...science.umd.edu/classroom/BSCI410-Liu/BSCI410/07 lectures...The Genetic Code: 61 triplet codons represent 20 amino acids;

Lectures 12, 13, and 14: Gene Prediction Steven Skiena …fac.ksu.edu.sa/sites/default/files/genefinding.pdf · 2014-11-03 · Since stop codons should occur every 20 codons or so,

of stop codons and 3’ flanking base in bacterial ... · The roles of stop codons and 3’ flanking base in bacterial translation termination efficiency Yulong Wei Supervisor: Dr