17
Physica A 388 (2009) 3809–3825 Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa Genetic code evolution as an initial driving force for molecular evolution Dirson Jian Li * , Shengli Zhang Department of Applied Physics, Xi’an Jiaotong University, Xi’an 710049, China article info Article history: Received 20 November 2008 Received in revised form 24 May 2009 Available online 11 June 2009 PACS: 87.10.-e 87.18.-h 87.23.Kg Keywords: Molecular evolution Variation of amino acid frequency Variation of genomic base composition Genetic code evolution abstract There is an intrinsic relationship between the molecular evolution in primordial period and genomic or proteomic properties of contemporary species. The genomic data may help us understand the driving force of evolution of life at a molecular level. In the absence of evidence, numerous problems in molecular evolution had to fall into a twilight zone of speculation and controversy in the past. Here we show that delicate variation patterns of genomic base compositions and amino acid frequencies resulted from the genetic code evolution, which underlies the molecular evolution. The theoretical results agree with the experimental observations very well, not only in the evolutionary trends of amino acid frequencies and genomic base compositions but also in many detailed characters. Inversely, the genomic data of contemporary species can help us unravel the genetic code chronology and amino acid chronology. Our results may shed light on the intrinsic mechanism of molecular evolution and the genetic code evolution. © 2009 Elsevier B.V. All rights reserved. 1. Introduction The driving force is a core problem in the theory on evolution of life. A qualified mechanism on driving force should explain the evolutionary trends for both molecular evolution and macroevolution of life. The driving force must be persistently effective from the primordial period to the present day. Based on some fundamental genomic properties, such as amino acid frequencies, genomic base compositions and protein length distributions, we may unravel an panoramic picture of the evolution of life. The discovery of genetic code and the study on genetic code evolution provide us great opportunity to understand the mechanism of molecular evolution [1–5]. Rich information on primordial evolution has been stored in the present variation patterns of amino acid frequencies or genomic base compositions. We found that the genetic code evolution determined the evolution of amino acid frequencies and genomic base compositions. So the genetic code evolution was the initial driving force in molecular evolution. Inversely, a thorough study on composition variations of proteins and DNAs can help us unravel the genetic code evolution in detail. The experimental observations in this paper are based on the biological data of contemporary species. The amino acid frequencies are obtained based on two databases: (i) 106 proteomes (85 eubacteria, 12 archaebacteria, 7 eukaryotes and 2 viruses) in the Prediction of Entire Proteomes (PEP) [6], and (ii) genomes of 803 microbes in the database in NCBI. Two sets of experimental observations based on PEP or NCBI have been obtained in the paper. Their results agree with each other. So the properties on the variation of amino acid frequencies are universal rules, which is independent of the choice of sample species. The GC contents are obtained from the Genome Properties system [7]. These species are representatives of the three domains in studying the evolutionary trends of amino acid frequencies and genomic base compositions. Fig. 6a–d are plotted according to Fig. 9-1, Fig. 9-6, Fig. 9-8 and Fig. 9-7 in Ref. [8] respectively. The genetic code multiplicity in Fig. 11 is plotted based on Fig. 1 in Ref. [9]. The data of gain–loss of amino acid in Table 1 are obtained from Ref. [10]. The organization of this paper is as follows: the experimental observations and theoretical results on the variations of amino acid frequencies or base compositions are explicated separately. Their relationships are explicated consequently. Finally, a theoretical model based on genetic code evolution is described in detail. * Corresponding author. E-mail address: [email protected] (D.J. Li). 0378-4371/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.physa.2009.06.017

Genetic code evolution as an initial driving force for molecular evolution

Embed Size (px)

Citation preview

Physica A 388 (2009) 3809–3825

Contents lists available at ScienceDirect

Physica A

journal homepage: www.elsevier.com/locate/physa

Genetic code evolution as an initial driving force for molecular evolutionDirson Jian Li ∗, Shengli ZhangDepartment of Applied Physics, Xi’an Jiaotong University, Xi’an 710049, China

a r t i c l e i n f o

Article history:Received 20 November 2008Received in revised form 24 May 2009Available online 11 June 2009

PACS:87.10.-e87.18.-h87.23.Kg

Keywords:Molecular evolutionVariation of amino acid frequencyVariation of genomic base compositionGenetic code evolution

a b s t r a c t

There is an intrinsic relationship between themolecular evolution in primordial period andgenomic or proteomic properties of contemporary species. The genomic data may helpus understand the driving force of evolution of life at a molecular level. In the absenceof evidence, numerous problems in molecular evolution had to fall into a twilight zoneof speculation and controversy in the past. Here we show that delicate variation patternsof genomic base compositions and amino acid frequencies resulted from the genetic codeevolution, which underlies the molecular evolution. The theoretical results agree with theexperimental observations very well, not only in the evolutionary trends of amino acidfrequencies and genomic base compositions but also inmanydetailed characters. Inversely,the genomic data of contemporary species can help us unravel the genetic code chronologyand amino acid chronology. Our results may shed light on the intrinsic mechanism ofmolecular evolution and the genetic code evolution.

© 2009 Elsevier B.V. All rights reserved.

1. Introduction

The driving force is a core problem in the theory on evolution of life. A qualified mechanism on driving force shouldexplain the evolutionary trends for both molecular evolution and macroevolution of life. The driving force must bepersistently effective from the primordial period to the present day. Based on some fundamental genomic properties, such asamino acid frequencies, genomic base compositions and protein length distributions, wemay unravel an panoramic pictureof the evolution of life. The discovery of genetic code and the study on genetic code evolution provide us great opportunityto understand the mechanism of molecular evolution [1–5]. Rich information on primordial evolution has been stored inthe present variation patterns of amino acid frequencies or genomic base compositions. We found that the genetic codeevolution determined the evolution of amino acid frequencies and genomic base compositions. So the genetic code evolutionwas the initial driving force in molecular evolution. Inversely, a thorough study on composition variations of proteins andDNAs can help us unravel the genetic code evolution in detail.The experimental observations in this paper are based on the biological data of contemporary species. The amino acid

frequencies are obtained based on two databases: (i) 106 proteomes (85 eubacteria, 12 archaebacteria, 7 eukaryotes and 2viruses) in the Prediction of Entire Proteomes (PEP) [6], and (ii) genomes of 803 microbes in the database in NCBI. Two setsof experimental observations based on PEP or NCBI have been obtained in the paper. Their results agree with each other. Sothe properties on the variation of amino acid frequencies are universal rules, which is independent of the choice of samplespecies. The GC contents are obtained from the Genome Properties system [7]. These species are representatives of the threedomains in studying the evolutionary trends of amino acid frequencies and genomic base compositions. Fig. 6a–d are plottedaccording to Fig. 9-1, Fig. 9-6, Fig. 9-8 and Fig. 9-7 in Ref. [8] respectively. The genetic code multiplicity in Fig. 11 is plottedbased on Fig. 1 in Ref. [9]. The data of gain–loss of amino acid in Table 1 are obtained from Ref. [10]. The organization ofthis paper is as follows: the experimental observations and theoretical results on the variations of amino acid frequenciesor base compositions are explicated separately. Their relationships are explicated consequently. Finally, a theoretical modelbased on genetic code evolution is described in detail.

∗ Corresponding author.E-mail address: [email protected] (D.J. Li).

0378-4371/$ – see front matter© 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.physa.2009.06.017

3810 D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825

Table 1Linear incorrelation between variation trends of amino acid frequencies and gain–loss of amino acids in modern time.

G A D V P S E L T R Q I N H K C F Y M W

Evol. trend(×10−6)

−354 −765 −62.9 −245 −292 128 77.0 −61.0 −42.9 −475 9.95 590 490 −57.0 734 −14.7 214 187 −1.27 −59.5

Gain–loss(×10−4)

−63 −239 −39 98 −139 167 −137 −17 91 38 20 89 73 73 −65 67 42 −5 88 2

2. Variation of amino acid frequencies

2.1. Experimental observations

2.1.1. Choosing orders of species properly to observe variation trends of amino acid frequenciesThe amino acid frequencies in species vary slightly,whichwas routinely assumed to be constant [11,12]. So, studies on the

variation of amino acid frequencies have been almost ignored in the past. However, it is easy to observe the variation trendsof amino acid frequencies if we choose proper orders for species. In this section, several orders of species will be introducedbased on the amino acid chronology or the like. Thus, we can obtain the variation trends of amino acid frequencies, whichcan help us reveal the mechanism of the variation of amino acid frequencies in the evolution.The chronology of amino acids to recruit into the genetic code from the earliest to the latest can be estimated as: G, A,

D, V, P, S, E, L, T, R, Q, I, N, H, K, C, F, Y, M, W [13]. Let a(i), i = 1 . . . 20 denote the 20 amino acids in this chronologicalorder. According to this amino acid chronology, some amino acids such as G and V recruited into the genetic code earlierthan other amino acids such as H, Q and W. Let

f (a(i), ξ) =N(a(i), ξ)20∑j=1N(a(j), ξ)

, ξ = 1, . . . , ns

denote the frequency of amino acid a(i) for the species in PEP (ns = 106) or in NCBI (nc = 803), where N(a(j), ξ) denotesthe total number of amino acids a(j) in all the protein sequences of species ξ . If we sort the species properly, we can observethe variation trends of amino acid frequencies when f (a(i), ξ) varies with respect to ξ in a certain order.We can introduce Late–early Ratio Orders to sort the species properly to observe the variation trends of amino acid

frequencies. We define R(a(i)...a(j))/(a(k)...a(l))(ξ) as the ratio of the average amino acid frequency of f (a(i), ξ), . . . , f (a(j), ξ) tothe average amino acid frequency of f (a(k), ξ), . . . , f (a(l), ξ), where a(i), . . . , a(j) are some later recruited amino acids anda(k), . . . , a(l) some earlier recruited amino acids. In the class of Late–early Ratio Orders, we can arrange the 106 species inPEP or 803 species in NCBI from small to large by the values of R(a(i)...a(j))/(a(k)...a(l))(ξ). The Late–early Ratio Order generallyindicates the chronology in the evolution according to its definition. When the earlier recruited amino acids or later onesare given concretely, we can define

R10/10(ξ) =

20∑i=11f (a(i), ξ)

10∑i=1f (a(i), ξ)

and obtain R10/10 order. Similarly, we obtain RHQW/GV order and R1/G order, where

RHQW/GV (ξ) =

13

∑a=H,Q ,W

f (a, ξ)

12

∑a=G,V

f (a, ξ),

and

R1/G(ξ) =1

f (G, ξ).

R10/10 order, RHQW/GV order and R1/G order are some cases of Late–early Ratio Orders.Ifwe choose orders of species improperly, the variation trends of amino acid frequencies cannot be observed. An improper

choice is the Random Ratio Order. In this order, we arrange the species by Random Ratio Order R(a(i)...a(j))/(a(k)...a(l))(ξ), wherethe amino acids in numerator and denominator are chosen randomly among the 20 amino acids. For instance, the speciescan be arranged by RAGHCN/LVQW order, where

RAGHCN/LVQW (ξ) =

15

∑a=A,G,H,C,N

f (a, ξ)

14

∑a=L,V ,Q ,W

f (a, ξ).

By these improper orders, we cannot find obvious variation trends of amino acid frequencies when f (a(i), ξ) vary with ξrandomly.

D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825 3811

0

0.1

Am

ino

acid

freq

uenc

y (o

bser

vatio

n)

0

0.1

Am

ino

acid

freq

uenc

y (s

imul

atio

n)

(for each a.a.) 1...106

(for each a.a.) 1...30

G WA D V P S E L T R Q I N H K C F Y M

G WA D V P S E L T R Q I N H K C F Y M

a

b

Fig. 1. Variation trends of amino acid frequencies. (a) Experimental observations of the variation trends of amino acid frequencies are based on the dataof 106 species in PEP, which agree with the results based on the data of 803 species in NCBI (Fig. 2). Each column represents a variation trend of one ofthe 20 amino acids, where the 106 species are aligned from left to right by R10/10 order. The 20 amino acids are aligned chronologically from left to right.(b) Theoretical results of the variation trends of amino acid frequencies. The 30 simulated species are aligned by t order. The variation trends in thesimulation fit the experimental observations for each amino acid.

We can also introduce Lav order to observe the variation trends of amino acid frequencies roughly. In this order, thespecies can be arranged from short to long by the average protein length Lav(ξ) of species ξ . The Lav order is independent ofthe choice of amino acid according to its definition.

2.1.2. General variation trends of amino acid frequenciesWhen sorting the species properly, we can obtain a set of variation trends for the 20 amino acids. The variation trends are

generally common for different proper orders of species, and the results of variation trends are also common either based onthe data of species in PEP or based on the data of species in NCBI. So the general variation trends of amino acid frequenciesare intrinsic properties of species.When sorting species by R10/10 order, the amino acid frequencies based on 106 species in PEP is in Fig. 1a, and the result

based on 803 species in NCBI is in Fig. 2. Both of the results are the same in variation trend for each of the 20 amino acids.Roughly speaking, the frequencies of G, A, D, V, P, L, T, R, H, W tend to decrease, the frequencies of S, E, I, N, K, F, Y tend toincrease, and the frequencies of Q, C, M tend to keep constant. The magnitudes of variations are different: frequencies of G,A, V, P, R decrease more rapidly than that of D, L, T, H, W, while frequencies of I, N, K, F, Y increase more rapidly than that ofS, E. We found that the evolutionary trends of amino acids are related to the amino acid chronology [13]: most of the aminoacids whose frequencies tend to decrease (or increase) are among the earlier (or later) recruited amino acids according tothe amino acid chronology. We can observe generally the same variation trends by R10/10 order, RHQW/GV order, R1/G orderand Lav order. The variation trends are irrelative to the choice of amino acids, because the variation trends are generallythe same even by Lav order, which is irrelative to the choice of amino acids. If sorting the species improperly, we can onlyobserve a completely random variation of amino acid frequencies.

2.1.3. Fine structures of the variation of amino acid frequenciesThe variation of amino acid frequencies can be observed in detail when studying 803 species in NCBI. We can obtain 803

discrete dots (ξ , f (a(i), ξ)) for each amino acid. Consequently, we can obtain smoothed lines of these dots according to theSavitzky–Golay method [14]. The detailed structures in the variation of amino acid frequencies can be observed explicitlyaccording to the fluctuations in the smoothed lines.We prescribe the fine structures as the overall profiles of the variations of amino acid frequencies. The smoothed lines

corresponding to fine structures can be obtained by choosing certain greater spans according to the Savitzky–Golay method(black lines in Fig. 2). These profiles are S-shaped or inverse S-shaped in general. For example, the profile of the 803 dots(ξ , f (V , ξ)) is S-shaped and the profile of the 803 dots (ξ , f (F , ξ)) is inverse S-shaped (see subplot V and subplot F in Fig. 2).Generally speaking, we can observe more steep variation trends at both ends. In the case of amino acids G, A, D, V, P, T, Rwith decreasing variation trends, the smoothed lines go down quickly at both ends; hence we obtain S-shaped profiles. Inthe case for amino acids S, I, N, K, F, Y with increasing variation trends, the smoothed lines go up quickly at both ends; hencewe obtain inverse S-shaped profiles. In the case of other amino acids Q, C, M, L, H, W, E, we can observe deformed S-shapedor wavy profiles. Especially, we found that the trends are always rather S-shaped or inverse S-shaped for most of the amino

3812 D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825

Fig. 2. Fine structures and superfine structures of the variation of amino acid frequencies in experimental observations. This is based on the data of803 species in NCBI (celeste dots), which are aligned by R10/10 order for each amino acid. The S-shaped or inverse S-shaped profiles are the fine-structures(smoothed lines: blank, span = 401, degree = 3). The detailed fluctuations represent the superfine-structures (smoothed lines: red, span = 201, degree =7). For each subplot, the x-axis represents the species, the y-axis represents amino acid frequencies, and the smoothed line for the variation of amino acidfrequency is according to the Savitzky–Golaymethod. These conventions are valid for other similar figures in this paper. (For interpretation of the referencesto colour in this figure legend, the reader is referred to the web version of this article.)

acids whose average amino acid frequencies are high, while the trends are deformed S-shaped for the amino acids whoseaverage amino acid frequencies are low.

2.1.4. Superfine structures of the variation of amino acid frequenciesWe prescribe the superfine structures as the fluctuations in the profiles of the variations of amino acid frequencies. The

smoothed lines corresponding to superfine structures can be obtained by choosing certain smaller spans according to theSavitzky–Golay method (red lines in Fig. 2). The superfine structures exist in the background of stochastic fluctuations.We can observe some characters of fluctuations in the smoothed lines. For examples, we can observe an obvious wavycharacteristic of fluctuations for amino acid W (for series number of species around 300 in the subplot W in Fig. 2); we canobserve a peak for amino acid S (for series number of species around 500 in the subplot S in Fig. 2); we can observe twopeaks for amino acidM (for series number of species around 400 and 600 in the subplot M in Fig. 2). Some obvious superfinestructures can be observed in either the result based on NCBI (Fig. 2) or the result based on PEP (Fig. 1a).

2.1.5. Variation trends of amino acid frequencies for three domainsIt is heuristic and significant to compare the variation trends of amino acid frequencies for three domains (eubacteria,

archaebacteria, and eukaryotes). All variation trends for the 20 amino acid frequencies are the same for three domains (Fig. 4).

D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825 3813

But there are differences in the ranges of amino acid frequencies for three domains. The ranges of the variations of aminoacid frequencies for eubacteria, roughly speaking, coincide with the ranges of the variations of amino acid frequencies forarchaebacteria. However, there are obvious differences between the ranges for archaebacteria and eukaryotes (Fig. 4). Forexamples, the amino acid frequency on the left end for eukaryotes is obviously less than the amino acid frequency on theleft end for eubacteria or archaebacteria in the case of amino acid R; whereas the value for eukaryotes is obviously greaterthan the values for eubacteria or archaebacteria in the case of amino acid N (Fig. 4).

2.2. Theoretical results

2.2.1. General variation trends of amino acid frequenciesThe variation trends of amino acid frequencies simulated by the model (Figs. 1b and 3) agree with the variation trends in

experimental observations (Figs. 1a and 2) very well. The mechanism of the variation of amino acid frequencies is faithfullybased on the genetic code multiplicity. The variation trends and magnitudes of amino acid frequencies are crucially relatedto the placements of amino acids in the genetic code multiplicity (Fig. 11). For example, the amino acids F and Y occupysimilar positions in the genetic code multiplicity, so the corresponding evolutionary trends and magnitudes accord witheach other. The similar results are also valid for the amino acids G and A, for the amino acids P and H, for the amino acidsR, D and E and for the amino acids L and S respectively (Figs. 1, 2 and 11). Hence, the genetic code evolution underlies thevariation of amino acid frequencies.The magnitudes of the variations of amino acid frequencies in the theoretical results are about half of the corresponding

experimental observations.We did not add anymore parameters in themodel to boost themagnitudes of variation of aminoacid frequencies in the model so as to obtain a better result concerning to the magnitudes, because we insisted on that themodel is faithfully based on the genetic code multiplicity.

2.2.2. Fine structures of the variation of amino acid frequenciesFine structures of the variation of amino acid frequencies obviously deviate from linear relationships in experimental

observations (blank lines in Fig. 2). According to the simulation, we show that the fine structures result from the fact that thesample density of species in the available database is not constant. The number of species inNCBI varieswith average proteinlength as a bell-shaped outline (Fig. 9c), which means that the sample density of species in the available database is also abell-shaped. The number of specieswithmoderate R10/10 is greater than the numbers of specieswith larger or smaller R10/10.If we choose a linear relationship between series number of species Ns and parameter t , namely t = t1(Ns), without

consideration of the variation of sample density (Fig. 9a), the profiles of variation trends of amino acid frequencies intheoretical results are generally straight for all the amino acids, which disagree with the experimental observations. Inorder to obtain better simulations, we introduced a non-linear relationship t = t3(Ns) (Fig. 9a) so as to obtain a reasonablesample density (Fig. 9b). Thus we obtained the fine structures of variations of amino acid frequencies (blank lines in Fig. 3),which generally agree with the experimental observations in Fig. 2. If we choose an improper function t = t2(Ns), whichcorresponds to an unreasonable sample density, the simulated fine structures disagree with the experimental observations.Therefore, we show that the reason to be either S-shaped profiles or inverse S-shaped profiles is trivially due to the sampledensity of species in available databases.

2.2.3. Superfine structures of the variation of amino acid frequenciesWe can observe the superfine structures of the variation of amino acid frequencies in theoretical results (red lines

in Fig. 3). In the simulation, the amino acid frequencies are calculated based on sufficient numerous simulated proteinsequences so as to avoid random errors. All the fluctuations in Fig. 3 can recur in detail if we run the program againin simulation. Therefore, the fluctuations of these smoothed lines are intrinsic properties of the variation of amino acidfrequencies, which are determined by the genetic code multiplicity according to the mechanism of the model.The simulation results indicate that the superfine structures in the experimental observations resulted from genetic code

multiplicity. Some characteristics of superfine structures, such as an obvious wavy characteristic of fluctuations for aminoacid W and an double-peak characteristic for amino acid M, may be observed in the simulation results.

2.2.4. Variation trends of amino acid frequencies for three domainsAccording to the phylogeny of three domains, Eukarya and Archaea separated later in evolution, while Bacteria and the

ancestor of both Eukarya and Archaea separated earlier in evolution [15]. Hence, we can explain the differences of variationof amino acid frequencies for three domains in experimental observations in terms of the phylogeny of three domains.The unicellular organisms appeared earlier and eukaryotes appeared later in the history. First, we input initial values ofiAAFNCBI (Table 2) to simulate the evolution of amino acid frequencies for Bacteria and Archaea. Then, we input the mediumamino acid frequencies iAAFmedium in the first step as initial amino acid frequencies to simulate the evolution of amino acidfrequencies for Eukarya. The simulation results(Fig. 5) agreewith the experimental observations (Fig. 4) in general.We foundthat the initial frequencies for Eukaryotes are generally less (greater) than the initial frequencies for Archaea and Eubacteriafor the amino acids with decreasing (increasing) variation trends.

3814 D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825

Fig. 3. Fine structures and superfine structures of the variation of amino acid frequencies in theoretical results. This is based on the data of nm = 40species (celeste dots) in the simulation. The calculation of amino acid frequencies for each simulated species is based on np = 400,000 protein sequencesgenerated by themodel. The species are aligned by t order in the x-axis for each amino acid. The S-shaped or inverse S-shaped profiles are the fine-structures(smoothed lines: blank, span = 31, degree = 3). The detailed fluctuations represent the superfine-structures (smoothed lines: red, span = 7, degree = 3).The theoretical results agree with experimental observations (Fig. 2) in general. (For interpretation of the references to colour in this figure legend, thereader is referred to the web version of this article.)

3. Variation of genetic code compositions

3.1. Experimental observations

It is well known that base compositions in genomes vary greatly, which are often referred as GC pressure or GA pressurein molecular evolution. There are delicate variation patterns of genomic base compositions of contemporary species, whichare discussed in detail in Ref. [8].The precise correlations between genomic GC content (or genomic GA content) and the GC content (or GA content) at

the first, second, or third codon positions can be observed obviously ( Fig. 6a, b) [16,8]. There are also correlations betweencodon position GC content and codon position GA content (Fig. 6c), or between genomic GC content and genomic GAcontent (Fig. 6d) [8]. Moreover, there are correlations between GC content of genes and codon position GC content of genesin each genome, hence we can obtain three slopes of the corresponding correlations in the first, second and third codonpositions for a species. The slopes corresponding to the three codon positions vary with genomic GC content respectivelyfor contemporary species (see Fig. 9-4 in Ref. [8]).

D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825 3815

Fig. 4. Variation of amino acid frequencies for three domains in experimental observations. The variation trends of amino acid frequencies are the same ingeneral for three domains, which are aligned from left to right for each amino acid (Eubacteria: green dots, Archaebacteria: Blue square, and Eukaryotes: reddiamonds). The amino acid frequencies are about the same for Archaea and Eubacteria. However, there are obvious differences in amino acid frequenciesbetween Archaea (Eubacteria) and eukaryotes (namely, there is an upward shift when increasing or a downward shift when decreasing for the eukaryotes).

3.2. Theoretical results

Themechanism of the evolution of genomic base compositions can also be explained by the samemodel based on geneticcode multiplicity and codon chronology. The simulations of correlations of genomic base compositions and codon positionbase compositions (Fig. 6e–h) generally agree with the experimental observations respectively (Fig. 6a–d). It is noteworthythat there are many detailed agreements between simulations and experimental observations, which strongly confirm thevalidity of our simulations.In Fig. 6e, we can observe a step in themiddle of the line corresponding to the first codon position and a junction between

lines corresponding to the first and second codon positions; these characters of step and junction can also be observed inthe plots based on biological data [16,8,17]. The slope of the line corresponding to the third codon position is the deepest,because G and C occupy all the third positions of the earliest codons for 20 amino acids, and A and U occupy all the thirdposition of the latest codons for 20 amino acids, but their compositions are about invariant for the first and second positions(Tables 5 and 6). The lower limit and upper limit of the GC content also result from the base compositions in codon positions.In Fig. 6f, the simulated slope corresponding to the third codon position is the greatest, which agrees with the experimentalobservation [8]. In Fig. 6g, the slopes and variation range in simulation agree with the experimental observation [8]. And in

3816 D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825

Fig. 5. Variation of amino acid frequencies for three domains in theoretical results. The theoretical results agree with the experiment observations (Fig. 4)in general. The evolutionary trends of amino acid frequencies are the same for three domains in the simulation. And the amino acid frequencies are the samefor Archaea and Eubacteria; while there are also upward or downward shifts of amino acid frequencies between Archaea (Eubacteria) and eukaryotes in thesimulation. According to the phylogeny of three domains, Eukarya appeared latest in the evolution. Therefore, we input the initial amino acid frequenciesiAAFNCBI to simulate the evolution of amino acid frequencies for both Eubacteria (green dots) and Archaea (Blue squares) first. Then, we input iAAFmedium(the value of amino acid frequencies in medium time in the above simulate) as initial amino acid frequencies to simulate the evolution of amino acidfrequencies for Eukaryotes (Red diamonds).

Fig. 6h, the deviation amplitude from the central declining line is great, which agrees with the experimental observation(see Fig. 9-7 in Ref. [8]). Finally, the simulations of the correlation between genomic GC content and the three slopes ofcorrelations of GC content in genes agree with the experimental observations in principle. Thus, we show that the delicatestructure in the correlations of genomic base compositions mainly comes from genetic code multiplicity and chronology(Fig. 6i–l) [9,18].

4. Relationships among amino acid frequencies, genomic base compositions and average protein lengths

4.1. Experimental observations

4.1.1. Relationships between amino acid frequencies and genomic base compositionsThere is an explicit relationship between amino acid frequencies and genomic base compositions. We observe that the

genomic GC content decreases linearlywith the ratio R10/10 (Fig. 7a) [19,20]. Similar results are also valid for other Late–earlyRatio Orders. So the genomic GC content and the amino acid frequencies are not independent variables whenwe discuss theevolutionary pressure in molecular evolution. It is not wise to take genomic GC content as the initial evolutionary pressure,since we can explain the relationships among amino acid frequencies, genomic base compositions and average proteinlengths in a unified theoretical framework.

D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825 3817

a b c d

e f g h

i j k l

Fig. 6. Correlations of genomic GC content or GA content and codon position GC content or GA content. The results for 1st, 2nd, and 3rd codon positionsare represent by blue circles, green squares and red triangles respectively. (a)–(d) Correlations of base compositions in codon positions and genomic basecompositions in experimental observations; (e)–(h) Correlations of base compositions in codon positions and genomic base compositions in the theoreticalresults. Simulations by the model are based on genetic code multiplicity, which agree with the above experimental observations in either evolutionarytrends or some detailed characters; (i)–(l) Correlations of the number of bases in codon positions and the total number of bases are based on Table 6, whichmainly determines the results in corresponding simulations.

4.1.2. Relationships between amino acid frequencies and average protein lengthsThere is a non-linear correlation between the average protein length l̄ and the ratio RHQW/GV (Fig. 8). The distribution of all

species forms a bowed line in the l̄−RHQW/GV plane, and the closely related species cluster together in the l̄−RHQW/GV plane(Fig. 8). Similar results are also valid for other Late–early Ratio Orders. We choose RHQW/GV in order that the distribution ofarchaebacteria can be separated from the distribution of bacteria. We also noticed that the species with larger genome sizelocate in the midstream of the evolutionary flow. This bowed distribution can be interpreted as an evolutionary flow. Moreadvanced prokaryotes with larger genome sizes always locate in the midstream of this evolutionary flow.

4.2. Theoretical results

4.2.1. Relationships between amino acid frequencies and genomic base compositionsAccording to the simulation, we found that the genetic code multiplicity can influence both amino acid frequencies and

genomicGC content. So the evolutionary pressure in the overallmolecular evolution originated in the genetic code evolution.The evolutionary pressure influences the amino acid frequencies, genomic base composition and the average protein lengthall together.When the parameter t is fixed, both protein sequences and the corresponding DNA sequences can be generated together

by the model. Subsequently, both amino acid frequencies and genomic GC content can be calculated. Thus we can explainthat the ratio R10/10 increases when the genomic GC content decreases (Fig. 7a). When the parameter t increases, there willbe more andmore late amino acids to join the protein sequences, so the ratio R10/10 will also increase, whereas the genomicGC content will decrease according to the codon chronology (Table 5). Thus we can explain that the ratio R10/10 increases

3818 D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825

0.4 0.6

0.4

0.6

0.8

Eubateria

Archaebacteria

Eukaryote

0.2

1

R10

/10

0.2 0.8

Genomic GC content

0.4

0.6

0.8

0.2

1

R10

/10

Genomic GC content

0.4 0.60.2 0.8

a b

Fig. 7. The relationship between the genomic GC content and the variation of amino acid frequencies. (a) The GC content declines with respect to the ratioR10/10 according to the biological data. (b) The simulation agrees with the experimental observation in the variation trend between GC content and R10/10 .

Fig. 8. The relationship between the average protein length and the ratio RHQW/GV . The species in three domains cluster together in three areas respectively.The genome sizes of species are represented by tails below the corresponding dots (larger genome size: long red tail; medium genome size: medium greentail and small genome size: short blue tail). Embedded, The simulation of the relationship between average protein length and the ratio RHQW/GV , especiallythe bedding direction, agrees with experimental observations.

when the genomic GC content decreases (Fig. 7b). The variation of amino acid frequencies can influence the slopes for thefirst and second codon positions in Fig. 6a and e.

4.2.2. Relationships between amino acid frequencies and average protein lengthsThe nonlinear relationship between amino acid frequencies and average protein length can be explained by the model.

The bending direction of the distribution of species in the l̄ − RHQW/GV plane can be simulated by our model, which agreeswith the experimental observation (Fig. 8). The bending direction is also the same for distribution of species in the l̄−R10/10plane. According to the simulation by themodel, the bending direction is sensitively related to the genetic codemultiplicity.By varying the substitution rules a little, the bending curve tends to be straight in the simulation. Such a detailed agreementbetween theoretical result and experimental observation verifies the validity of the referred genetic codemultiplicity in themodel (Fig. 11 and Ref. [9]).

D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825 3819

Table 2The amino acid frequencies as initial values in the simulations of variation of amino acid frequencies.

fG fA fD fV fP fS fE fL fT fR fQ fI fN fH fK fC fF fY fM fW

iAAFPEP (×10−4) 692 831 524 674 489 703 647 991 537 559 394 577 400 226 555 141 404 299 238 119iAAFNCBI (×10−4) 721 906 536 698 427 607 624 1025 527 537 366 672 413 204 560 97 414 312 238 116iAAFmedium (×10−4) 639 804 516 619 410 622 597 1051 464 520 365 811 421 195 721 84 464 348 237 114

5. Genetic code evolution as an initial driving force for molecular evolution

5.1. Timing the mechanism of the variation of amino acid frequencies

There is evidence that the variation of amino acid frequencies formed before the stage when three domains began tobranch. The variation of amino acid frequencies for three domains may help us time the mechanism of the variation ofamino acid frequencies. The evolutionary trends of amino acid frequencies are same for three domains (Fig. 4). And we haveexplained variation patterns of amino acid frequencies for three domains according to the phylogeny of three domains inSection 2.2.4. These results indicate that the mechanism of the variation of amino acid frequencies should originate in theperiod before the separation between Bacteria and the ancestor of Archaea and Eukarya.We also show that the variation of amino acid frequencies is irrelative to the gain–loss of amino acids in modern times.

The trend of amino acid gain and loss in protein evolution has been reported in Refs. [10,21]. There is a set of data of gain–lossfor 20 amino acids in Ref. [10] according to the protein evolution inmodern times (Table 1).We can also obtain a set of data ofvariation trends for 20 amino acids according to Fig. 1 by least squares (Table 1). The correlation efficient between gain–lossand variation trends is 0.393, which indicates that there is a weak correlation between the variation trends and the gain andloss of amino acids in modern times. Therefore, it is reasonable to suppose that the mechanism of the variation of aminoacid frequencies appeared in an early stage of evolution.

5.2. Initial driving force for molecular evolution

All the theoretical results in Sections 2–4 are obtained by amodel based on the genetic codemultiplicity and chronology.So the variation patterns of amino acid frequencies and genomic base compositions can be explained in a unified theoreticalframework. It can be inferred that (i) the pattern of the variation of the compositions of protein and DNA formed was fixedin the period when the genetic code evolved; (ii) themagnitudes of the evolutionary trends have been amplifying ever sincethe time when the genetic code had established. Thus, we conjecture that the genetic code evolution is an initial drivingforce for molecular evolution.

5.3. Cracking the genetic code evolution

If our conjecture is reasonable, we have found a new method to crack the genetic code evolution in primordial timeby studying the variation of amino acid frequencies and genomic base compositions of contemporary species. Especially,there are some detailed disagreements between theoretical results and experimental observations. For instance, it is hardto compare the superfine structures in detail between theoretical results and experimental observations at present. Theimprovement in understanding the variations of amino acid frequencies and genomic base compositions may enable us tounderstand the genetic code evolution more explicitly.

6. The model

6.1. Outline of the model

We proposed a model based on the evolution of the genetic code to explain the variations of amino acid frequenciesand base compositions. The model mainly consists of three parts: (i) generating void protein sequences by formal language;(ii) generating protein sequences based on genetic code multiplicity; and (iii) generating DNA sequences based on codonchronology.The genetic code multiplicity (Fig. 11) is the core in simulation of variation of amino acid frequencies. The genetic code

chronology (Table 5) is the core in simulation of variation of genomic base compositions. This is an elaborate model andthere is only one adjustable parameter t , which indicates the time of the evolution and plays a central role in simulations ofthe variations of amino acid frequencies and base compositions.

6.2. Simulation of the variation of amino acid frequencies

6.2.1. Generation of void protein sequences by formal languageThere are two steps in the generation of protein sequences in themodel: (1) generating void protein sequences according

to tree adjoining grammars in Fig. 10 [22]; (2) the leaf π in the tree adjoining grammar will be substituted by amino acids

3820 D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825

Table 3The genetic code multiplicity and substitution probability of amino acids.

π → π1[p1 = pD + pR + pE + p6, p6 = pP + pH ]

π1 → R [pR/p1]π1 → E [pE/p1]π1 → D [pD/p1]π1 → π6 [p6/p1] π6 → H [pH/p6]

π6 → P [pP/p6]π → π2[p2 = pF + pY ]

π2 → Y [pY /p2]π2 → F [pF/p2]

π → π3[p3 = pI + pA + pG + p7, p7 = pN + pQ + pM + pW ]

π3 → π10 [pI/p3] π10 → π11 [1− κ] π11 → π3 [1]π10 → I [κ]

π3 → A [pA/p3]π3 → G [pG/p3]π3 → π7 [p7/p3] π7 → π8 [p8/p7, p8 = pN + pQ ],

π8 → N [pN/p8]π8 → Q [pQ /p8]

π7 → π9 [p9/p3, p9 = pM + pW ]π9 → W [pW /p9]π9 → M [pM/p9]

π → π4[p4 = pV + pT + pK ]

π4 → T [pT /p4]π4 → K [pK /p4]π4 → V [pV /p4]

π → π5[p5 = pS + pL]

π5 → L [pL/p5]π5 → S [pS/p5]

π → C [pC ]

based on the genetic codemultiplicity in Ref. [9] ( Fig. 11 and Table 3), where the amino acid chronology has been consideredaccording to Ref. [13] and the probabilities for substitutions are determined by pa in Table 2.There are no rigorous restrictions in choosing grammar rules of the formal language, because they do not essentially

determine the variation trends of amino acid frequencies in our final results. In themodel,we choose tree adjoining grammarto generate void protein sequences. There is one initial tree and two auxiliary trees in the grammar rules, where S and T areinner nodes and π is leaf (Fig. 10). We set a parameter t in the model as the probability of the substitution of inner nodesin adjoining grammar. The length of a sequence may increase if an inner node has been substituted by the correspondingauxiliary tree (with probability t) or keep constant if the node has not been substituted (with probability 1 − t). When tincreases, there will be a greater probability to generate longer void protein sequences in each principal cycle.

6.2.2. Fill in the protein sequences based on genetic code multiplicityIn the second step, the void character π in the void protein sequences will be substituted by 20 amino acids by calling

a subprogram. The rules of substitutions in the subprogram are based rigorously on the genetic code multiplicity (Fig. 11,Table 3). Namely, π may be substituted by either of π1, π2, π3, π4, π5 or amino acid C at first level of substitution; π1 maybe substituted by either R, E, D or π6 at a second level of substitution, and so on; π6 may be substituted by either H or P at athird level of substitution, and so on (Fig. 11, Table 3). The depth of substitutions of the genetic code multiplicity tree is fourlevels as a maximum, namely from π to N, Q, W or M (Fig. 11, Table 3). The process of substitution of π will not finish untilπ is substituted by one of the 20 amino acids eventually. Thus, the protein sequences consisting of amino acids have beengenerated.The probabilities for all the possible substitutions of each node in the genetic code multiplicity tree (Fig. 11, Table 3)

are constant parameters in the model according to the average amino acid frequencies of contemporary species. The initialamino acid frequencies (iAAF ) at the beginning of each cycle in the program are input by the average amino acid frequenciesof species in PEP (ns = 106) or NCBI (ns = 803) (Table 2):

fa(i) =

ns∑ξ=1N(a(i), ξ)

20∑j=1

ns∑ξ=1N(a(j), ξ)

.

We choose the average amino acid frequencies as the initial amino acid frequencies in order that the average amino acidfrequencies in the simulation results agree with the average amino acid frequencies in observations. But these constantparameters do not influence the variation trends in the simulations. The corresponding probabilities for 20 amino acids canbe calculated as follows:

pa(i) =fa(i)

20∑i=1fa(i) + fterm

,

where fterm = fI (namely π10 will be substituted by π11 and I with an equal probability in Fig. 11). The probabilities (thenumbers in the brackets in Table 3) of substitutions between pairs of adjoining nodes in the genetic code multiplicity tree

D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825 3821

0.05–0.1 0.1–0.15 0.15–0.2 0.2–0.25 0.25–0.3

5

10

15

20

0–0.05 0.3–0.35

range of parameter t

0

25

num

ber

of p

aram

eter

s N

s

a b

c

Fig. 9. Explanation of the fine structure of the variation of amino acid frequencies. (a) The functions between t and Ns . The range of parameter t isfrom tmin = 0.05 to tmax = 0.3. The series numbers for the simulated species are Ns = 1, 2, . . . , nm, nm = 40. The functions for the three curvesare as follows respectively: (1) t1(Ns) = tmax − (Ns − 1)(tmax − tmin)/(nm − 1), (2) t2(Ns) = tmax − (tmax − tmin)

√1− (Ns − nm)2/(nm − 1)2 and

(3) t3(Ns) = (tmax + tmin)/2− 0.000017(Ns − (nm + 1)/2)3 . (b) The distribution of numbers of Ns with respect to the parameter t . (c) The distribution of803 species in NCBI with respect to average protein length.

Fig. 10. The tree adjoining grammar rules in the model. There are one initial tree and two auxiliary trees in the grammar. π in the trees are leaves, whichwill be replaced by amino acids according to Table 3.

(Fig. 11) can be calculated by the initial amino acid frequencies (Table 2). For instance, in the substitution of π6 by H , theprobability is pH/(pH + pP); while the probability of substitution of π by π2 is p2 = pF + pY (Table 3).

6.2.3. Calculation of amino acid frequencies and their variation trendsWhen the parameter t is fixed, we can generate sufficient numerous protein sequences so that the amino acid frequency

for each of the 20 amino acids goes to a certain constant value. Then, the amino acid frequencies vary with parameter t .We can study the evolution of amino acid frequencies by adjusting the parameter t from tmin to tmax. We observed that

the amino acid frequency for each amino acid may increase or decrease when t increases. For the initial parameter tmin, weinput pa(i) as initial values of amino acid frequencies. We chose nm sample species Ns = 1, 2, . . . , nm and obtained nm sets ofamino acid frequencies by themodel. Hence, we can obtain the variation trends of amino acid frequencies in the simulation.

3822 D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825

RPHD

E

Y

F

I

NVWM

GA

TVK

LS

C

π

π1

π2

π3(term)

π4

π5

π6

π10

π7

π8

π9

π11

(term)

Fig. 11. The genetic code multiplicity tree. This tree is based faithfully on the genetic code multiplicity in Fig. 1 in Ref. [9]. The substitutions from π toamino acids in the model is based on this genetic code multiplicity tree. The probabilities for substitutions between adjoining nodes in this tree are shownin Table 3.

We found that the variation trends of amino acid frequencies are about irrelative to the initial values of the amino acidfrequencies, which are determined by the genetic code multiplicity.

6.2.4. The mechanism of the variation of amino acid frequencies in the simulationAccording to the simulation by themodel, the genetic codemultiplicity plays a central role in the evolution of amino acid

frequencies. It takesmore steps of substitution to replaceπ in the void protein sequences by a late recruited amino acid. Theplacements of amino acids in the genetic code multiplicity tree essentially influence the variation of amino acid frequenciesin the simulation. When we generate shorter protein sequences, the late recruited amino acids have less opportunity tojoin the protein sequences. When the parameter t increases, the increment of protein length in one principal cycle alsobecomes greater than before. Hence, there will be more opportunities for the late recruited amino acids to join the proteinsequences. So the amino acid frequencies will vary with the parameter t . The only continuous variable t in the model canbe interpreted as the time in the evolution. The other constant parameters pa(i) and the grammar rules do not essentiallyinfluence the variation trends. And there are no other parameters in the model to deliberately influence certain amino acidfrequencies or genomic base compositions.

6.3. Simulation of the variation of genomic base compositions

6.3.1. Codon chronologyThe codon chronology can be reconstructed based on the amino acid chronology and the primacy of thermostability and

complementarity (Table 4) [13,18,5,23]. The result of codon chronology in Table 4 is almost the same as the chronology inRef. [5], but amino acid chronology in the first line of Table 4 in our calculation is replaced by a new amino acid chronologyobtained by the same authors in Ref. [13] that was published later than Ref. [5]. In our results, the amino acids are sortedchronologically in the first line in Table 4. The 32 pairs of complementary codons are sorted chronologically from top tobottom in Table 4, which correspond to the above amino acid chronology and form a lower triangle in Table 4.The codon chronology in Table 4 can explain the relationship between GC content and codon position GC content very

well. But the declining relationship between genomic GC content and genomic GA content in experimental observations(Fig. 6d) must not be achieved by the codon chronology in Table 4 in principle because of the rigorous restriction ofcodon complementarity. We have to modify the chronology slightly in sacrifice of the codon complementarity so that therelationship between genomic GC content and genomic GA content in experimental observation (Fig. 6d) can be simulatedby the model. We only rearranged the codon chronology for amino acids S and R and obtained a modified codon chronology(Table 5). Namely, we moved the positions of codons AGC and AGU earlier in Table 5 than in Table 4 for amino acid S; andwe moved the positions of codons AGG and AGA earlier in Table 5 than in Table 4 for amino acid R.According to the codon chronology in Table 5, there are 32 stages in the evolution of genetic codes. For each stage, there

is a definitive correspondence between a codon and a certain amino acid. For each stage, therefore, we can count the totalnumbers of bases G, C, A or T for all the amino acids in the first, second or third codon positions respectively according toTable 5 (Table 6). And we also obtained the total number of G and C and the total number of G and A in the first, secondand third codon positions respectively as well as the total numbers of G and C or G and A in all the three codon positions(Table 6). The results in Table 6 partially indicate the variation trends of genomic base compositions. And we can plot therelationships in Fig. 6i–l according to Table 6.

D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825 3823

Table 4The codon chronology based on the amino acid chronology and complementarity.

G A D V P S E L T R Q I N H K C F Y M W (stop) (S) (L) (R)

1 GGC GCC2 GAC GUC3 GGG CCC4 GGA UCC5 GAG CUC6 GGU ACC7 GCG CGC8 CCG CGG9 UCG CGA10 ACG CGU11 CUG CAG12 GAU AUC13 GUU AAC14 AUU AAU15 GUG CAC16 CUU AAG17 GCA UGC18 ACA UGU19 GAA UUC20 AAA UUU21 GUA UAC22 AUA UAU23 CAU AUG24 CCA UGG25 CUA UAG26 UCA UGA27 GCU AGC28 ACU AGU29 CAA UUG30 UAA UUA31 CCU AGG32 UCU AGA

Table 5The modified codon chronology.

G A D V P S∗ E L T R∗ Q I N H K C F Y M W

1 GGC GCC GAC GUC CCC AGC GAG CUC ACC AGG CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG2 GGG GCG GAC GUC CCC AGC GAG CUC ACC AGG CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG3 GGG GCG GAU GUU CCC AGC GAG CUC ACC AGG CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG4 GGA GCG GAU GUU CCG AGU GAG CUC ACC AGG CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG5 GGU GCG GAU GUU CCG AGU GAG CUC ACC AGG CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG6 GGU GCG GAU GUU CCG AGU GAA CUG ACC AGG CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG7 GGU GCG GAU GUU CCG AGU GAA CUG ACG AGG CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG8 GGU GCA GAU GUU CCG AGU GAA CUG ACG AGA CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG9 GGU GCA GAU GUU CCA AGU GAA CUG ACG CGC CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG10 GGU GCA GAU GUU CCA UCC GAA CUG ACG CGG CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG11 GGU GCA GAU GUU CCA UCC GAA CUG ACA CGA CAG AUC AAC CAC AAG UGC UUC UAC AUG UGG12 GGU GCA GAU GUU CCA UCC GAA CUU ACA CGA CAA AUC AAC CAC AAG UGC UUC UAC AUG UGG13 GGU GCA GAU GUU CCA UCC GAA CUU ACA CGA CAA AUU AAC CAC AAG UGC UUC UAC AUG UGG14 GGU GCA GAU GUG CCA UCC GAA CUU ACA CGA CAA AUU AAU CAC AAG UGC UUC UAC AUG UGG15 GGU GCA GAU GUG CCA UCC GAA CUU ACA CGA CAA AUA AAU CAC AAG UGC UUC UAC AUG UGG16 GGU GCA GAU GUA CCA UCC GAA CUU ACA CGA CAA AUA AAU CAU AAG UGC UUC UAC AUG UGG17 GGU GCA GAU GUA CCA UCC GAA CUA ACA CGA CAA AUA AAU CAU AAA UGC UUC UAC AUG UGG18 GGU GCU GAU GUA CCA UCC GAA CUA ACA CGA CAA AUA AAU CAU AAA UGU UUC UAC AUG UGG19 GGU GCU GAU GUA CCA UCC GAA CUA ACU CGA CAA AUA AAU CAU AAA UGU UUC UAC AUG UGG20 GGU GCU GAU GUA CCA UCC GAA CUA ACU CGA CAA AUA AAU CAU AAA UGU UUU UAC AUG UGG21 GGU GCU GAU GUA CCA UCC GAA CUA ACU CGA CAA AUA AAU CAU AAA UGU UUU UAC AUG UGG22 GGU GCU GAU GUA CCA UCC GAA CUA ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG23 GGU GCU GAU GUA CCA UCC GAA CUA ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG24 GGU GCU GAU GUA CCA UCC GAA CUA ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG25 GGU GCU GAU GUA CCU UCC GAA CUA ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG26 GGU GCU GAU GUA CCU UCC GAA UUG ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG27 GGU GCU GAU GUA CCU UCG GAA UUG ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG28 GGU GCU GAU GUA CCU UCA GAA UUG ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG29 GGU GCU GAU GUA CCU UCU GAA UUG ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG30 GGU GCU GAU GUA CCU UCU GAA UUA ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG31 GGU GCU GAU GUA CCU UCU GAA UUA ACU CGA CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG32 GGU GCU GAU GUA CCU UCU GAA UUA ACU CGU CAA AUA AAU CAU AAA UGU UUU UAU AUG UGG

3824 D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825

Table 6The number of bases at codon positions based on the modified codon chronology.

G C U GC CU1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd Total 1st 2nd 3rd Total

1 5 5 6 4 3 14 4 5 0 9 8 20 37 8 8 14 302 5 5 8 4 3 12 4 5 0 9 8 20 37 8 8 12 283 5 5 8 4 3 10 4 5 2 9 8 18 35 8 8 12 284 5 5 8 4 3 9 4 5 2 9 8 17 34 8 8 11 275 5 5 8 4 3 8 4 5 4 9 8 16 33 8 8 12 286 5 5 8 4 3 7 4 5 4 9 8 15 32 8 8 11 277 5 5 9 4 3 7 4 5 4 9 8 16 33 8 8 11 278 5 5 7 4 3 7 4 5 4 9 8 14 31 8 8 11 279 5 5 6 5 3 8 4 5 4 10 8 14 32 8 8 12 2810 5 4 7 5 4 7 5 5 3 10 8 14 32 10 9 10 2911 5 4 5 5 4 7 5 5 3 10 8 12 30 10 9 10 2912 5 4 3 5 4 7 5 5 4 10 8 10 28 10 9 11 3013 5 4 3 5 4 6 5 5 5 10 8 9 27 10 9 11 3014 5 4 4 5 4 5 5 5 5 10 8 9 27 10 9 10 2915 5 4 4 5 4 5 5 5 4 10 8 9 27 10 9 9 2816 5 4 3 5 4 4 5 5 5 10 8 7 25 10 9 9 2817 5 4 2 5 4 4 5 5 4 10 8 6 24 10 9 8 2718 5 4 2 5 4 3 5 5 6 10 8 5 23 10 9 9 2819 5 4 2 5 4 3 5 5 7 10 8 5 23 10 9 10 2920 5 4 2 5 4 2 5 5 8 10 8 4 22 10 9 10 2921 5 4 2 5 4 2 5 5 8 10 8 4 22 10 9 10 2922 5 4 2 5 4 1 5 5 9 10 8 3 21 10 9 10 2923 5 4 2 5 4 1 5 5 9 10 8 3 21 10 9 10 2924 5 4 2 5 4 1 5 5 9 10 8 3 21 10 9 10 2925 5 4 2 5 4 1 5 5 10 10 8 3 21 10 9 11 3026 5 4 3 5 4 1 6 5 10 10 8 4 22 11 9 11 3127 5 4 4 5 4 0 6 5 10 10 8 4 22 11 9 10 3028 5 4 3 5 4 0 6 5 10 10 8 3 21 11 9 10 3029 5 4 3 4 4 0 6 5 11 9 8 3 20 10 9 11 3030 5 4 2 4 4 0 6 5 11 9 8 2 19 10 9 11 3031 5 4 2 4 4 0 6 5 11 9 8 2 19 10 9 11 3032 5 4 2 4 4 0 6 5 12 9 8 2 19 10 9 12 31

6.3.2. Generation of DNA sequences according to codon chronologyThere is a one to one relationship between amino acids and a codon at each of the 32 stages according to the codon

chronology in Table 5. We separated the section [tmin, tmax] equally into 32 subsections so that each value of parameter t isin one of these subsections. Thus, the amino acids in the protein sequences generated in the second part of the model canbe replaced by the corresponding codons at certain stages. So there is also a one to one correspondence between proteinsequences and DNA sequences at each stage. Thus a group of DNA sequences can be obtained in the model as soon as theprotein sequences have been generated, which is based on both the genetic code multiplicity and the codon chronology.

6.3.3. Calculation of genomic base compositions and their variation trendsFor a fixed value of parameter t , which corresponds to a certain stage, we can obtain numerous DNA sequences. Then we

can calculate the compositions of bases G, C, A, T. Finally, we can obtain the GC content or the GA content at certain codonpositions and the total GC content or GA content in all the DNA sequences. When the parameter t varies, we can study theevolution of genomic base compositions or base compositions at certain codon positions.

6.3.4. The mechanism of the variation of genomic base compositions in the simulationAccording to the simulation in the model, the variation of base compositions in contemporary species can be explained

by the codon chronology, genetic code multiplicity and amino acid chronology together. For instance, we can explain therelationship between genomic GC content and GC content at the first, second and third codon positions. The plot of therelationship between total GC numbers and GC number at certain codon positions in Fig. 6i is approximately the same asthe corresponding experimental observation in Fig. 6a. The upper limit (about 25%) and lower limit (about 75%) of genomicGC content are due to the abundance of G and C in the earliest codons and lack of G and C in the latest codons in Table 5. Thecharacteristic ‘‘step’’ at the middle for the third codon position in experimental observations is due to the leap of numbersof G and C at the third codon position in Table 6. There are always 8 G and C at the second codon position in Table 6, so thereis no step for the second codon position in the experimental observation. The simulation of the GC content for each codonposition increases with genomic GC content (Fig. 5e), which agrees with the experimental observation (Fig. 5a). When tincreases, there will be fewer opportunities for bases G and C to join the DNA sequences. The other relationships in Fig. 6can also be explained similarly.

D.J. Li, S. Zhang / Physica A 388 (2009) 3809–3825 3825

7. Conclusion

The variation patterns of amino acid frequencies and genomic base compositions can be explained in a unified frameworkbased on genetic code evolution. The theoretical results agree with the experimental observations not only in generalvariation trends but also in many detailed characteristics. We conclude that the evolution of the genetic code is the initialdriving force in molecular evolution.

Acknowledgments

We thank Hefeng Wang for valuable discussions. This work was supported by the Key grant Project of Chinese Ministryof Education under Grant No. 708082 and NSF of China Grant No. of 10374075.

References

[1] R.D. Knight, L.F. Landweber, The early evolution of the genetic code, Cell 101 (2000) 569–572.[2] E. Szathmáry, Why are there four letters in the genetic alphabet? Nature Rev. Genet. 4 (2003) 995–1001.[3] F.H.C. Crick, The origin of the genetic code, J. Mol. Biol. 38 (1968) 376–379.[4] S. Osawa, Recent evidence for evolution of the genetic code, Microbiol. Rev. 56 (1992) 229–264.[5] E.N. Trifonov, A. Kirzhner, V.M. Kirzhner, I.N. Berezovsky, Distinct stages of protein evolution as suggested by protein sequence analysis, J. Mol. Evol.53 (2001) 394–401.

[6] P. Carter, J. Liu, B. Rost, PEP: Predictions for entire proteomes, Nucleic Acids Res. 31 (2003) 410–413.[7] D.H. Haft, J.D. Selengut, L.M. Brinkac, N. Zafar, O. White, Genome properties: A system for the investigation of prokaryotic genetic content formicrobiology, genome annotation and comparative genomics, Bioinformatics 21 (2005) 293–306.

[8] D.R. Forsdyke, Evolutionary Bioinformatics, Springer, New York, 2006.[9] J.E.M. Hornos, Y.M.M. Hornos, Algebraic model for the evolution of the genetic code, Phys. Rev. Lett. 71 (1993) 4401–4404.[10] I.K. Jordan, A univeral trend of amino acid gain and loss in protein evolution, Nature 433 (2005) 633–638.[11] B. Rost, Did evolution leap to create the protein universe? Curr. Opin. Struct. Biol. 12 (2002) 409–416.[12] J.-F. Liu, B. Rost, Comparing function and structure between entire proteomes, Protein Sci. 10 (2001) 1970–1979.[13] E.N. Trifonov, The triplet code from first principle, J. Biomol. Struct. Dyn. 22 (2004) 1–11.[14] A. Savitzky, M.J.E. Golay, Smoothing and differentiation of data, Anal. Chem. 36 (1964) 1627–1639.[15] C.R. Woese, O. Kandler, M.L. Wheelis, Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya, Proc. Natl.

Acad. Sci. USA 87 (1990) 4576–4579.[16] A. Muto, S. Osawa, The guanine and cytosine content of genomic DNA and bacterial evolution, Proc. Natl. Acad. Sci. USA 84 (1987) 166–169.[17] A. Gorban, Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences, Physica A 353 (2005) 365–387.[18] E.N. Trifonov, Consensus temporal order of amino acids and evolution of the triplet code, Gene 261 (2000) 139–151.[19] N. Sueoka, Correlation between base composition of deoxyribonucleic acid and amino acid composition of protein, Proc. Natl. Acad. Sci. USA 47 (1961)

1141–1149.[20] X. Gu, D. Hewett-Emmett, W.-H. Li, Directional mutational pressure affects the amino acid composition and hydrophobicity of proteins in bacteria,

Genetica 103 (1998) 383–391.[21] L.D. Hurst, E.J. Feil, E.P.C. Rocha, Causes of trends in amino-acid gain and loss, Nature 442 (2006) E11–E12.[22] A.K. Joshi, Y. Schabes, in: G. Rozenberg, A. Salomma (Eds.), Handbook of Formal Languages, Springer, Heidelberg, 1997, pp. 69–214.[23] M. Eigen, P. Schuster, The hypercycle. A principle of natural self-organization. Part C: The realistic hypercycle, Naturwissenschaften 65 (1978) 341–369.