14
A comparative categorization of gene flux in diverse microbial species Arnim Wiezer a,1 , Rainer Merkl b, * a Go ¨ttingen Genomics Laboratory, Grisebachstrasse 8, D-37077 Go ¨ttingen, Germany b Institut fu ¨r Biophysik und Physikalische Biochemie, Universita ¨t Regensburg, Universita ¨tsstrasse 31, D-93053 Regensburg, Germany Received 30 November 2004; accepted 25 May 2005 Available online 18 July 2005 Abstract Microbial genomes harbor genomic islands (GIs) , genes presumably acquired via horizontal gene transfer (HGT) . We compared GIs of hyperthermophilic, thermophilic, mesophilic, and pathogenic/nonpathogenic species and of small and large genomes. The COG database was used to characterize gene-encoded functions. Putative donors were determined to quantify gene flux between superkingdoms. In hyperthermophiles, more than 10% of the genes were on average acquired across the superkingdom border. For thermophiles and particularly mesophiles, we identified a nearly unidirectional export from bacteria to archaea. Additionally, we analyzed GI composition for Escherichia, and pairs of Listeria , Rhizobiales, Methanosarcinaceae, and Thermus thermophilus/Deinococcus radiodurans . For Escherichia and Listeria, the composition of GIs in pathogenic and nonpathogenic species did not differ significantly with respect to encoded COG classes. The analysis of related genomes showed that the composition of GIs cannot be explained with trends of gene content known to depend on genome size. D 2005 Elsevier Inc. All rights reserved. Keywords: Horizontal gene transfer; Lateral gene transfer; Gene flux; COG database; Genomic islands; Xenologous genes Introduction Horizontal gene transfer (HGT) is considered a strong evolutionary force enhancing the genetic diversity of microbes [1]. This effect brings new genes into a genome, even from taxonomically unrelated species. Thus, HGT offers the means for a rapid adaptation to environmental demands [2]. This speed distinguishes HGT from other processes constantly contributing to the evolution of genomes, such as point mutations, genetic rearrangements, gene loss, and genesis (the de novo generation of genes). Meanwhile, the existence of HGT is generally accepted, although its quantification [3] and the assessment of its impact on microbial genomes are still matters of debate [4]. One reason for dissent is differing results of various studies aimed at the quantification of HGT; for a comparison, see, e.g., [5] or [6]. Their outcomes frequently vary to a great extent with regard to the fraction of genes considered as being horizontally transferred. However, it might be that the different methods identify specific classes of alien genes and that they are sensitive to different time intervals of genome evolution [7]. Gene clusters having conspicuous compositions or prom- inent gene-encoded functions are frequently named ‘‘genomic islands’’ (GIs ). GIs were identified both in pathogenic [8] and nonpathogenic microbes [9]. It has been observed that pathogenicity islands, which are a subset of GIs , frequently have considerable lengths [10]. Recent studies of GI composition have shown that housekeeping genes related to cell surface, DNA binding, and pathogenicity were over- represented in GIs ; see, e.g., [11]. As proposed by the com- plexity hypothesis [12], a reason for this asymmetry might be that informational genes as opposed to housekeeping genes are typically members of large and complex systems rendering their exchange among genomes rather unlikely. Most frequently, HGT has been studied on the genome scale; see, e.g., [13]. In this article, we present an analysis of GIs based on larger sets of genomic data and on pairs of taxonomically related genomes. The study focuses on three questions. (i) To what extent do bacteria and archaea transfer 0888-7543/$ - see front matter D 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2005.05.014 * Corresponding author. Fax: +49(0)941 943 2813. E-mail address: [email protected] (R. Merkl). 1 Current address: QIAGEN, Qiagen-Strasse 1, D-40724 Hilden, Germany. Genomics 86 (2005) 462 – 475 www.elsevier.com/locate/ygeno

A comparative categorization of gene flux in diverse microbial species

  • Upload
    irebs

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

www.elsevier.com/locate/ygeno

Genomics 86 (20

A comparative categorization of gene flux in diverse microbial species

Arnim Wiezera,1, Rainer Merklb,*

aGottingen Genomics Laboratory, Grisebachstrasse 8, D-37077 Gottingen, GermanybInstitut fur Biophysik und Physikalische Biochemie, Universitat Regensburg, Universitatsstrasse 31, D-93053 Regensburg, Germany

Received 30 November 2004; accepted 25 May 2005

Available online 18 July 2005

Abstract

Microbial genomes harbor genomic islands (GIs), genes presumably acquired via horizontal gene transfer (HGT). We compared GIs of

hyperthermophilic, thermophilic, mesophilic, and pathogenic/nonpathogenic species and of small and large genomes. The COG database was

used to characterize gene-encoded functions. Putative donors were determined to quantify gene flux between superkingdoms. In

hyperthermophiles, more than 10% of the genes were on average acquired across the superkingdom border. For thermophiles and particularly

mesophiles, we identified a nearly unidirectional export from bacteria to archaea. Additionally, we analyzed GI composition for Escherichia,

and pairs of Listeria, Rhizobiales, Methanosarcinaceae, and Thermus thermophilus/Deinococcus radiodurans. For Escherichia and Listeria,

the composition ofGIs in pathogenic and nonpathogenic species did not differ significantly with respect to encoded COG classes. The analysis

of related genomes showed that the composition of GIs cannot be explained with trends of gene content known to depend on genome size.

D 2005 Elsevier Inc. All rights reserved.

Keywords: Horizontal gene transfer; Lateral gene transfer; Gene flux; COG database; Genomic islands; Xenologous genes

Introduction

Horizontal gene transfer (HGT) is considered a strong

evolutionary force enhancing the genetic diversity of

microbes [1]. This effect brings new genes into a genome,

even from taxonomically unrelated species. Thus, HGT

offers the means for a rapid adaptation to environmental

demands [2]. This speed distinguishes HGT from other

processes constantly contributing to the evolution of

genomes, such as point mutations, genetic rearrangements,

gene loss, and genesis (the de novo generation of genes).

Meanwhile, the existence of HGT is generally accepted,

although its quantification [3] and the assessment of its

impact on microbial genomes are still matters of debate [4].

One reason for dissent is differing results of various studies

aimed at the quantification of HGT; for a comparison, see,

e.g., [5] or [6]. Their outcomes frequently vary to a great

extent with regard to the fraction of genes considered as being

0888-7543/$ - see front matter D 2005 Elsevier Inc. All rights reserved.

doi:10.1016/j.ygeno.2005.05.014

* Corresponding author. Fax: +49(0)941 943 2813.

E-mail address: [email protected] (R. Merkl).1 Current address: QIAGEN, Qiagen-Strasse 1, D-40724 Hilden, Germany.

horizontally transferred. However, it might be that the

different methods identify specific classes of alien genes

and that they are sensitive to different time intervals of

genome evolution [7].

Gene clusters having conspicuous compositions or prom-

inent gene-encoded functions are frequently named ‘‘genomic

islands’’ (GIs).GIs were identified both in pathogenic [8] and

nonpathogenic microbes [9]. It has been observed that

pathogenicity islands, which are a subset of GIs, frequently

have considerable lengths [10]. Recent studies of GI

composition have shown that housekeeping genes related to

cell surface, DNA binding, and pathogenicity were over-

represented in GIs; see, e.g., [11]. As proposed by the com-

plexity hypothesis [12], a reason for this asymmetry might be

that informational genes as opposed to housekeeping genes

are typically members of large and complex systems

rendering their exchange among genomes rather unlikely.

Most frequently, HGT has been studied on the genome

scale; see, e.g., [13]. In this article, we present an analysis of

GIs based on larger sets of genomic data and on pairs of

taxonomically related genomes. The study focuses on three

questions. (i) To what extent do bacteria and archaea transfer

05) 462 – 475

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 463

genes across the borders of prokaryotic superkingdoms? (ii)

Are GIs of pathogenic and nonpathogenic species differently

composed with respect to gene-encoded functions? (iii) Do

species-specific variations in HGT activity explain differing

genome sizes of taxonomically related species?

We will show that certain amounts of gene flux occur

between hyperthermophiles across the superkingdom bor-

der. Among thermophilic and especially mesophilic species,

gene flux is largely unidirectional from bacteria to archaea,

if evaluated on the superkingdom level. The pairwise

genome comparison of pathogenic/nonpathogenic species

belonging to the genera Escherichia or Listeria illustrates

that gene-encoded functions found in respective GIs do not

differ significantly. Finally, we will correlate genome size

and the amount of GI content for closely related species.

These findings indicate that species-specific variations in

HGT activity do not explain the genome sizes of closely

related species of Methanosarcinaceae or Rhizobiales.

Results

Two different approaches complementing each other

were combined to study gene flux. These were the COG

database [14,15], which classifies genes on the protein

sequence level and SIGI [16], which analyses codon usage

and predicts GIs.

Categorizing gene flux

The COG database is a classification system for gene-

encoded functions. Each element of the database (a cluster

of orthologous groups of genes, a COG) is a set of genes

that code for the same function and originate from different

species. COGs are organized in functional categories (also

named COG classes in the following) describing the role of

gene-encoded functions. To date, the database classifies the

genes of 63 prokaryotic species into 4873 COGs. A minimal

COG would consist of three genes from phylogenetically

distinct species exhibiting significant similarity to each

other on the protein-sequence level. Due to this concept, the

fraction of genes that belong to COGs varies among

completely sequenced genomes. On average, 70–75% of

the protein-coding genes of a genome are classified in the

COG database. The STRING database [17] extends the

COG concept and comprises additional genomes. We

retrieved several genomes to supplement the groups of

thermophiles and hyperthermophiles. In addition, we

implemented the program COGGER according to the

COG concept [14], to allow the analysis of genomes which

are not elements of the COG or the STRING database. For

the work described here, we processed the genomes of

Escherichia blattae [18], Methanosarcina mazei [19],

Thermus thermophilus [20], and Listeria monocytogenes

[21]. In addition, we have created a new COG class X (see

Table 2) which contains the COGs describing integrases,

transposases, and inactivated derivatives to isolate these

functions, which are exceedingly overrepresented in GIs.

Gene-encoded functions related to COG classes Z (cyto-

skeleton) and Y (nuclear structure) did not occur in the data

sets and were not processed any further.

Recently, a novel method, named SIGI, was introduced

for the analysis of GIs. It is based on the genome theory [22]

and the taxonomical relatedness of codon usage [23]. The

algorithm focuses on the identification of recently acquired

genes and additionally aims at the prediction of the putative

donor. It determines in a genome the codon usage (CU) of

each individual gene. Clusters of genes having a codon

usage, which suspiciously deviated from the mean, genome-

specific case, were labeled as GIs; these genes were named

putatively alien genes (pA). For the prediction of the

putative donor, each CU was compared with a list of about

400 codon frequency tables (CUT) representative of micro-

bial species. SIGI considered for each gene the taxonomical

relation of those three species from CUT, which had the

most similar codon usage to CU. These three species were

labeled in a taxonomy tree. The taxon subsuming these three

entries as child nodes was predicted as the putative source of

the considered pA gene. If all three entries were bacteria or

archaea, the donor was predicted as putatively bacterial

(pAb) or archeal (pAa). In the worst case and if the three

high scorers belonged to (say) bacteria and archaea, or

bacteria and phages, codon usage was interpreted as

unspecific (pAu). During the design phase, two types of

test beds were applied to test SIGI’s predictive power. The

analyses have shown that at mean 75% of the predictions of

the putative origin were correct on the species level. In no

case, more than 1% of the predictions were wrong on the

level of the superclass. These results indicate that codon

usages of archaea and bacteria are quite distinct.

Gene flux between superkingdoms and in specific habitats

All genomes listed under Materials were analyzed using

SIGI. For each gene, the putative source was determined

and according to SIGI_s prediction added to histograms

summing up the absolute numbers of putatively native

(# pN) or putatively alien (# pA) genes. pA genes were

separated into three classes according to their predicted

origin and counted as # pAa, # pAb, or # pAu. Table 1 lists

representative findings for genomes having more than 50 pA

genes. Among hyperthermophiles, gene flux was approx-

imately balanced: At least 10–15% of the genes seemed to

be transferred across the superkingdom border. The portion

of pA genes having a distinct archeal codon usage dropped

to zero for thermophilic and mesophilic bacteria. GIs of

thermophilic archaea harbored 29% of pAb genes; this

fraction increased to 50% for mesophilic archaea. The

fraction of genes with an unspecific codon usage was in

archaea and hyperthermophilic bacteria >40% and in

thermophilic and mesophilic bacteria <35%. A comparison

of the summed # pAa, # pAb and # pAu genes for

Table 1

Occurrence of putatively alien genes in GIs of microbial species

# pA [%] # pAa [%] # pAb [%] # pAu [%]

Hyperthermophilic

archaea

984 6 414 41 109 11 461 48

Aeropyrum pernix 186 9 96 52 26 14 64 34

Sulfolobus solfataricus 253 7 124 49 26 10 103 41

Archaeoglobus fulgidus 169 5 49 29 20 12 100 59

Pyrococcus abyssi 120 5 58 48 6 5 56 47

Pyrococcus horikoshii 129 5 34 26 19 15 76 59

Thermococcus

kodakaraensis

127 5 53 42 12 9 62 49

Hyperthermophilic

Bacteria

477 6 48 15 236 40 193 44

Thermotoga maritima 143 6 28 20 50 35 65 45

Aquifex aeolicus 52 3 12 23 13 25 27 52

Thermoanaerobacter

tengcongensis

282 9 8 3 173 61 101 36

Thermophilic archaea 315 6 77 23 88 29 150 48

Thermoplasma

volcanium

137 8 42 31 32 23 63 46

Picrophilus torridus 88 5 2 2 44 50 42 48

Methanothermobacter

thermoautotrophicus

90 5 33 37 12 13 45 50

Thermophilic bacteria 760 6 2 0 524 65 234 34

Geobacillus kaustophilus 394 11 0 0 288 73 106 27

Thermus thermophilus

HB 27

125 5 0 0 73 58 52 42

Thermosynechococcus

elongatus

55 2 0 0 34 62 21 38

Symbiobacterium

thermophilum

186 5 2 1 129 69 55 30

Mesophilic archaea 916 8 85 8 451 50 380 43

Halobacterium salinarum 68 4 0 0 41 60 27 40

Methanosarcina

acetivorans

538 11 28 5 317 59 193 36

Methanosarcina mazei 310 8 57 18 93 30 160 52

Mesophilic bacteria 2182 12 4 0 1431 68 747 31

Bacillus subtilis 469 10 0 0 355 76 114 24

Escherichia coli K-12 633 12 4 1 451 71 178 28

Mesorhizobium loti 1028 15 0 0 564 55 464 45

For the listed genomes, the occurrence of pA genes having an archeal (# pAa),

bacterial (# pAb) or unspecific codon usage (# pAu) was determined.

Amounts are given as absolute numbers and in percentage related to the

number of genome-specific pA genes. The genomes were grouped into

hyperthermophilic, thermophilic, andmesophilic species. For each group, the

sum of # pA genes and the mean percentage values are listed.

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475464

hyperthermophilic and mesophilic species with a v2 test

showed a highly significant deviation from each other for

both archaea and bacteria (P << 0.01).

For a characterization of gene-encoded functions, the

COG classification was determined for each gene and

according to SIGI’s prediction, added to histograms summing

for each COG class k the absolute frequencies # pkN and #

pkA and more specifically # pkAa, # pkAb, or # pkAu.

Absolute frequencies were used to calculate relative fre-

quency values fk(pA), fk (pN) and ratios r(k) = fk (pA)/fk (pN).

An example may illustrate the determination of relative

frequencies and ratio values. If we consider for a COG class X

# pA = 219 and # pN = 381, then we get fX(pA) = 219/995 =

0.22, fX(pN) = 381/22872 = 0.016, and r(X) = fX(pA)/

fX(pN) = 0.22/0.016 = 13.2 (see numbers for COG class X of

data set ‘‘Archaea’’ in Table 2). Here, 995 is the sum of all pA

genes and 22872 is the sum of all pN genes determined in the

data set Archaea. This ratio value indicates that genes

assigned to COG class X are in archeal GIs 13 times as

frequent as among pN genes. Genes not assigned to a COG

class were—according to their putative origin—collected in

separate groups named unCOGed.

To test the statistical significance of differences in

histograms, v2 tests were computed on absolute gene

numbers. In bacteria (P << 0.01), archaea (P << 0.01),

hyperthermophilic archaea (P << 0.01), mesophilic archaea

(P = 0.028), and mesophilic bacteria (P << 0.01) the

distributions of COG classes resulting from pA and pN

genes were statistically significantly different. In hyper-

thermophilic bacteria, the skew was not significant at the

10% level. This might, however, be due to the small number

of related pA genes. In addition and for each COG class k,

the imbalance between pkA and pkN genes was assessed

with a v2 test. Classes having a statistically significantly

skew (P < 0.01) are printed in bold face in Table 2.

In archaea (compare Table 2 and Figs. 1C and 2C), the

imbalance between pA and pN genes and the overrepre-

sentation in GIs of genes classified as belonging to class M

(cell wall/membrane/envelope biogenesis) was more pro-

nounced than in bacteria. This overrepresentation of class M

is independent of the habitat or the taxonomical group

(compare Fig. 3). Fig. 3A reveals that archeal GIs harbor

more gene-encoded functions related to class F (nucleotide

transport and metabolism) and X (integrases, transposases,

and inactivated derivatives) than bacterial ones. Fig. 3B

indicates that GIs of mesophilic species harbor more gene-

encoded functions related to metabolism than hyperthermo-

philic ones.

A second, significant difference in GI composition

between bacteria and archaea was related to the putative

origin of horizontally transferred genes: The proportions

described above for the putative donors were similar in all

COG classes. For hyperthermophilic archaea, GIs were

similarly composed as those seen in mesophilic archaea

(compare Figs. 1A, 1B, and 3B). The fraction of pAb genes

was, however, lower. In summary, it was ,11%, and only a

fraction of pA genes classified as belonging to the COG

classes M (cell wall/membrane/envelope biogenesis), Q

(secondary metabolites biosynthesis, transport, and catabo-

lism), and X (integrases, transposases, and inactivated

derivatives) had a distinct bacterial codon usage (P <

0.01). In GIs of mesophilic bacteria, the portion of genes

presumably originating from archaea (pAa) was minimal (in

summary <1%). In each individual COG class, this fraction

was less than 2%.

One might argue that the composition of the COG

database is due to an overrepresentation of bacterial

genomes biased toward gene-encoded functions, preferen-

tially found in bacteria. If this were the case, archaea-

specific gene-encoded functions might be underrepresented

among the COGs, giving rise to a skewed distribution.

Table 2

Composition of microbial GIs with respect to COG classes and the putative origin of pA genes

Archaea HT

Arch

MS

Arch

HT

Bact

MS

Bact

Bacteria

# pN # pA r(k) # pAa r(k) # pAb r(k) r(k) r(k) r(k) r(k) # pN # pA r(k) # pAa r(k) # pAb r(k)

J Translation, ribosomal structure and biogenesis 2001 16 0.2 5 0.2 1 0.0 0.2 0.1 0.2 0.3 7124 113 0.3 0 0.0 87 0.3

A RNA processing and modification 21 0 0.0 0 0.0 0 0.0 0.0 0.0 0.0 0.0 22 0 0.0 0 0.0 0 0.0

K Transcription 1138 43 0.9 17 1.1 12 0.9 1.0 0.6 0.8 1.5 7562 654 1.5 6 1.6 474 1.5

L Replication, recombination and repair 943 53 1.3 19 1.4 13 1.2 1.0 1.2 1.7 1.0 4676 276 1.0 3 1.3 191 1.0

B Chromatin structure and dynamics 47 1 0.5 1 1.5 0 0.0 0.8 0.0 3.3 0.6 32 1 0.5 0 0.0 1 0.7

D Cell cycle control, cell division, chromos. partitioning 181 8 1.0 1 0.4 1 0.5 1.2 0.4 0.4 0.5 1122 34 0.5 0 0.0 21 0.4

V Defense mechanisms 286 13 1.0 1 0.3 7 2.1 0.8 1.0 2.3 1.5 1667 148 1.5 2 2.4 121 1.7

T Signal transduction mechanisms 448 15 0.8 5 0.8 6 1.1 1.1 0.5 1.1 1.0 4687 260 1.0 2 0.9 214 1.1

M Cell wall/membrane/envelope biogenesis 562 96 3.9 35 4.4 31 4.7 4.1 2.7 2.2 1.3 6029 452 1.3 6 2.0 332 1.3

N Cell motility 240 4 0.4 1 0.3 1 0.4 0.2 0.6 1.0 1.8 2166 220 1.7 0 0.0 187 2.0

W Extracellular structures 1 0 0.0 0 0.0 0 0.0 – 0.0 0.0 1.5 21 2 1.6 0 0.0 2 2.2

U Intracell. trafficking, secretion, and vesicular transport 292 5 0.4 2 0.5 3 0.9 0.2 0.8 0.2 2.1 2637 320 2.1 0 0.0 258 2.3

O Posttransl. modification, protein turnover, chaperones 820 15 0.4 5 0.4 6 0.6 0.2 0.6 0.1 0.6 4219 134 0.5 0 0.0 107 0.6

C Energy production and conversion 1910 27 0.3 8 0.3 7 0.3 0.2 0.4 0.8 0.6 6448 206 0.6 3 1.0 162 0.6

G Carbohydrate transport and metabolism 1047 31 0.7 13 0.9 8 0.7 0.6 0.4 1.7 0.8 7772 379 0.8 3 0.8 308 0.9

E Amino acid transport and metabolism 2132 45 0.5 19 0.6 15 0.6 0.4 0.6 0.9 0.5 10,617 334 0.5 7 1.3 245 0.5

F Nucleotide transport and metabolism 656 7 0.3 3 0.3 1 0.1 0.3 0.4 0.1 0.3 2710 52 0.3 0 0.0 42 0.4

H Coenzyme transport and metabolism 1264 8 0.3 1 0.1 3 0.2 0.2 0.2 0.3 0.5 4458 138 0.5 0 0.0 104 0.5

I Lipid transport and metabolism 582 7 0.3 1 0.1 0 0.0 0.2 0.3 0.4 0.7 3831 149 0.7 0 0.0 112 0.7

P Inorganic ion transport and metabolism 1268 38 0.7 14 0.8 12 0.8 0.8 0.5 1.1 0.8 6461 283 0.8 5 1.6 210 0.8

Q Sec. metabolites biosynthesis, transport and catabolism 401 39 2.2 10 1.8 15 3.2 2.3 1.8 1.2 1.1 2878 173 1.0 0 0.0 127 1.0

R General function prediction only 3950 204 1.2 66 1.1 72 1.6 1.1 1.4 1.1 1.0 14,159 834 1.0 6 0.9 633 1.0

S Function unknown 2301 101 1.0 17 0.5 22 0.8 1.2 0.8 0.5 1.3 9331 708 1.3 1 0.2 495 1.2

X Integrases, transposases and inactivated derivatives 381 219 13.2 82 15.1 32 7.2 14.4 9.3 7.7 8.9 1272 668 9.0 11 17.6 368 6.7

SUM 22,872 995 326 33% 268 27% 111,901 6538 55 0.8% 4801 73%

unCOGed [%] pAa [%] 904 181 20% 277 31% 28% 2% 4% 0.6% 5379 33 0.6% 2839 53%

pAb [%] 17% 59% 44% 53%

The first column lists for each COG class k the one-letter code and its function. The following columns give for archaea and bacteria the absolute numbers of unsuspicious genes (# pN), of putatively alien genes

(# pA), and the ratio of normalized fractions. The columns # pAa and # pAb give absolute numbers for pA genes originating presumably from archaea or bacteria. For hyperthermophilic and mesophilic archaea

(columns HTArch andMS Arch) and bacteria (columns HT Bact andMS Bact), the ratio values r(k) = fk(pA)/fk(pN) are given. Ratio values printed in bold result from (# pA, # pN) pairs, whose skew is statistically

significant on the 1% level. The row labeled SUM gives the sum of pA, pAa, and pAb genes for the respective species group and the respective fraction values in percentage. The last row lists the number of pA

genes, which are not elements of the COG database (unCOGed) and for unCOGed pAa and pAb genes their contribution to the respective number of all unCOGed genes. All r(k) values were rounded to one

decimal place.

A.Wiezer,

R.Merkl

/Genomics

86(2005)462–475

465

Fig. 1. Classification of putatively alien (pA) and putatively native (pN) genes in genomic islands of mesophilic (A) and hyperthermophilic (B) archaea.

According to SIGI’s prediction, pA genes were grouped with respect to their likely donors. Bars pAb give the fraction of pA genes presumably originating from

a bacterial donor. pAa are genes presumably imported from an archeal source. The codon usage in pAu genes is unspecific. For abbreviations of COG classes,

see Table 2. The bars on the right named unCOGed give the proportions of pAb, pAa, and pAu genes among those genes, which are not elements of the COG

database. (C) A plot of log(r(k)) values indicating the over- or underrepresentation of each COG class k. For columns marked with an asterisk, the skew of

respective # pA and # pN values is statistically significant ( P < 0.01, determined for each pair with a v2 test).

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475466

Therefore, the putative origin of unCOGed genes was

analyzed. The fractions of pAa and pAb genes among the

unCOGed ones were similar to the numbers listed in Table 1;

see last row of Table 2 and Figs. 1 and 2. In addition, the

annotation of unCOGed genes was interpreted. In all groups,

more than 80% of these genes were annotated as hypothetical

or with a putative function. The remaining 20% were

frequently annotated as transposases or as elements of

transposons or prophages. This was also true for archeal

GIs; e.g., in S. solfataricus the content of transposons

ISC1043, ISC1058, ISC1212, ISC1217, ISC1225, and

ISC1234 was predicted as GIs, In summary, there was no

evidence for the assumption that the above presented

imbalance of putative donors might be an artifact originating

from an underrepresentation of archaea-specific gene-

encoded functions in the COG database.

Gene flux in related species

For a comparison of pathogenic and nonpathogenic spe-

cies and the analysis of small and large genomes of species

belonging to the same genera, four genomes of Escherichia

and genome pairs of Listeria and Methanosarcinaceae were

analyzed. Fig. 4 summarizes the findings for Escherichia

genomes, namely Escherichia coli O157:H7 [24], E. coli

O157:H7 EDL933 [25], E. coli K-12 [26], and E. blattae

Fig. 2. Classification of putatively alien (pA) and putatively native (pN) genes in genomic islands of mesophilic and hyperthermophilic bacteria. For details, see

legend to Fig. 1.

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 467

[18]. The first two of these species are pathogenic. Concern-

ing the composition of GIs, the E. coli genomes had similar

prevalences. In all three genomes, COG classes X (integrases,

transposases, and inactivated derivatives), U (intracellular

trafficking, secretion, and vesicular transport), N (cell

motility), and M (cell wall/membrane/envelope biogenesis)

were overrepresented inGIs; compare ratio values in Fig. 4B.

The histograms also showed that there were no drastic

differences between pathogenic and nonpathogenic E. coli

species with respect to GI composition. GIs in E. coli K-12

tended to harbor more gene-encoded functions related to

classes N (cell motility) and U (intracellular trafficking,

secretion, and vesicular transport). Among those individual

COG functions, which were the strongest overrepresented

(P < 0.05) in GIs of all E. coli species, are gene-encoded

functions related to P pilus assembly. Concerning GIs in E.

blattae, the underrepresentation of classes N (cell motility)

and U (intracellular trafficking, secretion, and vesicular

transport) is striking. The GI composition of E. coli K-12

and E. blattae differed statistically significantly (P << 0.01).

The analyzed genomic data set for E. blattae consisted of five

contigs covering >98% of the genome. Due to the high

coverage, we do not expect significant deviations from our

findings for the complete genome.

Among the two Listeria species analyzed here, Listeria

innocua is nonpathogenic and Listeria monocytogenes is

pathogenic. The genomes have nearly the same size of 3.01

or 2.94 Mb [21]. SIGI identified in L. innocua 343 pA genes

(11%) and in L. monocytogenes 271 (9.4%). The median of

GI length was 8 genes in both species. The number of

hypothetical genes in GIs was in both cases approx. 50%

and among pN genes <15%. The COG histograms of GIs

Fig. 3. Log ratios indicating the prevalences for COG classes in genomic islands of hyperthermophilic or mesophilic species. (A) For each COG class k the

ratios ratio(G1, G2) = log(rG1(k)/rG2(k)) and ratio(G3, G4) = log(rG3(k)/rG4(k)) were determined for the groups G1 = hyperthermophilic archaea, G2 =

hyperthermophilic bacteria, G3 = mesophilic archaea, G4 = mesophilic bacteria. Only those r(k) values were considered that belong to COG classes having a

statistically significant skew ( P < 0.05). If both ratio values are positive (negative), the considered class is over-(under-) represented in all archeal GIs,

irrespective of the habitat. (B) For each COG class k the ratios ratio(G1, G3) = log(rG1(k)/rG3(k)) and ratio(G2, G4) = log(rG2(k)/rG4(k)) were determined as

described above. If both ratios were positive (negative), the considered class was over- (under-) represented in GIs of hyperthermophilic species, irrespective of

the considered superkingdom.

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475468

differed only marginally (see Fig. 5); a v2 test did not detect

significant differences (P > 0.9). Besides COG1396, a

predicted transcriptional regulator, none of the individual

COG functions was significantly overrepresented (P >

0.05) in GIs. Again, these examples support the notion that

pathogenicity does not massively influence the content of

GIs on the level of COG classes.

The genomes of the two completely sequenced archeal

Methanosarcinaceae differ considerably in size. The

genome of Methanosarcina acetivorans [27] is 39% (1.65

Mb) larger than the one ofMethanosarcina mazei [19]. SIGI

identified in M. mazei 310 pA genes (8%) and in M.

acetivorans 538 (11%). GIs in the genome of M. acetivor-

ans were significantly longer: The median of GI length was

in M. mazei 9 genes and in M. acetivorans 13 genes. The

distributions of GI lengths were statistically significantly

different (P << 0.01, determined with a KS test). The

histograms of COG classes are plotted in Fig. 6. In both

species, the overall distributions reflect the general preva-

lence of archeal GIs (see Fig. 6A). Again, the composition

of GIs differed only marginally with respect to COG classes.

The two COG classifications of all pA genes and those of

pAb genes did not differ statistically significantly (P > 0.1).

The comparison of the two proteomes by means of

bidirectional BLAST analysis identified 1051 (23%) spe-

cies-specific genes in the genome of M. acetivorans. This

fraction was roughly consistent with values determined for

other genomes; see, e.g., [28]. Of these genes 59% were

located in the genome half located around the terminus of

replication. The same area harbored 58% of pA genes. The

overrepresentation of both the species-specific and the pA

genes in the ter half was statistically significant (P << 0.01,

determined with a v2 test).Sinorhizobium meliloti and Mesorhizobium loti are both

Rhizobiales. The genome of M. loti [29] (7.60 MB, 6746

genes) is larger than that of S. meliloti [30] (6.69 MB, 3341

genes) and harbors a symbiosis island [31]. SIGI identified

in the genome of M. loti 1028 (15%) and in the genome of

S. meliloti 209 (5%) pA genes. The difference in GI

composition was statistically significant (P < 0.025).

Interestingly, the fraction of genes belonging to class X is

larger in GIs of the smaller genome. S. meliloti contained 22

GIs with a median length of 7, the genome of M. loti had 56

GIs with a median length of 11 genes. To identify those

gene-encoded functions, which were most strongly over-

represented in GIs of M. loti, those five COG classes were

identified, which had the largest ratio r(k). In GIs of M. loti,

genes related to class T (signal transduction mechanisms), U

(intracellular trafficking, secretion, and vesicular transport),

C (energy production and conversion), H (coenzyme trans-

port and metabolism), and Q (secondary metabolites

biosynthesis, and transport and catabolism) were over-

represented the strongest; compare Fig. 7.

In the genomes of Thermus thermophilus [20] and

Deinococcus radiodurans [32], which are two representa-

tives of a distinct bacterial phylum, SIGI found only a small

number of pA genes. For T. thermophilus, SIGI identified

97 (4.8%) pA genes and for D. radiodurans on chromosome

Fig. 4. Classification of putatively alien (pA) genes in genomic islands (GIs) of Escherichia. (A) Plot of fraction values fk (pA) for all COG classes as

determined in GIs of E. coli K-12, E. coli O157:H7 EDL933, E. coli O157:H7, and E. blattae. (B) Ratio values r(k) = fk (pA)/fk (pN) illustrating the under-/

overrepresentation of COG classes in GIs of the respective genomes. A positive log(r(k)) value indicates an overrepresentation, a negative log(r(k) value the

underrepresentation of the respective class k in GIs. For COG classes, see Table 2.

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 469

one 32 (1.3%) and on chromosome two 34 (9.5%) pA genes,

respectively. Ten of these 34 pA genes on chromosome two

belong to class M (cell wall/membrane/envelope biogenesis)

and constitute a cluster consisting of genes DRA0026–

DRA0048, coding mostly for transferases. T. thermophilus

is an extremely thermophilic bacterium. In its GIs, 10% of

the genes were related to COG class M (cell wall/

membrane/envelope biosynthesis) and 6% to class V

(defense mechanisms, data not shown). The two species

analyzed here are naturally transformable [32–34]. D.

radiodurans lives in locations rich in organic nutrients,

which are populated by a plethora of microbes and a diverse

gene pool. It is unclear why these genomes harbor only a

small amount of pA genes.

Discussion

Each analysis of HGT raises the question of whether the

approach of identifying horizontally transferred genes is

valid. A variety of methods have been developed to identify

GIs. The underlying concepts are based on the analysis of

sequence composition [11,16,35–39] or gene neighborhood

[40], on phylogenetic studies [41] or on a combination of

these approaches [42,43]. Here, we will discuss SIGI_spotential to identify GIs first and the results second.

Codon usage analysis reliably allows identification of GIs

It has been argued that approaches based on codon usage

analysis generate large numbers of false positive or false

negative hits. However, it has been shown that compositional

analyses recognize the vast majority of transfer events [44].

The risk of generating false positive hits can be reduced by

focusing on the prediction of pA gene clusters. Such an

approach elegantly exploits and combines biological evi-

dence and statistical principles. As has been established,

genomic islands frequently have a size of 10–200 kb [10].

This is an important premise, because the probability of

predicting false positives decreases drastically for gene

clusters. These arguments were considered in the design of

SIGI [16] and another recently introduced algorithm [11].

There is a second line of arguments that lends additional

support to algorithms analyzing compositional complexity.

The assumption that these methods might overlook genes

acquired by horizontal transfer could be valid for more

ancient events, which were subject to the amelioration

process [1] for a longer period of time. More recently

Fig. 5. Classification of putatively alien (pA) genes in genomic islands of Listeria. For details, see legend of Fig. 4.

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475470

acquired genes were detected to a great extent by methods

based on compositional complexity [5,44]. Lawrence and

Ochman estimated the age of imported genes [35] and

concluded that most were relatively recent, i.e., they were

acquired within the last few million years; see, e.g., [45].

This suggests that older imports have been purged from the

genomes presumably because these genes did not improve

fitness [4]. Given this reasoning, one might expect

amelioration not to affect the majority of pA genes.

Gene flux depends on lifestyle and is directional in

mesophilic species

In hyperthermophilic species, genes were exchanged

across the superkingdom border to nearly the same extent

in both directions. SIGI_s prediction (see Table 1) coincides

with the evidence for massive HGT in Thermotoga maritima

from archaea [13]. However, the absolute numbers of pA

genes found in these genomes are relatively small, if

compared to mesophilic ones. For mesophilic species, we

have to state a directedness of gene flux from bacteria to

archaea. This statement subsumes and is in agreement with

findings obtained for pathogenic bacteria. Such genomes

were shown to harbor only a low number of archeal genes

[46]. These results and our findings are in contrast to the

hypothesis introduced by [47], postulating that HGT from

archaea to E. coli contributes to the emergence of novel

diseases. According to the results presented above, the

transfer of archeal genes to genomes of mesophilic bacteria

is rare. Interestingly, no pathogenic archaeon is known so far.

If compared to bacterial genomes, archeal ones contain a

smaller amount of genes related to carbohydrate transport

and metabolism, cell wall/membrane/envelope biogenesis,

and inorganic transport and metabolism [48]. The lower

number of genes encoding functions for cell wall synthesis

was frequently explained by the fact that archaea possess a

different cell wall demanding fewer enzymes [49]. SIGI’s

output suggests that 17% of related archeal genes were

horizontally acquired; most likely one-third of these pA

genes originate from bacteria. In the case of archaea, the

habitat appears to have a minor effect on gene-encoded

functions acquired via HGT: The overall classification of pA

genes identified in GIs of hyperthermophilic archaea is

similar to that determined for mesophilic archaea. The

observation that hyperthermophilic archaea harbor fewer

genes acquired from bacteria is in agreement with previous

results [50].

GI content and pathogenicity

The genes distinguishing pathogenic from nonpathogenic

species are often clustered in pathogenicity islands [8,10].

For the samples analyzed here, the content of pA gene

clusters was nearly identical to the composition found in

Fig. 6. Classification of putatively alien (pA) genes in genomic islands of Methanosarcina acetivorans (first column) and Methanosarcina mazei (second

column). For abbreviations of COG classes, see Table 2. (B) log(rMa(k)/rMm(k)) values illustrate the under-/overrepresentation of COG classes in the genome of

M. acetivorans if compared to M. mazei. A positive log(ratio) value indicates an overrepresentation of the respective class k in GIs of M. acetivorans. Black

bars give the ratio derived from all pA genes; gray bars indicate the skew for pA genes originating from bacterial donors.

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 471

nonpathogenic species with respect to the COG classifica-

tion. There are several reasons that might explain this

correspondence. First, only 50% of pA genes are classified

as COGs. It might be that pathogenicity is encoded, at least

to some extent, in those genes that are not yet elements of

the COG scheme. Second, the clusters identified by SIGI

may be marginally or not at all related to pathogenicity.

Among the COG classes, there is one candidate frequently

overrepresented in GIs that could indicate pathogenicity:

class X. However, the analysis of archeal and bacterial

genomes (see Figs. 1 and 2) and the comparison of

eubacterial genomes (see Fig. 4) or of Listeria demonstrated

that the fraction of these genes including integrases and

transposases is not a marker for pathogenicity. Related

enzymes occurred with similar frequency in both pathogenic

and nonpathogenic species. On the other hand, it has been

shown that SIGI identifies operons related to pathogenicity

[16]. Algorithms like SIGI preferentially detect recently

acquired GIs. Pathogenicity islands, which are a subset of

GIs, will be missed, if the DNA composition is incon-

spicuous or already ameliorated toward the codon usage of

the host. As argued above, genomes may not harbor a great

amount of ameliorated genes. If, however, strength and

impact of HGT were complementary to habitat- and

phylum-specific needs, one would expect similar gene-

encoded functions in newly acquired GIs, in both patho-

genic and nonpathogenic species. The findings presented

here support this notion, which makes the correspondence

of GI composition plausible. This concept might also

explain the underrepresentation of gene-encoded functions

related to classes N (cell motility) and U (intracellular

trafficking, secretion, and vesicular transport) in GIs of E.

blattae. This species was isolated from the hindgut of a

cockroach [51]. Likely, the composition of GIs is related to

the commensal lifestyle of E. blattae.

One might argue that most of the GIs of closely related

species we compared here were acquired before speciation.

If this is the case, our results demonstrate that since then,

composition of GIs was rather stable with respect to a

general classification of gene-encoded functions. This

stability would again support the notion that respective

gene-encoded functions enhance fitness; otherwise, a purg-

ing of such GIs seems more plausible. Our hypothesis that

the evolution of a pathogenic lifestyle is not correlated with

a specific composition of GIs is independent of acquisition

time. If GIs have been acquired before speciation, patho-

Fig. 7. (A) Classification of putatively alien (pA) genes in genomic islands (GIs) of two Rhizobiales. For abbreviations of COG classes, see Table 2. Dark gray

bars give the distribution of COG classes in GIs of Sinorhizobium meliloti; light gray bars give the distribution of COG classes in GIs of Mesorhizobium loti.

(B) log(rMl(k)/rSm(k)) values show the over-/underrepresentation of COG classes in the genome ofM. loti if compared to S. meliloti. A positive log(ratio) value

indicates an overrepresentation of the respective class k in GIs of M. loti.

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475472

genic species did not accumulate significant amounts of

genes with different functions since then. If GIs were

acquired after speciation, accumulated functions are similar

in both pathogenic and nonpathogenic species.

There is another effect to be considered: Genomes of

pathogenic species are frequently characterized by a

massive loss of genes [52–54]. This effect is probably

due to the lifestyle of most pathogens as intracellular and

opportunistic parasites. Such environments offer rich

resources, making gene functions obsolete.

In summary and for the cases studied here, it seems

plausible to assume that pathogenicity does not massively

depend on newly imported genes.

HGT activity does not explain genome size

The genomes of the two (archeal) Methanosarcina and

the two (bacterial) Rhizobiales species differ considerably in

size. A statistically significant correlation of gene content

with genome size is known for the following COG classes

[48]: J, L, D, F (negative correlation) and K, N, T, C, Q

(positive correlation). For the Methanosarcina species, the

composition of GIs does not differ significantly with respect

to the COG classification (compare Fig. 6B). For GIs of

Rhizobiales, COG classes T (signal transduction mecha-

nisms), C (energy production and conversion), and Q

(secondary metabolites biosynthesis, transport and catabo-

lism) follow the genome-size-specific trend. Additionally,

COG classes M (cell wall/membrane/envelope biogenesis),

U (intracellular trafficking, secretion, and vesicular trans-

port), and H (coenzyme transport and metabolism) are

overrepresented in GIs (compare part B in Fig. 7). There is

no evidence for a correlation of these functions with genome

size [48]. These findings suggest that the composition of

GIs does not merely reflect the size dependency of genome

composition.

Most of the species-specific genes accumulated in the

genome of M. acetivorans and not found in the genome of

M. mazei were not classified as pA. This result indicates that

codon usage in these genes does not significantly deviate

from the preferences in native genes. There is evidence that

gene genesis is two times more frequent than HGT [55].

Therefore, it is plausible to assume that this phenomenon is

the source for those species-specific genes having an

unsuspicious codon usage. The accumulation of species-

specific and pA genes in the ter half supports the notion of a

distinct pattern of genome organization: essential genes are

enriched near the origin of replication; the area near the

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 473

replication terminus seems to be the place for gene genesis

or HGT. This notion is also supported by the finding that the

ter region is an area for hyperrecombination [56].

What is the origin of pAu genes?

In all cases studied above, at least 30% of pA genes had a

codon usage classified neither as bacterial nor as archeal.

We cannot exclude that the approach implemented with

SIGI or the limited knowledge of codon usage did not allow

us to resolve this uncertainty. Alternatively, it could be that

these genes were frequently transferred among environ-

ments, each modulating codon usage specifically, however,

with different prevalences. The annotations of unCOGed pA

genes signaled for many of these genes an origin in

transposons, prophages, or viruses making a frequent

transfer plausible. An exhaustive analysis of these genes is

an interesting task for the future.

Materials and methods

Categorizing gene-encoded function in genomic islands

The genomes under consideration were at first analyzed

using SIGI. Then and for each gene, the COG classification

was determined and according to SIGI’s prediction, added to

histograms summing the COG classifications for putatively

native (pN) and putatively alien (pA) genes. These

occurrences (# pN, # pA) were used to calculate relative

frequencies fk(pN), fk (pA) and ratios r(k) = fk (pA)/fk (pN).

To illustrate and scale the skew in the distribution of COG

classes among pA and pN genes, the logarithm of the

frequency ratios f(pA)/f(pN) (log odds) were computed. The

conversion to relative frequencies is a necessary step, if one

wants to compare relative abundances of COG classes

among pA and pN genes or between different species

groups. For the study of gene flux between superkingdoms,

SIGI’s predictions were sorted into three classes: Genes with

a putatively bacterial donor (pAb), genes with a putatively

archeal donor (pAa), and genes having a codon usage that

could not be clearly assigned as bacterial or archeal (pAu).

A v2 test was used on absolute gene numbers (# pN, # pA,

or respective groups) to assess statistical relevance. The size

of GIs was determined in multiples of genes and summed in

a histogram to determine the median length. Resulting

distributions were compared using the Kolmogorov-Smir-

nov (KS) test.

The functional category L (DNA replication, recombi-

nation, and repair) of the COG database subsumes several

COGs annotated as transposases and integrases. These gene-

encoded functions are exceedingly overrepresented in GIs.

To isolate these protein functions, we have created a new

category named X (integrases, transposases, and inactivated

derivatives) which harbors the COG functions COG0582,

COG0675, COG1662, COG1943, COG2452, COG2801,

COG2826, COG2963, COG3039, COG3293, COG3316,

COG3328, COG3335, COG3385, COG3415, COG3436,

COG3464, COG3547, COG3666, COG3676, COG3677,

COG4584, COG4644, COG5421, COG5433, COG5558,

and COG5659.

Material

The data sets consisted of the following genomes.

Hyperthermophilic archaea: Archaeoglobus fulgidus,

Aeropyrum pernix, Methanocaldococcus jannaschii, Meth-

anopyrus kandleri, Pyrobaculum aerophilum, Pyrococcus

abyssi, Pyrococcus horikoshii, Sulfolobus solfataricus,

Thermococcus kodakaraensis;

Mesophilic archaea: Halobacterium sp. NRC-1, Meth-

anosarcina acetivorans str. C2A, Methanosarcina mazei,

Methanothermobacter thermautotrophicus;

Archaea: all archeal genomes annotated in the COG

database;

Hyperthermophilic bacteria: Aquifex aeolicus, Thermo-

toga maritima, Thermoanaerobacter tengcongensis;

Thermophilic bacteria: Geobacillus kaustophilus, Ther-

mus thermophilus HB 27, Thermosynechococcus elongatus,

Symbiobacterium thermophilum;

Mesophilic bacteria: Synechocystis, Nostoc sp. PCC

7120, Fusobacterium nucleatum, Deinococcus radiodurans,

Corynebacterium glutamicum, Mycobacterium tuberculosis

H37Rv, Mycobacterium tuberculosis CDC1551, Mycobac-

terium leprae, Clostridium acetobutylicum, Lactococcus

lactis, Streptococcus pyogenes M1 GAS, Streptococcus

pneumoniae TIGR4, Staphylococcus aureus N315, Listeria

innocua, Bacillus subtilis, Bacillus halodurans, Ureaplasma

urealyticum, Mycoplasma pulmonis, Mycoplasma pneumo-

niae, Mycoplasma genitalium, Escherichia coli K-12,

Escherichia coli O157:H7 EDL933, Escherichia coli

O157:H7, Yersinia pestis, Salmonella typhimurium LT2,

Buchnera sp. APS, Vibrio cholerae, Pseudomonas aerugi-

nosa, Haemophilus influenzae, Pasteurella multocida,

Xylella fastidiosa 9a5c, Neisseria meningitidis MC58,

Neisseria meningitidis Z2491, Ralstonia solanacearum,

Helicobacter pylori 26695, Helicobacter pylori J99, Cam-

pylobacter jejuni, Agrobacterium tumefaciens, Sinorhi-

zobium meliloti 1021, Brucella melitensis, Mesorhizobium

loti, Caulobacter crescentus, Rickettsia prowazekii, Rick-

ettsia conorii, Chlamydia trachomatis, Chlamydophila

pneumoniae CWL029, Treponema pallidum, Borrelia

burgdorferi;

Bacteria: all bacterial genomes annotated in the COG

database.

Acknowledgments

The project was carried out within the framework of the

Competence Network Gottingen BiotechGenoMik financed

by the German Federal Ministry of Education and Research

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475474

(BMBF). We thank H. Liesegang for his help in preparing

genomic data sets and the referees for their constructive

comments.

References

[1] J.G. Lawrence, H. Ochman, Amelioration of bacterial genomes: rates

of change and exchange, J. Mol. Evol. 44 (1997) 383–397.

[2] W.F. Doolittle, Phylogenetic classification and the universal tree,

Science 284 (1999) 2124–2129.

[3] B. Snel, P. Bork, M.A. Huynen, Genomes in flux: the evolution of

archaeal and proteobacterial gene content, Genome Res. 12 (2002)

17–25.

[4] C.G. Kurland, B. Canback, O.G. Berg, Horizontal gene transfer: a

critical view, Proc. Natl. Acad. Sci. U. S. A. 100 (2003) 9658–9662.

[5] M.A. Ragan, On surrogate methods for detecting lateral gene transfer,

FEMS Microbiol. Lett. 201 (2001) 187–191.

[6] O. Zhaxybayeva, P. Lapierre, J.P. Gogarten, Genome mosaicism and

organismal lineages, Trends Genet. 20 (2004) 254–260.

[7] M.A. Ragan, Detection of lateral gene transfer among microbial

genomes, Curr. Opin. Genet. Dev. 11 (2001) 620–626.

[8] G. Blum, M. Ott, A. Lischewski, A. Ritter, H. Imrich, H. Tschape,

et al., Excision of large DNA regions termed pathogenicity islands

from tRNA-specific loci in the chromosome of an Escherichia coli

wild-type pathogen, Infect. Immun. 62 (1994) 606–614.

[9] U. Dobrindt, B. Hochhut, U. Hentschel, J. Hacker, Genomic islands in

pathogenic and environmental microorganisms, Nat. Rev., Microbiol.

2 (2004) 414–424.

[10] J. Hacker, J.B. Kaper, Pathogenicity islands and the evolution of

microbes, Annu. Rev. Microbiol. 54 (2000) 641–679.

[11] Y. Nakamura, T. Itoh, H. Matsuda, T. Gojobori, Biased biological

functions of horizontally transferred genes in prokaryotic genomes,

Nat. Genet. 36 (2004) 760–766.

[12] R. Jain, M.C. Rivera, J.A. Lake, Horizontal gene transfer among

genomes: the complexity hypothesis, Proc. Natl. Acad. Sci. U. S. A.

96 (1999) 3801–3806.

[13] K.E. Nelson, R.A. Clayton, S.R. Gill, M.L. Gwinn, R.J. Dodson, D.H.

Haft, et al., Evidence for lateral gene transfer between Archaea and

Bacteria from genome sequence of Thermotoga maritima, Nature 399

(1999) 323–329.

[14] R.L. Tatusov, E.V. Koonin, D.J. Lipman, A genomic perspective on

protein families, Science 278 (1997) 631–637.

[15] R.L. Tatusov, N.D. Fedorova, J.D. Jackson, A.R. Jacobs, B. Kiryutin,

E.V. Koonin, et al., The COG database: an updated version includes

Eukaryotes, BMC Bioinformatics 4 (2003) 41.

[16] R. Merkl, SIGI: Score-based identification of genomic islands, BMC

Bioinformatics 5 (2004) 22.

[17] C. von Mering, M. Huynen, D. Jaeggi, S. Schmidt, P. Bork, B. Snel,

STRING: a database of predicted functional associations between

proteins, Nucleic Acids Res. 31 (2003) 258–261.

[18] A. Wiezer, Entschlusselung der Genomsequenz von Escherichia

blattae und komparative Bioinformatik mikrobieller Genome, Institut

fur Mikrobiologie und Genetik, Georg-August-Universitat Gottingen,

2004.

[19] U. Deppenmeier, A. Johann, T. Hartsch, R. Merkl, R.A. Schmitz,

R. Martinez-Arias, et al., The genome of Methanosarcina mazei:

evidence for lateral gene transfer between Bacteria and Archaea,

J. Mol. Microbiol. Biotechnol. 4 (2002) 453–461.

[20] A. Henne, H. Bruggemann, C. Raasch, A. Wiezer, T. Hartsch, H.

Liesegang, et al., The genome sequence of the extreme thermophile

Thermus thermophilus, Nat. Biotechnol. 22 (2004) 547–553.

[21] P. Glaser, L. Frangeul, C. Buchrieser, C. Rusniok, A. Amend, F.

Baquero, et al., Comparative genomics of Listeria species, Science

294 (2001) 849–852.

[22] R. Grantham, C. Gautier, M. Gouy, R. Mercier, A. Pave, Codon

catalog usage and the genome hypothesis, Nucleic Acids Res. 8 (1980)

r49–r62.

[23] R. Sandberg, C.I. Branden, I. Ernberg, J. Coster, Quantifying the

species-specificity in genomic signatures, synonymous codon choice,

amino acid usage and G+C content, Gene 311 (2003) 35–42.

[24] T. Hayashi, K. Makino, M. Ohnishi, K. Kurokawa, K. Ishii, K.

Yokoyama, et al., Complete genome sequence of enterohemorrhagic

Escherichia coli O157:H7 and genomic comparison with a laboratory

strain K-12, DNA Res. 8 (2001) 11–22.

[25] N.T. Perna, G. Plunkett III, V. Burland, B. Mau, J.D. Glasner, D.J.

Rose, et al., Genome sequence of enterohaemorrhagic Escherichia

coli O157:H7, Nature 409 (2001) 529–533.

[26] F.R. Blattner, G. Plunkett III, C.A. Bloch, N.T. Perna, V. Burland, M.

Riley, et al., The complete genome sequence of Escherichia coli K-12,

Science 277 (1997) 1453–1474.

[27] J.E. Galagan, C. Nusbaum, A. Roy, M.G. Endrizzi, P. Macdonald,

W. FitzHugh, et al., The genome of M. acetivorans reveals

extensive metabolic and physiological diversity, Genome Res. 12

(2002) 532–542.

[28] K.E. Nelson, I.T. Paulsen, J.F. Heidelberg, C.M. Fraser, Status of

genome projects for nonpathogenic bacteria and archaea, Nat.

Biotechnol. 18 (2000) 1049–1054.

[29] T. Kaneko, Y. Nakamura, S. Sato, E. Asamizu, T. Kato, S. Sasamoto,

et al., Complete genome structure of the nitrogen-fixing symbiotic

bacterium Mesorhizobium loti, DNA Res. 7 (2000) 331–338.

[30] F. Galibert, T.M. Finan, S.R. Long, A. Puhler, P. Abola, F. Ampe,

et al., The composite genome of the legume symbiont Sinorhizobium

meliloti, Science 293 (2001) 668–672.

[31] J.T. Sullivan, C.W. Ronson, Evolution of rhizobia by acquisition of a

500-kb symbiosis island that integrates into a phe-tRNA gene, Proc.

Natl. Acad. Sci. U. S. A. 95 (1998) 5145–5149.

[32] O. White, J.A. Eisen, J.F. Heidelberg, E.K. Hickey, J.D. Peterson, R.J.

Dodson, et al., Genome sequence of the radioresistant bacterium

Deinococcus radiodurans R1, Science 286 (1999) 1571–1577.

[33] Y. Koyama, T. Hoshino, N. Tomizuka, K. Furukawa, Genetic trans-

formation of the extreme thermophile Thermus thermophilus and of

other Thermus spp, J. Bacteriol. 166 (1986) 338–340.

[34] A. Friedrich, C. Prust, T. Hartsch, A. Henne, B. Averhoff, Molecular

analyses of the natural transformation machinery and identification

of pilus structures in the extremely thermophilic bacterium Thermus

thermophilus strain HB27, Appl. Environ. Microbiol. 68 (2002)

745–755.

[35] J.G. Lawrence, H. Ochman, Molecular archaeology of the Escherichia

coli genome, Proc. Natl. Acad. Sci. U. S. A. 95 (1998) 9413–9417.

[36] S. Karlin, Detecting anomalous gene clusters and pathogenicity

islands in diverse bacterial genomes, Trends Microbiol. 9 (2001)

335–343.

[37] H.C. Wang, J. Badger, P. Kearney, M. Li, Analysis of codon usage

patterns of bacterial genomes using the self-organizing map, Mol.

Biol. Evol. 18 (2001) 792–800.

[38] Q. Tu, D. Ding, Detecting pathogenicity islands and anomalous gene

clusters by iterative discriminant analysis, FEMS Microbiol. Lett. 221

(2003) 269–275.

[39] T. Abe, S. Kanaya, M. Kinouchi, Y. Ichiba, T. Kozuki, T. Ikemura,

Informatics for unveiling hidden genome signatures, Genome Res. 13

(2003) 693–702.

[40] A. Ruepp, W. Graml, M.L. Santos-Martinez, K.K. Koretke, C.

Volker, H.W. Mewes, et al., The genome sequence of the

thermoacidophilic scavenger Thermoplasma acidophilum, Nature

407 (2000) 508–513.

[41] C.L. Nesbø, S. L’Haridon, K.O. Stetter, W.F. Doolittle, Phylogenetic

analyses of two ‘‘archaeal’’ genes in Thermotoga maritima reveal

multiple transfers between Archaea and Bacteria, Mol. Biol. Evol. 18

(2001) 362–375.

[42] S. Garcia-Vallve, A. Romeu, J. Palau, Horizontal gene transfer in

bacterial and archaeal complete genomes, Genome Res. 10 (2000)

1719–1725.

A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 475

[43] C.L. Nesbø, W.F. Doolittle, Targeting clusters of transferred genes in

Thermotoga maritima, Environ. Microbiol. 5 (2003) 1144–1154.

[44] V. Daubin, E. Lerat, G. Perriere, The source of laterally transferred

genes in bacterial genomes, Genome Biol. 4 (2003) R57.

[45] F. de la Cruz, J. Davies, Horizontal gene transfer and the origin

of species: lessons from bacteria, Trends Microbiol. 8 (2000)

128–133.

[46] U. Gophna, R.L. Charlebois, W.F. Doolittle, Have archaeal genes

contributed to bacterial virulence?, Trends Microbiol. 12 (2004)

213–219.

[47] D.M. Faguy, Lateral gene transfer (LGT) between Archaea and

Escherichia coli is a contributor to the emergence of novel infectious

disease, BMC Infect. Dis. 3 (2003) 13.

[48] K.T. Konstantinidis, J.M. Tiedje, Trends between gene content and

genome size in prokaryotic species with larger genomes, Proc. Natl.

Acad. Sci. U. S. A. 101 (2004) 3160–3165.

[49] O. Kandler, H. Konig, Cell wall polymers in Archaea (Archaebac-

teria), Cell. Mol. Life Sci. 54 (1998) 305–308.

[50] K.S. Makarova, L. Aravind, M.Y. Galperin, N.V. Grishin, R.L.

Tatusov, Y.I. Wolf, et al., Comparative genomics of the Archaea

(Euryarchaeota): evolution of conserved protein families, the stable

core, and the variable shell, Genome Res. 9 (1999) 608–628.

[51] N.R. Burgess, S.N. McDermott, J. Whiting, Aerobic bacteria occur-

ring in the hind-gut of the cockroach, Blatta orientalis, J. Hyg. (Lond.)

71 (1973) 1–7.

[52] J.O. Andersson, S.G. Andersson, Insights into the evolutionary

process of genome degradation, Curr. Opin. Genet. Dev. 9 (1999)

664–671.

[53] S.T. Cole, K. Eiglmeier, J. Parkhill, K.D. James, N.R. Thomson, P.R.

Wheeler, et al., Massive gene decay in the leprosy bacillus, Nature 409

(2001) 1007–1011.

[54] A. Mira, H. Ochman, N.A. Moran, Deletional bias and the evolution

of bacterial genomes, Trends Genet. 17 (2001) 589–596.

[55] V. Kunin, C.A. Ouzounis, The balance of driving forces during

genome evolution in prokaryotes, Genome Res. 13 (2003)

1589–1594.

[56] J. Louarn, F. Cornet, V. Francois, J. Patte, J.M. Louarn, Hyper-

recombination in the terminus region of the Escherichia coli

chromosome: possible relation to nucleoid organization, J. Bacteriol.

176 (1994) 7524–7531.