Upload
irebs
View
0
Download
0
Embed Size (px)
Citation preview
www.elsevier.com/locate/ygeno
Genomics 86 (20
A comparative categorization of gene flux in diverse microbial species
Arnim Wiezera,1, Rainer Merklb,*
aGottingen Genomics Laboratory, Grisebachstrasse 8, D-37077 Gottingen, GermanybInstitut fur Biophysik und Physikalische Biochemie, Universitat Regensburg, Universitatsstrasse 31, D-93053 Regensburg, Germany
Received 30 November 2004; accepted 25 May 2005
Available online 18 July 2005
Abstract
Microbial genomes harbor genomic islands (GIs), genes presumably acquired via horizontal gene transfer (HGT). We compared GIs of
hyperthermophilic, thermophilic, mesophilic, and pathogenic/nonpathogenic species and of small and large genomes. The COG database was
used to characterize gene-encoded functions. Putative donors were determined to quantify gene flux between superkingdoms. In
hyperthermophiles, more than 10% of the genes were on average acquired across the superkingdom border. For thermophiles and particularly
mesophiles, we identified a nearly unidirectional export from bacteria to archaea. Additionally, we analyzed GI composition for Escherichia,
and pairs of Listeria, Rhizobiales, Methanosarcinaceae, and Thermus thermophilus/Deinococcus radiodurans. For Escherichia and Listeria,
the composition ofGIs in pathogenic and nonpathogenic species did not differ significantly with respect to encoded COG classes. The analysis
of related genomes showed that the composition of GIs cannot be explained with trends of gene content known to depend on genome size.
D 2005 Elsevier Inc. All rights reserved.
Keywords: Horizontal gene transfer; Lateral gene transfer; Gene flux; COG database; Genomic islands; Xenologous genes
Introduction
Horizontal gene transfer (HGT) is considered a strong
evolutionary force enhancing the genetic diversity of
microbes [1]. This effect brings new genes into a genome,
even from taxonomically unrelated species. Thus, HGT
offers the means for a rapid adaptation to environmental
demands [2]. This speed distinguishes HGT from other
processes constantly contributing to the evolution of
genomes, such as point mutations, genetic rearrangements,
gene loss, and genesis (the de novo generation of genes).
Meanwhile, the existence of HGT is generally accepted,
although its quantification [3] and the assessment of its
impact on microbial genomes are still matters of debate [4].
One reason for dissent is differing results of various studies
aimed at the quantification of HGT; for a comparison, see,
e.g., [5] or [6]. Their outcomes frequently vary to a great
extent with regard to the fraction of genes considered as being
0888-7543/$ - see front matter D 2005 Elsevier Inc. All rights reserved.
doi:10.1016/j.ygeno.2005.05.014
* Corresponding author. Fax: +49(0)941 943 2813.
E-mail address: [email protected] (R. Merkl).1 Current address: QIAGEN, Qiagen-Strasse 1, D-40724 Hilden, Germany.
horizontally transferred. However, it might be that the
different methods identify specific classes of alien genes
and that they are sensitive to different time intervals of
genome evolution [7].
Gene clusters having conspicuous compositions or prom-
inent gene-encoded functions are frequently named ‘‘genomic
islands’’ (GIs).GIs were identified both in pathogenic [8] and
nonpathogenic microbes [9]. It has been observed that
pathogenicity islands, which are a subset of GIs, frequently
have considerable lengths [10]. Recent studies of GI
composition have shown that housekeeping genes related to
cell surface, DNA binding, and pathogenicity were over-
represented in GIs; see, e.g., [11]. As proposed by the com-
plexity hypothesis [12], a reason for this asymmetry might be
that informational genes as opposed to housekeeping genes
are typically members of large and complex systems
rendering their exchange among genomes rather unlikely.
Most frequently, HGT has been studied on the genome
scale; see, e.g., [13]. In this article, we present an analysis of
GIs based on larger sets of genomic data and on pairs of
taxonomically related genomes. The study focuses on three
questions. (i) To what extent do bacteria and archaea transfer
05) 462 – 475
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 463
genes across the borders of prokaryotic superkingdoms? (ii)
Are GIs of pathogenic and nonpathogenic species differently
composed with respect to gene-encoded functions? (iii) Do
species-specific variations in HGT activity explain differing
genome sizes of taxonomically related species?
We will show that certain amounts of gene flux occur
between hyperthermophiles across the superkingdom bor-
der. Among thermophilic and especially mesophilic species,
gene flux is largely unidirectional from bacteria to archaea,
if evaluated on the superkingdom level. The pairwise
genome comparison of pathogenic/nonpathogenic species
belonging to the genera Escherichia or Listeria illustrates
that gene-encoded functions found in respective GIs do not
differ significantly. Finally, we will correlate genome size
and the amount of GI content for closely related species.
These findings indicate that species-specific variations in
HGT activity do not explain the genome sizes of closely
related species of Methanosarcinaceae or Rhizobiales.
Results
Two different approaches complementing each other
were combined to study gene flux. These were the COG
database [14,15], which classifies genes on the protein
sequence level and SIGI [16], which analyses codon usage
and predicts GIs.
Categorizing gene flux
The COG database is a classification system for gene-
encoded functions. Each element of the database (a cluster
of orthologous groups of genes, a COG) is a set of genes
that code for the same function and originate from different
species. COGs are organized in functional categories (also
named COG classes in the following) describing the role of
gene-encoded functions. To date, the database classifies the
genes of 63 prokaryotic species into 4873 COGs. A minimal
COG would consist of three genes from phylogenetically
distinct species exhibiting significant similarity to each
other on the protein-sequence level. Due to this concept, the
fraction of genes that belong to COGs varies among
completely sequenced genomes. On average, 70–75% of
the protein-coding genes of a genome are classified in the
COG database. The STRING database [17] extends the
COG concept and comprises additional genomes. We
retrieved several genomes to supplement the groups of
thermophiles and hyperthermophiles. In addition, we
implemented the program COGGER according to the
COG concept [14], to allow the analysis of genomes which
are not elements of the COG or the STRING database. For
the work described here, we processed the genomes of
Escherichia blattae [18], Methanosarcina mazei [19],
Thermus thermophilus [20], and Listeria monocytogenes
[21]. In addition, we have created a new COG class X (see
Table 2) which contains the COGs describing integrases,
transposases, and inactivated derivatives to isolate these
functions, which are exceedingly overrepresented in GIs.
Gene-encoded functions related to COG classes Z (cyto-
skeleton) and Y (nuclear structure) did not occur in the data
sets and were not processed any further.
Recently, a novel method, named SIGI, was introduced
for the analysis of GIs. It is based on the genome theory [22]
and the taxonomical relatedness of codon usage [23]. The
algorithm focuses on the identification of recently acquired
genes and additionally aims at the prediction of the putative
donor. It determines in a genome the codon usage (CU) of
each individual gene. Clusters of genes having a codon
usage, which suspiciously deviated from the mean, genome-
specific case, were labeled as GIs; these genes were named
putatively alien genes (pA). For the prediction of the
putative donor, each CU was compared with a list of about
400 codon frequency tables (CUT) representative of micro-
bial species. SIGI considered for each gene the taxonomical
relation of those three species from CUT, which had the
most similar codon usage to CU. These three species were
labeled in a taxonomy tree. The taxon subsuming these three
entries as child nodes was predicted as the putative source of
the considered pA gene. If all three entries were bacteria or
archaea, the donor was predicted as putatively bacterial
(pAb) or archeal (pAa). In the worst case and if the three
high scorers belonged to (say) bacteria and archaea, or
bacteria and phages, codon usage was interpreted as
unspecific (pAu). During the design phase, two types of
test beds were applied to test SIGI’s predictive power. The
analyses have shown that at mean 75% of the predictions of
the putative origin were correct on the species level. In no
case, more than 1% of the predictions were wrong on the
level of the superclass. These results indicate that codon
usages of archaea and bacteria are quite distinct.
Gene flux between superkingdoms and in specific habitats
All genomes listed under Materials were analyzed using
SIGI. For each gene, the putative source was determined
and according to SIGI_s prediction added to histograms
summing up the absolute numbers of putatively native
(# pN) or putatively alien (# pA) genes. pA genes were
separated into three classes according to their predicted
origin and counted as # pAa, # pAb, or # pAu. Table 1 lists
representative findings for genomes having more than 50 pA
genes. Among hyperthermophiles, gene flux was approx-
imately balanced: At least 10–15% of the genes seemed to
be transferred across the superkingdom border. The portion
of pA genes having a distinct archeal codon usage dropped
to zero for thermophilic and mesophilic bacteria. GIs of
thermophilic archaea harbored 29% of pAb genes; this
fraction increased to 50% for mesophilic archaea. The
fraction of genes with an unspecific codon usage was in
archaea and hyperthermophilic bacteria >40% and in
thermophilic and mesophilic bacteria <35%. A comparison
of the summed # pAa, # pAb and # pAu genes for
Table 1
Occurrence of putatively alien genes in GIs of microbial species
# pA [%] # pAa [%] # pAb [%] # pAu [%]
Hyperthermophilic
archaea
984 6 414 41 109 11 461 48
Aeropyrum pernix 186 9 96 52 26 14 64 34
Sulfolobus solfataricus 253 7 124 49 26 10 103 41
Archaeoglobus fulgidus 169 5 49 29 20 12 100 59
Pyrococcus abyssi 120 5 58 48 6 5 56 47
Pyrococcus horikoshii 129 5 34 26 19 15 76 59
Thermococcus
kodakaraensis
127 5 53 42 12 9 62 49
Hyperthermophilic
Bacteria
477 6 48 15 236 40 193 44
Thermotoga maritima 143 6 28 20 50 35 65 45
Aquifex aeolicus 52 3 12 23 13 25 27 52
Thermoanaerobacter
tengcongensis
282 9 8 3 173 61 101 36
Thermophilic archaea 315 6 77 23 88 29 150 48
Thermoplasma
volcanium
137 8 42 31 32 23 63 46
Picrophilus torridus 88 5 2 2 44 50 42 48
Methanothermobacter
thermoautotrophicus
90 5 33 37 12 13 45 50
Thermophilic bacteria 760 6 2 0 524 65 234 34
Geobacillus kaustophilus 394 11 0 0 288 73 106 27
Thermus thermophilus
HB 27
125 5 0 0 73 58 52 42
Thermosynechococcus
elongatus
55 2 0 0 34 62 21 38
Symbiobacterium
thermophilum
186 5 2 1 129 69 55 30
Mesophilic archaea 916 8 85 8 451 50 380 43
Halobacterium salinarum 68 4 0 0 41 60 27 40
Methanosarcina
acetivorans
538 11 28 5 317 59 193 36
Methanosarcina mazei 310 8 57 18 93 30 160 52
Mesophilic bacteria 2182 12 4 0 1431 68 747 31
Bacillus subtilis 469 10 0 0 355 76 114 24
Escherichia coli K-12 633 12 4 1 451 71 178 28
Mesorhizobium loti 1028 15 0 0 564 55 464 45
For the listed genomes, the occurrence of pA genes having an archeal (# pAa),
bacterial (# pAb) or unspecific codon usage (# pAu) was determined.
Amounts are given as absolute numbers and in percentage related to the
number of genome-specific pA genes. The genomes were grouped into
hyperthermophilic, thermophilic, andmesophilic species. For each group, the
sum of # pA genes and the mean percentage values are listed.
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475464
hyperthermophilic and mesophilic species with a v2 test
showed a highly significant deviation from each other for
both archaea and bacteria (P << 0.01).
For a characterization of gene-encoded functions, the
COG classification was determined for each gene and
according to SIGI’s prediction, added to histograms summing
for each COG class k the absolute frequencies # pkN and #
pkA and more specifically # pkAa, # pkAb, or # pkAu.
Absolute frequencies were used to calculate relative fre-
quency values fk(pA), fk (pN) and ratios r(k) = fk (pA)/fk (pN).
An example may illustrate the determination of relative
frequencies and ratio values. If we consider for a COG class X
# pA = 219 and # pN = 381, then we get fX(pA) = 219/995 =
0.22, fX(pN) = 381/22872 = 0.016, and r(X) = fX(pA)/
fX(pN) = 0.22/0.016 = 13.2 (see numbers for COG class X of
data set ‘‘Archaea’’ in Table 2). Here, 995 is the sum of all pA
genes and 22872 is the sum of all pN genes determined in the
data set Archaea. This ratio value indicates that genes
assigned to COG class X are in archeal GIs 13 times as
frequent as among pN genes. Genes not assigned to a COG
class were—according to their putative origin—collected in
separate groups named unCOGed.
To test the statistical significance of differences in
histograms, v2 tests were computed on absolute gene
numbers. In bacteria (P << 0.01), archaea (P << 0.01),
hyperthermophilic archaea (P << 0.01), mesophilic archaea
(P = 0.028), and mesophilic bacteria (P << 0.01) the
distributions of COG classes resulting from pA and pN
genes were statistically significantly different. In hyper-
thermophilic bacteria, the skew was not significant at the
10% level. This might, however, be due to the small number
of related pA genes. In addition and for each COG class k,
the imbalance between pkA and pkN genes was assessed
with a v2 test. Classes having a statistically significantly
skew (P < 0.01) are printed in bold face in Table 2.
In archaea (compare Table 2 and Figs. 1C and 2C), the
imbalance between pA and pN genes and the overrepre-
sentation in GIs of genes classified as belonging to class M
(cell wall/membrane/envelope biogenesis) was more pro-
nounced than in bacteria. This overrepresentation of class M
is independent of the habitat or the taxonomical group
(compare Fig. 3). Fig. 3A reveals that archeal GIs harbor
more gene-encoded functions related to class F (nucleotide
transport and metabolism) and X (integrases, transposases,
and inactivated derivatives) than bacterial ones. Fig. 3B
indicates that GIs of mesophilic species harbor more gene-
encoded functions related to metabolism than hyperthermo-
philic ones.
A second, significant difference in GI composition
between bacteria and archaea was related to the putative
origin of horizontally transferred genes: The proportions
described above for the putative donors were similar in all
COG classes. For hyperthermophilic archaea, GIs were
similarly composed as those seen in mesophilic archaea
(compare Figs. 1A, 1B, and 3B). The fraction of pAb genes
was, however, lower. In summary, it was ,11%, and only a
fraction of pA genes classified as belonging to the COG
classes M (cell wall/membrane/envelope biogenesis), Q
(secondary metabolites biosynthesis, transport, and catabo-
lism), and X (integrases, transposases, and inactivated
derivatives) had a distinct bacterial codon usage (P <
0.01). In GIs of mesophilic bacteria, the portion of genes
presumably originating from archaea (pAa) was minimal (in
summary <1%). In each individual COG class, this fraction
was less than 2%.
One might argue that the composition of the COG
database is due to an overrepresentation of bacterial
genomes biased toward gene-encoded functions, preferen-
tially found in bacteria. If this were the case, archaea-
specific gene-encoded functions might be underrepresented
among the COGs, giving rise to a skewed distribution.
Table 2
Composition of microbial GIs with respect to COG classes and the putative origin of pA genes
Archaea HT
Arch
MS
Arch
HT
Bact
MS
Bact
Bacteria
# pN # pA r(k) # pAa r(k) # pAb r(k) r(k) r(k) r(k) r(k) # pN # pA r(k) # pAa r(k) # pAb r(k)
J Translation, ribosomal structure and biogenesis 2001 16 0.2 5 0.2 1 0.0 0.2 0.1 0.2 0.3 7124 113 0.3 0 0.0 87 0.3
A RNA processing and modification 21 0 0.0 0 0.0 0 0.0 0.0 0.0 0.0 0.0 22 0 0.0 0 0.0 0 0.0
K Transcription 1138 43 0.9 17 1.1 12 0.9 1.0 0.6 0.8 1.5 7562 654 1.5 6 1.6 474 1.5
L Replication, recombination and repair 943 53 1.3 19 1.4 13 1.2 1.0 1.2 1.7 1.0 4676 276 1.0 3 1.3 191 1.0
B Chromatin structure and dynamics 47 1 0.5 1 1.5 0 0.0 0.8 0.0 3.3 0.6 32 1 0.5 0 0.0 1 0.7
D Cell cycle control, cell division, chromos. partitioning 181 8 1.0 1 0.4 1 0.5 1.2 0.4 0.4 0.5 1122 34 0.5 0 0.0 21 0.4
V Defense mechanisms 286 13 1.0 1 0.3 7 2.1 0.8 1.0 2.3 1.5 1667 148 1.5 2 2.4 121 1.7
T Signal transduction mechanisms 448 15 0.8 5 0.8 6 1.1 1.1 0.5 1.1 1.0 4687 260 1.0 2 0.9 214 1.1
M Cell wall/membrane/envelope biogenesis 562 96 3.9 35 4.4 31 4.7 4.1 2.7 2.2 1.3 6029 452 1.3 6 2.0 332 1.3
N Cell motility 240 4 0.4 1 0.3 1 0.4 0.2 0.6 1.0 1.8 2166 220 1.7 0 0.0 187 2.0
W Extracellular structures 1 0 0.0 0 0.0 0 0.0 – 0.0 0.0 1.5 21 2 1.6 0 0.0 2 2.2
U Intracell. trafficking, secretion, and vesicular transport 292 5 0.4 2 0.5 3 0.9 0.2 0.8 0.2 2.1 2637 320 2.1 0 0.0 258 2.3
O Posttransl. modification, protein turnover, chaperones 820 15 0.4 5 0.4 6 0.6 0.2 0.6 0.1 0.6 4219 134 0.5 0 0.0 107 0.6
C Energy production and conversion 1910 27 0.3 8 0.3 7 0.3 0.2 0.4 0.8 0.6 6448 206 0.6 3 1.0 162 0.6
G Carbohydrate transport and metabolism 1047 31 0.7 13 0.9 8 0.7 0.6 0.4 1.7 0.8 7772 379 0.8 3 0.8 308 0.9
E Amino acid transport and metabolism 2132 45 0.5 19 0.6 15 0.6 0.4 0.6 0.9 0.5 10,617 334 0.5 7 1.3 245 0.5
F Nucleotide transport and metabolism 656 7 0.3 3 0.3 1 0.1 0.3 0.4 0.1 0.3 2710 52 0.3 0 0.0 42 0.4
H Coenzyme transport and metabolism 1264 8 0.3 1 0.1 3 0.2 0.2 0.2 0.3 0.5 4458 138 0.5 0 0.0 104 0.5
I Lipid transport and metabolism 582 7 0.3 1 0.1 0 0.0 0.2 0.3 0.4 0.7 3831 149 0.7 0 0.0 112 0.7
P Inorganic ion transport and metabolism 1268 38 0.7 14 0.8 12 0.8 0.8 0.5 1.1 0.8 6461 283 0.8 5 1.6 210 0.8
Q Sec. metabolites biosynthesis, transport and catabolism 401 39 2.2 10 1.8 15 3.2 2.3 1.8 1.2 1.1 2878 173 1.0 0 0.0 127 1.0
R General function prediction only 3950 204 1.2 66 1.1 72 1.6 1.1 1.4 1.1 1.0 14,159 834 1.0 6 0.9 633 1.0
S Function unknown 2301 101 1.0 17 0.5 22 0.8 1.2 0.8 0.5 1.3 9331 708 1.3 1 0.2 495 1.2
X Integrases, transposases and inactivated derivatives 381 219 13.2 82 15.1 32 7.2 14.4 9.3 7.7 8.9 1272 668 9.0 11 17.6 368 6.7
SUM 22,872 995 326 33% 268 27% 111,901 6538 55 0.8% 4801 73%
unCOGed [%] pAa [%] 904 181 20% 277 31% 28% 2% 4% 0.6% 5379 33 0.6% 2839 53%
pAb [%] 17% 59% 44% 53%
The first column lists for each COG class k the one-letter code and its function. The following columns give for archaea and bacteria the absolute numbers of unsuspicious genes (# pN), of putatively alien genes
(# pA), and the ratio of normalized fractions. The columns # pAa and # pAb give absolute numbers for pA genes originating presumably from archaea or bacteria. For hyperthermophilic and mesophilic archaea
(columns HTArch andMS Arch) and bacteria (columns HT Bact andMS Bact), the ratio values r(k) = fk(pA)/fk(pN) are given. Ratio values printed in bold result from (# pA, # pN) pairs, whose skew is statistically
significant on the 1% level. The row labeled SUM gives the sum of pA, pAa, and pAb genes for the respective species group and the respective fraction values in percentage. The last row lists the number of pA
genes, which are not elements of the COG database (unCOGed) and for unCOGed pAa and pAb genes their contribution to the respective number of all unCOGed genes. All r(k) values were rounded to one
decimal place.
A.Wiezer,
R.Merkl
/Genomics
86(2005)462–475
465
Fig. 1. Classification of putatively alien (pA) and putatively native (pN) genes in genomic islands of mesophilic (A) and hyperthermophilic (B) archaea.
According to SIGI’s prediction, pA genes were grouped with respect to their likely donors. Bars pAb give the fraction of pA genes presumably originating from
a bacterial donor. pAa are genes presumably imported from an archeal source. The codon usage in pAu genes is unspecific. For abbreviations of COG classes,
see Table 2. The bars on the right named unCOGed give the proportions of pAb, pAa, and pAu genes among those genes, which are not elements of the COG
database. (C) A plot of log(r(k)) values indicating the over- or underrepresentation of each COG class k. For columns marked with an asterisk, the skew of
respective # pA and # pN values is statistically significant ( P < 0.01, determined for each pair with a v2 test).
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475466
Therefore, the putative origin of unCOGed genes was
analyzed. The fractions of pAa and pAb genes among the
unCOGed ones were similar to the numbers listed in Table 1;
see last row of Table 2 and Figs. 1 and 2. In addition, the
annotation of unCOGed genes was interpreted. In all groups,
more than 80% of these genes were annotated as hypothetical
or with a putative function. The remaining 20% were
frequently annotated as transposases or as elements of
transposons or prophages. This was also true for archeal
GIs; e.g., in S. solfataricus the content of transposons
ISC1043, ISC1058, ISC1212, ISC1217, ISC1225, and
ISC1234 was predicted as GIs, In summary, there was no
evidence for the assumption that the above presented
imbalance of putative donors might be an artifact originating
from an underrepresentation of archaea-specific gene-
encoded functions in the COG database.
Gene flux in related species
For a comparison of pathogenic and nonpathogenic spe-
cies and the analysis of small and large genomes of species
belonging to the same genera, four genomes of Escherichia
and genome pairs of Listeria and Methanosarcinaceae were
analyzed. Fig. 4 summarizes the findings for Escherichia
genomes, namely Escherichia coli O157:H7 [24], E. coli
O157:H7 EDL933 [25], E. coli K-12 [26], and E. blattae
Fig. 2. Classification of putatively alien (pA) and putatively native (pN) genes in genomic islands of mesophilic and hyperthermophilic bacteria. For details, see
legend to Fig. 1.
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 467
[18]. The first two of these species are pathogenic. Concern-
ing the composition of GIs, the E. coli genomes had similar
prevalences. In all three genomes, COG classes X (integrases,
transposases, and inactivated derivatives), U (intracellular
trafficking, secretion, and vesicular transport), N (cell
motility), and M (cell wall/membrane/envelope biogenesis)
were overrepresented inGIs; compare ratio values in Fig. 4B.
The histograms also showed that there were no drastic
differences between pathogenic and nonpathogenic E. coli
species with respect to GI composition. GIs in E. coli K-12
tended to harbor more gene-encoded functions related to
classes N (cell motility) and U (intracellular trafficking,
secretion, and vesicular transport). Among those individual
COG functions, which were the strongest overrepresented
(P < 0.05) in GIs of all E. coli species, are gene-encoded
functions related to P pilus assembly. Concerning GIs in E.
blattae, the underrepresentation of classes N (cell motility)
and U (intracellular trafficking, secretion, and vesicular
transport) is striking. The GI composition of E. coli K-12
and E. blattae differed statistically significantly (P << 0.01).
The analyzed genomic data set for E. blattae consisted of five
contigs covering >98% of the genome. Due to the high
coverage, we do not expect significant deviations from our
findings for the complete genome.
Among the two Listeria species analyzed here, Listeria
innocua is nonpathogenic and Listeria monocytogenes is
pathogenic. The genomes have nearly the same size of 3.01
or 2.94 Mb [21]. SIGI identified in L. innocua 343 pA genes
(11%) and in L. monocytogenes 271 (9.4%). The median of
GI length was 8 genes in both species. The number of
hypothetical genes in GIs was in both cases approx. 50%
and among pN genes <15%. The COG histograms of GIs
Fig. 3. Log ratios indicating the prevalences for COG classes in genomic islands of hyperthermophilic or mesophilic species. (A) For each COG class k the
ratios ratio(G1, G2) = log(rG1(k)/rG2(k)) and ratio(G3, G4) = log(rG3(k)/rG4(k)) were determined for the groups G1 = hyperthermophilic archaea, G2 =
hyperthermophilic bacteria, G3 = mesophilic archaea, G4 = mesophilic bacteria. Only those r(k) values were considered that belong to COG classes having a
statistically significant skew ( P < 0.05). If both ratio values are positive (negative), the considered class is over-(under-) represented in all archeal GIs,
irrespective of the habitat. (B) For each COG class k the ratios ratio(G1, G3) = log(rG1(k)/rG3(k)) and ratio(G2, G4) = log(rG2(k)/rG4(k)) were determined as
described above. If both ratios were positive (negative), the considered class was over- (under-) represented in GIs of hyperthermophilic species, irrespective of
the considered superkingdom.
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475468
differed only marginally (see Fig. 5); a v2 test did not detect
significant differences (P > 0.9). Besides COG1396, a
predicted transcriptional regulator, none of the individual
COG functions was significantly overrepresented (P >
0.05) in GIs. Again, these examples support the notion that
pathogenicity does not massively influence the content of
GIs on the level of COG classes.
The genomes of the two completely sequenced archeal
Methanosarcinaceae differ considerably in size. The
genome of Methanosarcina acetivorans [27] is 39% (1.65
Mb) larger than the one ofMethanosarcina mazei [19]. SIGI
identified in M. mazei 310 pA genes (8%) and in M.
acetivorans 538 (11%). GIs in the genome of M. acetivor-
ans were significantly longer: The median of GI length was
in M. mazei 9 genes and in M. acetivorans 13 genes. The
distributions of GI lengths were statistically significantly
different (P << 0.01, determined with a KS test). The
histograms of COG classes are plotted in Fig. 6. In both
species, the overall distributions reflect the general preva-
lence of archeal GIs (see Fig. 6A). Again, the composition
of GIs differed only marginally with respect to COG classes.
The two COG classifications of all pA genes and those of
pAb genes did not differ statistically significantly (P > 0.1).
The comparison of the two proteomes by means of
bidirectional BLAST analysis identified 1051 (23%) spe-
cies-specific genes in the genome of M. acetivorans. This
fraction was roughly consistent with values determined for
other genomes; see, e.g., [28]. Of these genes 59% were
located in the genome half located around the terminus of
replication. The same area harbored 58% of pA genes. The
overrepresentation of both the species-specific and the pA
genes in the ter half was statistically significant (P << 0.01,
determined with a v2 test).Sinorhizobium meliloti and Mesorhizobium loti are both
Rhizobiales. The genome of M. loti [29] (7.60 MB, 6746
genes) is larger than that of S. meliloti [30] (6.69 MB, 3341
genes) and harbors a symbiosis island [31]. SIGI identified
in the genome of M. loti 1028 (15%) and in the genome of
S. meliloti 209 (5%) pA genes. The difference in GI
composition was statistically significant (P < 0.025).
Interestingly, the fraction of genes belonging to class X is
larger in GIs of the smaller genome. S. meliloti contained 22
GIs with a median length of 7, the genome of M. loti had 56
GIs with a median length of 11 genes. To identify those
gene-encoded functions, which were most strongly over-
represented in GIs of M. loti, those five COG classes were
identified, which had the largest ratio r(k). In GIs of M. loti,
genes related to class T (signal transduction mechanisms), U
(intracellular trafficking, secretion, and vesicular transport),
C (energy production and conversion), H (coenzyme trans-
port and metabolism), and Q (secondary metabolites
biosynthesis, and transport and catabolism) were over-
represented the strongest; compare Fig. 7.
In the genomes of Thermus thermophilus [20] and
Deinococcus radiodurans [32], which are two representa-
tives of a distinct bacterial phylum, SIGI found only a small
number of pA genes. For T. thermophilus, SIGI identified
97 (4.8%) pA genes and for D. radiodurans on chromosome
Fig. 4. Classification of putatively alien (pA) genes in genomic islands (GIs) of Escherichia. (A) Plot of fraction values fk (pA) for all COG classes as
determined in GIs of E. coli K-12, E. coli O157:H7 EDL933, E. coli O157:H7, and E. blattae. (B) Ratio values r(k) = fk (pA)/fk (pN) illustrating the under-/
overrepresentation of COG classes in GIs of the respective genomes. A positive log(r(k)) value indicates an overrepresentation, a negative log(r(k) value the
underrepresentation of the respective class k in GIs. For COG classes, see Table 2.
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 469
one 32 (1.3%) and on chromosome two 34 (9.5%) pA genes,
respectively. Ten of these 34 pA genes on chromosome two
belong to class M (cell wall/membrane/envelope biogenesis)
and constitute a cluster consisting of genes DRA0026–
DRA0048, coding mostly for transferases. T. thermophilus
is an extremely thermophilic bacterium. In its GIs, 10% of
the genes were related to COG class M (cell wall/
membrane/envelope biosynthesis) and 6% to class V
(defense mechanisms, data not shown). The two species
analyzed here are naturally transformable [32–34]. D.
radiodurans lives in locations rich in organic nutrients,
which are populated by a plethora of microbes and a diverse
gene pool. It is unclear why these genomes harbor only a
small amount of pA genes.
Discussion
Each analysis of HGT raises the question of whether the
approach of identifying horizontally transferred genes is
valid. A variety of methods have been developed to identify
GIs. The underlying concepts are based on the analysis of
sequence composition [11,16,35–39] or gene neighborhood
[40], on phylogenetic studies [41] or on a combination of
these approaches [42,43]. Here, we will discuss SIGI_spotential to identify GIs first and the results second.
Codon usage analysis reliably allows identification of GIs
It has been argued that approaches based on codon usage
analysis generate large numbers of false positive or false
negative hits. However, it has been shown that compositional
analyses recognize the vast majority of transfer events [44].
The risk of generating false positive hits can be reduced by
focusing on the prediction of pA gene clusters. Such an
approach elegantly exploits and combines biological evi-
dence and statistical principles. As has been established,
genomic islands frequently have a size of 10–200 kb [10].
This is an important premise, because the probability of
predicting false positives decreases drastically for gene
clusters. These arguments were considered in the design of
SIGI [16] and another recently introduced algorithm [11].
There is a second line of arguments that lends additional
support to algorithms analyzing compositional complexity.
The assumption that these methods might overlook genes
acquired by horizontal transfer could be valid for more
ancient events, which were subject to the amelioration
process [1] for a longer period of time. More recently
Fig. 5. Classification of putatively alien (pA) genes in genomic islands of Listeria. For details, see legend of Fig. 4.
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475470
acquired genes were detected to a great extent by methods
based on compositional complexity [5,44]. Lawrence and
Ochman estimated the age of imported genes [35] and
concluded that most were relatively recent, i.e., they were
acquired within the last few million years; see, e.g., [45].
This suggests that older imports have been purged from the
genomes presumably because these genes did not improve
fitness [4]. Given this reasoning, one might expect
amelioration not to affect the majority of pA genes.
Gene flux depends on lifestyle and is directional in
mesophilic species
In hyperthermophilic species, genes were exchanged
across the superkingdom border to nearly the same extent
in both directions. SIGI_s prediction (see Table 1) coincides
with the evidence for massive HGT in Thermotoga maritima
from archaea [13]. However, the absolute numbers of pA
genes found in these genomes are relatively small, if
compared to mesophilic ones. For mesophilic species, we
have to state a directedness of gene flux from bacteria to
archaea. This statement subsumes and is in agreement with
findings obtained for pathogenic bacteria. Such genomes
were shown to harbor only a low number of archeal genes
[46]. These results and our findings are in contrast to the
hypothesis introduced by [47], postulating that HGT from
archaea to E. coli contributes to the emergence of novel
diseases. According to the results presented above, the
transfer of archeal genes to genomes of mesophilic bacteria
is rare. Interestingly, no pathogenic archaeon is known so far.
If compared to bacterial genomes, archeal ones contain a
smaller amount of genes related to carbohydrate transport
and metabolism, cell wall/membrane/envelope biogenesis,
and inorganic transport and metabolism [48]. The lower
number of genes encoding functions for cell wall synthesis
was frequently explained by the fact that archaea possess a
different cell wall demanding fewer enzymes [49]. SIGI’s
output suggests that 17% of related archeal genes were
horizontally acquired; most likely one-third of these pA
genes originate from bacteria. In the case of archaea, the
habitat appears to have a minor effect on gene-encoded
functions acquired via HGT: The overall classification of pA
genes identified in GIs of hyperthermophilic archaea is
similar to that determined for mesophilic archaea. The
observation that hyperthermophilic archaea harbor fewer
genes acquired from bacteria is in agreement with previous
results [50].
GI content and pathogenicity
The genes distinguishing pathogenic from nonpathogenic
species are often clustered in pathogenicity islands [8,10].
For the samples analyzed here, the content of pA gene
clusters was nearly identical to the composition found in
Fig. 6. Classification of putatively alien (pA) genes in genomic islands of Methanosarcina acetivorans (first column) and Methanosarcina mazei (second
column). For abbreviations of COG classes, see Table 2. (B) log(rMa(k)/rMm(k)) values illustrate the under-/overrepresentation of COG classes in the genome of
M. acetivorans if compared to M. mazei. A positive log(ratio) value indicates an overrepresentation of the respective class k in GIs of M. acetivorans. Black
bars give the ratio derived from all pA genes; gray bars indicate the skew for pA genes originating from bacterial donors.
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 471
nonpathogenic species with respect to the COG classifica-
tion. There are several reasons that might explain this
correspondence. First, only 50% of pA genes are classified
as COGs. It might be that pathogenicity is encoded, at least
to some extent, in those genes that are not yet elements of
the COG scheme. Second, the clusters identified by SIGI
may be marginally or not at all related to pathogenicity.
Among the COG classes, there is one candidate frequently
overrepresented in GIs that could indicate pathogenicity:
class X. However, the analysis of archeal and bacterial
genomes (see Figs. 1 and 2) and the comparison of
eubacterial genomes (see Fig. 4) or of Listeria demonstrated
that the fraction of these genes including integrases and
transposases is not a marker for pathogenicity. Related
enzymes occurred with similar frequency in both pathogenic
and nonpathogenic species. On the other hand, it has been
shown that SIGI identifies operons related to pathogenicity
[16]. Algorithms like SIGI preferentially detect recently
acquired GIs. Pathogenicity islands, which are a subset of
GIs, will be missed, if the DNA composition is incon-
spicuous or already ameliorated toward the codon usage of
the host. As argued above, genomes may not harbor a great
amount of ameliorated genes. If, however, strength and
impact of HGT were complementary to habitat- and
phylum-specific needs, one would expect similar gene-
encoded functions in newly acquired GIs, in both patho-
genic and nonpathogenic species. The findings presented
here support this notion, which makes the correspondence
of GI composition plausible. This concept might also
explain the underrepresentation of gene-encoded functions
related to classes N (cell motility) and U (intracellular
trafficking, secretion, and vesicular transport) in GIs of E.
blattae. This species was isolated from the hindgut of a
cockroach [51]. Likely, the composition of GIs is related to
the commensal lifestyle of E. blattae.
One might argue that most of the GIs of closely related
species we compared here were acquired before speciation.
If this is the case, our results demonstrate that since then,
composition of GIs was rather stable with respect to a
general classification of gene-encoded functions. This
stability would again support the notion that respective
gene-encoded functions enhance fitness; otherwise, a purg-
ing of such GIs seems more plausible. Our hypothesis that
the evolution of a pathogenic lifestyle is not correlated with
a specific composition of GIs is independent of acquisition
time. If GIs have been acquired before speciation, patho-
Fig. 7. (A) Classification of putatively alien (pA) genes in genomic islands (GIs) of two Rhizobiales. For abbreviations of COG classes, see Table 2. Dark gray
bars give the distribution of COG classes in GIs of Sinorhizobium meliloti; light gray bars give the distribution of COG classes in GIs of Mesorhizobium loti.
(B) log(rMl(k)/rSm(k)) values show the over-/underrepresentation of COG classes in the genome ofM. loti if compared to S. meliloti. A positive log(ratio) value
indicates an overrepresentation of the respective class k in GIs of M. loti.
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475472
genic species did not accumulate significant amounts of
genes with different functions since then. If GIs were
acquired after speciation, accumulated functions are similar
in both pathogenic and nonpathogenic species.
There is another effect to be considered: Genomes of
pathogenic species are frequently characterized by a
massive loss of genes [52–54]. This effect is probably
due to the lifestyle of most pathogens as intracellular and
opportunistic parasites. Such environments offer rich
resources, making gene functions obsolete.
In summary and for the cases studied here, it seems
plausible to assume that pathogenicity does not massively
depend on newly imported genes.
HGT activity does not explain genome size
The genomes of the two (archeal) Methanosarcina and
the two (bacterial) Rhizobiales species differ considerably in
size. A statistically significant correlation of gene content
with genome size is known for the following COG classes
[48]: J, L, D, F (negative correlation) and K, N, T, C, Q
(positive correlation). For the Methanosarcina species, the
composition of GIs does not differ significantly with respect
to the COG classification (compare Fig. 6B). For GIs of
Rhizobiales, COG classes T (signal transduction mecha-
nisms), C (energy production and conversion), and Q
(secondary metabolites biosynthesis, transport and catabo-
lism) follow the genome-size-specific trend. Additionally,
COG classes M (cell wall/membrane/envelope biogenesis),
U (intracellular trafficking, secretion, and vesicular trans-
port), and H (coenzyme transport and metabolism) are
overrepresented in GIs (compare part B in Fig. 7). There is
no evidence for a correlation of these functions with genome
size [48]. These findings suggest that the composition of
GIs does not merely reflect the size dependency of genome
composition.
Most of the species-specific genes accumulated in the
genome of M. acetivorans and not found in the genome of
M. mazei were not classified as pA. This result indicates that
codon usage in these genes does not significantly deviate
from the preferences in native genes. There is evidence that
gene genesis is two times more frequent than HGT [55].
Therefore, it is plausible to assume that this phenomenon is
the source for those species-specific genes having an
unsuspicious codon usage. The accumulation of species-
specific and pA genes in the ter half supports the notion of a
distinct pattern of genome organization: essential genes are
enriched near the origin of replication; the area near the
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 473
replication terminus seems to be the place for gene genesis
or HGT. This notion is also supported by the finding that the
ter region is an area for hyperrecombination [56].
What is the origin of pAu genes?
In all cases studied above, at least 30% of pA genes had a
codon usage classified neither as bacterial nor as archeal.
We cannot exclude that the approach implemented with
SIGI or the limited knowledge of codon usage did not allow
us to resolve this uncertainty. Alternatively, it could be that
these genes were frequently transferred among environ-
ments, each modulating codon usage specifically, however,
with different prevalences. The annotations of unCOGed pA
genes signaled for many of these genes an origin in
transposons, prophages, or viruses making a frequent
transfer plausible. An exhaustive analysis of these genes is
an interesting task for the future.
Materials and methods
Categorizing gene-encoded function in genomic islands
The genomes under consideration were at first analyzed
using SIGI. Then and for each gene, the COG classification
was determined and according to SIGI’s prediction, added to
histograms summing the COG classifications for putatively
native (pN) and putatively alien (pA) genes. These
occurrences (# pN, # pA) were used to calculate relative
frequencies fk(pN), fk (pA) and ratios r(k) = fk (pA)/fk (pN).
To illustrate and scale the skew in the distribution of COG
classes among pA and pN genes, the logarithm of the
frequency ratios f(pA)/f(pN) (log odds) were computed. The
conversion to relative frequencies is a necessary step, if one
wants to compare relative abundances of COG classes
among pA and pN genes or between different species
groups. For the study of gene flux between superkingdoms,
SIGI’s predictions were sorted into three classes: Genes with
a putatively bacterial donor (pAb), genes with a putatively
archeal donor (pAa), and genes having a codon usage that
could not be clearly assigned as bacterial or archeal (pAu).
A v2 test was used on absolute gene numbers (# pN, # pA,
or respective groups) to assess statistical relevance. The size
of GIs was determined in multiples of genes and summed in
a histogram to determine the median length. Resulting
distributions were compared using the Kolmogorov-Smir-
nov (KS) test.
The functional category L (DNA replication, recombi-
nation, and repair) of the COG database subsumes several
COGs annotated as transposases and integrases. These gene-
encoded functions are exceedingly overrepresented in GIs.
To isolate these protein functions, we have created a new
category named X (integrases, transposases, and inactivated
derivatives) which harbors the COG functions COG0582,
COG0675, COG1662, COG1943, COG2452, COG2801,
COG2826, COG2963, COG3039, COG3293, COG3316,
COG3328, COG3335, COG3385, COG3415, COG3436,
COG3464, COG3547, COG3666, COG3676, COG3677,
COG4584, COG4644, COG5421, COG5433, COG5558,
and COG5659.
Material
The data sets consisted of the following genomes.
Hyperthermophilic archaea: Archaeoglobus fulgidus,
Aeropyrum pernix, Methanocaldococcus jannaschii, Meth-
anopyrus kandleri, Pyrobaculum aerophilum, Pyrococcus
abyssi, Pyrococcus horikoshii, Sulfolobus solfataricus,
Thermococcus kodakaraensis;
Mesophilic archaea: Halobacterium sp. NRC-1, Meth-
anosarcina acetivorans str. C2A, Methanosarcina mazei,
Methanothermobacter thermautotrophicus;
Archaea: all archeal genomes annotated in the COG
database;
Hyperthermophilic bacteria: Aquifex aeolicus, Thermo-
toga maritima, Thermoanaerobacter tengcongensis;
Thermophilic bacteria: Geobacillus kaustophilus, Ther-
mus thermophilus HB 27, Thermosynechococcus elongatus,
Symbiobacterium thermophilum;
Mesophilic bacteria: Synechocystis, Nostoc sp. PCC
7120, Fusobacterium nucleatum, Deinococcus radiodurans,
Corynebacterium glutamicum, Mycobacterium tuberculosis
H37Rv, Mycobacterium tuberculosis CDC1551, Mycobac-
terium leprae, Clostridium acetobutylicum, Lactococcus
lactis, Streptococcus pyogenes M1 GAS, Streptococcus
pneumoniae TIGR4, Staphylococcus aureus N315, Listeria
innocua, Bacillus subtilis, Bacillus halodurans, Ureaplasma
urealyticum, Mycoplasma pulmonis, Mycoplasma pneumo-
niae, Mycoplasma genitalium, Escherichia coli K-12,
Escherichia coli O157:H7 EDL933, Escherichia coli
O157:H7, Yersinia pestis, Salmonella typhimurium LT2,
Buchnera sp. APS, Vibrio cholerae, Pseudomonas aerugi-
nosa, Haemophilus influenzae, Pasteurella multocida,
Xylella fastidiosa 9a5c, Neisseria meningitidis MC58,
Neisseria meningitidis Z2491, Ralstonia solanacearum,
Helicobacter pylori 26695, Helicobacter pylori J99, Cam-
pylobacter jejuni, Agrobacterium tumefaciens, Sinorhi-
zobium meliloti 1021, Brucella melitensis, Mesorhizobium
loti, Caulobacter crescentus, Rickettsia prowazekii, Rick-
ettsia conorii, Chlamydia trachomatis, Chlamydophila
pneumoniae CWL029, Treponema pallidum, Borrelia
burgdorferi;
Bacteria: all bacterial genomes annotated in the COG
database.
Acknowledgments
The project was carried out within the framework of the
Competence Network Gottingen BiotechGenoMik financed
by the German Federal Ministry of Education and Research
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475474
(BMBF). We thank H. Liesegang for his help in preparing
genomic data sets and the referees for their constructive
comments.
References
[1] J.G. Lawrence, H. Ochman, Amelioration of bacterial genomes: rates
of change and exchange, J. Mol. Evol. 44 (1997) 383–397.
[2] W.F. Doolittle, Phylogenetic classification and the universal tree,
Science 284 (1999) 2124–2129.
[3] B. Snel, P. Bork, M.A. Huynen, Genomes in flux: the evolution of
archaeal and proteobacterial gene content, Genome Res. 12 (2002)
17–25.
[4] C.G. Kurland, B. Canback, O.G. Berg, Horizontal gene transfer: a
critical view, Proc. Natl. Acad. Sci. U. S. A. 100 (2003) 9658–9662.
[5] M.A. Ragan, On surrogate methods for detecting lateral gene transfer,
FEMS Microbiol. Lett. 201 (2001) 187–191.
[6] O. Zhaxybayeva, P. Lapierre, J.P. Gogarten, Genome mosaicism and
organismal lineages, Trends Genet. 20 (2004) 254–260.
[7] M.A. Ragan, Detection of lateral gene transfer among microbial
genomes, Curr. Opin. Genet. Dev. 11 (2001) 620–626.
[8] G. Blum, M. Ott, A. Lischewski, A. Ritter, H. Imrich, H. Tschape,
et al., Excision of large DNA regions termed pathogenicity islands
from tRNA-specific loci in the chromosome of an Escherichia coli
wild-type pathogen, Infect. Immun. 62 (1994) 606–614.
[9] U. Dobrindt, B. Hochhut, U. Hentschel, J. Hacker, Genomic islands in
pathogenic and environmental microorganisms, Nat. Rev., Microbiol.
2 (2004) 414–424.
[10] J. Hacker, J.B. Kaper, Pathogenicity islands and the evolution of
microbes, Annu. Rev. Microbiol. 54 (2000) 641–679.
[11] Y. Nakamura, T. Itoh, H. Matsuda, T. Gojobori, Biased biological
functions of horizontally transferred genes in prokaryotic genomes,
Nat. Genet. 36 (2004) 760–766.
[12] R. Jain, M.C. Rivera, J.A. Lake, Horizontal gene transfer among
genomes: the complexity hypothesis, Proc. Natl. Acad. Sci. U. S. A.
96 (1999) 3801–3806.
[13] K.E. Nelson, R.A. Clayton, S.R. Gill, M.L. Gwinn, R.J. Dodson, D.H.
Haft, et al., Evidence for lateral gene transfer between Archaea and
Bacteria from genome sequence of Thermotoga maritima, Nature 399
(1999) 323–329.
[14] R.L. Tatusov, E.V. Koonin, D.J. Lipman, A genomic perspective on
protein families, Science 278 (1997) 631–637.
[15] R.L. Tatusov, N.D. Fedorova, J.D. Jackson, A.R. Jacobs, B. Kiryutin,
E.V. Koonin, et al., The COG database: an updated version includes
Eukaryotes, BMC Bioinformatics 4 (2003) 41.
[16] R. Merkl, SIGI: Score-based identification of genomic islands, BMC
Bioinformatics 5 (2004) 22.
[17] C. von Mering, M. Huynen, D. Jaeggi, S. Schmidt, P. Bork, B. Snel,
STRING: a database of predicted functional associations between
proteins, Nucleic Acids Res. 31 (2003) 258–261.
[18] A. Wiezer, Entschlusselung der Genomsequenz von Escherichia
blattae und komparative Bioinformatik mikrobieller Genome, Institut
fur Mikrobiologie und Genetik, Georg-August-Universitat Gottingen,
2004.
[19] U. Deppenmeier, A. Johann, T. Hartsch, R. Merkl, R.A. Schmitz,
R. Martinez-Arias, et al., The genome of Methanosarcina mazei:
evidence for lateral gene transfer between Bacteria and Archaea,
J. Mol. Microbiol. Biotechnol. 4 (2002) 453–461.
[20] A. Henne, H. Bruggemann, C. Raasch, A. Wiezer, T. Hartsch, H.
Liesegang, et al., The genome sequence of the extreme thermophile
Thermus thermophilus, Nat. Biotechnol. 22 (2004) 547–553.
[21] P. Glaser, L. Frangeul, C. Buchrieser, C. Rusniok, A. Amend, F.
Baquero, et al., Comparative genomics of Listeria species, Science
294 (2001) 849–852.
[22] R. Grantham, C. Gautier, M. Gouy, R. Mercier, A. Pave, Codon
catalog usage and the genome hypothesis, Nucleic Acids Res. 8 (1980)
r49–r62.
[23] R. Sandberg, C.I. Branden, I. Ernberg, J. Coster, Quantifying the
species-specificity in genomic signatures, synonymous codon choice,
amino acid usage and G+C content, Gene 311 (2003) 35–42.
[24] T. Hayashi, K. Makino, M. Ohnishi, K. Kurokawa, K. Ishii, K.
Yokoyama, et al., Complete genome sequence of enterohemorrhagic
Escherichia coli O157:H7 and genomic comparison with a laboratory
strain K-12, DNA Res. 8 (2001) 11–22.
[25] N.T. Perna, G. Plunkett III, V. Burland, B. Mau, J.D. Glasner, D.J.
Rose, et al., Genome sequence of enterohaemorrhagic Escherichia
coli O157:H7, Nature 409 (2001) 529–533.
[26] F.R. Blattner, G. Plunkett III, C.A. Bloch, N.T. Perna, V. Burland, M.
Riley, et al., The complete genome sequence of Escherichia coli K-12,
Science 277 (1997) 1453–1474.
[27] J.E. Galagan, C. Nusbaum, A. Roy, M.G. Endrizzi, P. Macdonald,
W. FitzHugh, et al., The genome of M. acetivorans reveals
extensive metabolic and physiological diversity, Genome Res. 12
(2002) 532–542.
[28] K.E. Nelson, I.T. Paulsen, J.F. Heidelberg, C.M. Fraser, Status of
genome projects for nonpathogenic bacteria and archaea, Nat.
Biotechnol. 18 (2000) 1049–1054.
[29] T. Kaneko, Y. Nakamura, S. Sato, E. Asamizu, T. Kato, S. Sasamoto,
et al., Complete genome structure of the nitrogen-fixing symbiotic
bacterium Mesorhizobium loti, DNA Res. 7 (2000) 331–338.
[30] F. Galibert, T.M. Finan, S.R. Long, A. Puhler, P. Abola, F. Ampe,
et al., The composite genome of the legume symbiont Sinorhizobium
meliloti, Science 293 (2001) 668–672.
[31] J.T. Sullivan, C.W. Ronson, Evolution of rhizobia by acquisition of a
500-kb symbiosis island that integrates into a phe-tRNA gene, Proc.
Natl. Acad. Sci. U. S. A. 95 (1998) 5145–5149.
[32] O. White, J.A. Eisen, J.F. Heidelberg, E.K. Hickey, J.D. Peterson, R.J.
Dodson, et al., Genome sequence of the radioresistant bacterium
Deinococcus radiodurans R1, Science 286 (1999) 1571–1577.
[33] Y. Koyama, T. Hoshino, N. Tomizuka, K. Furukawa, Genetic trans-
formation of the extreme thermophile Thermus thermophilus and of
other Thermus spp, J. Bacteriol. 166 (1986) 338–340.
[34] A. Friedrich, C. Prust, T. Hartsch, A. Henne, B. Averhoff, Molecular
analyses of the natural transformation machinery and identification
of pilus structures in the extremely thermophilic bacterium Thermus
thermophilus strain HB27, Appl. Environ. Microbiol. 68 (2002)
745–755.
[35] J.G. Lawrence, H. Ochman, Molecular archaeology of the Escherichia
coli genome, Proc. Natl. Acad. Sci. U. S. A. 95 (1998) 9413–9417.
[36] S. Karlin, Detecting anomalous gene clusters and pathogenicity
islands in diverse bacterial genomes, Trends Microbiol. 9 (2001)
335–343.
[37] H.C. Wang, J. Badger, P. Kearney, M. Li, Analysis of codon usage
patterns of bacterial genomes using the self-organizing map, Mol.
Biol. Evol. 18 (2001) 792–800.
[38] Q. Tu, D. Ding, Detecting pathogenicity islands and anomalous gene
clusters by iterative discriminant analysis, FEMS Microbiol. Lett. 221
(2003) 269–275.
[39] T. Abe, S. Kanaya, M. Kinouchi, Y. Ichiba, T. Kozuki, T. Ikemura,
Informatics for unveiling hidden genome signatures, Genome Res. 13
(2003) 693–702.
[40] A. Ruepp, W. Graml, M.L. Santos-Martinez, K.K. Koretke, C.
Volker, H.W. Mewes, et al., The genome sequence of the
thermoacidophilic scavenger Thermoplasma acidophilum, Nature
407 (2000) 508–513.
[41] C.L. Nesbø, S. L’Haridon, K.O. Stetter, W.F. Doolittle, Phylogenetic
analyses of two ‘‘archaeal’’ genes in Thermotoga maritima reveal
multiple transfers between Archaea and Bacteria, Mol. Biol. Evol. 18
(2001) 362–375.
[42] S. Garcia-Vallve, A. Romeu, J. Palau, Horizontal gene transfer in
bacterial and archaeal complete genomes, Genome Res. 10 (2000)
1719–1725.
A. Wiezer, R. Merkl / Genomics 86 (2005) 462–475 475
[43] C.L. Nesbø, W.F. Doolittle, Targeting clusters of transferred genes in
Thermotoga maritima, Environ. Microbiol. 5 (2003) 1144–1154.
[44] V. Daubin, E. Lerat, G. Perriere, The source of laterally transferred
genes in bacterial genomes, Genome Biol. 4 (2003) R57.
[45] F. de la Cruz, J. Davies, Horizontal gene transfer and the origin
of species: lessons from bacteria, Trends Microbiol. 8 (2000)
128–133.
[46] U. Gophna, R.L. Charlebois, W.F. Doolittle, Have archaeal genes
contributed to bacterial virulence?, Trends Microbiol. 12 (2004)
213–219.
[47] D.M. Faguy, Lateral gene transfer (LGT) between Archaea and
Escherichia coli is a contributor to the emergence of novel infectious
disease, BMC Infect. Dis. 3 (2003) 13.
[48] K.T. Konstantinidis, J.M. Tiedje, Trends between gene content and
genome size in prokaryotic species with larger genomes, Proc. Natl.
Acad. Sci. U. S. A. 101 (2004) 3160–3165.
[49] O. Kandler, H. Konig, Cell wall polymers in Archaea (Archaebac-
teria), Cell. Mol. Life Sci. 54 (1998) 305–308.
[50] K.S. Makarova, L. Aravind, M.Y. Galperin, N.V. Grishin, R.L.
Tatusov, Y.I. Wolf, et al., Comparative genomics of the Archaea
(Euryarchaeota): evolution of conserved protein families, the stable
core, and the variable shell, Genome Res. 9 (1999) 608–628.
[51] N.R. Burgess, S.N. McDermott, J. Whiting, Aerobic bacteria occur-
ring in the hind-gut of the cockroach, Blatta orientalis, J. Hyg. (Lond.)
71 (1973) 1–7.
[52] J.O. Andersson, S.G. Andersson, Insights into the evolutionary
process of genome degradation, Curr. Opin. Genet. Dev. 9 (1999)
664–671.
[53] S.T. Cole, K. Eiglmeier, J. Parkhill, K.D. James, N.R. Thomson, P.R.
Wheeler, et al., Massive gene decay in the leprosy bacillus, Nature 409
(2001) 1007–1011.
[54] A. Mira, H. Ochman, N.A. Moran, Deletional bias and the evolution
of bacterial genomes, Trends Genet. 17 (2001) 589–596.
[55] V. Kunin, C.A. Ouzounis, The balance of driving forces during
genome evolution in prokaryotes, Genome Res. 13 (2003)
1589–1594.
[56] J. Louarn, F. Cornet, V. Francois, J. Patte, J.M. Louarn, Hyper-
recombination in the terminus region of the Escherichia coli
chromosome: possible relation to nucleoid organization, J. Bacteriol.
176 (1994) 7524–7531.