Upload
uga
View
0
Download
0
Embed Size (px)
Citation preview
Long Simple Sequence Repeats in Host-Adapted PathogensLocalize Near Genes Encoding Antigens, Housekeeping Genes,and Pseudogenes
Xiangxue Guo Æ Jan Mrazek
Received: 1 April 2008 / Accepted: 3 September 2008 / Published online: 17 October 2008
� Springer Science+Business Media, LLC 2008
Abstract Simple sequence repeats (SSRs) in DNA
sequences are tandem iterations of a single nucleotide or a
short oligonucleotide. SSRs are subject to slipped-strand
mutations and a common source of phase variation in
bacteria and antigenic variation in pathogens. Significantly
long SSRs are generally rare in prokaryotic genomes, and
long SSRs composed of iterations of mono-, di-, tri-, and
tetranucleotides are mostly restricted to host-adapted
pathogens. We present new results concerning associations
between long SSRs and genes related to different cellular
functions in genomes of host-adapted pathogens. We found
that in the majority of the analyzed genomes, at least some
of the genes associated with SSRs encode potential anti-
gens, which is expected if the primary function of SSRs is
their contribution to antigenic variation. However, we also
found a number of long SSRs associated with housekeep-
ing genes, including rRNA and tRNA genes, genes
encoding ribosomal proteins, amino acyl-tRNA syntheta-
ses, chaperones, and important metabolic enzymes. Many
of these genes are probably essential and it is unlikely that
they are phase-variable. Few statistically significant asso-
ciations between SSRs and gene functional classifications
were detected, suggesting that most long SSRs are not
related to a particular cellular function or process. Long
SSRs in Mycobacterium leprae are mostly associated with
pseudogenes and may be contributing to gene loss fol-
lowing the adaptation to an obligate pathogenic lifestyle.
We speculate that LSSRs may have played a similar role in
genome reduction of other host-adapted pathogens.
Keywords Tandem repeats � Phase variation �Contingency loci � Antigenic variation �Genome reduction � Pathogen evolution
Introduction
Simple sequence repeats (SSR) are tandem iterations of a
single nucleotide or a short oligonucleotide in a DNA
sequence. Long SSRs (LSSRs) are common in eukaryotes
(Kashi and King 2006; Toth et al. 2000) but rare in most
prokaryotic genomes (Field and Wills 1998; Mrazek et al.
2007). SSRs have some unusual properties that differenti-
ate them from regular DNA sequences. SSRs are subject to
slipped-strand mutations and, consequently, hypermutable
with respect to their lengths (Kashi and King 2006; Toth
et al. 2000). Mutations in SSRs located in protein coding
regions can cause frameshifts and deactivate and subse-
quently reactivate the affected genes. Gene expression can
also be influenced by mutations in SSRs located in regu-
latory regions, where such mutations can alter the activity
of promoters (Groisman and Casadesus 2005; Karlin et al.
1996; Moxon et al. 1994; Rocha 2003; Rocha and Blan-
chard 2002). Such SSR-facilitated mutations can be
beneficial in some circumstances. In particular, mutations
in SSRs can cause phase variation, that is, reversible and
inheritable switching between two phenotypes. In this
model, SSRs can act as an on/off switch for a particular
gene or operon (Groisman and Casadesus 2005; Moxon
et al. 1994; van der Woude and Baumler 2004). In patho-
gens, SSRs often influence genes encoding antigens and
frequent mutations in these SSRs can increase antigenic
X. Guo � J. Mrazek (&)
Department of Microbiology, University of Georgia,
Athens, GA 30602-2605, USA
e-mail: [email protected]
J. Mrazek
Institute of Bioinformatics, University of Georgia, Athens,
GA 30602, USA
123
J Mol Evol (2008) 67:497–509
DOI 10.1007/s00239-008-9166-5
variation within the pathogen population and aid evasion of
the host immune system (Groisman and Casadesus 2005;
Rocha 2003; Roske et al. 2001; van der Woude and
Baumler 2004). The role of SSRs in promoting antigenic
variation is generally recognized, and in several cases the
effects of SSR mutations on gene expression were exper-
imentally verified (reviewed by (Groisman and Casadesus
2005; Moxon et al. 1994; van der Woude and Baumler
2004).
Some SSRs can also affect three-dimensional structures
and physical properties of both DNA and protein molecules
(Dunker et al. 2005; Htun and Dahlberg 1989; Li et al.
2004; Shafer and Smirnov 2000). For example, several
inherited human neurodegenerative diseases are caused by
expansion of CAG repeats in protein coding regions, which
are not deleterious per se but change properties of the
encoded proteins (Perutz 1999). Other types of SSRs can
promote structural transitions in DNA molecules (Htun and
Dahlberg 1989; Shafer and Smirnov 2000; Sinden 1994).
Our previous investigation of SSRs in Mycoplasma gen-
omes suggested that the physiological roles of SSRs may
not be limited to phase variation, and can include organi-
zation of the chromosome and influence on protein structure
and function (Mrazek 2006). Through analysis of SSRs in
more than 300 prokaryotic genomes, we have shown a large
variance among prokaryotes in terms of SSR content
(Mrazek et al. 2007). In this work, we present detailed
analysis of the relationships of SSRs with genes related to
various cellular functions and physiological processes in
host-adapted pathogens whose genomes exhibit significant
overrepresentations of long SSRs. We use comparisons
among SSRs of different lengths to identify a subset of long
SSRs that are likely maintained by selection. In the majority
of the analyzed genomes, some of these long SSRs are
associated with genes for potential antigens but many long
SSRs are located near housekeeping genes. We interpret our
data as an indication that physiological roles of SSRs in
bacterial genomes likely extend beyond their direct
involvement in phase variation. We present several
hypotheses on possible biological roles of long SSRs.
Materials and Methods
Genome Sequences and Annotations
In a previous work, we analyzed SSR content in more than
300 prokaryotic genomes and found that only few pro-
karyotes feature multiple significantly long tandem repeats
of mono-, di-, tri-, and tetranucleotides (Mrazek et al.
2007). Those which do are mostly host-adapted pathogens,
which do not readily survive in the environment outside a
host. The 11 genomes of host-adapted pathogens that
contain multiple significantly long SSRs were analyzed in
this work.
Complete DNA sequences were downloaded from the
National Center for Biotechnology Information FTP server
at ftp://ftp.ncbi.nih.gov/genomes/Bacteria/. We used three
standardized gene function classifications to determine
which types of genes are significantly often (or signifi-
cantly rarely) associated with SSRs. Gene assignments into
clusters of orthologous groups (COGs) (Tatusov et al.
1997, 2003) were obtained from the IMG database
(http://img.jgi.doe.gov/). COGs are divided into 25 gene
function classes (for details see the COG database at
http://www.ncbi.nlm.nih.gov/COG/). The COG database
only includes protein-coding genes and we added rRNA
and tRNA genes as separate classes. Genes without COG
assignments were included in another separate class. These
genes do not have homologues in the genomes used to
construct the COG database and mostly represent unique
genes found only in the particular organism or a narrow
group of related species.
Gene ontology (GO) terms assigned to individual genes
were downloaded from the UniProt database (http://www.
pir.uniprot.org/). The GO classification assigns to genes
standardized functional descriptions referring to cellular
localization, protein function, and biological process. The
Kyoto Encyclopedia of Genes and Genomes (KEGG)
orthology terms were downloaded from http://www.
genome.jp/kegg/. The KEGG database uses three hierar-
chical levels of functional classification. Only five very
general categories (metabolism, cellular processes, human
diseases, genetic information processing, and environ-
mental information processing) are included at the first
level, while the third level classifies genes in specific
biological pathways. The second level provides interme-
diate classification of genes (e.g., membrane transport, cell
mobility, carbohydrate metabolism).
Simple Sequence Repeats
We define SSRs as perfect, uninterrupted tandem iterations
of a single nucleotide or a short oligonucleotide. We
measure the length of an SSR in nucleotides (base pairs;
bp) rather than number of repetitive units, which allows
accounting for partial copies and facilitates comparisons
among SSRs of different lengths. Every SSR is character-
ized by the length of the repetitive unit k and the length of
the complete repeat l. By this definition, an SSR of length l
composed of iterations of a k-mer starts at position i in a
sequence of nucleotides xj
� �if xj ¼ xjþk for all
j� i; j� iþ l� k � 1, and simultaneously xi�1 6¼ xi�1þk
and xiþl�k 6¼ xiþl. This definition can be applied to all SSRs
of length l� k. Repeats of a longer oligonucleotide that
also qualify as repeats of a shorter oligonucleotide are only
498 J Mol Evol (2008) 67:497–509
123
counted as the shorter oligonucleotide SSR. We analyze the
SSR counts NkðlÞ in a given genome as a function of k and l
(Mrazek 2006; Mrazek et al. 2007). In this work, we
concentrate on iterations of mono-, di-, tri-, and tetramers,
(k� 4), which are overrepresented in genomes of several
host-adapted pathogens (Mrazek et al. 2007). The functions
NkðlÞ are generated for the genomic DNA sequence and
multiple random sequences created by stochastic models of
varying complexity (Mrazek 2006). Comparisons among
SSRs of varying lengths and with random sequences allow
easy detection of anomalies in SSR counts.
Definition of Long SSRs
Previous analyses of SSRs in prokaryotes generally used an
arbitrary length cutoff, sometimes as short as 6 bp (Gur-Arie
et al. 2000). Although such short SSRs can be polymorphic
in length and cause phase variation, the vast majority of such
short SSRs occur by coincidence and probably do not have
any specific function. We developed methods to distinguish
long SSRs likely maintained by selection from the randomly
occurring SSRs based on comparisons among SSR counts of
different lengths and stochastic models (Mrazek 2006;
Mrazek et al. 2007). In most prokaryotes, SSRs occur
approximately as expected in random sequences, except that
mononucleotide SSRs (i.e., uninterrupted runs of the same
nucleotide) of a length [8 bp are underrepresented, possibly
due to negative selection (Fig. 1). In some genomes, long
SSRs are overrepresented (more abundant than expected in
random sequences) and their counts decline approximately
monotonically with an increasing SSR length (Fig. 2). These
longer-than-expected SSRs can arise from spontaneous
expansion of shorter SSRs. In prokaryotes, this pattern often
applies to SSRs composed of tandem iterations of 5- to 11-
mers but rarely to those composed of 1- to 4-mers (Mrazek
et al. 2007). In contrast, several genomes feature bimodal or
discontinuous NkðlÞ plots, where the SSR counts initially
follow the expected counts or even drop below the expected
counts but exhibit a separate peak and/or a discontinuity at
greater SSR lengths (Fig. 3).This discontinuity is difficult to
explain by spontaneous expansion of shorter repeats.
Although mutation-based explanations cannot be summarily
dismissed, explaining the discontinuity of the NkðlÞ plots
(Fig. 3) requires multiple assumptions about the character of
the mutational biases involved, and following Occam’s razor
principle we consider selection a more likely explanation.
Hence, we label the SSRs following the discontinuity in the
NkðlÞ plots long SSRs (LSSRs) (Fig. 3) and we assume that
most LSSRs are maintained by selection, with the implica-
tion that they have some role in the organism’s physiology.
Note that this method for determining the length cutoff for
LSSRs emphasizes comparisons among SSRs of different
length and only applies to discontinuous or bimodal NkðlÞplots, whereas comparisons with random sequences are used
only for reference. Bimodal distributions of SSR counts are
found almost exclusively in genomes of some (but not all)
host-adapted pathogens (Mrazek et al. 2007). Prokaryotic
pathogens with an excess of LSSRs in their genomes are
listed in Table 1.
1-mers
0.1
1
10
100
1000
10000
100000
1000000
10000000
0
Length (nucleotides)
Cou
nt
2-mers
0.1
1
10
100
1000
10000
100000
1000000
10000000
Cou
nt
5 10 15 20 25 30 35 40 45 50 0
Length (nucleotides)5 10 15 20 25 30 35 40 45 50
Fig. 1 Mononucleotide (left) and dinucleotide (right) SSRs in E. coliK12. Filled circles show counts of the corresponding SSR types of the
length given by the abscissa. Gray lines signify expected counts
assessed from simulations using different stochastic models of
varying degrees of complexity. Dashed lines refer to homogeneous
Bernoulli and Markov models, whereas solid gray lines refer to
heterogeneous combinations of Bernoulli, Markov, and periodic
Markov models, where a different model is used for every gene and
intergenic region (Mrazek 2006; Mrazek et al. 2007). The plot is
expected to be linear under homogeneous stochastic models, and
deviations from linearity can be interpreted as under- or overrepre-
sentations of SSRs of the corresponding lengths
7-mers
0.1
1
10
100
1000
10000
100000
1000000
10000000
0
Length (nucleotides)
Cou
nt
10 15 20 25 30 35 40 45 505
Fig. 2 Heptanucleotide SSRs in Nostoc PCC 7120. See legend to
Fig. 1
J Mol Evol (2008) 67:497–509 499
123
SSR Associations with Genes of Different Functional
Assignments
Once LSSRs are identified we investigate their relationship
to genes of different functional classifications. First, we
define SSR-associated genes as genes that have an LSSR
located in their coding region or in their flanking intergenic
regions (downstream or upstream). For all SSR-associated
genes, we find the KEGG orthology terms, COG function
classes, and GO terms assigned to them and identify the
ontology terms and functional classifications that occur
significantly more or less often among the SSR-associated
1-mers
0.1
1
10
100
1000
10000
100000
1000000
10000000
0
Length (nucleotides)
Cou
nt
2-mers
0.1
1
10
100
1000
10000
100000
1000000
10000000
Cou
nt
LSSRs LSSRs
5 10 15 20 25 30 35 40 45 50 0
Length (nucleotides)5 10 15 20 25 30 35 40 45 50
Fig. 3 Mononucleotide SSRs
in Mycoplasma hyopneumoniae(left) and dinucleotide SSRs in
Lawsonia intracellularis (right).
SSRs comprising the secondary
peak are considered long SSRs
(LSSRs). See legend to Fig. 1
Table 1 Overrepresented COG function classes, KEGG terms, and GO terms among SSR-associated genes of different pathogenic bacteria
Genome COG KEGG GO
Helicobacter pylori J99 Cell cycle control, cell division, chromosome
partitioning (?); inorganic ion transport
and metabolism (?) p = 0.023,
p0 = 0.005
Haemophilus influenzae86–028NP
Cell wall, membrane, envelope biogenesis
(??) p = 0.129, p0 = 0.085
Glycan biosynthesis and
metabolism (???)
Lipopolysaccharide biosynthetic
process (??); transferase activity
(???); outer membrane (?)
Lawsonia intracellularisPHE/MN1-00
Not classifieda (??) p = 0.227, p0 = 0.622 Protein folding (??); unfolded
protein binding (?)
Mycobacterium leprae TN p = 0.269, p0 = 0.269
Mycoplasma capricolumATCC 27343
Defense mechanisms (?); not classifieda
(???) p = 0.026, p0 = 0.007
Mycoplasma gallisepticumR
Not classifieda (???) p = 0.002,
p0 = 0.393
Mycoplasma genitaliumG37
p = 0.161; p0 = 0.584
Mycoplasmahyopneumoniae 232
rRNA(??); not classifieda (??) p = 0.012,
p0 = 0.395
Mycoplasma mycoides SC
str. PG1
Defense mechanisms(?); not classifieda
(???) p = 0.029, p0 = 0.040
Mycoplasma pulmonisUAB CTIP
Replication, recombination and repair (??)
p = 0.166, p0 = 0.114
Replication and repair
(???)
DNA binding (??); DNA
methylation (???); site-specific
DNA-methyltransferase (???);
N-methyltransferase activity
(???)
Xanthomonas oryzaeMAFF 311018
Carbohydrate transport and metabolism (??)
p = 0.268, p0 = 0.106
Note: Symbols in parentheses signify the level of overrepresentation of that particular functional description: ? corresponds to probability
0.01 \ p B 0.05; ??, to 0.001 \ p B 0.01; and ???, to p B 0.001. No correction for multiple hypothesis testing has been used. The overall
goodness-of-fit test was performed for COG classification and the relevant p-values are listed in the table. The first p-value relates to the test
involving all classes listed in Tables 3 and 4, whereas the second p-value, labeled p0, excludes the classes with uncertain functional assignments
‘‘general function prediction only,’’ ‘‘function unknown,’’ and ‘‘not classified.’’ The goodness-of-fit test was not performed for the KEGG and
GO classifications. See Methods for detailsa The ‘‘not classified’’ COG category includes genes that were not classified in the COG database. These generally represent unique genes that do
not have homologues in genomes that were used in the design of the COG database
500 J Mol Evol (2008) 67:497–509
123
genes compared to the complete gene collection of the
organism. If all genes were equally likely to be associated
with SSRs, one would expect to find on average �mX ¼MX
Mm SSR-associated genes with the functional assignment
X, where M is the number of all annotated genes, m is the
number of all SSR-associated genes, and MX is the number
of all annotated genes with the functional assignment X.
For M large and �mX small, the observed counts MX of SSR-
associated genes with the functional assignment X follow
approximately the Poisson distribution PrðmXÞ ¼ exp
ð� �mXÞ ð �mXÞmX
mX! . The Poisson approximation is used to esti-
mate the probability that the number of SSR-associated
genes in a particular gene function class exceeds a given
value. These statistical assessments are used to identify
potential trends which are subsequently investigated at the
level of individual genes.
In addition to testing over- and underrepresentation of
genes of specific function classes among SSR-associated
genes, we evaluate the overall goodness of fit between the
observed and the expected counts for the COG function
classes using the chi-square test and Monte Carlo simula-
tions. The v2 value is calculated by the standard formula
v2 ¼ ðmX� �mXÞ2�mX
. However, the standard way of converting
the v2 value into a probability using the v2 distribution is
not applicable because the expected counts are too small to
provide accurate results. Instead, we determine the p-value
by performing random simulations, where the m SSR-
associated genes are randomly assigned to the gene func-
tion classes proportionately to the expected counts �mX and
the p-value is assessed as the fraction of simulations that
yield v2 higher or equal to that obtained with the real data.
The p-values reported in Table 1 are based on 106 simu-
lations. This method is only applied to COG classification
because it involves a relatively small number of function
classes and only few genes are assigned to more than one
class.
Results
Functional Classification of Genes Associated with
LSSRs
Table 1 shows a summary of overrepresented gene func-
tion classes and ontology terms among SSR-associated
genes in genomes of pathogenic bacteria with high counts
of LSSRs. Results of the overall goodness-of-fit test are
shown for COG classifications. Note that different func-
tional descriptions in Table 1 can arise from the same set of
SSR-associated genes. For example, the GO terms ‘‘DNA
binding,’’ ‘‘DNA methylation’’ ‘‘site-specific DNA-meth-
yltransferase,’’ and ‘‘N-methyltransferase activity’’ for M.
pulmonis all relate to several SSR-associated genes
encoding DNA methyltransferases. The tests for individual
function classes are conducted independently and are less
conservative than overall goodness-of-fit (v2) tests, which
indicate that LSSRs in most genomes are distributed
approximately randomly among different gene function
classes (p [ 0.05). We next discuss the SSR-associated
genes in genomes where the v2 test indicates nonrandom
distribution or one or more function classes shows an
excess of SSR-associated genes with p \ 0.01 (symbol ??
or ???).
Haemophilus Influenzae
The class ‘‘cell wall, membrane, envelope biogenesis’’ in
COG classification is overrepresented among SSR-associ-
ated genes. It is consistent with the overrepresented KEGG
classification ‘‘glycan biosynthesis and metabolism’’ and
GO terms ‘‘lipopolysaccharide biosynthetic process’’ and
‘‘transferase activity.’’ All these overrepresented functional
descriptions refer to a set of genes involved in lipopoly-
saccharide biosynthesis. Mutations in these genes can
affect properties of the cell surface, which is consistent
with the role of the LSSRs in antigenic variation.
Helicobacter Pylori
The distribution of SSR-associated genes among the COG
classes is nonrandom (indicated by the v2 test; Table 1),
with main contributions from the classes ‘‘Cell cycle con-
trol, cell division, chromosome partitioning’’ and
‘‘Inorganic ion transport and metabolism.’’ H. pylori fea-
tures a number of SSR-associated outer membrane proteins
and transporters (see also Table 2), which leads to the
marginal overrepresentation of the latter class. The former
class includes four SSR-associated genes encoding ATP-
binding protein (mrp), rod shape-determining protein
(mreB), and two proteins assigned to COGs described as
‘‘ATPases involved in chromosome partitioning,’’ which
are annotated as hypothetical protein (HI0059) and patho-
genicity island protein CagA (HI0547). The CagA protein
is a major virulence protein involved in disruptions of the
host epithelium (Amieva et al. 2003). Locations of LSSRs
near genes encoding outer membrane proteins and viru-
lence genes are consistent with their possible role in
antigenic variation.
Mycoplasma capricolum and Mycoplasma mycoides
M. capricolum and M. mycoides each contain two SSR-
associated genes for ABC transporters (Table 2) classified
in the COG class ‘‘Defense mechanisms,’’ which results
in a marginal overrepresentation (probability \ 0.05 but
J Mol Evol (2008) 67:497–509 501
123
[0.01) of this function class among SSR-associated genes.
Mycoplasmas do not have the outer membrane or cell wall,
and components of ABC transporters could be recognized
as antigens. Hence the overrepresentation of this group
among SSR-associated genes is consistent with the SSR
role in antigenic variance.
Lawsonia intracellularis and Mycoplasma hyopneumoniae
These genomes have more LSSRs and SSR-associated
genes than other bacteria and their SSR-associated genes
are distributed mostly randomly among different function
classes (Tables 1, 3, and 4). Notably, all three rRNA genes
Table 2 SSR-associated genes potentially contributing to antigenic variation
Classification Locus SSR location Description
Cell wall, membrane, envelope biogenesis HP0855 Up Alginate O-acetylation protein (algI)
HP1105 Up LPS biosynthesis protein
HP1341 Down Siderophore-mediated iron transport protein (tonB)
LI0092 Up ADP-heptose:LPS heptosyltransferase (rfaF/rfaQ)
LI0319 Up ADP-heptose:LPS heptosyltransferase
LI0452 Up 3-Deoxy-D-manno-octulosonic acid (KDO) 8-phosphate synthase
(kdsA)
LI0661 Down Glucosamine 6-phosphate synthetase (glmS)
LI0730 Down Predicted UDP-glucose 6-dehydrogenase (ugd)
LI0920 Down Membrane protein related to metalloendopeptidases
LI0984 Up Phosphomannose isomerase/GDP-mannose pyrophosphorylase (xanB)
MCAP_0063 Up Glycosyl transferase, group 2 family protein
MCAP_0270 Down ABC transporter, permease protein, putative
mhp676 Down Glucose-inhibited division protein B (gidB)
NTHI0365 In UDP-galactose–lipo-oligosaccharide galactosyltransferase (lgtC)
NTHI0677 In UDP-Gal–lipo-oligosaccharide galactosyltransferase (lic2A)
NTHI0913 Up UDP-glucose–lipo-oligosaccharide glucosyltransferase (lex2B)
NTHI1597 In LicA
NTHI1750 In Putative glycosyl transferase, glycosyl transferase family 8 protein
XOO_0008 In TonB protein (tonB)
NTHI0585 In Autotransported protein Lav (lav)
NTHI0472 In CMP-Neu5Ac–lipo-oligosaccharide a 2–3 sialyltransferase (lic3A)
NTHI1034 In CMP-neu5Ac–lipo-oligosaccharide a 2–3 sialyltransferase (lic3A2)
Defense mechanism HP0464 In Type I restriction enzyme R protein (hsdR)
HP1521 Up Type III restriction enzyme R protein (res)
MCAP_0587 Up ABC transporter, ATP-binding protein, putative
MCAP_0655 Up ABC transporter, ATP-binding protein
mhp025 Down ABC transporter ATP binding protein
mhp686 Down Multidrug resistance protein homologue (pr2)
MSC_0398 Up Na? ABC transporter, ATP-binding component (natA)
MSC_0704 Up ABC transporter, ATP-binding component (Na?)
Lipoprotein MCAP_0431 [up], MCAP_0432 [up], MCAP_0433 [up], MCAP_0470 [in]; MCAP_0593(vmcD)
[up], MCAP_0594(vmcC) [up], MCAP_0595(vmcB) [up], MCAP_0629 (vmcE) [up] and
MCAP_0630 (vmcF) [up]; MG_338 [in]; MSC_0397 (lpp) [down], MSC_0390 (vmm) [up];
MSC_0847(lpp) [up], MSC_1005(lpp) [up], MYPU_0190 [up], MYPU_4780 [up], MYPU_6520
[down]
Outer membrane protein HP0009(omp1) [up], HP0025(omp2) [up], HP0227(omp5) [down], HP0722(omp16) [up],
HP0725(omp17) [up], HP0896(omp19) [*2086–18], HP0912(omp20) [up], HP1342(omp29) [up]
Note: Genes are labeled with gene tags used in GenBank annotation. The initial letters signify the organism: HI—Haemophilus influenzae 86-
028NP; HP—Helicobacter pylori 26695; jhp—Helicobacter pylori J99; LI—Lawsonia intracellularis PHE/MN1-00; MCAP—Mycoplasmacapricolum subsp. capricolum ATCC 27343; MG—Mycoplasma genitalium G37; mhp—Mycoplasma hyopneumoniae 232; ML—Mycobacte-rium leprae TN; MSC—Mycoplasma mycoides subsp. mycoides SC str. PG1; MYPU—Mycoplasma pulmonis UAB CTIP; XOO—Xanthomonasoryzae pv. oryzae MAFF 311018. ‘‘Up,’’ ‘‘down,’’ or ‘‘in’’ signify the SSR location upstream, downstream, or within the associated gene,
respectively
502 J Mol Evol (2008) 67:497–509
123
of M. hyopneumoniae are associated with LSSRs (Table 3).
Many hypothetical genes of unknown function in L. in-
tracellularis have proximal LSSRs, and some of them
might be involved in antigenic variation. Surprisingly, four
of six genes labeled with the GO term ‘‘protein folding’’ in
L. intracellularis are associated with LSSRs (Table 1).
These include genes for molecular chaperones DnaK,
DnaJ, CbpA, and GrpE.
Mycoplasma pulmonis
Five SSR-associated genes encoding DNA methylases,
which give rise to all overrepresented functional classifi-
cations in Table 1. Some of these DNA methylases might
indirectly contribute to antigenic variation by altering
methylation patterns in distant regions of the chromosome
and influencing expression levels of other genes (Rocha
and Blanchard 2002).
LSSRs and Antigenic Variation
LSSRs associated with genes whose products are accessi-
ble on the surface of the cell or function in biogenesis of
cellular surface structures could directly contribute to
antigenic variation. These SSR-associated genes are listed
in Table 2. Most SSR-associated genes of H. influenzae
belong to two main groups: cell envelope biogenesis genes
and those related to outer membrane, intracellular traf-
ficking and secretion (Table 2). In M. capricolum, some
SSR-associated genes encode putative membrane proteins,
including three ABC transporters, five genes of the vmc
cluster, and several lipoproteins of the Lpp family. The vmc
genes contain dinucleotide SSRs (TA iterations) in their
putative promoter regions, which govern the phase variable
expression of these genes (Wise et al. 2006). M. mycoides
has two SSR-associated genes encoding ATPase compo-
nents of ABC-type multidrug transport systems, which are
potential surface antigens (Blanchard et al. 1996; Raheri-
son et al. 2002; Subramaniam et al. 2000). Both have a 14-
bp run of T in their upstream flanking regions. Many
putative lipoproteins and variable surface proteins in M.
mycoides are also associated with LSSRs (Table 2). Some
SSR-associated genes in H. pylori, L. intracellularis, M.
hyopneumoniae, and X. oryzae also encode proteins located
on the cell surface and could influence the interaction of
the pathogen with its host (Table 2).
Table 3 COG classifications of SSR-associated genes of Mycoplasma hyopneumoniae
COG classification All genes With SSR Expected Difference
Energy production and conversion 25 4 3.552
Cell cycle control, cell division, chromosome partitioning 4 0 0.568
Amino acid transport and metabolism 25 0 3.552 –
Nucleotide transport and metabolism 21 5 2.983
Carbohydrate transport and metabolism 57 8 8.098
Coenzyme transport and metabolism 7 0 0.994
Lipid transport and metabolism 5 0 0.71
Translation, ribosomal structure, and biogenesis 94 12 13.354
Transcription 17 0 2.415
Replication, recombination, and repair 50 2 7.103 –
Cell wall/membrane/envelope biogenesis 8 1 1.137
Posttranslational modification, protein turnover, chaperones 18 1 2.557
Inorganic ion transport and metabolism 19 1 2.699
Signal transduction mechanisms 3 1 0.426
Intracellular trafficking, secretion, and vesicular transport 12 1 1.705
Defense mechanisms 23 2 3.268
rRNA 3 3 0.426 ??
tRNA 30 6 4.262
General function prediction only 45 5 6.393
Function unknown 18 1 2.557
Not classified 260 52 36.938 ??
Note: The table lists the number of all genes and the number of SSR-associated genes in each COG function category, the expected number of
SSR-associated genes, and the significance of the difference between the observed and the expected numbers (see Methods). (–) Probability
\5%; (??) probability \1%
J Mol Evol (2008) 67:497–509 503
123
LSSRs Are Often Associated with Housekeeping and
Essential Genes
Surprisingly, we found that SSRs are also associated with
many housekeeping genes, such as rRNA and tRNA genes,
ribosomal protein genes, amino acyl-tRNA synthetases,
and chaperones (Table 5). Some SSR-associated genes
encode enzymes of central pathways of energy metabolism,
including citrate synthase and isocitrate dehydrogenase
functioning in the citrate cycle, and a subunit of ATP
synthase. The rod shape-determining protein MreB and
putative chromosome partitioning ATPases are encoded by
SSR-associated genes in H. influenzae and L. intracellu-
laris, respectively. Many rRNA and tRNA genes and genes
encoding enzymes that contribute to protein synthesis are
among SSR-associated genes in M. hyopneumoniae, L.
intracellularis, and, to a lesser extent, several other gen-
omes (Table 5). Also among SSR-associated genes in some
of the analyzed genomes are those encoding general
chaperones such as DnaK, DnaJ, and GroEL, the single-
stranded DNA-binding protein, DNA helicases, gyrases,
recombinases, and a DNA polymerase subunit III. Many of
those genes are probably essential for a living cell, and
their inactivation would likely be lethal.
Table 6 shows locations of LSSRs with respect to pre-
dicted housekeeping operons in M. hyopneumoniae and
L. intracellularis. The operon predictions were adopated
from the Microbes Online database (http://www.micro
besonline.org/) (Price et al. 2005). In most cases, LSSRs
are located upstream of the first gene in an operon where
they could have a direct effect on transcription initiation
(Table 6).
SSRs May Contribute to Genome Reduction in M.
leprae
The M. leprae genome contains 16 LSSRs, more than any
other mycobacterium (Mrazek et al. 2007). Interestingly,
most of the SSR-associated ‘‘genes’’ are pseudogenes of
diverse original functions (Table 7). Apart from three
membrane proteins, other SSR-associated pseudogenes are
similar to transcriptional regulators, or genes involved in
the metabolism of cofactors and vitamins, amino acids, and
nucleotides. LSSRs are often located within or upstream of
Table 4 COG classifications of SSR-associated genes in Lawsonia intracellularis
COG classification All genes With SSR Expected Difference
Energy production and conversion 2 0 0.235
Cell cycle control, cell division, chromosome partitioning 49 9 5.765
Amino acid transport and metabolism 19 2 2.235
Nucleotide transport and metabolism 82 3 9.647 –
Carbohydrate transport and metabolism 44 5 5.176
Coenzyme transport and metabolism 52 3 6.118
Lipid transport and metabolism 66 8 7.765
Translation, ribosomal structure, and biogenesis 28 1 3.294
Transcription 122 12 14.353
Replication, recombination, and repair 33 3 3.882
Cell wall/membrane/envelope biogenesis 68 8 8
Cell motility 134 11 15.765
Posttranslational modification, protein turnover, chaperones 55 8 6.471
Inorganic ion transport and metabolism 54 7 6.353
Secondary metabolites biosynthesis, transport 43 3 5.059
Signal transduction mechanisms 15 1 1.765
Intracellular trafficking, secretion, and vesicular transport 123 18 14.471
Defense mechanisms 67 7 7.882
rRNA 40 7 4.706
tRNA 44 5 5.176
General function prediction only 7 0 0.824
Function unknown 294 52 34.588 ??
Not classified 2 0 0.235
Note: The table lists the number of all genes and the number of SSR-associated genes in each COG function category, the expected number of
SSR-associated genes, and the significance of the difference between the observed and the expected numbers (see Methods). (–) Probability
\5%; (??) probability \1%
504 J Mol Evol (2008) 67:497–509
123
Table 5 SSR-associated housekeeping genes
COG classification Locus SSR
location
Description
Energy production and conversion HP0026 Down Citrate synthase
LI0276 Down Cytochrome bd-type quinol oxidase, subunit 2 (cyoA)
LI0790 Up Ferredoxin oxidoreductase, a subunit (vorB)
mhp476 Down ATP synthase subunit B (atpD)
Cell division, chromosome partitioning LIB024 In Chromosome partitioning ATPase (parA)
LI0814 Up Chromosome partitioning ATPase (soj)
HP0743 Up Rod shape-determining protein (mreB)
Translation, ribosomal structure,
and biogenesis
mhp186 Up 30S ribosomal protein S10 (rps10)
mhp094 Up 30S ribosomal protein S16 (rpsP)
mhp307 Up 30S ribosomal protein S6 (rpsF)
mhp638 Up 50S ribosomal protein L10 (rplJ)
mhp459 Up 50S ribosomal protein L11 (rplK)
MYPU_1300 Up 30S ribosomal protein S1
MYPU_4670 Down 50S ribosomal protein L19
LI0983 Down Ribosomal protein L17 (rplQ)
LI0560 Down Ribosomal protein S9
mhp416 Up Asparaginyl-tRNA synthetase (asnS)
mhp030 In Aspartyl/glutamyl-tRNA amidotransferase subunit B (gatB)
LI0986 Up Histidyl-tRNA synthetase (hisS)
mhp128 Down Seryl-tRNA synthetase (serS)
mhp106 Up Phenylalanyl-tRNA synthetase a chain (pheS)
mhp105 Down Phenylalanyl-tRNA synthetase b subunit (pheT)
ML0238 Down Methionine–tRNA ligase (metG)
mhp430 Up Elongation factor P (efp)
Replication, recombination and repair LI0287 Up ATP-dependent exoDNAse (exonuclease V), a subunit-helicase
superfamily I member (recD)
LI0194 Down Primosomal protein N’ (replication factor Y) superfamily II helicase
(priA)
HP0911 Down Helicase, single-stranded DNA-dependent ATPase (rep)
LIC081 In Replicative DNA helicase (recQ)
LI0257 Up DNA polymerase III, a subunit (dnaE)
MG_031 Up DNA polymerase III, a subunit, Gram? (polC–1)
mhp270 Down DNA gyrase subunit B (gyrB)
LI0492 Up Single-stranded DNA-binding protein
Posttranslational modification, protein
turnover, chaperones
LI0124 Up DnaJ-class molecular chaperone (cbpA)
LI0685 Down DnaJ-class molecular chaperone (dnaJ)
LI0912 Up Molecular chaperone (dnaK)
MYPU_2230 Up Molecular chaperone (dnaK)
LI1048 Up, Down/Ina Molecular chaperone (grpE)
HP0010 Down Chaperonin (groEL)
Carbohydrate transport and metabolism XOO_2190 Up ABC-type sugar transport ATPase
XOO_2191 Up Glucose-6-phosphate 1-dehydrogenase
XOO_0301 Up Glucose dehydrogenase
rRNA MG_rrnA-16S [up]; mhprRNA-16S [up]; mhprRNA-23S [down]; mhprRNA-5S [up, downa]
J Mol Evol (2008) 67:497–509 505
123
the first in a string of pseudogenes transcribed in the same
direction and constituting a putative operon (Table 7).
Searching the translated nucleotide database using the
tblastx program (McGinnis and Madden 2004), we found
that those SSRs are absent from homologous active genes
in the genomes of other mycobacteria (data not shown). If
LSSRs arose after inactivation of the genes due to relaxed
selective constraints, one would expect LSSRs to be
located anywhere within the operon. The locations
upstream of the first pseudogene of a putative operon
suggest that the LSSRs contributed directly to the inacti-
vation of these operons. It has been argued that M. leprae is
in an early stage of genome reduction following a recent
adaptation to an obligate pathogenic lifestyle (Cole et al.
2001), and it is possible that SSRs contribute to permanent
loss of genes and operons from the chromosome.
Discussion
Many cases of antigenic variation facilitated by SSRs have
been documented (Groisman and Casadesus 2005; Moxon
et al. 1994; van der Woude and Baumler 2004) and we
expected to detect an overrepresentation of SSR-associated
genes among those related to cell surface structures and
pathogen-host interactions in the analyzed genomes. This
was the case for H. influenzae and H. pylori but to a lesser
extent for other analyzed genomes, where the distribution
of SSR-asociated genes among different gene function
classes was mostly random (Table 1).
Surprisingly, we found a large number of SSR-associ-
ated housekeeping genes, such as rRNA and tRNA genes,
ribosomal protein genes, amino acyl-tRNA synthetases,
chaperones, and some metabolic enzymes, many of which
are probably essential, and their inactivation by SSR-
induced mutations could be lethal. These SSRs are unlikely
to cause phase variation, which is generally viewed as
reversible and inheritable alterations of the cell phenotype
(van der Woude and Baumler 2004). How can LSSRs
affecting housekeeping genes be beneficial to pathogens? If
mutations in these LSSRs affect expression of house-
keeping genes, they could alter growth rates of different
cell lineages, which could possibly further increase the
antigenic variance within the pathogen population in
combination with other mechanisms. However, we are not
aware of any other evidence supporting this hypothesis,
and it is unclear whether it could be effective. Alterna-
tively, LSSRs associated with housekeeping genes could
serve to influence expression of these genes by affecting
physical properties of DNA or RNA molecules, or they
could function as regulatory elements. The frequent loca-
tion of LSSRs upstream of the first gene in an operon is
consistent with their possible role in transcription initiation.
The apparent random association of LSSRs with gene
function classes in some of the analyzed genomes could
also indicate that the SSRs are not directly related to the
adjacent genes and, instead, influence some genomewide
processes, such as organization and stability of the chro-
mosome, replication, or chromosome segregation. It is also
possible that some LSSRs arise from spontaneous expan-
sions of shorter SSRs in the absence of selective
constraints, a hypothesis that we previously deemed less
likely. However, as stated earlier, explaining the bimodal
distributions of SSR counts (Fig. 3) by spontaneous
expansion alone requires additional assumptions. More-
over, considering that the bimodal distributions of SSR
counts are generally restricted to a subset of host-adapted
pathogens (Mrazek et al. 2007), it is easy to envision the
pathogen-host interactions as a source of selective con-
straints affecting these genomes, whereas it is unclear how
the spontaneous expansion should be restricted to this
group of bacteria.
An intriguing explanation of LSSR association with
genes relates to LSSRs located near pseudogenes in M.
Table 5 continued
COG classification Locus SSR
location
Description
tRNA MCAP_0062 [up]; mhptRNA-Asn4 [up]; mhptRNA-Gly [up]; mhptRNA-His [up]; mhptRNA-Leu4
[down]; mhptRNA-Ser2 [up]; mhptRNA-Tyr [up]; MLt18 (leuV) [down]; MLt27 (valV) [down];
MYPU_TRNA_LEU_1 [up]
Note: Genes are labeled with gene tags used in GenBank annotation. The initial letters signify the organism: HI—Haemophilus influenzae 86-
028NP; HP—Helicobacter pylori 26695; jhp—Helicobacter pylori J99; LI—Lawsonia intracellularis PHE/MN1-00; MCAP—Mycoplasmacapricolum subsp. capricolum ATCC 27343; MG—Mycoplasma genitalium G37; mhp—Mycoplasma hyopneumoniae 232; ML—Mycobacte-rium leprae TN; MSC—Mycoplasma mycoides subsp. mycoides SC str. PG1; MYPU—Mycoplasma pulmonis UAB CTIP; XOO—Xanthomonasoryzae pv. oryzae MAFF 311018. ‘‘Up,’’ ‘‘down,’’ or ‘‘in’’ signify the SSR location upstream, downstream, or within the associated gene,
respectivelya The GrpE gene of L. intracellularis has one LSSR upstream and another overlapping the stop codon. The 5S rRNA gene of M. hyopneumoniaehas two LSSRs near both ends of the gene. According to the annotation, LSSRs are located inside the gene, but alignments with other 5S rRNA
genes suggest that LSSRs could be outside the section corresponding to the 5S rRNA (Mrazek 2006)
506 J Mol Evol (2008) 67:497–509
123
leprae. Unlike its close relatives, M. leprae has never been
cultivated outside a host and has been proposed to be in an
early process of genome reduction following its adaptation
to the obligate pathogenic lifestyle (Cole et al. 2001; Lerat
and Ochman 2004). The abundance of LSSRs in or near
pseudogenes could be explained by two mechanisms: (i)
LSSRs directly contribute to inactivation of genes and/or
operons, or (ii) LSSRs arise after inactivation of a gene or
operon due to relaxed selective constraints. If the latter
model is correct, one might expect LSSRs to occur equally
Table 6 Location of LSSRs in the housekeeping operons in M. hyopneumoniae and L. intracellularis
Operon SSR Gene locus SSR location
M. hyopneumoniae
ATP synthases and hypothetical proteins T(20) mhp476–mhp482 |///////
30S and 50S ribosomal proteins T(21) mhp186–mhp206 |???…???
Ribosomal proteins for protein synthesis A(44) mhp094–mhp099 |??????
30S ribosomal proteins, single-strand binding, and GTP- binding proteins A(21) mhp304–mhp307 ////|
50S ribosomal proteins and RNA polymerases T(20) mhp634–mhp638 /////|
50S ribosomal proteins T(20) mhp458–mhp459 //|
tRNA synthetase, DNA helicase, and hypothetical proteins T(26) mhp416–mhp420 |?????
tRNA amidotransferase and amidase [AAAC](4)AA mhp029–mhp030 ??*
Seryl-tRNA synthetase and phosphopyruvate hydratase [AAAT](5) mhp128–mhp129 |//
Phenylalanyl-tRNA synthetases T(23) mhp105–mhp106 /|/
Elongation factor and transketolase T(18) mhp430–mhp431 |??
p97 cilium adhesion paralogues A(14) mhp271–mhp272 |??
5S rRNA A(49),T(27) rRNA-5S |?|
16S and 23S rRNAs T(20) rRNA-16S-23S //|
tRNA-Thr1, Val, Glu, and Asn4 T(18) mhptRNA-Thr1, -Val, -Glu, -
Asn4
////|
tRNA-Gly, T(26) mhptRNA-Gly |?
tRNA-His A(21) mhptRNA-His |?
tRNA-Leu4 and Ser2 T(20) mhptRNA-Leu4, -Ser2 /|/
tRNA-Tyr and Gln T(21) mhptRNA-Gln, -Tyr //|
L. intracellularis
Cytochrome bd-type quinol oxidases [TA](7) LI0275–LI0276 ?|?
Ferredoxin oxidoreductases [TA](7) LI0790–LI0792 |???
Chromosome partitioning ATPase [ATT](6)AT LIB024 ?*
Chromosome partitioning ATPases, transcriptional regulator, and ADP-
heptose synthase
[AC](8) LI0814–LI0816 |???
Ribosomal proteins [TTA](7)T LI0959–LI0983 ???…??|
Ribosomal proteins [AT](6) LI0559–LI0560 ??|
tRNA synthetases [TTA](5)TT LI0986–LI0987 |??
ExoDNAse and cell-wall-associated hydrolase [ATTA](4) LI0286–LI0287 //|
Replication helicase, UDP-glucose pyrophosphorylase,
phosphomannomutase
[ATA](7) LI0191–LI0194 ????|
Replicative DNA helicase [AT](7) LIC081 ?*
DNA polymerase and ATP synthase subunits [GTTA](4) LI0257–LI0258 |??
Porphyrin oxidase and oxidoreductases [AGA](5)A LI0429 |?
Molecular chaperone and ATPase [TTA](5)TT LI0124–LI0126 |???
Molecular chaperone DnaJ A(14) LI0685 /|
Molecular chaperone DnaK A(12) LI0912 |?
Molecular chaperone GrpE [TATT](4)T,
[TTA](6)T
LI1048 |/|
Note: Arrows signify the gene orientation, vertical bars denote locations of intergenic SSRs, and asterisks indicate locations of SSRs inside the
gene to the left. The numbers in the SSR sequences indicate the number of copies of the preceding segment followed by any remaining partial
copy. For example, [AAAC](4)AA refers to the sequence AAACAAACAAACAAACAA
J Mol Evol (2008) 67:497–509 507
123
likely anywhere in an operon. However, LSSRs in M.
leprae are located upstream of the translation start site or
within the coding regions of the first gene in a string of
pseudogenes (possible operon; Table 7), which suggests
that the LSSRs directly contributed to the inactivation of
these operons. A recent analysis of distribution of short
homo-oligonucleotide runs (i.e., tandem repeats of a single
nucleotide C 6 bp long) in bacterial protein-coding genes
detected a bias toward location near the 50 ends of genes,
possibly due to selection to reduce the metabolic cost of
synthesis of nonfunctional peptides resulting from frame-
shift mutations in these short SSRs (van Passel and
Ochman 2007). The location of LSSRs in M. leprae is
consistent with this observation: disrupting the initiation of
transcription would prevent the synthesis not only of
nonfunctional peptides but also of nonfunctional mRNAs.
Our data suggest that LSSRs are contributing to permanent
gene loss in M. leprae, and their location is selected to
minimize a potential detrimental effect of synthesis of
nonfunctional RNAs and peptides. One might speculate
that LSSRs may have played similar roles in genome
reduction of other host-adapted pathogens, and that a
proliferation of LSSRs could be involved at some stage of
genome reduction following an adaptation to obligate
pathogenic lifestyle (Moran 2002). Some of the SSRs in
present-day genomes could simply be left over from the
period of genome reduction. However, this hypothesis is
contradicted by the absence of LSSRs in genomes of
obligate endosymbiotic bacteria (Mrazek et al. 2007),
which have undergone a similar process of genome
reduction (Moran 2002, 2003). Nevertheless, it is possible
that proliferation of LSSRs in the early stage of genome
reduction is a general phenomenon, whereas their sub-
sequent long-term preservation is determined by species-
specific constraints.
Acknowledgments We thank Dr. Anne Summers for critical read-
ing of the manuscript and Drs. Mark Schell, Duncan Krause, and
other colleagues at the UGA Department of Microbiology for stim-
ulating discussions.
References
Amieva MR, Vogelmann R, Covacci A, Tompkins LS, Nelson WJ,
Falkow S (2003) Disruption of the epithelial apical-junctional
complex by Helicobacter pylori CagA. Science 300:1430–1434
Blanchard B, Saillard C, Kobisch M, Bove JM (1996) Analysis of
putative ABC transporter genes in Mycoplasma hyopneumoniae.
Microbiology 142(Pt 7):1855–1862
Cole ST, Eiglmeier K, Parkhill J, James KD, Thomson NR, Wheeler
PR, Honore N, Garnier T, Churcher C, Harris D, Mungall K,
Basham D, Brown D, Chillingworth T, Connor R, Davies RM,
Devlin K, Duthoy S, Feltwell T, Fraser A, Hamlin N, Holroyd S,
Hornsby T, Jagels K, Lacroix C, Maclean J, Moule S, Murphy L,
Oliver K, Quail MA, Rajandream MA, Rutherford KM, Rutter S,
Seeger K, Simon S, Simmonds M, Skelton J, Squares R, Squares
S, Stevens K, Taylor K, Whitehead S, Woodward JR, Barrell BG
(2001) Massive gene decay in the leprosy bacillus. Nature
409:1007–1011
Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN
(2005) Flexible nets. The roles of intrinsic disorder in protein
interaction networks. Febs J 272:5129–5148
Field D, Wills C (1998) Abundant microsatellite polymorphism in
Saccharomyces cerevisiae, and the different distributions of
microsatellites in eight prokaryotes and S. cerevisiae, result from
strong mutation pressures and a variety of selective forces. Proc
Natl Acad Sci USA 95:1647–1652
Groisman EA, Casadesus J (2005) The origin and evolution of human
pathogens. Mol Microbiol 56:1–7
Gur-Arie R, Cohen CJ, Eitan Y, Shelef L, Hallerman EM, Kashi Y
(2000) Simple sequence repeats in Escherichia coli: abundance,
distribution, composition, and polymorphism. Genome Res
10:62–71
Htun H, Dahlberg JE (1989) Topology and formation of triple-
stranded H-DNA. Science 243:1571–1576
Karlin S, Mrazek J, Campbell AM (1996) Frequent oligonucleotides
and peptides of the Haemophilus influenzae genome. Nucleic
Acids Res 24:4263–4272
Kashi Y, King DG (2006) Simple sequence repeats as advantageous
mutators in evolution. Trends Genet 22:253–259
Lerat E, Ochman H (2004) W-U: Exploring the outer limits of
bacterial pseudogenes. Genome Res 14:2273–2278
Li YC, Korol AB, Fahima T, Nevo E (2004) Microsatellites within
genes: structure, function, and evolution. Mol Biol Evol 21:991–
1007
McGinnis S, Madden TL (2004) BLAST: at the core of a powerful
and diverse set of sequence analysis tools. Nucleic Acids Res
32:W20–W25
Moran NA (2002) Microbial minimalism: genome reduction in
bacterial pathogens. Cell 108:583–586
Table 7 SSR-associated pseudogenes in Mycobacterium leprae
Putative operon SSR Graph Pseodogene locus
Acyl CoA synthesis G(22) ?*????? ML0163–ML0168
Precorrin-3 methylase and reductase [TA](10) //| ML1449–ML1450
Group II intron maturase and transposase [AC](8)A ///| ML1823–ML1825
Sugar transporter [AAG](7)A /| ML2344
Transcriptional regulator [AAG](7)A |? ML2345
Membrane proteins, methyltransferase, sigma factor, PPE-protein and transposase [AT](10) ////////| ML2368–ML2375
PE protein [TA](11) ?* ML2477
Note: Arrows signify the gene orientation, vertical bars denote locations of intergenic SSRs, and asterisks indicate locations of SSRs inside the
gene to the left. The original function of the pseudogenes was assessed by sequence similarity searches
508 J Mol Evol (2008) 67:497–509
123
Moran NA (2003) Tracing the evolution of gene loss in obligate
bacterial symbionts. Curr Opin Microbiol 6:512–518
Moxon ER, Rainey PB, Nowak MA, Lenski RE (1994) Adaptive
evolution of highly mutable loci in pathogenic bacteria. Curr
Biol 4:24–33
Mrazek J (2006) Analysis of distribution indicates diverse functions
of simple sequence repeats in Mycoplasma genomes. Mol Biol
Evol 23:1370–1385
Mrazek J, Guo X, Shah A (2007) Simple sequence repeats in
prokaryotic genomes. Proc Natl Acad Sci USA 104:8472–8477
Perutz MF (1999) Glutamine repeats and neurodegenerative diseases:
molecular aspects. Trends Biochem Sci 24:58–63
Price MN, Huang KH, Alm EJ, Arkin AP (2005) A novel method for
accurate operon predictions in all sequenced prokaryotes.
Nucleic Acids Res 33:880–892
Raherison S, Gonzalez P, Renaudin H, Charron A, Bebear C, Bebear
CM (2002) Evidence of active efflux in resistance to ciproflox-
acin and to ethidium bromide by Mycoplasma hominis.
Antimicrob Agents Chemother 46:672–679
Rocha EP (2003) An appraisal of the potential for illegitimate
recombination in bacterial genomes and its consequences: from
duplications to genome reduction. Genome Res 13:1123–1132
Rocha EP, Blanchard A (2002) Genomic repeats, genome plasticity
and the dynamics of Mycoplasma evolution. Nucleic Acids Res
30:2031–2042
Roske K, Blanchard A, Chambaud I, Citti C, Jacobs E (2001) Phase
variation among major surface antigens of Mycoplasma pene-trans. Infect Immun 69:7642–7651
Shafer RH, Smirnov I (2000) Biological aspects of DNA/RNA
quadruplexes. Biopolymers 56:209–227
Sinden RR (1994) DNA structure and function. Academic Press, San
Diego, CA
Subramaniam S, Frey J, Huang B, Djordjevic S, Kwang J (2000)
Immunoblot assays using recombinant antigens for the detection
of Mycoplasma hyopneumoniae antibodies. Vet Microbiol
75:99–106
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective
on protein families. Science 278:631–637
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B,
Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikols-
kaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf
YI, Yin JJ, Natale DA (2003) The COG database: an updated
version includes eukaryotes. BMC Bioinform 4:41
Toth G, Gaspari Z, Jurka J (2000) Microsatellites in different
eukaryotic genomes: survey and analysis. Genome Res 10:967–
981
van der Woude MW, Baumler AJ (2004) Phase and antigenic
variation in bacteria. Clin Microbiol Rev 17:581–611
van Passel MW, Ochman H (2007) Selection on the genic location of
disruptive elements. Trends Genet 23:601–604
Wise KS, Foecking MF, Roske K, Lee YJ, Lee YM, Madan A,
Calcutt MJ (2006) Distinctive repertoire of contingency genes
conferring mutation–based phase variation and combinatorial
expression of surface lipoproteins in Mycoplasma capricolumsubsp. capricolum of the Mycoplasma mycoides phylogenetic
cluster. J Bacteriol 188:4926–4941
J Mol Evol (2008) 67:497–509 509
123