13
Long Simple Sequence Repeats in Host-Adapted Pathogens Localize Near Genes Encoding Antigens, Housekeeping Genes, and Pseudogenes Xiangxue Guo Jan Mra ´zek Received: 1 April 2008 / Accepted: 3 September 2008 / Published online: 17 October 2008 Ó Springer Science+Business Media, LLC 2008 Abstract Simple sequence repeats (SSRs) in DNA sequences are tandem iterations of a single nucleotide or a short oligonucleotide. SSRs are subject to slipped-strand mutations and a common source of phase variation in bacteria and antigenic variation in pathogens. Significantly long SSRs are generally rare in prokaryotic genomes, and long SSRs composed of iterations of mono-, di-, tri-, and tetranucleotides are mostly restricted to host-adapted pathogens. We present new results concerning associations between long SSRs and genes related to different cellular functions in genomes of host-adapted pathogens. We found that in the majority of the analyzed genomes, at least some of the genes associated with SSRs encode potential anti- gens, which is expected if the primary function of SSRs is their contribution to antigenic variation. However, we also found a number of long SSRs associated with housekeep- ing genes, including rRNA and tRNA genes, genes encoding ribosomal proteins, amino acyl-tRNA syntheta- ses, chaperones, and important metabolic enzymes. Many of these genes are probably essential and it is unlikely that they are phase-variable. Few statistically significant asso- ciations between SSRs and gene functional classifications were detected, suggesting that most long SSRs are not related to a particular cellular function or process. Long SSRs in Mycobacterium leprae are mostly associated with pseudogenes and may be contributing to gene loss fol- lowing the adaptation to an obligate pathogenic lifestyle. We speculate that LSSRs may have played a similar role in genome reduction of other host-adapted pathogens. Keywords Tandem repeats Phase variation Contingency loci Antigenic variation Genome reduction Pathogen evolution Introduction Simple sequence repeats (SSR) are tandem iterations of a single nucleotide or a short oligonucleotide in a DNA sequence. Long SSRs (LSSRs) are common in eukaryotes (Kashi and King 2006; To ´th et al. 2000) but rare in most prokaryotic genomes (Field and Wills 1998; Mra ´zek et al. 2007). SSRs have some unusual properties that differenti- ate them from regular DNA sequences. SSRs are subject to slipped-strand mutations and, consequently, hypermutable with respect to their lengths (Kashi and King 2006; To ´th et al. 2000). Mutations in SSRs located in protein coding regions can cause frameshifts and deactivate and subse- quently reactivate the affected genes. Gene expression can also be influenced by mutations in SSRs located in regu- latory regions, where such mutations can alter the activity of promoters (Groisman and Casadesus 2005; Karlin et al. 1996; Moxon et al. 1994; Rocha 2003; Rocha and Blan- chard 2002). Such SSR-facilitated mutations can be beneficial in some circumstances. In particular, mutations in SSRs can cause phase variation, that is, reversible and inheritable switching between two phenotypes. In this model, SSRs can act as an on/off switch for a particular gene or operon (Groisman and Casadesus 2005; Moxon et al. 1994; van der Woude and Ba ¨umler 2004). In patho- gens, SSRs often influence genes encoding antigens and frequent mutations in these SSRs can increase antigenic X. Guo J. Mra ´zek (&) Department of Microbiology, University of Georgia, Athens, GA 30602-2605, USA e-mail: [email protected] J. Mra ´zek Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA 123 J Mol Evol (2008) 67:497–509 DOI 10.1007/s00239-008-9166-5

Long Simple Sequence Repeats in Host-Adapted Pathogens Localize Near Genes Encoding Antigens, Housekeeping Genes, and Pseudogenes

  • Upload
    uga

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Long Simple Sequence Repeats in Host-Adapted PathogensLocalize Near Genes Encoding Antigens, Housekeeping Genes,and Pseudogenes

Xiangxue Guo Æ Jan Mrazek

Received: 1 April 2008 / Accepted: 3 September 2008 / Published online: 17 October 2008

� Springer Science+Business Media, LLC 2008

Abstract Simple sequence repeats (SSRs) in DNA

sequences are tandem iterations of a single nucleotide or a

short oligonucleotide. SSRs are subject to slipped-strand

mutations and a common source of phase variation in

bacteria and antigenic variation in pathogens. Significantly

long SSRs are generally rare in prokaryotic genomes, and

long SSRs composed of iterations of mono-, di-, tri-, and

tetranucleotides are mostly restricted to host-adapted

pathogens. We present new results concerning associations

between long SSRs and genes related to different cellular

functions in genomes of host-adapted pathogens. We found

that in the majority of the analyzed genomes, at least some

of the genes associated with SSRs encode potential anti-

gens, which is expected if the primary function of SSRs is

their contribution to antigenic variation. However, we also

found a number of long SSRs associated with housekeep-

ing genes, including rRNA and tRNA genes, genes

encoding ribosomal proteins, amino acyl-tRNA syntheta-

ses, chaperones, and important metabolic enzymes. Many

of these genes are probably essential and it is unlikely that

they are phase-variable. Few statistically significant asso-

ciations between SSRs and gene functional classifications

were detected, suggesting that most long SSRs are not

related to a particular cellular function or process. Long

SSRs in Mycobacterium leprae are mostly associated with

pseudogenes and may be contributing to gene loss fol-

lowing the adaptation to an obligate pathogenic lifestyle.

We speculate that LSSRs may have played a similar role in

genome reduction of other host-adapted pathogens.

Keywords Tandem repeats � Phase variation �Contingency loci � Antigenic variation �Genome reduction � Pathogen evolution

Introduction

Simple sequence repeats (SSR) are tandem iterations of a

single nucleotide or a short oligonucleotide in a DNA

sequence. Long SSRs (LSSRs) are common in eukaryotes

(Kashi and King 2006; Toth et al. 2000) but rare in most

prokaryotic genomes (Field and Wills 1998; Mrazek et al.

2007). SSRs have some unusual properties that differenti-

ate them from regular DNA sequences. SSRs are subject to

slipped-strand mutations and, consequently, hypermutable

with respect to their lengths (Kashi and King 2006; Toth

et al. 2000). Mutations in SSRs located in protein coding

regions can cause frameshifts and deactivate and subse-

quently reactivate the affected genes. Gene expression can

also be influenced by mutations in SSRs located in regu-

latory regions, where such mutations can alter the activity

of promoters (Groisman and Casadesus 2005; Karlin et al.

1996; Moxon et al. 1994; Rocha 2003; Rocha and Blan-

chard 2002). Such SSR-facilitated mutations can be

beneficial in some circumstances. In particular, mutations

in SSRs can cause phase variation, that is, reversible and

inheritable switching between two phenotypes. In this

model, SSRs can act as an on/off switch for a particular

gene or operon (Groisman and Casadesus 2005; Moxon

et al. 1994; van der Woude and Baumler 2004). In patho-

gens, SSRs often influence genes encoding antigens and

frequent mutations in these SSRs can increase antigenic

X. Guo � J. Mrazek (&)

Department of Microbiology, University of Georgia,

Athens, GA 30602-2605, USA

e-mail: [email protected]

J. Mrazek

Institute of Bioinformatics, University of Georgia, Athens,

GA 30602, USA

123

J Mol Evol (2008) 67:497–509

DOI 10.1007/s00239-008-9166-5

variation within the pathogen population and aid evasion of

the host immune system (Groisman and Casadesus 2005;

Rocha 2003; Roske et al. 2001; van der Woude and

Baumler 2004). The role of SSRs in promoting antigenic

variation is generally recognized, and in several cases the

effects of SSR mutations on gene expression were exper-

imentally verified (reviewed by (Groisman and Casadesus

2005; Moxon et al. 1994; van der Woude and Baumler

2004).

Some SSRs can also affect three-dimensional structures

and physical properties of both DNA and protein molecules

(Dunker et al. 2005; Htun and Dahlberg 1989; Li et al.

2004; Shafer and Smirnov 2000). For example, several

inherited human neurodegenerative diseases are caused by

expansion of CAG repeats in protein coding regions, which

are not deleterious per se but change properties of the

encoded proteins (Perutz 1999). Other types of SSRs can

promote structural transitions in DNA molecules (Htun and

Dahlberg 1989; Shafer and Smirnov 2000; Sinden 1994).

Our previous investigation of SSRs in Mycoplasma gen-

omes suggested that the physiological roles of SSRs may

not be limited to phase variation, and can include organi-

zation of the chromosome and influence on protein structure

and function (Mrazek 2006). Through analysis of SSRs in

more than 300 prokaryotic genomes, we have shown a large

variance among prokaryotes in terms of SSR content

(Mrazek et al. 2007). In this work, we present detailed

analysis of the relationships of SSRs with genes related to

various cellular functions and physiological processes in

host-adapted pathogens whose genomes exhibit significant

overrepresentations of long SSRs. We use comparisons

among SSRs of different lengths to identify a subset of long

SSRs that are likely maintained by selection. In the majority

of the analyzed genomes, some of these long SSRs are

associated with genes for potential antigens but many long

SSRs are located near housekeeping genes. We interpret our

data as an indication that physiological roles of SSRs in

bacterial genomes likely extend beyond their direct

involvement in phase variation. We present several

hypotheses on possible biological roles of long SSRs.

Materials and Methods

Genome Sequences and Annotations

In a previous work, we analyzed SSR content in more than

300 prokaryotic genomes and found that only few pro-

karyotes feature multiple significantly long tandem repeats

of mono-, di-, tri-, and tetranucleotides (Mrazek et al.

2007). Those which do are mostly host-adapted pathogens,

which do not readily survive in the environment outside a

host. The 11 genomes of host-adapted pathogens that

contain multiple significantly long SSRs were analyzed in

this work.

Complete DNA sequences were downloaded from the

National Center for Biotechnology Information FTP server

at ftp://ftp.ncbi.nih.gov/genomes/Bacteria/. We used three

standardized gene function classifications to determine

which types of genes are significantly often (or signifi-

cantly rarely) associated with SSRs. Gene assignments into

clusters of orthologous groups (COGs) (Tatusov et al.

1997, 2003) were obtained from the IMG database

(http://img.jgi.doe.gov/). COGs are divided into 25 gene

function classes (for details see the COG database at

http://www.ncbi.nlm.nih.gov/COG/). The COG database

only includes protein-coding genes and we added rRNA

and tRNA genes as separate classes. Genes without COG

assignments were included in another separate class. These

genes do not have homologues in the genomes used to

construct the COG database and mostly represent unique

genes found only in the particular organism or a narrow

group of related species.

Gene ontology (GO) terms assigned to individual genes

were downloaded from the UniProt database (http://www.

pir.uniprot.org/). The GO classification assigns to genes

standardized functional descriptions referring to cellular

localization, protein function, and biological process. The

Kyoto Encyclopedia of Genes and Genomes (KEGG)

orthology terms were downloaded from http://www.

genome.jp/kegg/. The KEGG database uses three hierar-

chical levels of functional classification. Only five very

general categories (metabolism, cellular processes, human

diseases, genetic information processing, and environ-

mental information processing) are included at the first

level, while the third level classifies genes in specific

biological pathways. The second level provides interme-

diate classification of genes (e.g., membrane transport, cell

mobility, carbohydrate metabolism).

Simple Sequence Repeats

We define SSRs as perfect, uninterrupted tandem iterations

of a single nucleotide or a short oligonucleotide. We

measure the length of an SSR in nucleotides (base pairs;

bp) rather than number of repetitive units, which allows

accounting for partial copies and facilitates comparisons

among SSRs of different lengths. Every SSR is character-

ized by the length of the repetitive unit k and the length of

the complete repeat l. By this definition, an SSR of length l

composed of iterations of a k-mer starts at position i in a

sequence of nucleotides xj

� �if xj ¼ xjþk for all

j� i; j� iþ l� k � 1, and simultaneously xi�1 6¼ xi�1þk

and xiþl�k 6¼ xiþl. This definition can be applied to all SSRs

of length l� k. Repeats of a longer oligonucleotide that

also qualify as repeats of a shorter oligonucleotide are only

498 J Mol Evol (2008) 67:497–509

123

counted as the shorter oligonucleotide SSR. We analyze the

SSR counts NkðlÞ in a given genome as a function of k and l

(Mrazek 2006; Mrazek et al. 2007). In this work, we

concentrate on iterations of mono-, di-, tri-, and tetramers,

(k� 4), which are overrepresented in genomes of several

host-adapted pathogens (Mrazek et al. 2007). The functions

NkðlÞ are generated for the genomic DNA sequence and

multiple random sequences created by stochastic models of

varying complexity (Mrazek 2006). Comparisons among

SSRs of varying lengths and with random sequences allow

easy detection of anomalies in SSR counts.

Definition of Long SSRs

Previous analyses of SSRs in prokaryotes generally used an

arbitrary length cutoff, sometimes as short as 6 bp (Gur-Arie

et al. 2000). Although such short SSRs can be polymorphic

in length and cause phase variation, the vast majority of such

short SSRs occur by coincidence and probably do not have

any specific function. We developed methods to distinguish

long SSRs likely maintained by selection from the randomly

occurring SSRs based on comparisons among SSR counts of

different lengths and stochastic models (Mrazek 2006;

Mrazek et al. 2007). In most prokaryotes, SSRs occur

approximately as expected in random sequences, except that

mononucleotide SSRs (i.e., uninterrupted runs of the same

nucleotide) of a length [8 bp are underrepresented, possibly

due to negative selection (Fig. 1). In some genomes, long

SSRs are overrepresented (more abundant than expected in

random sequences) and their counts decline approximately

monotonically with an increasing SSR length (Fig. 2). These

longer-than-expected SSRs can arise from spontaneous

expansion of shorter SSRs. In prokaryotes, this pattern often

applies to SSRs composed of tandem iterations of 5- to 11-

mers but rarely to those composed of 1- to 4-mers (Mrazek

et al. 2007). In contrast, several genomes feature bimodal or

discontinuous NkðlÞ plots, where the SSR counts initially

follow the expected counts or even drop below the expected

counts but exhibit a separate peak and/or a discontinuity at

greater SSR lengths (Fig. 3).This discontinuity is difficult to

explain by spontaneous expansion of shorter repeats.

Although mutation-based explanations cannot be summarily

dismissed, explaining the discontinuity of the NkðlÞ plots

(Fig. 3) requires multiple assumptions about the character of

the mutational biases involved, and following Occam’s razor

principle we consider selection a more likely explanation.

Hence, we label the SSRs following the discontinuity in the

NkðlÞ plots long SSRs (LSSRs) (Fig. 3) and we assume that

most LSSRs are maintained by selection, with the implica-

tion that they have some role in the organism’s physiology.

Note that this method for determining the length cutoff for

LSSRs emphasizes comparisons among SSRs of different

length and only applies to discontinuous or bimodal NkðlÞplots, whereas comparisons with random sequences are used

only for reference. Bimodal distributions of SSR counts are

found almost exclusively in genomes of some (but not all)

host-adapted pathogens (Mrazek et al. 2007). Prokaryotic

pathogens with an excess of LSSRs in their genomes are

listed in Table 1.

1-mers

0.1

1

10

100

1000

10000

100000

1000000

10000000

0

Length (nucleotides)

Cou

nt

2-mers

0.1

1

10

100

1000

10000

100000

1000000

10000000

Cou

nt

5 10 15 20 25 30 35 40 45 50 0

Length (nucleotides)5 10 15 20 25 30 35 40 45 50

Fig. 1 Mononucleotide (left) and dinucleotide (right) SSRs in E. coliK12. Filled circles show counts of the corresponding SSR types of the

length given by the abscissa. Gray lines signify expected counts

assessed from simulations using different stochastic models of

varying degrees of complexity. Dashed lines refer to homogeneous

Bernoulli and Markov models, whereas solid gray lines refer to

heterogeneous combinations of Bernoulli, Markov, and periodic

Markov models, where a different model is used for every gene and

intergenic region (Mrazek 2006; Mrazek et al. 2007). The plot is

expected to be linear under homogeneous stochastic models, and

deviations from linearity can be interpreted as under- or overrepre-

sentations of SSRs of the corresponding lengths

7-mers

0.1

1

10

100

1000

10000

100000

1000000

10000000

0

Length (nucleotides)

Cou

nt

10 15 20 25 30 35 40 45 505

Fig. 2 Heptanucleotide SSRs in Nostoc PCC 7120. See legend to

Fig. 1

J Mol Evol (2008) 67:497–509 499

123

SSR Associations with Genes of Different Functional

Assignments

Once LSSRs are identified we investigate their relationship

to genes of different functional classifications. First, we

define SSR-associated genes as genes that have an LSSR

located in their coding region or in their flanking intergenic

regions (downstream or upstream). For all SSR-associated

genes, we find the KEGG orthology terms, COG function

classes, and GO terms assigned to them and identify the

ontology terms and functional classifications that occur

significantly more or less often among the SSR-associated

1-mers

0.1

1

10

100

1000

10000

100000

1000000

10000000

0

Length (nucleotides)

Cou

nt

2-mers

0.1

1

10

100

1000

10000

100000

1000000

10000000

Cou

nt

LSSRs LSSRs

5 10 15 20 25 30 35 40 45 50 0

Length (nucleotides)5 10 15 20 25 30 35 40 45 50

Fig. 3 Mononucleotide SSRs

in Mycoplasma hyopneumoniae(left) and dinucleotide SSRs in

Lawsonia intracellularis (right).

SSRs comprising the secondary

peak are considered long SSRs

(LSSRs). See legend to Fig. 1

Table 1 Overrepresented COG function classes, KEGG terms, and GO terms among SSR-associated genes of different pathogenic bacteria

Genome COG KEGG GO

Helicobacter pylori J99 Cell cycle control, cell division, chromosome

partitioning (?); inorganic ion transport

and metabolism (?) p = 0.023,

p0 = 0.005

Haemophilus influenzae86–028NP

Cell wall, membrane, envelope biogenesis

(??) p = 0.129, p0 = 0.085

Glycan biosynthesis and

metabolism (???)

Lipopolysaccharide biosynthetic

process (??); transferase activity

(???); outer membrane (?)

Lawsonia intracellularisPHE/MN1-00

Not classifieda (??) p = 0.227, p0 = 0.622 Protein folding (??); unfolded

protein binding (?)

Mycobacterium leprae TN p = 0.269, p0 = 0.269

Mycoplasma capricolumATCC 27343

Defense mechanisms (?); not classifieda

(???) p = 0.026, p0 = 0.007

Mycoplasma gallisepticumR

Not classifieda (???) p = 0.002,

p0 = 0.393

Mycoplasma genitaliumG37

p = 0.161; p0 = 0.584

Mycoplasmahyopneumoniae 232

rRNA(??); not classifieda (??) p = 0.012,

p0 = 0.395

Mycoplasma mycoides SC

str. PG1

Defense mechanisms(?); not classifieda

(???) p = 0.029, p0 = 0.040

Mycoplasma pulmonisUAB CTIP

Replication, recombination and repair (??)

p = 0.166, p0 = 0.114

Replication and repair

(???)

DNA binding (??); DNA

methylation (???); site-specific

DNA-methyltransferase (???);

N-methyltransferase activity

(???)

Xanthomonas oryzaeMAFF 311018

Carbohydrate transport and metabolism (??)

p = 0.268, p0 = 0.106

Note: Symbols in parentheses signify the level of overrepresentation of that particular functional description: ? corresponds to probability

0.01 \ p B 0.05; ??, to 0.001 \ p B 0.01; and ???, to p B 0.001. No correction for multiple hypothesis testing has been used. The overall

goodness-of-fit test was performed for COG classification and the relevant p-values are listed in the table. The first p-value relates to the test

involving all classes listed in Tables 3 and 4, whereas the second p-value, labeled p0, excludes the classes with uncertain functional assignments

‘‘general function prediction only,’’ ‘‘function unknown,’’ and ‘‘not classified.’’ The goodness-of-fit test was not performed for the KEGG and

GO classifications. See Methods for detailsa The ‘‘not classified’’ COG category includes genes that were not classified in the COG database. These generally represent unique genes that do

not have homologues in genomes that were used in the design of the COG database

500 J Mol Evol (2008) 67:497–509

123

genes compared to the complete gene collection of the

organism. If all genes were equally likely to be associated

with SSRs, one would expect to find on average �mX ¼MX

Mm SSR-associated genes with the functional assignment

X, where M is the number of all annotated genes, m is the

number of all SSR-associated genes, and MX is the number

of all annotated genes with the functional assignment X.

For M large and �mX small, the observed counts MX of SSR-

associated genes with the functional assignment X follow

approximately the Poisson distribution PrðmXÞ ¼ exp

ð� �mXÞ ð �mXÞmX

mX! . The Poisson approximation is used to esti-

mate the probability that the number of SSR-associated

genes in a particular gene function class exceeds a given

value. These statistical assessments are used to identify

potential trends which are subsequently investigated at the

level of individual genes.

In addition to testing over- and underrepresentation of

genes of specific function classes among SSR-associated

genes, we evaluate the overall goodness of fit between the

observed and the expected counts for the COG function

classes using the chi-square test and Monte Carlo simula-

tions. The v2 value is calculated by the standard formula

v2 ¼ ðmX� �mXÞ2�mX

. However, the standard way of converting

the v2 value into a probability using the v2 distribution is

not applicable because the expected counts are too small to

provide accurate results. Instead, we determine the p-value

by performing random simulations, where the m SSR-

associated genes are randomly assigned to the gene func-

tion classes proportionately to the expected counts �mX and

the p-value is assessed as the fraction of simulations that

yield v2 higher or equal to that obtained with the real data.

The p-values reported in Table 1 are based on 106 simu-

lations. This method is only applied to COG classification

because it involves a relatively small number of function

classes and only few genes are assigned to more than one

class.

Results

Functional Classification of Genes Associated with

LSSRs

Table 1 shows a summary of overrepresented gene func-

tion classes and ontology terms among SSR-associated

genes in genomes of pathogenic bacteria with high counts

of LSSRs. Results of the overall goodness-of-fit test are

shown for COG classifications. Note that different func-

tional descriptions in Table 1 can arise from the same set of

SSR-associated genes. For example, the GO terms ‘‘DNA

binding,’’ ‘‘DNA methylation’’ ‘‘site-specific DNA-meth-

yltransferase,’’ and ‘‘N-methyltransferase activity’’ for M.

pulmonis all relate to several SSR-associated genes

encoding DNA methyltransferases. The tests for individual

function classes are conducted independently and are less

conservative than overall goodness-of-fit (v2) tests, which

indicate that LSSRs in most genomes are distributed

approximately randomly among different gene function

classes (p [ 0.05). We next discuss the SSR-associated

genes in genomes where the v2 test indicates nonrandom

distribution or one or more function classes shows an

excess of SSR-associated genes with p \ 0.01 (symbol ??

or ???).

Haemophilus Influenzae

The class ‘‘cell wall, membrane, envelope biogenesis’’ in

COG classification is overrepresented among SSR-associ-

ated genes. It is consistent with the overrepresented KEGG

classification ‘‘glycan biosynthesis and metabolism’’ and

GO terms ‘‘lipopolysaccharide biosynthetic process’’ and

‘‘transferase activity.’’ All these overrepresented functional

descriptions refer to a set of genes involved in lipopoly-

saccharide biosynthesis. Mutations in these genes can

affect properties of the cell surface, which is consistent

with the role of the LSSRs in antigenic variation.

Helicobacter Pylori

The distribution of SSR-associated genes among the COG

classes is nonrandom (indicated by the v2 test; Table 1),

with main contributions from the classes ‘‘Cell cycle con-

trol, cell division, chromosome partitioning’’ and

‘‘Inorganic ion transport and metabolism.’’ H. pylori fea-

tures a number of SSR-associated outer membrane proteins

and transporters (see also Table 2), which leads to the

marginal overrepresentation of the latter class. The former

class includes four SSR-associated genes encoding ATP-

binding protein (mrp), rod shape-determining protein

(mreB), and two proteins assigned to COGs described as

‘‘ATPases involved in chromosome partitioning,’’ which

are annotated as hypothetical protein (HI0059) and patho-

genicity island protein CagA (HI0547). The CagA protein

is a major virulence protein involved in disruptions of the

host epithelium (Amieva et al. 2003). Locations of LSSRs

near genes encoding outer membrane proteins and viru-

lence genes are consistent with their possible role in

antigenic variation.

Mycoplasma capricolum and Mycoplasma mycoides

M. capricolum and M. mycoides each contain two SSR-

associated genes for ABC transporters (Table 2) classified

in the COG class ‘‘Defense mechanisms,’’ which results

in a marginal overrepresentation (probability \ 0.05 but

J Mol Evol (2008) 67:497–509 501

123

[0.01) of this function class among SSR-associated genes.

Mycoplasmas do not have the outer membrane or cell wall,

and components of ABC transporters could be recognized

as antigens. Hence the overrepresentation of this group

among SSR-associated genes is consistent with the SSR

role in antigenic variance.

Lawsonia intracellularis and Mycoplasma hyopneumoniae

These genomes have more LSSRs and SSR-associated

genes than other bacteria and their SSR-associated genes

are distributed mostly randomly among different function

classes (Tables 1, 3, and 4). Notably, all three rRNA genes

Table 2 SSR-associated genes potentially contributing to antigenic variation

Classification Locus SSR location Description

Cell wall, membrane, envelope biogenesis HP0855 Up Alginate O-acetylation protein (algI)

HP1105 Up LPS biosynthesis protein

HP1341 Down Siderophore-mediated iron transport protein (tonB)

LI0092 Up ADP-heptose:LPS heptosyltransferase (rfaF/rfaQ)

LI0319 Up ADP-heptose:LPS heptosyltransferase

LI0452 Up 3-Deoxy-D-manno-octulosonic acid (KDO) 8-phosphate synthase

(kdsA)

LI0661 Down Glucosamine 6-phosphate synthetase (glmS)

LI0730 Down Predicted UDP-glucose 6-dehydrogenase (ugd)

LI0920 Down Membrane protein related to metalloendopeptidases

LI0984 Up Phosphomannose isomerase/GDP-mannose pyrophosphorylase (xanB)

MCAP_0063 Up Glycosyl transferase, group 2 family protein

MCAP_0270 Down ABC transporter, permease protein, putative

mhp676 Down Glucose-inhibited division protein B (gidB)

NTHI0365 In UDP-galactose–lipo-oligosaccharide galactosyltransferase (lgtC)

NTHI0677 In UDP-Gal–lipo-oligosaccharide galactosyltransferase (lic2A)

NTHI0913 Up UDP-glucose–lipo-oligosaccharide glucosyltransferase (lex2B)

NTHI1597 In LicA

NTHI1750 In Putative glycosyl transferase, glycosyl transferase family 8 protein

XOO_0008 In TonB protein (tonB)

NTHI0585 In Autotransported protein Lav (lav)

NTHI0472 In CMP-Neu5Ac–lipo-oligosaccharide a 2–3 sialyltransferase (lic3A)

NTHI1034 In CMP-neu5Ac–lipo-oligosaccharide a 2–3 sialyltransferase (lic3A2)

Defense mechanism HP0464 In Type I restriction enzyme R protein (hsdR)

HP1521 Up Type III restriction enzyme R protein (res)

MCAP_0587 Up ABC transporter, ATP-binding protein, putative

MCAP_0655 Up ABC transporter, ATP-binding protein

mhp025 Down ABC transporter ATP binding protein

mhp686 Down Multidrug resistance protein homologue (pr2)

MSC_0398 Up Na? ABC transporter, ATP-binding component (natA)

MSC_0704 Up ABC transporter, ATP-binding component (Na?)

Lipoprotein MCAP_0431 [up], MCAP_0432 [up], MCAP_0433 [up], MCAP_0470 [in]; MCAP_0593(vmcD)

[up], MCAP_0594(vmcC) [up], MCAP_0595(vmcB) [up], MCAP_0629 (vmcE) [up] and

MCAP_0630 (vmcF) [up]; MG_338 [in]; MSC_0397 (lpp) [down], MSC_0390 (vmm) [up];

MSC_0847(lpp) [up], MSC_1005(lpp) [up], MYPU_0190 [up], MYPU_4780 [up], MYPU_6520

[down]

Outer membrane protein HP0009(omp1) [up], HP0025(omp2) [up], HP0227(omp5) [down], HP0722(omp16) [up],

HP0725(omp17) [up], HP0896(omp19) [*2086–18], HP0912(omp20) [up], HP1342(omp29) [up]

Note: Genes are labeled with gene tags used in GenBank annotation. The initial letters signify the organism: HI—Haemophilus influenzae 86-

028NP; HP—Helicobacter pylori 26695; jhp—Helicobacter pylori J99; LI—Lawsonia intracellularis PHE/MN1-00; MCAP—Mycoplasmacapricolum subsp. capricolum ATCC 27343; MG—Mycoplasma genitalium G37; mhp—Mycoplasma hyopneumoniae 232; ML—Mycobacte-rium leprae TN; MSC—Mycoplasma mycoides subsp. mycoides SC str. PG1; MYPU—Mycoplasma pulmonis UAB CTIP; XOO—Xanthomonasoryzae pv. oryzae MAFF 311018. ‘‘Up,’’ ‘‘down,’’ or ‘‘in’’ signify the SSR location upstream, downstream, or within the associated gene,

respectively

502 J Mol Evol (2008) 67:497–509

123

of M. hyopneumoniae are associated with LSSRs (Table 3).

Many hypothetical genes of unknown function in L. in-

tracellularis have proximal LSSRs, and some of them

might be involved in antigenic variation. Surprisingly, four

of six genes labeled with the GO term ‘‘protein folding’’ in

L. intracellularis are associated with LSSRs (Table 1).

These include genes for molecular chaperones DnaK,

DnaJ, CbpA, and GrpE.

Mycoplasma pulmonis

Five SSR-associated genes encoding DNA methylases,

which give rise to all overrepresented functional classifi-

cations in Table 1. Some of these DNA methylases might

indirectly contribute to antigenic variation by altering

methylation patterns in distant regions of the chromosome

and influencing expression levels of other genes (Rocha

and Blanchard 2002).

LSSRs and Antigenic Variation

LSSRs associated with genes whose products are accessi-

ble on the surface of the cell or function in biogenesis of

cellular surface structures could directly contribute to

antigenic variation. These SSR-associated genes are listed

in Table 2. Most SSR-associated genes of H. influenzae

belong to two main groups: cell envelope biogenesis genes

and those related to outer membrane, intracellular traf-

ficking and secretion (Table 2). In M. capricolum, some

SSR-associated genes encode putative membrane proteins,

including three ABC transporters, five genes of the vmc

cluster, and several lipoproteins of the Lpp family. The vmc

genes contain dinucleotide SSRs (TA iterations) in their

putative promoter regions, which govern the phase variable

expression of these genes (Wise et al. 2006). M. mycoides

has two SSR-associated genes encoding ATPase compo-

nents of ABC-type multidrug transport systems, which are

potential surface antigens (Blanchard et al. 1996; Raheri-

son et al. 2002; Subramaniam et al. 2000). Both have a 14-

bp run of T in their upstream flanking regions. Many

putative lipoproteins and variable surface proteins in M.

mycoides are also associated with LSSRs (Table 2). Some

SSR-associated genes in H. pylori, L. intracellularis, M.

hyopneumoniae, and X. oryzae also encode proteins located

on the cell surface and could influence the interaction of

the pathogen with its host (Table 2).

Table 3 COG classifications of SSR-associated genes of Mycoplasma hyopneumoniae

COG classification All genes With SSR Expected Difference

Energy production and conversion 25 4 3.552

Cell cycle control, cell division, chromosome partitioning 4 0 0.568

Amino acid transport and metabolism 25 0 3.552 –

Nucleotide transport and metabolism 21 5 2.983

Carbohydrate transport and metabolism 57 8 8.098

Coenzyme transport and metabolism 7 0 0.994

Lipid transport and metabolism 5 0 0.71

Translation, ribosomal structure, and biogenesis 94 12 13.354

Transcription 17 0 2.415

Replication, recombination, and repair 50 2 7.103 –

Cell wall/membrane/envelope biogenesis 8 1 1.137

Posttranslational modification, protein turnover, chaperones 18 1 2.557

Inorganic ion transport and metabolism 19 1 2.699

Signal transduction mechanisms 3 1 0.426

Intracellular trafficking, secretion, and vesicular transport 12 1 1.705

Defense mechanisms 23 2 3.268

rRNA 3 3 0.426 ??

tRNA 30 6 4.262

General function prediction only 45 5 6.393

Function unknown 18 1 2.557

Not classified 260 52 36.938 ??

Note: The table lists the number of all genes and the number of SSR-associated genes in each COG function category, the expected number of

SSR-associated genes, and the significance of the difference between the observed and the expected numbers (see Methods). (–) Probability

\5%; (??) probability \1%

J Mol Evol (2008) 67:497–509 503

123

LSSRs Are Often Associated with Housekeeping and

Essential Genes

Surprisingly, we found that SSRs are also associated with

many housekeeping genes, such as rRNA and tRNA genes,

ribosomal protein genes, amino acyl-tRNA synthetases,

and chaperones (Table 5). Some SSR-associated genes

encode enzymes of central pathways of energy metabolism,

including citrate synthase and isocitrate dehydrogenase

functioning in the citrate cycle, and a subunit of ATP

synthase. The rod shape-determining protein MreB and

putative chromosome partitioning ATPases are encoded by

SSR-associated genes in H. influenzae and L. intracellu-

laris, respectively. Many rRNA and tRNA genes and genes

encoding enzymes that contribute to protein synthesis are

among SSR-associated genes in M. hyopneumoniae, L.

intracellularis, and, to a lesser extent, several other gen-

omes (Table 5). Also among SSR-associated genes in some

of the analyzed genomes are those encoding general

chaperones such as DnaK, DnaJ, and GroEL, the single-

stranded DNA-binding protein, DNA helicases, gyrases,

recombinases, and a DNA polymerase subunit III. Many of

those genes are probably essential for a living cell, and

their inactivation would likely be lethal.

Table 6 shows locations of LSSRs with respect to pre-

dicted housekeeping operons in M. hyopneumoniae and

L. intracellularis. The operon predictions were adopated

from the Microbes Online database (http://www.micro

besonline.org/) (Price et al. 2005). In most cases, LSSRs

are located upstream of the first gene in an operon where

they could have a direct effect on transcription initiation

(Table 6).

SSRs May Contribute to Genome Reduction in M.

leprae

The M. leprae genome contains 16 LSSRs, more than any

other mycobacterium (Mrazek et al. 2007). Interestingly,

most of the SSR-associated ‘‘genes’’ are pseudogenes of

diverse original functions (Table 7). Apart from three

membrane proteins, other SSR-associated pseudogenes are

similar to transcriptional regulators, or genes involved in

the metabolism of cofactors and vitamins, amino acids, and

nucleotides. LSSRs are often located within or upstream of

Table 4 COG classifications of SSR-associated genes in Lawsonia intracellularis

COG classification All genes With SSR Expected Difference

Energy production and conversion 2 0 0.235

Cell cycle control, cell division, chromosome partitioning 49 9 5.765

Amino acid transport and metabolism 19 2 2.235

Nucleotide transport and metabolism 82 3 9.647 –

Carbohydrate transport and metabolism 44 5 5.176

Coenzyme transport and metabolism 52 3 6.118

Lipid transport and metabolism 66 8 7.765

Translation, ribosomal structure, and biogenesis 28 1 3.294

Transcription 122 12 14.353

Replication, recombination, and repair 33 3 3.882

Cell wall/membrane/envelope biogenesis 68 8 8

Cell motility 134 11 15.765

Posttranslational modification, protein turnover, chaperones 55 8 6.471

Inorganic ion transport and metabolism 54 7 6.353

Secondary metabolites biosynthesis, transport 43 3 5.059

Signal transduction mechanisms 15 1 1.765

Intracellular trafficking, secretion, and vesicular transport 123 18 14.471

Defense mechanisms 67 7 7.882

rRNA 40 7 4.706

tRNA 44 5 5.176

General function prediction only 7 0 0.824

Function unknown 294 52 34.588 ??

Not classified 2 0 0.235

Note: The table lists the number of all genes and the number of SSR-associated genes in each COG function category, the expected number of

SSR-associated genes, and the significance of the difference between the observed and the expected numbers (see Methods). (–) Probability

\5%; (??) probability \1%

504 J Mol Evol (2008) 67:497–509

123

Table 5 SSR-associated housekeeping genes

COG classification Locus SSR

location

Description

Energy production and conversion HP0026 Down Citrate synthase

LI0276 Down Cytochrome bd-type quinol oxidase, subunit 2 (cyoA)

LI0790 Up Ferredoxin oxidoreductase, a subunit (vorB)

mhp476 Down ATP synthase subunit B (atpD)

Cell division, chromosome partitioning LIB024 In Chromosome partitioning ATPase (parA)

LI0814 Up Chromosome partitioning ATPase (soj)

HP0743 Up Rod shape-determining protein (mreB)

Translation, ribosomal structure,

and biogenesis

mhp186 Up 30S ribosomal protein S10 (rps10)

mhp094 Up 30S ribosomal protein S16 (rpsP)

mhp307 Up 30S ribosomal protein S6 (rpsF)

mhp638 Up 50S ribosomal protein L10 (rplJ)

mhp459 Up 50S ribosomal protein L11 (rplK)

MYPU_1300 Up 30S ribosomal protein S1

MYPU_4670 Down 50S ribosomal protein L19

LI0983 Down Ribosomal protein L17 (rplQ)

LI0560 Down Ribosomal protein S9

mhp416 Up Asparaginyl-tRNA synthetase (asnS)

mhp030 In Aspartyl/glutamyl-tRNA amidotransferase subunit B (gatB)

LI0986 Up Histidyl-tRNA synthetase (hisS)

mhp128 Down Seryl-tRNA synthetase (serS)

mhp106 Up Phenylalanyl-tRNA synthetase a chain (pheS)

mhp105 Down Phenylalanyl-tRNA synthetase b subunit (pheT)

ML0238 Down Methionine–tRNA ligase (metG)

mhp430 Up Elongation factor P (efp)

Replication, recombination and repair LI0287 Up ATP-dependent exoDNAse (exonuclease V), a subunit-helicase

superfamily I member (recD)

LI0194 Down Primosomal protein N’ (replication factor Y) superfamily II helicase

(priA)

HP0911 Down Helicase, single-stranded DNA-dependent ATPase (rep)

LIC081 In Replicative DNA helicase (recQ)

LI0257 Up DNA polymerase III, a subunit (dnaE)

MG_031 Up DNA polymerase III, a subunit, Gram? (polC–1)

mhp270 Down DNA gyrase subunit B (gyrB)

LI0492 Up Single-stranded DNA-binding protein

Posttranslational modification, protein

turnover, chaperones

LI0124 Up DnaJ-class molecular chaperone (cbpA)

LI0685 Down DnaJ-class molecular chaperone (dnaJ)

LI0912 Up Molecular chaperone (dnaK)

MYPU_2230 Up Molecular chaperone (dnaK)

LI1048 Up, Down/Ina Molecular chaperone (grpE)

HP0010 Down Chaperonin (groEL)

Carbohydrate transport and metabolism XOO_2190 Up ABC-type sugar transport ATPase

XOO_2191 Up Glucose-6-phosphate 1-dehydrogenase

XOO_0301 Up Glucose dehydrogenase

rRNA MG_rrnA-16S [up]; mhprRNA-16S [up]; mhprRNA-23S [down]; mhprRNA-5S [up, downa]

J Mol Evol (2008) 67:497–509 505

123

the first in a string of pseudogenes transcribed in the same

direction and constituting a putative operon (Table 7).

Searching the translated nucleotide database using the

tblastx program (McGinnis and Madden 2004), we found

that those SSRs are absent from homologous active genes

in the genomes of other mycobacteria (data not shown). If

LSSRs arose after inactivation of the genes due to relaxed

selective constraints, one would expect LSSRs to be

located anywhere within the operon. The locations

upstream of the first pseudogene of a putative operon

suggest that the LSSRs contributed directly to the inacti-

vation of these operons. It has been argued that M. leprae is

in an early stage of genome reduction following a recent

adaptation to an obligate pathogenic lifestyle (Cole et al.

2001), and it is possible that SSRs contribute to permanent

loss of genes and operons from the chromosome.

Discussion

Many cases of antigenic variation facilitated by SSRs have

been documented (Groisman and Casadesus 2005; Moxon

et al. 1994; van der Woude and Baumler 2004) and we

expected to detect an overrepresentation of SSR-associated

genes among those related to cell surface structures and

pathogen-host interactions in the analyzed genomes. This

was the case for H. influenzae and H. pylori but to a lesser

extent for other analyzed genomes, where the distribution

of SSR-asociated genes among different gene function

classes was mostly random (Table 1).

Surprisingly, we found a large number of SSR-associ-

ated housekeeping genes, such as rRNA and tRNA genes,

ribosomal protein genes, amino acyl-tRNA synthetases,

chaperones, and some metabolic enzymes, many of which

are probably essential, and their inactivation by SSR-

induced mutations could be lethal. These SSRs are unlikely

to cause phase variation, which is generally viewed as

reversible and inheritable alterations of the cell phenotype

(van der Woude and Baumler 2004). How can LSSRs

affecting housekeeping genes be beneficial to pathogens? If

mutations in these LSSRs affect expression of house-

keeping genes, they could alter growth rates of different

cell lineages, which could possibly further increase the

antigenic variance within the pathogen population in

combination with other mechanisms. However, we are not

aware of any other evidence supporting this hypothesis,

and it is unclear whether it could be effective. Alterna-

tively, LSSRs associated with housekeeping genes could

serve to influence expression of these genes by affecting

physical properties of DNA or RNA molecules, or they

could function as regulatory elements. The frequent loca-

tion of LSSRs upstream of the first gene in an operon is

consistent with their possible role in transcription initiation.

The apparent random association of LSSRs with gene

function classes in some of the analyzed genomes could

also indicate that the SSRs are not directly related to the

adjacent genes and, instead, influence some genomewide

processes, such as organization and stability of the chro-

mosome, replication, or chromosome segregation. It is also

possible that some LSSRs arise from spontaneous expan-

sions of shorter SSRs in the absence of selective

constraints, a hypothesis that we previously deemed less

likely. However, as stated earlier, explaining the bimodal

distributions of SSR counts (Fig. 3) by spontaneous

expansion alone requires additional assumptions. More-

over, considering that the bimodal distributions of SSR

counts are generally restricted to a subset of host-adapted

pathogens (Mrazek et al. 2007), it is easy to envision the

pathogen-host interactions as a source of selective con-

straints affecting these genomes, whereas it is unclear how

the spontaneous expansion should be restricted to this

group of bacteria.

An intriguing explanation of LSSR association with

genes relates to LSSRs located near pseudogenes in M.

Table 5 continued

COG classification Locus SSR

location

Description

tRNA MCAP_0062 [up]; mhptRNA-Asn4 [up]; mhptRNA-Gly [up]; mhptRNA-His [up]; mhptRNA-Leu4

[down]; mhptRNA-Ser2 [up]; mhptRNA-Tyr [up]; MLt18 (leuV) [down]; MLt27 (valV) [down];

MYPU_TRNA_LEU_1 [up]

Note: Genes are labeled with gene tags used in GenBank annotation. The initial letters signify the organism: HI—Haemophilus influenzae 86-

028NP; HP—Helicobacter pylori 26695; jhp—Helicobacter pylori J99; LI—Lawsonia intracellularis PHE/MN1-00; MCAP—Mycoplasmacapricolum subsp. capricolum ATCC 27343; MG—Mycoplasma genitalium G37; mhp—Mycoplasma hyopneumoniae 232; ML—Mycobacte-rium leprae TN; MSC—Mycoplasma mycoides subsp. mycoides SC str. PG1; MYPU—Mycoplasma pulmonis UAB CTIP; XOO—Xanthomonasoryzae pv. oryzae MAFF 311018. ‘‘Up,’’ ‘‘down,’’ or ‘‘in’’ signify the SSR location upstream, downstream, or within the associated gene,

respectivelya The GrpE gene of L. intracellularis has one LSSR upstream and another overlapping the stop codon. The 5S rRNA gene of M. hyopneumoniaehas two LSSRs near both ends of the gene. According to the annotation, LSSRs are located inside the gene, but alignments with other 5S rRNA

genes suggest that LSSRs could be outside the section corresponding to the 5S rRNA (Mrazek 2006)

506 J Mol Evol (2008) 67:497–509

123

leprae. Unlike its close relatives, M. leprae has never been

cultivated outside a host and has been proposed to be in an

early process of genome reduction following its adaptation

to the obligate pathogenic lifestyle (Cole et al. 2001; Lerat

and Ochman 2004). The abundance of LSSRs in or near

pseudogenes could be explained by two mechanisms: (i)

LSSRs directly contribute to inactivation of genes and/or

operons, or (ii) LSSRs arise after inactivation of a gene or

operon due to relaxed selective constraints. If the latter

model is correct, one might expect LSSRs to occur equally

Table 6 Location of LSSRs in the housekeeping operons in M. hyopneumoniae and L. intracellularis

Operon SSR Gene locus SSR location

M. hyopneumoniae

ATP synthases and hypothetical proteins T(20) mhp476–mhp482 |///////

30S and 50S ribosomal proteins T(21) mhp186–mhp206 |???…???

Ribosomal proteins for protein synthesis A(44) mhp094–mhp099 |??????

30S ribosomal proteins, single-strand binding, and GTP- binding proteins A(21) mhp304–mhp307 ////|

50S ribosomal proteins and RNA polymerases T(20) mhp634–mhp638 /////|

50S ribosomal proteins T(20) mhp458–mhp459 //|

tRNA synthetase, DNA helicase, and hypothetical proteins T(26) mhp416–mhp420 |?????

tRNA amidotransferase and amidase [AAAC](4)AA mhp029–mhp030 ??*

Seryl-tRNA synthetase and phosphopyruvate hydratase [AAAT](5) mhp128–mhp129 |//

Phenylalanyl-tRNA synthetases T(23) mhp105–mhp106 /|/

Elongation factor and transketolase T(18) mhp430–mhp431 |??

p97 cilium adhesion paralogues A(14) mhp271–mhp272 |??

5S rRNA A(49),T(27) rRNA-5S |?|

16S and 23S rRNAs T(20) rRNA-16S-23S //|

tRNA-Thr1, Val, Glu, and Asn4 T(18) mhptRNA-Thr1, -Val, -Glu, -

Asn4

////|

tRNA-Gly, T(26) mhptRNA-Gly |?

tRNA-His A(21) mhptRNA-His |?

tRNA-Leu4 and Ser2 T(20) mhptRNA-Leu4, -Ser2 /|/

tRNA-Tyr and Gln T(21) mhptRNA-Gln, -Tyr //|

L. intracellularis

Cytochrome bd-type quinol oxidases [TA](7) LI0275–LI0276 ?|?

Ferredoxin oxidoreductases [TA](7) LI0790–LI0792 |???

Chromosome partitioning ATPase [ATT](6)AT LIB024 ?*

Chromosome partitioning ATPases, transcriptional regulator, and ADP-

heptose synthase

[AC](8) LI0814–LI0816 |???

Ribosomal proteins [TTA](7)T LI0959–LI0983 ???…??|

Ribosomal proteins [AT](6) LI0559–LI0560 ??|

tRNA synthetases [TTA](5)TT LI0986–LI0987 |??

ExoDNAse and cell-wall-associated hydrolase [ATTA](4) LI0286–LI0287 //|

Replication helicase, UDP-glucose pyrophosphorylase,

phosphomannomutase

[ATA](7) LI0191–LI0194 ????|

Replicative DNA helicase [AT](7) LIC081 ?*

DNA polymerase and ATP synthase subunits [GTTA](4) LI0257–LI0258 |??

Porphyrin oxidase and oxidoreductases [AGA](5)A LI0429 |?

Molecular chaperone and ATPase [TTA](5)TT LI0124–LI0126 |???

Molecular chaperone DnaJ A(14) LI0685 /|

Molecular chaperone DnaK A(12) LI0912 |?

Molecular chaperone GrpE [TATT](4)T,

[TTA](6)T

LI1048 |/|

Note: Arrows signify the gene orientation, vertical bars denote locations of intergenic SSRs, and asterisks indicate locations of SSRs inside the

gene to the left. The numbers in the SSR sequences indicate the number of copies of the preceding segment followed by any remaining partial

copy. For example, [AAAC](4)AA refers to the sequence AAACAAACAAACAAACAA

J Mol Evol (2008) 67:497–509 507

123

likely anywhere in an operon. However, LSSRs in M.

leprae are located upstream of the translation start site or

within the coding regions of the first gene in a string of

pseudogenes (possible operon; Table 7), which suggests

that the LSSRs directly contributed to the inactivation of

these operons. A recent analysis of distribution of short

homo-oligonucleotide runs (i.e., tandem repeats of a single

nucleotide C 6 bp long) in bacterial protein-coding genes

detected a bias toward location near the 50 ends of genes,

possibly due to selection to reduce the metabolic cost of

synthesis of nonfunctional peptides resulting from frame-

shift mutations in these short SSRs (van Passel and

Ochman 2007). The location of LSSRs in M. leprae is

consistent with this observation: disrupting the initiation of

transcription would prevent the synthesis not only of

nonfunctional peptides but also of nonfunctional mRNAs.

Our data suggest that LSSRs are contributing to permanent

gene loss in M. leprae, and their location is selected to

minimize a potential detrimental effect of synthesis of

nonfunctional RNAs and peptides. One might speculate

that LSSRs may have played similar roles in genome

reduction of other host-adapted pathogens, and that a

proliferation of LSSRs could be involved at some stage of

genome reduction following an adaptation to obligate

pathogenic lifestyle (Moran 2002). Some of the SSRs in

present-day genomes could simply be left over from the

period of genome reduction. However, this hypothesis is

contradicted by the absence of LSSRs in genomes of

obligate endosymbiotic bacteria (Mrazek et al. 2007),

which have undergone a similar process of genome

reduction (Moran 2002, 2003). Nevertheless, it is possible

that proliferation of LSSRs in the early stage of genome

reduction is a general phenomenon, whereas their sub-

sequent long-term preservation is determined by species-

specific constraints.

Acknowledgments We thank Dr. Anne Summers for critical read-

ing of the manuscript and Drs. Mark Schell, Duncan Krause, and

other colleagues at the UGA Department of Microbiology for stim-

ulating discussions.

References

Amieva MR, Vogelmann R, Covacci A, Tompkins LS, Nelson WJ,

Falkow S (2003) Disruption of the epithelial apical-junctional

complex by Helicobacter pylori CagA. Science 300:1430–1434

Blanchard B, Saillard C, Kobisch M, Bove JM (1996) Analysis of

putative ABC transporter genes in Mycoplasma hyopneumoniae.

Microbiology 142(Pt 7):1855–1862

Cole ST, Eiglmeier K, Parkhill J, James KD, Thomson NR, Wheeler

PR, Honore N, Garnier T, Churcher C, Harris D, Mungall K,

Basham D, Brown D, Chillingworth T, Connor R, Davies RM,

Devlin K, Duthoy S, Feltwell T, Fraser A, Hamlin N, Holroyd S,

Hornsby T, Jagels K, Lacroix C, Maclean J, Moule S, Murphy L,

Oliver K, Quail MA, Rajandream MA, Rutherford KM, Rutter S,

Seeger K, Simon S, Simmonds M, Skelton J, Squares R, Squares

S, Stevens K, Taylor K, Whitehead S, Woodward JR, Barrell BG

(2001) Massive gene decay in the leprosy bacillus. Nature

409:1007–1011

Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN

(2005) Flexible nets. The roles of intrinsic disorder in protein

interaction networks. Febs J 272:5129–5148

Field D, Wills C (1998) Abundant microsatellite polymorphism in

Saccharomyces cerevisiae, and the different distributions of

microsatellites in eight prokaryotes and S. cerevisiae, result from

strong mutation pressures and a variety of selective forces. Proc

Natl Acad Sci USA 95:1647–1652

Groisman EA, Casadesus J (2005) The origin and evolution of human

pathogens. Mol Microbiol 56:1–7

Gur-Arie R, Cohen CJ, Eitan Y, Shelef L, Hallerman EM, Kashi Y

(2000) Simple sequence repeats in Escherichia coli: abundance,

distribution, composition, and polymorphism. Genome Res

10:62–71

Htun H, Dahlberg JE (1989) Topology and formation of triple-

stranded H-DNA. Science 243:1571–1576

Karlin S, Mrazek J, Campbell AM (1996) Frequent oligonucleotides

and peptides of the Haemophilus influenzae genome. Nucleic

Acids Res 24:4263–4272

Kashi Y, King DG (2006) Simple sequence repeats as advantageous

mutators in evolution. Trends Genet 22:253–259

Lerat E, Ochman H (2004) W-U: Exploring the outer limits of

bacterial pseudogenes. Genome Res 14:2273–2278

Li YC, Korol AB, Fahima T, Nevo E (2004) Microsatellites within

genes: structure, function, and evolution. Mol Biol Evol 21:991–

1007

McGinnis S, Madden TL (2004) BLAST: at the core of a powerful

and diverse set of sequence analysis tools. Nucleic Acids Res

32:W20–W25

Moran NA (2002) Microbial minimalism: genome reduction in

bacterial pathogens. Cell 108:583–586

Table 7 SSR-associated pseudogenes in Mycobacterium leprae

Putative operon SSR Graph Pseodogene locus

Acyl CoA synthesis G(22) ?*????? ML0163–ML0168

Precorrin-3 methylase and reductase [TA](10) //| ML1449–ML1450

Group II intron maturase and transposase [AC](8)A ///| ML1823–ML1825

Sugar transporter [AAG](7)A /| ML2344

Transcriptional regulator [AAG](7)A |? ML2345

Membrane proteins, methyltransferase, sigma factor, PPE-protein and transposase [AT](10) ////////| ML2368–ML2375

PE protein [TA](11) ?* ML2477

Note: Arrows signify the gene orientation, vertical bars denote locations of intergenic SSRs, and asterisks indicate locations of SSRs inside the

gene to the left. The original function of the pseudogenes was assessed by sequence similarity searches

508 J Mol Evol (2008) 67:497–509

123

Moran NA (2003) Tracing the evolution of gene loss in obligate

bacterial symbionts. Curr Opin Microbiol 6:512–518

Moxon ER, Rainey PB, Nowak MA, Lenski RE (1994) Adaptive

evolution of highly mutable loci in pathogenic bacteria. Curr

Biol 4:24–33

Mrazek J (2006) Analysis of distribution indicates diverse functions

of simple sequence repeats in Mycoplasma genomes. Mol Biol

Evol 23:1370–1385

Mrazek J, Guo X, Shah A (2007) Simple sequence repeats in

prokaryotic genomes. Proc Natl Acad Sci USA 104:8472–8477

Perutz MF (1999) Glutamine repeats and neurodegenerative diseases:

molecular aspects. Trends Biochem Sci 24:58–63

Price MN, Huang KH, Alm EJ, Arkin AP (2005) A novel method for

accurate operon predictions in all sequenced prokaryotes.

Nucleic Acids Res 33:880–892

Raherison S, Gonzalez P, Renaudin H, Charron A, Bebear C, Bebear

CM (2002) Evidence of active efflux in resistance to ciproflox-

acin and to ethidium bromide by Mycoplasma hominis.

Antimicrob Agents Chemother 46:672–679

Rocha EP (2003) An appraisal of the potential for illegitimate

recombination in bacterial genomes and its consequences: from

duplications to genome reduction. Genome Res 13:1123–1132

Rocha EP, Blanchard A (2002) Genomic repeats, genome plasticity

and the dynamics of Mycoplasma evolution. Nucleic Acids Res

30:2031–2042

Roske K, Blanchard A, Chambaud I, Citti C, Jacobs E (2001) Phase

variation among major surface antigens of Mycoplasma pene-trans. Infect Immun 69:7642–7651

Shafer RH, Smirnov I (2000) Biological aspects of DNA/RNA

quadruplexes. Biopolymers 56:209–227

Sinden RR (1994) DNA structure and function. Academic Press, San

Diego, CA

Subramaniam S, Frey J, Huang B, Djordjevic S, Kwang J (2000)

Immunoblot assays using recombinant antigens for the detection

of Mycoplasma hyopneumoniae antibodies. Vet Microbiol

75:99–106

Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective

on protein families. Science 278:631–637

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B,

Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikols-

kaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf

YI, Yin JJ, Natale DA (2003) The COG database: an updated

version includes eukaryotes. BMC Bioinform 4:41

Toth G, Gaspari Z, Jurka J (2000) Microsatellites in different

eukaryotic genomes: survey and analysis. Genome Res 10:967–

981

van der Woude MW, Baumler AJ (2004) Phase and antigenic

variation in bacteria. Clin Microbiol Rev 17:581–611

van Passel MW, Ochman H (2007) Selection on the genic location of

disruptive elements. Trends Genet 23:601–604

Wise KS, Foecking MF, Roske K, Lee YJ, Lee YM, Madan A,

Calcutt MJ (2006) Distinctive repertoire of contingency genes

conferring mutation–based phase variation and combinatorial

expression of surface lipoproteins in Mycoplasma capricolumsubsp. capricolum of the Mycoplasma mycoides phylogenetic

cluster. J Bacteriol 188:4926–4941

J Mol Evol (2008) 67:497–509 509

123