32
Supplemental information for: Abigail Manson McGuire 1 , Kyla Cochrane 2 , Allison D. Griggs 1 , Brian J. Haas 1 , Thomas Abeel 1,4 , Qiandong Zeng 1 , Justin B. Nice 5 , Hanlon MacDonald 5 , Bruce W. Birren 1 , Bryan W. Berger 3,5 , Emma Allen-Vercoe 2 , Ashlee M. Earl 1* . Evolution of Invasion in a Diverse Set of Fusobacterium. SUPPLEMENTAL RESULTS Adaptive radiation of fusobacterial lineages. To gain greater insight into how active and passive invasion strategies may have evolved, we looked closely at the phylogenetic relationship among the active and passive invader species (Figure 1). While our maximum likelihood tree agreed with previous reports of Fusobacterium phylogeny, with Lineage 2 (containing Clades C and D) in the more basal position relative to Lineage 3 (containing Clade E) (1, 2), there was very low (4%) bootstrap support for this node, indicating a very weak basis for sequential ordering of that division. Bootstrap support for all other nodes in the tree was excellent. When individual gene trees for each of the 498 core orthogroups were examined for 1

Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

  • Upload
    ngotu

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

Supplemental information for: Abigail Manson McGuire1, Kyla Cochrane2, Allison D. Griggs1,

Brian J. Haas1, Thomas Abeel1,4, Qiandong Zeng1, Justin B. Nice5, Hanlon MacDonald5, Bruce W.

Birren1, Bryan W. Berger3,5, Emma Allen-Vercoe2, Ashlee M. Earl1*. Evolution of Invasion in a

Diverse Set of Fusobacterium.

SUPPLEMENTAL RESULTS

Adaptive radiation of fusobacterial lineages. To gain greater insight into how active

and passive invasion strategies may have evolved, we looked closely at the phylogenetic

relationship among the active and passive invader species (Figure 1). While our maximum

likelihood tree agreed with previous reports of Fusobacterium phylogeny, with Lineage 2

(containing Clades C and D) in the more basal position relative to Lineage 3 (containing Clade E)

(1, 2), there was very low (4%) bootstrap support for this node, indicating a very weak basis for

sequential ordering of that division. Bootstrap support for all other nodes in the tree was

excellent. When individual gene trees for each of the 498 core orthogroups were examined for

relatedness, only 37% were found to support the topology predicted by maximum likelihood

analysis, while 27% supported an alternate topology placing Lineage 3 in the more basal

position relative to Lineage 2, and 32% represented topologies where these clades diverged

from Lineage 1 (containing Clades A and B) via a common ancestor. These data strongly suggest

that the relative placement of Lineage 2 and Lineage 3 is poorly defined due to their near

simultaneous divergence, likely representing an adaptive radiation where the last common

ancestor diversified into three major lineages containing five clades, three clades acquiring (or

1

Page 2: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

retaining) the potential to actively invade and the others lacking this capability.

ANI-based species definition. ANI values of 94-95% represent a commonly accepted

threshold for species designation (3, 4). Pairwise comparisons between members of different F.

nucleatum subspecies were lower than this threshold (89-93%), while values obtained between

members of the same subspecies were above this threshold (96-99%) (Figure S1d). These

results indicate that each of the four F. nucleatum subspecies in our analysis would be

considered a separate species, according to ANI-based species definition.

ANI values within F. periodonticum ranged from 92-98%, overlapping the ANI species

line of 94-95%. F. periodonticum 2_1_31 and F. periodonticum D10 would be contained within

a single species (pairwise ANI values of 98%), while F. periodonticum ATCC 33693 and F.

periodonticum 1_1_41FAA would each be placed in a separate species (pairwise ANI values of

92-94% to other strains; see Figure S1e). For all other species with more than one strain in our

analysis (F. necrophorum, F. gonidiaformans, and F. ulcerans), all pairwise intraspecies ANI

values were >95%, indicating that they represent a single species.

Species-specific orthogroups. Species-specific orthogroups, or groups of orthologous

genes, primarily contain genes annotated as encoding “hypothetical proteins”, with no

indication of protein function. Pfam, GO, KEGG, and other functional annotations provide some

clues to their function.

F. nucleatum. Orthogroups present in all 14 F. nucleatum strains, but no other strains in

our analysis, include those encoding two MORN2 domain containing proteins, two periplasmic

2

Page 3: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

solute-binding proteins, several transporters, a polysaccharide deacetylase, and a protein

containing a NUDIX domain. NUDIX domains have previously been associated with infection in

other pathogenic bacteria (5). Many F. nucleatum-specific orthogroups include genes with no

functional annotation.

There are also numerous subspecies-specific orthogroups (see Table S3), which are

genes present in all sequenced members of a single F. nucleatum strain, and absent in all other

sequenced F. nucleatum strains. Most of these subspecies-specific orthogroups contain genes

of unknown function. For F. nucleatum animalis, subspecies-specific orthogroups include those

encoding a YadA family adhesin, a RelB/Stbe addiction module toxin, several transporters, a

bacterial surface protein, and several proteins with likely roles in gene regulation. F. nucleatum

vincentii-specific orthogroups include a TonB-dependent receptor component, several

transporters, a YadA family adhesin, and an orthogroup with MORN2 domains. F. nucleatum

nucleatum-specific orthogroups include those encoding a MORN2 protein, and several

transporters. F. nucleatum polymorphum-specific orthogroups include those encoding two

MORN2 proteins, several transporters, regulators, and a colicin. Orthogroups missing in

individual subspecies also encode transporters, MORN2 proteins, autotransporter β-domain

containing proteins (which is a component of RadD-family adhesins), and regulatory proteins,

as well as many proteins of unknown function.

F. periodonticum. Orthogroups present only in all four F. periodonticum strains include

3 MORN2-containing orthogroups, two orthogroups encoding extracellular solute-binding

proteins, several orthogroups encoding transporters, several encoding regulatory proteins, an

orthogroup encoding a TonB family protein, and an orthogroup encoding a tetratricopeptide-

3

Page 4: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

repeat (TPR) containing protein. TPR containing proteins contain a set of repeats, which can

fold together to form a solenoid domain (6). Many F. periodonticum-specific orthogroups

include genes with no functional annotation.

F. necrophorum. Orthogroups present only in both F. necrophorum strains include three

orthogroups encoding either YadA family or haemagglutinin adhesins, an orthogroup encoding

a TonB family protein, two orthogroups encoding extracellular solute-binding proteins, two

orthogroups encoding haemolysin secretion/activation proteins, a orthogroup with genes

encoding a POTRA domain (POTRA domains are known to be involved in the production of

haemolysins (7)), an orthogroup encoding Omptin domain (Omptins are known to be involved

in pathogenesis in other organisms (8)), and several orthogroups containing transporters.

Many F. necrophorum-specific orthogroups include genes with no functional annotation.

F. gonidiaformans. There are fewer F. gonidiaformans-specific orthogroups because

these are the two smallest genomes in our data set. Orthogroups present only in both F.

gonidiaformans strains include two groups encoding transporters, a group encoding a

regulator, and several encoding proteins of unknown function.

F. ulcerans. Orthogroups present only in both F. ulcerans strains include 17 orthogroups

encoding bacterial DNA-binding proteins with the Pfam domain PF00216, nine orthogroups

encoding regulatory proteins, two orthogroups encoding FadA adhesin domains, 15

orthogroups with EAL and GGDEF signaling domains, and four orthogroups encoding

extracellular solute-binding proteins.

F. varium and F. mortiferum. Because there is only one representative for each of these

two species in our dataset, all strain-specific orthogroups are included in the list in Table S3.

4

Page 5: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

Most of these contain strain-specific genes with no functional annotation.

Species-specific gene family expansions. We identified expansions of KEGG Pathways,

Gene Ontology categories and Pfam protein domain families within individual species.

F. nucleatum. In this highly invasive species, we observed expansions of gene families

related primarily to membrane components, transporters, and pathogenesis (see Table S4).

Other expansions included several Pfam domains of unknown function (DUF1703, DUF1311,

DUF3601, and DUF1016), as well as a large expansion of MORN2 domains. In F. nucleatum, we

also observed an expansion of genes related to addiction-module toxin-antitoxin systems,

which are usually related to maintaining the stability of extrachromosomal elements, and can

be involved in pathogenesis. We have fully resolved plasmids for the manually finished F.

nucleatum animalis 7-1, F. nucleatum animalis 4-8, and F. nucleatum vincentii 3-1-27.

The most striking subspecies-specific characteristics in F. nucleatum polymorphum are

expansions in amino acid metabolism genes (Table S4). Categories that were highly expanded

include branched chain amino acid biosynthesis, leucine biosynthesis, lysine biosynthesis, and

C5-branched dibasic acid metabolism. We also observed expansions in vitamin B6 metabolism

and pantothenate biosynthesis in this subspecies. In F. nucleatum polymorphum ATCC 0953,

previous research showed that proteins involved in amino acid metabolism change

concentration in alkaline biofilms, along with a shift to more efficient metabolism (9). Other

subspecies-specific gene family expansions are related to transposase activity (see Table S4).

F. periodonticum . Similar to F. nucleatum polymorphum, we observed expansions of

amino acid metabolism genes. Phylogenetic profiles for genes related to amino acid

5

Page 6: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

metabolism varied widely across the organisms in our data set. This was in agreement with

previous studies, which showed that Fusobacterium spp. strains within different niches of the

oral cavity utilize different sets of amino acids (10). Several categories of amino acid

metabolism genes were expanded or only present in F. periodonticum, as well as F. nucleatum

subsp. polymorphum, particularly branched-chain amino acid metabolism (including leucine and

threonine; see Table S4). Pantothenate and coA biosynthesis also followed this same profile of

being expanded only in F. nucleatum polymorphum and F. periodonticum. We also saw a

striking expansion of the MORN2 Pfam domains. There were also many MORN2 proteins in the

active invaders F. nucleatum, F. ulcerans and F. varium, but the most striking expansion was

observed in F. periodonticum. Each F. periodonticum genome had approximately 50 proteins

containing MORN2 domains.

F. necrophorum. This species had expansions of genes containing adhesion-related

Pfam domains (“haemagglutinin”, “YadA-like C-terminal domain”, and “Hep Hag”). F.

necrophorum also had a significant expansion of TonB-dependent receptor genes. These

membrane proteins sense and transmit signals from outside the cell into the cytoplasm. F.

necrophorum had significantly reduced numbers of genes related to the membrane (GO terms

“intrinsic to membrane”, “integral to membrane”, and “plasma membrane”, and “MORN2”).

F. gonidiaformans. This species was missing many functions relative to the other

organisms, since its two representative genome sequences were the smallest in our data set.

The categories most significantly reduced included those related to the membrane (“membrane

part”, “intrinsic/integral to the membrane”, “plasma membrane”, “outer membrane”, and

“MORN2”) and categories related to transport (“transporter activity”, “transport”, “active

6

Page 7: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

transmembrane transporter activity”, “antiporter activity”, “transmembrane transporter

activity”, “substract-specific transporter activity”, and “drug transport”). This large reduction in

membrane components was consistent with the fact that F. gonidiaformans is not believed to

be able to invade host cells independently. F. gonidiaformans has approximately one-third as

many genes as other fusobacteria belonging to the “membrane part” GO category.

F. ulcerans and F. varium . These species exhibited very striking expansions of DNA

binding proteins, including transcription factors. The most striking expanded category was

PF00216 (bacterial DNA-binding proteins (Table S4). There were 50-60 of these in each F.

ulcerans and F. varium genome, and less than five in other organisms. The exact function of

these domains is not clear, but they are annotated as being histone-like proteins involved in

wrapping and stabilizing DNA. Many categories of transcriptional regulators were highly

expanded in F. ulcerans and F. varium. These species had the largest genomes in our collection

(~3500 genes vs. 2000 in F. nucleatum). Since the number of regulatory proteins tends to go up

as the square of the genome size (11-13), an expansion of regulatory proteins was expected in

these two species genomes, as we would fewer regulatory proteins in F. necrophorum and F.

gonidiaformans. Overall, the expansion of regulator proteins in F. ulcerans and F. varium

(approximately three times as many as in F. nucleatum) is consistent with the expansion

expected due to its larger genome size.

F. ulcerans and F. varium also contained expansions of the GGDEF and EAL domains,

which are related to signaling, as well as expansions of genes related to metal binding, and

numerous types of transporter domains. The most significantly expanded KEGG pathway was

ko00633 (nitrotoluene degradation). Because these are larger genomes, we saw more

7

Page 8: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

significant gene expansions than gene reductions. However, we did observe that the histidine

metabolism pathway (ko00340) was completely absent in F. ulcerans and F. varium.

F. mortiferum. Because there is only one sequenced F. mortiferum genome, we do not

have additional strains for comparison. In F. mortiferum ATCC 9817, we observed modest

expansions of carbohydrate transporters and sugar phosphotransferase systems as compared

to other fusobacterial species, indicating adaptation to a specific metabolic environment.

Protein families expanded in active invaders cluster in the genome. Using the finished

F. nucleatum genomes, we performed a computational analysis to quantitate the proximity

between FadA, RadD, and MORN2 protein families, and members of the following categories

(see Materials and Methods): 1) all Pfam domains, 2) all gene ontology categories, 3) proteins

containing signal peptides as predicted by SignalP (14), and 4) proteins containing

transmembrane domains predicted by TMHMM (15). These Pfam and gene ontology categories

include groupings related to phage and other mobile elements. We observed that genes

encoding active-specific protein families associated with one another with statistical

significance (Table S6), suggesting that they physically cluster. The most significant clustering

occurred between groups of genes encoding MORN2 proteins (see Table S6 and Figure 3). F.

nucleatum genomes contained 5-9 separate clusters of MORN2 genes, with each cluster

containing 2-4 MORN2 genes, and F. periodonticum genomes contained 7-11 separate clusters

of MORN2 genes, with each cluster containing 4-5 MORN2 genes. There were also differences

in the gene content of MORN2 clusters between strains, even between strains of the same

subspecies, suggesting that these clusters are hotspots for variation.

8

Page 9: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

In addition to physical clustering between different Pfam families expanded in active

invaders, we also observed that the expanded Pfam families localized with other adhesins and

virulence-associated genes. For instance, there were 17 instances where a FadA domain-

containing gene was located near an OmpA family protein. OmpA proteins are membrane-

embedded β-barrels, often involved in bacterial pathogenesis, adhesion, and invasion (16).

Genes encoding proteins with signal peptides, indicating an extracellular role, were highly

clustered with MORN2-, FadA-, and RadD-containing genes, as well as genes related to

transport, cell membrane, cell wall organization, and peptidoglycan biosynthesis. There was

also an association between chorismate mutase (CM)-domain containing proteins and proteins

containing MORN2 domains. In each of the seven finished genomes, all three CM-domains

were either found within the same gene as a MORN2 domain, or were located within two

adjacent genes in the same operon (FN0043, FN0044, and FN0045 in F. nucleatum subsp.

nucleatum ATCC 25586).

We also observed that proteins related to phage and transposition significantly

clustered with MORN2-containing proteins (Table S6). IS elements (see Materials and

Methods) were also 1.6 times more likely to be found within 2 Kb of a MORN2 region than

other genes in the genome. IS elements have previously been shown to associate with outer

membrane proteins in F. nucleatum (17): In F. nucleatum subsp. nucleatum 25586, IS elements

were observed to flank outer membrane proteins, including a RadD family member containing

an autotransporter β-domain. However, prophage predicted by the Phast tool (18) did not

overlap with any MORN2, FadA, or RadD regions , so prophage do not appear to promote

present-day evolution and movement of these regions. Additionally, active-specific genomic

9

Page 10: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

regions did not appear to have been recently horizontally acquired from non-fusobacterial

sources. We searched for genomic islands using the Islandviewer software (19) (see Materials

and Methods), and found very few predictions in these genomes. Together, these data suggest

that MORN2 genes have been carried in Fusobacterium for a long period of time, and are being

actively reshuffled.

Active-specific features present in F. nucleatum from cancerous tumors. To further

validate our lists of active-specific orthogroups, as well as functional categories over-

represented in the active invaders, we analyzed six additional sequenced F. nucleatum genomes

isolated from cancerous tumors (see Materials and Methods). In these cancer strains, we

observed similar copy numbers for MORN2 (27-31 copies), FadA (3-6 copies), and RadD (5-11

copies) family genes as for the active invaders. In addition, all 44 lf the active-specific

orthogroups were also present in each of the six cancer strains, supporting the idea that these

six F. nucleatum strains are similar to the other active invader strains in our dataset.

MORN2 domains in other organisms. The massive expansion of MORN2 domains

observed in the active invader clades is a feature highly specific to the Fusobacterium genus.

MORN2 domains are present in about 10% of the >5000 bacterial strains represented in the

Pfam database. However, most of these species contain only a small number of MORN2

domains: only 23 bacteria in the Pfam database contain >100 MORN2 domains, and 22 of these

23 are actively invading Fusobacterium. The only non-fusobacterial species in Pfam containing

over 100 MORN2 domains is Helicobacter bilis ATCC 43879.

10

Page 11: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

The 65 genomes in the Pfam database containing 20-100 MORN2 domains (more than

the passively invading Fusobacterium spp., but fewer than most members of the active invader

clade) comprise a diverse set, including three Leptotrichiaceae (28-60 MORN2 domains); five

Bacteroidetes (Cytophagia, Flavobacteria, and Saprospira, with 26-51 MORN2 domains each);

16 Shewanella strains (each containing 26-29 MORN2s); two Vibrio species (29 and 34

MORN2s); three of the opportunistic pathogens Myroides odoratimumus (20-26 MORN2s) (20);

and two pneumonia-causing Parachlamydia acanthamoebae (21 MORN2s) (21). A variety of

additional organisms contain smaller numbers of MORN2 domains, including numerous Yersinia

pestis strains (9-18 MORN2s); several Shigella strains (9-10 MORN2s); numerous Salmonella

strains (9 MORN2s); numerous Campylobacter strains (6 MORN2s); numerous Neisseria strains

(4 MORN2 s); numerous Chlamydia strains (4 MORN2s); several Pseudomonas strains (3

MORN2s); and a small number of E. coli strains (10 MORN2s), including an avian-pathogenic

strain causing extraintestinal colibacillosis. Across all organisms, MORN2 domains are often

found clustered, with multiple copies within a single gene. Like for Fusobacterium, the role of

MORN2 domains in these organisms is unclear.

Chromosomal rearrangements in Fusobacterium. Perhaps because of evolution

through duplication in Fusobacterium genomes, genes containing MORN2 domains correlate

with plasticity in genomic architecture. Actively invading Fusobacterium genomes exhibit an

exceptional level of genomic rearrangement (Figure S3), whereas genomes of passive invaders

with fewer MORN2 genes have fewer rearrangements (p=9e-6), as measured by syntenic

fraction (Figure S3; see Methods). The rate observed within some Fusobacterium spp.,

11

Page 12: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

including F. nucleatum, is even higher than the rates observed in H. pylori and Y. pestis, species

known to undergo chromosomal rearrangements at unusually high rates (22-24). Even the

level of rearrangements present within individual subspecies of F. nucleatum is quite high.

The association between MORN2 genes and large-scale synteny breaks, potentially

indicating a causal relationship, was confirmed (p=0.005; see example shown in Figure 3) using

a metric for average syntenic conservation (see Materials and Methods) for all MORN2-

containing orthogroups (0.31), as well as for all orthogroups in our set of finished genomes

(0.43), in which synteny was less in question. Recombination between non-randomly

distributed, repetitive chromosomal sequences is a widely conserved mechanism to promote

genome diversity in prokaryotes, including H. pylori and M. tuberculosis (24). Therefore, it is

possible that the repetitive MORN2 regions could be driving this elevated rate of chromosomal

rearrangement in Fusobacterium, which could allow the bacteria to rapidly adapt to their varied

ecological niches and diverse environmental stresses within the host.

12

Page 13: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

SUPPLEMENTAL MATERIALS AND METHODS

Sequencing. 21 new draft genomes were produced as part of the reference genome

collection for the Human Microbiome Project at the Broad Institute. Assembly statistics can be

seen in Table S1. Genomes were sequenced using 454. Sample preparation was performed as

described in Lennon et al. (25). We selected five strains for finishing (F. nucleatum subsp.

vincentii 3_1_27, F. nucleatum subsp. vincentii 3_1_36A2, F. nucleatum subsp. animalis 21_1A,

F. nucleatum subsp. animalis 4_8, and F. nucleatum subsp. animalis 7_1). Gaps within scaffolds

were spanned by PCR amplicons that were then Sanger sequenced using end primers or

internal walking primers. Gaps between scaffolds were addressed using a combinatorial PCR

approach where scaffold edge primers were used in amplification pairs with every other

scaffold edge. Reaction products were subjected to Sanger sequencing and reads were

incorporated as appropriate using Consed (26). Three of the finished genomes (F. nucleatum

subsp. vincentii 3_1_27, F. nucleatum subsp. animalis 4_8, and F. nucleatum subsp. animalis

7_1) have plasmids, and F. nucleatum subsp. vincentii 3_1_27 also had a second 373 Kb

chromosome. Assembly statistics for these finished genomes and plasmids can be found in

Table S1.

We also used the following six F. nucleatum strains isolated from cancerous tumors,

sequenced at the Broad Institute after our comparative analysis was initiated, to further

validate our results (NCBI accessions are indicated in parentheses): F. nucleatum CTI-1

(AXNZ01000000), F. nucleatum CTI-2 (AXNY01000000), F. nucleatum CTI-3 (AXNX01000000), F.

nucleatum CTI-5 (AXNW01000000), F. nucleatum CTI-6 (AXNV01000000), and F. nucleatum CTI-

13

Page 14: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

7 (AXNU01000000).

Genome annotation. To assure consistency and to reduce artifacts among the genomes

being analyzed, all genomes were re-annotated in a uniform manner using the Broad Institute’s

prokaryotic annotation pipeline. The protein-coding genes were predicted with Prodigal (27)

and filtered to remove genes with >=70% overlap to tRNAs or rRNAs. The tRNAs were

identified by tRNAscan-SE (28). The rRNA genes were predicted using RNAmmer (29). The

gene product names were assigned based on top blast hits against SwissProt protein database

(>=70% identity and >=70% query coverage), and protein family profile search against the

TIGRfam hmmer equivalogs. Additional annotation analyses performed include PFAM (30),

TIGRfam (31), KEGG (32), COG (33), GO (34), EC (35), SignalP (14), and TMHMM (15).

Summaries of gene counts can be found in Table S1.

Orthogroup clustering. SYNERGY2 (36-38), available at

http://sourceforge.net/projects/synergytwo/, was used to identify orthogroups in our set of 27

genomes. Orthogroups contain orthologs, which are vertically inherited genes that likely have

the same function, and also possibly paralogs, which are duplicated genes that may have

different function.

Phylogenetic trees and renaming strains. Phylogenetic trees were generated by

applying RAxML (39) to a concatenated alignment of 498 single-copy core orthogroups

(excluding orthogroups with paralogs) across all 27 organisms (including the Leptotrichia

outgroup). Bootstrapping was performed using RAxML’s rapid bootstrapping algorithm. 16S

phylogenetic trees were generated using ClustalW (40) alignments and Phylip’s DNAdist and

Fitch algorithms (41).

14

Page 15: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

Using these trees, we were able to assign organisms to appropriate species and

subspecies and rename them accordingly (see Table S1 and Figure S1a-b). In order to accurately

rename the subspecies within F. nucleatum, we added one additional, recently completed

genome to our analysis in order to clarify phylogenetic relationships (F. nucleatum subsp.

animalis ATCC 51191; this genome was not available when we began our analysis). We

constructed orthogroups using OrthoMCL (42), using the 14 F. nucleatum genomes in Table S1,

plus F. nucleatum subsp. animalis ATCC 51191, and constructed the phylogenetic tree in Figure

S1b.

Since we had no complete genome sequence for the fifth known species of F. nucleatum

(F. nucleatum subsp. fusiforme), we also generated a 16S rRNA-based tree to verify this new

strain naming (see Figure S1c). We used the 16S sequence for a F. nucleatum subsp. fusiforme

strain, as well as 16S sequences for all of the other genomes in our analysis.

ANI and shared-gene analysis. SYNERGY2 orthogroups were used to determine shared

gene content in pairwise genome comparisons. For a genome pair (genome 1 and genome 2),

the total number of genes in genome 1 was determined and the number of genes in genome 1

shared with genome 2 (based on shared ortholog group membership) was determined. Percent

shared gene content was calculated by dividing the number of genome 1 genes shared with

genome 2 by the number of genes in genome 1. Nucleotide alignments of shared genes were

used to determine the numbers of identical and different nucleotide residues in shared genes.

Percent ANI was calculated by dividing the number of identical nucleotide residues in shared

genes by the total number of nucleotide residues.

Identification of fusobacterial clades. BAPS 6.0 (43, 44) was used to cluster the 26

15

Page 16: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

fusobacterial genomes. We used a concatenated alignment of single copy core orthogroups

from SYNERGY as input, using the BAPS module for genetic mixture analysis of sequences or

linked loci. Using values of k ranging from 3 to 8, we identified an optimum clustering yielding 5

groupings, by examining the output log(ml) values.

Gene families. We selected the gene categories (PFAM, GO, and KEGG) most expanded

or reduced when comparing active and passive invaders by using Fisher’s test (Q<0.05). We

compared members of each species to members of all other species. We compared members

of each F. nucleatum subspecies to members of all other F. nucleatum subspecies. We also

compared a group consisting of active invaders (F. nucleatum, F. periodonticum, F. ulcerans, F.

varium) to the group of passive invaders (F. necrophorum and F. gonidiaformans). We filtered

the results, removing all gene categories where the smaller group contains greater than 150

members, as well as those gene groupings with less than a 20% different in copy number

between the two sets.

Comparison of sequence distances for MORN2 domains. Protein alignments for 3570

individual MORN2 domains from Fusobacteriaceae and Leptotrichiaceae were extracted from

the Pfam database (30). This dataset includes 3422 domains from the Fusobacteriaceae and

148 domains from the Leptotrichiaceae (including Sebaldella, Ilyobacter, Leptotrichia, and

Streptobacillus). Sequence identities were computed between all pairwise combinations of

MORN2 domains, using the Pfam alignment. For each gene, an average sequence distance was

computed for all MORN2 domains within 1kb, 2kb, 5kb, and 10kb, and for all MORN2 domains

further away than 10kb in the same genome. Average sequence distances were also computed

for domains located within the same proteins. The distributions of values were compared using

16

Page 17: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

the paired Wilcoxon-rank sum test using R.

Co-localization of proteins containing MORN2 domains with known virulence factors.

In order to examine the co-localization of MORN2, FadA, and RadD family proteins with other

known virulence factors, we performed a computational analysis of the proximity between

members of these three expanded Pfam families, and members of all other gene families. We

performed this analysis for all GO terms, Pfam domains, proteins containing transmembrane

domains predicted by TMHMM (15), and proteins with Signal peptides predicted by SignalP

(14). For each gene category, we counted all pairs of genes within a certain distance (1, 2, 5,

10, or 20 genes away) from a MORN2, fadA, or radD gene. In order to calculate an over-

representation statistic, we repeated the same analysis using 1000 randomly chosen sets of

genes of the same size as our sets of MORN2, fadA, or radD genes. For each gene category, we

compared the number of observed instances co-localized with a MORN2, fadA, or radD gene, to

the number observed in our randomly chosen gene sets. We calculated a mean and standard

deviation for our randomly chosen gene sets, and calculated a z-score that relates the number

of nearby pairings found in the test set to the number found in the control set.

Mobilome identification. To predict prophage, we used the Phast (18) algorithm.

Regions with Phast score greater than 70 were considered as potential phage regions in our

analysis. To identify genomic island regions, we used the SigiHMM (45) and Islandpath-DIMOB

(46) algorithms implemented in Islandviewer (19). We did not use Islandviewer’s “IslandPick”

algorithm (47), because there were too few closely related genomes in the IslandPick database.

To locate IS elements, we used IS finder (48), with a an E-value cutoff of 1e-5.

Multiple alignments. In order to examine larger-scale rearrangements in our set of

17

Page 18: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

finished genomes, multiple alignments were constructed using Mauve (49). We generated an

alignment using ProgressiveMauve for the seven finished F. nucleatum genomes. We first used

the Mauve Contig Mover to reorder and reorient the draft genome contigs (50). F. nucleatum

subsp. nucleatum 25586 was used as the reference for the Mauve Contig Mover.

Calculation of Syntenic Conservation. Synteny between genomes was quantified using

syntenic fraction, a metric described by Wapinski (37). For a pair of genes in the same

orthologous cluster, syntenic fraction is the percentage of their neighbors that are orthologous

to each other. A window of 5kb up and downstream of a gene was used to define its neighbors.

The syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

of all orthologous gene pairs in that orthogroup. Self-comparisons were not included in these

averages. An average was calculated for all orthogroups, and compared to the average for all

MORN2 orthogroups only. Rates of genomic rearrangement are highly dependent on

phylogenetic distance.

To compare rates between active and passive invaders, we examined syntenic

conservation between all pairwise combinations of F. gonidiaformans and F. necrophorum spp.

(average phylogenetic distance 0.19 ± 0.001), as well as between all pairwise combinations of F.

nucleatum and F. periodonticum spp. (which have a similar average phylogenetic distance of

0.14 ± 0.004). To compare overall syntenic values between pairs of genomes, we summed the

values for all orthogroups in these genomes. The distributions were compared by T-test to

obtain a p-value.

SUPPLEMENTAL REFERENCES

18

Page 19: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

1. Citron DM. 2002. Update on the taxonomy and clinical aspects of the genus fusobacterium. Clin Infect Dis 35:S22-27.

2. Munoz R, Yarza P, Ludwig W, Euzeby J, Amann R, Schleifer KH, Glockner FO, Rossello-Mora R. 2011. Release LTPs104 of the All-Species Living Tree. Systematic and applied microbiology 34:169-170.

3. Konstantinidis KT, Tiedje JM. 2005. Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci U S A 102:2567-2572.

4. Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM. 2007. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol 57:81-91.

5. Luo Y, Liu Y, Sun D, Ojcius DM, Zhao J, Lin X, Wu D, Zhang R, Chen M, Li L, Yan J. 2011. InvA protein is a Nudix hydrolase required for infection by pathogenic Leptospira in cell lines and animals. The Journal of biological chemistry 286:36852-36863.

6. Blatch GL, Lassle M. 1999. The tetratricopeptide repeat: a structural motif mediating protein-protein interactions. Bioessays 21:932-939.

7. Sanchez-Pulido L, Devos D, Genevrois S, Vicente M, Valencia A. 2003. POTRA: a conserved domain in the FtsQ family and a class of beta-barrel outer membrane proteins. Trends Biochem Sci 28:523-526.

8. Hritonenko V, Stathopoulos C. 2007. Omptin proteins: an expanding family of outer membrane proteases in Gram-negative Enterobacteriaceae. Mol Membr Biol 24:395-406.

9. Chew J, Zilm PS, Fuss JM, Gully NJ. 2012. A proteomic investigation of Fusobacterium nucleatum alkaline-induced biofilms. BMC Microbiol 12:189.

10. Gharbia SE, Shah HN. 1991. Comparison of the amino acid uptake profile of reference and clinical isolates of Fusobacterium nucleatum subspecies. Oral Microbiol Immunol 6:264-269.

11. Cases I, de Lorenzo V, Ouzounis CA. 2003. Transcription regulation and environmental adaptation in bacteria. Trends Microbiol 11:248-253.

12. Molina N, van Nimwegen E. 2009. Scaling laws in functional genome content across prokaryotic clades and lifestyles. Trends Genet 25:243-247.

13. van Nimwegen E. 2003. Scaling laws in the functional content of genomes. Trends Genet 19:479-484.

14. Petersen TN, Brunak S, von Heijne G, Nielsen H. 2011. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8:785-786.

15. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. 2001. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567-580.

16. Confer AW, Ayalew S. 2013. The OmpA family of proteins: roles in bacterial pathogenesis and immunity. Vet Microbiol 163:207-222.

17. Kapatral V, Anderson I, Ivanova N, Reznik G, Los T, Lykidis A, Bhattacharyya A, Bartman A, Gardner W, Grechkin G, Zhu L, Vasieva O, Chu L, Kogan Y, Chaga O, Goltsman E, Bernal A, Larsen N, D'Souza M, Walunas T, Pusch G, Haselkorn R, Fonstein M, Kyrpides N, Overbeek R. 2002. Genome sequence and analysis of the oral bacterium Fusobacterium nucleatum strain ATCC 25586. J Bacteriol 184:2005-2018.

18. Zhou Y, Liang Y, Lynch KH, Dennis JJ, Wishart DS. 2011. PHAST: a fast phage search tool. Nucleic acids research 39:W347-352.

19. Langille MG, Brinkman FS. 2009. IslandViewer: an integrated interface for

19

Page 20: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

computational identification and visualization of genomic islands. Bioinformatics 25:664-665.

20. Benedetti P, Rassu M, Pavan G, Sefton A, Pellizzer G. 2011. Septic shock, pneumonia, and soft tissue infection due to Myroides odoratimimus: report of a case and review of Myroides infections. Infection 39:161-165.

21. Greub G. 2009. Parachlamydia acanthamoebae, an emerging agent of pneumonia. Clin Microbiol Infect 15:18-28.

22. Darling AE, Miklos I, Ragan MA. 2008. Dynamics of genome rearrangement in bacterial populations. PLoS Genet 4:e1000128.

23. Lara-Ramirez EE, Segura-Cabrera A, Guo X, Yu G, Garcia-Perez CA, Rodriguez-Perez MA. 2011. New implications on genomic adaptation derived from the Helicobacter pylori genome comparison. PLoS One 6:e17300.

24. Aras RA, Kang J, Tschumi AI, Harasaki Y, Blaser MJ. 2003. Extensive repetitive DNA facilitates prokaryotic genome plasticity. Proc Natl Acad Sci U S A 100:13579-13584.

25. Lennon NJ, Lintner RE, Anderson S, Alvarez P, Barry A, Brockman W, Daza R, Erlich RL, Giannoukos G, Green L, Hollinger A, Hoover CA, Jaffe DB, Juhn F, McCarthy D, Perrin D, Ponchner K, Powers TL, Rizzolo K, Robbins D, Ryan E, Russ C, Sparrow T, Stalker J, Steelman S, Weiand M, Zimmer A, Henn MR, Nusbaum C, Nicol R. 2010. A scalable, fully automated process for construction of sequence-ready barcoded libraries for 454. Genome Biol 11:R15.

26. Gordon D, Abajian C, Green P. 1998. Consed: a graphical tool for sequence finishing. Genome research 8:195-202.

27. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119.

28. Lowe TM, Eddy SR. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research 25:955-964.

29. Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW. 2007. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic acids research 35:3100-3108.

30. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A. 2008. The Pfam protein families database. Nucleic Acids Res 36:D281-288.

31. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O. 2001. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res 29:41-43.

32. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27:29-34.

33. Tatusov RL, Koonin EV, Lipman DJ. 1997. A genomic perspective on protein families. Science 278:631-637.

34. Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21:3674-3676.

35. Tian W, Arakaki AK, Skolnick J. 2004. EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acids Res 32:6226-6239.

36. Griggs A, Wapinski, I., Wortman, J., Haas, B. 2014. SYNERGY2: Accurate and scalable ortholog identification. in preparation.

37. Wapinski I, Pfeffer A, Friedman N, Regev A. 2007. Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics 23:i549-558.

38. Wapinski I, Pfeffer A, Friedman N, Regev A. 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54-61.

20

Page 21: Supplementary Text for Fuso paper.docxmbio.asm.org/.../mbo006142056s1.docx · Web viewThe syntenic fraction for a given orthogroup was calculated as the sum of the syntenic fraction

39. Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688-2690.

40. Thompson JD, Higgins DG, Gibson TJ. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680.

41. Felsenstein J. 1989. PHYLIP -- Phylogeny Inference Package (Version 3.2). . Cladistics 5:164-166.

42. Li L, Stoeckert CJ, Jr., Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178-2189.

43. Corander J, Marttinen P, Siren J, Tang J. 2008. Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinformatics 9:539.

44. Corander J, Tang J. 2007. Bayesian analysis of population structure based on linked molecular information. Math Biosci 205:19-31.

45. Waack S, Keller O, Asper R, Brodag T, Damm C, Fricke WF, Surovcik K, Meinicke P, Merkl R. 2006. Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models. BMC Bioinformatics 7:142.

46. Hsiao W, Wan I, Jones SJ, Brinkman FS. 2003. IslandPath: aiding detection of genomic islands in prokaryotes. Bioinformatics 19:418-420.

47. Langille MG, Hsiao WW, Brinkman FS. 2008. Evaluation of genomic island predictors using a comparative genomics approach. BMC Bioinformatics 9:329.

48. Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M. 2006. ISfinder: the reference centre for bacterial insertion sequences. Nucleic acids research 34:D32-36.

49. Darling AE, Mau B, Perna NT. 2010. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5:e11147.

50. Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT. 2009. Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics 25:2071-2073.

21