Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Novel Phylogenetic Approaches to Problems in
Microbial Genomics
by
Lawrence A. David
Submitted to the Computational & Systems Biology Initiativein partial fulfillment of the requirements for the degree of
PhD
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2010
c� Massachusetts Institute of Technology 2010. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Computational & Systems Biology Initiative
August 26th, 2010
Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Eric J. Alm
Assistant ProfessorThesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Chris Burge
Chair, Ph.D. Graduate Committee
Report Documentation Page Form ApprovedOMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.
1. REPORT DATE SEP 2010 2. REPORT TYPE
3. DATES COVERED 00-00-2010 to 00-00-2010
4. TITLE AND SUBTITLE Novel Phylogenetic Approaches to Problems in Microbial Genomics
5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Massachusetts Institute of Technology,77 Massachusetts Avenue,Cambridge,MA,02139
8. PERFORMING ORGANIZATIONREPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT Present day microbial genomes are the handiwork of over 3 billion years of evolution. Comparisonsbetween these genomes enable stepping backwards through past evolutionary events, and can beformalized using binary tree models known as phylogenies. In this thesis, I present three new phylogeneticmethods for gaining insight into how microbes evolve. In Chapter 1, I introduce the algorithm AdaptML,which uses strain ecology information to identify genetically- and ecologically-distinct bacterialpopulations. Analysis of 1000 marine Vibrionaceae strains by AdaptML finds evidence that nicheadaptation may influence patterns of genetic differentiation in bacteria. In Chapter 2, I introduce thealgorithm AnGST, which can infer the evolutionary history of a gene family in a chronological context.Analysis of 3968 gene families drawn from 100 modern day organisms with AnGST reveals genomicevidence for a massive expansion in microbial genetic diversity during the Archean eon and the gradualoxygenation of the biosphere over the past 3 billion years. Lastly, I introduce in Chapter 3 the algorithmGAnG, which can construct prokaryotic species trees from thousands of distinct gene trees. GAnG analysisof archaeal gene trees supports hypotheses that the Nanoarchaeota diverged from the last ancestor of theArchaea prior to the Crenarchaeota/Euryarchaeota split.
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as
Report (SAR)
18. NUMBEROF PAGES
126
19a. NAME OFRESPONSIBLE PERSON
a. REPORT unclassified
b. ABSTRACT unclassified
c. THIS PAGE unclassified
Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18
Novel Phylogenetic Approaches to Problems in Microbial
Genomics
by
Lawrence A. David
Submitted to the Computational & Systems Biology Initiativeon August 26th, 2010, in partial fulfillment of the
requirements for the degree ofPhD
Abstract
Present day microbial genomes are the handiwork of over 3 billion years of evolution.Comparisons between these genomes enable stepping backwards through past evolu-tionary events, and can be formalized using binary tree models known as phylogenies.In this thesis, I present three new phylogenetic methods for gaining insight into howmicrobes evolve. In Chapter 1, I introduce the algorithm AdaptML, which uses strainecology information to identify genetically- and ecologically-distinct bacterial popu-lations. Analysis of 1000 marine Vibrionaceae strains by AdaptML finds evidencethat niche adaptation may influence patterns of genetic differentiation in bacteria.In Chapter 2, I introduce the algorithm AnGST, which can infer the evolutionaryhistory of a gene family in a chronological context. Analysis of 3968 gene familiesdrawn from 100 modern day organisms with AnGST reveals genomic evidence fora massive expansion in microbial genetic diversity during the Archean eon and thegradual oxygenation of the biosphere over the past 3 billion years. Lastly, I intro-duce in Chapter 3 the algorithm GAnG, which can construct prokaryotic species treesfrom thousands of distinct gene trees. GAnG analysis of archaeal gene trees supportshypotheses that the Nanoarchaeota diverged from the last ancestor of the Archaeaprior to the Crenarchaeota/Euryarchaeota split.
Thesis Supervisor: Eric J. AlmTitle: Assistant Professor
2
Acknowledgments
Funding
I am deeply grateful for the research support provided by:
• National Defense Science & Engineering Graduate Fellowship (DoD)
• Whitaker Health Sciences Fund Fellowship (MIT)
Personal
Many people have deliberately or unwittingly helped me complete my doctoral work.My words of gratitude below are only a small installment against a lifetime of welcomedebt.
Foremost, I would like to thank my advisor, Eric Alm, who gave me the freedomto work on the problems I found interesting and the scientific training to solve them.His unique enthusiasm over these past years has helped fuel my research and providedme with personal joys that I will always cherish.
I have been also remarkably fortunate to have joined a group of wonderful col-leagues here at MIT. Mark Smith, Chris Smillie, Greg Fournier, Sarah Preheim, InesBaptista, Matt Blackburn, Yonatan Friedman, Alexandra Konings, Claudia Bauer,and Albert Wang have been a source of thoughtful discussion and much needed lev-ity. In particular, the old guard of Arne Materna, Sean Clarke, Sonia Timberlake,and Jesse Shapiro is enshrined in some of my fondest memories of graduate school.Classmates in my department, especially Robin Friedman, Charles Lin, and MichelleChan, have been inspirational young-scientists who I have looked to for advice andmotivation.
A host of caring individuals enabled me to reach the closing stages of my graduatecareer. Sheila Frankel, Darlene Strother, Darlene Ray, James Long, and Bonnie LeeWhang supplied administrative and moral support. My graduate committee, whosemembership has included Ed DeLong, Drew Endy, Manolis Kellis, Dianne Newman,Howard Ochman, (and unofficially Martin Polz), generously set aside time to provideme with professional advice and letters of recommendation. [greg] ensured that mycluster jobs ran smoothly and provided crucial UNIX humor.
I also acknowledge my friends and family, who conspired to make graduate schoolthe shortest five years of my life. Within the Parsons laboratory, David Gonzalez-Rodriguez, Piyatida Hoisungwan, Marcos, and Crystal Ng, unfailingly made me ex-cited to come to lab each morning. MacGregor’s F-entry has been my adopted un-dergraduate family, and I hold dear my memories of living among them. My sisterhas continuously given me her love, and my father and mother have selflessly workedto give me the opportunities I presently enjoy. Finally, my wife Christina has simplybeen the nicest person I’ve ever met.
3
Contents
I Introduction 5
II Contributions 13
1 Resource partitioning and sympatric differentiation among closely
related bacterioplankton 15
2 Rapid evolutionary innovation during an Archean Genetic Expan-
sion 51
3 Building prokaryotic species trees from thousands of gene trees 92
III Conclusion 113
IV Bibliography 116
4
Part I
Introduction
5
Overview
Present day microbial genomes are the handiwork of over 3 billion years of evolution.
Comparisons between these genomes enable stepping backwards through evolutionary
history – genetic features present across a wide diversity of genomes likely arose
more anciently than features found in subsets of related genomes. This intuition is
formalized using binary tree models of sequence evolution known as phylogenetic trees.
Phylogenies propose a series of ancestral sequence divergence events that explain the
similarity of extant sequences. These trees can in turn be used to build models for
the evolution of organismal phenotypes, such as preferred environment or lifestyle.
However, prokaryotes’ capacity for horizontal gene transfer (HGT) can require regions
of the same genome to be associated with different phylogenetic trees, and ultimately
obscure which phylogenetic tree best represents overall genome evolution.
In this thesis, I present three novel phylogenetic approaches for inferring micro-
bial evolutionary history through the comparison of gene sequences. The remainder
of Part I briefly describes the research context in which I developed: AdaptML, an
algorithm for detecting signatures of ecological adaptation influencing bacterial ge-
netic differentiation; and AnGST, an algorithm for inferring the series of HGT, gene
duplication, and gene loss events that gave rise to a gene family. I go on in Chapter 1
of Part II to use AdaptML to identify genetically- and ecologically-distinct clusters of
Vibrionaceae coexisting in a marine environment. In Chapter 2, I use AnGST to infer
patterns in microbial genome evolution over the past 3.8 billion years. I use AnGST
again in Chapter 3 in the development of GAnG, a new method for constructing
prokaryotic species trees from thousands of gene trees. I conclude this thesis in Part
III with a summary of the chapters and a brief discussion of ongoing and future work
with AdaptML, AnGST, and GAnG.
6
Detecting relationships between genetic and ecolog-
ical differentiation in bacteria
Distinct groups of closely-related bacteria, or phylogenetic clusters, are a recurring
pattern of genetic differentiation among bacterial isolate housekeeping genes [1–3].
Ecological adaptation is suspected of playing a role in cluster formation [4]. Ac-
cording to the ecotype model, genetically-distinct bacterial populations form when
bacterial populations adapt to an ecological niche and are repeatedly purged of ge-
netic variation through periodic selection events [5]. However, a theoretical study
has shown that genetically distinct sub-populations can form under a neutral model
that either prohibits recombination, or simulates high within-cluster recombination
[6]. Alternatively, a recent phylogenetic analysis of eight sequenced Vibrio isolates
has found evidence for a combined model featuring both ecological adaptation and
neutral processes contributing to genetic differentiation. Under this model, the intro-
duction of niche-adaptive alleles initially erodes sympatry in a bacterial population.
Reduced gene flow between niche-adapted bacteria and the remaining population
subsequently yields genetically-distinct subgroups [7]. Ultimately, if niche adapta-
tion drives the formation of genetically-distinct bacterial groups, members of each
group should inhabit a common niche. Mathematical models capable of identifying
both genetically- and ecologically-cohesive bacterial groups can thus be used to help
resolve the role of ecological adaptation in the genetic differentiation of bacteria.
Several existing statistical methods, such as the Fst test, the P test, and Unifrac,
can evaluate the null hypothesis that phylogenetic clusters do not exhibit distinct
ecological associations. These tests assume that bacterial sequences are annotated
with ecological metadata describing the environment each sequence was harvested
from. The Fst test compares the genetic diversity among bacteria annotated as
sharing the same environment to the genetic diversity measured across all sampled
sequences. Low genetic diversity within a particular environment, coupled with high
genetic diversity between environments, is evidence for rejection of the null hypothesis
of no association between genetic clustering and bacterial ecology [8]. Alternatively,
7
the P test builds a phylogeny of strain sequences, labels leaves by their environmental
association, and uses a parsimony model to infer the number of times ancestral strains
on the tree changed environmental associations. Low parsimony scores are evidence
for rejecting the null hypothesis [8]. Lastly, the Unifrac model combines elements of
both the Fst and P test, utilizing genetic distances and strain tree topology to test
the relationship between strain genetic clustering and associated environment [9].
One weakness, however, of the Fst, P, and Unifrac statistics is their potential for
erroneously reporting no association between genetic clustering and ecology when the
ecological forces driving cluster formation are unmeasured or improperly annotated.
For example, consider a bacterial sequence cluster caused by adaptation to conditions
between 20◦-30◦C. An association between this cluster and ecology would go unrec-
ognized by Fst, P, or Unifrac analyses if temperature data was not collected, or if
temperature data were discretized into only two ranges: < 25◦C and ≥25◦C. Thus,
these statistics may not be appropriate for analyzing the evolution of bacteria whose
niche composition is unknown or highly uncertain, as environmental parameters de-
scribing these bacteria’s niche may not have been measured.
Another inference algorithm, Ecotype Simulation (ES), can identify genetically-
and ecologically-distinct clusters in a manner insensitive to how ecological parameters
are measured [10]. ES finds ecotypes by fitting a maximum likelihood model onto a
gene phylogeny. This model estimates the rates of ecotype formation, periodic se-
lection, and genetic drift, as well as the total number of ecotypes present. Identified
ecotypes can subsequently be analyzed using ecological measurements and multivari-
ate statistics in order to confirm that niche-adaptation has taken place and identify
environmental parameters that define the niche. Recent application of this approach
discovered ecotypes among Bacillus strains sampled from Death Valley, CA, which
could be distinguished by adaptation to solar exposure and soil texture [11]. One
drawback to the ES algorithm, however, is that it cannot detect nascent ecotype
formation events that have not yet undergone multiple series of periodic selections.
In Chapter 1, I present a new method named AdaptML, which uses a maxi-
mum likelihood model to identify genetically- and ecologically-coherent clusters of
8
bacterial strains. This model explicitly combines genetic information embedded in
sequence-based phylogenies with environmental sampling data. Recent niche adap-
tation events, characterized by ecologically coherent clusters with minimal genetic
distinction from a parent clade, can be captured by the model. Although AdaptML
cannot detect ecological associations with unmeasured environmental parameters, the
algorithm can account for environmental parameter discretization schemes that would
generally confound previous methods for detecting ecological associations. To do this,
I introduce the model concept of a “habitat.” Habitats are characterized by discrete
probability distributions describing the likelihood that a strain adapted to a habitat
will be sampled from a given ecological state (e.g. at a particular location in an estu-
ary). Habitats are not defined a priori but rather learned directly from the sequence
and ecological data using an Expectation Maximization routine. Once habitats are
defined, I learn a maximum likelihood model for the evolution of habitat association
on the tree. Randomization experiments can be used to determine which sequence
clusters show a statistically-significant association with a given habitat.
9
Inferring the evolutionary history of microbial gene
families
Microbial genomes do not evolve solely by point mutation [12, 13]. Comparison of
gamma-proteobacterial genomes suggests gene loss events eliminated thousands of
genes from the ancestor of the Buchnera following its adoption of an endosymbiotic
lifestyle [14]. Genes can be gained, via either the duplication of small regions of the
genome [15], or via the duplication of the entire genome itself, as has been shown for
yeast [16]. Gene gain is also possible via HGT and is a well-known source of genomic
diversity among the prokaryotes [12]. Cases of HGT have also been identified between
eukaryotes [17, 18] and even from bacteria to animals [19, 20]. Models that can infer
when genes have undergone loss, duplication, or HGT, and when genes have been
vertically inherited, are necessary for understanding the relative contribution of these
four mechanisms to genome evolution.
Algorithms for inferring the evolutionary history of gene families vary according to
their reliance on phylogenetic models and how they account for gene gain events (Ta-
ble 1). Phylogeny-free methods utilize features such as GC-bias to detect xenologous
genes [21, 22], or within-genome BLAST searches to find evidence for past duplication
events [23]. More complex approaches, known as presence-absence models, construct
a phylogeny of sampled species and identify which leaves on the tree are represented
in a gene family of interest. Parsimony algorithms can then be used to identify a set of
ancestral gene duplication, gene loss, or HGT events to explain the observed pattern
of gene presence and absence on the species tree [24–27]. However, presence/absence
algorithms may underestimate the amount of HGT in a gene family history, since
frequent HGT events can produce presence/absence patterns similar to those caused
by gene birth at a deep node, followed by vertical descent. More sensitive models
capable of differentiating between these scenarios utilize gene sequence information,
in addition to a species tree. Quartet methods quantify how strongly quartets of
orthologous genes support each of the three possible 4-taxon trees representing their
evolutionary history [28, 29]. Quartets that strongly support topologies discordant
10
ModelSpecies
treeGenetree
Unc.genetrees
FindsHGT
Findsdup.
Refs.
GC-bias No No - Yes No [21, 22]BLAST-hits No No - Yes Yes [23]Presence/absence Yes No - Yes Yes [24–27]Quartet mapping Partial Partial Yes Yes No [28, 29]Parsimony recon-ciliation
Yes Yes No Yes Yes [30, 31]
Probabilistic rec-onciliation
Yes Yes Yes Yes No [33, 34]
Table 1: Selection of existing models used to infer gene family evolutionaryhistories: Models are characterized by their explicit usage of species trees and gene trees,their consideration of gene tree uncertainty, and their ability to detect HGT and dupli-cation events. Note that only References [27, 31] can find HGT and duplication eventssimultaneously.
with the expected species tree are evidence for HGT within the gene family. More
elaborate “reconciliation” models compare full gene and species trees in order to in-
fer a precise phylogenetic location for each inferred evolutionary event. Parsimony
reconciliation models [30, 31], however, will infer spurious events if phylogenetic con-
struction errors are present in the gene tree [32]. Newer probabilistic reconciliation
algorithms have been developed to deal with these potential inaccuracies [33, 34].
Gene family evolutionary history models can also be partitioned according to
whether they account for gene gain using duplication or HGT events. With the
exception of Snel and Charleston’s algorithms [27, 31], evolutionary history models
usually account for only one of these two events. The specificity of these models may
be caused by self-reinforcing biases associated with the expected modes of eukaryotic
and prokaryotic genome evolution. The relative rarity of reported HGT events among
eukaryotes, compared to duplication events, likely encourages analyses of eukaryotic
genome evolution using tools specialized only to detect gene duplications. By contrast,
recognition of how HGT can accelerate prokaryotic adaptation and blur species lines
has probably reduced interest in broad surveys of potential prokaryotic duplication.
Exceptions to this proposed bias among prokaryotic studies do exist, however, as
11
Gevers et al. cataloged gene duplications in 106 bacterial genomes [23] and Snel
searched for both HGT and duplication among 17 archaeal and bacterial genomes
[27]. Increasing examples of HGT among eukaryotes [17, 18] are also fueling new
interest in systematically searching for HGT across the eukaryotes [35]. Bias against
the creation of models that account for both HGT and duplication is also likely due
to issues of model complexity. In certain scenarios, gene duplication and gene loss
can produce gene tree topologies similar to those yielded by HGT [36]. A combined
HGT/duplication inference model must be capable of recognizing this scenario and
proposing plausible HGT and duplication scenarios. Moreover, a combined model
requires defining a metric to choose which of these scenarios is preferable.
In Chapter 2, I present a new reconciliation method for inferring a set of gene loss,
gene duplication, and HGT events that explain topological incongruities between a
species tree and a gene tree. I named this algorithm the Analyzer of Gene & Species
Trees, or AnGST. AnGST was inspired by a gene family evolution model originally
designed for problems in biogeography and the inference of gene duplication and gene
loss events [30]. Also referred to as a host-parasite model, this approach seeks to infer
which ancestral genome (the host) on the reference tree possessed each ancestral gene
copy (the parasite). AnGST employs a generalized parsimony framework in order to
choose when duplication events should be inferred instead of HGT scenarios. This
framework assigns scores to each type of evolution event and returns the evolutionary
history with the lowest overall score. AnGST can further minimize reconciliation
scores by reconciling multiple gene tree bootstraps simultaneously and combining
their lowest scoring subtrees into a single chimeric gene tree. This bootstrap amalga-
mation step reduces the opportunity for poorly resolved gene tree subtrees to cause
the spurious inference of evolutionary events.
12
Part II
Contributions
13
Chapter 1
Resource partitioning andsympatric differentiation amongclosely related bacterioplankton
Dana E. Hunt*, Lawrence A. David*, Dirk Gevers, Sarah P. Preheim,Eric J. Alm, Martin F. Polz
*These authors contributed equally to this work.
This chapter is presented as it originally appeared in Science 320, 1081 (2008).Corresponding Supplementary Material is appended.
Chapter 1
Resource partitioning and
sympatric differentiation among
closely related bacterioplankton
Identifying ecologically differentiated populations within complex micro-
bial communities remains challenging, yet is critical for interpreting the
evolution and ecology of microbes in the wild. Here we describe spatial
and temporal resource partitioning among Vibrionaceae strains coexist-
ing in coastal bacterioplankton. A quantitative model (AdaptML) estab-
lishes the evolutionary history of ecological differentiation, thus revealing
populations specific for seasons and life-styles (combinations of free-living,
particle, or zooplankton associations). These ecological population bound-
aries frequently occur at deep phylogenetic levels (consistent with named
species); however, recent and perhaps ongoing adaptive radiation is evi-
dent in Vibrio splendidus, which comprises numerous ecologically distinct
populations at different levels of phylogenetic differentiation. Thus, envi-
ronmental specialization may be an important correlate or even trigger of
speciation among sympatric microbes.
15
Microbes dominate biomass and control biogeochemical cycling in the ocean, but
we know little about the mechanisms and dynamics of their functional differentiation
in the environment. Culture-independent analysis typically reveals vast microbial di-
versity, and although some taxa and gene families are differentially distributed among
environments [37, 38], it is not clear to what extent coexisting genotypic diversity can
be divided into functionally cohesive populations [37, 39]. First, we lack broad surveys
of nonpathogenic free-living bacteria that establish robust associations of individual
strains with spatiotemporal conditions [40, 41]; second, it remains controversial what
level of genetic diversification reflects ecological differentiation. Phylogenetic clus-
ters have been proposed to correspond to ecological populations that arise by neutral
diversification after niche-specific selective sweeps [5]. Clusters are indeed observed
among closely related isolates (e.g., when examined by multilocus sequence analysis)
[4] and in culture-independent analyses of coastal bacterioplankton [42]. Yet recent
theoretical studies suggest that clusters can result from neutral evolution alone [6],
and evidence for clusters as ecologically distinct populations remains sparse, having
been most conclusively demonstrated for cyanobacteria along ocean-scale gradients
[43] and in a depth profile of a microbial mat [44]). Further, horizontal gene transfer
(HGT) may erode the ecological cohesion of clusters if adaptive genes are transferred
[45], and recombination can homogenize genes between ecologically distinct popu-
lations [46]. Thus, exploring the relationship between phylogenetic and ecological
differentiation is a critical step toward understanding the evolutionary mechanisms
of bacterial speciation [6].
In this study, we investigated ecological differentiation by spatial and temporal
resource partitioning in coastal waters among coexisting bacteria of the family Vibri-
onaceae, which are ubiquitous, metabolically versatile heterotrophs [47]. The coastal
ocean is well suited to test population-level effects of microhabitat preferences, be-
cause tidal mixing and oceanic circulation ensure a high probability of migration, re-
ducing biogeographic effects on population structure. In the plankton, heterotrophs
may adopt alternate ecological strategies: exploiting either the generally lower concen-
tration but more evenly distributed dissolved nutrients or attaching to and degrading
16
small suspended organic particles, originating from algal exopolysaccharides and de-
tritus [39]. Bacterial microhabitat preferences may develop because resources are
distributed on the same scale as the dispersal range of individuals, due to turbulent
mixing and active motility [48]. Of potential microhabitats, particles represent abun-
dant but relatively short-lived resources, as labile components are rapidly utilized (on
time scales of hours to days) [49, 50], implying that particle colonization is a dynamic
process. Moreover, particulate matter may change composition with macroecologi-
cal conditions (such as seasonal algal blooms). Zooplankton provide additional, more
stable microhabitats; vibrios attach to and metabolize chitinous zooplankton exoskele-
tons [51, 52] but may also live in the gut or occupy niches specific to pathogens. The
extent to which microenvironmental preferences contribute to resource partitioning in
this complex ecological landscape remains an important question in microbial ecology
[53].
We aimed to conservatively identify ecologically coherent groups by examining
distribution patterns of Vibrionaceae genotypes among free- living and associated
(with suspended particles and zooplankton) compartments of the planktonic environ-
ment under different macroecological conditions (spring and fall) (Figs. 1.3 & 1.5).
Because the level of genetic differentiation at which ecological preferences develop is
not known, we focused on a range of relationships (0 to 10% small subunit riboso-
mal RNA (rRNA) divergence) among co-occurring vibrios [54]. Particle-associated
and free-living cells were separated into four consecutive size fractions by sequential
filtration (four replicate water samples, each subsampled with at least four replicate
filters per size fraction); each fraction contained organisms and dead organic material
of different origins (detailed in the supporting online material [SOM]; Section 1.2).
For simplicity, we refer to these fractions as enriched in zooplankton (≥63 mm), in
large (5 to 63 mm) and small (1 to 5 mm) particles, and in free-living cells (0.22 to 1
mm) (Fig. 1.5B). The 1- to 5-mm size fraction was somewhat ambiguous, probably
containing small particles as well as large or dividing cells; however, it provided a firm
buffer between obviously particle-associated (>5 mm) and free-living (<1 mm) cells.
Vibrionaceae strains were isolated by plating filters on selective media, previously
17
shown by quantitative polymerase chain reaction to yield good correspondence be-
tween genotypes recovered in culture and those present in environmental samples [54].
Roughly 1000 isolates were characterized by partial sequencing of a protein-coding
gene (hsp60 ). To obtain added resolution, between one and three additional gene
fragments (mdh, adk, and pgi) were sequenced for over half of the isolates (SOM),
including V. splendidus strains, the most abundant group [54].
Our rationale for testing environmental associations grows out of the following
considerations. First, as in most ecological sampling, the true habitats or niches
are unknown and can only be observed as projections onto the sampling dimensions
(“projected habitats”). Thus, associations can be detected as distinct distributions
of groups of strains if habitats/niches are differentially apportioned among samples.
Second, the lack of an accepted microbial species concept implies that it is imprudent
to use any measure of genetic relationships to define a priori the populations whose
environmental association should be assessed. Therefore, we first tested the null
hypothesis that there is no environmental association across the phylogeny of the
strains. We then refined such estimates by developing a new model to simultaneously
identify populations and their projected habitats. Finally, these model-based results
were tested with nonparametric empirical statistics.
The initial null hypothesis of no association between phylogeny and ecology is
strongly rejected (seasons: p < 10−79; size fractions: p < 10−49) by comparing the
parsimony score of observed environments on the tree to that expected by chance [55]
(SOM), confirming the visual impression of differential patterns of clustering among
seasons and size fractions (Fig. 1.1A). This result is robust toward uncertainty in the
phylogeny, which should diminish but not strengthen associations, and is confirmed
by introducing additional uncertainty in the phylogeny (Fig. 1.6). The observed
overall association with season and size fraction therefore suggests that water-column
vibrios partition resources, but neither provides insights into the phylogenetic bounds
of populations or the composition of their habitats.
We therefore developed an evolutionary model (AdaptML) to identify popula-
tions as groups of related strains sharing a common projected habitat, which reflects
18
their relative abundance in the measured environmental categories (size fractions and
seasons) (SOM). In practice, the model inputs are the phylogeny, season, and size
fraction of the strains. It then maps changes in environmental preference onto the
tree by predicting projected habitats for each extant and ancestral strain in the phy-
logeny. Although similar in spirit to existing parsimony, likelihood, and Bayesian
methods, which map ancestral states onto trees [56], the model accounts for the com-
plexities and uncertainties of environmental sampling. First, projected habitats can
span multiple sampling dimensions to account for complex life cycles (such as time
spent in multiple true habitats) and problems inherent in environmental sampling:
Discrete samples rarely equate to true habitats, and true habitats are frequently mis-
placed among their typical sample categories (for example, zooplankton fragments
may also be found in smaller size fractions). Second, projected habitats can span
multiple phylogenetic clusters to allow for the possibility that clusters may arise neu-
trally or that the relevant parameters differentiating them ecologically have not been
measured.
Briefly, AdaptML builds a hidden Markov model for the evolution of habitat
associations: Adjacent nodes on the phylogeny transition between habitats according
to a probability function that is dependent on branch length and a transition rate,
which is learned from the data (SOM) (Fig. 1.7). Subsequently, we optimize the
model parameters (the transition rate and the composition of each projected habitat)
to maximize the likelihood of the observed data. Finally, we use a simple ad hoc rule
for reducing noninformative parameters: We merge habitats that converge to similar
distributions (simple correlation of distribution vectors >90%) during the model-
fitting procedure (SOM). This reproducibly identified six nonredundant habitats for
the observed data set (HA to HF in Figs. 1.1B and 1.9). Moreover, the algorithm acts
conservatively, as suggested by two tests. First, the model did not overfit the data
when there was no ecological signal present: When the environments were shuffled,
only a single generalist habitat (evenly distributed over all size fractions and seasons)
was recovered. Second, when simulated habitats were used to generate environmental
assignments, the model usually identified a number of habitats equal to or less than
19
the true number present (Fig. 1.10).
The analysis suggests that a single bacterial family coexisting in the water column
resolves into a striking number of ecologically distinct populations with clearly iden-
tifiable preferences (habitats). The algorithm identified 25 populations, associated
with one of the six habitats defined by distinct distributions of isolates over seasons
and size fractions (Fig. 1.1 and Fig. 1.11). Most clusters have a strong seasonal
signal; interestingly, two pairs of highly similar habitats are observed in both seasons
(Fig. 1.1B). The first of the habitat pairs corresponds to populations occurring both
free-living and on particles but lacking zooplankton-associated isolates (HB and HC);
the second indicates a preference for zooplankton and large particles (HE and HF )
(Fig. 1.1B). The remaining two habitats were season-specific. Habitat HA combines
all primarily free-living populations in the fall, whereas habitat HD identifies a second
particle- and zooplankton-associated group in spring, but unlike HE and HF it has a
higher proportion of large particles and maps onto a single small group (G25) (Fig.
1.1). However, we cannot place high confidence in the absence of the free-living habi-
tat in the spring, because relatively few strains were recovered from that fraction.
Moreover, the distribution of individual populations among seasons and size frac-
tions varies considerably, with remarkably narrow preferences for some populations
whereas others are more broadly distributed. For example, V. ordalii (G3) is almost
exclusively free-living in both seasons, whereas V. alginolyticus (G5) has a significant
representation in both zooplankton and free-living size fractions but occurs exclu-
sively in the fall (Fig. 1.1, A and B). The sequences of three additional genes for V.
alginolyticus isolates were identical, arguing against misidentification due to recombi-
nation or additional population substructuring. Similarly, there was good agreement
when two different gene phylogenies (hsp60 and mdh) were used to identify habitats
for V. splendidus (Fig. 1.12), although fewer habitats were identified using the mdh
tree, most likely because it is less well-resolved. Overall, across all vibrios sampled,
association with the zooplankton-enriched and free-living fractions dominated, and
although several populations contain particle-associated isolates, only a few appear
to be specifically particle-adapted. Because vibrios are generally regarded as particle
20
and zooplankton specialists [47], this observed partitioning offers new insight into
their ecology.
Thus, in spite of the highly variable conditions of the water column, popula-
tions appear to finely partition resources, especially because our habitat estimates
are conservative, as clusters occupying the same habitat may be differentiated along
additional (unobserved) resource axes. For example, different zooplankton-associated
groups may be host- or body region-specific, and the strong seasonal signal of most
clusters may be due to a variety of factors; however, temperature is a likely candi-
date because it has so far arisen as the strongest correlate of microbial population
changes both over a seasonal cycle [57] and along ocean-scale gradients [43]. Fi-
nally, populations, which appear unassociated in our study, may be true generalists
with respect to the resource space sampled or may be adapted to environments not
sampled in this study, such as animal intestines or sediments [47]. Despite these un-
certainties, the observed strong partitioning among associated and free-living clusters
may have important implications for population biology in the bacterioplankton. As
recently suggested [6], for attached bacteria, the effective population size (Ne) may
be considerably smaller than the census size because colonization serves as a pop-
ulation bottleneck, whereas in free-living clusters, Ne may be closer to the census
size. Although computing the true magnitude of Ne in microbial populations remains
controversial [58], it is an important parameter that determines the relative strength
of selection and drift. Thus, attached and free-living populations may evolve under
different constraints [6].
The phylogenetic structure of populations also provides insights into the history
of habitat switches. Deeply branching populations may have remained associated
with habitats over long evolutionary time, and shallow branches may have diversified
more recently (Fig. 1.1, A and B). These stable habitat-associated clusters roughly
correlate to named species within the Vibrionaceae. For example, V. ordalii (G3) and
Enterovibrio norvegicus (G2) both represent clusters without close relatives contain-
ing > 50 isolates, which are overwhelmingly predicted to follow primarily free-living
(HA) and free-living/particle-associated lifestyles (HC), respectively (Fig. 1.1A). On
21
the other hand, some very closely related clusters are associated with different habi-
tats; V. splendidus, which is composed of strains that are ∼99% identical in rDNA
gene sequence [54], differentiates into 15 microdiverse habitat-associated clusters, of
which one is distributed roughly evenly among both seasons, and 9 and 5 predom-
inantly occur in spring and fall, respectively. Thus, V. splendidus appears to have
ecologically diversified, possibly by invading new niches or partitioning resources at
increasingly fine scales.
Recent or perhaps ongoing radiation by sympatric resource partitioning is most
strongly suggested for two nested clusters within V. splendidus, where groups of
strains differing by as little as a single nucleotide in hsp60 display distinct ecological
preferences (Fig. 1.1A, insets, and Fig. 1.3). These strains were isolated from multiple
independent samples and thus do not represent clonal expansion, suggesting that this
may reflect a true habitat switch; nonetheless, homologous recombination could also
move alleles between distantly related, ecologically distinct clusters, creating spurious
phylogenetic relationships, which can be detected by comparison with other genes.
Multilocus sequence analysis shows that for nested cluster I, a close relationship
was artificially created because hsp60 gene phylogeny is discordant with three other
genes (Fig. 1.2). However, this still represents a habitat switch, just at a slightly
larger sequence distance, as I.A is nested within the much larger G16 cluster in
both the hsp60 and the mdh-pgi -adk phylogenies. For the second nested cluster,
the three additional genes confirm partial separation of the subclusters II.A and II.B
by a single base pair difference in one of the genes, whereas the other genes consist
of identical alleles. This reinforces the idea that subcluster II.A is not incorrectly
grouped because of recombination, despite its distinct ecological affiliation (Fig. 1.2).
In combination, these data support the idea that there is ecological differentiation
among recently diverged genotypes and show that such changes might be recognized
in protein-coding genes as soon as they accumulate (neutral) sequence changes.
How might adaptation to a new habitat relate to speciation, the generation of dis-
tinct clusters of closely related bacteria? Mathematical modeling has recently shown
that the dynamics of speciation depend on the ratio of homologous recombination to
22
mutation rates (r/m) [6]. When this ratio per allele exceeds ∼1, populations tran-
sition from essentially clonal to sexual, with the major consequence that selection is
probably required for the formation of clusters [6]. Our preliminary multilocus se-
quence analysis on a set of strains with similar taxonomic composition suggests that
their r/m is well above that threshold. Thus, our observations of habitat separation
for highly similar but clearly distinct genotypes suggest that ecological selection may
have triggered phylogenetic differentiation. A plausible mechanism is that differen-
tial distribution among habitats (possibly caused by few adaptive loci) is sufficient
to depress gene flow between associated genotypes [6, 59]. Consequently, mutations
will no longer be homogenized but instead accumulate within specialized populations,
even for ecologically neutral genes. Over time, genetic isolation may increase because
homologous recombination rates decrease log-linearly with sequence distance [60]. We
detected associations with different habitats among sister clades over a wide range
of phylogenetic distances, possibly representing populations at various stages of dif-
ferentiation (Fig. 1.1A). Although we cannot determine whether clusters represent
transiently adapted populations or nascent species, our observations of differential
distributions of genotypes suggest that there exists a small-scale adaptive landscape
in the water column allowing the initiation of (sympatric) speciation within this com-
munity.
Although it has recently been suggested that microbial lineages remain specific
to macroenvironments over long evolutionary times [61], this study demonstrates
switches in ecological associations within a bacterial family coexisting in the coastal
ocean. In the V. splendidus clade, speciation could be ongoing, but the divergence
between most other ecologically defined groups appears large. This is consistent with
our previous suggestion that rRNA gene clusters, which are roughly congruent with
the deeply divergent protein-coding gene clusters detected here, represent ecological
populations [42]. However, the example of V. splendidus highlights the fact that using
marker genes to assess community-wide diversity may not capture some ecological
specialization. Moreover, different groups of organisms could evolve under different
constraints, and the mechanisms suggested here apply to the invasion of new habitats
23
and are thus different from (but compatible with) the widely discussed niche-specific
selective sweeps [10]. Why V. splendidus appears to have radiated recently into new
habitats whereas other groups appear to be more constant is not known but may
be related to its high heterogeneity in genome architecture [54]. This could indicate
a large (flexible) gene pool that, if shared by horizontal gene transfer, gives rise to
large numbers of ecologically adaptive phenotypes. It will therefore be important
to compare whole genomes within recently ecologically diverged clusters to identify
specific changes leading to adaptive evolution.
24
1.1 Figures
that clusters can result from neutral evolution alone(9), and evidence for clusters as ecologicallydistinct populations remains sparse, having beenmost conclusively demonstrated for cyanobacteriaalong ocean-scale gradients (10) and in a depthprofile of a microbial mat (11). Further, horizontalgene transfer (HGT) may erode the ecologicalcohesion of clusters if adaptive genes are transferred(12), and recombination can homogenize genesbetween ecologically distinct populations (13).Thus, exploring the relationship between phyloge-
netic and ecological differentiation is a critical steptoward understanding the evolutionary mecha-nisms of bacterial speciation (9).
In this study, we investigated ecological dif-ferentiation by spatial and temporal resourcepartitioning in coastal waters among coexistingbacteria of the family Vibrionaceae, which areubiquitous, metabolically versatile heterotrophs(14). The coastal ocean is well suited to testpopulation-level effects of microhabitat prefer-ences, because tidal mixing and oceanic circula-tion ensure a high probability of migration,reducing biogeographic effects on populationstructure. In the plankton, heterotrophs mayadopt alternate ecological strategies: exploitingeither the generally lower concentration but moreevenly distributed dissolved nutrients or attach-ing to and degrading small suspended organicparticles, originating from algal exopolysaccha-rides and detritus (3). Bacterial microhabitatpreferences may develop because resources aredistributed on the same scale as the dispersal
range of individuals, due to turbulent mixing andactive motility (15). Of potential microhabitats,particles represent abundant but relatively short-lived resources, as labile components are rapidlyutilized (on time scales of hours to days) (16, 17),implying that particle colonization is a dynamicprocess. Moreover, particulate matter may changecomposition with macroecological conditions(such as seasonal algal blooms). Zooplanktonprovide additional, more stable microhabitats;vibrios attach to and metabolize chitinous zoo-plankton exoskeletons (18, 19) but may also livein the gut or occupy niches specific to pathogens.The extent to which microenvironmental prefer-ences contribute to resource partitioning in thiscomplex ecological landscape remains an impor-tant question in microbial ecology (20).
We aimed to conservatively identify ecolog-ically coherent groups by examining distributionpatterns of Vibrionaceae genotypes among free-living and associated (with suspended particlesand zooplankton) compartments of the plankton-
Fig. 1. Season and size fraction distributions and habitat predictionsmapped onto Vibrionaceae isolate phylogeny inferred by maximumlikelihood analysis of partial hsp60 gene sequences. Projected habitatsare identified by colored circles at the parent nodes. (A) Phylogenetictree of all strains, with outer and inner rings indicating seasons and sizefractions of strain origin, respectively. Ecological populations predictedby the model are indicated by alternating blue and gray shading ofclusters if they pass an empirical confidence threshold of 99.99% (seeSOM for details). Bootstrap confidence levels are shown in fig. S10. (B)Ultrametric tree summarizing habitat-associated populations identifiedby the model and the distribution of each population among seasons and
size fractions. The habitat legend matches the colored circles in (A) and(B) with the habitat distribution over seasons and size fractions inferredby the model. Distributions are normalized by the total number of countsin each environmental category to reduce the effects of uneven sampling.The insets at the lower right of (A) show two nested clusters (I.A and I.Band II.A and II.B) for which recent ecological differentiation is inferred,including habitat predictions at each node. The closest named species tonumbered groups are as follows: G1, V. calviensis; G2, Enterovibrionorvegicus; G3, V. ordalii; G4, V. rumoiensis; G5, V. alginolyticus; G6, V.aestuarianus; G7, V. fischeri/logei; G8, V. fischeri; G9, V. superstes; G10,V. penaeicida; G11 to G25, V. splendidus.
1Department of Civil and Environmental Engineering,Massachusetts Institute of Technology (MIT), Cambridge, MA02139, USA. 2Computational and Systems Biology Initiative,MIT, Cambridge, MA 02139, USA. 3Laboratory of Microbiol-ogy, Ghent University, Gent 9000, Belgium. 4Bioinformaticsand Evolutionary Genomics Group, Ghent University, Gent9000, Belgium.5Department of Biological Engineering, MIT,Cambridge, MA 02139, USA, and Broad Instilate of MIT andHarvard University, Cambridge, MA 02139, USA.
*These authors contributed equally to this work.†To whom correspondence should be addressed. E-mail:[email protected] (E.J.A.); [email protected] (M.F.P.)
23 MAY 2008 VOL 320 SCIENCE www.sciencemag.org1082
REPORTS
on
May
26,
200
8 w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
Figure 1.1: Season and size fraction distributions and habitat predictions mapped ontoVibrionaceae isolate phylogeny inferred by maximum likelihood analysis of partial hsp60
gene sequences. Projected habitats are identified by colored circles at the parent nodes.(A) Phylogenetic tree of all strains, with outer and inner rings indicating seasons andsize fractions of strain origin, respectively. Ecological populations predicted by the modelare indicated by alternating blue and gray shading of clusters if they pass an empiricalconfidence threshold of 99.99% (see SOM for details). Bootstrap confidence levels are shownin Fig. 1.14. (B) Ultrametric tree summarizing habitat-associated populations identifiedby the model and the distribution of each population among seasons and size fractions. Thehabitat legend matches the colored circles in (A) and (B) with the habitat distribution overseasons and size fractions inferred by the model. Distributions are normalized by the totalnumber of counts in each environmental category to reduce the effects of uneven sampling.The insets at the lower right of (A) show two nested clusters (I.A and I.B and II.A and II.B)for which recent ecological differentiation is inferred, including habitat predictions at eachnode. The closest named species to numbered groups are as follows: G1, V. calviensis; G2,Enterovibrio norvegicus; G3, V. ordalii ; G4, V. rumoiensis; G5, V. alginolyticus; G6, V.
aestuarianus; G7, V. fischeri/logei ; G8, V. fischeri ; G9, V. superstes; G10, V. penaeicida;G11 to G25, V. splendidus.
25
ic environment under different macroecologicalconditions (spring and fall) (fig. S1 and table S1).Because the level of genetic differentiation atwhich ecological preferences develop is notknown, we focused on a range of relationships[0 to 10% small subunit ribosomal RNA (rRNA)divergence] among co-occurring vibrios (21).Particle-associated and free-living cells wereseparated into four consecutive size fractions bysequential filtration (four replicate water samples,each subsampled with at least four replicatefilters per size fraction); each fraction containedorganisms and dead organic material of differentorigins [detailed in the supporting online material(SOM)]. For simplicity, we refer to these frac-tions as enriched in zooplankton (!63 mm), inlarge (5 to 63 mm) and small (1 to 5 mm) particles,and in free-living cells (0.22 to 1 mm) (fig. S1B).The 1- to 5-mm size fraction was somewhat am-biguous, probably containing small particles aswell as large or dividing cells; however, it pro-vided a firm buffer between obviously particle-associated (>5 mm) and free-living (<1 mm)cells. Vibrionaceae strains were isolated by plat-ing filters on selective media, previously shownby quantitative polymerase chain reaction to yieldgood correspondence between genotypes recov-ered in culture and those present in environmentalsamples (21). Roughly 1000 isolates were char-acterized by partial sequencing of a protein-coding gene (hsp60). To obtain added resolution,between one and three additional gene fragments
(mdh, adk, and pgi) were sequenced for over halfof the isolates (SOM), including V. splendidusstrains, the most abundant group (21).
Our rationale for testing environmental asso-ciations grows out of the following consider-ations. First, as in most ecological sampling, thetrue habitats or niches are unknown and can onlybe observed as projections onto the sampling di-mensions (“projected habitats”). Thus, associationscan be detected as distinct distributions of groupsof strains if habitats/niches are differentially ap-portioned among samples. Second, the lack of anaccepted microbial species concept implies that itis imprudent to use anymeasure of genetic relation-ships to define a priori the populations whoseenvironmental association should be assessed.Therefore, we first tested the null hypothesis thatthere is no environmental association across thephylogeny of the strains. We then refined such es-timates by developing a new model to simulta-neously identify populations and their projectedhabitats. Finally, these model-based results weretested with nonparametric empirical statistics.
The initial null hypothesis of no associationbetween phylogeny and ecology is strongly rejected(seasons: P < 10"79; size fractions: P < 10"49) bycomparing the parsimony score of observed envi-ronments on the tree to that expected by chance(22) (SOM), confirming the visual impression ofdifferential patterns of clustering among seasonsand size fractions (Fig. 1A). This result is robusttoward uncertainty in the phylogeny, whichshould diminish but not strengthen associations,and is confirmed by introducing additional uncer-tainty in the phylogeny (fig. S2). The observedoverall association with season and size fractiontherefore suggests that water-column vibrios par-tition resources, but neither provides insights intothe phylogenetic bounds of populations or thecomposition of their habitats.
We therefore developed an evolutionarymodel(AdaptML) to identify populations as groups ofrelated strains sharing a common projected hab-itat, which reflects their relative abundance in themeasured environmental categories (size frac-tions and seasons) (SOM). In practice, the modelinputs are the phylogeny, season, and size frac-tion of the strains. It then maps changes in envi-ronmental preference onto the tree by predictingprojected habitats for each extant and ancestralstrain in the phylogeny. Although similar in spiritto existing parsimony, likelihood, and Bayesianmethods, which map ancestral states onto trees(23), the model accounts for the complexities anduncertainties of environmental sampling. First,projected habitats can span multiple samplingdimensions to account for complex life cycles(such as time spent in multiple true habitats) andproblems inherent in environmental sampling:Discrete samples rarely equate to true habitats,and true habitats are frequently misplaced amongtheir typical sample categories (for example,zooplankton fragments may also be found insmaller size fractions). Second, projected habitatscan span multiple phylogenetic clusters to allow
for the possibility that clusters may arise neutrallyor that the relevant parameters differentiatingthem ecologically have not been measured.
Briefly, AdaptML builds a hidden Markovmodel for the evolution of habitat associations:Adjacent nodes on the phylogeny transition be-tween habitats according to a probability functionthat is dependent on branch length and a tran-sition rate, which is learned from the data (SOM)(fig. S3). Subsequently, we optimize the modelparameters (the transition rate and the compo-sition of each projected habitat) to maximize thelikelihood of the observed data. Finally, we use asimple ad hoc rule for reducing noninformativeparameters: We merge habitats that converge tosimilar distributions (simple correlation of distri-bution vectors >90%) during the model-fittingprocedure (SOM). This reproducibly identifiedsix nonredundant habitats for the observed dataset (HA to HF in Fig. 1B and fig. S5). Moreover,the algorithm acts conservatively, as suggestedby two tests. First, the model did not overfit thedata when there was no ecological signal present:When the environments were shuffled, only asingle generalist habitat (evenly distributed overall size fractions and seasons) was recovered.Second, when simulated habitats were used togenerate environmental assignments, the modelusually identified a number of habitats equal to orless than the true number present (fig. S6).
The analysis suggests that a single bacterialfamily coexisting in the water column resolvesinto a striking number of ecologically distinctpopulations with clearly identifiable preferences(habitats). The algorithm identified 25 popula-tions, associated with one of the six habitatsdefined by distinct distributions of isolates overseasons and size fractions (Fig. 1 and fig. S7).Most clusters have a strong seasonal signal; in-terestingly, two pairs of highly similar habitatsare observed in both seasons (Fig. 1B). The firstof the habitat pairs corresponds to populationsoccurring both free-living and on particles butlacking zooplankton-associated isolates (HB andHC); the second indicates a preference for zoo-plankton and large particles (HE and HF) (Fig.1B). The remaining two habitats were season-specific. Habitat HA combines all primarily free-living populations in the fall, whereas habitat HD
identifies a second particle- and zooplankton-associated group in spring, but unlike HE and HF
it has a higher proportion of large particles andmaps onto a single small group (G25) (Fig. 1).However, we cannot place high confidence in theabsence of the free-living habitat in the spring,because relatively few strains were recoveredfrom that fraction. Moreover, the distribution ofindividual populations among seasons and sizefractions varies considerably, with remarkablynarrow preferences for some populations whereasothers are more broadly distributed. For ex-ample, V. ordalii (G3) is almost exclusively free-living in both seasons, whereas V. alginolyticus(G5) has a significant representation in bothzooplankton and free-living size fractions but
Fig. 2. Multilocus sequence analysis of nestedclusters (IA and IB and IIA and IIB) with differentialhabitat association by comparison of partial hsp60(left) and concatenated partial mdh, adk, and pgi(right) gene phylogenies. Habitat predictions(indicated by colored boxes) and the numberingof clusters correspond to Fig. 1. Scale bar is in unitsof nucleotide substitutions per site.
www.sciencemag.org SCIENCE VOL 320 23 MAY 2008 1083
REPORTS
on
May
26,
200
8 w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
Figure 1.2: Multilocus sequence analysis of nested clusters (IA and IB and IIA and IIB)with differential habitat association by comparison of partial hsp60 (left) and concatenatedpartial mdh, adk, and pgi (right) gene phylogenies. Habitat predictions (indicated by coloredboxes) and the numbering of clusters correspond to Fig. 1.1. Scale bar is in units ofnucleotide substitutions per site.
26
1.2 Supplementary Material
1.2.1 Sampling rationale
To investigate partitioning of Vibrionaceae strains in the water column, we exam-
ined their distribution among the free-living and associated (with particles and zoo-
plankton) fractions of the bacterioplankton community at two time points. This was
achieved by sequential filtration with decreasing pore size cutoffs and subsequent cul-
tivation on Vibrio selective media (Fig. 1.5B). Here, we give additional details on
sampling protocols and rationale supplementing the overview given in the main text.
Filtration is commonly used in oceanography to separate particle-associated and
free-living populations by retention of particles on filters, although the filter size cut off
for collecting particle-attached bacteria has varied in past studies between 0.8 and 10
µm [62–65]. To obtain higher ecological resolution, we used sequential filtration since
alternate types of particulate organic matter and organisms (e.g., phytoplankton,
zooplankton) will have distribution maxima in different size fractions thus enabling
differentiation of associated bacterial genotypes.
We collected a total of four size fractions with different expected composition
of particles and organisms (Fig. 1.5B). The largest fraction (≥63 m) was visually
enriched in zooplankton and detrital material (e.g., pieces of macroalgae, terrestrial
plant material); however, large gelatinous material [frequently part of marine snow,
which represents particles >0.5 mm [66]] was likely not collected since it is disrupted
by the pressure on the plankton nets used for collection. All other fractions were
collected by gravity rather than vacuum filtration to minimize disruption of frag-
ile particles. The large particle fraction (63-5 µm) likely contains zooplankton fecal
pellets, dead and living algae, and other detritus. The composition of the 5-1 µm
size fraction is somewhat ambiguous since it may contain both cells attached to very
small particles as well as large or dividing cells; however, it provides a firm buffer
between obviously particle-attached (>5 µm) and free-living (<1µm) cells. Particu-
late material in this size range may include small algae, bacterial cell walls, as well as
fragments of larger particles, which have broken apart; nonetheless, the small size of
27
such particles in unlikely to sustain a resident bacterial population. Free-living bac-
teria, observed in the 1-0.22 µm size fraction, likely live on dissolved organic matter
produced by living algae, cell lysis and the dissolution of particles.
1.2.2 Sample collection
Coastal ocean water samples were collected at high tide on the marine end of the
Plum Island Estuary (NE Massachusetts) (Fig. 1.5A) on two days representing spring
(4/28/06) and fall (9/6/06) conditions in the coastal ocean. Nutrient concentrations,
water temperature and chlorophyll levels were measured on both sampling dates (Fig.
1.3).
Two replicate samples of the largest size fraction (enriched in zooplankton) were
collected by filtering ∼100 L each through a 63 µm plankton net, which was sub-
sequently washed with sterile seawater (Fig. 1.5B). Particle-associated and free-
living bacterial populations were collected from quadruplicate water samples, which
were independently 2 pre-filtered through the 63 µm plankton net (to remove the
zooplankton-enriched fraction) into 4 L nalgene bottles (Fig. 1.5B). For each bottle,
water was sequentially filtered through 5, 1 and 0.22 µm pore size filters, collecting
at least four replicate filters per size fraction. To avoid disruption of fragile parti-
cles, the 63-5 and 5-1 µm fractions were collected on polycarbonate membrane filters
(Sterlitech) using gravity filtration followed by washing with 5 ml of sterile (0.22
µm-filtered and Tindalized) seawater to remove free-living bacteria that might have
been retained on the filter. The <1 µm fraction containing free-living bacteria was
collected on 0.22 µm Supor-200 filters (Pall) by applying gentle vacuum pressure.
After size fractionation, particles and zooplankton were broken up before plating
(Fig. 1.5B). The zooplankton sample was homogenized using a tissue grinder (VWR
Scientific) and vortexed for 20 minutes at low speed before concentration on 0.22 µm
Supor-200 filters (Pall). These filters were then plated directly on selective media.
Similarly, 5 µm and 1 µm filters were placed in 50 ml conical tubes with 50 ml sterile
seawater and vortexed at low speed for 20 min to break up particles and detach
bacteria from the filters. The supernatant was concentrated on 0.22 µm filters, and
28
both the original and supernatant filters were placed directly on media to collect
isolates.
1.2.3 Strain isolation and identification
Isolates were obtained from TCBS plates (Accumedia or Difo) with 2% NaCl since
this media has been shown to yield good correspondence in phylogenetic groups of
vibrios detected by quantitative PCR and isolation [54]. After 2-3 days of growth,
colonies were counted and re-streaked a total of three times alternately on Tryptic
Soy Broth (TSB) (Difco) with 2% NaCl and media.
For classification of strains by sequencing, purified isolates were grown in marine
TSB broth overnight; DNA was extracted using either a tissue DNA kit (Qiagen) or
Lyse- N-Go (Pierce). Following the rationale of multilocus sequence analysis (MLSA),
housekeeping genes were used for further strain characterization since these are un-
likely to be under environmental selection. The partial hsp60 gene sequence was
amplified for all isolates as described previously [67]. For isolates with an hsp60
sequence differing by more than 2% from an already characterized strain, the 16S
rRNA gene was PCR amplified using primers 27F- 1492R and sequenced using the
27F primer [68]. The 16S sequence was used to identify the organism using the
RDP classifier [69] and BLAST [70]. In cases where the hsp60 gene either failed
to amplify or the sequence diverged greatly from other vibrios, 16S rRNA gene se-
quencing confirmed that these isolates largely belonged to the genera Pseudomonas,
Shewanella, Pseudoalteromonas, and Agaravorans (RDP Classifier) [69]. We excluded
non-Vibrionaceae strains from further analysis.
To confirm relationships for V. splendidus, the most highly represented group
among isolates, an additional gene (mdh) was sequenced. The partial mdh gene was
amplified using primers mdh mod.for (5’- GAY CTD AGY CAY ATC CCW AC -3’)
and mdh mod.rev (5’- GCT TCW ACM ACY TCD GTR CCY G -3’) (S. Preheim,
unpublished data). Two additional housekeeping gene sequences were obtained (pgi,
adk) for select groups of strains, using pgi.for (5 GAC CTW GGY CCW TAC ATG
GT 3 - 3) and pgi.rev (5-CMG CRC CRT GGA AGT TGT TRT-3) (S. Preheim,
29
unpublished data) and adk.for (5- GTA TTC CAC AAA TYT CTA CTG G-3) and
adk.rev (5- GCT TCT TTA CCG TAG TA- 3) [71]. All of these genes were amplified
using the following PCR conditions: 2 min at 94◦C followed by 32 cycles of 1 min each
at 94◦C, 46◦ and 72◦C with a final step of 6 min at 72◦C. Most genes were sequenced
at least twice using forward and reverse primers. All sequencing was performed at
the Bay Paul Center at the Marine Biological Laboratory, Woods Hole MA.
1.2.4 Phylogenetic tree construction and representation
The partial hsp60, mdh, adk, and pgi gene sequences yielded unambiguous align-
ments of 541, 422, 372, 395 nucleotides, respectively. Phylogenetic relationships were
reconstructed using PhyML v.2.4.4 [72] with following parameter settings: DNA sub-
stitution was modeled using the HKY parameter [73]; the transition/transversion
ratio was set to 4.0; PhyML estimated the proportion of invariable nucleotide sites;
the gamma distribution parameter was set to 1.0; 4 gamma rate categories were used;
a BIONJ tree was initially used; and, both tree topology and branch lengths were
optimized by PhyML. Circular tree figures were drawn using the online iTOL software
package [74]. To prevent numerical instabilities in AdaptMLs maximum likelihood
computations, branches with zero length were assigned the minimal observed non-zero
branch length: 0.001.
1.2.5 Empirical statistical testing
We employed empirical statistics to quantify evidence for differential environmental
distribution of phylogenetic groups (Fig. 1.6). We first tested the overall association
of phylogeny with our environmental data using a non-parametric parsimony-based
metric. We assigned a different character to each of the environmental categories,
and calculated the minimum number of character transitions needed to explain the
data given the observed hsp60 phylogeny. Although this test is likely to be overly
conservative given the heterogeneous nature of our observed clusters, it nonetheless
supported a highly significant correlation between phylogeny and both size fraction (p
30
< 10−49) and season (p < 10−79). Exact p-values were computed based on the algo-
rithm of [55]. It was not possible to compute p-values for both season and size fraction
together because the computational complexity of the algorithm grows exponentially
with the number of character states.
We also employed non-parametric empirical statistics to test specific model pre-
dictions. We tested the hypothesis that each of the clusters identified by the model
would be likely to arise by chance. To do this, we produced a 2 × 8 contingency table
to test for any associations between cluster membership and distribution across envi-
ronments. We used the Fisher exact test [75] as implemented in the R programming
language to evaluate the significance of each association. The results are shown in
Figure 1.4.
1.2.6 Overview of AdaptML
We developed a maximum likelihood method to help identify the boundaries of eco-
logically distinct populations and infer the ancestral habitat association of internal
nodes in the strain phylogeny. The key to our method is a hidden variable mapping a
’projected habitat’ to each node. We mathematically characterize each habitat as a
discrete probability distribution, which describes the likelihood that a strain adapted
to that habitat will be observed in each of our eight environmental categories. These
distributions, which we refer to as emission probabilities in accordance with terminol-
ogy used in machine learning, are not known a priori and must be learned from the
data. Because of this probabilistic definition of habitats, a phylogenetic group span-
ning several environmental categories can still be considered an ecologically distinct
population.
Using the habitat variables and isolate sequence-based phylogeny, we built a
hidden-Markov model (HMM) [75] describing the evolution of habitat association
(Figure 1.7). The probability that adjacent nodes in the phylogeny share the same
habitat is a function of both the branch length separating them and a parameter that
represents the rate at which a lineage can transition between habitats (the transition
rate). The observed variables the environmental category from which each strain
31
was sampled occur only at the leaves of the phylogeny. The parameters necessary
for our model can be learned from the data according to the following algorithm:
1. Initialize parameters: We initialize 16 habitats, each with random emission
probability distributions over the 8 environmental categories. The transition
rate parameter is initialized to 10−1 transitions per substitution/site (relative
to the gene phylogeny branch lengths).
2. Infer the observed datas likelihood, given phylogeny and parameter
estimates: We use a dynamic programming algorithm and the model param-
eters (transition rate and emission probabilities) to compute the likelihood of
the observed data (environmental category for each isolate). Our computation
proceeds in a manner identical to Felsenstein’s “pruning” method of computing
likelihoods on a phylogenetic tree [76].
3. Optimize parameters to maximize likelihood of observed data: We es-
timate the probability that each internal node is associated with a given habitat
by summing over all possible habitat assignments at other nodes (E-step). These
probabilities are used (M-step) to update the:
(a) Transition rate parameter : We numerically optimize the transition rate to
maximize the likelihood of the observed data.
(b) Emission probability matrix : We update the emission probability matrix
by taking the matrix that maximizes the likelihood of the observed data
given the marginal likelihoods for the habitat assignments at each of the
phylogenys leaves.
We note that separating these two steps represents an approximation, as these
two parameters are not strictly independent. The approximation, however,
speeds up the implementation considerably as only one parameter (instead of 1
+ 16 × 7 = 113) is optimized numerically.
4. Test for convergence: If the model parameters do not change significantly
from the previous iteration, then the emission probabilities and the transition
32
rate are considered to have converged: continue to step (5). (A typical tra-
jectory of emission probability convergence is shown in Figure 1.8. Note: the
approximation identified in step 3 can lead to fluctuation near a likelihood max-
imum rather than actual convergence). Otherwise, return to step (2).
5. Test for model complexity/redundancy. If a pair of habitats has emission
probability distributions that exhibit correlations greater than 0.90, they are
merged into a single habitat and the algorithm continues from step (2). If
no habitats are merged, the parameter estimation loop terminates. Although
our approach employed manual inspection and empirical testing rather than
a likelihood-based criterion for reducing model complexity [such as the AIC
[77]], our algorithm can be easily extended to include a likelihood criterion.
To test for overfitting, we performed simulations as described in the main text
and Figure 1.10. We found that our scheme acts conservatively since it usually
underestimated the true number of habitats. Figure 1.13 shows how the inferred
habitats identified by the model vary with different cutoffs.
Once a set of model parameters has been learned, we utilize the following protocol
to identify ecologically distinct and statistically significant populations.
1. Infer node habitat assignments that maximize the joint probability
of the observed data: We rely upon a joint likelihood calculation to infer a
single habitat assignment per ancestral node. To compute this likelihood, we
use the parameter estimates inferred by the algorithm described above, which
sums over all habitat assignments. Phylogenetic groups that share a common
habitat association are taken as candidate ecological populations if they pass
an empirical significance test.
2. Empirical testing to identify ecologically distinct populations: The
assignment of nodes to habitats in step (1) identifies the most likely set of
population boundaries, but may include some weakly predicted clusters. To
filter low confidence ecological groupings, we estimate empirical p-values for
33
each clade and only report statistically significant (p < 0.0001) populations
(Figure 1A & 1B). Empirical p-values are computed by comparing the likelihood
of the parent node for a cluster to the likelihood observed at the same node in
randomized trials where environmental assignments are shuffled, but phylogeny
is maintained. For comparison, all possible clusters (with no significance cutoff)
can be inferred from the full model results shown in Figure 1.11.
1.2.7 Detailed description of maximum likelihood model
Conditional likelihoods
The conditional likelihood describes the likelihood L that the leaves of a subtree
exhibit their observed states, conditional on the subtree’s root node k taking state s.
This likelihood can be defined recursively, assuming two child nodes l and m
Lk(s) =
��
x
P (x|s, tl)Lkl(x)
�×
��
y
P (y|s, tm)Lkm(y)
�(1.1)
where the function P (x|s, t) represents the probability of transitioning between
states x and s along some interval t. To reduce the number of fitted parameters
early in our model, we use the simplifying assumption that all state transitions can
be described using the same transition rate parameter µ. Thus, we compute the
probability of transitions between states as
P (x|y, t) =1
h
�1− e
−hµt�
(1.2)
and the probability of remaining in the same state
P (x|x, t) =1
h
�1 + (h− 1)× e
−hµt�
(1.3)
which is analogous to the Jukes-Cantor model for nucleotide sequence evolution
[78]. Note that if k is a leaf node in state s with observed environment o, the likelihood
is drawn from the emission probability matrix Pe:
34
Lk(s) = Pe(o|s) (1.4)
Marginal and joint likelihoods
To compute the marginal likelihood that a node k has state s, we combine the con-
ditional likelihoods:
MLk(s) =1
K×
�
t
��
x
P (x|s, tkl)Ll(x)
�(1.5)
where the l are drawn from the set of nodes adjacent to k, and K is a normalization
factor such that
�
s
MLk(s) = 1 (1.6)
To compute the likelihood of the observed data for the single best assignment of
habitats to nodes (17), the summation terms in the likelihood formula are replaced
with max operations. This is equivalent to maximizing the “joint” likelihood:
Lk(s) =�max
xP (x|s, tkl)Ll(x)
�×
�max
yP (y|s, tkm)Lm(y)
�(1.7)
The backtracking process used to keep track of the max arguments is analogous
to the Viterbi algorithm for finding the most probable state path in an HMM (16).
Parameter estimation
As probability distributions, each set of emission probabilities must satisfy
�
o
Pe(o|s) = 1 (1.8)
Because leaf nodes are independent samples of their emission probability distri-
butions (given their state assignment), we use a weighted-average approach to calcu-
lating the emission probability matrix Pe
35
Pe(o|s) =1
K×
�
k
MLk(s)×∆(o, ok) (1.9)
where MLk(s) is the marginal likelihood that node k is adapted to habitat s, the
function∆( x, y) equals 1 if and only if x equals y (and is 0 otherwise), and K is a
normalization factor.
We learn the transition rate parameter µ by numerical optimization to maximize
the likelihood of the observed data (summed over all values for the hidden variables).
1.2.8 Defining ecologically coherent and significant clusters
We use empirical testing to estimate the statistical significance associated with the
likelihood value computed for each node in the maximum (joint) likelihood assignment
of habitats to nodes (given the final parameter estimates). Each trial preserved
the phylogeny, the inferred habitat assignments, and the habitat and transition rate
parameters, but environmental categories at the leaves were shuffled. The maximum
’joint’ likelihoods at each node were compiled over all trials and used as empirical
background probability distributions.
We use empirical testing to estimate the statistical significance associated with
the likelihood value computed for each node in the maximum (joint) likelihood assign-
ment of habitats to nodes (given the final parameter estimates). Each trial preserved
the phylogeny, the inferred habitat assignments, the habitat and transition rate pa-
rameters, and the frequency of the various environmental categories; however, each
leaf was randomly assigned an environmental category in each trial. The maximum
joint likelihoods at each node were compiled over all trials and used as empirical
background probability distributions.
Using the habitat associations learned from the original (non-randomized) data
set, we identified internal nodes where habitat transitions were inferred to take place.
These nodes were then iterated through using a post-fix traversal:
36
Nested clusters
At each of these transitional nodes, a second, pre-fix traversal took place. If the sub-
tree rooted by the current node possessed a likelihood greater than that observed in
99.99% of the random trials and 90% of its leaves shared the current nodes habitat
assignment, we recognized the subtree as an ecologically coherent and statistically
significant cluster. This latter cutoff (90% coherence) was necessary to ensure mean-
ingful clusters because the likelihood at a parent node can be unusually high when two
nearly significant (but non-identical) child groups are combined. Requiring a major-
ity of child nodes to have the same predicted model state as the parent has the effect
of identifying clusters that correspond more directly to single putative populations.
Making the threshold too high (e.g., 100%) would eliminate some larger clusters that
have a small, internal nested cluster.
To enable the discovery of nested clusters, identified clusters were pruned from
the phylogeny. Nodes representing clades that contained the cluster had their joint-
likelihoods modified so that they no longer incorporated information from the pruned
groups. If the clade rooted by the current node did not satisfy both the p-value and
90% coherence thresholds, the second recursion would descend to the current nodes
children. The second recursion only terminated if the current node was either a leaf,
itself a habitat transition node, or determined to root an ecologically coherent and
statistically significant cluster.
37
Supplementary Figures
11
III. WEBLINK TO AdaptML APPLICATION
An online version of AdaptML can be found at: http://www.almlab.org/adaptml/. Here,
users can upload their own phylogenetic trees and sampling metadata for automated
analysis. Copies of the AdaptML source code can be found at this address as well.
IV. SUPPLEMENTAL TABLE AND FIGURES
This section contains Supplementary Table S1 & S2 and Supplementary Figures S1 to
S10. See main text for details.
Table S1. Temperature and nutrient concentrations on sampling dates.
Temperature Chlorophyll a1 DOC2 TDN2 NO3-+NO2
- NH4+ TDP2 PO4
3-
[˚C] [µg/L] [mg C/L] [mg N/L] [µg N/L] [µg N/L] [µg P/L] [µg P/L]
Spring (4/28/06) 11 4.07 2.11 0.17 9 189 18 14
Fall(9/6/06) 16 6.03 2.28 0.27 5 144 24 25
1 measured using overnight extraction in 90% acetone (19)2 DOC = dissolved organic carbon, TDN = Total Dissolved Nitrogen, TDP = total dissolved phosphorous. All chemical analyses were performed at the University of New Hampshire, Durham, NH.
Figure 1.3: Temperature and nutrient concentrations on sampling dates.
38
12
Table S2. Empirical significance testing of ecologically differentiated clusters predicted
by the model. Fisher’s exact test was used to identify significant associations between
clusters and environments as described in the SOM. Group numbers correspond to Figure
1 in the main text.
Group P-value
1 1.16 E-07
2 1.04 E-16
3 5.06 E-15
4 1.70 E-01
5 1.11 E-02
6 1.26 E-04
7 7.54 E-10
8 2.46 E-05
9 2.35 E-07
10 7.19 E-07
11 1.26 E-13
12 2.62 E-03
13 5.13 E-10
14 1.16 E-06
15 1.01 E-02
16 1.13 E-08
17 4.03 E-07
18 2.81 E-03
19 1.08 E-08
20 1.80 E-06
21 2.31 E-13
22 7.02 E-03
23 3.32 E-06
24 6.06 E-14
25 2.76 E-03
Figure 1.4: Empirical significance testing of ecologically differentiated clusters predictedby the model. Fishers exact test was used to identify significant associations betweenclusters and environments as described in the SOM. Group numbers correspond to Figure1.1 in the main text.
39
5 µm filter
1 µm filter
63µm plankton net
[ ]4 x
[ 4 x
[ ]4 x
2 x
0.22 µm filter
tissue grinder vortexfilter plated on
selective media
4 x
63 µm prefiltered
seawater
]
A
! BMap courtesy of the Gulf of Maine Council
on the Marine Environment
Figure 1.5: Sampling site and outline of sampling strategy for determination of bacterialdistribution among seasons and size fractions in coastal water. (A) Sampling locationshown on a map of North America (left) with a white box depicting the bounds of thepicture at right: the Gulf of Maine. The arrow points to the sampling location, Plum IslandSound, MA. (B) Protocol for obtaining size fractionated of bacterial seawater isolates usingsequential filtration and plating on selective media.
40
14
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1300
400
500
600
Pars
imon
y Sc
ore
Sequence Fraction0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
!60
!40
!20
0
log(
P!va
lue)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1!60
!40
!20
0
Fig. S2. Empirical statistical testing for ecological association of phylogenetic clades.
The likelihood that the observed parsimony score (or lower) for size fraction data might
have arisen by chance was calculated [using the method of (15)] for a series of trees,
inferred using subsets of strains from the hsp60 gene alignment. To test the effect of
statistical uncertainty on the inferred association, 1%, 10%, 50%, 90%, or 100% of the
sequence data was randomly selected and used to construct the phylogeny (with 3
replicates each). The parsimony score for each such tree is shown with green dots
corresponding to the left axis; the p-value of obtaining that parsimony score by chance is
shown with blue dots corresponding to the right axis. These results are based on size
fraction data only, but similar results are obtained with season data.
Figure 1.6: Empirical statistical testing for ecological association of phylogenetic clades.The likelihood that the observed parsimony score (or lower) for size fraction data might havearisen by chance was calculated [using the method of [55]] for a series of trees, inferred usingsubsets of strains from the hsp60 gene alignment. To test the effect of statistical uncertaintyon the inferred association, 1%, 10%, 50%, 90%, or 100% of the sequence data was randomlyselected and used to construct the phylogeny (with 3 replicates each). The parsimony scorefor each such tree is shown with green dots corresponding to the left axis; the p-value ofobtaining that parsimony score by chance is shown with blue dots corresponding to the rightaxis. These results are based on size fraction data only, but similar results are obtainedwith season data.
41
15
Fig. S3. Overview of hidden Markov model components. (A) Habitats represent hidden
(latent) variables in the model; experiments do not directly measure what habitat a strain
is adapted to. Two adjacent nodes on the phylogeny may differ in their habitat
assignment according to the rate of habitat transition (arrows between habitats). In our
simple model, all transition rates are equivalent. (B) Associated with each habitat is an
emission probability distribution describing how likely a strain associated with a
particular habitat (H1-3) is sampled from a given environment (blue or black). Bars in this
cartoon depict hypothetical probability distributions; strains adapted to Habitat 1 have
higher probability of being observed in the black environment than in the blue
environment. (C) Our model maps a habitat onto each node in the phylogeny such that it
maximizes the probability of the observed data (at the leaves). Observations are limited
to the leaves of the phylogeny, as we cannot directly sample ancestral strains. In the
example shown, the mostly blue group is mapped to habitat 3, the mostly black group to
habitat 1, and the heterogeneous group is mapped to habitat 2.
H1
H2
H3
H1
H2
H3
A B
C
0.0 1.0
Figure 1.7: Overview of hidden Markov model components. (A) Habitats represent hid-den (latent) variables in the model; experiments do not directly measure what habitat astrain is adapted to. Two adjacent nodes on the phylogeny may differ in their habitatassignment according to the rate of habitat transition (arrows between habitats). In oursimple model, all transition rates are equivalent. (B) Associated with each habitat is anemission probability distribution describing how likely a strain associated with a particularhabitat (H1-H3) is sampled from a given environment (blue or black). Bars in this car-toon depict hypothetical probability distributions; strains adapted to habitat 1 have higherprobability of being observed in the black environment than in the blue environment. (C)Our model maps a habitat onto each node in the phylogeny such that it maximizes theprobability of the observed data (at the leaves). Observations are limited to the leaves ofthe phylogeny, as we cannot directly sample ancestral strains. In the example shown, themostly blue group is mapped to habitat 3, the mostly black group to habitat 1, and theheterogeneous group is mapped to habitat 2.
42
16
0 5 10 15 20 25 30 35 40 45 50!2840
!2820
!2800
likel
ihoo
d
iteration
0 5 10 15 20 25 30 35 40 45 500
0.5
1
para
met
er c
hang
e
likelihoodemission probtransition rate
Fig. S4. Convergence of iterative parameter optimization. During the parameter
optimization, both the average change in probability for components of the emission
probability matrix (red line) and the change in transition rate (black line) decrease
rapidly. The overall log-likelihood of the observed data converges in concert with the
parameter estimates (blue line).
Figure 1.8: Convergence of iterative parameter optimization. During the parameter opti-mization, both the average change in probability for components of the emission probabilitymatrix (red line) and the change in transition rate (black line) decrease rapidly. The overalllog-likelihood of the observed data converges in concert with the parameter estimates (blueline).
43
17
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
HA HB HC HD HE HF
Fig. S5. Reproducibility of inferred habitats. Sixty independent trials of the iterative
habitat learning process were performed. Shown are the frequencies of occurrence of the
habitats presented in Figure 1. The habitat similarity cut-off for counting a match was an
average emission probability difference < 0.10 over all environmental categories. As
expected, the least abundant habitat in our data, HD (see Fig. S4) is the least reproducible,
although even this habitat is identified in 83% of trials.
Figure 1.9: Reproducibility of inferred habitats. Sixty independent trials of the iterativehabitat learning process were performed. Shown are the frequencies of occurrence of thehabitats presented in Figure 1.1. The habitat similarity cut-off for counting a match wasan average emission probability difference < 0.10 over all environmental categories. Asexpected, the least abundant habitat in our data, HD (see Fig. 1.8) is the least reproducible,although even this habitat is identified in 83% of trials.
44
18
0 1 2 3 4 5 60
1
2
3
4
5
6
actual habitats
infe
rred
habi
tats
y=xy = mxsimulation run
Fig. S6. Number of habitats inferred from simulated data. We generated 120 simulated
datasets and compared the number of inferred habitats to the true number. Each dataset
was randomly assigned between 2 and 6 habitats; these habitats were distributed over
environmental categories by randomly partitioning the interval (0,1) according to a
uniform distribution, but requiring that each successive partitioning must occur at a larger
value than the last (random partitioning without this restriction leads to a large number of
similar ‘generalist’ habitats). These habitats were mapped onto the set of clusters learned
from the Vibrionaceae data (Fig. 1, main text). Shown are the number of habitats
inferred for these trials versus the number actually present (a small amount of Gaussian
noise was added to each data point so that the data points could be discerned). These
results suggest that the habitat inference algorithm is generally conservative, inferring
fewer rather than more than the true number of habitats.
Figure 1.10: Number of habitats inferred from simulated data. We generated 120 sim-ulated datasets and compared the number of inferred habitats to the true number. Eachdataset was randomly assigned between 2 and 6 habitats; these habitats were distributedover environmental categories by randomly partitioning the interval (0,1) according to auniform distribution, but requiring that each successive partitioning must occur at a largervalue than the last (random partitioning without this restriction leads to a large number ofsimilar generalist habitats). These habitats were mapped onto the set of clusters learnedfrom the Vibrionaceae data (Figure 1A). Shown are the number of habitats inferred for thesetrials versus the number actually present (a small amount of Gaussian noise was added toeach data point so that the data points could be discerned). These results suggest that thehabitat inference algorithm is generally conservative, inferring fewer rather than more thanthe true number of habitats.
45
19
HA
HDHE,HF
HB,HC
Habitat:
< 1 !m
5-63 !m> 63 !m
1-5 !m
Size Fraction: Seasonspringfall
Legend:
habitat 12
habitat 0
habitat 15
habitat 5
habitat 1
habitat 13
F
1
5
Z
SPR
FAL
1F 71
1F 108
FF 182
FF 96
1F 2
5F 155
1F 193
5F 183
5F 172
1F 238
1F 174
5F 304
5F 49
FF 351
1F 180
FF 295
5F 143
1F 66
FF 337
ZF 23
1F 280
1F 94
5F 119
1F 201
1F 90
FF 79
5F 73
1F 79
FF 316
FF 77
1F 211
5F 269
1F 164
1F 24
5F 279
5F 249
1F 17
1F 260
5F 98
FF 302
FF 7
1F 39
5F 186
FF 480
1F 205
FF 277
FF 384
FF 8
1F 121
FF 301
FF 15
FF 63
FF 257
1F 287
FF 9
1F 81
5F 259
1F 4
1F 36
FF 1505F 2121F 125
FF 345FF 51
ZF 265F 245FF 2265F 148
FF 492
5F 323
FF 262FF 65FF 72
1F 230
FF 468
1F 20
5F 174
FF 5
5F 31
FF 472
1F 1045F 266
5F 222
FF 32
FF 84FF 342
1F 115
1F 99
5F 132
FF 62
1F 73
5F 316
FF 313
1F 213
5F 2685F 651F 221
5F 224FF 45FF 447FF 92
5F 256FF 41F 2405F 3091F 92FF 3545F 325FF 851F 261FF 452
FF 285
FS 108
FF 124
1S 256
FS 123
1F 85
FF 190
FF 493
FF 121
FF 228
FF 106
FF 123
FF 186
FF 113
FS 184
FF 33
FF 477
1F 215
1S 138
ZF 187
ZF 153
FF 211
FF 283
1F 75
FF 119
FF 142
FF 240
FF 346
FF 165
FF 373
FF 220
FF 340
1F 156
5F 251
FF 298
5F 197
FF 163
FF 454
FF 219
FF 256
FF 133
FF 131
FF 214
FF 164
FF 122
FF 231
FF 162
FF 476
5F 169
FF 458
FF 212
1F 234
FF 104
1F 161
FF 203
FF 216
ZF 161
FF 230
FF 179
FF 217
1F 160
FF 173
FF 252
FF 130
FF 2065F
136FF 320FF 2321S 101F 30
ZF 141FF 349
FS 259
FF 2915F 45
5F 25
1S 257
FF 344
FF 93
1F 69
FS 219
FF 118
FS 103
FS 148
1F 128
FF 187
FF 495
FS 144
FF 68
FS 143
FF 197
FS 272
FS 2171S
1235F 177FS
1291S 143FF
991F 2591F
236FF 284
FF 222FS
215FF 443
FS 214
FF 263FF
317FF 159FS
2381F 247
FS 203
FF 236
FF 167
FF 166
FF 279FF
19FF 498FF
2FF 155FS
134FF 483FS
104FF 380FF
489FF 31FF
451FF 188FF
134
FF 343FS
206FF 288
ZS 109
1S 209
1S 242
5S 187
1S 45
1S 136
FF 255
FF 497
FF 209
1F 223
FF 286
FF 461
FF 195
ZF 157
1F 146
ZF 155
FF 100
FF 482
1F 151
FF 109
FF 464
ZF 53
ZF 45
5F 247
ZF 105
5F 112
5F 80
ZF 22
ZF 35
ZF 13
5F 76
ZF 165
ZF 34
1F 295
FF 474
FF 457
FF 491
FF 154
ZF 75
ZF 225
ZF 18
FF 310
FF 178
5F 299
FF 82
1F 257
5F 90
FF 296
FF 181
ZF 61
1F 220
1F 218
FF 273
ZF 71
5F 322
5S 230
5S 9
5S 257
ZF 117
5F 3
ZF 133
ZS 56
ZS 52
ZF 12
FF 57
5F 153
ZF 55
5F 227
5F 101
5F 200
5F 36
5F 240
5F 188
5F 37
5F 294
5F 9
FF 50
5F 74
5F 77
FF 449
ZF 131
ZF 101
ZF 29
5F 163
ZF 177
FF 201
5F 226
FF 147
5F 84
5F 264
FF 193
ZF 69
5F 255
ZF 115
ZF 265
1S 269
ZF 6
FF 61
5F 113
ZF 103
5F 311FF
445
1F 158
5F 126
FF 145
FF 377
FF 247
FF 161
FF 86
1F 297
FF 348
1F 153
1S 29
ZS 971F
101
FF 172
ZF 181
1F 38
5F 237
1F 149
5F 69
FF 54
5F 296
ZS 111
1F 25
5S 106
ZS 161
ZS 25
FF 20
5S 246
FF 341
1F 55
FF 6
FF 319
FF 198
FF 215
FF 207
1F 283
FF 139
FF 309
FF 183
FF 95
1F 157
FF 251
FF 138
FF 144
FS 302
1S 206
1F 57
ZS 55
5S 141
1S 35
FF 175
5F 106
5F 18
ZF 209
ZF 79
ZF 77
5F 286
5S 55
5S 37
1F 317
FF 156
ZF 2
FS 202
ZS 156
ZS 121
ZS 166
ZS 95
5S 51
ZS 202
5S 163
ZF 7
ZS 23
1S 261
1S 169
FF 1
1F 150
FF 24
1F 145
5F 102
ZS 207
5S 190
1S 119
ZS 201
5S 279
ZS 5
1S 296
5S 153
5S 262
ZS 119
ZS 183
ZS 35
ZS 105
1S 48
1S 46
5S 101
5S 67
5S 162
ZS 173
ZF 90
5S 102
ZS 101
ZS 136
5S 292
5S 213
5S 214
5S 86
ZS 28
ZS 127
5S 139
5S 89
ZF 97
1F 275
FS 213
FF 12
5F 297
5F 284
5F 51
5S 18
5F 216
5S 118
ZS 192
5S 264
5S 69
ZS 135
1F 251
1S 33
5S 64
1S 84
1S 65
5S 16
1S 63
5S 191
ZS 46
ZS 19
ZS 58
5S 122
5S 124
ZS 41
1S 285
ZS 211
ZS 178
ZS 138
1S 142
5S 116
5S 226
ZS 16
5S 249
FF 462
1S 271
ZS 176
5S 4
ZS 213
ZS 158
ZS 168
ZS 198
1S 141
ZS 193
ZS 66
ZS 210
ZS 180
ZS 181
ZS 187
ZS 195
5S 92
5S 2
5S 1
FF 91
FF 446
5S 81
ZS 208
ZS 185
5S 225
ZS 47
1S 314
ZS 8
FS 258
5S 156
5S 245
5S 252
ZS 10
ZS 29
5S 215
ZS 93
5S 57
ZS 31
ZS 163
ZS 113
1S 27
1F 197
ZF 259
5S 151
5S 111
5S 114
FF 352
FS 176
FF 308
FF 143
FF 374
FF 191
ZS 125
ZS 171
ZS 107
ZS 115
5S 192
ZF 31
ZF 59
ZS 2
1S 317
1S 319
ZS 117
5S 272
1F 29
5S 210
ZS 7
FF 125
FF 126
1S 146
1S 247
FS 178
FS 274
1S 130
5S 223
1S 113
1S 260
1S 124
1S 134
5S 161
5S 228
5S 209
5S 23
5S 235
5S 155
5S 242
5S 136
FF 259
FF 30
1S 70
1S 309
5S 5
5S 232
5S 267
ZS 103
1S 153
5S 285
ZS 36
1S 129
ZS 59
5S 134
5S 133
ZF 197
5S 135
5S 143
5S 185
ZS 53
ZS 90
ZS 68
ZS 72
ZS 78
5S 14
ZS 74
ZS 76
5S 283
ZS 84
ZS 82
ZS 86
ZS 88
1S 311
ZS 80
ZS 18
5S 123
1S 131
1S 14
ZS 20
5S 208
5S 238
5S 28
5S 268
ZS 12
5S 7
FF 185
FF 184
5S 73
5S 104
5F 57
5S 83
FF 460
ZF 43
ZF 93
5F 61
1F 263
1F 267
FF 465
FF 245
1F 200
5F 277
FF 102
FF 189
5F 70
1F 177
5F 129
5F 180
FF 174
1F 310
FF 488
5F 19
FF 233FF 484
5F 283
FF 481
ZS 184ZS 214
ZS 50
ZS 186
ZS 67
ZS 44
ZS 49
ZS 137
ZS 17
5F 275
FF 292
FF 132
ZF 41
FF 13
5F 302
FF 459
ZF 20
FF 500
FF 225
5F 261
FF 264
1F 31
FF 223
ZS 160
ZS 64
ZS 70
1S 278
5S 221
ZS 151
ZS 196
ZS 92
FF 149
FF 466
FF 36
5F 47
FF 53
5F 85
5F 29
5F 79
5F 135
FF 210
FF 148
5F 231
FF 47
FF 141
FF 22
ZF 2615S 217ZF 2031F 63
FF 3151F 42
1F 199
5F 44
1F 60
5F 4
1F 308
1F 9
5F 320
1F 195
ZF 76
FF 496
FF 14
1F 303
FF 375
FF 169
1F 249
FF 266
FF 249
1F 299
ZF 193
5F 273
FF 265
1F 187
1F 292
FF 3
1F 269
FF 2801F 183 FF 250
FF 382
FF 237
FF 2601F 185
FF 140
FF 386
FF 350 5F 324
1F 279
FF 268
FF 376
FF 213FF
305
5F 15F
60
FF 312
1F 300
5F 315
FF 290
1F 181
FF 59
FF 442
5F 318
FF 107
1F 110
FF 475
FF 267
1F 155
FF 105
FF 385
5F 20
FF 272
1F 277
ZF 111
FF 299
1F 169
FF 112
1F 148
1F 285
FF 168
1F 293
1F 253
ZS 13
ZF 170
5F 281
ZF 57
ZF 14
5F 133
ZS 63
ZF 30
ZF 207
ZF 270
FF 75
ZF 28
1F 289
5F 59
1F 97
1F 111
1F 175
1F 127
FF 160
1F 273
1F 53
FF 274
1F 124
5F 301
ZF 264
ZF 255
ZF 99
5F 171
ZF 205
ZF 65
1S 120
5S 239
5S 240
FS 285
FS 286
5S 41
1S 77
5S 149
1S 193
FS 181
FF 379
1F 265
FF 135
FF 307
FF 176
FF 204
5F 195
FF 202
FF 101
FF 239
1F 282
5F 189
FF 234
FF 221
FF 339
FF 94
5F 257
1F 171
FF 371
FF 199
FF 108
FF 304
FF 115
FF 152
FF 117
ZF 223
ZF 192
5F 165
5F 235
ZF 72
ZF 89
ZF 175
FF 170
ZF 63
1F 33
ZF 167
ZF 15
ZF 137
5S 76
1F 165
5S 78
5F 207
ZF 32
5F 205
5F 40
1S 139
ZF 8
5F 92
5S 49
5F 117
1F 255
ZF 199
ZF 33
ZF 21
ZF 269
5F 142
ZF 5
1F 243
5F 290
5S 207
ZF 17
ZF 135
ZF 9
5F 122
1S 297
ZF 10
FF 58
1S 140
FF 372
ZF 159
5F 213
FF 192
5F 289
5F 192
1F 203
ZF 219
ZF 215
ZF 195
ZF 113
ZF 91
5F 161
ZF 92
5F 201
5F 123
5F 63
ZF 100
ZF 49
5S 8
ZS 60
5S 40
ZS 37
ZS 38
ZS 99
5S 115
ZS 199
5S 62
ZS 40
5S 200
ZS 61
ZS 189
ZS 11
ZS 15
5S 22
ZS 33
ZS 209
1S 315
1S 105
5S 46
ZS 205
5S 100
ZS 22
ZS 133
ZS 139
ZS 65
5S 269
ZS 190
ZF 201
ZF 81
ZF 189
ZF 173
ZF 50
ZF 257
ZF 129
1F 77
ZF 27
ZF 107
1F 315
1F 179
ZF 163
ZF 67
5F 157
ZF 95
FF 52
ZF 109
5F 99
5F 293
ZF 19
ZF 36
1F 245
1F 117
ZF 25
ZF 47
1F 209
ZF 73
1F 28
5F 306
5F 23
5F 97
ZF 11
ZF 221
5F 308
ZF 83
5F 94
1F 227
5F 7
1F 45
5F 234
5F 33
5F 17
FF 229
5F 53
FF 246
FF 227
ZF 211
FF 254
5F 146
5F 238
ZF 213
1F 123
1F 154
1S 229
1S 237
5S 119
5S 189
5S 186
1S 159
FS 201
5S 202
5S 198
5S 183
1S 128
1S 175
FS 186
1S 165
1S 152
5S 53
FF 136
FF 200
FF 196
FF 110
FF 238
1F 207
FF 146
1F 152
5F 104
FF 21
FF 224
1F 126
5S 43
5S 159
1S 196
5S 130
5S 12
1F 189
1F 191
1F 120
5F 253
5F 6
Fig. S7. Inferred habitat associations for all ancestors of sequenced Vibrio strains. The
rings surrounding the tree represent the season (outer) and size fraction (inner) from
which strains were isolated. The maximum likelihood assignment of nodes to habitats is
shown for all nodes, regardless of the confidence of each prediction (only confident
assignments are shown in main text Fig. 1A). Colored circles on each branch indicate the
habitat assignment (HA-F, as in Fig. 1 main text) for the node immediately below that
branch (see above legend for color scheme). Branch lengths are adjusted to aid
visualization and do not represent evolutionary distances.
Figure 1.11: Inferred habitat associations for all ancestors of sequenced Vibrio strains. Therings surrounding the tree represent the season (outer) and size fraction (inner) from whichstrains were isolated. The maximum likelihood assignment of nodes to habitats is shownfor all nodes, regardless of the confidence of each prediction (only confident assignmentsare shown in Figure 1.1A). Colored circles on each branch indicate the habitat assignment(HA-HF , as in Figure 1.1A) for the node immediately below that branch (see above legendfor color scheme). Branch lengths are adjusted to aid visualization and do not representevolutionary distances.
46
20
B
A
mdh hsp60
HBHCHDHEHF
0 0.05 0.1 0.15 0.2 0.25 0.3 0.350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
genetic distance
fract
ion
habitat identity50th percentile of <dist>95th percentile of <dist>
Fig. S8. Comparison of habitat inference on different gene phylogenies (hsp60 and mdh)
for Vibrio splendidus strains. (A) Juxtaposition of habitats learned from the mdh and
hsp60 datasets for V. splendidus strains only; habitats are labeled to allow comparison
with habitats predicted for all Vibrionaceae in Fig. 1, main text. Emission probabilities
are normalized by the total number of isolates obtained in each environmental sample to
reduce the effects of sampling bias. As expected, fewer habitats are identified from the
mdh phylogeny, which has a lower rate of nucleotide divergence and thus is less well-
resolved. (B) Comparison of habitat assignments to nodes. Because it is difficult to map
the internal nodes between topologically distinct trees, the habitat assignment for the last
common ancestor of each pair of V. splendidus strains was compared. If both
corresponded to the “same” habitat (HC, HE, or HF in both phylogenies), they were
considered to be in agreement, otherwise they were considered to be in disagreement.
The fraction of nodes in agreement is shown as a function of increasing genetic distance
between the pairs of strains considered (in the hsp60 phylogeny). The black and red lines
indicate distances that include 50% and 95% of strains within the same cluster in main
text Figure 1, respectively.
Figure 1.12: Comparison of habitat inference on different gene phylogenies (hsp60 andmdh) for Vibrio splendidus strains. (A) Juxtaposition of habitats learned from the mdh andhsp60 datasets for V. splendidus strains only; habitats are labeled to allow comparison withhabitats predicted for all Vibrionaceae in Fig. 1.1. Emission probabilities are normalizedby the total number of isolates obtained in each environmental sample to reduce the effectsof sampling bias. As expected, fewer habitats are identified from the mdh phylogeny, whichhas a lower rate of nucleotide divergence and thus is less well-resolved. (B) Comparisonof habitat assignments to nodes. Because it is difficult to map the internal nodes betweentopologically distinct trees, the habitat assignment for the last common ancestor of eachpair of V. splendidus strains was compared. If both corresponded to the same habitat (HC ,HE , or HF in both phylogenies), they were considered to be in agreement, otherwise theywere considered to be in disagreement. The fraction of nodes in agreement is shown as afunction of increasing genetic distance between the pairs of strains considered (in the hsp60
phylogeny). The black and red lines indicate distances that include 50% and 95% of strainswithin the same cluster in main text Figure 1.1, respectively.
47
21
85% 90% 95% 99%
Fig. S9. Influence of the model complexity/redundancy parameter on inferred habitats.
Clusters are merged during the model fitting procedure when the vectors describing their
distribution across environments are more than 90% correlated. As this cutoff is varied,
slightly different habitats are observed. At 85%, habitat HC is not recovered; at higher
values additional habitats become more redundant suggesting that the 90% cutoff allows
conservative recovery of characteristic projected habitats.
Figure 1.13: Influence of the model complexity/redundancy parameter on inferred habi-tats. Clusters are merged during the model fitting procedure when the vectors describingtheir distribution across environments are more than 90% correlated. As this cutoff is var-ied, slightly different habitats are observed. At 85%, habitat HC is not recovered; at highervalues additional habitats become more redundant suggesting that the 90% cutoff allowsconservative recovery of characteristic projected habitats.
48
22
0.01
5S OUTGROUP
1 FAL2
F FAL96
F FAL182
1 FAL71
1 FAL108
F FAL7
F FAL302
5 FAL98
1 FAL260
1 FAL17
5 FAL249
5 FAL279
1 FAL24
1 FAL164
5 FAL269
1 FAL211
F FAL77
F FAL316
1 FAL79
5 FAL73
F FAL79
1 FAL90
1 FAL201
5 FAL119
1 FAL94
1 FAL280
Z FAL23
F FAL337
1 FAL66
5 FAL143
F FAL295
1 FAL180
F FAL351
5 FAL49
5 FAL304
1 FAL174
1 FAL238
5 FAL172
5 FAL183
5 FAL155
1 FAL193
F FAL277
1 FAL39
5 FAL186
F FAL480
1 FAL205
F FAL384
F FAL8
F FAL9
1 FAL287
F FAL257
1 FAL121
F FAL301
F FAL15
F FAL63
5 FAL148
F FAL226
5 FAL245
Z FAL26
F FAL51
F FAL345
1 FAL81
5 FAL259
1 FAL125
5 FAL212
F FAL150
1 FAL4
1 FAL36
F FAL285
F FAL72
F FAL65
F FAL262
F FAL492
5 FAL323
F FAL452
1 FAL261
F FAL85
5 FAL325
F F
AL354
1 FA
L92
5 FA
L309
1 FA
L240
F F
AL4
5 FA
L256
1 FA
L230
F F
AL468
F F
AL92
F F
AL447
F F
AL45
5 FA
L224
1 FA
L20
5 FA
L174
1 FA
L221
5 FA
L65
5 FA
L268
F F
AL5
5 FA
L31
1 FA
L213
F F
AL313
5 FA
L316
5 FA
L222
5 FA
L266
F F
AL472
1 FA
L104
F F
AL342
F F
AL32
F F
AL84
1 FA
L73
F F
AL62
5 FA
L132
1 FA
L115
1 FA
L99
F SPR123
F SPR108
F FAL124
1 SPR256
F FAL477
F FAL33
F SPR184
F FAL113
F FAL186
F FAL123
F FAL106
F FAL228
F FAL121
F FAL493
1 FAL85
F FAL190
1 FAL215
1 SPR138
F FAL240
Z FAL187
Z FAL153
F FAL142
F FAL211
F FAL283
1 FAL75
F FAL119
F FAL454
F FAL163
5 FAL197
F FAL298
5 FAL251
1 FAL156
F FAL340
F FAL220
F FAL373
F FAL346
F FAL165
F FAL219
F FAL256
F FAL133
F FAL131
F FAL214
F FAL164
F FAL122
F FAL231
F FAL162
F FAL476
Z FAL161
F FAL216
F FAL203
1 FAL161
F FAL104
1 FAL234
F FAL212
5 FAL169
F FAL458
F F
AL230
F F
AL179
F F
AL217
1 FA
L160
F F
AL173
F F
AL252
5 FA
L25 5 F
AL4
5
1 S
PR
10
F F
AL2
32
F F
AL3
20
5 F
AL1
36
F F
AL1
30
F F
AL2
06F F
AL2
91
F S
PR
259
F F
AL3
49
1 F
AL3
0
Z F
AL1
41
F F
AL2
88
F F
AL4
43
F S
PR
215
F F
AL2
22
1 S
PR
257
F F
AL3
44
F FA
L284
1 FA
L236
1 FA
L259
F FA
L99
1 S
PR
143
F S
PR
129
5 FA
L177
1 S
PR
123
F S
PR
217
F S
PR
272
F FA
L197
F S
PR
143
F FA
L68
F S
PR
144
F FA
L495
F FA
L187
1 FA
L128
F FA
L93
1 FA
L69
F S
PR
148
F S
PR
103
F S
PR
219
F FA
L118
F S
PR
206
F F
AL3
43
1 F
AL2
47
F S
PR
238
F F
AL1
59
F F
AL3
17
F S
PR
214
F F
AL2
63
F S
PR
203
F F
AL2
36
F F
AL1
34
F F
AL1
88
F F
AL4
51
F F
AL3
1
F F
AL4
89
F F
AL3
80
F S
PR
104
F F
AL4
83
F S
PR
134
F F
AL1
55
F F
AL2
F F
AL4
98
F F
AL1
9
F F
AL2
79
F F
AL1
67
F F
AL1
66
Z S
PR
109
1 S
PR
136
1 S
PR
45
5 S
PR
187
1 S
PR
209
1 S
PR
242
5 SPR25
7
5 SPR23
0
5 SPR9
5 FA
L322
Z FAL133
Z FAL117
5 FAL3
Z SPR56
Z SPR52
Z FAL12
F FAL57
5 FAL153
Z FAL55
5 FAL240
5 FAL36
5 FAL200
5 FAL227
5 FAL101
F FAL50
5 FAL9
5 FAL294
5 FAL188
5 FAL37
F FAL449
5 FAL74
5 FAL77
5 FAL255
5 FAL163
Z FAL29
Z FAL131
Z FAL101
Z FAL69
F FAL193
5 FAL264
5 FAL84
Z FAL177
F FAL201
5 FAL226
F FAL147
Z FAL115
Z FAL6
Z FAL265
1 SPR269
F FAL61
5 FAL113
Z FAL103
5 S
PR
7
1 F
AL1
53
F F
AL3
48
1 F
AL2
97F FA
L86
F FA
L161
5 FA
L311
F FA
L445F FA
L247F FA
L377F FA
L1451 FA
L1585 FA
L1261
FAL2
5
Z S
PR
1111 S
PR
29Z S
PR
971 FA
L101F
FAL1
725 FA
L296
F FA
L545
FAL6
91 FA
L149
Z FA
L181
1 FA
L38
5 FA
L237
5 S
PR
106Z
SP
R16
1Z S
PR
25
F FA
L20
5 S
PR
246
1 FA
L57
1 SP
R20
6
F FA
L207
F FA
L215
F FA
L198
F FA
L319
F FA
L6
F FA
L341
1 FA
L55
F SP
R30
2
F FA
L144
1 FA
L283
F FA
L139
F FA
L95
F FA
L309
F FA
L183
F FA
L138
1 FA
L157
F FA
L251
1 S
PR
35
Z S
PR
55
5 S
PR
141
5 S
PR
37
5 S
PR
55
5 FA
L286
5 FA
L18
F FA
L175
5 FA
L106
Z FA
L77
Z FA
L209
Z FA
L79
Z FA
L2
1 FA
L317
F FA
L156
Z SP
R121
F SP
R202
Z SP
R156
Z SP
R166
Z SP
R95
5 SP
R51
Z SPR20
2
Z SP
R23
5 SP
R163
Z FA
L7
1 SPR26
1
1 SPR16
9
5 FA
L102
1 FA
L145
F FA
L24
F FA
L1
1 FA
L150
Z SPR10
5
Z SPR20
7
5 SPR19
0
1 SPR11
9
Z SPR20
1
5 SPR27
9
5 SPR15
3
Z SPR5
1 SPR29
6
Z SPR35
Z SPR18
3
5 SPR262
Z SPR119
1 SPR48
1 SPR46
Z SPR13
6
5 SPR16
2
5 SPR10
1
5 SPR67
Z SPR10
1
5 SPR10
2
Z SPR17
3
Z FAL9
0
5 SPR64
5 SPR292
5 SPR213
Z SPR127
Z SPR28
5 SPR214
5 SPR86
Z FAL97
5 SPR139
5 SPR89
1 SPR33
1 FAL275
F SPR213
5 FAL51
5 FAL284
F FAL12
5 FAL297
1 FAL251
5 SPR18
5 FAL216
Z SPR135
5 SPR69
5 SPR264
5 SPR118
Z SPR192
5 SPR191
1 SPR84
1 SPR65
5 SPR16
1 SPR63
Z SPR138
1 SPR285
Z SPR41
5 SPR124
5 SPR122
Z SPR58
Z SPR46
Z SPR19
Z SPR211
Z SPR178
1 SPR142
5 SPR116
1 SPR271
5 SPR249
5 SPR226
Z SPR16
F FAL462
Z SPR176
5 SPR4
Z SPR168
Z SPR213
Z SPR158
Z SPR195
Z SPR198
1 SPR141
Z SPR187
Z SPR193
Z SPR66
Z SPR181
Z SPR210
Z SPR180
5 SPR1
5 SPR92
5 SPR2
F FAL91
F FAL446
5 SPR81
5 SPR245
Z SPR208
Z SPR185
5 SPR156
5 SPR225
Z SPR47
F SPR258
1 SPR314
Z SPR8
Z SPR31
Z SPR29
5 SPR252
Z SPR10
5 SPR57
5 SPR215
Z SPR93
1 SPR27
Z SPR163
Z SPR113
5 SPR151
1 FAL197
Z FAL259
5 SPR111
5 SPR114
F FAL191
F FAL374
F FAL143
F FAL308
F FAL352
F SPR176
1 SPR319
1 SPR317
5 SPR192
Z SPR115
Z SPR107
Z SPR125
Z SPR171
Z FAL31
Z FAL59
Z SPR2
Z SPR7
5 SPR210
1 FAL29
Z SPR117
5 SPR272
1 SPR146
F FAL125
F FAL126
1 SPR247
F SPR178
F SPR274
5 SPR209
5 SPR228
5 SPR161
1 SPR130
5 SPR223
1 SPR134
1 SPR124
1 SPR113
1 SPR260
5 SPR155
5 SPR23
5 SPR235
5 SPR242
5 SPR136
F FAL259
F FAL30
1 SPR70
1 SPR309
5 SPR5
Z SPR103
5 SPR232
5 SPR267
Z SPR59
1 SPR153
5 SPR285
Z SPR36
1 SPR129
5 SPR135
Z FAL197
5 SPR134
5 SPR133
5 SPR143
5 SPR185
Z SPR53
Z SPR12
5 SPR268
5 SPR28
5 SPR238
5 SPR208
Z SPR20
Z SPR90
Z SPR68
Z SPR72
Z SPR78
5 SPR14
Z SPR74
1 SPR14
1 SPR131
5 SPR123
Z SPR18
Z SPR80
1 SPR311
Z SPR88
Z SPR86
Z SPR82
Z SPR84
Z SPR76
5 SPR283
F FAL185
F FAL184
5 SPR73
5 SPR104
5 FAL57
5 SPR83
Z SPR13
1 FAL253
F FAL102
5 FAL277
1 FAL267
1 FAL263
Z FAL93
F FAL460
Z FAL43
5 FAL61
1 FAL200
F FAL465
F FAL245
F FAL189
5 FAL70
1 FAL177
1 FAL310
F FAL174
5 FAL129
5 FAL180
F FAL488
5 FAL19
F FAL2231 FA
L31F FAL2645 FA
L261F FAL225
F FAL500
Z FAL20F FA
L459
5 FAL302
F FAL13Z FA
L41F FAL132
F FAL292
F FAL481
5 FAL283
F FAL233
F FAL484
5 FAL275
Z SP
R17
Z SP
R137
Z SP
R49
Z S
PR
44
Z S
PR
67
Z S
PR
186
Z S
PR
50
Z S
PR
184
Z S
PR
214
Z SPR70
Z SPR160
Z SPR64
1 SPR278
Z SPR92
Z SPR1965 S
PR
221Z SP
R151F FA
L22
F FAL149F FA
L466F FAL141F FA
L53
F FAL36
5 FAL47
F FAL47
5 FAL231F FA
L148F FAL2105 FA
L1355 FAL795 FA
L855 FAL29
Z F
AL261
5 SP
R217
Z F
AL203
1 FA
L63
1 FA
L249
5 FA
L4
1 FA
L60
5 FA
L44
1 FA
L199
F F
AL315
1 FA
L42
F F
AL169
F F
AL375
1 FA
L303
F F
AL14
F F
AL496
Z F
AL76
1 FA
L195
5 FA
L320
1 FA
L308
1 FA
L9
1 FA
L269
F F
AL3
1 FA
L292
1 FA
L187
F F
AL265
5 FA
L273
Z F
AL193
1 FA
L299
F F
AL266
F F
AL249
F F
AL237
F F
AL382
F F
AL250F
FA
L280
1 F
AL1
83
F F
AL3
50
F F
AL3
86
F F
AL1
40
F F
AL2
60
1 F
AL1
85
F F
AL376
F F
AL268
5 FA
L324
1 FA
L279
1 F
AL2
93
F F
AL1
68
1 F
AL2
85
1 F
AL1
48
F F
AL1
12
1 F
AL1
69
F F
AL2
99
Z F
AL1
11
1 F
AL2
77
F F
AL2
72
5 F
AL2
0
F F
AL3
85
F F
AL1
05
1 F
AL1
55
F F
AL2
67
F F
AL4
75
1 F
AL1
10
F F
AL1
07
5 F
AL3
18
F F
AL4
42
F F
AL2
13
F F
AL3
05
F F
AL5
9
1 F
AL1
81
F F
AL2
90
5 F
AL3
15
1 F
AL3
00
F F
AL3
125 F
AL15 F
AL6
0
Z FAL65
Z FAL170
5 FAL281
Z SPR63
Z FAL57
Z FAL14
5 FAL133
F FAL75
Z FAL270
Z FAL30
Z FAL207
5 FAL59
Z FAL28
1 FAL289
1 FAL124
F FAL274
1 FAL53
1 FAL273
F FAL160
1 FAL127
1 FAL175
1 FAL97
1 FAL111
Z FAL205
5 FAL171
5 FAL301
Z FAL264
Z FAL255
Z FAL99
F FAL307
F FAL135
1 FAL265
F FAL379
F SPR181
1 SPR193
1 SPR120
5 SPR239
5 SPR149
1 SPR77
5 SPR41
F SPR286
5 SPR240
F SPR285
F FAL101
F FAL202
5 FAL195
F FAL176
F FAL204
F FAL239
1 FAL282
F FAL339
5 FAL189
F FAL234
F FAL221
1 FAL171
F FAL94
5 FAL257
F FAL170
F FAL117
F FAL371
F FAL199
F FAL108
F FAL304
F FAL115
F FAL152
Z FAL223
Z FAL192
Z FAL175
5 FAL165
5 FAL235
Z FAL72
Z FAL89
Z FAL49
Z FAL167
Z FAL63
1 FAL33
Z FAL15
Z FAL137
5 SPR76
1 FAL165
5 SPR78
5 FAL207
1 FAL243
Z FAL5
5 FAL142
Z FAL269
Z FAL21
Z FAL33
Z FAL199
Z FAL32
5 FAL205
5 FAL40
1 SPR139
Z FAL8
5 FAL92
1 FAL255
5 SPR49
5 FAL117
Z FAL17
5 FAL290
5 SPR207
Z FAL100
Z FAL113
Z FAL195
Z FAL215
Z FAL219
1 FAL203
5 FAL192
5 FAL289
F FAL192
5 FAL213
Z FAL159
F FAL372
1 SPR140
F FAL58
Z FAL10
Z FAL135
Z FAL9
5 FAL122
1 SPR297
Z FAL92
Z FAL91
5 FAL161
5 FAL63
5 FAL201
5 FAL123
Z SPR190
5 SPR8
Z SPR60
Z SPR38
5 SPR40
Z SPR37
Z SPR40
Z SPR199
Z SPR99
5 SPR115
5 SPR62
Z SPR15
Z SPR11
Z SPR189
5 SPR200
Z SPR61
5 SPR22
Z SPR33
1 SPR105
Z SPR209
1 SPR315
5 SPR46
Z SPR205
5 SPR100
Z SPR22
5 SPR269
Z SPR65
Z SPR133
Z SPR139
5 FAL99
Z FAL109
Z FAL173
Z FAL189
Z FAL201
Z FAL81
Z FAL129
Z FAL50
Z FAL257
1 FAL77
F FAL52
Z FAL95
5 FAL157
Z FAL67
Z FAL163
1 FAL179
1 FAL315
Z FAL27
Z FAL107
5 FA
L293
Z FA
L19
1 FAL1
26
F FAL1
36
F FAL2
00
F FAL2
24
F FAL2
38
F FAL1
96
F FAL110
1 FAL207
F FAL146
F FAL21
1 FAL152
5 FAL104
5 SPR12
5 SPR13
0
5 SPR43
5 SPR15
9
1 SPR19
6
1 FAL1
20
1 FAL1
89
1 FAL1
91
5 FAL2
53
5 FAL6
5 SPR183
5 SPR119
1 SPR229
1 SPR237
5 SPR198
5 SPR189
5 SPR186
5 SPR202
1 SPR159
F SPR201
5 SPR53
1 SPR152
1 SPR165
F SPR186
1 SPR128
1 SPR175
1 FAL154
Z FAL213
1 FAL123
Z FAL36
1 FAL245
1 FAL117
5 FAL238
5 FAL146
5 FAL7
1 FAL227
5 FAL94
Z FAL83
Z FAL221
5 FAL308
F FAL254
Z FAL211
F FAL227
F FAL246
5 FAL53
F FAL229
5 FAL17
5 FAL33
1 FAL45
5 FAL234
Z FAL11
Z FAL47
Z FAL25
1 FAL28
1 FAL209
Z FAL73
5 FAL97
5 FAL306
5 FAL23
1 FA
L151
F FA
L109
F FA
L464
Z FA
L53
Z FA
L45
5 FA
L80
5 FA
L112
5 FA
L247
Z FA
L105
5 FA
L299
F FA
L178
F FA
L310
Z FA
L18
Z FA
L225
Z FA
L75
Z FA
L22
Z FA
L35
F FA
L154
F FA
L491
F FA
L457
F FA
L474
1 FA
L295
Z FA
L34
Z FA
L165
Z FA
L13
5 FA
L76
5 FA
L90
F FA
L82
1 FA
L257
F FA
L273
1 FA
L218
Z FA
L61
1 FA
L220
Z FA
L71
F FA
L181
F FA
L296
F FA
L255
F FA
L482
F FA
L100
Z FA
L155
Z FA
L157
1 FA
L146
F FA
L195
F FA
L286
F FA
L461
1 FA
L223
F FA
L209
F FA
L497
Fig. S10. Statistical uncertainty in the hsp60 gene tree by constructing 100 trees using
non-parametric bootstrap re-sampling from the hsp60 alignment. Clades supported in
greater than 80% of bootstraps are indicated with a black dot.
Figure 1.14: Statistical uncertainty in the hsp60 gene tree by constructing 100 treesusing non-parametric bootstrap re-sampling from the hsp60 alignment. Clades supportedin greater than 80% of bootstraps are indicated with a black dot.
49
Chapter 2
Rapid evolutionary innovationduring an Archean Genetic
Expansion
Lawrence A. David, Eric J. Alm
This chapter is presented as it was submitted for publication. CorrespondingSupplementary Material is appended.
Chapter 2
Rapid evolutionary innovation
during an Archean Genetic
Expansion
A natural history of Precambrian life remains elusive because of the rarity of microbial
fossils and biomarkers [79, 80]. The composition of modern day genomes, however,
may bear imprints of ancient biogeochemical events [81–83]. We have employed an
explicit model of macroevolution including gene birth, transfer, duplication and loss
to map the evolutionary history of 3,968 gene families across the three domains of life.
We observe that horizontal gene transfer (HGT) is the primary source of new genes
in prokaryotes, while duplication dominates in eukaryotes. Inter-domain gene trans-
fer is rare compared to intra-domain transfer with the notable exception of massive
Bacteria-Eukarya transfer events that correspond to the endosymbiosis of the mito-
chondria and chloroplasts [84, 85]. Surprisingly, we find that a brief period of genetic
innovation during the Archean eon gave rise to 27% of major modern gene families.
Genes born during this period are especially likely to be involved in electron trans-
port, while later genes exhibit a gradually increasing usage of molecular oxygen. Our
results demonstrate that reconstructing the complex interplay between organismal
and geochemical evolution over Earth history is becoming a tractable goal.
51
Introduction
Describing the emergence of life on our planet is one of the grand challenges of the
Biological and Earth sciences. Yet the roughly three-billion-year history of life pre-
ceding the emergence of hard-shelled metazoans remains obscure [79]. To date, the
best understood event in early Earth history is the Great Oxidation Event, which is
believed to follow the invention of oxygenic photosynthesis by the ancestors of modern
cyanobacteria [86] (though the precise timeline remains controversial [80]). If DNA
sequences from extant organisms bear an imprint of this event, then we can use them
to make and test predictions [81–83]; e.g., genes that use molecular oxygen will be
confined to a group of organisms emerging after the Great Oxidation Event. Transfer
of genes across species, however, can obscure patterns of descent and disrupt our abil-
ity to correlate gene histories with the geochemical record [87]. Widely distributed
genes, for example, may descend from a Last Universal Common Ancestor (LUCA)
as widely believed to be the case for the translational machinery [88], or may have
been dispersed by HGT [12, 89], as in the case of antibiotic resistance cassettes.
Methods Overview
We developed a new phylogenomic method, AnGST (Analyzer of Gene and Species
Trees), to account for the confounding effects of HGT by comparing individual gene
phylogenies to the phylogeny of organisms (the Tree of Life). We refer to this process
as tree reconciliation and provide a detailed description of the AnGST algorithm in
the Supplementary Information. Unlike some previous methods [24, 25, 27], AnGST
uses the topology of the gene family tree rather than just its presence/absence across
genomes and can infer duplication, HGT, and loss events. Importantly, AnGST also
accounts for uncertainty in gene trees by incorporating reconciliation into the tree-
building process: the tree that minimizes the evolutionary cost function, but is still
supported by the sequence data, is chosen as the best gene tree. Simulated trees
inferred with this method are more accurate than trees based on a maximum likeli-
hood model of sequence data alone (Figure 2.5). Thus, tree building methods such as
52
AnGST that explicitly model macroevolutionary events may have utility in phyloge-
netic inference [33]. We used a previously described Tree of Life [90] to reconcile gene
families, although we note that our key results were consistent when using 30 alterna-
tive reference trees, including those that used the Archaea or Eukarya as outgroups
(Figs. 2.11, 2.12). Ensuring proper causality in a large reconciliation (i.e., avoiding
the “grandfather paradox” in which a gene is inferred to be its own ancestor) is a
computationally intractable problem in general [31], which we overcome by explicitly
modeling the timing of evolutionary events based on a chronogram constructed from
our reference tree. A conservative set of eight temporal constraints was selected from
the geochemical and paleontological literature (Table 2.1), and the PhyloBayes soft-
ware package was used to infer a range of divergence times for each ancestral lineage
on the reference tree [91]. We did not apply temporal constraints to lineage ages on
the gene trees.
Results
Domain-specific Macroevolutionary Trends
For 3,968 extant gene families [106], AnGST predicted a total of 109,452 speciation,
38,575 HGT, 14,021 gene duplication, and 35,252 gene loss events (Figure 2.1A). The
abundance of HGT events (on average 9.7 per gene family) underscores the evolu-
tionary importance of gene transfer in prokaryotic genome structure (Figure 2.15).
Domain-specific preferences in the types of macroevolutionary events emerge in Fig-
ure 2.1B. On a per-gene basis, gene transfer is 2.1 times more likely in bacteria than
in eukaryotes, while duplications are 4.4 times more likely in eukaryotes than in bac-
teria. The rate of HGT in eukaryotes is likely to be an overestimate because we did
not consider eukaryote-only gene families. The bias toward duplication in eukary-
otes is consistent with known domain-specific traits, such as unequal crossing-over,
whole-genome duplication events, and reduced selection against large genome sizes
[16, 107, 108]. Interdomain transfers comprise a minority (16.1%) of HGT events
but exhibit significant over-representation of HGT from alpha-proteobacteria to an-
53
Event Constraint Evidence1 Last universal common ances-
tor arises< 3850 Ma Carbon isotope fractionation
[92, 93]2 Cyanobacteria emerge > 2500 Ma Traces of an aerobic nitro-
gen cycle [94], changes inredox-metal enrichments [95],and sulfur isotope fractiona-tion data [96, 97] indicate oxy-genic photosynthesis; tracesof 2α-methylhopane biomark-ers [98] indicate cyanobacterialpresence
3 Eukaryotes diverge from Ar-chaea
> 2670 Ma Preserved sterane biomarkers[98, 99]
4 Akinetes diverge fromcyanobacteria lacking celldifferentiation
> 1500 Ma Akinete microfossils [100]
5 Archaeplastida emerge > 1198 Ma Red algae microfossils [101]6 Animals emerge > 635 Ma Preserved demosponge ster-
anes [102]7 Tetrapods emerge (a) < 385 Ma
(b) > 359 Ma(a) Tetrapod precursor dating[103]; (b) Tetrapod fossil dat-ing [104]
8 Buchnera diverge from Wig-
glesworthia
> 160 Ma Fossil history of Buchnera’saphid hosts [105]
Table 2.1: Temporal constraints used to construct chronogram. Eight temporal con-straints that could be directly linked to fossil or geochemical evidence were used to estimatedivergence times on the Tree of Life (Figure 2.10).
54
cient eukaryotes (p=3.3 × 10−7 Wilcoxon rank sum test) and from cyanobacteria to
plants (p=8.3 × 10−6 Wilcoxon rank sum test). These results likely reflect the an-
cient endosymbioses that gave rise to the mitochondrial and chloroplast organelles
[84, 85]. Functional analysis of HGT from the alpha-proteobacteria to the ancestral
eukaryotes reveals significant enrichment for energy metabolism genes (p=1.6× 10−6
Fisher’s Exact Test), further supporting an association between these HGT and an
energy-producing endosymbiosis (Figure 2.18). HGT from cyanobacteria to Arabidop-
sis thaliana are also enriched for energy-producing genes (p=3.9×10−3 Fisher’s Exact
Test), as well as translation-related genes (p=4.4 × 10−5 Fisher’s Exact Test) which
likely reflect the migration of 70S ribosomal proteins from the chloroplast to the plant
nucleus [109].
An Archean Genetic Expansion
Gene histories reveal dramatic changes in the rates of gene birth, duplication, loss,
and HGT over geologic time scales (Figure 2.2). The most striking feature of the
overall gene flux depicted in Figure 2 is a burst of de novo gene family birth between
3.33-2.85 Ga which we refer to as the Archean Genetic Expansion (AGE). This win-
dow gave rise to 26.8% of extant gene families and coincides with a rapid bacterial
cladogenesis. A spike in the rate of gene loss (∼3.1 Ga) follows the AGE and may
represent consolidation of newly evolved phenotypes, as ancestral genomes became
specialized for newly emerging niches. After 2.85 Ga, the rates of both gene loss and
gene transfer stabilize at roughly modern-day levels. The rates of de novo gene birth
and duplication after the AGE appear to show opposite trends: de novo gene family
birth rates decrease and duplication rates increase over time. The near absence of de
novo birth in modern times likely reflects the fact that ORFan gene families, which
are widespread across all major prokaryotic groups, are not considered in this study
[110]. The excess of gene duplications and ORFans in modern genomes suggests that
novel genes from both sources experience high turnover and rarely persist over long
evolutionary time scales.
What evolutionary factors were responsible for the period of innovation marked by
55
the AGE? While we cannot provide an unequivocal answer to this question using gene
birth dates alone, we can ask whether the functions of genes born during this time
suggest plausible hypotheses. In general, birth of metabolic genes is enriched during
the AGE, especially those involved in energy production and coenzyme metabolism
(Table 2.17), but further inspection also reveals an enrichment for metabolic gene
family birth prior to the AGE. To focus on specific metabolic changes linked to the
AGE we: (i) grouped genes according to the metabolites they used; and (ii) we directly
compared the occurrence of these metabolites in genes born during the AGE to their
abundance prior to the AGE. The results are striking: the AGE-specific metabolites
(positive bars, Figure 2.2 inset) include most of the compounds annotated as redox/e−
transfer (blue bars), with Fe-S-, Fe-, and O2-binding gene families showing the most
significant enrichment (False Discovery Rate < 5%, Fisher’s exact test). Gene families
that use ubiquinone and FAD (key metabolites in respiration pathways) are also
enriched, albeit at slightly lower significance levels (False Discovery Rate < 10%).
The ubiquitous NADH and NADPH are a notable exception to this trend and appear
to have played a role early in life history. By contrast, enzymes linked to nucleotides
(green bars) exhibited strong enrichment in genes of more ancient origin than the
AGE.
The observed metabolite usage bias suggests that the AGE was associated with an
expansion in microbial respiratory and electron transport capabilities. Proving this
association to be causal is beyond the power of our phylogenomic model. Yet this
hypothesis is appealing because more efficient energy conservation pathways could
increase the total free energy budget available to the biosphere, possibly enabling
the support of more complex ecosystems and a concomitant expansion of species and
genetic diversity. We note, however, that while the use of oxygen as a terminal electron
acceptor would have significantly increased biological energy budgets, oxygen-utilizing
genes are only enriched toward the end of the AGE (Figure 2.14). Thus, the earliest
redox genes we identified as part of the AGE were likely to be used in anaerobic
respiration or oxygenic/anoxygenic photosynthesis, although some may have been
co-opted later for use in aerobic respiration pathways.
56
Phylogenomic evidence for ancient changes in global redox potential
Our metabolic analysis supports an increasingly oxygenated biosphere following the
AGE, as the fraction of proteins utilizing oxygen gradually increases from the AGE
until the present day (Figure 2.3; p=3.4×10−8, two-sided Kolmogorov-Smirnov test).
Further indirect evidence of rising oxygen levels comes from compounds that are sen-
sitive to global redox potential. We observe significant increases over time in the usage
of the transition metals copper and molybdenum (Figure 2.3; False Discovery Rate
< 5%, two-sided Kolmogorov-Smirnov test), which is in agreement with geochemical
models of these metals’ solubility in increasingly oxidizing oceans [82, 83] and with
the growth of molybdenum enrichments from black shales that suggests molybdenum
began accumulating in the oceans only after the Archean eon [111]. Our prediction of
a significant increase in nickel utilization accords with geochemical modeling predic-
tions of a 10X increase in dissolved nickel concentration between the Proterozoic and
modern day [82], but conflicts with a recent analysis of banded iron formations that
inferred monotonically decreasing maximum deposited nickel concentrations from the
Archean onwards [112]. The abundance of enzymes using oxidized forms of nitrogen
(N2O and NO3) also grows significantly over time, with 1/3 of nitrate-binding gene
families appearing at the beginning of the AGE and 3/4 of nitrous oxide-binding gene
families appearing by the AGEs end. The timing of these gene family births provides
phylogenomic evidence for an aerobic nitrogen cycle by the Late Archean [94].
One striking discrepancy between our phylogenomic patterns and geochemical pre-
dictions, however, is a modest but significant increase in iron-using genes over time
(Figure 2.3; False Discovery Rate < 5%, two-sided Kolmogorov-Smirnov test). The
cessation of iron formation deposition roughly 1.8 Ga and ocean chemistry models in-
dicate that iron usage should decrease following the Archean, as declining iron solubil-
ity in oxygenated ocean surface waters and sulfide-mediated iron removal from anoxic
deeper waters combined to reduce overall iron bioavailability [113]. The counterin-
tuitive phylogenomic prediction may reflect the confounding effect of evolutionary
inertia, whereby microbes in the face of declining iron availability could have found
57
more success evolving a handful of metal-acquisition proteins (e.g. siderophores),
rather than replacing a host of iron-binding proteins. Alternatively, the insolubility
of iron in modern oceans may be offset by large existing organic pools of reduced iron.
A precise timeline for oxygen availability is beyond the resolution of our relaxed
molecular clock approach and remains a contentious topic in organic geochemistry
[80, 98, 99]. Nonetheless, our results suggest an Archean biosphere containing some
of the basic components required for oxygenic photosynthesis and respiration, despite
the fact that appreciable oxygen levels do not appear in the geological record until
much later (roughly 2.5 Ga) [114, 115]. Although our results are consistent with recent
biomarker-based evidence for early oxygenesis [99], special caution should be used in
comparing the molecular and geological dates. Divergence times for deep nodes on
our reference chronogram are uncertain (Figure 2.16), and they are partially based on
the constraint that the LUCA could be as old as the earliest evidence for life (3.85 Ga)
[92], even though the LUCA is likely a descendant of the first life form. Furthermore,
although the PhyloBayes dates include uncertainty estimates that are accurate given
the assumptions of the CIR model [91], an alternative, semi-parametric approach
implemented in r8s [116] results in a much younger date of 2.75-2.5 Ga for the AGE
(compared to 3.33-2.85 Ga for PhyloBayes) which is closer to the Great Oxidation
Event (Figure 2.12). Here we present mainly results from the PhyloBayes program,
because it allowed us to explicitly account for uncertainty in the timing of inferred
events. With such disparate phylogenetic estimates of the timing of the most ancient
lineages, a chronology for the AGE and the evolution of oxygen-producing genes will
require a careful integration of both geochemical and genetic data.
Implications
Using just eight temporal constraints as our geochemical and paleontological guides,
we have shown that whole genome sequence data can be used to infer details of
microbial evolution during the Archean eon and can recount global changes in redox-
sensitive compound bioavailability following the evolution of oxygenesis. Still, our
phylogenomic approach represents only a first step toward linking whole genome
58
sequence data to early Earth history. By connecting events in gene histories to events
in Earth history, hypotheses of enzyme or pathway presence/absence can be used
to make testable predictions about when metabolic signatures should first appear in
the geochemical record. Conversely, geochemical hypotheses may be tested against
predictions of extant metabolisms (as we demonstrate using the rise of oxygen). This
may admit useful new lines of evidence for geochemical theories that suffer from gaps
in the rock record. Successive refinement of phylogenomic models against geochemical
constraints may eventually yield an abundant and reliable source of Precambrian
fossils: modern-day genomes.
59
2.1 Figures
Phanerozoic
Proterozoic
Archean
a
a
Chloroflexi
Aquificae
Thermotogae
DeinococcalesPlanctomycetes
Bacteroidetes
Chlamydiae
Firmicutes
Transfer
Loss
Birth Chromalveolata
Plantae
Fungi
Metazoa
Crenarcheota
Euryarchaeota
Nanoarchaeota
a
a
0.5 Ga
A
B
Alm, Figure 1
Event Rate
Arch
aea
Bacte
ria
Eu
karya
Transfer Origins
L
D
B
H
0.181
0.188
0.065
0.022
L
D
B
H
0.151
0.171
0.039
0.016
L
D
B
H
0.102
0.083
0.171
0.008
E
A
B
3%
40%
57%
E
A
B
2%
5%
93%
E
A
B
34%
5%
61%
Archae. fulgid.
Methan. acetiv.
Methan. jannas.
Methan. kandle.
Methan. therma.
Pyroco. DSM
Nanoar. equita
.
Sulfol. s
olfat.
Sacc
ha. ce
revi
.
Caenor. e
legan.
.
Dan
io rer
io.
Pan troglo
.Mus
mus
cul.
Thermo. tengco.
Clostr. tetani.
Lister. innocu.Bacill. anthra.
Oceano. iheyen.
Entero. faecal.Lactoc. lactis.
Desulf.
vulg
ar.
Woline. succin.
.
.
Mesorh. loti.
Sinorh. melilo.
Ralsto. solana.
Chromo. violac.
Neisse. mening.
Xylell. fastid.
Pseudo. aerugi.
Salm
on. e
nte
ri.E
scher. co
li.S
hig
el. fle
xne.
Wiggle. glossi.
Therm
o. m
ariti.
Aquife. aeolic
.
Therm
u. th
erm
o.
Dein
oc.
radio
d.
Dehalo
. eth
eno.
Pro
chl.
m.9
313.
Pro
chl.
m.C
CM
P.
Syn
ech
. elo
nga.
.
Bacte
r. theta
i.
Chlam
y. tracho.
Bifid
o. lo
ngum
.
Cory
ne. g
luta
m.
Figure 2.1: Evolutionary events by lineage. (A) The number of macroevolutionary eventsis mapped to each lineage on an ultrametric Tree of Life and visualized using the iToLwebsite [74]. Pie chart area denotes the number of events, and color indicates event type:gene birth (red), duplication (blue), HGT (green), and loss (yellow). (B) The averagenumber of events per gene copy is separated by domain and the origins of HGT events aredepicted: Bacteria (green), Archaea (beige), Eukarya (violet).
60
*Ubiquino.Fructose**Fe
FADH2Ferrocyt.H2O2
Glucose**O2
Oxaloace.Gen e- acceptor
Zn*FAD
.Glyceron.
Oxogluta.Alcohol
AcetateThiamin .Carboxyl.
Phosphoe.NAD+
CysteineBiotin
H2OH+
Glyceral.Methioni.
Glutamat.CTP
CoANH3
CO2Mn
Fructose.Fumarate
tRNANADPH
THFAMP
Aspartat.GMP
eATP
GlycineGlutamin.
PyruvatePyridoxa.
PPi*Orthopho.
FormateDNA
UDPGDP
GTPNTP
CMP**PRPP
**UMP
3.5
3.0
2.5
2.0
Tim
e [G
a]
1.5
1.0
0.5
0.0
4 2 0 2 4 8 10
Transfer
Loss
Birth
DuplicationDuplication
[events / genome / 10 Ma]Gene loss rate Gene gain rate
1 2 30
Archean Gene Expansion
ArcheanProterozoic
ProterozoicPhanerozoic
Present
0
2[log (enrich)] in AGE
12
Redox / e- transfer TCA / GlycolysisCarboxylic AcidAmino AcidNucleotide
Carbohydrate
Other
2[log (enrich)] pre-AGE
Figure 2.2: Rates of macroevolutionary events over time. The figure shows the rates ofmacroevolutionary events (colors as in Figure 2.1) as a function of time. Shown are theaverage rates of evolutionary events per lineage (events / 10 Ma / lineage). Events thatincrease gene count are plotted to the right, and gene loss events are shown to the left (yellowcurve). Genes already present at the LUCA are not included in the analysis of birth ratesbecause the time over which those genes formed is not known. The AGE was also detectedwhen alternative chronograms were considered (Figure 2.12). Inset: metabolites or classes ofmetabolites ordered according to the number of gene families that use them that were bornduring the AGE compared to the number born before the expansion. Metabolites whoseenrichments are statistically significant at a False Discovery Rate < 10% or < 5% (Fisher’sExact Test) are identified using one or two asterisks, respectively. Bars are colored byfunctional annotation or compound type (functional annotations were assigned manually).Metabolites were obtained from the KEGG database release 51.0 [117] and associated withCOGs using the MicrobesOnline September 2008 database [118]. Metabolites associatedwith fewer than 20 COGs or sharing more than 2/3 of gene families with other includedmetabolites are omitted.
61
0.00.51.01.52.02.53.03.5
Time [Ga]
Fe*ZnMnCu*Mo*CoNi*
19.2
2.7
2.8
4.8
2.5
24.7
43.4
40.0
0.0
0.0
0.0
0.0
24.0
15.0
4.8
4.6
4.1
4.0
22.8
44.7
2.9 4.0
Tran
sit
ion
Meta
ls
(174
)
Oxygen* (88)
36.0
Gen
e fr
actio
n n
orm
aliz
ed
to p
resen
t day
1.0
1.1
1.2
1.3
1.4
≥ 1.5
0.9
0.8
0.7
0.6
≤ 0.5
NH3
NO3*N2O*N2
2.4
0.9
5.8
90.8
1.6
0.9
2.8
94.6
0.0
0.0
0.0
Nit
ro
gen
Co
mp
.
(115
)
100.0
SSO4SO3H2SS2O3
14.7
8.0
16.2
54.0
7.1
14.4
8.1
17.0
50.6
19.3
9.7
6.5
22.6Su
lfu
r C
om
p.
(79)
42.0
9.9
CO2CHO2
CH2OCH4OCH4
5.1
3.7
14.0
77.0
0.2
3.8
2.6
14.0
79.6
4.4
2.2
10.0
0.0
C1
Co
mp
.
(214
)
83.3
0.0
1.3
Alm, Figure 3
Figure 2.3: Genome utilization of redox-sensitive compounds over time. The first barillustrates a gradual increase in the fraction of enzymes that bind molecular oxygen predictedto be present over Earth history (p=1.3×10−7, two-sided Kolmogorov-Smirnov test). Colorsindicate abundance normalized to present-day values. The lower four panels group transitionmetals, nitrogen compounds, sulfur compounds, and C1 compounds. The fraction of eachgroup’s associated genes that bind a given compound, normalized to present-day fractions,is shown over time using a color gradient. Enclosed boxes show raw fractional values atthree time points: 3.5 Ga (left); 2.5 Ga (middle); and the present day (right). For example,18.9% of transition metal-binding genes are predicted to have bound Mn at 2.5 Ga, a value1.26 times the size of the modern day percentage of 15.0%. Values within parenthesesgive the overall number of gene families in each group. To determine which compoundsshowed divergent genome utilization over time, the timing of copy number changes for eachcompound’s associated genes was compared to a background model derived from all othercompounds. Compounds whose utilization significantly differs from the background modelare marked with an asterisk (False Discovery Rate < 5%, two-sided Kolmogorov-Smirnovtest). Nitrite and nitric oxide are not shown due to their COG-binding similarity to nitrateand nitrous oxide, respectively.
62
2.2 Supplementary Material
2.2.1 Overview
We developed a phylogenomic method that we named AnGST (Analyzer of Gene
and Species Trees), which “reconciles” any observed differences between a gene tree
and a reference tree (species tree) by inferring a minimal set of evolutionary events,
including horizontal gene transfer (HGT), gene duplication (DUP), gene loss (LOS),
speciation (SPC) and exactly one gene birth or genesis event (GEN). Each event type
is assigned a unique cost, and the overall sum of costs associated with a reconciliation
is minimized (i.e., we use a generalized parsimony criterion). We address previously
described shortcomings of similar parsimony-based models of host-parasite evolution
[119] by accounting for phylogenetic uncertainty (using a new approach described
below) and directly estimating event costs from our large dataset. We divide the
gene-tree/species-tree reconciliation process into two components:
• The basic reconciliation step assumes a known gene tree and species tree and
identifies the set of evolutionary events (HGT, DUP, LOS, SPC, GEN) needed
to explain any discordance between the trees
• The tree amalgamation step accounts for gene tree uncertainty by incor-
porating tree construction into the reconciliation process: multiple gene tree
bootstraps are provided to AnGST and the algorithm retains and combines
bootstrap subtrees which yield the most conservative reconciliation consistent
with the sequence data.
The estimation of event costs from the input data is based on reducing large fluc-
tuations in ancient genome sizes. This method is presented in Section 2.2.5 together
with a sensitivity analysis for the resulting parameters.
The AnGST software package is implemented in the Python programming lan-
guage and can be downloaded from: http://almlab.mit.edu/ALM/Software/.
63
2.2.2 Basic Reconciliation Algorithm
First assumptions
The basic reconciliation step requires a rooted, strictly bifurcating gene tree G and
species tree S. Each tree is composed of a set of nodes linked to one another by a
set of connecting edges. We assume that each node g in G can be mapped to a node
s in S, a mapping we abbreviate as g:s. This mapping describes which (extant or
ancestral) genome hosted a given (extant or ancestral) gene copy. Maps are known
with certainty for extant genes, but must be inferred for ancestral gene copies.
Algorithm explanation
Our goal in gene/species tree reconciliation is to recover the optimal set of evolu-
tionary events that explain any topological discordance between the gene and species
trees. A brute-force search through all possible evolutionary histories is intractable,
as the number of possible histories grows exponentially with increasing tree size [120].
However, for a given gene and species tree pair, there are only |S| possible mappings
for the root node of the gene tree, gr. If the optimal reconciliation is already known
for each possible mapping gr:sr, where sr is a node in S, a new outgroup for the
gene tree can be added (making gr a child of the new root node gn), and optimal
reconciliations for the larger gene tree can be quickly computed using the following
method:
1. For each possible pair of mappings (gr:sr, gn:sn) where sr and sn are nodes in
S
(a) Choose the most parsimonious explanation for how a gene copy in sn de-
scended into sr.
(b) Concatenate this history to the known optimal reconciliation for gr:sr, to
produce the optimal reconciliation for the (gr:sr, gn:sn) pair
2. Identify optimal reconciliations for each mapping gn:sn by selecting the minimal
overall reconciliation cost associated with (gr:sr, gn:sn) as sr is varied over the
64
nodes of S.
Using the above method, the reconciliation problem can be formulated in a dy-
namic programming framework, yielding computational complexity that is a polynomial-
time function of gene tree size. The AnGST program implements this algorithm as a
post-fix traversal of the gene tree. At each node, reconciliations from child subtrees
are combined in mini-reconciliations, which explain how the gene copy at g coalesced
from two child copies c1 andc2 (i.e., whether HGT, speciation, or duplication oc-
curred), assuming the mappings g:s, c1:s1, and c2:s2. This is repeated for each s, s1,
and s2 ∈ S. Mini-reconciliations return optimal duplication-loss or HGT scenarios if s
is the last common ancestor of s1 and s2, or if s is identical to either s1 or s2. All other
combinations of s, s1, and s2, yield mini-reconciliations that we refer to as complex
scenarios. We include these scenarios in the pseudocode below to aid understand-
ing of basic reconciliation design, but we do not provide a method for their solution
since complex scenarios can be safely ignored without loss of reconciliation optimal-
ity (see Running Time discussion below). If g is a leaf node, mini-reconciliations are
unnecessary since the true mapping from g to the species tree is known. Once all
combinations have been evaluated, we retain the optimal reconciliation associated
with each possible mapping of g to the species tree. Pseudocode for the reconciliation
algorithm is provided on the following page in Python style.
65
Pseudocode:
% Main %• Reconcile(gene_tree.root)% Methods %• define Reconcile(node):
• child_1, child_2 = ChildNodes(node) %strictly bifurcating tree• if child_1 AND child_2 are null: %is a leaf node
• for node_map in AllNodes(species_tree): • if node_map is KnownHostGenome(node):
• node.reconciliation_cost(node_map) = 0 %correct answer is known for leaves• else:
• node.reconciliation_cost(node_map) = maxint • return
• Reconcile(child_1) %post-fix traversal• Reconcile(child_2)• for node_map in AllNodes(species_tree): %try all possible hosts for ancestor
• for child_1_map in AllNodes(species_tree): %try all possible hosts for children• for child_2_map in AllNodes(species_tree):
• events = MiniReconcile(node_map, child_1_map, child_2_map)• prior_events_1 = child_1.reconciliation_cost(child_1_map)• prior_events_2 = child_2.reconciliation_cost(child_2_map)• overall_cost = Cost(events + prior_events_1 + prior_events_2)• cost_matrix(node_map, child_1_map, child_2_map) = overall_cost
• for node_map in AllNodes(species_tree):• node.reconciliation_cost(node_map) = Min(cost_matrix(node_map, :, :))
• return
• define MiniReconcile(node_map, child_1_map, child_2_map):• % compute DupLoss scenarios• if node_map is ancestral to child_1_map AND child_2_map:
• if node_map is last_common_ancestor of child_1_map AND child_2_map:• %%% See Page46 for DupLoss pseudocode• duploss_events = DupLoss(node_map, child_1_map, child_2_map)
• else:• duploss_events = ComplexScenario()• %ComplexScenario() not implemented -- see Methods Section 1.1.2 Running Time discussion for explanation
• else:• duploss_events = maxint % impossible to reconcile with only dup-loss
• % compute HGT scenarios• if node_map is child_1_map:
• hgt_events = {HGT from node_map to child_2_map}• elif node_map is child_2_map:
• hgt_events = {HGT from node_map to child_1_map}• else:
• hgt_events = ComplexScenario()• return MinCost(hgt_events,duploss_events)
5
An example reconciliation:
An AnGST reconciliation of two simple, but discordant, gene and species trees is
provided in Figure 2.4. Here, we assume that we know the true mappings from the
leaves in G to S: g1:sA, g2:sC , g3:sB. Because AnGST uses a post-fix traversal of G
and the mapping of G’s leaves to S is trivial, we first investigate how g4 is mapped
to nodes in S. We initialize the algorithm by assigning infinite reconciliation cost to
leaf mappings which deviate from the known leaf mappings (e.g. g1:sB); thus, there
is only one valid mapping for g1 and g2.
In Scenario α, g4 is mapped to sA (g4:sA) and we infer one HGT event using
the mini-reconciliation algorithm (since g4 is mapped to the same lineage as one
of its child nodes). Similarly, if we consider g4:sC , we infer one HGT from sC to
sA (Scenario β). In the case of g4:sE (Scenario γ), g4 is mapped to the LCA of
sA and sE and a duplication-loss scenario is invoked by the mini-reconciler. Other
more complex scenarios exist (e.g., s4:sD), but these can be ignored without affecting
overall reconciliation optimality (see Running Time section below). Once optimal
reconciliations have been found for each possible g4:s mapping, AnGST recurses to
g5 and repeats the process. In the next mini-reconciliation, there are multiple valid
g4:s mappings. Thus AnGST must iterate through prospective mappings for both g5
and g4 (although for the sake of illustrative simplicity, we only enumerate a fraction
of these scenarios).
In the first mapping shown for g5 (g5:sD, g4:sA, g3:sB), there is 1 SPC (since
sA and sB are direct vertical descendants of sD) and this cost is added to the 1
HGT already inferred in Scenario α, which resulted in g4:sA. For the combination
(g5:sC ,g4:sC ,g3:sB), a cost of 1 HGT (because g5 and g4 share the same mapping) is
added to the cost for Scenario β. The last mapping shown is (g5:sE,g4:sE,g3:sB). A
mini-reconciliation that posits HGT will imply forward-in-time gene transfers – an
evolutionary event we do not allow (see Temporal constraints on HGT below). In-
stead, a DUP in sE and subsequent losses among sA and sC are needed to correctly
explain the mapping of s5:sE, s4:sE, and g3:sB. The g5 mapping that leads to the
67
optimal reconciliation is a function of the chosen evolutionary event costs. With a
cost structure: CSPC=0, CHGT =1, CLOS=2, CDUP =3, the optimal mappings would
be g5:sD, g4:sA, and the associated reconciliation would be a GEN event at sD, fol-
lowed by SPC at sD, and an HGT from sA to sC . However, if CSPC=0, CHGT =10,
CLOS=2, CDUP =3, the optimal mapping would be g5:sE, g4:sE, and the associated
reconciliation would be an initial GEN event at sE, followed by a DUP in sE, 2 SPCs
each at sE and sD, and LOS in lineages sA, sB, and sC .
Running time
O(|G|*|S|3) is an upper bound on run-time complexity of AnGST, where |S| and
|G| are the number of nodes in those trees, respectively. Running times can be
significantly reduced without loss of reconciliation optimality, however, with a simple
speedup. When performing mini-reconciliations on all combinations of g:s, c1:s1, and
c2:s2 for s, s1, and s2 ∈ S, any complex scenario (s is not s1 or s2, and s is not the
last common ancestor of both s1 or s2) will require at least two HGT (one to s1 and
another to s2), or one HGT to the last common ancestor of s1 and s2 followed by a
duplication-loss scenario originating at that ancestor. These more complex scenarios
will therefore always be suboptimal with respect to non-complex scenarios and their
evaluation can be skipped during the reconciliation process. The resulting reduction
in mapping search space lowers AnGST run-time complexity to O(|G|*|S|2). When
temporal constraints on HGT are enforced (see below), this speedup cannot be fully
exploited, as nodes ancestral to s1 and s2 are potentially optimal values for s in HGT
scenarios.
In practice, on 3.0Ghz single-cores with access to 8GB of memory, an AnGST run
reconciling 100 bootstrap trees from one gene family against a reference tree of 100
species would take roughly: 0.1 minutes for gene trees with 10 leaves, 4 minutes for
gene trees with 50 leaves, 13 minutes for gene trees with 100 leaves, 27 minutes for
gene trees with 150 leaves, 37 minutes for gene trees with 200 leaves.
68
Temporal constraints on HGT
If provided a chronogram as a reference tree, AnGST will restrict the set of possible
inferred gene transfers to only those between contemporaneous lineages. This feature
eliminates the possibility of inferring multiple HGT events which are chronologically
impossible [31]. Any non-zero chronological overlap is sufficient to allow transfers.
But, if a gene transfer is inferred from node s1 to node s2, subsequent transfers of the
gene copy in s2 may only occur with lineages which exist during the range T1 ∩ T2,
where T1 and T2 are the times spanned by the parent edges of s1 and s2, respectively. A
feature enabling transfers forward in time (which may represent “phantom transfers”
from unsampled taxa [121]) has been built into AnGST, but remains off by default
and was not used in our analyses.
Gene tree rooting
Bootstrap trees are assumed to be unrooted. All possible rootings of these bootstrap
trees are evaluated during the reconciliation process. The resulting gene tree is rooted
on the branch that results in the overall lowest reconciliation score.
2.2.3 Bootstrap tree amalgamation
Errors or uncertainty in gene phylogenies can lead to the inference of spurious macroevo-
lutionary events [32] and is a particular concern for deeply branching phylogenies
[122]. AnGST resolves uncertainty by incorporating reconciliation into the tree-
building process: the tree with the lowest reconciliation cost is chosen from a large
ensemble of trees consistent with the sequence data. To generate an ensemble of
suitable trees, AnGST considers the set of all trees that contains only bipartitions
observed in a set of input trees, which we generate with non-parametric bootstrap-
ping. Thus, AnGST typically outputs chimeric trees that do not match any of the
input bootstrap trees exactly, although every bipartition in the AnGST tree occurs
in at least one of the bootstraps. In simulations, we observe these trees to be signif-
icantly more accurate than trees based on sequence likelihood alone, although they
69
generally have lower likelihood (see Chimeric tree fidelity below and Figure 2.5). Any
number of bootstrap trees can be used, but we found limited increase in accuracy in
simulated data as a result of using more than 10 (data not shown).
We implement this approach in the following manner (see Figure 2.6 for an exam-
ple). Given n gene tree bootstraps {G1, G2, ... Gn} and a reference tree S, AnGST
will begin the basic reconciliation algorithm starting on tree G1. Each time AnGST
evaluates an internal node g of G1, it also evaluates the set of internal nodes I = {g1,
g2, ... gk} in other bootstrap gene trees that define the same bipartition as g. The
optimal reconciliation at this node is the lowest scoring scenario/topology observed
in any of the bootstrap trees. That is, a distinct solution is computed for each pos-
sible mapping (gi:s for gi ∈ I, s ∈ S), and only the best solution is retained for each
value of s. These |S| optimal mappings and their reconciliations are subsequently
shared across all the nodes in I. This last step creates “chimeric” gene trees, as the
reconciliation at g in G1 may now refer to a topology found in bootstrap Gi.
2.2.4 Simulation and Benchmarking
Simulation
Benchmarking
We used simulations to benchmark the performance of AnGST. Ten independent
gene trees births were simulated on each of the 199 extant and ancestral lineages of
the reference tree. A simple Poisson statistics-based model of HGT, DUP, and LOS
was used to generate random gene histories and associated gene trees; the average
simulated gene family underwent 0.21 HGT, 0.05 DUP, 0.76 SPC, and 0.26 LOS per
extant gene copy. (For comparison, our analysis of the COG dataset inferred 0.29
HGT, 0.10 DUP, 0.83 SPC, and 0.27 LOS per extant gene copy.) Synthetic amino
acid sequences were generated using these simulated trees and the SeqGen software
(v.1.3.2) [123]. Trees were reconstructed from the synthetic sequences using either the
BIONJ algorithm (implemented in PhyML), PhyML (v.2.4.5) [72], or AnGST via 100
PhyML-generated bootstrap topologies (see Section 2.2.8 for PhyML parameters). A
70
subset of 75 gene families were used to learn costs for HGT and DUP (see Methods
Section 2.2.5). A cost combination of CHGT =4,CDUP =3 minimized genome size flux
using this gene family subset (compared to CHGT =3, CDUP =2 learned for the COG
dataset).
Chimeric tree fidelity
Following reconciliation, nodes deep in the interior of the resultant gene tree can
contain topologies not found in any of the inputted bootstraps (although all possible
bipartitions of these subtrees will exist in at least one of the bootstraps). Thus, the
potential search space of topologies is vast. We tested the fidelity of the chimeric
gene trees learned during the reconciliation process using the Robinson-Foulds (RF)
statistic [124], which measures the number of bipartitions not shared by a pair of
trees. A 0 RF score indicates perfect concordance (all bipartitions of the candidate
and reference tree are identical) and increasing RF scores denote higher phyloge-
netic discordance. Analysis of the 225 gene trees with a minimal level of complexity
(more than 10 leaves) demonstrates that AnGST trees are significantly more accu-
rate than trees generated by BIONJ (p=9.110-8 Wilcoxon rank sum test) or PhyML
alone (p=1.810-2 Wilcoxon rank sum test). Interestingly, this increase in topological
accuracy comes with a likelihood tradeoff in comparison to the PhyML algorithm
(p=2.710-39 Wilcoxon rank sum test). As an aside, we note that the PhyML like-
lihoods in these analyses are in agreement with previous simulations which showed
PhyML capable of constructing trees with higher likelihood than the true topologies
[72].
Inferred birth date accuracy
We benchmarked the accuracy of gene family birth dates predicted by AnGST using
the 747 synthetic gene families that included more than one extant gene copy. A
comparison of inferred birth events and the simulated age of birth events is shown in
Figure 2.7A. There is a strong correlation between inferred and simulated ages (0.88)
and 76% of births are predicted to within 250 My of their simulated age. These
71
results are especially promising given the noisy processes (sequence simulation and
phylogenetic inference) separating simulation of a gene family and its reconciliation.
Moreover, we see no obvious evidence of inference bias which may lead to the false
inference of a birth spike. Direct comparison of birth counts during the AGE (2.9-3.3
Ga) to simulated births during the same period (Figure 2.7B) did not show a bias
towards over-counting births. We did, however, observe a bias toward gene birth prior
to the AGE, suggesting that our set of very ancient genes (born prior to 3.3 Ga) may
be inflated.
2.2.5 Parameter learning
Minimizing genome size flux
We address the problem of assigning the costs to each event type in a manner similar
to some previous studies [25, 27, 125]: we use predictions of ancestral genome sizes
to constrain the costs CDUP and CHGT (these are the only free parameters as we
can assume CLOS=1 and CSPC=0 without loss of generality). However, we chose
to minimize differences in genome size between parent and child nodes (a metric we
refer to as genome size flux) rather than constraining overall genome size over time for
two reasons: first, gene acquisition rates may not have been constant over time and
ancient genomes may have been smaller (or larger) than modern day genomes [125];
second, the extinction of ancient gene families would lead to a trend of smaller inferred
ancestral genome sizes at earlier times even if actual genome sizes were constant. A
grid search of cost space showed genome size flux to be minimized at: CHGT =3 and
CDUP =2 (Figures 2.8A, 2.9).
Sensitivity analysis
We investigated the extent to which the high fraction of overall gene birth detected
during the Archean Gene Expansion was dependent on model parameters (Figure
2.8B). Gene birth patterns were invariant over a broad range of CDUP . Gene birth
from 2.8-3.4 Ga dissipated only at low CHGT values. However, this regime of CHGT
72
resulted in unrealistic genome size distributions: ancestral genomes were much smaller
than present day ones, and most genes were predicted to have been born on terminal
branches and spread via HGT.
2.2.6 Reference tree construction
Building a Chronogram of Life
We used a previously reported Tree of Life as the template for a reference chronogram
[90]. This template was constructed using a concatenation of 31 translation-related
orthologs. All of the species represented in our gene family dataset were present
in this template tree. Divergence times were estimated using PhyloBayes (v.2.3c)
[126]. Since autocorrelated molecular clock models have been shown to outperform
uncorrelated ones in some cases similar to this study [91], we ran PhyloBayes with
a CIR process model of rate correlation. Eight sets of temporal constraints that
could be directly linked to fossil or geochemical evidence were used and are displayed
in Figure 2.10. Benchmarking PhyloBayes runs in parallel (n=95) established that
predicted divergence times and model likelihood converged after a burn-in of roughly
1500 model cycles. Final divergence time estimates were estimated following a burn-
in of 2500 cycles, after which trees were sampled every 20 cycles until the 3500th
cycle.
2.2.7 Alternate reference trees
We tested the extent to which the AGE was sensitive to the topology of the reference
tree and to the molecular clock model used in chronogram construction. We built 10
separate reference phylogenies using non-parametric bootstrapping of the Ciccarelli et
al. gene alignment [90] (see Section 2.2.8 for PhyML parameters) and rooted each with
either the Bacteria, the Archaea, or the Eukarya as the outgroup. Unequivocal errors
in phylogeny that may be due to sequence alignment construction errors were observed
for the Bdellovibrio, Shigella, Treponema, and Helicobacter pylori taxa; these were
resolved by manual pruning and re-grafting. Each phylogeny was then converted to an
73
ultrametric tree using r8s (v.1.71) [116] under a penalized likelihood model (with an
additive penalty function, truncated Newton nonlinear optimization, cross-validation
enabled, a cross validation start value of 10, a cross validation smoothing increment
of 3, and the number of smoothing values tried set to 4). The same set of temporal
constraints was used as for PhyloBayes. For the purposes of computational economy,
a subset of 250 COGs was randomly selected from our dataset and reconciled against
each of the 10 alternate chronograms.
Predicted birth ages were robust to the usage of alternative chronograms. The
median gene family birth date difference between any two alternative chronograms
is 0.09 Ga (Fig. 2.11) Elevated rates of gene birth during the Late Archean were
observed in all 30 of the alternative chronograms and on average, 19% of the 250
chosen COGs were predicted to be born during a 200-My window (Fig. 2.12). How-
ever, the timing of AGE-like window diverged from that reported by PhyloBayes
and spanned 2.7-2.5 Ga (compared to 3.3-2.9 Ga for PhyloBayes). Inspection of the
r8s chronograms suggests this temporal discrepancy may be related to differences in
dating the cyanobacteria under the two models. For both the r8s and PhyloBayes
chronograms, AGE-like events coincided with the relatively brief period during which
the major bacterial phyla, such as the Firmicutes and Proteobacteria, diverged from
the last bacterial common ancestor. This period of compressed cladogenesis predates
the appearance of the cyanobacteria by roughly 100-200 My in both models. Our
r8s analysis places the initial occurrence of the cyanobacteria at 2.5 Ga, which is
precisely the minimum age constraint for the appearance of this clade (see Section
2.2.6). Using the same constraints, PhyloBayes predicts the cyanobacteria to have
emerged 3.0 Ga. We confirmed the importance of the cyanobacteria in dating the
AGE by reducing the minimum constrained age of this clade during r8s chronogram
construction; the resultant model yielded a younger AGE (data not shown).
2.2.8 Gene tree construction
Families of orthologous genes used in this study are based upon functionally anno-
tated orthologous groups from the COG database [127], as extended to a wider set
74
of genomes in the eggNOG database [106]. Due to computational limitations, we re-
stricted this study to a subset of 100 of these genomes (11 eukaryotic, 12 archaeal, and
67 bacterial) broadly distributed across the Tree of Life. Sequences were downloaded
from the eggNOG database in September of 2008.
eggNOG-derived families were filtered to ensure usable levels of sequence conser-
vation with the aim of excluding the most error-prone phylogenies. We performed
this filtering in an iterative fashion: First, we excised poorly aligned regions of se-
quence [90, 128], using Gblocks (0.91b) [129] with the minimum number of sequences
for a flank position set to half the number of sequences in the alignment, the maxi-
mum number of contiguous non-conserved positions set to 8, the minimum length of
a block set to 2, and the allowed gap positions set to all. Second, we excluded genes
with more than 20% of their sequence in these excised regions from each gene family.
Third, Muscle (v3.7) [130] was used with default settings to realign the remaining se-
quences. This process then returned to the first step, unless no sequences or regions
were removed in the first or second steps in which case the process terminated. Of
the original 4872 COGs, 788 lost more than 25% of their original gene copies during
this process; these COGs were considered likely to be error-prone and thus excluded
from further analysis. Another 101 COGs were not analyzed due to their high gene
copy numbers and the extreme computational demands of running AnGST on those
large families. A distribution of gene copy numbers within each gene family is shown
in Figure 2.13.
Phylogenetic trees were constructed for the remaining gene families using ver-
sion 2.4.5 of PhyML [72] and the following parameters: 100 bootstrap trees, a JTT
substitution model, 0.0 percentage of the sites were invariable, 4 substitution rate
categories, a gamma distribution parameter of 1.0, a BIONJ-based starting tree, and
both tree topology and branch length optimization were enabled.
75
Figure 2.4: Example of a basic reconciliation. An AnGST reconciliation of twosimple, but discordant, gene (G) and species (S) trees is shown. The mapping of leaves ofG to S: g1:sA, g2:sC , g3:sB is indicated with color (e.g., g1 and sA are both shown in blue).Reconciliation proceeds in a post-fix manner through the gene tree, first evaluating possiblemappings from g4 to nodes in the S. Once the reconciliation process is completed at g4,the algorithm continues at g5. A detailed explanation of this reconciliation is provided inSection 2.2.2.
76
−20 −10 0 100
2
4
6
8
10
RF
Scor
e
! log likelihood
AnGSTBIONJPhyML
Figure 2.5: AnGST trees are more accurate than likelihood trees in simulation studies.We simulated the evolution of sequence data using 225 randomly generated gene treeswith more than 10 leaves. Gene trees were reconstructed from synthetic sequence datausing either BIONJ (red), PhyML (black), or AnGST (blue). Phylogenetic accuracy wasevaluated by Robinson-Foulds (RF) score. A 0 RF score indicates perfect concordance(all bipartitions of the candidate and reference tree are identical) and increasing RF scoresdenote higher phylogenetic discordance.The logarithm of sequence likelihood given each treemodel, relative to the likelihood calculated with the true gene topology, is plotted on the Xaxis. Mean RF scores and relative log likelihoods are drawn with rectangles whose heightand width reflect standard errors of the mean; protruding lines are standard deviations.PhyML-based trees enjoy significantly higher likelihood scores than the AnGST chimerictrees (p = 2.7×10−39 Wilcoxon rank sum test), but the AnGST-based trees are significantlymore similar to the correct gene tree topologies (p = 1.8 × 10−2 Wilcoxon rank sum test).Outlying points beyond axes were not drawn to facilitate viewing mean values, but wereincluded in mean and significance estimations.
77
Figure 2.6: Amalgamation algorithm for phylogenetic uncertainty. An AnGSTreconciliation of four gene tree bootstrap topologies {G1, G2, G3, and G4} and species treeS is shown. Leaf nodes on each bootstrap map to leaves on S according to color. Thereconciliation begins on one of the bootstrap trees, G1 (Step 1) and proceeds to an interiornode (Step 2). The reconciliation does not consider other topologies for this subtree, as itonly contains two leaves. When the reconciliation reaches the parent node node g1 (Step 3),AnGST considers subtrees from other bootstraps with alternative topologies (but identicalleaves). Corresponding subtrees are found on G3 and G4 and rooted at nodes g3 and g4
respectively (Step 4). Reconciliations are performed in parallel at g1, g3, and g4. For themapping of these internal nodes to lineage A on the species tree, the reconciliation at g3 isoptimal (since its topology matches the reference one) and the corresponding subtree in G3
is substituted for the mappings g1:sA and g4:sA.
78
−4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5
0
Simulated Birthdate [Ga]
Infe
rred
Birth
date
[Ga]
−4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 00
0.5
1
1.5
2
2.5
Birthdate [Ga]
Birth
bia
s ra
tio
B
A
Figure 2.7: Benchmarking AnGST inference accuracy. A) A scatter plot of simu-lated gene family birth dates and inferred birth dates. Points drawn signify midpoints ofbranches associated with birth events. A slight amount of Gaussian noise with distributionN(µ=0, σ=0.025) has been added to each point so that overlapping points can be distin-guished. The correlation coefficient is 0.88 and 76% of predicted births are within 250 My(bounded by red dashed lines) of their true ages. B) Birth prediction bias is plotted as afunction of time. Predicted births have been normalized by the number of simulated birthsassociated with a given age. The AGE (2.9-3.3 Ga) is highlighted in pink.
79
012345678
0
5
10
5
10
15
20
25
30
35
CHGT
Genome size flux sensitivity to HGT and DUP costs
CDUP
Gen
ome
size
flux
10
15
20
25
30
0 1 2 3 4 5 6 7 8 9
0
2
4
6
8
100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
CHGT
Gene birth spike sensitivity to HGT and DUP costs
CDUP
Frac
tion
of a
ll CO
Gs
born
from
2.8−3
.4 G
a
0.05
0.1
0.15
0.2
0.25
0.3
0.35
B
A
Figure 2.8: AnGST parameter learning and sensitivity analysis. A) We performeda grid search over the costs CHGT and CDUP with the intention of minimizing averagegenome size flux between inferred ancestral genomes. The costs CLOS and CSPC were fixedat 1.0 and 0.0, respectively. Flux can clearly be minimized along the CHGT axis, but isless sensitive to changes in CDUP . A minimum point does exist, however, at CHGT =3.0,CDUP =2.0. B) A sensitivity analysis for our detection of a high fraction of births from 2.8-3.4 Ga was performed over the same parameter space evaluated in A). Comparable fractionsof overall gene birth to the AGE were detected in parameter space near the genome sizeflux minimum. Note that axes are reversed in panels A and B in order to facilitate viewingparameter sensitivity landscapes.
80
10
Legend:
Dataset sizes
A
Color ranges:
foo_ffd9b3
foo_ffecd9
foo_d9d9ff
foo_b3b3ff
foo_d7ffce
foo_9ee08c
Thermo. a
cidop.
Archae. fulgid.
Haloba. sp.
Methan. acetiv.
Methan. jannas.
Methan. kandle.
Methan. therma.
Pyroco. DSM
Nanoar. equita
.
Aeropy. p
ernix.
Sulfol. s
olfat.
Pyroba. a
eroph.
Gia
rdi. lam
bli.
Pla
sm
o. fa
lcip
.C
rypto
. hom
ini.
Ara
bid
. th
alia
.S
acc
ha. ce
revi
.
Caenor. e
legan.
Dro
sop.
mel
ano.
Dan
io rer
io.
Hom
o sa
pien
.
Pan tr
oglo
.Mus
mus
cul.
Thermo. tengco.
Clostr. tetani.
Onion phytop.Mycopl. genita.
Mycopl. gallis.
Staphy. aureus.
Lister. innocu.Bacill. anthra.
Bacill. subtil.
Oceano. iheyen.
Lactob. planta.Entero. faecal.Lactoc. lactis.Strept. pneumo.
Desu
lf. v
ulgar.
Bdello. b
acter.
Geobac. su
lfur.
Sol
iba.
usita
t.
Acido
b. s
p.
Woline. s
uccin.Helico. pylori.Helico. hepati.
Campyl. jejuni.
Wolbac. sp.Ricket. prowaz.Caulob. cresce.Rhodop. palust.
Bradyr. japoni.
Mesorh. loti.
Brucel. abortu.
Sinorh. melilo.
Agroba. tumefa.
Nitros. europa.
Bordet. pertus.
Ralsto. solana.
Chromo. violac.
Neisse. mening.
Coxiel. burnet.
Xantho. campes.
Xylell. fastid.
Pseudo. aerugi.
Shewan. oneide.
Photob. profun.
Vibrio choler.
Haem
op. influe.
Salm
on. e
nte
ri.E
scher. co
li.S
hig
el. fle
xne.
Yersin
. pestis.
Wiggle. glossi.
Buch
ne. a
phid
i.
Therm
o. m
ariti.
Aquife. aeolic
.
Fusoba. nucle
a.
Therm
u. th
erm
o.
Dein
oc. ra
dio
d.
Dehalo
. eth
eno.
Glo
eob. vi
ola
c.Syn
ech.
sp.
Pro
chl.
m.9
313.
Pro
chl.
m.C
CM
P.
Syn
ech
. elo
nga.
Nost
oc
sp.
Porphy. gingiv.
Bacte
r. theta
i.C
hloro. tepidu.Chlam
y. tracho.
Chlam
y. pneumo.
Tre
pon. p
allid
.
Borre
l. burg
do.
Lepto
s. inte
rr.R
hodop. b
altic.
Bifid
o. lo
ngum
.
Tro
phe. w
hip
pl.
Mycoba. tu
berc
.
Cory
ne. g
luta
m.
Stre
pt. c
oelic
.
183
2507
Figure 2.9: Inferred ancient genome sizes for CHGT=3.0, CDUP=2.0. Circle areasscale absolutely with genome sizes. Our optimization metric, genome size flux, aims tominimize the average difference between parent and child genomes. Ancestral genomes arepredicted to be smaller than modern day ones; this may reflect the evolution of increasinglycomplex genomes, and/or the extinction of ancestral gene families. Genome sizes for theLUCA (183 genes) and the modern-day genome of E. coli (2507 genes) are labeled in darkblue. Metazoan genomes appear only slightly larger than prokaryotic ones because theCOGs used in this study were originally defined using only unicellular organismsTatusov2000, which thus biased our analyses of eukaryotic genomes towards only microbially-relatedgenes.
81
10
Color ranges:
euk
arc
bac
Giardia lamblia ATCC 50803
Plasmodium
falciparum
Cryptosporidium hom
inis
Saccharom
yces cerevisiae
Caenorhabditis elegans
Dro
sophila
mela
nogaste
r
Hom
o s
apie
ns
Pan tro
glo
dyte
s
Mus m
uscu
lus
Danio
rerio
Arabidopsis thaliana
Therm
opla
sm
a a
cid
ophilu
m D
SM
1728
Arc
haeoglo
bus fu
lgid
us D
SM
4304
Halo
bacte
rium
sp N
RC
!1 M
eth
anosarc
ina a
cetivora
ns C
2A
Pyro
coccus furiosus D
SM
3638
Meth
anocald
ococcus jannaschii D
SM
2661
Meth
anopyru
s k
andle
ri A
V19
Meth
anoth
erm
obacte
r th
erm
auto
trophic
us s
tr D
elta H
Nanoarc
haeum
equita
ns K
in4!
M
Aero
pyru
m p
ern
ix K
1S
ulfo
lobus s
olfa
taric
us
Pyro
baculu
m a
ero
philu
m Therm
oanaero
bacte
r te
ngcongensis
MB
4
Clo
str
idiu
m teta
ni E
88
Mycopla
sm
a g
enitaliu
m G
37
Mycopla
sm
a g
alli
septicum
R
Onio
n y
ello
ws p
hyto
pla
sm
a O
Y!M
Lact
obaci
llus
pla
nta
rum
WC
FS1
Ente
roco
ccus
faeca
lis V
583
Lact
ococ
cus
lact
is s
ubsp
lact
is Il
1403
Strep
toco
ccus
pne
umon
iae
TIG
R4
Liste
ria in
nocua C
lip11262
Bacillus a
nthra
cis str Ste
rne
Bacillus subtili
s subsp subtilis str 1
68
Oceanobacillu
s iheyensis H
TE831
Sta
phyloco
ccus
aureus
subsp
aure
us NCTC
8325
Chlorobium tepidum TLS
Porphyromonas gingivalis W83
Bacteroides thetaiotaomicron VPI!5482
Chlamydia trachomatis DUW!3CX
Chlamydophila pneumoniae AR39
Bifidobacterium longum NCC2705Tropheryma whipplei str TwistStreptomyces coelicolor A3(2)
Mycobacterium tuberculosis CDC1551
Corynebacterium glutamicum ATCC 13032
Treponema pallidum subsp pallidum str Nichols
Borrelia burgdorferi B31
Leptospira interrogans serovar Lai str 56601
Pirellula sp
Fusobacterium nucleatum subsp nucleatum ATCC 25586
Thermotoga m
aritima M
SB8
Aquifex aeolicus VF5
Synechococcus s
p W
H 8
102
Pro
chlo
rococcus m
arin
us s
tr MIT
9313
Pro
chlo
rococcus m
arin
us s
ubsp m
arin
us s
tr CC
MP
1375
Syn
ech
oco
ccus e
longatu
s PC
C 6
301
Nosto
c sp P
CC
7120
Gloeobacter violaceus P
CC 7421
Thermus therm
ophilus HB27
Deinococcus radiodurans R
1
Dehalococcoides ethenogenes 195
Desulfo
vib
rio v
ulg
aris
str H
ildenboro
ugh
Bdello
vib
rio b
acte
riovoru
s
Geobacte
r sulfu
rreducens P
CA
Solib
acte
r usita
tus E
llin6076
Acid
obacte
ria b
acte
rium
Ellin
345
Wolb
achia
sp w
Mel
Ric
kettsia
pro
wazekii s
tr M
adrid E
Caulo
bacte
r cre
scentu
s C
B15
Rhodopseudom
onas p
alu
str
is C
GA
009
Bra
dyrh
izobiu
m japonic
um
US
DA
110
Mesorh
izobiu
m loti M
AFF303099
Bru
cella
abortus
Sin
orh
izobiu
m m
eliloti 1
021
Agro
bact
erium
tum
efa
ciens
str C
58
Nitr
oso
monas
euro
paea A
TCC 1
9718
Bor
dete
lla p
ertu
ssis
Ral
ston
ia s
olan
acea
rum
GM
I100
0
Chro
mobact
erium
vio
lace
um A
TCC
12472
Neis
seria
menin
gitidis
MC
58
Pseudomonas aeruginosa PAO1
Shewanella oneidensis MR!1Photobacterium profundum
Vibrio cholerae O1 biovar El Tor str N16961Haemophilus influenzae Rd KW20
Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis
Buchnera aphidicola str APS (Acyrthosiphon pisum)
Yersinia pestis KIM
Escherichia coli K!12
Shigella flexneri 2a str 301
Salmonella enterica subsp enterica serovar Typhi str Ty2
Xanthomonas campestris pv campestris
str ATCC 33913
Xylella fastid
iosa 9a5c
Coxiella b
urnetii
RSA 493
Cam
pylo
bacte
r jeju
ni s
ubsp je
juni N
CT
C 1
1168
Wolinella s
uccin
ogenes D
SM
1740
Helicobacte
r pylo
ri 2
6695
Helicobacte
r hepaticus A
TC
C 5
1449
2
1
3
4
5
6
7b
7a
8
Event Constraint
1Last universal common ancestor emerges
> 3850 Ma
2Cyanobacteria emerge
> 2500 Ma
3Eukaryotes diverge from Archaea
> 2670 Ma
4Akinetes diverge from cyanobacteria lacking cell differentiation
> 1500 Ma
5Archaeplastida emerge
> 1198 Ma
6 Animals emerge > 635 Ma
7 Tetrapods emerge(a) < 385 Ma(b) > 359 Ma
8Buchnera diverge from Wigglesworthia > 160 Ma
Figure 2.10: Temporal constraints. Eight fossil and biogeochemical constraints wereused to constrain the chronogram (evidence cited can be found in Table 2.1). Those con-straints are overlaid onto the reference phylogeny.
82
0 0.5 1 1.5 2 2.5 3 3.5 40
0.5
1
1.5
2
2.5
3
3.5
4
Gene family birth using topology A [Ga]
Gen
e fa
mily
birt
h us
ing
topo
logy
B [G
a]
Supplementary Figure 8: Sensitivity of predicted birth ages to variation in reference tree
topology. Ten bootstraps of the reference tree were rooted using either the Bacteria, Archaea, or
Eukarya as an outgroup and subsequently processed with r8s, producing 30 alternative reference
chronograms. Birth dates were inferred for 250 gene families using each chronogram. We graph
1000 random combinations of alternative chronogram pairs and gene families in the scatter plot
above (correlation coefficient = 0.86). The median gene family birth date difference between
any two alternative chronograms is 0.09 Ga.
2010-03-03429A-Z
25
Figure 2.11: Sensitivity of predicted birth ages to variation in reference treetopology. Ten bootstraps of the reference tree were rooted using either the Bacteria,Archaea, or Eukarya as an outgroup and subsequently processed with r8s, producing 30alternative reference chronograms. Birth dates were inferred for 250 gene families usingeach chronogram. We graph 1000 random combinations of alternative chronogram pairsand gene families in the scatter plot above (correlation coefficient = 0.86). The mediangene family birth date difference between any two alternative chronograms is 0.09 Ga.
83
Supplementary Figure 9: Gene family birth using 30 alternative reference tree topologies.
Shown above are cumulative distribution functions (CDFs) of total COG birth over time for the
30 alternative reference chronograms (light gray lines). Mean CDFs for the Bacteria, Archaea,
and Eukarya as outgroups are shown using green, red, and blue dashed lines, respectively.
Overall (solid black line), the period 2.7-2.5 Ga witnesses a gene family birth spike of on
average 0.23 families born per 1 Ma and accounts for the birth of 19% of the COG families
studied. By contrast, birth rates average 0.07 families born per 1 Ma from 2.5 Ga-present day.
−3.5 −3 −2.5 −2 −1.5 −1 −0.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time [Ga]
0.00.51.01.52.02.53.03.54.0
Exis
tin
g C
OG
fracti
on
Overall average CDFAverage CDF w/ euk. outgroupAverage CDF w/ bac. outgroupAverage CDF w/ arc. outgroupIndividual chronogram CDF
2010-03-03429A-Z
26
Figure 2.12: Gene family birth using 30 alternative reference tree topologies.Shown above are cumulative distribution functions (CDFs) of total COG birth over time forthe 30 alternative reference chronograms (light gray lines). Mean CDFs for the Bacteria,Archaea, and Eukarya as outgroups are shown using green, red, and blue dashed lines,respectively. Overall (solid black line), the period 2.7-2.5 Ga witnesses a gene family birthspike of on average 0.23 families born per 1 Ma and accounts for the birth of 19% of theCOG families studied. By contrast, birth rates average 0.07 families born per 1 Ma from2.5 Ga-present day.
84
0 50 100 150 200 250 300 350 4000
200
400
600
800
1000
1200
1400
Num
ber o
f gen
e fa
milie
s
Gene family size
Figure 2.13: Histogram of COG family sizes. The median COG family in our datasetpossesses 18 gene copies, and 93% of COG families have 100 or fewer gene copies.
85
4.0 3.5 3.0 2.5 2.00
0.05
0.1
Time [Ga]
O2−
bind
ing
CO
G fr
actio
n
**
Figure 2.14: O2-utilizing gene birth over time. The fraction of compound-bindingCOG births which bind O2 is shown over time. A chi-square test was used to comparethe overall number of COGs born and the number of O2-binding COGs born in 100 Mywindows: prior to the AGE (3.7 Ga), at the height of the AGE (3.25 Ga), and at the tail ofthe AGE (2.85 Ga). Comparisons with p < 0.05 are denoted with asterisks on the graph.These data suggest that changes in O2 usage came toward the end of the AGE.
86
0 50 100 150 200 250 300 350 4000
20
40
60
80
100
120
HG
T ev
ents
Gene family size
y = 0.085x
y = 0.303x
LCA age > 2.8 GaLCA age <= 2.8 Ga
Figure 2.15: HGT counts vs. gene family size. The average gene family reconciliationyields 9.7 inferred HGT events. The number of HGT events inferred grows with the numberof gene copies in a COG family. Gene family HGT counts also grow with the age of the lastcommon ancestor of all genomes represented in the family, suggesting that HGT is morefrequent among gene families spanning wider phyletic range. We note that y-intercepts forthe above line fittings have been forced to equal 0.
87
Supplementary Figure 13: Confidence intervals for divergence times on reference
chronogram. Confidence intervals (95%) were estimated by PhyloBayes and are shown next to
each divergence point on the tree. Values are in units of Ga.
Thermoplas.acidoArchaeoglo.fulgiHalobacter.sp.Methanosar.aceti1.06-0.481.34-0.78
Methanocal.jannaMethanopyr.kandlMethanothe.therm1.05-0.621.27-0.83Pyrococcus.furio
1.62-1.041.8-1.25
2.02-1.59
Nanoarchae.equitAeropyrum.perniSulfolobus.solfa1.56-1.0Pyrobaculu.aerop1.77-1.37
2.01-1.65
2.33-2.05
Giardia.lamblPlasmodium.falciCryptospor.homin1.26-0.82Arabidopsi.thaliSaccharomy.cerevCaenorhabd.elegaDrosophila.melanDanio.rerioHomo.sapiePan.trogl0.19-0.1Mus.muscu0.34-0.21
0.57-0.370.97-0.7
1.1-0.821.4-1.04
1.61-1.241.77-1.37
2.08-1.77
2.82-2.67
Thermoanae.tengcClostridiu.tetan1.95-1.57Onion.yelloMycoplasma.genitMycoplasma.galli0.72-0.352.03-1.51Staphyloco.aureuListeria.innocBacillus.anthrBacillus.subti0.74-0.27Oceanobaci.iheye0.9-0.41
1.19-0.711.6-1.05
Lactobacil.plantEnterococc.faecaLactococcu.lactiStreptococ.pneum0.44-0.210.95-0.571.26-0.77
1.9-1.36
2.54-2.17
3.15-2.87
Desulfovib.vulgaBdellovibr.bacteGeobacter.sulfu2.51-2.332.64-2.36Solibacter.usitaAcidobacte.sp.1.96-1.31
2.97-2.78
Wolinella.succiHelicobact.pylorHelicobact.hepat1.4-1.261.42-1.28Campylobac.jejun
1.89-1.71Wolbachia.sp.Rickettsia.prowa2.14-1.14Caulobacte.crescRhodopseud.palusBradyrhizo.japon0.61-0.17Mesorhizob.lotiBrucella.abort0.82-0.27Sinorhizob.melilAgrobacter.tumef0.6-0.16
0.99-0.351.53-0.7
2.02-1.252.65-2.23
Nitrosomon.europBordetella.pertuRalstonia.solan1.11-0.61Chromobact.violaNeisseria.menin1.07-0.631.72-1.23
1.98-1.68
Coxiella.burneXanthomona.campeXylella.fasti0.8-0.41Pseudomona.aerugShewanella.oneidPhotobacte.profuVibrio.chole0.84-0.6Haemophilu.influSalmonella.enterEscherichi.coliShigella.flexn0.66-0.590.67-0.6Yersinia.pesti
0.83-0.74Wiggleswor.glossBuchnera.aphid0.54-0.39
0.9-0.781.0-0.86
1.09-0.951.2-1.06
1.94-1.732.25-1.97
2.34-2.08
2.5-2.21
2.94-2.73
3.03-2.84
3.14-2.93
Thermotoga.maritAquifex.aeoli2.43-1.85Fusobacter.nucle3.0-2.75Thermus.thermDeinococcu.radio1.68-1.07Dehalococc.ethen2.98-2.76Gloeobacte.violaSynechococ.sp.Prochloroc.marin0.71-0.24Prochloroc.marin1.16-0.7Synechococ.elongNostoc.sp.1.54-1.5
1.77-1.672.09-1.92
3.14-2.96
3.22-3.03
3.29-3.09
Porphyromo.gingiBacteroide.theta0.58-0.19Chlorobium.tepid2.63-1.38Chlamydia.trachChlamydoph.pneum0.88-0.23
3.14-2.42
Treponema.palliBorrelia.burgd2.7-2.19Leptospira.inter2.76-2.3Rhodopirel.balti3.05-2.61Bifidobact.longuTropheryma.whipp1.36-0.72Mycobacter.tuberCorynebact.gluta0.87-0.47Streptomyc.coeli1.62-1.06
2.01-1.363.18-2.76
3.29-3.05
3.37-3.18
3.39-3.21
100
3.85-3.77
0.1 Ma
2010-03-03429A-Z
30
Figure 2.16: Confidence intervals for divergence times on reference chronogram.Confidence intervals (95%) were estimated by PhyloBayes and are shown next to eachdivergence point on the tree. Values are in units of Ga.
88
Meta-function Function COG CodeNumber of
COGs
Fraction of genes studied
Fraction of AGE births
AGE birth enrichment
Fraction of pre-AGE
births
pre-AGE birth enrichment
pre-AGE vs. AGE p-value
Information storage & processing
Translation
Information storage & processing
RNA proc.Information storage &
processingTranscriptionInformation storage &
processingReplication, recombination
Information storage & processing
Chromatin struct.
Cellular processes & signaling
Cell cycle control
Cellular processes & signaling
Defense mech.
Cellular processes & signaling
Signal transduction
Cellular processes & signaling
Cell wall/membraneCellular processes & signaling Cell motility
Cellular processes & signaling
Cytoskeleton
Cellular processes & signaling
Intracell. trafficking
Cellular processes & signaling
Post-trans. modification
Metabolism
Energy prod. & conv.
Metabolism
Carb. trans. & met.
Metabolism
Amino acid trans. & met.
MetabolismNucleotide trans. & met.
MetabolismCoenzyme trans. & met.
Metabolism
Lipid trans. & met.
Metabolism
Inorganic ion trans. & met.
Metabolism
Secondary metabolites
Poorly characterizedFunc. unknown
Poorly characterizedGeneral func. pred.
J 197 0.049 0.061 1.234 0.150 3.042 0.000A 17 0.004 0.001 0.219 0.002 0.377 1.000K 173 0.043 0.030 0.681 0.042 0.962 0.246L 155 0.039 0.034 0.868 0.068 1.746 0.002B 11 0.003 0.000 0.000 0.002 0.655 0.338
D 56 0.014 0.013 0.903 0.012 0.854 1.000V 29 0.007 0.008 1.083 0.001 0.097 0.033T 106 0.027 0.028 1.044 0.020 0.762 0.405M 141 0.035 0.050 1.424 0.060 1.702 0.487N 82 0.021 0.031 1.493 0.015 0.718 0.064Z 5 0.001 0.000 0.000 0.000 0.000 1.000U 122 0.031 0.032 1.047 0.022 0.710 0.274O 157 0.039 0.043 1.101 0.042 1.070 1.000
C 211 0.053 0.079 1.499 0.068 1.289 0.490G 186 0.047 0.056 1.205 0.051 1.087 0.730E 226 0.057 0.079 1.394 0.109 1.923 0.054F 83 0.021 0.026 1.228 0.059 2.835 0.001H 155 0.039 0.056 1.439 0.069 1.772 0.326I 72 0.018 0.024 1.306 0.032 1.795 0.334P 182 0.046 0.058 1.276 0.046 0.998 0.298Q 70 0.018 0.015 0.847 0.004 0.216 0.045
S 1186 0.298 0.183 0.615 0.071 0.238 0.000R 560 0.141 0.142 1.010 0.113 0.804 0.105
Figure 2.17: Function of gene births prior to and during the AGE. Functionalenrichment of gene birth from 2.8-3.3 Ga is shown for the 20 COG functional categories. Atwo-tailed Fisher exact test was used to compute the p-value of a difference in total COGbirths prior to the AGE vs. during the AGE, for each functional category (last column).
89
Meta-function FunctionCOG Code
HGT out of cyano into
plant
HGT out of cyano
Fisher p-value
HGT out of alpha-proteos
into euks
HGT out of alpha-proteos
Fisher p-value
Information storage & processing
Translation
Information storage & processing
RNA proc.Information storage
& processingTranscriptionInformation storage
& processingReplication, recombination
Information storage & processing
Chromatin struct.
Cellular processes & signaling
Cell cycle control
Cellular processes & signaling
Defense mech.
Cellular processes & signaling
Signal transduction
Cellular processes & signaling
Cell wall/membraneCellular processes & signaling Cell motility
Cellular processes & signaling
Cytoskeleton
Cellular processes & signaling
Intracell. trafficking
Cellular processes & signaling
Post-trans. modification
Metabolism
Energy prod. & conv.
Metabolism
Carb. trans. & met.
Metabolism
Amino acid trans. & met.
MetabolismNucleotide trans. & met.
MetabolismCoenzyme trans. & met.
Metabolism
Lipid trans. & met.
Metabolism
Inorganic ion trans. & met.
Metabolism
Secondary metabolites
Poorly characterized
Func. unknownPoorly characterized General func. pred.
J 38 1225 4.41E-05 13 779 9.34E-02A 0 1 1.00E+00 0 0 1.00E+00K 2 297 3.34E-01 0 231 1.78E-01L 4 550 1.52E-01 0 324 8.23E-02B 0 7 1.00E+00 0 8 1.00E+00
D 3 164 7.42E-01 0 98 6.28E-01V 0 116 4.25E-01 0 59 1.00E+00T 2 341 2.54E-01 3 209 4.84E-01M 5 735 6.06E-02 0 513 6.24E-03N 0 80 6.36E-01 0 168 4.23E-01Z 0 0 1.00E+00 0 0 1.00E+00U 6 273 3.20E-01 1 305 3.79E-01O 11 549 3.72E-01 9 367 1.55E-02
C 26 949 3.95E-03 21 621 1.55E-06G 9 541 7.21E-01 2 335 5.86E-01E 16 1205 6.23E-01 5 701 5.57E-01F 6 486 8.49E-01 1 270 5.32E-01H 13 1057 5.12E-01 7 397 1.98E-01I 9 295 5.06E-02 4 247 3.33E-01P 12 918 6.76E-01 5 472 1.00E+00Q 3 163 7.41E-01 0 106 6.30E-01
S 16 1601 8.04E-02 5 1103 3.73E-02R 19 1496 4.34E-01 9 797 7.17E-01
Figure 2.18: Biases in gene function associated with ancient endosymbioses.Shown here is a functional breakdown of HGT from cyanobacteria into Arabidopsis thaliana,and from the alpha-proteobacteria to ancient eukaryotes (which we defined as eukaryoticlineages predating the divergence of Arabidopsis). The significance of functional enrichmentsfor genes associated with the chloroplast are calculated by comparing the observed numberof HGT from cyanobacteria into the plant lineage, to the total number of HGT originatingin the cyanobacteria, so as to account for functional biases associated with HGT out of thecyanobacteria. Similar statistics are performed on mitochondria-related genes. P-valuesthat fall below a 5% False Discovery Rate cutoff in are underlined.
90
Chapter 3
Building prokaryotic species treesfrom thousands of gene trees
Lawrence A. David, Albert Y. Wang, Eric J. Alm
Experiments in this chapter were performed in collaboration with Albert Wang inthe Alm laboratory.
Chapter 3
Building prokaryotic species trees
from thousands of gene trees
3.1 Abstract
Sequenced-based approaches for reconstructing prokaryotic species trees from more
than one gene utilize either the concatenation of sequences (supermatrices) or the
merger of separate gene trees (supertrees) [131]. Yet, supermatrices ignore differ-
ences in evolutionary rate and nucleotide compositional bias between concatenated
genes, which can ultimately reduce the accuracy of inferred phylogenetic trees [132].
Supertree methods can account for these inter-gene heterogeneities, but a commonly-
used supertree technique, matrix representation, violates the site-independence as-
sumption underlying many phylogenetic construction algorithms. Here, I extend an
alternative supertree method known as Gene Tree Parsimony (GTP), which chooses
the species tree as the topology that requires the least number of duplication events
to be inferred when compared to a set of gene trees [133]. My GTP model, named
GAnG, can account for horizontal gene transfer events (HGT) and is suitable for
use with prokaryotic gene trees. Preliminary GAnG analysis of 250 archaeal gene
trees built from a subset of the sequenced archaeal genomes supports hypotheses that
Nanoarchaeota diverged from the last ancestor of the Archaea prior to the Crenar-
chaeota/Euryarchaeota split, but also appears sensitive to HGT artifacts between the
92
Crenarchaeota and the Thermoplasma.
93
3.2 Introduction
Prokaryotic species phylogenies are frequently built by concatenating sequences from
more than one gene [131], in order to sample a range of phylogenetically-informative
characters [134], and to mitigate the role of topological artifacts caused by long branch
attraction [122, 135] and nucleotide composition bias [136, 137]. These genes can
be restricted to “core” sets that are topologically congruent [90, 138], under the
assumption that horizontal gene transfer (HGT) artifacts would only be present if
identical HGTs affected all of the genes in the core set. However, reconstructions of
organismal histories using core genes have been criticized for ignoring the majority
of phylogenetic signal present in genomes and have been referred to as “The tree[s]
of one percent” [139]. In order to claim that species trees represent overall genome
evolution, computational tools are required that can incorporate genetic information
from hundreds, or even thousands, of genes.
Previous studies that have built prokaryotic species trees from multiple gene fam-
ilies have utilized either “supertree” or “supermatrix” approaches [90, 138, 140, 141].
Under the supermatrix approach, sequences from orthologous gene families are con-
catenated into a single sequence, which can subsequently be inputted into a standard
phylogenetic construction algorithm. This method is robust to gene families with lim-
ited taxonomic distribution, as simulations have shown supermatrices to perform well
even in regimes where only 10% of species possess a given gene copy [142]. However,
the concatenation of prokaryotic genes risks assembly of sequences with divergent
evolutionary histories due to HGT [143]. Supermatrix approaches must therefore as-
sume that the phylogenetic signal from vertically-descended genes overwhelms any
signal from transferred regions of the alignment. Moreover, species trees cannot be
constructed in a computationally-efficient way if differences in gene evolutionary rate
or nucleotide composition are taken into account. Simulations have shown that par-
titioning concatenations and fitting these parameters on a per-gene basis improved
the fidelity of reconstructed phylogenies in nearly all reported scenarios, and in some
cases enabled the reliable recovery of clades that were never inferred by a conven-
94
tional supermatrix procedure [132]. However, supermatrices that model evolutionary
heterogeneity between genes cannot utilize existing tree building algorithms and have
so far required exhaustively searching tree space in order to identify an optimal tree
[132, 134].
Supertree methods provide an alternative phylogenetic framework that can model
evolutionary parameters on a per-gene basis, while still constructing a species phy-
logeny in a computationally-efficient manner. Under this framework, gene trees are
built separately for each orthologous gene family and subsequently merged to form
a species tree. Supertree methods are distinguished by how gene tree merger takes
place. According to the most popular merger method [144], matrix representation
using parsimony (MRP), a binary character matrix is built that possesses a column
for each internal node in the set of gene trees, and a row for each of the species sam-
pled. Species with a gene copy descended from a given internal node all share a “1”
in the corresponding column of the matrix, and all other species share a “0” [145]. A
consensus tree is formed by inputting the character matrix into a sequence-based phy-
logenetic construction algorithm. Because the supertree framework allows trees to be
built for individual gene families, the model is capable of accounting for differential
evolutionary rates and nucleotide compositions between genes, unlike conventional
supermatrices.
However, the MRP algorithm in particular has been criticized for its analogy
between the tree matrix and a nucleic acid sequence matrix [133]. This analogy
violates the site-independence model assumption of many phylogenetic reconstruction
algorithms [76], since sibling leaves on a gene tree will be partitioned similarly for
each MRP matrix column that corresponds to one of these leaves’ ancestral nodes.
In practice, this site dependence should manifest as a supertree bias towards subtree
topologies from large gene trees with many ancestral nodes and therefore more sites
in the tree matrix. This bias is likely deleterious, as large gene trees can be caused
by frequent HGT and duplication events that impede partitioning gene families into
orthologous groups.
Gene tree parsimony (GTP) provides an alternative supertree merger criterion
95
that has received less statistical criticism than MRP, but that also cannot be used
on prokaryotic gene trees in present implementations. First introduced by Slowinski
in 1997, the GTP approach chooses the species tree as the topology that requires
the least number of duplication events to be inferred when species and gene trees are
compared (a step called a reconciliation) [146]. This reconciliation-based approach
avoids the matrix construction step that has led to criticism of MRP supertree meth-
ods. However, existing implementations of GTP can only model gene duplication and
gene loss events [147, 148], limiting their usage to eukaryotic phylogenies.
Here, I enable the use of GTP supertrees on prokaryotic trees, through a new
program that I have named GAnG. This algorithm reconciles gene trees and species
trees using a generalized parsimony model I previously developed, the Analyzer of
Gene & Species Trees, or AnGST [149], which accounts for HGTs, as well as gene
duplication and gene loss events, in gene family evolution. Experiments with simu-
lated data show that GAnG can accurately quantify species tree accuracy. As a proof
of concept, I also used the algorithm to learn a phylogeny of 12 sequenced archaeal
species from 250 gene trees with divergent topologies and that have likely undergone
HGT. Results of this analysis support the basal position of Nanoarchaeum equitans
on the archaeal tree.
96
3.3 Methods
3.3.1 Approach overview
The GAnG algorithm is an iterative process composed of four primary steps:
1. Initial topology generation and reconciliation: An initial reference tree
is constructed using one gene, or a concatenation of multiple genes. Individual
gene trees are also constructed for all orthologous gene families. After trees
have been built, each of the gene trees is reconciled against the species tree
using AnGST.
2. Proposal of tree refinements: The HGTs inferred from reconciliations be-
tween the species tree and gene trees are mined to propose subtree-prune-and-
regraft (SPR) moves. These SPRs are used to generate candidate species tree
topologies (Section 3.3.2). Alternatively, new rootings of the species tree can
be evaluated (Section 3.3.3).
3. Evaluation of candidate refinements: Each candidate species tree is eval-
uated by comparison to the set of gene trees using AnGST, as described in
Section 3.3.4.
4. Selection of a new species tree: The best scoring candidate tree is identified.
If this tree has a better reconciliation score than the current species tree, the
species tree is replaced with the candidate tree and the algorithm returns to
Step 2. Otherwise, the process terminates and returns the present species tree.
3.3.2 Generating candidate species trees
GAnG searches species tree space using an iterative subtree-prune-and-regraft (SPR)
strategy that generates a candidate species trees from an existing species tree by
pruning off subtrees and regrafting them onto new locations on the tree. This heuristic
approach is necessary because it is unlikely that a polynomial-time algorithm exists
for finding an optimal species tree using AnGST [150]. An exhaustive search strategy
97
is also not possible, since the number of possible rooted binary trees grows super-
exponentially with the number of species s, following the function (2s−3)!! [151] (for
the 14 genomes in Fig. 3.2, there are nearly eight trillion possible organismal trees).
SPR moves enable dramatic changes in tree topology using relatively few topo-
logical operations and are used by tree construction algorithms like PhyML 3.0 [152],
FastTree [153] and RaxML [154] to avoid being caught in local minima in tree space.
An SPR-based strategy for refining species trees has the added advantage of com-
plementing AnGST reconciliation process. High-frequency HGTs can be eliminated
if an SPR between the transferred nodes (in the reverse direction of the HGT) is
performed on the species tree. Consequently, GAnG generates a candidate species
tree for each of the SPR moves associated with the 100 most frequently inferred HGT
on the present species tree.
3.3.3 Rooting the species tree
Unlike typical supermatrix or supertree methods, GAnG does not require an outgroup
sequence to to infer a rooted species tree. Since the species tree root orients the
direction of vertical inheritance at internal nodes, alternative rootings of the same
unrooted species tree will have distinct AnGST scores when reconciled against a
gene tree. Therefore, in addition to incorporating SPR moves, the candidate tree
generation routine can also vary the position of the root node on the species tree.
3.3.4 Scoring a candidate species tree
A score S for a putative species tree topology T quantifies how well the species tree
fits the set of gene trees according to the AnGST model:
S(T ) =�
G
Rec(G|T ) (3.1)
where Rec(G|T ) is the AnGST reconciliation score for a gene tree G, given the can-
didate species tree T . The AnGST model assigns a score of 3, 2, and 1 to each HGT,
duplication, and loss event inferred, respectively [149].
98
3.4 Preliminary results
I have performed two preliminary tests using the GAnG algorithm, using simulated
and real-world data generated from my previous study of microbial genome evolution
[149]. These tests suggest:
1. The AnGST-based scoring function can discern between species trees of varying
accuracy (Section 3.4.1)
2. HGTs can be used to propose SPR moves on the species tree that lead to lower
AnGST reconciliation scores (Section 3.4.2).
3.4.1 AnGST scores increase with true species tree permu-
tation
I evaluated how AnGST reconciliation scores changed as increasing phylogenetic noise
was added to the Tree of Life [90], in order to test with real-world gene trees if the
GAnG tree search process could be attracted towards the correct species tree. This
experiment used 125 microbial gene families randomly selected from my previous
study of microbial genome evolution [149]. A total of 100 sequenced genomes spanning
all three domains of life were represented in these gene trees. Because the true species
tree for the 100 sampled genomes is not known, I could not directly evaluate whether
or not the GAnG algorithm could use the gene trees to recover the true species tree.
Instead, I tested if the GAnG algorithm behaved in a manner consistent with less
accurate species trees receiving higher reconciliation scores than more accurate species
trees. I simulated a continuum of species tree accuracy by taking a topology (the Tree
of Life) likely similar to the true tree, and randomly permuting it with between 1 and
10 SPR random moves. This procedure was used to generate 300 alternate species
trees. Gene tree reconciliation scores against the species trees increased linearly with
the number of SPRs performed on the Tree of Life (Figure 3.1; R2 = 0.67), suggesting
that more accurate species trees will receive lower reconciliation scores than less
accurate species trees.
99
3.4.2 HGTs can be used to find a better-scoring tree
The GAnG algorithm was run to completion on a set of 12 sequenced archaeal species
and 250 gene trees randomly chosen from my previous study on microbial genome
evolution [149]. The initial species tree topology was pruned from the Ciccarelli &
Bork Tree of Life (Fig. 3.2A) [90]. The algorithm terminated after four iterations,
yielding a refined species tree whose reconciliations with gene trees were an average of
3.7% lower than the initial tree (Figure 3.2B). A single SPR move of the Crenarchaeota
to a basal euryarchaeal lineage provided the most dramatic change in the refined
topology, shifting N. equitans to a basal position on the archaeal tree.
100
3.5 Discussion
GAnG employs the AnGST reconciliation model to become the first GTP-based su-
pertree method capable of accounting for HGT events and therefore appropriate for
inferring prokaryotic species trees. Usage of the AnGST model carries two additional
benefits. First, the AnGST algorithm includes a bootstrap amalgamation step that
constructs a chimeric gene tree from the bootstrap subtrees that best conform to the
species tree. As a result, gene families that evolved by vertical descent, but whose
phylogenetic reconstructions are sensitive to sequence sampling variation, will be less
likely to mislead the GTP process. Second, HGT events inferred by AnGST are
directed between specific branches on the species tree and can therefore provide a
heuristic guide for refining species trees. Frequent HGTs between two lineages may
be the result of a topological error on the species tree and can be resolved by making
these lineages sibling to one another on an updated tree.
Preliminary analyses of GAnG on simulated data demonstrate that the algorithm’s
scoring function will be attracted towards correct species topologies (Fig. 3.1). The
observed linear relationship between species tree score and tree randomization demon-
strates also suggests that GAnG can accurately quantify relative differences in inac-
curacy among trees. Thus, future implementations of GAnG’s tree search algorithm
may be able to employ more sophisticated search algorithms, such as gradient descent.
The GAnG analysis on real-world data showed the algorithm capable of inferring
a prokaryotic species tree largely in line with prior archaeal phylogenies, despite not
relying on a “core” set of gene families free of HGT (Fig. 3.2). A single SPR move of
the Crenarchaeota to a basal euryarchaeal lineage provided the most dramatic refine-
ment to the starting Ciccarelli & Bork archaeal tree. This SPR shifted N. equitans
from its initial sibling position with the Crenarchaeota (a position not supported by
the archaeal phylogenetic literature) to a more basal position outgrouping both the
Euryarchaeota and the Crenarchaeota. This outgroup location of N. equitans has
been reported previously [155], but remains controversial [156]. The refined archaeal
tree also includes a paraphyletic euryarchaeal clade that features the Thermoplas-
101
matales and Crenarchaeota as sibling to one another. A close relationship between
the Thermoplasmatales and the Crenarchaeota has been reported by previous phy-
logenetic studies [157–159] and has been attributed to HGT between thermoplasma
species and the Crenarchaeota [138].
Finally, these experiments both indicate that GAnG could be run on datasets
composed of thousands of gene trees. The analysis of the archaeal tree with 250 gene
trees took on the order of 3 days on a computer cluster. GAnG running time increases
linearly with the number of gene trees, so analysis of a 1000 gene tree set would
require approximately two weeks of compute time – a time-consuming experiment,
but not prohibitively long. Moreover, a potential running time speedup is currently
in development (see Future Directions below).
102
3.6 Future Directions
Future studies will investigate several potential improvements to the GAnG model,
specifically in species tree scoring, running time improvement, and gene tree weight-
ing. The benefits of each of these model changes will be investigated using gene and
species tree simulation software that I have previously written during the development
of AnGST [149].
A more accurate and more efficient version of GAnG will also be tested on a
much larger archaeal dataset of 9053 gene families drawn from 70 sequenced archaeal
genomes [160]. The resulting archaeal tree will be used to test hypotheses involving
the basal position of the Korarchaeota and Thaumarchaeota on archaeal phylogenies
and whether N. equitans represents a separate archaeal phylum.
3.6.1 Model improvements
Scoring modifications
The scoring function Eq. 3.1 may be biased by larger gene trees, which are likely to
have a higher variance in reconciliation scores. An alternative scoring function robust
to this bias is:
S(T ) =�
G
Sign(Rec(G|T )− Rec(G|R)) (3.2)
This score is negative if there are more gene trees whose reconciliation scores are lower
with the candidate tree than with the present species tree R. In this case, a candidate
species tree would be accepted and become the new present species tree. Evidence
that S(T ) will be positive for incorrect candidate trees is presented in Section 3.4.1
of the Preliminary Results. Summary of a tree’s fitness using the sign function also
facilitates an algorithmic speedup described below in Reducing running time.
Alternatively, if we assume that reconciliation score variance grows with reconcili-
ation score, exceptional changes in reconciliation score can be captured by the scoring
103
metric:
S(T ) =�
G
Log
�Rec(G|T )− Rec(G|R)
Rec(G|R)
�(3.3)
In contrast to Equation 3.1, however, this scoring function may be biased by smaller
gene trees, which are likely to have smaller reconciliation scores.
Reducing running time
I can avoid reconciling the entire set of gene trees if it can be quickly determined
that a candidate species tree is less fit than the current species tree. I can perform
this speedup using the scoring function in Equation 3.2 and by assuming that gene
trees evolve independently from one another. Under this model, determining the sign
of S(T ) can be likened to the problem of determining if a coin is fair using a finite
number of coin tosses. To compute if there is a bias towards [Rec(G|T )−Rec(G|R)]
being positive or negative, we can count the total number of G for which this value is
negative, N−, or non-zero, N , and estimate the probability that a candidate species
tree that changes the reconciliation score of G causes a lower reconciliation score:
p =N−
N(3.4)
However, p is an estimated probability with a standard error sp
sp =
�p(1− p)
N(3.5)
and a maximum error of
E = Z × sp (3.6)
where Z is taken from a table of Z-values and associated confidence levels, calculated
using a normal distribution [75].
If p − E > 0.5, the candidate topology can be accepted at the chosen confidence
level. Similarly, if p + E > 0.5, the candidate topology can be rejected. Otherwise,
the maximum error is too high to determine if the tree should be accepted or rejected
104
and more gene trees need to be reconciled. One way to determine the additional
number of reconciliations to perform in this case would be to calculate E = |0.5− p|,
or the maximum error associated with p so that it can be determined if p is positive
or negative. It can be shown that the number of reconciliations necessary to achieve
this error, NE, is equal to:
NE =Z2p(1− p)
E2(3.7)
Gene tree weighting
Core gene approaches to species tree construction usually exclude gene families that
show evidence of HGT [90]. However, the strict exclusion of gene families that carry
only weak signals of HGT may be overly conservative. A single HGT event causes
only one bifurcation on a gene tree to not be explained by vertical inheritance (i.e.
a speciation event). A large gene tree with relatively few HGT can thus still be
informative for phylogenomic purposes. This intuition can be incorporated into a
scoring function by multiplying each gene tree score by a weighting factor w(G). One
method for calculating this weight would be to take the ratio of the number of HGT
possible given a gene family’s taxonomic distribution (which can be approximated by
the square of the number of species L represented in the gene family), to the number
of HGT inferred on the gene tree, HGT(G):
w(G) =L(G)2
HGT(G)(3.8)
Translation-related gene families, which have been previously been relied upon for
constructing archaeal phylogenies [138], are weighted more highly according to this
scheme than the average gene family (p < 10−30, Rank Sum test; Fig. 3.3). A caveat
to this weighting scheme, however, is that HGT(G) will change as the species tree is
refined by GAnG.
I will also evaluate via simulation an alternative gene tree weighting function that
uses gene tree branch lengths to downweight the influence of gene families suspected of
105
bearing transferred or duplicated genes. This approach relies on the assumption that
genetic distances between gene copies descended from an HGT event will be shorter
than expected, whereas genetic distances between certain gene copies descended from
a duplication/loss scenarios will be longer than expected (see Figure 3.4 for an il-
lustration of these phenomena). The following weighting function accounts for this
evidence of non-vertical descent:
w(G) = 1/ minr
�
si,sj
(rD(si, sj|G)−D(si, sj|R)) (3.9)
This metric iterates over all pairs of species si and sj represented in a gene tree G,
and computes their pairwise distance on G using the function D. The difference
between this distance and the expected distance (taken from the reference tree R) is
then summed. To account for gene-specific rates of evolution, a rate term r is fit to
G using a minimization function.
3.6.2 A tree of all sequenced archaea
Once I have completed optimizing the GAnG algorithm, I will construct an archaeal
phylogeny using the arCOG dataset of 9504 archaeal gene families, which span 88%
of sequenced archaeal genomes [160]. Due in part to the challenges of cultivating
archaeal species [161], only 70 genomes have so far been sequenced from this domain
of life; 4 of these genomes are the sole representatives of 3 of the 5 known archaeal
phyla [155, 162–164] (Fig. 3.5). This uneven taxon sampling, along with suspected
HGT events and unequal rates of lineage evolution, makes resolution of deep nodes
on the archaeal phylogeny sensitive to the choice of analyzed genes. For example, a
phylogeny built from a core set of 27 large and 23 small ribosomal subunit proteins
supports a basal position of N. equitans on the archaeal tree, but the phylogeny of
just the small subunit proteins suggests this species may only be a rapidly-evolving
euryarchaeote [156].
An archaeal species tree built using the full arCOG dataset should correctly posi-
tion N. equitans, unless phylogenetic artifacts systematically bias a large proportion
106
of genes in archaeal genomes. This tree may help resolve ongoing debates regarding
the origins of other archaeal phyla. Previous studies with varying sets of core genes
have placed the Korarchaeota either basal on the archaeal tree [165, 166], deep within
the Crenarchaeota [162], or sibling to the Euryarchaeota [167]. Core gene studies have
also reported conflicting positions for the Thaumarchaeota, positioning the phylum
either within the Crenarchaeota [162] or basal on the archaeal tree [167, 168]. The
GAnG algorithm offers a quantitative method for testing hypotheses of korarchaeal
and thaumarchaeal origins against thousands of gene trees. Lessons learned from re-
solving these questions in archaeal history may also prove insightful for future efforts
that use GAnG to reconstruct a larger three domain Tree of Life.
107
0 1 2 3 4 5 6 7 8 9 10
50
55
60
65
Degree of reference tree randomization [SPRs]
AnG
ST s
core
Figure 3.1: Average AnGST gene tree reconciliation scores as species tree ran-domization increases: Between 1-10 random SPR moves were made to 300 copies of theTree of Life [90], to produce a continuum of species tree accuracy. Each of the these speciestrees was reconciled against a set of 125 gene trees and the average AnGST score for eachreconciliation is plotted on the y-axis. The R2 of the best fit line (shown in red) to thesedata is 0.67.
108
njplottemp_3 Mon May 03 22:14:21 2010 Page 1 of 1
Thermoplasma acidophilum
Archaeoglobus fulgidus
Halobacterium sp.
Methanosarcina acetivorans
760.385
760.385
1087.550
327.167
Methanocaldococcus jannaschii
Methanopyrus kandleri
Methanothermobacter thermautotrophicus
857.400
857.400
1092.450
235.052
Pyrococcus furiosus
257.340
1349.790
469.760
207.521
1804.730
247.419
Nanoarchaeum equitans
Aeropyrum pernix
Sulfolobus solfataricus
1244.660
1244.660
Pyrobaculum aerophilum
255.626
1500.290
1803.860
303.570
368.697
369.573
Homo Sapiens
540.331
2713.764
Escherichia coli
1109.430
3823.1945e+002
njplottemp_2 Mon May 03 22:14:04 2010 Page 1 of 1
Halobacterium sp.
Methanosarcina acetivorans
535.540
535.540
Archaeoglobus fulgidus
112.407
647.947
Methanopyrus kandleri
Methanothermobacter thermautotrophicus
556.634
556.634
Methanocaldococcus jannaschii
107.244
663.877
117.217
101.286
Pyrococcus furiosus
60.834
825.998
Pyrobaculum aerophilum
Aeropyrum pernix
621.057
621.057
Sulfolobus solfataricus
78.368
699.424
145.682
272.256
Thermoplasma acidophilum
20.789
992.469
Nanoarchaeum equitans
758.779
1751.248
Homo Sapiens
470.469
2221.716
Escherichia coli
1628.284
3850.0005e+002
A B
Ma Ma
Figure 3.2: An archaeal phylogeny refined using 250 gene trees and HGT-proposed SPRs: (A) Initial phylogeny of archaeal species pruned from the Ciccarelli &Bork Tree of Life [90]. Crenarchaeal lineages are labeled in blue, euryarchaeal lineages arelabeled in red, and the nanoarchaeal lineage is labeled in green. The most dramatic acceptedSPR move on this topology is shown using a black arrow. (B) The resulting archaeal treeafter 4 iterations of tree refinement. The average AnGST gene tree reconciliation scoreusing this refined tree decreased 3.7% relative to the initial phylogeny.
109
0 1 2 3 4 5 6 7 80
10
20
30
40
50
60
70
80
90
100
Log(SpeciesCount2)
HG
T
All GenesTranslation Genes
Figure 3.3: Inferred HGT as a function of the number of species representedin a gene tree: Each of the 9053 arCOG gene trees was reconciled against an archaealspecies tree (Fig. 3.5). The number of inferred HGT for each gene family is plotted againstthe square of the number of species represented in the gene family (a rough approximationof the number of possible HGT) on a log-scale. The 201 gene families annotated by thearCOG database as involved in translation are shown in red, and all other gene familiesare shown in blue. Gene tree weights, as calculated by Eq. 3.8 are significantly higher fortranslation-associated gene families (p < 10−30, Rank Sum test).
110
A
B
C
D
E
F
A
C
B
D
E
F
A
B
C
D
E
F
A
C
B
D
E
F
L
LL
L
D
L
L
T
T
Duplication HGT
Sp
ec
ies
Tre
eG
en
e T
re
e
Figure 3.4: Expected gene tree branch lengths following duplication or HGT:Two hypothetical evolutionary histories are shown for a gene family. On the left, an ances-tral duplication event (D), followed by four loss events (L), creates higher than expectedgenetic distance between species A and B, and between species C and D. On the right,HGT events (T) cause lower than expected genetic distance between species A and C, andbetween species B and D.
111
0.6
Methanococcus maripaludis S2
Thermococcus gammatolerans EJ3
Thermofilum pendens Hrk 5
Methanocaldococcus jannaschii
Sulfolobus islandicus M.16.4
Methanopyrus kandleri
Sulfolobus islandicus L.S.2.15
Halobacterium sp.
Nanoarchaeum equitans
Caldivirga maquilingensis IC-167
Sulfolobus islandicus M.14.25
Ignicoccus hospitalis KIN4/IHyperthermus butylicus
Picrophilus torridus DSM 9790
Methanococcoides burtonii DSM 6242
Pyrobaculum islandicum DSM 4184
Sulfolobus islandicus Y.N.15.51
Pyrobaculum arsenaticum DSM 13514
Halorubrum lacusprofundi ATCC 49239
Methanococcus vannielii SB
Methanosarcina acetivorans
Methanococcus aeolicus Nankai-3
Pyrobaculum calidifontis JCM 11548
Methanosphaerula palustris E1-9c
uncultured methanogenic archaeon
Candidatus Methanoregula boonei 6A8
Thermoplasma acidophilum
Methanosaeta thermophila PT
Candidatus Korarchaeum cryptofilum OPF8
Methanocaldococcus fervens AG86
Thermoproteus neutrophilus V24Sta
Methanothermobacter thermautotrophicus
Desulfurococcus kamchatkensis 1221n
Methanocaldococcus vulcanius M7
Sulfolobus islandicus M.16.27
Aeropyrum pernix
Methanococcus maripaludis C5
Methanospirillum hungatei JF-1
Sulfolobus islandicus Y.G.57.14
Methanoculleus marisnigri JR1
Cenarchaeum symbiosum
Halomicrobium mukohataei DSM 12286
Methanobrevibacter smithii ATCC 35061
Natronomonas pharaonis
Nitrosopumilus maritimus SCM1
Haloarcula marismortui ATCC 43049
Pyrococcus abyssi
Metallosphaera sedula DSM 5348
Halobacterium salinarum R1
Methanosphaera stadtmanae
Pyrobaculum aerophilum
Thermococcus kodakarensis KOD1
Sulfolobus tokodaii
Archaeoglobus fulgidus
Staphylothermus marinus F1
Pyrococcus furiosus
Sulfolobus solfataricus
Methanosarcina mazei
Haloquadratum walsbyi
Thermoplasma volcanium
Methanococcus maripaludis C7
Thermococcus onnurineus NA1Thermococcus sibiricus MM 739
Pyrococcus horikoshii
Methanocorpusculum labreanum Z
Methanococcus maripaludis C6
Methanosarcina barkeri str. Fusaro
Halorhabdus utahensis DSM 12940
Sulfolobus acidocaldarius DSM 639
Figure 3.5: Maximum likelihood tree of archaeal species: This tree of 70 archaealspecies will be the starting topology for the GAnG analysis of 9504 arCOG gene trees. Thetree was built using a maximum likelihood analysis of 53 concatenated ribosomal proteins[138, 160].
112
Part III
Conclusion
113
This thesis contributes new algorithms for comparative microbial genomics, partic-
ularly in the subfields of microbial ecology and microbial genome evolution. In Chap-
ter 1 of Part II, I presented the algorithm AdaptML, which can identify genetically-
and ecologically-distinct bacterial populations. I used this tool to identify clusters of
marine vibrio that appear to have differentiated in response to their nutrient prefer-
ences. Code for AdaptML has been made public and with the help of Albert Wang, an
undergraduate researcher in Eric Alm’s laboratory, I have created a webserver where
users can submit their own phylogenetic and ecological data for AdaptML to pro-
cess online (http://almlab.mit.edu/adaptml/). Outside investigators have used these
tools to run AnGST on their own datasets. Fred Cohan’s group ran the algorithm to
find ecotypes in the genus Bacillus that would have been overlooked with traditional
sequence divergence-based thresholds for calling species [11]. Oakley and colleagues
used AdaptML to help confirm evidence for niche partitioning among sampled Desul-
fobulbus [169]. Other investigators have promoted using AdaptML for future studies
of microbial evolution and ecology. Daniel Falush has suggested using AdaptML to
infer the evolutionary history of microbe-host adaptation [170], and Ford Doolittle &
Olga Zhaxybayeva have pointed out the algorithm’s advantages over strict threshold-
based approaches to microbial ecology [171].
In Chapter 2, I introduced the algorithm AnGST, which reconciles an ultrametric
reference tree and a gene tree to infer HGT, gene duplication, and gene family birth
events in a chronological context. Implicit in these reconciliations are known con-
straints on organismal history drawn from paleontological and geochemical records.
Analysis of 100 sequenced genomes with AnGST produced evidence for a massive ex-
pansion of microbial genetic diversity during the Archean eon, as well as the gradual
oxygenation of the biosphere over the past 3 Ga. This later finding is in agreement
with other studies that suggest secular changes in earth geochemistry were recorded
in, and can now be mined from, microbial genomes [81, 83, 172, 173]. Further evidence
of the link between the AnGST analysis and biogeochemistry could be uncovered by
follow up studies on enzyme families whose first appearance can be dated using the
sedimentary record. For example, Form 1 RuBisCO makes specific contacts with
114
molecular oxygen and first appeared 2.7-2.9 Ga according to carbon isotope fraction-
ation results [174]. AnGST’s predicted birth date for this enzyme’s small subunit,
between 2.0-3.0 Ga, is wide, but not incompatible with the fractionation results.
Identification and analysis of other biomarkers whose age can be independently ver-
ified by AnGST will lead to potentially fruitful collaborations between genomicists,
paleontologists, and biogeochemists.
Lastly, in Chapter 3, I introduced the supertree algorithm GAnG, which is capable
of constructing prokaryotic species trees from thousands of gene trees. Preliminary
GAnG analysis of 250 archaeal gene trees built from a subset of the sequenced ar-
chaeal genomes supports the hypothesis that Nanoarchaeota diverged from the last
ancestor of the Archaea prior to the Crenarchaeota/Euryarchaeota split, but also
appears sensitive to HGT from the Crenarchaeota to the Thermoplasma. Further
improvements to the GAnG scoring function and gene tree weighting scheme are on-
going. An improved version of GAnG will be used to build a species tree relating
70 sequenced Archaea from all five known phyla. One important observation during
that tree’s construction will be the topology of the species tree scoring landscape near
variations on deep node branching order. A relatively flat landscape in that regime
may suggest that the phyla radiated so rapidly during early archaeal evolution that
their branching order cannot be resolved [141] or that HGT dominated vertical de-
scent during the formation of the archaeal phyla [175]. By contrast, a bowl-shaped
landscape with a clear scoring minimum will provide support that a Tree of Life exists
for the Archaea, and will encourage attempts to construct a universal Tree of Life
with sequenced genomes from all three domains of life.
115
Part IV
Bibliography
116
Bibliography
[1] Daniel Godoy, Gaynor Randle, An-drew J Simpson, David M Aanensen,Tyrone L Pitt, Reimi Kinoshita, andBrian G Spratt. Multilocus sequence typ-ing and evolutionary relationships amongthe causative agents of melioidosis andglanders, Burkholderia pseudomallei andBurkholderia mallei. J Clin Microbiol,41(5):2068–79, May 2003.
[2] Fergus G Priest, Margaret Barker, Les W JBaillie, Edward C Holmes, and Martin C JMaiden. Population structure and evolu-tion of the Bacillus cereus group. J Bacte-
riol, 186(23):7959–70, Dec 2004.
[3] William P Hanage, Christophe Fraser, andBrian G Spratt. Fuzzy species among re-combinogenic bacteria. BMC Biol, 3:6, Jan2005.
[4] Dirk Gevers, Frederick M Cohan, Jef-frey G Lawrence, Brian G Spratt, Tom Co-enye, Edward J Feil, Erko Stackebrandt,Yves Van de Peer, et al. Opinion: Re-evaluating prokaryotic species. Nat Rev
Microbiol, 3(9):733–9, Sep 2005.
[5] Frederick M Cohan and Elizabeth B Perry.A systematics for discovering the funda-mental units of bacterial diversity. Curr
Biol, 17(10):R373–86, May 2007.
[6] Christophe Fraser, William P Hanage, andBrian G Spratt. Recombination and thenature of bacterial speciation. Science,315(5811):476–80, Jan 2007.
[7] B. Jesse Shapiro, Jonathan Friedman,Otto X. Cordero, Sarah Preheim, SoniaTimberlake, Martin Polz, and Eric Alm.Recombination in the core and flexiblegenome drives ecological differentiation insympatric ocean microbes. In progress.
[8] Andrew P Martin. Phylogenetic ap-proaches for describing and comparing thediversity of microbial communities. Appl
Environ Microbiol, 68(8):3673–82, Aug2002.
[9] Catherine Lozupone and Rob Knight.UniFrac: a new phylogenetic method forcomparing microbial communities. Appl
Environ Microbiol, 71(12):8228–35, Dec2005.
[10] Alexander Koeppel, Elizabeth B Perry,Johannes Sikorski, Danny Krizanc, An-drew Warner, David M Ward, Alejandro PRooney, Evelyne Brambilla, et al. Iden-tifying the fundamental units of bacterialdiversity: a paradigm shift to incorporateecology into bacterial systematics. Proc
Natl Acad Sci USA, 105(7):2504–9, Feb2008.
[11] Nora Connor, Johannes Sikorski, Alejan-dro P Rooney, Sarah Kopac, Alexander FKoeppel, Andrew Burger, Scott G Cole,Elizabeth B Perry, et al. Ecology of speci-ation in the genus Bacillus. Appl Environ
Microbiol, 76(5):1349–58, Mar 2010.
[12] H Ochman, J G Lawrence, and E A Gro-isman. Lateral gene transfer and thenature of bacterial innovation. Nature,405(6784):299–304, May 2000.
[13] Eugene V Koonin. Darwinian evolution inthe light of genomics. Nucleic Acids Res,37(4):1011–34, Mar 2009.
[14] N A Moran and A Mira. The processof genome shrinkage in the obligate sym-biont Buchnera aphidicola. Genome Biol-
ogy, 2(12), Jan 2001.
[15] M M Riehle, A F Bennett, and A D Long.Genetic architecture of thermal adaptation
117
in Escherichia coli. Proc Natl Acad Sci
USA, 98(2):525–30, Jan 2001.
[16] Manolis Kellis, Bruce W Birren, and Eric SLander. Proof and evolutionary analy-sis of ancient genome duplication in theyeast Saccharomyces cerevisiae. Nature,428(6983):617–24, Apr 2004.
[17] Nancy A Moran and Tyler Jarvik. Lat-eral transfer of genes from fungi underliescarotenoid production in aphids. Science,328(5978):624–7, Apr 2010.
[18] Mary E Rumpho, Jared M Worful, JunghoLee, Krishna Kannan, Mary S Tyler, De-bashish Bhattacharya, Ahmed Moustafa,and James R Manhart. Horizontal genetransfer of the algal nuclear gene psbOto the photosynthetic sea slug Elysiachlorotica. Proc Natl Acad Sci USA,105(46):17867–71, Nov 2008.
[19] Julie C Dunning Hotopp, Michael E Clark,Deodoro C S G Oliveira, Jeremy M Foster,Peter Fischer, Monica C Munoz Torres,Jonathan D Giebel, Nikhil Kumar, et al.Widespread lateral gene transfer from in-tracellular bacteria to multicellular eukary-otes. Science, 317(5845):1753–6, Sep 2007.
[20] Natsuko Kondo, Naruo Nikoh, NobuyukiIjichi, Masakazu Shimada, and TakemaFukatsu. Genome fragment of Wolbachiaendosymbiont transferred to X chromo-some of host insect. Proc Natl Acad Sci
USA, 99(22):14280–5, Oct 2002.
[21] J G Lawrence and H Ochman. Moleculararchaeology of the Escherichia coli genome.Proc Natl Acad Sci USA, 95(16):9413–7,Aug 1998.
[22] Yoji Nakamura, Takeshi Itoh, Hideo Mat-suda, and Takashi Gojobori. Biased bio-logical functions of horizontally transferredgenes in prokaryotic genomes. Nat Genet,36(7):760–6, Jul 2004.
[23] Dirk Gevers, Klaas Vandepoele, CedricSimillon, and Yves Van de Peer. Gene du-plication and biased functional retention ofparalogs in bacterial genomes. Trends Mi-
crobiol, 12(4):148–54, Apr 2004.
[24] Eric Alm, Katherine Huang, and AdamArkin. The evolution of two-componentsystems in bacteria reveals different strate-gies for niche adaptation. PLoS Comput
Biol, 2(11):e143, Nov 2006.
[25] Victor Kunin and Christos A Ouzou-nis. The balance of driving forces duringgenome evolution in prokaryotes. Genome
Res, 13(7):1589–94, Jul 2003.
[26] Boris G Mirkin, Trevor I Fenner, Michael YGalperin, and Eugene V Koonin. Algo-rithms for computing parsimonious evolu-tionary scenarios for genome evolution, thelast universal common ancestor and dom-inance of horizontal gene transfer in theevolution of prokaryotes. BMC Evol Biol,3:2, Jan 2003.
[27] Berend Snel, Peer Bork, and Martijn AHuynen. Genomes in flux: the evolution ofarchaeal and proteobacterial gene content.Genome Res, 12(1):17–25, Jan 2002.
[28] Olga Zhaxybayeva, J Peter Gogarten,Robert L Charlebois, W Ford Doolittle,and R Thane Papke. Phylogenetic analysesof cyanobacterial genomes: quantificationof horizontal gene transfer events. Genome
Res, 16(9):1099–108, Sep 2006.
[29] Vincent Daubin, Nancy A Moran, andHoward Ochman. Phylogenetics and thecohesion of bacterial genomes. Science,301(5634):829–32, Aug 2003.
[30] R D Page and M A Charleston. Fromgene to organismal phylogeny: reconciledtrees and the gene tree/species tree prob-lem. Mol Phylogenet Evol, 7(2):231–40,Apr 1997.
[31] M A Charleston. Jungles: a new solutionto the host/parasite phylogeny reconcilia-tion problem. Mathematical Biosciences,149(2):191–223, May 1998.
[32] Matthew W Hahn. Bias in phylogenetictree reconciliation methods: implicationsfor vertebrate genome evolution. Genome
Biol, 8(7):R141, Jan 2007.
[33] Matthew D Rasmussen and Manolis Kel-lis. Accurate gene-tree reconstruction by
118
learning gene- and species-specific sub-stitution rates across multiple completegenomes. Genome Res, 17(12):1932–42,Dec 2007.
[34] Orjan Akerborg, Bengt Sennblad, Lars Ar-vestad, and Jens Lagergren. SimultaneousBayesian gene tree reconstruction and rec-onciliation analysis. Proc Natl Acad Sci
USA, 106(14):5714–9, Apr 2009.
[35] Patrick J Keeling and Jeffrey D Palmer.Horizontal gene transfer in eukaryotic evo-lution. Nat Rev Genet, 9(8):605–18, Aug2008.
[36] J Peter Gogarten and Jeffrey P Townsend.Horizontal gene transfer, genome innova-tion and evolution. Nat Rev Microbiol,3(9):679–87, Sep 2005.
[37] Stephen J Giovannoni and Ulrich Stingl.Molecular diversity and ecology of micro-bial plankton. Nature, 437(7057):343–8,Sep 2005.
[38] Edward F DeLong, Christina M Preston,Tracy Mincer, Virginia Rich, Steven J Hal-lam, Niels-Ulrik Frigaard, Asuncion Mar-tinez, Matthew B Sullivan, et al. Commu-nity genomics among stratified microbialassemblages in the ocean’s interior. Sci-
ence, 311(5760):496–503, Jan 2006.
[39] Martin F Polz, Dana E Hunt, Sarah PPreheim, and Daniel M Weinreich. Pat-terns and mechanisms of genetic and phe-notypic differentiation in marine microbes.Philos Trans R Soc Lond, B, Biol Sci,361(1475):2009–21, Nov 2006.
[40] Alban Ramette and James M Tiedje. Mul-tiscale responses of microbial life to spatialdistance and environmental heterogeneityin a patchy ecosystem. Proc Natl Acad Sci
USA, 104(8):2761–6, Feb 2007.
[41] Jennifer B Hughes Martiny, Brendan J MBohannan, James H Brown, Robert K Col-well, Jed A Fuhrman, Jessica L Green,M Claire Horner-Devine, Matthew Kane,et al. Microbial biogeography: putting mi-croorganisms on the map. Nat Rev Micro-
biol, 4(2):102–12, Feb 2006.
[42] Silvia G Acinas, Vanja Klepac-Ceraj,Dana E Hunt, Chanathip Pharino, IvicaCeraj, Daniel L Distel, and Martin FPolz. Fine-scale phylogenetic architectureof a complex bacterial community. Nature,430(6999):551–4, Jul 2004.
[43] Zackary I Johnson, Erik R Zinser, Alli-son Coe, Nathan P McNulty, E Malcolm SWoodward, and Sallie W Chisholm. Nichepartitioning among Prochlorococcus eco-types along ocean-scale environmental gra-dients. Science, 311(5768):1737–40, Mar2006.
[44] Mike J Ferris, Michael Kuhl, AndreaWieland, and David M Ward. Cyanobac-terial ecotypes in different optical microen-vironments of a 68 degrees C hot springmat community revealed by 16S-23S rRNAinternal transcribed spacer region varia-tion. Appl Environ Microbiol, 69(5):2893–8, May 2003.
[45] W Ford Doolittle and R Thane Papke. Ge-nomics and the bacterial species problem.Genome Biology, 7(9):116, Jan 2006.
[46] Adam C Retchless and Jeffrey G Lawrence.Temporal fragmentation of speciation inbacteria. Science, 317(5841):1093–6, Aug2007.
[47] J R Thompson and M F Polz. The Biology
of Vibrios. ASM Press, Washington, DC,2006.
[48] Thomas Kiørboe, Hans-Peter Grossart,Helle Ploug, and Kam Tang. Mechanismsand rates of bacterial colonization of sink-ing aggregates. Appl Environ Microbiol,68(8):3996–4006, Aug 2002.
[49] L Pomeroy, R Hanson, P Mcgillivary,B Sherr, D Kirchman, and D Deibel. Mi-crobiology and Chemistry of Fecal Prod-ucts of Pelagic Tunicates: Rates and Fates.Bull. Mar. Sci., 35(3):426–439, 1984.
[50] C Panagiotopoulos, R Sempere, andI Obernosterer. Bacterial degradationof large particles in the southern IndianOcean using in vitro incubation exper-iments. Organic Geochemistry, 33:985–1000, Jan 2002.
119
[51] J F Heidelberg, K B Heidelberg, and R RColwell. Bacteria of the gamma-subclassProteobacteria associated with zooplank-ton in Chesapeake Bay. Appl Environ Mi-
crobiol, 68(11):5498–507, Nov 2002.
[52] Dana E Hunt, Dirk Gevers, Nisha M Va-hora, and Martin F Polz. Conservationof the chitin utilization pathway in theVibrionaceae. Appl Environ Microbiol,74(1):44–51, Jan 2008.
[53] Jakob Pernthaler and Rudolf Amann. Fateof heterotrophic microbes in pelagic habi-tats: focus on populations. Microbiol Mol
Biol Rev, 69(3):440–61, Sep 2005.
[54] Janelle R Thompson, Sarah Pacocha,Chanathip Pharino, Vanja Klepac-Ceraj,Dana E Hunt, Jennifer Benoit, RamahiSarma-Rupavtarm, Daniel L Distel, et al.Genotypic diversity within a naturalcoastal bacterioplankton population. Sci-
ence, 307(5713):1311–3, Feb 2005.
[55] W Maddison and M Slatkin. Null modelsfor the number of evolutionary steps in acharacter on a phylogenetic tree. Evolu-
tion, 45(5):1184–1197, Jan 1991.
[56] Fredrik Ronquist. Bayesian inference ofcharacter evolution. Trends Ecol Evol
(Amst), 19(9):475–81, Sep 2004.
[57] Janelle R Thompson, Mark A Randa,Luisa A Marcelino, Aoy Tomita-Mitchell,Eelin Lim, and Martin F Polz. Diversityand dynamics of a north atlantic coastalVibrio community. Appl Environ Micro-
biol, 70(7):4103–10, Jul 2004.
[58] Vincent Daubin and Nancy A Moran.Comment on ”The origins of genome com-plexity”. Science, 306(5698):978, Nov2004.
[59] Frederick M Cohan. Sexual isolation andspeciation in bacteria. Genetica, 116(2-3):359–70, Nov 2002.
[60] J Majewski. Sexual isolation in bacteria.FEMS Microbiol Lett, 199(2):161–9, May2001.
[61] C von Mering, P Hugenholtz, J Raes, S GTringe, T Doerks, L J Jensen, N Ward, andP Bork. Quantitative phylogenetic assess-ment of microbial communities in diverseenvironments. Science, 315(5815):1126–30,Feb 2007.
[62] S G Acinas, J Anton, and F Rodrıguez-Valera. Diversity of free-living andattached bacteria in offshore WesternMediterranean waters as depicted by anal-ysis of genes encoding 16S rRNA. Appl En-
viron Microbiol, 65(2):514–22, Feb 1999.
[63] B C Crump, E V Armbrust, and J ABaross. Phylogenetic analysis of particle-attached and free-living bacterial commu-nities in the Columbia river, its estuary,and the adjacent coastal ocean. Appl Env-
iron Microbiol, 65(7):3192–204, Jul 1999.
[64] L Riemann and A Winding. Commu-nity Dynamics of Free-living and Particle-associated Bacterial Assemblages during aFreshwater Phytoplankton Bloom. Micro-
bial Ecology, 42(3):274–285, Oct 2001.
[65] N Selje and M Simon. Composition anddynamics of particle-associated and free-living bacterial communities in the Weserestuary, Germany. Aquatic Microbial Ecol-
ogy, 30:221–237, Jan 2003.
[66] E Delong, D Franks, and A Alldredge. Phy-logenetic diversity of aggregate-attachedvs. free-living marine bacterial assem-blages. Limnology and Oceanography,38(5):924–934, Jan 1993.
[67] S H Goh, S Potter, J O Wood, S M Hem-mingsen, R P Reynolds, and A W Chow.HSP60 gene sequences as universal targetsfor microbial species identification: stud-ies with coagulase-negative staphylococci.J Clin Microbiol, 34(4):818–23, Apr 1996.
[68] Erko Stackebrandt and M Goodfellow, ed-itors. Nucleic acid techniques in bacterial
systematics. Wiley and Sons, Chichester,UK, 1991.
[69] J R Cole, B Chai, R J Farris, Q Wang,A S Kulam-Syed-Mohideen, D M McGar-rell, A M Bandela, E Cardenas, et al. The
120
ribosomal database project (RDP-II): in-troducing myRDP space and quality con-trolled public data. Nucleic Acids Res,35(Database issue):D169–72, Jan 2007.
[70] S F Altschul, W Gish, W Miller, E W My-ers, and D J Lipman. Basic local alignmentsearch tool. J Mol Biol, 215(3):403–10, Oct1990.
[71] Scott R Santos and Howard Ochman. Iden-tification and phylogenetic sorting of bac-terial lineages with universally conservedgenes and proteins. Environmental Micro-
biology, 6(7):754–9, Jul 2004.
[72] Stephane Guindon and Olivier Gascuel. Asimple, fast, and accurate algorithm to es-timate large phylogenies by maximum like-lihood. Syst Biol, 52(5):696–704, Oct 2003.
[73] M Hasegawa, Y Iida, T Yano, F Takaiwa,and M Iwabuchi. Phylogenetic relation-ships among eukaryotic kingdoms inferredfrom ribosomal RNA sequences. J Mol
Evol, 22(1):32–8, Jan 1985.
[74] Ivica Letunic and Peer Bork. InteractiveTree Of Life (iTOL): an online tool forphylogenetic tree display and annotation.Bioinformatics, 23(1):127–8, Jan 2007.
[75] John A. Rice. Mathematical Statistics and
Data Analysis. Duxbury Press: Belmont,CA, 1994.
[76] J. Felsenstein. Inferring phylogenies. Sin-auer Associates, Inc., Sunderland, MA,2003.
[77] Akaike. A new look at the statistical modelidentification. Automatic Control, IEEE
Transactions on, 19(6):716 – 723, 1974.
[78] Richard Durbin, Sean R. Eddy, AndersKrogh, and Graeme J. Mitchison. Biologi-
cal Sequence Analysis: Probabilistic Models
of Proteins and Nucleic Acids. CambridgeUniversity Press, 1998.
[79] E G Nisbet and N H Sleep. The habi-tat and nature of early life. Nature,409(6823):1083–91, Feb 2001.
[80] Birger Rasmussen, Ian R Fletcher,Jochen J Brocks, and Matt R Kilburn.Reassessing the first appearance of eu-karyotes and cyanobacteria. Nature,455(7216):1101–4, Oct 2008.
[81] Christopher L Dupont, Song Yang, BrianPalenik, and Philip E Bourne. Modern pro-teomes contain putative imprints of ancientshifts in trace metal geochemistry. Proc
Natl Acad Sci USA, 103(47):17822–7, Nov2006.
[82] M Saito, D Sigman, and F Morel. Thebioinorganic chemistry of the ancientocean: the co-evolution of cyanobacterialmetal requirements and biogeochemical cy-cles at the Archean-Proterozoic bound-ary? Inorganica Chimica Acta, 356:308–318, Jan 2003.
[83] A Zerkle, C House, and S Brantley. Bio-geochemical signatures through time asinferred from whole microbial genomes.American Journal of Science, 305:467–502,Jan 2005.
[84] D Yang, Y Oyaizu, H Oyaizu, G J Olsen,and C R Woese. Mitochondrial origins.Proc Natl Acad Sci USA, 82(13):4443–7,Jul 1985.
[85] S J Giovannoni, S Turner, G J Olsen,S Barns, D J Lane, and N R Pace. Evo-lutionary relationships among cyanobacte-ria and green chloroplasts. J Bacteriol,170(8):3584–92, Aug 1988.
[86] D J De Marais. Evolution. When did pho-tosynthesis emerge on Earth? Science,289(5485):1703–5, Sep 2000.
[87] J Gogarten, W Doolittle, and JeffreyLawrence. Prokaryotic Evolution inLight of Gene Transfer. Mol Biol Evol,19(12):2226, Dec 2002.
[88] R Jain, M C Rivera, and J A Lake. Hor-izontal gene transfer among genomes: thecomplexity hypothesis. Proc Natl Acad Sci
USA, 96(7):3801–6, Mar 1999.
[89] Mark A Ragan and Robert G Beiko. Lat-eral genetic transfer: open issues. Phi-
los Trans R Soc Lond, B, Biol Sci,364(1527):2241–51, Aug 2009.
121
[90] Francesca D Ciccarelli, Tobias Doerks,Christian von Mering, Christopher JCreevey, Berend Snel, and Peer Bork.Toward automatic reconstruction of ahighly resolved tree of life. Science,311(5765):1283–7, Mar 2006.
[91] Thomas Lepage, David Bryant, HervePhilippe, and Nicolas Lartillot. A gen-eral comparison of relaxed molecular clockmodels. Mol Biol Evol, 24(12):2669–80,Dec 2007.
[92] S J Mojzsis, G Arrhenius, K D McKee-gan, T M Harrison, A P Nutman, andC R Friend. Evidence for life on Earthbefore 3,800 million years ago. Nature,384(6604):55–9, Nov 1996.
[93] M Rosing. 13C-Depleted carbon mi-croparticles in ¿3700-Ma sea-floor sedimen-tary rocks from west greenland. Science,283(5402):674–6, Jan 1999.
[94] Jessica Garvin, Roger Buick, Ariel D An-bar, Gail L Arnold, and Alan J Kauf-man. Isotopic evidence for an aerobic ni-trogen cycle in the latest Archean. Science,323(5917):1045–8, Feb 2009.
[95] Ariel D Anbar, Yun Duan, Timothy WLyons, Gail L Arnold, Brian Kendall,Robert A Creaser, Alan J Kaufman,Gwyneth W Gordon, et al. A whiff of oxy-gen before the great oxidation event? Sci-
ence, 317(5846):1903–6, Sep 2007.
[96] Alan J Kaufman, David T Johnston, JamesFarquhar, Andrew L Masterson, Timo-thy W Lyons, Steve Bates, Ariel D Anbar,Gail L Arnold, et al. Late Archean bio-spheric oxygenation and atmospheric evo-lution. Science, 317(5846):1900–3, Sep2007.
[97] Christopher T Reinhard, Rob Raiswell,Clint Scott, Ariel D Anbar, and Timo-thy W Lyons. A late Archean sulfidic seastimulated by early oxidative weathering ofthe continents. Science, 326(5953):713–6,Oct 2009.
[98] J J Brocks, G A Logan, R Buick, andR E Summons. Archean molecular fossilsand the early rise of eukaryotes. Science,285(5430):1033–6, Aug 1999.
[99] Jacob R Waldbauer, Laura S Sherman,Dawn Y Sumner, and Roger E Summons.Late Archean molecular fossils from theTransvaal Supergroup record the antiquityof microbial diversity and aerobiosis. Pre-
cambrian Research, 169(1-4):28–47, Dec2009.
[100] S Golubic, V N Sergeev, and A HKnoll. Mesoproterozoic Archaeoellipsoides:akinetes of heterocystous cyanobacteria.Lethaia, 28:285–98, Jan 1995.
[101] N Butterfield. Bangiomorpha pubescens n.gen., n. sp.: implications for the evolutionof sex, multicellularity, and the Mesopro-terozoic/ Neoproterozoic radiation of eu-karyotes. Paleobiology, 26(3):386–404, Jan2000.
[102] Gordon D Love, Emmanuelle Grosjean,Charlotte Stalvies, David A Fike, John PGrotzinger, Alexander S Bradley, Amy EKelly, Maya Bhatia, et al. Fossil steroidsrecord the appearance of Demospongiaeduring the Cryogenian period. Nature,457(7230):718–21, Feb 2009.
[103] Edward B Daeschler, Neil H Shubin, andFarish A Jenkins. A Devonian tetrapod-like fish and the evolution of the tetrapodbody plan. Nature, 440(7085):757–63, Apr2006.
[104] E Daeschler, N Shubin, K Thomson, andW Amaral. A Devonian Tetrapod fromNorth America. Science, 265(5172):639–642, Jul 1994.
[105] N Moran, M Munson, P Baumann, andH Ishikawa. A Molecular Clock in En-dosymbiotic Bacteria is Calibrated Usingthe Insect Hosts. Proceedings of the Royal
Society B: Biological Sciences, 253:167–171, Jan 1993.
[106] Lars Juhl Jensen, Philippe Julien, MichaelKuhn, Christian von Mering, Jean Muller,Tobias Doerks, and Peer Bork. eggNOG:automated construction and annotation oforthologous groups of genes. Nucleic Acids
Res, 36(Database issue):D250–4, Jan 2008.
[107] J G Jelesko, R Harper, M Furuya, andW Gruissem. Rare germinal unequal
122
crossing-over leading to recombinant geneformation and gene duplication in Ara-bidopsis thaliana. Proc Natl Acad Sci USA,96(18):10302–7, Aug 1999.
[108] Michael Lynch and John S Conery. Theorigins of genome complexity. Science,302(5649):1401–4, Nov 2003.
[109] M Sugiura, T Hirose, and M Sugita. Evo-lution and mechanism of translation inchloroplasts. Annu Rev Genet, 32:437–59,Jan 1998.
[110] D Fischer and D Eisenberg. Finding fam-ilies for genomic ORFans. Bioinformatics,15(9):759–62, Sep 1999.
[111] C Scott, T W Lyons, A Bekker, Y Shen,S W Poulton, X Chu, and A D Anbar.Tracing the stepwise oxygenation of theProterozoic ocean. Nature, 452(7186):456–9, Mar 2008.
[112] Kurt O Konhauser, Ernesto Pecoits, Ste-fan V Lalonde, Dominic Papineau, Euan GNisbet, Mark E Barley, Nicholas T Arndt,Kevin Zahnle, et al. Oceanic nickeldepletion and a methanogen famine be-fore the Great Oxidation Event. Nature,458(7239):750–3, Apr 2009.
[113] D Canfield. A new model for Proterozoicocean chemistry. Nature, 396(6710):450–453, Jan 1998.
[114] A Bekker, H D Holland, P-L Wang,D Rumble, H J Stein, J L Hannah,L L Coetzee, and N J Beukes. Datingthe rise of atmospheric oxygen. Nature,427(6970):117–20, Jan 2004.
[115] D Canfield. The early history of atmo-spheric oxygen: homage to Robert M. Gar-rels. Annu. Rev. Earth Planet. Sci., 33:1–36, Jan 2005.
[116] Michael J Sanderson. r8s: inferring abso-lute rates of molecular evolution and di-vergence times in the absence of a molecu-lar clock. Bioinformatics, 19(2):301–2, Jan2003.
[117] M Kanehisa and S Goto. KEGG: kyoto en-cyclopedia of genes and genomes. Nucleic
Acids Res, 28(1):27–30, Jan 2000.
[118] Eric J Alm, Katherine H Huang, Morgan NPrice, Richard P Koche, Keith Keller,Inna L Dubchak, and Adam P Arkin. TheMicrobesOnline Web site for comparativegenomics. Genome Res, 15(7):1015–22, Jul2005.
[119] JP Huelsenbeck, B Rannala, and B Larget.Tangled Trees: Phylogenies, Cospeciation,
and Coevolution. University of ChicagoPress, Chicago, 2002.
[120] L Arvestad, A Berglund, J Lagergren, andB Sennblad. Gene tree reconstruction andorthology analysis based on an integratedmodel for duplications and sequence evo-lution. RECOMB’04, pages 326–335, Jan2004.
[121] Dave MacLeod, Robert L Charlebois, FordDoolittle, and Eric Bapteste. Deductionof probable events of lateral gene transferthrough comparison of phylogenetic treesby recursive consolidation and rearrange-ment. BMC Evol Biol, 5(1):27, Jan 2005.
[122] J Bergsten. A review of long-branch at-traction. Cladistics, 21:163–193, Jan 2005.
[123] A Rambaut and N Grass. Seq-Gen: anapplication for the Monte Carlo simulationof DNA sequence evolution along phyloge-netic trees. Bioinformatics, 13(3):235–238,Jan 1997.
[124] DF Robinson and LR Foulds. Comparisonof phylogenetic trees. Mathematical Bio-
sciences, 53(1-2):131–147, 1981.
[125] Tal Dagan and William Martin. Ances-tral genome sizes specify the minimum rateof lateral gene transfer during prokary-ote evolution. Proc Natl Acad Sci USA,104(3):870–5, Jan 2007.
[126] Nicolas Lartillot and Herve Philippe. ABayesian mixture model for across-site het-erogeneities in the amino-acid replacementprocess. Mol Biol Evol, 21(6):1095–109,Jun 2004.
[127] Roman L Tatusov, Natalie D Fedorova,John D Jackson, Aviva R Jacobs, BorisKiryutin, Eugene V Koonin, Dmitri MKrylov, Raja Mazumder, et al. The COG
123
database: an updated version includes eu-karyotes. BMC Bioinformatics, 4:41, Sep2003.
[128] Gerard Talavera and Jose Castresana. Im-provement of phylogenies after removingdivergent and ambiguously aligned blocksfrom protein sequence alignments. Syst
Biol, 56(4):564–77, Aug 2007.
[129] J Castresana. Selection of conserved blocksfrom multiple alignments for their usein phylogenetic analysis. Mol Biol Evol,17(4):540–52, Apr 2000.
[130] Robert C Edgar. MUSCLE: multiplesequence alignment with high accuracyand high throughput. Nucleic Acids Res,32(5):1792–7, Jan 2004.
[131] Frederic Delsuc, Henner Brinkmann, andHerve Philippe. Phylogenomics and the re-construction of the tree of life. Nat Rev
Genet, 6(5):361–75, May 2005.
[132] Fengrong Ren, Hiroshi Tanaka, and Zi-heng Yang. A likelihood look at thesupermatrix-supertree controversy. Gene,441(1-2):119–25, Jul 2009.
[133] J Slowinski and R Page. How shouldspecies phylogenies be inferred from se-quence data? Syst Biol, 48(4):814–825,Jan 1999.
[134] Eric Bapteste, Henner Brinkmann,Jennifer A Lee, Dorothy V Moore,Christoph W Sensen, Paul Gordon,Laure Durufle, Terry Gaasterland, et al.The analysis of 100 genes supports thegrouping of three highly divergent amoe-bae: Dictyostelium, Entamoeba, andMastigamoeba. Proc Natl Acad Sci USA,99(3):1414–9, Feb 2002.
[135] J Felsenstein. Cases in which parsimonyor compatibility methods will be positivelymisleading. Syst Biol, 27(4):401–410, Jan1978.
[136] P J Lockhart, M A Steel, M D Hendy, andD Penny. Recovering evolutionary trees un-der a more realistic model of sequence evo-lution. Mol Biol Evol, 11(4):605–12, Jul1994.
[137] Matthew J Phillips, Frederic Delsuc, andDavid Penny. Genome-scale phylogeny andthe detection of systematic biases. Mol Biol
Evol, 21(7):1455–8, Jul 2004.
[138] Oriane Matte-Tailliez, Celine Brochier,Patrick Forterre, and Herve Philippe. Ar-chaeal phylogeny based on ribosomal pro-teins. Mol Biol Evol, 19(5):631–9, May2002.
[139] Tal Dagan and William Martin. The treeof one percent. Genome Biol, 7(10):118,Jan 2006.
[140] Vincent Daubin and Howard Ochman.Quartet mapping and the extent of lateraltransfer in bacterial genomes. Mol Biol
Evol, 21(1):86–9, Jan 2004.
[141] Pere Puigbo, Yuri I Wolf, and Eugene VKoonin. Search for a ’Tree of Life’ in thethicket of the phylogenetic forest. Journal
of Biology, 8(6):59, Jan 2009.
[142] Herve Philippe, Elizabeth A Snell, EricBapteste, Philippe Lopez, Peter W H Hol-land, and Didier Casane. Phylogenomics ofeukaryotes: impact of missing data on largealignments. Mol Biol Evol, 21(9):1740–52,Sep 2004.
[143] Herve Philippe and Christophe J Douady.Horizontal gene transfer and phylogenet-ics. Curr Opin Microbiol, 6(5):498–505,Oct 2003.
[144] Olaf R P Bininda-Emonds. The evolutionof supertrees. Trends Ecol Evol (Amst),19(6):315–22, Jun 2004.
[145] M A Ragan. Phylogenetic inference basedon matrix representation of trees. Mol Phy-
logenet Evol, 1(1):53–8, Mar 1992.
[146] J B Slowinski, A Knight, and A P Rooney.Inferring species trees from gene trees: aphylogenetic analysis of the Elapidae (Ser-pentes) based on the amino acid sequencesof venom proteins. Mol Phylogenet Evol,8(3):349–62, Dec 1997.
[147] R D Page. GeneTree: comparing gene andspecies phylogenies using reconciled trees.Bioinformatics, 14(9):819–20, Jan 1998.
124
[148] A Wehe, M Bansal, J Burleigh, and O Eu-lenstein. DupTree: A program for large-scale phylogenetic analyses using gene treeparsimony. Bioinformatics, May 2008.
[149] LA David and EJ Alm. Rapid evolution-ary innovation during an Archean GeneticExpansion. Submitted, 2010.
[150] M Bansal, J Burleigh, O Eulenstein,and A Wehe. Heuristics for the gene-duplication problem: A (n) speed-up forthe local search. RECOMB’07, pages 238–252, Jan 2007.
[151] L Billera, S Holmes, and K Vogtmann.Geometry of the Space of PhylogeneticTrees. Advances in Applied Mathematics,27(4):733–767, Jan 2001.
[152] Stephane Guindon, Jean-Francois Du-fayard, Vincent Lefort, Maria Anisi-mova, Wim Hordijk, and Olivier Gascuel.New algorithms and methods to estimatemaximum-likelihood phylogenies: assess-ing the performance of PhyML 3.0. Syst
Biol, 59(3):307–21, May 2010.
[153] Morgan N Price, Paramvir S Dehal, andAdam P Arkin. FastTree 2–approximatelymaximum-likelihood trees for large align-ments. PLoS One, 5(3):e9490, Jan 2010.
[154] A Stamatakis, T Ludwig, and H Meier.RAxML-III: a fast program for maximumlikelihood-based inference of large phyloge-netic trees. Bioinformatics, 21(4):456–63,Feb 2005.
[155] Elizabeth Waters, Michael J Hohn, IvanAhel, David E Graham, Mark D Adams,Mary Barnstead, Karen Y Beeson, LisaBibbs, et al. The genome of Nanoarchaeumequitans: insights into early archaeal evo-lution and derived parasitism. Proc Natl
Acad Sci USA, 100(22):12984–8, Oct 2003.
[156] Celine Brochier, Simonetta Gribaldo,Yvan Zivanovic, Fabrice Confalonieri, andPatrick Forterre. Nanoarchaea: represen-tatives of a novel archaeal phylum or afast-evolving euryarchaeal lineage relatedto Thermococcales? Genome Biology,6(5):R42, Jan 2005.
[157] Yuri I Wolf, Igor B Rogozin, Nick V Gr-ishin, and Eugene V Koonin. Genometrees and the tree of life. Trends Genet,18(9):472–9, Sep 2002.
[158] Alexei I Slesarev, Katja V Mezhevaya,Kira S Makarova, Nikolai N Polushin,Olga V Shcherbinina, Vera V Shakhova,Galina I Belova, L Aravind, et al.The complete genome of hyperthermophileMethanopyrus kandleri AV19 and mono-phyly of archaeal methanogens. Proc Natl
Acad Sci USA, 99(7):4644–9, Apr 2002.
[159] Ji Qi, Bin Wang, and Bai-Iin Hao. Wholeproteome prokaryote phylogeny withoutsequence alignment: a K-string composi-tion approach. J Mol Evol, 58(1):1–11, Jan2004.
[160] Kira S Makarova, Alexander V Sorokin,Pavel S Novichkov, Yuri I Wolf, and Eu-gene V Koonin. Clusters of orthologousgenes for 41 archaeal genomes and implica-tions for evolutionary genomics of archaea.Biol Direct, 2:33, Jan 2007.
[161] Ken Nealson. A Korarchaeote yields togenome sequencing. Proc Natl Acad Sci
USA, 105(26):8805–6, Jul 2008.
[162] James G Elkins, Mircea Podar, David EGraham, Kira S Makarova, Yuri Wolf,Lennart Randau, Brian P Hedlund, CelineBrochier-Armanet, et al. A korarchaealgenome reveals insights into the evolutionof the Archaea. Proc Natl Acad Sci USA,105(23):8102–7, Jun 2008.
[163] Steven J Hallam, Konstantinos T Kon-stantinidis, Nik Putnam, Christa Schleper,Yoh ichi Watanabe, Junichi Sugahara,Christina Preston, Jose de la Torre,et al. Genomic analysis of the unculti-vated marine crenarchaeote Cenarchaeumsymbiosum. Proc Natl Acad Sci USA,103(48):18296–301, Nov 2006.
[164] Harald Huber, Michael J Hohn, Rein-hard Rachel, Tanja Fuchs, Verena C Wim-mer, and Karl O Stetter. A new phy-lum of Archaea represented by a nano-sized hyperthermophilic symbiont. Nature,417(6884):63–7, May 2002.
125
[165] S M Barns, C F Delwiche, J D Palmer, andN R Pace. Perspectives on archaeal diver-sity, thermophily and monophyly from en-vironmental rRNA sequences. Proc Natl
Acad Sci USA, 93(17):9188–93, Aug 1996.
[166] Thomas A Auchtung, Cristina D Takacs-Vesbach, and Colleen M Cavanaugh. 16SrRNA phylogenetic investigation of thecandidate division ”Korarchaeota”. Appl
Environ Microbiol, 72(7):5077–82, Jul2006.
[167] Anja Spang, Roland Hatzenpichler,Celine Brochier-Armanet, Thomas Rattei,Patrick Tischler, Eva Spieck, WolfgangStreit, David A Stahl, et al. Distinct geneset in two different lineages of ammonia-oxidizing archaea supports the phylumThaumarchaeota. Trends Microbiol,18(8):331–340, Aug 2010.
[168] Celine Brochier-Armanet, Bastien Bous-sau, Simonetta Gribaldo, and PatrickForterre. Mesophilic Crenarchaeota: pro-posal for a third archaeal phylum, theThaumarchaeota. Nat Rev Microbiol,6(3):245–52, Mar 2008.
[169] Brian B Oakley, Franck Carbonero,Christopher J van der Gast, Robert JHawkins, and Kevin J Purdy. Evolution-ary divergence and biogeography of sym-patric niche-differentiated bacterial popu-lations. ISME J, 4(4):488–97, Apr 2010.
[170] Daniel Falush. Toward the use of genomicsto study microevolutionary change in bac-teria. PLoS Genet, 5(10):e1000627, Oct2009.
[171] W Doolittle and O Zhaxybayeva. Metage-nomics and the Units of Biological Orga-nization. BioScience, 60(2):102–112, Jan2010.
[172] Christopher L Dupont, Andrew Butcher,Ruben E Valas, Philip E Bourne, and Gus-tavo Caetano-Anolles. History of biologicalmetal utilization inferred through phyloge-nomic analysis of protein structures. Proc
Natl Acad Sci USA, 107(23):10567–72, Jun2010.
[173] Jason Raymond and Daniel Segre. Theeffect of oxygen on biochemical networksand the evolution of complex life. Science,311(5768):1764–7, Mar 2006.
[174] E G Nisbet, N V Grassineau, C J Howe, P IAbell, M Regelous, and R E R Nisbet. Theage of Rubisco: the evolution of oxygenicphotosynthesis. Geobiology, 5:311–335, Jan2007.
[175] Eugene V Koonin. The Biological Big Bangmodel for the major transitions in evolu-tion. Biol Direct, 2:21, Jan 2007.
126