Novel Phylogenetic Approaches to Problems in Microbial ...Novel Phylogenetic Approaches to Problems in Microbial Genomics by Lawrence A. David Submitted to the Computational & Systems

Novel Phylogenetic Approaches to Problems in

Microbial Genomics

by

Lawrence A. David

Submitted to the Computational & Systems Biology Initiativein partial fulfillment of the requirements for the degree of

PhD

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2010

c� Massachusetts Institute of Technology 2010. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Computational & Systems Biology Initiative

August 26th, 2010

Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Eric J. Alm

Assistant ProfessorThesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Chris Burge

Chair, Ph.D. Graduate Committee

Report Documentation Page Form ApprovedOMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.

1. REPORT DATE SEP 2010 2. REPORT TYPE

3. DATES COVERED 00-00-2010 to 00-00-2010

4. TITLE AND SUBTITLE Novel Phylogenetic Approaches to Problems in Microbial Genomics

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Massachusetts Institute of Technology,77 Massachusetts Avenue,Cambridge,MA,02139

8. PERFORMING ORGANIZATIONREPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited

13. SUPPLEMENTARY NOTES

14. ABSTRACT Present day microbial genomes are the handiwork of over 3 billion years of evolution. Comparisonsbetween these genomes enable stepping backwards through past evolutionary events, and can beformalized using binary tree models known as phylogenies. In this thesis, I present three new phylogeneticmethods for gaining insight into how microbes evolve. In Chapter 1, I introduce the algorithm AdaptML,which uses strain ecology information to identify genetically- and ecologically-distinct bacterialpopulations. Analysis of 1000 marine Vibrionaceae strains by AdaptML finds evidence that nicheadaptation may influence patterns of genetic differentiation in bacteria. In Chapter 2, I introduce thealgorithm AnGST, which can infer the evolutionary history of a gene family in a chronological context.Analysis of 3968 gene families drawn from 100 modern day organisms with AnGST reveals genomicevidence for a massive expansion in microbial genetic diversity during the Archean eon and the gradualoxygenation of the biosphere over the past 3 billion years. Lastly, I introduce in Chapter 3 the algorithmGAnG, which can construct prokaryotic species trees from thousands of distinct gene trees. GAnG analysisof archaeal gene trees supports hypotheses that the Nanoarchaeota diverged from the last ancestor of theArchaea prior to the Crenarchaeota/Euryarchaeota split.

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as

Report (SAR)

18. NUMBEROF PAGES

126

19a. NAME OFRESPONSIBLE PERSON

a. REPORT unclassified

b. ABSTRACT unclassified

c. THIS PAGE unclassified

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

Novel Phylogenetic Approaches to Problems in Microbial

Genomics

by

Lawrence A. David

Submitted to the Computational & Systems Biology Initiativeon August 26th, 2010, in partial fulfillment of the

requirements for the degree ofPhD

Abstract

Present day microbial genomes are the handiwork of over 3 billion years of evolution.Comparisons between these genomes enable stepping backwards through past evolu-tionary events, and can be formalized using binary tree models known as phylogenies.In this thesis, I present three new phylogenetic methods for gaining insight into howmicrobes evolve. In Chapter 1, I introduce the algorithm AdaptML, which uses strainecology information to identify genetically- and ecologically-distinct bacterial popu-lations. Analysis of 1000 marine Vibrionaceae strains by AdaptML finds evidencethat niche adaptation may influence patterns of genetic differentiation in bacteria.In Chapter 2, I introduce the algorithm AnGST, which can infer the evolutionaryhistory of a gene family in a chronological context. Analysis of 3968 gene familiesdrawn from 100 modern day organisms with AnGST reveals genomic evidence fora massive expansion in microbial genetic diversity during the Archean eon and thegradual oxygenation of the biosphere over the past 3 billion years. Lastly, I intro-duce in Chapter 3 the algorithm GAnG, which can construct prokaryotic species treesfrom thousands of distinct gene trees. GAnG analysis of archaeal gene trees supportshypotheses that the Nanoarchaeota diverged from the last ancestor of the Archaeaprior to the Crenarchaeota/Euryarchaeota split.

Thesis Supervisor: Eric J. AlmTitle: Assistant Professor

2

Acknowledgments

Funding

I am deeply grateful for the research support provided by:

• National Defense Science & Engineering Graduate Fellowship (DoD)

• Whitaker Health Sciences Fund Fellowship (MIT)

Personal

Many people have deliberately or unwittingly helped me complete my doctoral work.My words of gratitude below are only a small installment against a lifetime of welcomedebt.

Foremost, I would like to thank my advisor, Eric Alm, who gave me the freedomto work on the problems I found interesting and the scientific training to solve them.His unique enthusiasm over these past years has helped fuel my research and providedme with personal joys that I will always cherish.

I have been also remarkably fortunate to have joined a group of wonderful col-leagues here at MIT. Mark Smith, Chris Smillie, Greg Fournier, Sarah Preheim, InesBaptista, Matt Blackburn, Yonatan Friedman, Alexandra Konings, Claudia Bauer,and Albert Wang have been a source of thoughtful discussion and much needed lev-ity. In particular, the old guard of Arne Materna, Sean Clarke, Sonia Timberlake,and Jesse Shapiro is enshrined in some of my fondest memories of graduate school.Classmates in my department, especially Robin Friedman, Charles Lin, and MichelleChan, have been inspirational young-scientists who I have looked to for advice andmotivation.

A host of caring individuals enabled me to reach the closing stages of my graduatecareer. Sheila Frankel, Darlene Strother, Darlene Ray, James Long, and Bonnie LeeWhang supplied administrative and moral support. My graduate committee, whosemembership has included Ed DeLong, Drew Endy, Manolis Kellis, Dianne Newman,Howard Ochman, (and unofficially Martin Polz), generously set aside time to provideme with professional advice and letters of recommendation. [greg] ensured that mycluster jobs ran smoothly and provided crucial UNIX humor.

I also acknowledge my friends and family, who conspired to make graduate schoolthe shortest five years of my life. Within the Parsons laboratory, David Gonzalez-Rodriguez, Piyatida Hoisungwan, Marcos, and Crystal Ng, unfailingly made me ex-cited to come to lab each morning. MacGregor’s F-entry has been my adopted un-dergraduate family, and I hold dear my memories of living among them. My sisterhas continuously given me her love, and my father and mother have selflessly workedto give me the opportunities I presently enjoy. Finally, my wife Christina has simplybeen the nicest person I’ve ever met.

3

Contents

I Introduction 5

II Contributions 13

1 Resource partitioning and sympatric differentiation among closely

related bacterioplankton 15

2 Rapid evolutionary innovation during an Archean Genetic Expan-

sion 51

3 Building prokaryotic species trees from thousands of gene trees 92

III Conclusion 113

IV Bibliography 116

4

Part I

Introduction

5

Overview

Present day microbial genomes are the handiwork of over 3 billion years of evolution.

Comparisons between these genomes enable stepping backwards through evolutionary

history – genetic features present across a wide diversity of genomes likely arose

more anciently than features found in subsets of related genomes. This intuition is

formalized using binary tree models of sequence evolution known as phylogenetic trees.

Phylogenies propose a series of ancestral sequence divergence events that explain the

similarity of extant sequences. These trees can in turn be used to build models for

the evolution of organismal phenotypes, such as preferred environment or lifestyle.

However, prokaryotes’ capacity for horizontal gene transfer (HGT) can require regions

of the same genome to be associated with different phylogenetic trees, and ultimately

obscure which phylogenetic tree best represents overall genome evolution.

In this thesis, I present three novel phylogenetic approaches for inferring micro-

bial evolutionary history through the comparison of gene sequences. The remainder

of Part I briefly describes the research context in which I developed: AdaptML, an

algorithm for detecting signatures of ecological adaptation influencing bacterial ge-

netic differentiation; and AnGST, an algorithm for inferring the series of HGT, gene

duplication, and gene loss events that gave rise to a gene family. I go on in Chapter 1

of Part II to use AdaptML to identify genetically- and ecologically-distinct clusters of

Vibrionaceae coexisting in a marine environment. In Chapter 2, I use AnGST to infer

patterns in microbial genome evolution over the past 3.8 billion years. I use AnGST

again in Chapter 3 in the development of GAnG, a new method for constructing

prokaryotic species trees from thousands of gene trees. I conclude this thesis in Part

III with a summary of the chapters and a brief discussion of ongoing and future work

with AdaptML, AnGST, and GAnG.

6

Detecting relationships between genetic and ecolog-

ical differentiation in bacteria

Distinct groups of closely-related bacteria, or phylogenetic clusters, are a recurring

pattern of genetic differentiation among bacterial isolate housekeeping genes [1–3].

Ecological adaptation is suspected of playing a role in cluster formation [4]. Ac-

cording to the ecotype model, genetically-distinct bacterial populations form when

bacterial populations adapt to an ecological niche and are repeatedly purged of ge-

netic variation through periodic selection events [5]. However, a theoretical study

has shown that genetically distinct sub-populations can form under a neutral model

that either prohibits recombination, or simulates high within-cluster recombination

[6]. Alternatively, a recent phylogenetic analysis of eight sequenced Vibrio isolates

has found evidence for a combined model featuring both ecological adaptation and

neutral processes contributing to genetic differentiation. Under this model, the intro-

duction of niche-adaptive alleles initially erodes sympatry in a bacterial population.

Reduced gene flow between niche-adapted bacteria and the remaining population

subsequently yields genetically-distinct subgroups [7]. Ultimately, if niche adapta-

tion drives the formation of genetically-distinct bacterial groups, members of each

group should inhabit a common niche. Mathematical models capable of identifying

both genetically- and ecologically-cohesive bacterial groups can thus be used to help

resolve the role of ecological adaptation in the genetic differentiation of bacteria.

Several existing statistical methods, such as the Fst test, the P test, and Unifrac,

can evaluate the null hypothesis that phylogenetic clusters do not exhibit distinct

ecological associations. These tests assume that bacterial sequences are annotated

with ecological metadata describing the environment each sequence was harvested

from. The Fst test compares the genetic diversity among bacteria annotated as

sharing the same environment to the genetic diversity measured across all sampled

sequences. Low genetic diversity within a particular environment, coupled with high

genetic diversity between environments, is evidence for rejection of the null hypothesis

of no association between genetic clustering and bacterial ecology [8]. Alternatively,

7

the P test builds a phylogeny of strain sequences, labels leaves by their environmental

association, and uses a parsimony model to infer the number of times ancestral strains

on the tree changed environmental associations. Low parsimony scores are evidence

for rejecting the null hypothesis [8]. Lastly, the Unifrac model combines elements of

both the Fst and P test, utilizing genetic distances and strain tree topology to test

the relationship between strain genetic clustering and associated environment [9].

One weakness, however, of the Fst, P, and Unifrac statistics is their potential for

erroneously reporting no association between genetic clustering and ecology when the

ecological forces driving cluster formation are unmeasured or improperly annotated.

For example, consider a bacterial sequence cluster caused by adaptation to conditions

between 20◦-30◦C. An association between this cluster and ecology would go unrec-

ognized by Fst, P, or Unifrac analyses if temperature data was not collected, or if

temperature data were discretized into only two ranges: < 25◦C and ≥25◦C. Thus,

these statistics may not be appropriate for analyzing the evolution of bacteria whose

niche composition is unknown or highly uncertain, as environmental parameters de-

scribing these bacteria’s niche may not have been measured.

Another inference algorithm, Ecotype Simulation (ES), can identify genetically-

and ecologically-distinct clusters in a manner insensitive to how ecological parameters

are measured [10]. ES finds ecotypes by fitting a maximum likelihood model onto a

gene phylogeny. This model estimates the rates of ecotype formation, periodic se-

lection, and genetic drift, as well as the total number of ecotypes present. Identified

ecotypes can subsequently be analyzed using ecological measurements and multivari-

ate statistics in order to confirm that niche-adaptation has taken place and identify

environmental parameters that define the niche. Recent application of this approach

discovered ecotypes among Bacillus strains sampled from Death Valley, CA, which

could be distinguished by adaptation to solar exposure and soil texture [11]. One

drawback to the ES algorithm, however, is that it cannot detect nascent ecotype

formation events that have not yet undergone multiple series of periodic selections.

In Chapter 1, I present a new method named AdaptML, which uses a maxi-

mum likelihood model to identify genetically- and ecologically-coherent clusters of

8

bacterial strains. This model explicitly combines genetic information embedded in

sequence-based phylogenies with environmental sampling data. Recent niche adap-

tation events, characterized by ecologically coherent clusters with minimal genetic

distinction from a parent clade, can be captured by the model. Although AdaptML

cannot detect ecological associations with unmeasured environmental parameters, the

algorithm can account for environmental parameter discretization schemes that would

generally confound previous methods for detecting ecological associations. To do this,

I introduce the model concept of a “habitat.” Habitats are characterized by discrete

probability distributions describing the likelihood that a strain adapted to a habitat

will be sampled from a given ecological state (e.g. at a particular location in an estu-

ary). Habitats are not defined a priori but rather learned directly from the sequence

and ecological data using an Expectation Maximization routine. Once habitats are

defined, I learn a maximum likelihood model for the evolution of habitat association

on the tree. Randomization experiments can be used to determine which sequence

clusters show a statistically-significant association with a given habitat.

9

Inferring the evolutionary history of microbial gene

families

Microbial genomes do not evolve solely by point mutation [12, 13]. Comparison of

gamma-proteobacterial genomes suggests gene loss events eliminated thousands of

genes from the ancestor of the Buchnera following its adoption of an endosymbiotic

lifestyle [14]. Genes can be gained, via either the duplication of small regions of the

genome [15], or via the duplication of the entire genome itself, as has been shown for

yeast [16]. Gene gain is also possible via HGT and is a well-known source of genomic

diversity among the prokaryotes [12]. Cases of HGT have also been identified between

eukaryotes [17, 18] and even from bacteria to animals [19, 20]. Models that can infer

when genes have undergone loss, duplication, or HGT, and when genes have been

vertically inherited, are necessary for understanding the relative contribution of these

four mechanisms to genome evolution.

Algorithms for inferring the evolutionary history of gene families vary according to

their reliance on phylogenetic models and how they account for gene gain events (Ta-

ble 1). Phylogeny-free methods utilize features such as GC-bias to detect xenologous

genes [21, 22], or within-genome BLAST searches to find evidence for past duplication

events [23]. More complex approaches, known as presence-absence models, construct

a phylogeny of sampled species and identify which leaves on the tree are represented

in a gene family of interest. Parsimony algorithms can then be used to identify a set of

ancestral gene duplication, gene loss, or HGT events to explain the observed pattern

of gene presence and absence on the species tree [24–27]. However, presence/absence

algorithms may underestimate the amount of HGT in a gene family history, since

frequent HGT events can produce presence/absence patterns similar to those caused

by gene birth at a deep node, followed by vertical descent. More sensitive models

capable of differentiating between these scenarios utilize gene sequence information,

in addition to a species tree. Quartet methods quantify how strongly quartets of

orthologous genes support each of the three possible 4-taxon trees representing their

evolutionary history [28, 29]. Quartets that strongly support topologies discordant

10

ModelSpecies

treeGenetree

Unc.genetrees

FindsHGT

Findsdup.

Refs.

GC-bias No No - Yes No [21, 22]BLAST-hits No No - Yes Yes [23]Presence/absence Yes No - Yes Yes [24–27]Quartet mapping Partial Partial Yes Yes No [28, 29]Parsimony recon-ciliation

Yes Yes No Yes Yes [30, 31]

Probabilistic rec-onciliation

Yes Yes Yes Yes No [33, 34]

Table 1: Selection of existing models used to infer gene family evolutionaryhistories: Models are characterized by their explicit usage of species trees and gene trees,their consideration of gene tree uncertainty, and their ability to detect HGT and dupli-cation events. Note that only References [27, 31] can find HGT and duplication eventssimultaneously.

with the expected species tree are evidence for HGT within the gene family. More

elaborate “reconciliation” models compare full gene and species trees in order to in-

fer a precise phylogenetic location for each inferred evolutionary event. Parsimony

reconciliation models [30, 31], however, will infer spurious events if phylogenetic con-

struction errors are present in the gene tree [32]. Newer probabilistic reconciliation

algorithms have been developed to deal with these potential inaccuracies [33, 34].

Gene family evolutionary history models can also be partitioned according to

whether they account for gene gain using duplication or HGT events. With the

exception of Snel and Charleston’s algorithms [27, 31], evolutionary history models

usually account for only one of these two events. The specificity of these models may

be caused by self-reinforcing biases associated with the expected modes of eukaryotic

and prokaryotic genome evolution. The relative rarity of reported HGT events among

eukaryotes, compared to duplication events, likely encourages analyses of eukaryotic

genome evolution using tools specialized only to detect gene duplications. By contrast,

recognition of how HGT can accelerate prokaryotic adaptation and blur species lines

has probably reduced interest in broad surveys of potential prokaryotic duplication.

Exceptions to this proposed bias among prokaryotic studies do exist, however, as

11

Gevers et al. cataloged gene duplications in 106 bacterial genomes [23] and Snel

searched for both HGT and duplication among 17 archaeal and bacterial genomes

[27]. Increasing examples of HGT among eukaryotes [17, 18] are also fueling new

interest in systematically searching for HGT across the eukaryotes [35]. Bias against

the creation of models that account for both HGT and duplication is also likely due

to issues of model complexity. In certain scenarios, gene duplication and gene loss

can produce gene tree topologies similar to those yielded by HGT [36]. A combined

HGT/duplication inference model must be capable of recognizing this scenario and

proposing plausible HGT and duplication scenarios. Moreover, a combined model

requires defining a metric to choose which of these scenarios is preferable.

In Chapter 2, I present a new reconciliation method for inferring a set of gene loss,

gene duplication, and HGT events that explain topological incongruities between a

species tree and a gene tree. I named this algorithm the Analyzer of Gene & Species

Trees, or AnGST. AnGST was inspired by a gene family evolution model originally

designed for problems in biogeography and the inference of gene duplication and gene

loss events [30]. Also referred to as a host-parasite model, this approach seeks to infer

which ancestral genome (the host) on the reference tree possessed each ancestral gene

copy (the parasite). AnGST employs a generalized parsimony framework in order to

choose when duplication events should be inferred instead of HGT scenarios. This

framework assigns scores to each type of evolution event and returns the evolutionary

history with the lowest overall score. AnGST can further minimize reconciliation

scores by reconciling multiple gene tree bootstraps simultaneously and combining

their lowest scoring subtrees into a single chimeric gene tree. This bootstrap amalga-

mation step reduces the opportunity for poorly resolved gene tree subtrees to cause

the spurious inference of evolutionary events.

12

Part II

Contributions

13

Chapter 1

Resource partitioning andsympatric differentiation amongclosely related bacterioplankton

Dana E. Hunt*, Lawrence A. David*, Dirk Gevers, Sarah P. Preheim,Eric J. Alm, Martin F. Polz

*These authors contributed equally to this work.

This chapter is presented as it originally appeared in Science 320, 1081 (2008).Corresponding Supplementary Material is appended.

Chapter 1

Resource partitioning and

sympatric differentiation among

closely related bacterioplankton

Identifying ecologically differentiated populations within complex micro-

bial communities remains challenging, yet is critical for interpreting the

evolution and ecology of microbes in the wild. Here we describe spatial

and temporal resource partitioning among Vibrionaceae strains coexist-

ing in coastal bacterioplankton. A quantitative model (AdaptML) estab-

lishes the evolutionary history of ecological differentiation, thus revealing

populations specific for seasons and life-styles (combinations of free-living,

particle, or zooplankton associations). These ecological population bound-

aries frequently occur at deep phylogenetic levels (consistent with named

species); however, recent and perhaps ongoing adaptive radiation is evi-

dent in Vibrio splendidus, which comprises numerous ecologically distinct

populations at different levels of phylogenetic differentiation. Thus, envi-

ronmental specialization may be an important correlate or even trigger of

speciation among sympatric microbes.

15

Microbes dominate biomass and control biogeochemical cycling in the ocean, but

we know little about the mechanisms and dynamics of their functional differentiation

in the environment. Culture-independent analysis typically reveals vast microbial di-

versity, and although some taxa and gene families are differentially distributed among

environments [37, 38], it is not clear to what extent coexisting genotypic diversity can

be divided into functionally cohesive populations [37, 39]. First, we lack broad surveys

of nonpathogenic free-living bacteria that establish robust associations of individual

strains with spatiotemporal conditions [40, 41]; second, it remains controversial what

level of genetic diversification reflects ecological differentiation. Phylogenetic clus-

ters have been proposed to correspond to ecological populations that arise by neutral

diversification after niche-specific selective sweeps [5]. Clusters are indeed observed

among closely related isolates (e.g., when examined by multilocus sequence analysis)

[4] and in culture-independent analyses of coastal bacterioplankton [42]. Yet recent

theoretical studies suggest that clusters can result from neutral evolution alone [6],

and evidence for clusters as ecologically distinct populations remains sparse, having

been most conclusively demonstrated for cyanobacteria along ocean-scale gradients

[43] and in a depth profile of a microbial mat [44]). Further, horizontal gene transfer

(HGT) may erode the ecological cohesion of clusters if adaptive genes are transferred

[45], and recombination can homogenize genes between ecologically distinct popu-

lations [46]. Thus, exploring the relationship between phylogenetic and ecological

differentiation is a critical step toward understanding the evolutionary mechanisms

of bacterial speciation [6].

In this study, we investigated ecological differentiation by spatial and temporal

resource partitioning in coastal waters among coexisting bacteria of the family Vibri-

onaceae, which are ubiquitous, metabolically versatile heterotrophs [47]. The coastal

ocean is well suited to test population-level effects of microhabitat preferences, be-

cause tidal mixing and oceanic circulation ensure a high probability of migration, re-

ducing biogeographic effects on population structure. In the plankton, heterotrophs

may adopt alternate ecological strategies: exploiting either the generally lower concen-

tration but more evenly distributed dissolved nutrients or attaching to and degrading

16

small suspended organic particles, originating from algal exopolysaccharides and de-

tritus [39]. Bacterial microhabitat preferences may develop because resources are

distributed on the same scale as the dispersal range of individuals, due to turbulent

mixing and active motility [48]. Of potential microhabitats, particles represent abun-

dant but relatively short-lived resources, as labile components are rapidly utilized (on

time scales of hours to days) [49, 50], implying that particle colonization is a dynamic

process. Moreover, particulate matter may change composition with macroecologi-

cal conditions (such as seasonal algal blooms). Zooplankton provide additional, more

stable microhabitats; vibrios attach to and metabolize chitinous zooplankton exoskele-

tons [51, 52] but may also live in the gut or occupy niches specific to pathogens. The

extent to which microenvironmental preferences contribute to resource partitioning in

this complex ecological landscape remains an important question in microbial ecology

[53].

We aimed to conservatively identify ecologically coherent groups by examining

distribution patterns of Vibrionaceae genotypes among free- living and associated

(with suspended particles and zooplankton) compartments of the planktonic environ-

ment under different macroecological conditions (spring and fall) (Figs. 1.3 & 1.5).

Because the level of genetic differentiation at which ecological preferences develop is

not known, we focused on a range of relationships (0 to 10% small subunit riboso-

mal RNA (rRNA) divergence) among co-occurring vibrios [54]. Particle-associated

and free-living cells were separated into four consecutive size fractions by sequential

filtration (four replicate water samples, each subsampled with at least four replicate

filters per size fraction); each fraction contained organisms and dead organic material

of different origins (detailed in the supporting online material [SOM]; Section 1.2).

For simplicity, we refer to these fractions as enriched in zooplankton (≥63 mm), in

large (5 to 63 mm) and small (1 to 5 mm) particles, and in free-living cells (0.22 to 1

mm) (Fig. 1.5B). The 1- to 5-mm size fraction was somewhat ambiguous, probably

containing small particles as well as large or dividing cells; however, it provided a firm

buffer between obviously particle-associated (>5 mm) and free-living (<1 mm) cells.

Vibrionaceae strains were isolated by plating filters on selective media, previously

17

shown by quantitative polymerase chain reaction to yield good correspondence be-

tween genotypes recovered in culture and those present in environmental samples [54].

Roughly 1000 isolates were characterized by partial sequencing of a protein-coding

gene (hsp60 ). To obtain added resolution, between one and three additional gene

fragments (mdh, adk, and pgi) were sequenced for over half of the isolates (SOM),

including V. splendidus strains, the most abundant group [54].

Our rationale for testing environmental associations grows out of the following

considerations. First, as in most ecological sampling, the true habitats or niches

are unknown and can only be observed as projections onto the sampling dimensions

(“projected habitats”). Thus, associations can be detected as distinct distributions

of groups of strains if habitats/niches are differentially apportioned among samples.

Second, the lack of an accepted microbial species concept implies that it is imprudent

to use any measure of genetic relationships to define a priori the populations whose

environmental association should be assessed. Therefore, we first tested the null

hypothesis that there is no environmental association across the phylogeny of the

strains. We then refined such estimates by developing a new model to simultaneously

identify populations and their projected habitats. Finally, these model-based results

were tested with nonparametric empirical statistics.

The initial null hypothesis of no association between phylogeny and ecology is

strongly rejected (seasons: p < 10−79; size fractions: p < 10−49) by comparing the

parsimony score of observed environments on the tree to that expected by chance [55]

(SOM), confirming the visual impression of differential patterns of clustering among

seasons and size fractions (Fig. 1.1A). This result is robust toward uncertainty in the

phylogeny, which should diminish but not strengthen associations, and is confirmed

by introducing additional uncertainty in the phylogeny (Fig. 1.6). The observed

overall association with season and size fraction therefore suggests that water-column

vibrios partition resources, but neither provides insights into the phylogenetic bounds

of populations or the composition of their habitats.

We therefore developed an evolutionary model (AdaptML) to identify popula-

tions as groups of related strains sharing a common projected habitat, which reflects

18

their relative abundance in the measured environmental categories (size fractions and

seasons) (SOM). In practice, the model inputs are the phylogeny, season, and size

fraction of the strains. It then maps changes in environmental preference onto the

tree by predicting projected habitats for each extant and ancestral strain in the phy-

logeny. Although similar in spirit to existing parsimony, likelihood, and Bayesian

methods, which map ancestral states onto trees [56], the model accounts for the com-

plexities and uncertainties of environmental sampling. First, projected habitats can

span multiple sampling dimensions to account for complex life cycles (such as time

spent in multiple true habitats) and problems inherent in environmental sampling:

Discrete samples rarely equate to true habitats, and true habitats are frequently mis-

placed among their typical sample categories (for example, zooplankton fragments

may also be found in smaller size fractions). Second, projected habitats can span

multiple phylogenetic clusters to allow for the possibility that clusters may arise neu-

trally or that the relevant parameters differentiating them ecologically have not been

measured.

Briefly, AdaptML builds a hidden Markov model for the evolution of habitat

associations: Adjacent nodes on the phylogeny transition between habitats according

to a probability function that is dependent on branch length and a transition rate,

which is learned from the data (SOM) (Fig. 1.7). Subsequently, we optimize the

model parameters (the transition rate and the composition of each projected habitat)

to maximize the likelihood of the observed data. Finally, we use a simple ad hoc rule

for reducing noninformative parameters: We merge habitats that converge to similar

distributions (simple correlation of distribution vectors >90%) during the model-

fitting procedure (SOM). This reproducibly identified six nonredundant habitats for

the observed data set (HA to HF in Figs. 1.1B and 1.9). Moreover, the algorithm acts

conservatively, as suggested by two tests. First, the model did not overfit the data

when there was no ecological signal present: When the environments were shuffled,

only a single generalist habitat (evenly distributed over all size fractions and seasons)

was recovered. Second, when simulated habitats were used to generate environmental

assignments, the model usually identified a number of habitats equal to or less than

19

the true number present (Fig. 1.10).

The analysis suggests that a single bacterial family coexisting in the water column

resolves into a striking number of ecologically distinct populations with clearly iden-

tifiable preferences (habitats). The algorithm identified 25 populations, associated

with one of the six habitats defined by distinct distributions of isolates over seasons

and size fractions (Fig. 1.1 and Fig. 1.11). Most clusters have a strong seasonal

signal; interestingly, two pairs of highly similar habitats are observed in both seasons

(Fig. 1.1B). The first of the habitat pairs corresponds to populations occurring both

free-living and on particles but lacking zooplankton-associated isolates (HB and HC);

the second indicates a preference for zooplankton and large particles (HE and HF )

(Fig. 1.1B). The remaining two habitats were season-specific. Habitat HA combines

all primarily free-living populations in the fall, whereas habitat HD identifies a second

particle- and zooplankton-associated group in spring, but unlike HE and HF it has a

higher proportion of large particles and maps onto a single small group (G25) (Fig.

1.1). However, we cannot place high confidence in the absence of the free-living habi-

tat in the spring, because relatively few strains were recovered from that fraction.

Moreover, the distribution of individual populations among seasons and size frac-

tions varies considerably, with remarkably narrow preferences for some populations

whereas others are more broadly distributed. For example, V. ordalii (G3) is almost

exclusively free-living in both seasons, whereas V. alginolyticus (G5) has a significant

representation in both zooplankton and free-living size fractions but occurs exclu-

sively in the fall (Fig. 1.1, A and B). The sequences of three additional genes for V.

alginolyticus isolates were identical, arguing against misidentification due to recombi-

nation or additional population substructuring. Similarly, there was good agreement

when two different gene phylogenies (hsp60 and mdh) were used to identify habitats

for V. splendidus (Fig. 1.12), although fewer habitats were identified using the mdh

tree, most likely because it is less well-resolved. Overall, across all vibrios sampled,

association with the zooplankton-enriched and free-living fractions dominated, and

although several populations contain particle-associated isolates, only a few appear

to be specifically particle-adapted. Because vibrios are generally regarded as particle

20

and zooplankton specialists [47], this observed partitioning offers new insight into

their ecology.

Thus, in spite of the highly variable conditions of the water column, popula-

tions appear to finely partition resources, especially because our habitat estimates

are conservative, as clusters occupying the same habitat may be differentiated along

additional (unobserved) resource axes. For example, different zooplankton-associated

groups may be host- or body region-specific, and the strong seasonal signal of most

clusters may be due to a variety of factors; however, temperature is a likely candi-

date because it has so far arisen as the strongest correlate of microbial population

changes both over a seasonal cycle [57] and along ocean-scale gradients [43]. Fi-

nally, populations, which appear unassociated in our study, may be true generalists

with respect to the resource space sampled or may be adapted to environments not

sampled in this study, such as animal intestines or sediments [47]. Despite these un-

certainties, the observed strong partitioning among associated and free-living clusters

may have important implications for population biology in the bacterioplankton. As

recently suggested [6], for attached bacteria, the effective population size (Ne) may

be considerably smaller than the census size because colonization serves as a pop-

ulation bottleneck, whereas in free-living clusters, Ne may be closer to the census

size. Although computing the true magnitude of Ne in microbial populations remains

controversial [58], it is an important parameter that determines the relative strength

of selection and drift. Thus, attached and free-living populations may evolve under

different constraints [6].

The phylogenetic structure of populations also provides insights into the history

of habitat switches. Deeply branching populations may have remained associated

with habitats over long evolutionary time, and shallow branches may have diversified

more recently (Fig. 1.1, A and B). These stable habitat-associated clusters roughly

correlate to named species within the Vibrionaceae. For example, V. ordalii (G3) and

Enterovibrio norvegicus (G2) both represent clusters without close relatives contain-

ing > 50 isolates, which are overwhelmingly predicted to follow primarily free-living

(HA) and free-living/particle-associated lifestyles (HC), respectively (Fig. 1.1A). On

21

the other hand, some very closely related clusters are associated with different habi-

tats; V. splendidus, which is composed of strains that are ∼99% identical in rDNA

gene sequence [54], differentiates into 15 microdiverse habitat-associated clusters, of

which one is distributed roughly evenly among both seasons, and 9 and 5 predom-

inantly occur in spring and fall, respectively. Thus, V. splendidus appears to have

ecologically diversified, possibly by invading new niches or partitioning resources at

increasingly fine scales.

Recent or perhaps ongoing radiation by sympatric resource partitioning is most

strongly suggested for two nested clusters within V. splendidus, where groups of

strains differing by as little as a single nucleotide in hsp60 display distinct ecological

preferences (Fig. 1.1A, insets, and Fig. 1.3). These strains were isolated from multiple

independent samples and thus do not represent clonal expansion, suggesting that this

may reflect a true habitat switch; nonetheless, homologous recombination could also

move alleles between distantly related, ecologically distinct clusters, creating spurious

phylogenetic relationships, which can be detected by comparison with other genes.

Multilocus sequence analysis shows that for nested cluster I, a close relationship

was artificially created because hsp60 gene phylogeny is discordant with three other

genes (Fig. 1.2). However, this still represents a habitat switch, just at a slightly

larger sequence distance, as I.A is nested within the much larger G16 cluster in

both the hsp60 and the mdh-pgi -adk phylogenies. For the second nested cluster,

the three additional genes confirm partial separation of the subclusters II.A and II.B

by a single base pair difference in one of the genes, whereas the other genes consist

of identical alleles. This reinforces the idea that subcluster II.A is not incorrectly

grouped because of recombination, despite its distinct ecological affiliation (Fig. 1.2).

In combination, these data support the idea that there is ecological differentiation

among recently diverged genotypes and show that such changes might be recognized

in protein-coding genes as soon as they accumulate (neutral) sequence changes.

How might adaptation to a new habitat relate to speciation, the generation of dis-

tinct clusters of closely related bacteria? Mathematical modeling has recently shown

that the dynamics of speciation depend on the ratio of homologous recombination to

22

mutation rates (r/m) [6]. When this ratio per allele exceeds ∼1, populations tran-

sition from essentially clonal to sexual, with the major consequence that selection is

probably required for the formation of clusters [6]. Our preliminary multilocus se-

quence analysis on a set of strains with similar taxonomic composition suggests that

their r/m is well above that threshold. Thus, our observations of habitat separation

for highly similar but clearly distinct genotypes suggest that ecological selection may

have triggered phylogenetic differentiation. A plausible mechanism is that differen-

tial distribution among habitats (possibly caused by few adaptive loci) is sufficient

to depress gene flow between associated genotypes [6, 59]. Consequently, mutations

will no longer be homogenized but instead accumulate within specialized populations,

even for ecologically neutral genes. Over time, genetic isolation may increase because

homologous recombination rates decrease log-linearly with sequence distance [60]. We

detected associations with different habitats among sister clades over a wide range

of phylogenetic distances, possibly representing populations at various stages of dif-

ferentiation (Fig. 1.1A). Although we cannot determine whether clusters represent

transiently adapted populations or nascent species, our observations of differential

distributions of genotypes suggest that there exists a small-scale adaptive landscape

in the water column allowing the initiation of (sympatric) speciation within this com-

munity.

Although it has recently been suggested that microbial lineages remain specific

to macroenvironments over long evolutionary times [61], this study demonstrates

switches in ecological associations within a bacterial family coexisting in the coastal

ocean. In the V. splendidus clade, speciation could be ongoing, but the divergence

between most other ecologically defined groups appears large. This is consistent with

our previous suggestion that rRNA gene clusters, which are roughly congruent with

the deeply divergent protein-coding gene clusters detected here, represent ecological

populations [42]. However, the example of V. splendidus highlights the fact that using

marker genes to assess community-wide diversity may not capture some ecological

specialization. Moreover, different groups of organisms could evolve under different

constraints, and the mechanisms suggested here apply to the invasion of new habitats

23

and are thus different from (but compatible with) the widely discussed niche-specific

selective sweeps [10]. Why V. splendidus appears to have radiated recently into new

habitats whereas other groups appear to be more constant is not known but may

be related to its high heterogeneity in genome architecture [54]. This could indicate

a large (flexible) gene pool that, if shared by horizontal gene transfer, gives rise to

large numbers of ecologically adaptive phenotypes. It will therefore be important

to compare whole genomes within recently ecologically diverged clusters to identify

specific changes leading to adaptive evolution.

24

1.1 Figures

that clusters can result from neutral evolution alone(9), and evidence for clusters as ecologicallydistinct populations remains sparse, having beenmost conclusively demonstrated for cyanobacteriaalong ocean-scale gradients (10) and in a depthprofile of a microbial mat (11). Further, horizontalgene transfer (HGT) may erode the ecologicalcohesion of clusters if adaptive genes are transferred(12), and recombination can homogenize genesbetween ecologically distinct populations (13).Thus, exploring the relationship between phyloge-

netic and ecological differentiation is a critical steptoward understanding the evolutionary mecha-nisms of bacterial speciation (9).

In this study, we investigated ecological dif-ferentiation by spatial and temporal resourcepartitioning in coastal waters among coexistingbacteria of the family Vibrionaceae, which areubiquitous, metabolically versatile heterotrophs(14). The coastal ocean is well suited to testpopulation-level effects of microhabitat prefer-ences, because tidal mixing and oceanic circula-tion ensure a high probability of migration,reducing biogeographic effects on populationstructure. In the plankton, heterotrophs mayadopt alternate ecological strategies: exploitingeither the generally lower concentration but moreevenly distributed dissolved nutrients or attach-ing to and degrading small suspended organicparticles, originating from algal exopolysaccha-rides and detritus (3). Bacterial microhabitatpreferences may develop because resources aredistributed on the same scale as the dispersal

range of individuals, due to turbulent mixing andactive motility (15). Of potential microhabitats,particles represent abundant but relatively short-lived resources, as labile components are rapidlyutilized (on time scales of hours to days) (16, 17),implying that particle colonization is a dynamicprocess. Moreover, particulate matter may changecomposition with macroecological conditions(such as seasonal algal blooms). Zooplanktonprovide additional, more stable microhabitats;vibrios attach to and metabolize chitinous zoo-plankton exoskeletons (18, 19) but may also livein the gut or occupy niches specific to pathogens.The extent to which microenvironmental prefer-ences contribute to resource partitioning in thiscomplex ecological landscape remains an impor-tant question in microbial ecology (20).

We aimed to conservatively identify ecolog-ically coherent groups by examining distributionpatterns of Vibrionaceae genotypes among free-living and associated (with suspended particlesand zooplankton) compartments of the plankton-

Fig. 1. Season and size fraction distributions and habitat predictionsmapped onto Vibrionaceae isolate phylogeny inferred by maximumlikelihood analysis of partial hsp60 gene sequences. Projected habitatsare identified by colored circles at the parent nodes. (A) Phylogenetictree of all strains, with outer and inner rings indicating seasons and sizefractions of strain origin, respectively. Ecological populations predictedby the model are indicated by alternating blue and gray shading ofclusters if they pass an empirical confidence threshold of 99.99% (seeSOM for details). Bootstrap confidence levels are shown in fig. S10. (B)Ultrametric tree summarizing habitat-associated populations identifiedby the model and the distribution of each population among seasons and

size fractions. The habitat legend matches the colored circles in (A) and(B) with the habitat distribution over seasons and size fractions inferredby the model. Distributions are normalized by the total number of countsin each environmental category to reduce the effects of uneven sampling.The insets at the lower right of (A) show two nested clusters (I.A and I.Band II.A and II.B) for which recent ecological differentiation is inferred,including habitat predictions at each node. The closest named species tonumbered groups are as follows: G1, V. calviensis; G2, Enterovibrionorvegicus; G3, V. ordalii; G4, V. rumoiensis; G5, V. alginolyticus; G6, V.aestuarianus; G7, V. fischeri/logei; G8, V. fischeri; G9, V. superstes; G10,V. penaeicida; G11 to G25, V. splendidus.

1Department of Civil and Environmental Engineering,Massachusetts Institute of Technology (MIT), Cambridge, MA02139, USA. 2Computational and Systems Biology Initiative,MIT, Cambridge, MA 02139, USA. 3Laboratory of Microbiol-ogy, Ghent University, Gent 9000, Belgium. 4Bioinformaticsand Evolutionary Genomics Group, Ghent University, Gent9000, Belgium.5Department of Biological Engineering, MIT,Cambridge, MA 02139, USA, and Broad Instilate of MIT andHarvard University, Cambridge, MA 02139, USA.

*These authors contributed equally to this work.†To whom correspondence should be addressed. E-mail:[email protected] (E.J.A.); [email protected] (M.F.P.)

23 MAY 2008 VOL 320 SCIENCE www.sciencemag.org1082

REPORTS

on

May

26,

200

8 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Figure 1.1: Season and size fraction distributions and habitat predictions mapped ontoVibrionaceae isolate phylogeny inferred by maximum likelihood analysis of partial hsp60

gene sequences. Projected habitats are identified by colored circles at the parent nodes.(A) Phylogenetic tree of all strains, with outer and inner rings indicating seasons andsize fractions of strain origin, respectively. Ecological populations predicted by the modelare indicated by alternating blue and gray shading of clusters if they pass an empiricalconfidence threshold of 99.99% (see SOM for details). Bootstrap confidence levels are shownin Fig. 1.14. (B) Ultrametric tree summarizing habitat-associated populations identifiedby the model and the distribution of each population among seasons and size fractions. Thehabitat legend matches the colored circles in (A) and (B) with the habitat distribution overseasons and size fractions inferred by the model. Distributions are normalized by the totalnumber of counts in each environmental category to reduce the effects of uneven sampling.The insets at the lower right of (A) show two nested clusters (I.A and I.B and II.A and II.B)for which recent ecological differentiation is inferred, including habitat predictions at eachnode. The closest named species to numbered groups are as follows: G1, V. calviensis; G2,Enterovibrio norvegicus; G3, V. ordalii ; G4, V. rumoiensis; G5, V. alginolyticus; G6, V.

aestuarianus; G7, V. fischeri/logei ; G8, V. fischeri ; G9, V. superstes; G10, V. penaeicida;G11 to G25, V. splendidus.

25

ic environment under different macroecologicalconditions (spring and fall) (fig. S1 and table S1).Because the level of genetic differentiation atwhich ecological preferences develop is notknown, we focused on a range of relationships[0 to 10% small subunit ribosomal RNA (rRNA)divergence] among co-occurring vibrios (21).Particle-associated and free-living cells wereseparated into four consecutive size fractions bysequential filtration (four replicate water samples,each subsampled with at least four replicatefilters per size fraction); each fraction containedorganisms and dead organic material of differentorigins [detailed in the supporting online material(SOM)]. For simplicity, we refer to these frac-tions as enriched in zooplankton (!63 mm), inlarge (5 to 63 mm) and small (1 to 5 mm) particles,and in free-living cells (0.22 to 1 mm) (fig. S1B).The 1- to 5-mm size fraction was somewhat am-biguous, probably containing small particles aswell as large or dividing cells; however, it pro-vided a firm buffer between obviously particle-associated (>5 mm) and free-living (<1 mm)cells. Vibrionaceae strains were isolated by plat-ing filters on selective media, previously shownby quantitative polymerase chain reaction to yieldgood correspondence between genotypes recov-ered in culture and those present in environmentalsamples (21). Roughly 1000 isolates were char-acterized by partial sequencing of a protein-coding gene (hsp60). To obtain added resolution,between one and three additional gene fragments

(mdh, adk, and pgi) were sequenced for over halfof the isolates (SOM), including V. splendidusstrains, the most abundant group (21).

Our rationale for testing environmental asso-ciations grows out of the following consider-ations. First, as in most ecological sampling, thetrue habitats or niches are unknown and can onlybe observed as projections onto the sampling di-mensions (“projected habitats”). Thus, associationscan be detected as distinct distributions of groupsof strains if habitats/niches are differentially ap-portioned among samples. Second, the lack of anaccepted microbial species concept implies that itis imprudent to use anymeasure of genetic relation-ships to define a priori the populations whoseenvironmental association should be assessed.Therefore, we first tested the null hypothesis thatthere is no environmental association across thephylogeny of the strains. We then refined such es-timates by developing a new model to simulta-neously identify populations and their projectedhabitats. Finally, these model-based results weretested with nonparametric empirical statistics.

The initial null hypothesis of no associationbetween phylogeny and ecology is strongly rejected(seasons: P < 10"79; size fractions: P < 10"49) bycomparing the parsimony score of observed envi-ronments on the tree to that expected by chance(22) (SOM), confirming the visual impression ofdifferential patterns of clustering among seasonsand size fractions (Fig. 1A). This result is robusttoward uncertainty in the phylogeny, whichshould diminish but not strengthen associations,and is confirmed by introducing additional uncer-tainty in the phylogeny (fig. S2). The observedoverall association with season and size fractiontherefore suggests that water-column vibrios par-tition resources, but neither provides insights intothe phylogenetic bounds of populations or thecomposition of their habitats.

We therefore developed an evolutionarymodel(AdaptML) to identify populations as groups ofrelated strains sharing a common projected hab-itat, which reflects their relative abundance in themeasured environmental categories (size frac-tions and seasons) (SOM). In practice, the modelinputs are the phylogeny, season, and size frac-tion of the strains. It then maps changes in envi-ronmental preference onto the tree by predictingprojected habitats for each extant and ancestralstrain in the phylogeny. Although similar in spiritto existing parsimony, likelihood, and Bayesianmethods, which map ancestral states onto trees(23), the model accounts for the complexities anduncertainties of environmental sampling. First,projected habitats can span multiple samplingdimensions to account for complex life cycles(such as time spent in multiple true habitats) andproblems inherent in environmental sampling:Discrete samples rarely equate to true habitats,and true habitats are frequently misplaced amongtheir typical sample categories (for example,zooplankton fragments may also be found insmaller size fractions). Second, projected habitatscan span multiple phylogenetic clusters to allow

for the possibility that clusters may arise neutrallyor that the relevant parameters differentiatingthem ecologically have not been measured.

Briefly, AdaptML builds a hidden Markovmodel for the evolution of habitat associations:Adjacent nodes on the phylogeny transition be-tween habitats according to a probability functionthat is dependent on branch length and a tran-sition rate, which is learned from the data (SOM)(fig. S3). Subsequently, we optimize the modelparameters (the transition rate and the compo-sition of each projected habitat) to maximize thelikelihood of the observed data. Finally, we use asimple ad hoc rule for reducing noninformativeparameters: We merge habitats that converge tosimilar distributions (simple correlation of distri-bution vectors >90%) during the model-fittingprocedure (SOM). This reproducibly identifiedsix nonredundant habitats for the observed dataset (HA to HF in Fig. 1B and fig. S5). Moreover,the algorithm acts conservatively, as suggestedby two tests. First, the model did not overfit thedata when there was no ecological signal present:When the environments were shuffled, only asingle generalist habitat (evenly distributed overall size fractions and seasons) was recovered.Second, when simulated habitats were used togenerate environmental assignments, the modelusually identified a number of habitats equal to orless than the true number present (fig. S6).

The analysis suggests that a single bacterialfamily coexisting in the water column resolvesinto a striking number of ecologically distinctpopulations with clearly identifiable preferences(habitats). The algorithm identified 25 popula-tions, associated with one of the six habitatsdefined by distinct distributions of isolates overseasons and size fractions (Fig. 1 and fig. S7).Most clusters have a strong seasonal signal; in-terestingly, two pairs of highly similar habitatsare observed in both seasons (Fig. 1B). The firstof the habitat pairs corresponds to populationsoccurring both free-living and on particles butlacking zooplankton-associated isolates (HB andHC); the second indicates a preference for zoo-plankton and large particles (HE and HF) (Fig.1B). The remaining two habitats were season-specific. Habitat HA combines all primarily free-living populations in the fall, whereas habitat HD

identifies a second particle- and zooplankton-associated group in spring, but unlike HE and HF

it has a higher proportion of large particles andmaps onto a single small group (G25) (Fig. 1).However, we cannot place high confidence in theabsence of the free-living habitat in the spring,because relatively few strains were recoveredfrom that fraction. Moreover, the distribution ofindividual populations among seasons and sizefractions varies considerably, with remarkablynarrow preferences for some populations whereasothers are more broadly distributed. For ex-ample, V. ordalii (G3) is almost exclusively free-living in both seasons, whereas V. alginolyticus(G5) has a significant representation in bothzooplankton and free-living size fractions but

Fig. 2. Multilocus sequence analysis of nestedclusters (IA and IB and IIA and IIB) with differentialhabitat association by comparison of partial hsp60(left) and concatenated partial mdh, adk, and pgi(right) gene phylogenies. Habitat predictions(indicated by colored boxes) and the numberingof clusters correspond to Fig. 1. Scale bar is in unitsof nucleotide substitutions per site.

www.sciencemag.org SCIENCE VOL 320 23 MAY 2008 1083

REPORTS

on

May

26,

200

8 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Figure 1.2: Multilocus sequence analysis of nested clusters (IA and IB and IIA and IIB)with differential habitat association by comparison of partial hsp60 (left) and concatenatedpartial mdh, adk, and pgi (right) gene phylogenies. Habitat predictions (indicated by coloredboxes) and the numbering of clusters correspond to Fig. 1.1. Scale bar is in units ofnucleotide substitutions per site.

26

1.2 Supplementary Material

1.2.1 Sampling rationale

To investigate partitioning of Vibrionaceae strains in the water column, we exam-

ined their distribution among the free-living and associated (with particles and zoo-

plankton) fractions of the bacterioplankton community at two time points. This was

achieved by sequential filtration with decreasing pore size cutoffs and subsequent cul-

tivation on Vibrio selective media (Fig. 1.5B). Here, we give additional details on

sampling protocols and rationale supplementing the overview given in the main text.

Filtration is commonly used in oceanography to separate particle-associated and

free-living populations by retention of particles on filters, although the filter size cut off

for collecting particle-attached bacteria has varied in past studies between 0.8 and 10

µm [62–65]. To obtain higher ecological resolution, we used sequential filtration since

alternate types of particulate organic matter and organisms (e.g., phytoplankton,

zooplankton) will have distribution maxima in different size fractions thus enabling

differentiation of associated bacterial genotypes.

We collected a total of four size fractions with different expected composition

of particles and organisms (Fig. 1.5B). The largest fraction (≥63 m) was visually

enriched in zooplankton and detrital material (e.g., pieces of macroalgae, terrestrial

plant material); however, large gelatinous material [frequently part of marine snow,

which represents particles >0.5 mm [66]] was likely not collected since it is disrupted

by the pressure on the plankton nets used for collection. All other fractions were

collected by gravity rather than vacuum filtration to minimize disruption of frag-

ile particles. The large particle fraction (63-5 µm) likely contains zooplankton fecal

pellets, dead and living algae, and other detritus. The composition of the 5-1 µm

size fraction is somewhat ambiguous since it may contain both cells attached to very

small particles as well as large or dividing cells; however, it provides a firm buffer

between obviously particle-attached (>5 µm) and free-living (<1µm) cells. Particu-

late material in this size range may include small algae, bacterial cell walls, as well as

fragments of larger particles, which have broken apart; nonetheless, the small size of

27

such particles in unlikely to sustain a resident bacterial population. Free-living bac-

teria, observed in the 1-0.22 µm size fraction, likely live on dissolved organic matter

produced by living algae, cell lysis and the dissolution of particles.

1.2.2 Sample collection

Coastal ocean water samples were collected at high tide on the marine end of the

Plum Island Estuary (NE Massachusetts) (Fig. 1.5A) on two days representing spring

(4/28/06) and fall (9/6/06) conditions in the coastal ocean. Nutrient concentrations,

water temperature and chlorophyll levels were measured on both sampling dates (Fig.

1.3).

Two replicate samples of the largest size fraction (enriched in zooplankton) were

collected by filtering ∼100 L each through a 63 µm plankton net, which was sub-

sequently washed with sterile seawater (Fig. 1.5B). Particle-associated and free-

living bacterial populations were collected from quadruplicate water samples, which

were independently 2 pre-filtered through the 63 µm plankton net (to remove the

zooplankton-enriched fraction) into 4 L nalgene bottles (Fig. 1.5B). For each bottle,

water was sequentially filtered through 5, 1 and 0.22 µm pore size filters, collecting

at least four replicate filters per size fraction. To avoid disruption of fragile parti-

cles, the 63-5 and 5-1 µm fractions were collected on polycarbonate membrane filters

(Sterlitech) using gravity filtration followed by washing with 5 ml of sterile (0.22

µm-filtered and Tindalized) seawater to remove free-living bacteria that might have

been retained on the filter. The <1 µm fraction containing free-living bacteria was

collected on 0.22 µm Supor-200 filters (Pall) by applying gentle vacuum pressure.

After size fractionation, particles and zooplankton were broken up before plating

(Fig. 1.5B). The zooplankton sample was homogenized using a tissue grinder (VWR

Scientific) and vortexed for 20 minutes at low speed before concentration on 0.22 µm

Supor-200 filters (Pall). These filters were then plated directly on selective media.

Similarly, 5 µm and 1 µm filters were placed in 50 ml conical tubes with 50 ml sterile

seawater and vortexed at low speed for 20 min to break up particles and detach

bacteria from the filters. The supernatant was concentrated on 0.22 µm filters, and

28

both the original and supernatant filters were placed directly on media to collect

isolates.

1.2.3 Strain isolation and identification

Isolates were obtained from TCBS plates (Accumedia or Difo) with 2% NaCl since

this media has been shown to yield good correspondence in phylogenetic groups of

vibrios detected by quantitative PCR and isolation [54]. After 2-3 days of growth,

colonies were counted and re-streaked a total of three times alternately on Tryptic

Soy Broth (TSB) (Difco) with 2% NaCl and media.

For classification of strains by sequencing, purified isolates were grown in marine

TSB broth overnight; DNA was extracted using either a tissue DNA kit (Qiagen) or

Lyse- N-Go (Pierce). Following the rationale of multilocus sequence analysis (MLSA),

housekeeping genes were used for further strain characterization since these are un-

likely to be under environmental selection. The partial hsp60 gene sequence was

amplified for all isolates as described previously [67]. For isolates with an hsp60

sequence differing by more than 2% from an already characterized strain, the 16S

rRNA gene was PCR amplified using primers 27F- 1492R and sequenced using the

27F primer [68]. The 16S sequence was used to identify the organism using the

RDP classifier [69] and BLAST [70]. In cases where the hsp60 gene either failed

to amplify or the sequence diverged greatly from other vibrios, 16S rRNA gene se-

quencing confirmed that these isolates largely belonged to the genera Pseudomonas,

Shewanella, Pseudoalteromonas, and Agaravorans (RDP Classifier) [69]. We excluded

non-Vibrionaceae strains from further analysis.

To confirm relationships for V. splendidus, the most highly represented group

among isolates, an additional gene (mdh) was sequenced. The partial mdh gene was

amplified using primers mdh mod.for (5’- GAY CTD AGY CAY ATC CCW AC -3’)

and mdh mod.rev (5’- GCT TCW ACM ACY TCD GTR CCY G -3’) (S. Preheim,

unpublished data). Two additional housekeeping gene sequences were obtained (pgi,

adk) for select groups of strains, using pgi.for (5 GAC CTW GGY CCW TAC ATG

GT 3 - 3) and pgi.rev (5-CMG CRC CRT GGA AGT TGT TRT-3) (S. Preheim,

29

unpublished data) and adk.for (5- GTA TTC CAC AAA TYT CTA CTG G-3) and

adk.rev (5- GCT TCT TTA CCG TAG TA- 3) [71]. All of these genes were amplified

using the following PCR conditions: 2 min at 94◦C followed by 32 cycles of 1 min each

at 94◦C, 46◦ and 72◦C with a final step of 6 min at 72◦C. Most genes were sequenced

at least twice using forward and reverse primers. All sequencing was performed at

the Bay Paul Center at the Marine Biological Laboratory, Woods Hole MA.

1.2.4 Phylogenetic tree construction and representation

The partial hsp60, mdh, adk, and pgi gene sequences yielded unambiguous align-

ments of 541, 422, 372, 395 nucleotides, respectively. Phylogenetic relationships were

reconstructed using PhyML v.2.4.4 [72] with following parameter settings: DNA sub-

stitution was modeled using the HKY parameter [73]; the transition/transversion

ratio was set to 4.0; PhyML estimated the proportion of invariable nucleotide sites;

the gamma distribution parameter was set to 1.0; 4 gamma rate categories were used;

a BIONJ tree was initially used; and, both tree topology and branch lengths were

optimized by PhyML. Circular tree figures were drawn using the online iTOL software

package [74]. To prevent numerical instabilities in AdaptMLs maximum likelihood

computations, branches with zero length were assigned the minimal observed non-zero

branch length: 0.001.

1.2.5 Empirical statistical testing

We employed empirical statistics to quantify evidence for differential environmental

distribution of phylogenetic groups (Fig. 1.6). We first tested the overall association

of phylogeny with our environmental data using a non-parametric parsimony-based

metric. We assigned a different character to each of the environmental categories,

and calculated the minimum number of character transitions needed to explain the

data given the observed hsp60 phylogeny. Although this test is likely to be overly

conservative given the heterogeneous nature of our observed clusters, it nonetheless

supported a highly significant correlation between phylogeny and both size fraction (p

30

< 10−49) and season (p < 10−79). Exact p-values were computed based on the algo-

rithm of [55]. It was not possible to compute p-values for both season and size fraction

together because the computational complexity of the algorithm grows exponentially

with the number of character states.

We also employed non-parametric empirical statistics to test specific model pre-

dictions. We tested the hypothesis that each of the clusters identified by the model

would be likely to arise by chance. To do this, we produced a 2 × 8 contingency table

to test for any associations between cluster membership and distribution across envi-

ronments. We used the Fisher exact test [75] as implemented in the R programming

language to evaluate the significance of each association. The results are shown in

Figure 1.4.

1.2.6 Overview of AdaptML

We developed a maximum likelihood method to help identify the boundaries of eco-

logically distinct populations and infer the ancestral habitat association of internal

nodes in the strain phylogeny. The key to our method is a hidden variable mapping a

’projected habitat’ to each node. We mathematically characterize each habitat as a

discrete probability distribution, which describes the likelihood that a strain adapted

to that habitat will be observed in each of our eight environmental categories. These

distributions, which we refer to as emission probabilities in accordance with terminol-

ogy used in machine learning, are not known a priori and must be learned from the

data. Because of this probabilistic definition of habitats, a phylogenetic group span-

ning several environmental categories can still be considered an ecologically distinct

population.

Using the habitat variables and isolate sequence-based phylogeny, we built a

hidden-Markov model (HMM) [75] describing the evolution of habitat association

(Figure 1.7). The probability that adjacent nodes in the phylogeny share the same

habitat is a function of both the branch length separating them and a parameter that

represents the rate at which a lineage can transition between habitats (the transition

rate). The observed variables the environmental category from which each strain

31

was sampled occur only at the leaves of the phylogeny. The parameters necessary

for our model can be learned from the data according to the following algorithm:

1. Initialize parameters: We initialize 16 habitats, each with random emission

probability distributions over the 8 environmental categories. The transition

rate parameter is initialized to 10−1 transitions per substitution/site (relative

to the gene phylogeny branch lengths).

2. Infer the observed datas likelihood, given phylogeny and parameter

estimates: We use a dynamic programming algorithm and the model param-

eters (transition rate and emission probabilities) to compute the likelihood of

the observed data (environmental category for each isolate). Our computation

proceeds in a manner identical to Felsenstein’s “pruning” method of computing

likelihoods on a phylogenetic tree [76].

3. Optimize parameters to maximize likelihood of observed data: We es-

timate the probability that each internal node is associated with a given habitat

by summing over all possible habitat assignments at other nodes (E-step). These

probabilities are used (M-step) to update the:

(a) Transition rate parameter : We numerically optimize the transition rate to

maximize the likelihood of the observed data.

(b) Emission probability matrix : We update the emission probability matrix

by taking the matrix that maximizes the likelihood of the observed data

given the marginal likelihoods for the habitat assignments at each of the

phylogenys leaves.

We note that separating these two steps represents an approximation, as these

two parameters are not strictly independent. The approximation, however,

speeds up the implementation considerably as only one parameter (instead of 1

+ 16 × 7 = 113) is optimized numerically.

4. Test for convergence: If the model parameters do not change significantly

from the previous iteration, then the emission probabilities and the transition

32

rate are considered to have converged: continue to step (5). (A typical tra-

jectory of emission probability convergence is shown in Figure 1.8. Note: the

approximation identified in step 3 can lead to fluctuation near a likelihood max-

imum rather than actual convergence). Otherwise, return to step (2).

5. Test for model complexity/redundancy. If a pair of habitats has emission

probability distributions that exhibit correlations greater than 0.90, they are

merged into a single habitat and the algorithm continues from step (2). If

no habitats are merged, the parameter estimation loop terminates. Although

our approach employed manual inspection and empirical testing rather than

a likelihood-based criterion for reducing model complexity [such as the AIC

[77]], our algorithm can be easily extended to include a likelihood criterion.

To test for overfitting, we performed simulations as described in the main text

and Figure 1.10. We found that our scheme acts conservatively since it usually

underestimated the true number of habitats. Figure 1.13 shows how the inferred

habitats identified by the model vary with different cutoffs.

Once a set of model parameters has been learned, we utilize the following protocol

to identify ecologically distinct and statistically significant populations.

1. Infer node habitat assignments that maximize the joint probability

of the observed data: We rely upon a joint likelihood calculation to infer a

single habitat assignment per ancestral node. To compute this likelihood, we

use the parameter estimates inferred by the algorithm described above, which

sums over all habitat assignments. Phylogenetic groups that share a common

habitat association are taken as candidate ecological populations if they pass

an empirical significance test.

2. Empirical testing to identify ecologically distinct populations: The

assignment of nodes to habitats in step (1) identifies the most likely set of

population boundaries, but may include some weakly predicted clusters. To

filter low confidence ecological groupings, we estimate empirical p-values for

33

each clade and only report statistically significant (p < 0.0001) populations

(Figure 1A & 1B). Empirical p-values are computed by comparing the likelihood

of the parent node for a cluster to the likelihood observed at the same node in

randomized trials where environmental assignments are shuffled, but phylogeny

is maintained. For comparison, all possible clusters (with no significance cutoff)

can be inferred from the full model results shown in Figure 1.11.

1.2.7 Detailed description of maximum likelihood model

Conditional likelihoods

The conditional likelihood describes the likelihood L that the leaves of a subtree

exhibit their observed states, conditional on the subtree’s root node k taking state s.

This likelihood can be defined recursively, assuming two child nodes l and m

Lk(s) =

��

x

P (x|s, tl)Lkl(x)

�×

��

y

P (y|s, tm)Lkm(y)

�(1.1)

where the function P (x|s, t) represents the probability of transitioning between

states x and s along some interval t. To reduce the number of fitted parameters

early in our model, we use the simplifying assumption that all state transitions can

be described using the same transition rate parameter µ. Thus, we compute the

probability of transitions between states as

P (x|y, t) =1

h

�1− e

−hµt�

(1.2)

and the probability of remaining in the same state

P (x|x, t) =1

h

�1 + (h− 1)× e

−hµt�

(1.3)

which is analogous to the Jukes-Cantor model for nucleotide sequence evolution

[78]. Note that if k is a leaf node in state s with observed environment o, the likelihood

is drawn from the emission probability matrix Pe:

34

Lk(s) = Pe(o|s) (1.4)

Marginal and joint likelihoods

To compute the marginal likelihood that a node k has state s, we combine the con-

ditional likelihoods:

MLk(s) =1

K×

�

t

��

x

P (x|s, tkl)Ll(x)

�(1.5)

where the l are drawn from the set of nodes adjacent to k, and K is a normalization

factor such that

�

s

MLk(s) = 1 (1.6)

To compute the likelihood of the observed data for the single best assignment of

habitats to nodes (17), the summation terms in the likelihood formula are replaced

with max operations. This is equivalent to maximizing the “joint” likelihood:

Lk(s) =�max

xP (x|s, tkl)Ll(x)

�×

�max

yP (y|s, tkm)Lm(y)

�(1.7)

The backtracking process used to keep track of the max arguments is analogous

to the Viterbi algorithm for finding the most probable state path in an HMM (16).

Parameter estimation

As probability distributions, each set of emission probabilities must satisfy

�

o

Pe(o|s) = 1 (1.8)

Because leaf nodes are independent samples of their emission probability distri-

butions (given their state assignment), we use a weighted-average approach to calcu-

lating the emission probability matrix Pe

35

Pe(o|s) =1

K×

�

k

MLk(s)×∆(o, ok) (1.9)

where MLk(s) is the marginal likelihood that node k is adapted to habitat s, the

function∆( x, y) equals 1 if and only if x equals y (and is 0 otherwise), and K is a

normalization factor.

We learn the transition rate parameter µ by numerical optimization to maximize

the likelihood of the observed data (summed over all values for the hidden variables).

1.2.8 Defining ecologically coherent and significant clusters

We use empirical testing to estimate the statistical significance associated with the

likelihood value computed for each node in the maximum (joint) likelihood assignment

of habitats to nodes (given the final parameter estimates). Each trial preserved

the phylogeny, the inferred habitat assignments, and the habitat and transition rate

parameters, but environmental categories at the leaves were shuffled. The maximum

’joint’ likelihoods at each node were compiled over all trials and used as empirical

background probability distributions.

We use empirical testing to estimate the statistical significance associated with

the likelihood value computed for each node in the maximum (joint) likelihood assign-

ment of habitats to nodes (given the final parameter estimates). Each trial preserved

the phylogeny, the inferred habitat assignments, the habitat and transition rate pa-

rameters, and the frequency of the various environmental categories; however, each

leaf was randomly assigned an environmental category in each trial. The maximum

joint likelihoods at each node were compiled over all trials and used as empirical

background probability distributions.

Using the habitat associations learned from the original (non-randomized) data

set, we identified internal nodes where habitat transitions were inferred to take place.

These nodes were then iterated through using a post-fix traversal:

36

Nested clusters

At each of these transitional nodes, a second, pre-fix traversal took place. If the sub-

tree rooted by the current node possessed a likelihood greater than that observed in

99.99% of the random trials and 90% of its leaves shared the current nodes habitat

assignment, we recognized the subtree as an ecologically coherent and statistically

significant cluster. This latter cutoff (90% coherence) was necessary to ensure mean-

ingful clusters because the likelihood at a parent node can be unusually high when two

nearly significant (but non-identical) child groups are combined. Requiring a major-

ity of child nodes to have the same predicted model state as the parent has the effect

of identifying clusters that correspond more directly to single putative populations.

Making the threshold too high (e.g., 100%) would eliminate some larger clusters that

have a small, internal nested cluster.

To enable the discovery of nested clusters, identified clusters were pruned from

the phylogeny. Nodes representing clades that contained the cluster had their joint-

likelihoods modified so that they no longer incorporated information from the pruned

groups. If the clade rooted by the current node did not satisfy both the p-value and

90% coherence thresholds, the second recursion would descend to the current nodes

children. The second recursion only terminated if the current node was either a leaf,

itself a habitat transition node, or determined to root an ecologically coherent and

statistically significant cluster.

37

Supplementary Figures

11

III. WEBLINK TO AdaptML APPLICATION

An online version of AdaptML can be found at: http://www.almlab.org/adaptml/. Here,

users can upload their own phylogenetic trees and sampling metadata for automated

analysis. Copies of the AdaptML source code can be found at this address as well.

IV. SUPPLEMENTAL TABLE AND FIGURES

This section contains Supplementary Table S1 & S2 and Supplementary Figures S1 to

S10. See main text for details.

Table S1. Temperature and nutrient concentrations on sampling dates.

Temperature Chlorophyll a1 DOC2 TDN2 NO3-+NO2

- NH4+ TDP2 PO4

3-

[˚C] [µg/L] [mg C/L] [mg N/L] [µg N/L] [µg N/L] [µg P/L] [µg P/L]

Spring (4/28/06) 11 4.07 2.11 0.17 9 189 18 14

Fall(9/6/06) 16 6.03 2.28 0.27 5 144 24 25

1 measured using overnight extraction in 90% acetone (19)2 DOC = dissolved organic carbon, TDN = Total Dissolved Nitrogen, TDP = total dissolved phosphorous. All chemical analyses were performed at the University of New Hampshire, Durham, NH.

Figure 1.3: Temperature and nutrient concentrations on sampling dates.

38

12

Table S2. Empirical significance testing of ecologically differentiated clusters predicted

by the model. Fisher’s exact test was used to identify significant associations between

clusters and environments as described in the SOM. Group numbers correspond to Figure

1 in the main text.

Group P-value

1 1.16 E-07

2 1.04 E-16

3 5.06 E-15

4 1.70 E-01

5 1.11 E-02

6 1.26 E-04

7 7.54 E-10

8 2.46 E-05

9 2.35 E-07

10 7.19 E-07

11 1.26 E-13

12 2.62 E-03

13 5.13 E-10

14 1.16 E-06

15 1.01 E-02

16 1.13 E-08

17 4.03 E-07

18 2.81 E-03

19 1.08 E-08

20 1.80 E-06

21 2.31 E-13

22 7.02 E-03

23 3.32 E-06

24 6.06 E-14

25 2.76 E-03

Figure 1.4: Empirical significance testing of ecologically differentiated clusters predictedby the model. Fishers exact test was used to identify significant associations betweenclusters and environments as described in the SOM. Group numbers correspond to Figure1.1 in the main text.

39

5 µm filter

1 µm filter

63µm plankton net

[ ]4 x

[ 4 x

[ ]4 x

2 x

0.22 µm filter

tissue grinder vortexfilter plated on

selective media

4 x

63 µm prefiltered

seawater

]

A

! BMap courtesy of the Gulf of Maine Council

on the Marine Environment

Figure 1.5: Sampling site and outline of sampling strategy for determination of bacterialdistribution among seasons and size fractions in coastal water. (A) Sampling locationshown on a map of North America (left) with a white box depicting the bounds of thepicture at right: the Gulf of Maine. The arrow points to the sampling location, Plum IslandSound, MA. (B) Protocol for obtaining size fractionated of bacterial seawater isolates usingsequential filtration and plating on selective media.

40

14

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1300

400

500

600

Pars

imon

y Sc

ore

Sequence Fraction0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

!60

!40

!20

0

log(

P!va

lue)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1!60

!40

!20

0

Fig. S2. Empirical statistical testing for ecological association of phylogenetic clades.

The likelihood that the observed parsimony score (or lower) for size fraction data might

have arisen by chance was calculated [using the method of (15)] for a series of trees,

inferred using subsets of strains from the hsp60 gene alignment. To test the effect of

statistical uncertainty on the inferred association, 1%, 10%, 50%, 90%, or 100% of the

sequence data was randomly selected and used to construct the phylogeny (with 3

replicates each). The parsimony score for each such tree is shown with green dots

corresponding to the left axis; the p-value of obtaining that parsimony score by chance is

shown with blue dots corresponding to the right axis. These results are based on size

fraction data only, but similar results are obtained with season data.

Figure 1.6: Empirical statistical testing for ecological association of phylogenetic clades.The likelihood that the observed parsimony score (or lower) for size fraction data might havearisen by chance was calculated [using the method of [55]] for a series of trees, inferred usingsubsets of strains from the hsp60 gene alignment. To test the effect of statistical uncertaintyon the inferred association, 1%, 10%, 50%, 90%, or 100% of the sequence data was randomlyselected and used to construct the phylogeny (with 3 replicates each). The parsimony scorefor each such tree is shown with green dots corresponding to the left axis; the p-value ofobtaining that parsimony score by chance is shown with blue dots corresponding to the rightaxis. These results are based on size fraction data only, but similar results are obtainedwith season data.

41

15

Fig. S3. Overview of hidden Markov model components. (A) Habitats represent hidden

(latent) variables in the model; experiments do not directly measure what habitat a strain

is adapted to. Two adjacent nodes on the phylogeny may differ in their habitat

assignment according to the rate of habitat transition (arrows between habitats). In our

simple model, all transition rates are equivalent. (B) Associated with each habitat is an

emission probability distribution describing how likely a strain associated with a

particular habitat (H1-3) is sampled from a given environment (blue or black). Bars in this

cartoon depict hypothetical probability distributions; strains adapted to Habitat 1 have

higher probability of being observed in the black environment than in the blue

environment. (C) Our model maps a habitat onto each node in the phylogeny such that it

maximizes the probability of the observed data (at the leaves). Observations are limited

to the leaves of the phylogeny, as we cannot directly sample ancestral strains. In the

example shown, the mostly blue group is mapped to habitat 3, the mostly black group to

habitat 1, and the heterogeneous group is mapped to habitat 2.

H1

H2

H3

H1

H2

H3

A B

C

0.0 1.0

Figure 1.7: Overview of hidden Markov model components. (A) Habitats represent hid-den (latent) variables in the model; experiments do not directly measure what habitat astrain is adapted to. Two adjacent nodes on the phylogeny may differ in their habitatassignment according to the rate of habitat transition (arrows between habitats). In oursimple model, all transition rates are equivalent. (B) Associated with each habitat is anemission probability distribution describing how likely a strain associated with a particularhabitat (H1-H3) is sampled from a given environment (blue or black). Bars in this car-toon depict hypothetical probability distributions; strains adapted to habitat 1 have higherprobability of being observed in the black environment than in the blue environment. (C)Our model maps a habitat onto each node in the phylogeny such that it maximizes theprobability of the observed data (at the leaves). Observations are limited to the leaves ofthe phylogeny, as we cannot directly sample ancestral strains. In the example shown, themostly blue group is mapped to habitat 3, the mostly black group to habitat 1, and theheterogeneous group is mapped to habitat 2.

42

16

0 5 10 15 20 25 30 35 40 45 50!2840

!2820

!2800

likel

ihoo

d

iteration

0 5 10 15 20 25 30 35 40 45 500

0.5

1

para

met

er c

hang

e

likelihoodemission probtransition rate

Fig. S4. Convergence of iterative parameter optimization. During the parameter

optimization, both the average change in probability for components of the emission

probability matrix (red line) and the change in transition rate (black line) decrease

rapidly. The overall log-likelihood of the observed data converges in concert with the

parameter estimates (blue line).

Figure 1.8: Convergence of iterative parameter optimization. During the parameter opti-mization, both the average change in probability for components of the emission probabilitymatrix (red line) and the change in transition rate (black line) decrease rapidly. The overalllog-likelihood of the observed data converges in concert with the parameter estimates (blueline).

43

17

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

HA HB HC HD HE HF

Fig. S5. Reproducibility of inferred habitats. Sixty independent trials of the iterative

habitat learning process were performed. Shown are the frequencies of occurrence of the

habitats presented in Figure 1. The habitat similarity cut-off for counting a match was an

average emission probability difference < 0.10 over all environmental categories. As

expected, the least abundant habitat in our data, HD (see Fig. S4) is the least reproducible,

although even this habitat is identified in 83% of trials.

Figure 1.9: Reproducibility of inferred habitats. Sixty independent trials of the iterativehabitat learning process were performed. Shown are the frequencies of occurrence of thehabitats presented in Figure 1.1. The habitat similarity cut-off for counting a match wasan average emission probability difference < 0.10 over all environmental categories. Asexpected, the least abundant habitat in our data, HD (see Fig. 1.8) is the least reproducible,although even this habitat is identified in 83% of trials.

44

18

0 1 2 3 4 5 60

1

2

3

4

5

6

actual habitats

infe

rred

habi

tats

y=xy = mxsimulation run

Fig. S6. Number of habitats inferred from simulated data. We generated 120 simulated

datasets and compared the number of inferred habitats to the true number. Each dataset

was randomly assigned between 2 and 6 habitats; these habitats were distributed over

environmental categories by randomly partitioning the interval (0,1) according to a

uniform distribution, but requiring that each successive partitioning must occur at a larger

value than the last (random partitioning without this restriction leads to a large number of

similar ‘generalist’ habitats). These habitats were mapped onto the set of clusters learned

from the Vibrionaceae data (Fig. 1, main text). Shown are the number of habitats

inferred for these trials versus the number actually present (a small amount of Gaussian

noise was added to each data point so that the data points could be discerned). These

results suggest that the habitat inference algorithm is generally conservative, inferring

fewer rather than more than the true number of habitats.

Figure 1.10: Number of habitats inferred from simulated data. We generated 120 sim-ulated datasets and compared the number of inferred habitats to the true number. Eachdataset was randomly assigned between 2 and 6 habitats; these habitats were distributedover environmental categories by randomly partitioning the interval (0,1) according to auniform distribution, but requiring that each successive partitioning must occur at a largervalue than the last (random partitioning without this restriction leads to a large number ofsimilar generalist habitats). These habitats were mapped onto the set of clusters learnedfrom the Vibrionaceae data (Figure 1A). Shown are the number of habitats inferred for thesetrials versus the number actually present (a small amount of Gaussian noise was added toeach data point so that the data points could be discerned). These results suggest that thehabitat inference algorithm is generally conservative, inferring fewer rather than more thanthe true number of habitats.

45

19

HA

HDHE,HF

HB,HC

Habitat:

< 1 !m

5-63 !m> 63 !m

1-5 !m

Size Fraction: Seasonspringfall

Legend:

habitat 12

habitat 0

habitat 15

habitat 5

habitat 1

habitat 13

F

1

5

Z

SPR

FAL

1F 71

1F 108

FF 182

FF 96

1F 2

5F 155

1F 193

5F 183

5F 172

1F 238

1F 174

5F 304

5F 49

FF 351

1F 180

FF 295

5F 143

1F 66

FF 337

ZF 23

1F 280

1F 94

5F 119

1F 201

1F 90

FF 79

5F 73

1F 79

FF 316

FF 77

1F 211

5F 269

1F 164

1F 24

5F 279

5F 249

1F 17

1F 260

5F 98

FF 302

FF 7

1F 39

5F 186

FF 480

1F 205

FF 277

FF 384

FF 8

1F 121

FF 301

FF 15

FF 63

FF 257

1F 287

FF 9

1F 81

5F 259

1F 4

1F 36

FF 1505F 2121F 125

FF 345FF 51

ZF 265F 245FF 2265F 148

FF 492

5F 323

FF 262FF 65FF 72

1F 230

FF 468

1F 20

5F 174

FF 5

5F 31

FF 472

1F 1045F 266

5F 222

FF 32

FF 84FF 342

1F 115

1F 99

5F 132

FF 62

1F 73

5F 316

FF 313

1F 213

5F 2685F 651F 221

5F 224FF 45FF 447FF 92

5F 256FF 41F 2405F 3091F 92FF 3545F 325FF 851F 261FF 452

FF 285

FS 108

FF 124

1S 256

FS 123

1F 85

FF 190

FF 493

FF 121

FF 228

FF 106

FF 123

FF 186

FF 113

FS 184

FF 33

FF 477

1F 215

1S 138

ZF 187

ZF 153

FF 211

FF 283

1F 75

FF 119

FF 142

FF 240

FF 346

FF 165

FF 373

FF 220

FF 340

1F 156

5F 251

FF 298

5F 197

FF 163

FF 454

FF 219

FF 256

FF 133

FF 131

FF 214

FF 164

FF 122

FF 231

FF 162

FF 476

5F 169

FF 458

FF 212

1F 234

FF 104

1F 161

FF 203

FF 216

ZF 161

FF 230

FF 179

FF 217

1F 160

FF 173

FF 252

FF 130

FF 2065F

136FF 320FF 2321S 101F 30

ZF 141FF 349

FS 259

FF 2915F 45

5F 25

1S 257

FF 344

FF 93

1F 69

FS 219

FF 118

FS 103

FS 148

1F 128

FF 187

FF 495

FS 144

FF 68

FS 143

FF 197

FS 272

FS 2171S

1235F 177FS

1291S 143FF

991F 2591F

236FF 284

FF 222FS

215FF 443

FS 214

FF 263FF

317FF 159FS

2381F 247

FS 203

FF 236

FF 167

FF 166

FF 279FF

19FF 498FF

2FF 155FS

134FF 483FS

104FF 380FF

489FF 31FF

451FF 188FF

134

FF 343FS

206FF 288

ZS 109

1S 209

1S 242

5S 187

1S 45

1S 136

FF 255

FF 497

FF 209

1F 223

FF 286

FF 461

FF 195

ZF 157

1F 146

ZF 155

FF 100

FF 482

1F 151

FF 109

FF 464

ZF 53

ZF 45

5F 247

ZF 105

5F 112

5F 80

ZF 22

ZF 35

ZF 13

5F 76

ZF 165

ZF 34

1F 295

FF 474

FF 457

FF 491

FF 154

ZF 75

ZF 225

ZF 18

FF 310

FF 178

5F 299

FF 82

1F 257

5F 90

FF 296

FF 181

ZF 61

1F 220

1F 218

FF 273

ZF 71

5F 322

5S 230

5S 9

5S 257

ZF 117

5F 3

ZF 133

ZS 56

ZS 52

ZF 12

FF 57

5F 153

ZF 55

5F 227

5F 101

5F 200

5F 36

5F 240

5F 188

5F 37

5F 294

5F 9

FF 50

5F 74

5F 77

FF 449

ZF 131

ZF 101

ZF 29

5F 163

ZF 177

FF 201

5F 226

FF 147

5F 84

5F 264

FF 193

ZF 69

5F 255

ZF 115

ZF 265

1S 269

ZF 6

FF 61

5F 113

ZF 103

5F 311FF

445

1F 158

5F 126

FF 145

FF 377

FF 247

FF 161

FF 86

1F 297

FF 348

1F 153

1S 29

ZS 971F

101

FF 172

ZF 181

1F 38

5F 237

1F 149

5F 69

FF 54

5F 296

ZS 111

1F 25

5S 106

ZS 161

ZS 25

FF 20

5S 246

FF 341

1F 55

FF 6

FF 319

FF 198

FF 215

FF 207

1F 283

FF 139

FF 309

FF 183

FF 95

1F 157

FF 251

FF 138

FF 144

FS 302

1S 206

1F 57

ZS 55

5S 141

1S 35

FF 175

5F 106

5F 18

ZF 209

ZF 79

ZF 77

5F 286

5S 55

5S 37

1F 317

FF 156

ZF 2

FS 202

ZS 156

ZS 121

ZS 166

ZS 95

5S 51

ZS 202

5S 163

ZF 7

ZS 23

1S 261

1S 169

FF 1

1F 150

FF 24

1F 145

5F 102

ZS 207

5S 190

1S 119

ZS 201

5S 279

ZS 5

1S 296

5S 153

5S 262

ZS 119

ZS 183

ZS 35

ZS 105

1S 48

1S 46

5S 101

5S 67

5S 162

ZS 173

ZF 90

5S 102

ZS 101

ZS 136

5S 292

5S 213

5S 214

5S 86

ZS 28

ZS 127

5S 139

5S 89

ZF 97

1F 275

FS 213

FF 12

5F 297

5F 284

5F 51

5S 18

5F 216

5S 118

ZS 192

5S 264

5S 69

ZS 135

1F 251

1S 33

5S 64

1S 84

1S 65

5S 16

1S 63

5S 191

ZS 46

ZS 19

ZS 58

5S 122

5S 124

ZS 41

1S 285

ZS 211

ZS 178

ZS 138

1S 142

5S 116

5S 226

ZS 16

5S 249

FF 462

1S 271

ZS 176

5S 4

ZS 213

ZS 158

ZS 168

ZS 198

1S 141

ZS 193

ZS 66

ZS 210

ZS 180

ZS 181

ZS 187

ZS 195

5S 92

5S 2

5S 1

FF 91

FF 446

5S 81

ZS 208

ZS 185

5S 225

ZS 47

1S 314

ZS 8

FS 258

5S 156

5S 245

5S 252

ZS 10

ZS 29

5S 215

ZS 93

5S 57

ZS 31

ZS 163

ZS 113

1S 27

1F 197

ZF 259

5S 151

5S 111

5S 114

FF 352

FS 176

FF 308

FF 143

FF 374

FF 191

ZS 125

ZS 171

ZS 107

ZS 115

5S 192

ZF 31

ZF 59

ZS 2

1S 317

1S 319

ZS 117

5S 272

1F 29

5S 210

ZS 7

FF 125

FF 126

1S 146

1S 247

FS 178

FS 274

1S 130

5S 223

1S 113

1S 260

1S 124

1S 134

5S 161

5S 228

5S 209

5S 23

5S 235

5S 155

5S 242

5S 136

FF 259

FF 30

1S 70

1S 309

5S 5

5S 232

5S 267

ZS 103

1S 153

5S 285

ZS 36

1S 129

ZS 59

5S 134

5S 133

ZF 197

5S 135

5S 143

5S 185

ZS 53

ZS 90

ZS 68

ZS 72

ZS 78

5S 14

ZS 74

ZS 76

5S 283

ZS 84

ZS 82

ZS 86

ZS 88

1S 311

ZS 80

ZS 18

5S 123

1S 131

1S 14

ZS 20

5S 208

5S 238

5S 28

5S 268

ZS 12

5S 7

FF 185

FF 184

5S 73

5S 104

5F 57

5S 83

FF 460

ZF 43

ZF 93

5F 61

1F 263

1F 267

FF 465

FF 245

1F 200

5F 277

FF 102

FF 189

5F 70

1F 177

5F 129

5F 180

FF 174

1F 310

FF 488

5F 19

FF 233FF 484

5F 283

FF 481

ZS 184ZS 214

ZS 50

ZS 186

ZS 67

ZS 44

ZS 49

ZS 137

ZS 17

5F 275

FF 292

FF 132

ZF 41

FF 13

5F 302

FF 459

ZF 20

FF 500

FF 225

5F 261

FF 264

1F 31

FF 223

ZS 160

ZS 64

ZS 70

1S 278

5S 221

ZS 151

ZS 196

ZS 92

FF 149

FF 466

FF 36

5F 47

FF 53

5F 85

5F 29

5F 79

5F 135

FF 210

FF 148

5F 231

FF 47

FF 141

FF 22

ZF 2615S 217ZF 2031F 63

FF 3151F 42

1F 199

5F 44

1F 60

5F 4

1F 308

1F 9

5F 320

1F 195

ZF 76

FF 496

FF 14

1F 303

FF 375

FF 169

1F 249

FF 266

FF 249

1F 299

ZF 193

5F 273

FF 265

1F 187

1F 292

FF 3

1F 269

FF 2801F 183 FF 250

FF 382

FF 237

FF 2601F 185

FF 140

FF 386

FF 350 5F 324

1F 279

FF 268

FF 376

FF 213FF

305

5F 15F

60

FF 312

1F 300

5F 315

FF 290

1F 181

FF 59

FF 442

5F 318

FF 107

1F 110

FF 475

FF 267

1F 155

FF 105

FF 385

5F 20

FF 272

1F 277

ZF 111

FF 299

1F 169

FF 112

1F 148

1F 285

FF 168

1F 293

1F 253

ZS 13

ZF 170

5F 281

ZF 57

ZF 14

5F 133

ZS 63

ZF 30

ZF 207

ZF 270

FF 75

ZF 28

1F 289

5F 59

1F 97

1F 111

1F 175

1F 127

FF 160

1F 273

1F 53

FF 274

1F 124

5F 301

ZF 264

ZF 255

ZF 99

5F 171

ZF 205

ZF 65

1S 120

5S 239

5S 240

FS 285

FS 286

5S 41

1S 77

5S 149

1S 193

FS 181

FF 379

1F 265

FF 135

FF 307

FF 176

FF 204

5F 195

FF 202

FF 101

FF 239

1F 282

5F 189

FF 234

FF 221

FF 339

FF 94

5F 257

1F 171

FF 371

FF 199

FF 108

FF 304

FF 115

FF 152

FF 117

ZF 223

ZF 192

5F 165

5F 235

ZF 72

ZF 89

ZF 175

FF 170

ZF 63

1F 33

ZF 167

ZF 15

ZF 137

5S 76

1F 165

5S 78

5F 207

ZF 32

5F 205

5F 40

1S 139

ZF 8

5F 92

5S 49

5F 117

1F 255

ZF 199

ZF 33

ZF 21

ZF 269

5F 142

ZF 5

1F 243

5F 290

5S 207

ZF 17

ZF 135

ZF 9

5F 122

1S 297

ZF 10

FF 58

1S 140

FF 372

ZF 159

5F 213

FF 192

5F 289

5F 192

1F 203

ZF 219

ZF 215

ZF 195

ZF 113

ZF 91

5F 161

ZF 92

5F 201

5F 123

5F 63

ZF 100

ZF 49

5S 8

ZS 60

5S 40

ZS 37

ZS 38

ZS 99

5S 115

ZS 199

5S 62

ZS 40

5S 200

ZS 61

ZS 189

ZS 11

ZS 15

5S 22

ZS 33

ZS 209

1S 315

1S 105

5S 46

ZS 205

5S 100

ZS 22

ZS 133

ZS 139

ZS 65

5S 269

ZS 190

ZF 201

ZF 81

ZF 189

ZF 173

ZF 50

ZF 257

ZF 129

1F 77

ZF 27

ZF 107

1F 315

1F 179

ZF 163

ZF 67

5F 157

ZF 95

FF 52

ZF 109

5F 99

5F 293

ZF 19

ZF 36

1F 245

1F 117

ZF 25

ZF 47

1F 209

ZF 73

1F 28

5F 306

5F 23

5F 97

ZF 11

ZF 221

5F 308

ZF 83

5F 94

1F 227

5F 7

1F 45

5F 234

5F 33

5F 17

FF 229

5F 53

FF 246

FF 227

ZF 211

FF 254

5F 146

5F 238

ZF 213

1F 123

1F 154

1S 229

1S 237

5S 119

5S 189

5S 186

1S 159

FS 201

5S 202

5S 198

5S 183

1S 128

1S 175

FS 186

1S 165

1S 152

5S 53

FF 136

FF 200

FF 196

FF 110

FF 238

1F 207

FF 146

1F 152

5F 104

FF 21

FF 224

1F 126

5S 43

5S 159

1S 196

5S 130

5S 12

1F 189

1F 191

1F 120

5F 253

5F 6

Fig. S7. Inferred habitat associations for all ancestors of sequenced Vibrio strains. The

rings surrounding the tree represent the season (outer) and size fraction (inner) from

which strains were isolated. The maximum likelihood assignment of nodes to habitats is

shown for all nodes, regardless of the confidence of each prediction (only confident

assignments are shown in main text Fig. 1A). Colored circles on each branch indicate the

habitat assignment (HA-F, as in Fig. 1 main text) for the node immediately below that

branch (see above legend for color scheme). Branch lengths are adjusted to aid

visualization and do not represent evolutionary distances.

Figure 1.11: Inferred habitat associations for all ancestors of sequenced Vibrio strains. Therings surrounding the tree represent the season (outer) and size fraction (inner) from whichstrains were isolated. The maximum likelihood assignment of nodes to habitats is shownfor all nodes, regardless of the confidence of each prediction (only confident assignmentsare shown in Figure 1.1A). Colored circles on each branch indicate the habitat assignment(HA-HF , as in Figure 1.1A) for the node immediately below that branch (see above legendfor color scheme). Branch lengths are adjusted to aid visualization and do not representevolutionary distances.

46

20

B

A

mdh hsp60

HBHCHDHEHF

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

genetic distance

fract

ion

habitat identity50th percentile of <dist>95th percentile of <dist>

Fig. S8. Comparison of habitat inference on different gene phylogenies (hsp60 and mdh)

for Vibrio splendidus strains. (A) Juxtaposition of habitats learned from the mdh and

hsp60 datasets for V. splendidus strains only; habitats are labeled to allow comparison

with habitats predicted for all Vibrionaceae in Fig. 1, main text. Emission probabilities

are normalized by the total number of isolates obtained in each environmental sample to

reduce the effects of sampling bias. As expected, fewer habitats are identified from the

mdh phylogeny, which has a lower rate of nucleotide divergence and thus is less well-

resolved. (B) Comparison of habitat assignments to nodes. Because it is difficult to map

the internal nodes between topologically distinct trees, the habitat assignment for the last

common ancestor of each pair of V. splendidus strains was compared. If both

corresponded to the “same” habitat (HC, HE, or HF in both phylogenies), they were

considered to be in agreement, otherwise they were considered to be in disagreement.

The fraction of nodes in agreement is shown as a function of increasing genetic distance

between the pairs of strains considered (in the hsp60 phylogeny). The black and red lines

indicate distances that include 50% and 95% of strains within the same cluster in main

text Figure 1, respectively.

Figure 1.12: Comparison of habitat inference on different gene phylogenies (hsp60 andmdh) for Vibrio splendidus strains. (A) Juxtaposition of habitats learned from the mdh andhsp60 datasets for V. splendidus strains only; habitats are labeled to allow comparison withhabitats predicted for all Vibrionaceae in Fig. 1.1. Emission probabilities are normalizedby the total number of isolates obtained in each environmental sample to reduce the effectsof sampling bias. As expected, fewer habitats are identified from the mdh phylogeny, whichhas a lower rate of nucleotide divergence and thus is less well-resolved. (B) Comparisonof habitat assignments to nodes. Because it is difficult to map the internal nodes betweentopologically distinct trees, the habitat assignment for the last common ancestor of eachpair of V. splendidus strains was compared. If both corresponded to the same habitat (HC ,HE , or HF in both phylogenies), they were considered to be in agreement, otherwise theywere considered to be in disagreement. The fraction of nodes in agreement is shown as afunction of increasing genetic distance between the pairs of strains considered (in the hsp60

phylogeny). The black and red lines indicate distances that include 50% and 95% of strainswithin the same cluster in main text Figure 1.1, respectively.

47

21

85% 90% 95% 99%

Fig. S9. Influence of the model complexity/redundancy parameter on inferred habitats.

Clusters are merged during the model fitting procedure when the vectors describing their

distribution across environments are more than 90% correlated. As this cutoff is varied,

slightly different habitats are observed. At 85%, habitat HC is not recovered; at higher

values additional habitats become more redundant suggesting that the 90% cutoff allows

conservative recovery of characteristic projected habitats.

Figure 1.13: Influence of the model complexity/redundancy parameter on inferred habi-tats. Clusters are merged during the model fitting procedure when the vectors describingtheir distribution across environments are more than 90% correlated. As this cutoff is var-ied, slightly different habitats are observed. At 85%, habitat HC is not recovered; at highervalues additional habitats become more redundant suggesting that the 90% cutoff allowsconservative recovery of characteristic projected habitats.

48

22

0.01

5S OUTGROUP

1 FAL2

F FAL96

F FAL182

1 FAL71

1 FAL108

F FAL7

F FAL302

5 FAL98

1 FAL260

1 FAL17

5 FAL249

5 FAL279

1 FAL24

1 FAL164

5 FAL269

1 FAL211

F FAL77

F FAL316

1 FAL79

5 FAL73

F FAL79

1 FAL90

1 FAL201

5 FAL119

1 FAL94

1 FAL280

Z FAL23

F FAL337

1 FAL66

5 FAL143

F FAL295

1 FAL180

F FAL351

5 FAL49

5 FAL304

1 FAL174

1 FAL238

5 FAL172

5 FAL183

5 FAL155

1 FAL193

F FAL277

1 FAL39

5 FAL186

F FAL480

1 FAL205

F FAL384

F FAL8

F FAL9

1 FAL287

F FAL257

1 FAL121

F FAL301

F FAL15

F FAL63

5 FAL148

F FAL226

5 FAL245

Z FAL26

F FAL51

F FAL345

1 FAL81

5 FAL259

1 FAL125

5 FAL212

F FAL150

1 FAL4

1 FAL36

F FAL285

F FAL72

F FAL65

F FAL262

F FAL492

5 FAL323

F FAL452

1 FAL261

F FAL85

5 FAL325

F F

AL354

1 FA

L92

5 FA

L309

1 FA

L240

F F

AL4

5 FA

L256

1 FA

L230

F F

AL468

F F

AL92

F F

AL447

F F

AL45

5 FA

L224

1 FA

L20

5 FA

L174

1 FA

L221

5 FA

L65

5 FA

L268

F F

AL5

5 FA

L31

1 FA

L213

F F

AL313

5 FA

L316

5 FA

L222

5 FA

L266

F F

AL472

1 FA

L104

F F

AL342

F F

AL32

F F

AL84

1 FA

L73

F F

AL62

5 FA

L132

1 FA

L115

1 FA

L99

F SPR123

F SPR108

F FAL124

1 SPR256

F FAL477

F FAL33

F SPR184

F FAL113

F FAL186

F FAL123

F FAL106

F FAL228

F FAL121

F FAL493

1 FAL85

F FAL190

1 FAL215

1 SPR138

F FAL240

Z FAL187

Z FAL153

F FAL142

F FAL211

F FAL283

1 FAL75

F FAL119

F FAL454

F FAL163

5 FAL197

F FAL298

5 FAL251

1 FAL156

F FAL340

F FAL220

F FAL373

F FAL346

F FAL165

F FAL219

F FAL256

F FAL133

F FAL131

F FAL214

F FAL164

F FAL122

F FAL231

F FAL162

F FAL476

Z FAL161

F FAL216

F FAL203

1 FAL161

F FAL104

1 FAL234

F FAL212

5 FAL169

F FAL458

F F

AL230

F F

AL179

F F

AL217

1 FA

L160

F F

AL173

F F

AL252

5 FA

L25 5 F

AL4

5

1 S

PR

10

F F

AL2

32

F F

AL3

20

5 F

AL1

36

F F

AL1

30

F F

AL2

06F F

AL2

91

F S

PR

259

F F

AL3

49

1 F

AL3

0

Z F

AL1

41

F F

AL2

88

F F

AL4

43

F S

PR

215

F F

AL2

22

1 S

PR

257

F F

AL3

44

F FA

L284

1 FA

L236

1 FA

L259

F FA

L99

1 S

PR

143

F S

PR

129

5 FA

L177

1 S

PR

123

F S

PR

217

F S

PR

272

F FA

L197

F S

PR

143

F FA

L68

F S

PR

144

F FA

L495

F FA

L187

1 FA

L128

F FA

L93

1 FA

L69

F S

PR

148

F S

PR

103

F S

PR

219

F FA

L118

F S

PR

206

F F

AL3

43

1 F

AL2

47

F S

PR

238

F F

AL1

59

F F

AL3

17

F S

PR

214

F F

AL2

63

F S

PR

203

F F

AL2

36

F F

AL1

34

F F

AL1

88

F F

AL4

51

F F

AL3

1

F F

AL4

89

F F

AL3

80

F S

PR

104

F F

AL4

83

F S

PR

134

F F

AL1

55

F F

AL2

F F

AL4

98

F F

AL1

9

F F

AL2

79

F F

AL1

67

F F

AL1

66

Z S

PR

109

1 S

PR

136

1 S

PR

45

5 S

PR

187

1 S

PR

209

1 S

PR

242

5 SPR25

7

5 SPR23

0

5 SPR9

5 FA

L322

Z FAL133

Z FAL117

5 FAL3

Z SPR56

Z SPR52

Z FAL12

F FAL57

5 FAL153

Z FAL55

5 FAL240

5 FAL36

5 FAL200

5 FAL227

5 FAL101

F FAL50

5 FAL9

5 FAL294

5 FAL188

5 FAL37

F FAL449

5 FAL74

5 FAL77

5 FAL255

5 FAL163

Z FAL29

Z FAL131

Z FAL101

Z FAL69

F FAL193

5 FAL264

5 FAL84

Z FAL177

F FAL201

5 FAL226

F FAL147

Z FAL115

Z FAL6

Z FAL265

1 SPR269

F FAL61

5 FAL113

Z FAL103

5 S

PR

7

1 F

AL1

53

F F

AL3

48

1 F

AL2

97F FA

L86

F FA

L161

5 FA

L311

F FA

L445F FA

L247F FA

L377F FA

L1451 FA

L1585 FA

L1261

FAL2

5

Z S

PR

1111 S

PR

29Z S

PR

971 FA

L101F

FAL1

725 FA

L296

F FA

L545

FAL6

91 FA

L149

Z FA

L181

1 FA

L38

5 FA

L237

5 S

PR

106Z

SP

R16

1Z S

PR

25

F FA

L20

5 S

PR

246

1 FA

L57

1 SP

R20

6

F FA

L207

F FA

L215

F FA

L198

F FA

L319

F FA

L6

F FA

L341

1 FA

L55

F SP

R30

2

F FA

L144

1 FA

L283

F FA

L139

F FA

L95

F FA

L309

F FA

L183

F FA

L138

1 FA

L157

F FA

L251

1 S

PR

35

Z S

PR

55

5 S

PR

141

5 S

PR

37

5 S

PR

55

5 FA

L286

5 FA

L18

F FA

L175

5 FA

L106

Z FA

L77

Z FA

L209

Z FA

L79

Z FA

L2

1 FA

L317

F FA

L156

Z SP

R121

F SP

R202

Z SP

R156

Z SP

R166

Z SP

R95

5 SP

R51

Z SPR20

2

Z SP

R23

5 SP

R163

Z FA

L7

1 SPR26

1

1 SPR16

9

5 FA

L102

1 FA

L145

F FA

L24

F FA

L1

1 FA

L150

Z SPR10

5

Z SPR20

7

5 SPR19

0

1 SPR11

9

Z SPR20

1

5 SPR27

9

5 SPR15

3

Z SPR5

1 SPR29

6

Z SPR35

Z SPR18

3

5 SPR262

Z SPR119

1 SPR48

1 SPR46

Z SPR13

6

5 SPR16

2

5 SPR10

1

5 SPR67

Z SPR10

1

5 SPR10

2

Z SPR17

3

Z FAL9

0

5 SPR64

5 SPR292

5 SPR213

Z SPR127

Z SPR28

5 SPR214

5 SPR86

Z FAL97

5 SPR139

5 SPR89

1 SPR33

1 FAL275

F SPR213

5 FAL51

5 FAL284

F FAL12

5 FAL297

1 FAL251

5 SPR18

5 FAL216

Z SPR135

5 SPR69

5 SPR264

5 SPR118

Z SPR192

5 SPR191

1 SPR84

1 SPR65

5 SPR16

1 SPR63

Z SPR138

1 SPR285

Z SPR41

5 SPR124

5 SPR122

Z SPR58

Z SPR46

Z SPR19

Z SPR211

Z SPR178

1 SPR142

5 SPR116

1 SPR271

5 SPR249

5 SPR226

Z SPR16

F FAL462

Z SPR176

5 SPR4

Z SPR168

Z SPR213

Z SPR158

Z SPR195

Z SPR198

1 SPR141

Z SPR187

Z SPR193

Z SPR66

Z SPR181

Z SPR210

Z SPR180

5 SPR1

5 SPR92

5 SPR2

F FAL91

F FAL446

5 SPR81

5 SPR245

Z SPR208

Z SPR185

5 SPR156

5 SPR225

Z SPR47

F SPR258

1 SPR314

Z SPR8

Z SPR31

Z SPR29

5 SPR252

Z SPR10

5 SPR57

5 SPR215

Z SPR93

1 SPR27

Z SPR163

Z SPR113

5 SPR151

1 FAL197

Z FAL259

5 SPR111

5 SPR114

F FAL191

F FAL374

F FAL143

F FAL308

F FAL352

F SPR176

1 SPR319

1 SPR317

5 SPR192

Z SPR115

Z SPR107

Z SPR125

Z SPR171

Z FAL31

Z FAL59

Z SPR2

Z SPR7

5 SPR210

1 FAL29

Z SPR117

5 SPR272

1 SPR146

F FAL125

F FAL126

1 SPR247

F SPR178

F SPR274

5 SPR209

5 SPR228

5 SPR161

1 SPR130

5 SPR223

1 SPR134

1 SPR124

1 SPR113

1 SPR260

5 SPR155

5 SPR23

5 SPR235

5 SPR242

5 SPR136

F FAL259

F FAL30

1 SPR70

1 SPR309

5 SPR5

Z SPR103

5 SPR232

5 SPR267

Z SPR59

1 SPR153

5 SPR285

Z SPR36

1 SPR129

5 SPR135

Z FAL197

5 SPR134

5 SPR133

5 SPR143

5 SPR185

Z SPR53

Z SPR12

5 SPR268

5 SPR28

5 SPR238

5 SPR208

Z SPR20

Z SPR90

Z SPR68

Z SPR72

Z SPR78

5 SPR14

Z SPR74

1 SPR14

1 SPR131

5 SPR123

Z SPR18

Z SPR80

1 SPR311

Z SPR88

Z SPR86

Z SPR82

Z SPR84

Z SPR76

5 SPR283

F FAL185

F FAL184

5 SPR73

5 SPR104

5 FAL57

5 SPR83

Z SPR13

1 FAL253

F FAL102

5 FAL277

1 FAL267

1 FAL263

Z FAL93

F FAL460

Z FAL43

5 FAL61

1 FAL200

F FAL465

F FAL245

F FAL189

5 FAL70

1 FAL177

1 FAL310

F FAL174

5 FAL129

5 FAL180

F FAL488

5 FAL19

F FAL2231 FA

L31F FAL2645 FA

L261F FAL225

F FAL500

Z FAL20F FA

L459

5 FAL302

F FAL13Z FA

L41F FAL132

F FAL292

F FAL481

5 FAL283

F FAL233

F FAL484

5 FAL275

Z SP

R17

Z SP

R137

Z SP

R49

Z S

PR

44

Z S

PR

67

Z S

PR

186

Z S

PR

50

Z S

PR

184

Z S

PR

214

Z SPR70

Z SPR160

Z SPR64

1 SPR278

Z SPR92

Z SPR1965 S

PR

221Z SP

R151F FA

L22

F FAL149F FA

L466F FAL141F FA

L53

F FAL36

5 FAL47

F FAL47

5 FAL231F FA

L148F FAL2105 FA

L1355 FAL795 FA

L855 FAL29

Z F

AL261

5 SP

R217

Z F

AL203

1 FA

L63

1 FA

L249

5 FA

L4

1 FA

L60

5 FA

L44

1 FA

L199

F F

AL315

1 FA

L42

F F

AL169

F F

AL375

1 FA

L303

F F

AL14

F F

AL496

Z F

AL76

1 FA

L195

5 FA

L320

1 FA

L308

1 FA

L9

1 FA

L269

F F

AL3

1 FA

L292

1 FA

L187

F F

AL265

5 FA

L273

Z F

AL193

1 FA

L299

F F

AL266

F F

AL249

F F

AL237

F F

AL382

F F

AL250F

FA

L280

1 F

AL1

83

F F

AL3

50

F F

AL3

86

F F

AL1

40

F F

AL2

60

1 F

AL1

85

F F

AL376

F F

AL268

5 FA

L324

1 FA

L279

1 F

AL2

93

F F

AL1

68

1 F

AL2

85

1 F

AL1

48

F F

AL1

12

1 F

AL1

69

F F

AL2

99

Z F

AL1

11

1 F

AL2

77

F F

AL2

72

5 F

AL2

0

F F

AL3

85

F F

AL1

05

1 F

AL1

55

F F

AL2

67

F F

AL4

75

1 F

AL1

10

F F

AL1

07

5 F

AL3

18

F F

AL4

42

F F

AL2

13

F F

AL3

05

F F

AL5

9

1 F

AL1

81

F F

AL2

90

5 F

AL3

15

1 F

AL3

00

F F

AL3

125 F

AL15 F

AL6

0

Z FAL65

Z FAL170

5 FAL281

Z SPR63

Z FAL57

Z FAL14

5 FAL133

F FAL75

Z FAL270

Z FAL30

Z FAL207

5 FAL59

Z FAL28

1 FAL289

1 FAL124

F FAL274

1 FAL53

1 FAL273

F FAL160

1 FAL127

1 FAL175

1 FAL97

1 FAL111

Z FAL205

5 FAL171

5 FAL301

Z FAL264

Z FAL255

Z FAL99

F FAL307

F FAL135

1 FAL265

F FAL379

F SPR181

1 SPR193

1 SPR120

5 SPR239

5 SPR149

1 SPR77

5 SPR41

F SPR286

5 SPR240

F SPR285

F FAL101

F FAL202

5 FAL195

F FAL176

F FAL204

F FAL239

1 FAL282

F FAL339

5 FAL189

F FAL234

F FAL221

1 FAL171

F FAL94

5 FAL257

F FAL170

F FAL117

F FAL371

F FAL199

F FAL108

F FAL304

F FAL115

F FAL152

Z FAL223

Z FAL192

Z FAL175

5 FAL165

5 FAL235

Z FAL72

Z FAL89

Z FAL49

Z FAL167

Z FAL63

1 FAL33

Z FAL15

Z FAL137

5 SPR76

1 FAL165

5 SPR78

5 FAL207

1 FAL243

Z FAL5

5 FAL142

Z FAL269

Z FAL21

Z FAL33

Z FAL199

Z FAL32

5 FAL205

5 FAL40

1 SPR139

Z FAL8

5 FAL92

1 FAL255

5 SPR49

5 FAL117

Z FAL17

5 FAL290

5 SPR207

Z FAL100

Z FAL113

Z FAL195

Z FAL215

Z FAL219

1 FAL203

5 FAL192

5 FAL289

F FAL192

5 FAL213

Z FAL159

F FAL372

1 SPR140

F FAL58

Z FAL10

Z FAL135

Z FAL9

5 FAL122

1 SPR297

Z FAL92

Z FAL91

5 FAL161

5 FAL63

5 FAL201

5 FAL123

Z SPR190

5 SPR8

Z SPR60

Z SPR38

5 SPR40

Z SPR37

Z SPR40

Z SPR199

Z SPR99

5 SPR115

5 SPR62

Z SPR15

Z SPR11

Z SPR189

5 SPR200

Z SPR61

5 SPR22

Z SPR33

1 SPR105

Z SPR209

1 SPR315

5 SPR46

Z SPR205

5 SPR100

Z SPR22

5 SPR269

Z SPR65

Z SPR133

Z SPR139

5 FAL99

Z FAL109

Z FAL173

Z FAL189

Z FAL201

Z FAL81

Z FAL129

Z FAL50

Z FAL257

1 FAL77

F FAL52

Z FAL95

5 FAL157

Z FAL67

Z FAL163

1 FAL179

1 FAL315

Z FAL27

Z FAL107

5 FA

L293

Z FA

L19

1 FAL1

26

F FAL1

36

F FAL2

00

F FAL2

24

F FAL2

38

F FAL1

96

F FAL110

1 FAL207

F FAL146

F FAL21

1 FAL152

5 FAL104

5 SPR12

5 SPR13

0

5 SPR43

5 SPR15

9

1 SPR19

6

1 FAL1

20

1 FAL1

89

1 FAL1

91

5 FAL2

53

5 FAL6

5 SPR183

5 SPR119

1 SPR229

1 SPR237

5 SPR198

5 SPR189

5 SPR186

5 SPR202

1 SPR159

F SPR201

5 SPR53

1 SPR152

1 SPR165

F SPR186

1 SPR128

1 SPR175

1 FAL154

Z FAL213

1 FAL123

Z FAL36

1 FAL245

1 FAL117

5 FAL238

5 FAL146

5 FAL7

1 FAL227

5 FAL94

Z FAL83

Z FAL221

5 FAL308

F FAL254

Z FAL211

F FAL227

F FAL246

5 FAL53

F FAL229

5 FAL17

5 FAL33

1 FAL45

5 FAL234

Z FAL11

Z FAL47

Z FAL25

1 FAL28

1 FAL209

Z FAL73

5 FAL97

5 FAL306

5 FAL23

1 FA

L151

F FA

L109

F FA

L464

Z FA

L53

Z FA

L45

5 FA

L80

5 FA

L112

5 FA

L247

Z FA

L105

5 FA

L299

F FA

L178

F FA

L310

Z FA

L18

Z FA

L225

Z FA

L75

Z FA

L22

Z FA

L35

F FA

L154

F FA

L491

F FA

L457

F FA

L474

1 FA

L295

Z FA

L34

Z FA

L165

Z FA

L13

5 FA

L76

5 FA

L90

F FA

L82

1 FA

L257

F FA

L273

1 FA

L218

Z FA

L61

1 FA

L220

Z FA

L71

F FA

L181

F FA

L296

F FA

L255

F FA

L482

F FA

L100

Z FA

L155

Z FA

L157

1 FA

L146

F FA

L195

F FA

L286

F FA

L461

1 FA

L223

F FA

L209

F FA

L497

Fig. S10. Statistical uncertainty in the hsp60 gene tree by constructing 100 trees using

non-parametric bootstrap re-sampling from the hsp60 alignment. Clades supported in

greater than 80% of bootstraps are indicated with a black dot.

Figure 1.14: Statistical uncertainty in the hsp60 gene tree by constructing 100 treesusing non-parametric bootstrap re-sampling from the hsp60 alignment. Clades supportedin greater than 80% of bootstraps are indicated with a black dot.

49

Chapter 2

Rapid evolutionary innovationduring an Archean Genetic

Expansion

Lawrence A. David, Eric J. Alm

This chapter is presented as it was submitted for publication. CorrespondingSupplementary Material is appended.

Chapter 2

Rapid evolutionary innovation

during an Archean Genetic

Expansion

A natural history of Precambrian life remains elusive because of the rarity of microbial

fossils and biomarkers [79, 80]. The composition of modern day genomes, however,

may bear imprints of ancient biogeochemical events [81–83]. We have employed an

explicit model of macroevolution including gene birth, transfer, duplication and loss

to map the evolutionary history of 3,968 gene families across the three domains of life.

We observe that horizontal gene transfer (HGT) is the primary source of new genes

in prokaryotes, while duplication dominates in eukaryotes. Inter-domain gene trans-

fer is rare compared to intra-domain transfer with the notable exception of massive

Bacteria-Eukarya transfer events that correspond to the endosymbiosis of the mito-

chondria and chloroplasts [84, 85]. Surprisingly, we find that a brief period of genetic

innovation during the Archean eon gave rise to 27% of major modern gene families.

Genes born during this period are especially likely to be involved in electron trans-

port, while later genes exhibit a gradually increasing usage of molecular oxygen. Our

results demonstrate that reconstructing the complex interplay between organismal

and geochemical evolution over Earth history is becoming a tractable goal.

51

Introduction

Describing the emergence of life on our planet is one of the grand challenges of the

Biological and Earth sciences. Yet the roughly three-billion-year history of life pre-

ceding the emergence of hard-shelled metazoans remains obscure [79]. To date, the

best understood event in early Earth history is the Great Oxidation Event, which is

believed to follow the invention of oxygenic photosynthesis by the ancestors of modern

cyanobacteria [86] (though the precise timeline remains controversial [80]). If DNA

sequences from extant organisms bear an imprint of this event, then we can use them

to make and test predictions [81–83]; e.g., genes that use molecular oxygen will be

confined to a group of organisms emerging after the Great Oxidation Event. Transfer

of genes across species, however, can obscure patterns of descent and disrupt our abil-

ity to correlate gene histories with the geochemical record [87]. Widely distributed

genes, for example, may descend from a Last Universal Common Ancestor (LUCA)

as widely believed to be the case for the translational machinery [88], or may have

been dispersed by HGT [12, 89], as in the case of antibiotic resistance cassettes.

Methods Overview

We developed a new phylogenomic method, AnGST (Analyzer of Gene and Species

Trees), to account for the confounding effects of HGT by comparing individual gene

phylogenies to the phylogeny of organisms (the Tree of Life). We refer to this process

as tree reconciliation and provide a detailed description of the AnGST algorithm in

the Supplementary Information. Unlike some previous methods [24, 25, 27], AnGST

uses the topology of the gene family tree rather than just its presence/absence across

genomes and can infer duplication, HGT, and loss events. Importantly, AnGST also

accounts for uncertainty in gene trees by incorporating reconciliation into the tree-

building process: the tree that minimizes the evolutionary cost function, but is still

supported by the sequence data, is chosen as the best gene tree. Simulated trees

inferred with this method are more accurate than trees based on a maximum likeli-

hood model of sequence data alone (Figure 2.5). Thus, tree building methods such as

52

AnGST that explicitly model macroevolutionary events may have utility in phyloge-

netic inference [33]. We used a previously described Tree of Life [90] to reconcile gene

families, although we note that our key results were consistent when using 30 alterna-

tive reference trees, including those that used the Archaea or Eukarya as outgroups

(Figs. 2.11, 2.12). Ensuring proper causality in a large reconciliation (i.e., avoiding

the “grandfather paradox” in which a gene is inferred to be its own ancestor) is a

computationally intractable problem in general [31], which we overcome by explicitly

modeling the timing of evolutionary events based on a chronogram constructed from

our reference tree. A conservative set of eight temporal constraints was selected from

the geochemical and paleontological literature (Table 2.1), and the PhyloBayes soft-

ware package was used to infer a range of divergence times for each ancestral lineage

on the reference tree [91]. We did not apply temporal constraints to lineage ages on

the gene trees.

Results

Domain-specific Macroevolutionary Trends

For 3,968 extant gene families [106], AnGST predicted a total of 109,452 speciation,

38,575 HGT, 14,021 gene duplication, and 35,252 gene loss events (Figure 2.1A). The

abundance of HGT events (on average 9.7 per gene family) underscores the evolu-

tionary importance of gene transfer in prokaryotic genome structure (Figure 2.15).

Domain-specific preferences in the types of macroevolutionary events emerge in Fig-

ure 2.1B. On a per-gene basis, gene transfer is 2.1 times more likely in bacteria than

in eukaryotes, while duplications are 4.4 times more likely in eukaryotes than in bac-

teria. The rate of HGT in eukaryotes is likely to be an overestimate because we did

not consider eukaryote-only gene families. The bias toward duplication in eukary-

otes is consistent with known domain-specific traits, such as unequal crossing-over,

whole-genome duplication events, and reduced selection against large genome sizes

[16, 107, 108]. Interdomain transfers comprise a minority (16.1%) of HGT events

but exhibit significant over-representation of HGT from alpha-proteobacteria to an-

53

Event Constraint Evidence1 Last universal common ances-

tor arises< 3850 Ma Carbon isotope fractionation

[92, 93]2 Cyanobacteria emerge > 2500 Ma Traces of an aerobic nitro-

gen cycle [94], changes inredox-metal enrichments [95],and sulfur isotope fractiona-tion data [96, 97] indicate oxy-genic photosynthesis; tracesof 2α-methylhopane biomark-ers [98] indicate cyanobacterialpresence

3 Eukaryotes diverge from Ar-chaea

> 2670 Ma Preserved sterane biomarkers[98, 99]

4 Akinetes diverge fromcyanobacteria lacking celldifferentiation

> 1500 Ma Akinete microfossils [100]

5 Archaeplastida emerge > 1198 Ma Red algae microfossils [101]6 Animals emerge > 635 Ma Preserved demosponge ster-

anes [102]7 Tetrapods emerge (a) < 385 Ma

(b) > 359 Ma(a) Tetrapod precursor dating[103]; (b) Tetrapod fossil dat-ing [104]

8 Buchnera diverge from Wig-

glesworthia

> 160 Ma Fossil history of Buchnera’saphid hosts [105]

Table 2.1: Temporal constraints used to construct chronogram. Eight temporal con-straints that could be directly linked to fossil or geochemical evidence were used to estimatedivergence times on the Tree of Life (Figure 2.10).

54

cient eukaryotes (p=3.3 × 10−7 Wilcoxon rank sum test) and from cyanobacteria to

plants (p=8.3 × 10−6 Wilcoxon rank sum test). These results likely reflect the an-

cient endosymbioses that gave rise to the mitochondrial and chloroplast organelles

[84, 85]. Functional analysis of HGT from the alpha-proteobacteria to the ancestral

eukaryotes reveals significant enrichment for energy metabolism genes (p=1.6× 10−6

Fisher’s Exact Test), further supporting an association between these HGT and an

energy-producing endosymbiosis (Figure 2.18). HGT from cyanobacteria to Arabidop-

sis thaliana are also enriched for energy-producing genes (p=3.9×10−3 Fisher’s Exact

Test), as well as translation-related genes (p=4.4 × 10−5 Fisher’s Exact Test) which

likely reflect the migration of 70S ribosomal proteins from the chloroplast to the plant

nucleus [109].

An Archean Genetic Expansion

Gene histories reveal dramatic changes in the rates of gene birth, duplication, loss,

and HGT over geologic time scales (Figure 2.2). The most striking feature of the

overall gene flux depicted in Figure 2 is a burst of de novo gene family birth between

3.33-2.85 Ga which we refer to as the Archean Genetic Expansion (AGE). This win-

dow gave rise to 26.8% of extant gene families and coincides with a rapid bacterial

cladogenesis. A spike in the rate of gene loss (∼3.1 Ga) follows the AGE and may

represent consolidation of newly evolved phenotypes, as ancestral genomes became

specialized for newly emerging niches. After 2.85 Ga, the rates of both gene loss and

gene transfer stabilize at roughly modern-day levels. The rates of de novo gene birth

and duplication after the AGE appear to show opposite trends: de novo gene family

birth rates decrease and duplication rates increase over time. The near absence of de

novo birth in modern times likely reflects the fact that ORFan gene families, which

are widespread across all major prokaryotic groups, are not considered in this study

[110]. The excess of gene duplications and ORFans in modern genomes suggests that

novel genes from both sources experience high turnover and rarely persist over long

evolutionary time scales.

What evolutionary factors were responsible for the period of innovation marked by

55

the AGE? While we cannot provide an unequivocal answer to this question using gene

birth dates alone, we can ask whether the functions of genes born during this time

suggest plausible hypotheses. In general, birth of metabolic genes is enriched during

the AGE, especially those involved in energy production and coenzyme metabolism

(Table 2.17), but further inspection also reveals an enrichment for metabolic gene

family birth prior to the AGE. To focus on specific metabolic changes linked to the

AGE we: (i) grouped genes according to the metabolites they used; and (ii) we directly

compared the occurrence of these metabolites in genes born during the AGE to their

abundance prior to the AGE. The results are striking: the AGE-specific metabolites

(positive bars, Figure 2.2 inset) include most of the compounds annotated as redox/e−

transfer (blue bars), with Fe-S-, Fe-, and O2-binding gene families showing the most

significant enrichment (False Discovery Rate < 5%, Fisher’s exact test). Gene families

that use ubiquinone and FAD (key metabolites in respiration pathways) are also

enriched, albeit at slightly lower significance levels (False Discovery Rate < 10%).

The ubiquitous NADH and NADPH are a notable exception to this trend and appear

to have played a role early in life history. By contrast, enzymes linked to nucleotides

(green bars) exhibited strong enrichment in genes of more ancient origin than the

AGE.

The observed metabolite usage bias suggests that the AGE was associated with an

expansion in microbial respiratory and electron transport capabilities. Proving this

association to be causal is beyond the power of our phylogenomic model. Yet this

hypothesis is appealing because more efficient energy conservation pathways could

increase the total free energy budget available to the biosphere, possibly enabling

the support of more complex ecosystems and a concomitant expansion of species and

genetic diversity. We note, however, that while the use of oxygen as a terminal electron

acceptor would have significantly increased biological energy budgets, oxygen-utilizing

genes are only enriched toward the end of the AGE (Figure 2.14). Thus, the earliest

redox genes we identified as part of the AGE were likely to be used in anaerobic

respiration or oxygenic/anoxygenic photosynthesis, although some may have been

co-opted later for use in aerobic respiration pathways.

56

Phylogenomic evidence for ancient changes in global redox potential

Our metabolic analysis supports an increasingly oxygenated biosphere following the

AGE, as the fraction of proteins utilizing oxygen gradually increases from the AGE

until the present day (Figure 2.3; p=3.4×10−8, two-sided Kolmogorov-Smirnov test).

Further indirect evidence of rising oxygen levels comes from compounds that are sen-

sitive to global redox potential. We observe significant increases over time in the usage

of the transition metals copper and molybdenum (Figure 2.3; False Discovery Rate

< 5%, two-sided Kolmogorov-Smirnov test), which is in agreement with geochemical

models of these metals’ solubility in increasingly oxidizing oceans [82, 83] and with

the growth of molybdenum enrichments from black shales that suggests molybdenum

began accumulating in the oceans only after the Archean eon [111]. Our prediction of

a significant increase in nickel utilization accords with geochemical modeling predic-

tions of a 10X increase in dissolved nickel concentration between the Proterozoic and

modern day [82], but conflicts with a recent analysis of banded iron formations that

inferred monotonically decreasing maximum deposited nickel concentrations from the

Archean onwards [112]. The abundance of enzymes using oxidized forms of nitrogen

(N2O and NO3) also grows significantly over time, with 1/3 of nitrate-binding gene

families appearing at the beginning of the AGE and 3/4 of nitrous oxide-binding gene

families appearing by the AGEs end. The timing of these gene family births provides

phylogenomic evidence for an aerobic nitrogen cycle by the Late Archean [94].

One striking discrepancy between our phylogenomic patterns and geochemical pre-

dictions, however, is a modest but significant increase in iron-using genes over time

(Figure 2.3; False Discovery Rate < 5%, two-sided Kolmogorov-Smirnov test). The

cessation of iron formation deposition roughly 1.8 Ga and ocean chemistry models in-

dicate that iron usage should decrease following the Archean, as declining iron solubil-

ity in oxygenated ocean surface waters and sulfide-mediated iron removal from anoxic

deeper waters combined to reduce overall iron bioavailability [113]. The counterin-

tuitive phylogenomic prediction may reflect the confounding effect of evolutionary

inertia, whereby microbes in the face of declining iron availability could have found

57

more success evolving a handful of metal-acquisition proteins (e.g. siderophores),

rather than replacing a host of iron-binding proteins. Alternatively, the insolubility

of iron in modern oceans may be offset by large existing organic pools of reduced iron.

A precise timeline for oxygen availability is beyond the resolution of our relaxed

molecular clock approach and remains a contentious topic in organic geochemistry

[80, 98, 99]. Nonetheless, our results suggest an Archean biosphere containing some

of the basic components required for oxygenic photosynthesis and respiration, despite

the fact that appreciable oxygen levels do not appear in the geological record until

much later (roughly 2.5 Ga) [114, 115]. Although our results are consistent with recent

biomarker-based evidence for early oxygenesis [99], special caution should be used in

comparing the molecular and geological dates. Divergence times for deep nodes on

our reference chronogram are uncertain (Figure 2.16), and they are partially based on

the constraint that the LUCA could be as old as the earliest evidence for life (3.85 Ga)

[92], even though the LUCA is likely a descendant of the first life form. Furthermore,

although the PhyloBayes dates include uncertainty estimates that are accurate given

the assumptions of the CIR model [91], an alternative, semi-parametric approach

implemented in r8s [116] results in a much younger date of 2.75-2.5 Ga for the AGE

(compared to 3.33-2.85 Ga for PhyloBayes) which is closer to the Great Oxidation

Event (Figure 2.12). Here we present mainly results from the PhyloBayes program,

because it allowed us to explicitly account for uncertainty in the timing of inferred

events. With such disparate phylogenetic estimates of the timing of the most ancient

lineages, a chronology for the AGE and the evolution of oxygen-producing genes will

require a careful integration of both geochemical and genetic data.

Implications

Using just eight temporal constraints as our geochemical and paleontological guides,

we have shown that whole genome sequence data can be used to infer details of

microbial evolution during the Archean eon and can recount global changes in redox-

sensitive compound bioavailability following the evolution of oxygenesis. Still, our

phylogenomic approach represents only a first step toward linking whole genome

58

sequence data to early Earth history. By connecting events in gene histories to events

in Earth history, hypotheses of enzyme or pathway presence/absence can be used

to make testable predictions about when metabolic signatures should first appear in

the geochemical record. Conversely, geochemical hypotheses may be tested against

predictions of extant metabolisms (as we demonstrate using the rise of oxygen). This

may admit useful new lines of evidence for geochemical theories that suffer from gaps

in the rock record. Successive refinement of phylogenomic models against geochemical

constraints may eventually yield an abundant and reliable source of Precambrian

fossils: modern-day genomes.

59

2.1 Figures

Phanerozoic

Proterozoic

Archean

a

a

Chloroflexi

Aquificae

Thermotogae

DeinococcalesPlanctomycetes

Bacteroidetes

Chlamydiae

Firmicutes

Transfer

Loss

Birth Chromalveolata

Plantae

Fungi

Metazoa

Crenarcheota

Euryarchaeota

Nanoarchaeota

a

a

0.5 Ga

A

B

Alm, Figure 1

Event Rate

Arch

aea

Bacte

ria

Eu

karya

Transfer Origins

L

D

B

H

0.181

0.188

0.065

0.022

L

D

B

H

0.151

0.171

0.039

0.016

L

D

B

H

0.102

0.083

0.171

0.008

E

A

B

3%

40%

57%

E

A

B

2%

5%

93%

E

A

B

34%

5%

61%

Archae. fulgid.

Methan. acetiv.

Methan. jannas.

Methan. kandle.

Methan. therma.

Pyroco. DSM

Nanoar. equita

.

Sulfol. s

olfat.

Sacc

ha. ce

revi

.

Caenor. e

legan.

.

Dan

io rer

io.

Pan troglo

.Mus

mus

cul.

Thermo. tengco.

Clostr. tetani.

Lister. innocu.Bacill. anthra.

Oceano. iheyen.

Entero. faecal.Lactoc. lactis.

Desulf.

vulg

ar.

Woline. succin.

.

.

Mesorh. loti.

Sinorh. melilo.

Ralsto. solana.

Chromo. violac.

Neisse. mening.

Xylell. fastid.

Pseudo. aerugi.

Salm

on. e

nte

ri.E

scher. co

li.S

hig

el. fle

xne.

Wiggle. glossi.

Therm

o. m

ariti.

Aquife. aeolic

.

Therm

u. th

erm

o.

Dein

oc.

radio

d.

Dehalo

. eth

eno.

Pro

chl.

m.9

313.

Pro

chl.

m.C

CM

P.

Syn

ech

. elo

nga.

.

Bacte

r. theta

i.

Chlam

y. tracho.

Bifid

o. lo

ngum

.

Cory

ne. g

luta

m.

Figure 2.1: Evolutionary events by lineage. (A) The number of macroevolutionary eventsis mapped to each lineage on an ultrametric Tree of Life and visualized using the iToLwebsite [74]. Pie chart area denotes the number of events, and color indicates event type:gene birth (red), duplication (blue), HGT (green), and loss (yellow). (B) The averagenumber of events per gene copy is separated by domain and the origins of HGT events aredepicted: Bacteria (green), Archaea (beige), Eukarya (violet).

60

*Ubiquino.Fructose**Fe

FADH2Ferrocyt.H2O2

Glucose**O2

Oxaloace.Gen e- acceptor

Zn*FAD

.Glyceron.

Oxogluta.Alcohol

AcetateThiamin .Carboxyl.

Phosphoe.NAD+

CysteineBiotin

H2OH+

Glyceral.Methioni.

Glutamat.CTP

CoANH3

CO2Mn

Fructose.Fumarate

tRNANADPH

THFAMP

Aspartat.GMP

eATP

GlycineGlutamin.

PyruvatePyridoxa.

PPi*Orthopho.

FormateDNA

UDPGDP

GTPNTP

CMP**PRPP

**UMP

3.5

3.0

2.5

2.0

Tim

e [G

a]

1.5

1.0

0.5

0.0

4 2 0 2 4 8 10

Transfer

Loss

Birth

DuplicationDuplication

[events / genome / 10 Ma]Gene loss rate Gene gain rate

1 2 30

Archean Gene Expansion

ArcheanProterozoic

ProterozoicPhanerozoic

Present

0

2[log (enrich)] in AGE

12

Redox / e- transfer TCA / GlycolysisCarboxylic AcidAmino AcidNucleotide

Carbohydrate

Other

2[log (enrich)] pre-AGE

Figure 2.2: Rates of macroevolutionary events over time. The figure shows the rates ofmacroevolutionary events (colors as in Figure 2.1) as a function of time. Shown are theaverage rates of evolutionary events per lineage (events / 10 Ma / lineage). Events thatincrease gene count are plotted to the right, and gene loss events are shown to the left (yellowcurve). Genes already present at the LUCA are not included in the analysis of birth ratesbecause the time over which those genes formed is not known. The AGE was also detectedwhen alternative chronograms were considered (Figure 2.12). Inset: metabolites or classes ofmetabolites ordered according to the number of gene families that use them that were bornduring the AGE compared to the number born before the expansion. Metabolites whoseenrichments are statistically significant at a False Discovery Rate < 10% or < 5% (Fisher’sExact Test) are identified using one or two asterisks, respectively. Bars are colored byfunctional annotation or compound type (functional annotations were assigned manually).Metabolites were obtained from the KEGG database release 51.0 [117] and associated withCOGs using the MicrobesOnline September 2008 database [118]. Metabolites associatedwith fewer than 20 COGs or sharing more than 2/3 of gene families with other includedmetabolites are omitted.

61

0.00.51.01.52.02.53.03.5

Time [Ga]

Fe*ZnMnCu*Mo*CoNi*

19.2

2.7

2.8

4.8

2.5

24.7

43.4

40.0

0.0

0.0

0.0

0.0

24.0

15.0

4.8

4.6

4.1

4.0

22.8

44.7

2.9 4.0

Tran

sit

ion

Meta

ls

(174

)

Oxygen* (88)

36.0

Gen

e fr

actio

n n

orm

aliz

ed

to p

resen

t day

1.0

1.1

1.2

1.3

1.4

≥ 1.5

0.9

0.8

0.7

0.6

≤ 0.5

NH3

NO3*N2O*N2

2.4

0.9

5.8

90.8

1.6

0.9

2.8

94.6

0.0

0.0

0.0

Nit

ro

gen

Co

mp

.

(115

)

100.0

SSO4SO3H2SS2O3

14.7

8.0

16.2

54.0

7.1

14.4

8.1

17.0

50.6

19.3

9.7

6.5

22.6Su

lfu

r C

om

p.

(79)

42.0

9.9

CO2CHO2

CH2OCH4OCH4

5.1

3.7

14.0

77.0

0.2

3.8

2.6

14.0

79.6

4.4

2.2

10.0

0.0

C1

Co

mp

.

(214

)

83.3

0.0

1.3

Alm, Figure 3

Figure 2.3: Genome utilization of redox-sensitive compounds over time. The first barillustrates a gradual increase in the fraction of enzymes that bind molecular oxygen predictedto be present over Earth history (p=1.3×10−7, two-sided Kolmogorov-Smirnov test). Colorsindicate abundance normalized to present-day values. The lower four panels group transitionmetals, nitrogen compounds, sulfur compounds, and C1 compounds. The fraction of eachgroup’s associated genes that bind a given compound, normalized to present-day fractions,is shown over time using a color gradient. Enclosed boxes show raw fractional values atthree time points: 3.5 Ga (left); 2.5 Ga (middle); and the present day (right). For example,18.9% of transition metal-binding genes are predicted to have bound Mn at 2.5 Ga, a value1.26 times the size of the modern day percentage of 15.0%. Values within parenthesesgive the overall number of gene families in each group. To determine which compoundsshowed divergent genome utilization over time, the timing of copy number changes for eachcompound’s associated genes was compared to a background model derived from all othercompounds. Compounds whose utilization significantly differs from the background modelare marked with an asterisk (False Discovery Rate < 5%, two-sided Kolmogorov-Smirnovtest). Nitrite and nitric oxide are not shown due to their COG-binding similarity to nitrateand nitrous oxide, respectively.

62

2.2 Supplementary Material

2.2.1 Overview

We developed a phylogenomic method that we named AnGST (Analyzer of Gene

and Species Trees), which “reconciles” any observed differences between a gene tree

and a reference tree (species tree) by inferring a minimal set of evolutionary events,

including horizontal gene transfer (HGT), gene duplication (DUP), gene loss (LOS),

speciation (SPC) and exactly one gene birth or genesis event (GEN). Each event type

is assigned a unique cost, and the overall sum of costs associated with a reconciliation

is minimized (i.e., we use a generalized parsimony criterion). We address previously

described shortcomings of similar parsimony-based models of host-parasite evolution

[119] by accounting for phylogenetic uncertainty (using a new approach described

below) and directly estimating event costs from our large dataset. We divide the

gene-tree/species-tree reconciliation process into two components:

• The basic reconciliation step assumes a known gene tree and species tree and

identifies the set of evolutionary events (HGT, DUP, LOS, SPC, GEN) needed

to explain any discordance between the trees

• The tree amalgamation step accounts for gene tree uncertainty by incor-

porating tree construction into the reconciliation process: multiple gene tree

bootstraps are provided to AnGST and the algorithm retains and combines

bootstrap subtrees which yield the most conservative reconciliation consistent

with the sequence data.

The estimation of event costs from the input data is based on reducing large fluc-

tuations in ancient genome sizes. This method is presented in Section 2.2.5 together

with a sensitivity analysis for the resulting parameters.

The AnGST software package is implemented in the Python programming lan-

guage and can be downloaded from: http://almlab.mit.edu/ALM/Software/.

63

2.2.2 Basic Reconciliation Algorithm

First assumptions

The basic reconciliation step requires a rooted, strictly bifurcating gene tree G and

species tree S. Each tree is composed of a set of nodes linked to one another by a

set of connecting edges. We assume that each node g in G can be mapped to a node

s in S, a mapping we abbreviate as g:s. This mapping describes which (extant or

ancestral) genome hosted a given (extant or ancestral) gene copy. Maps are known

with certainty for extant genes, but must be inferred for ancestral gene copies.

Algorithm explanation

Our goal in gene/species tree reconciliation is to recover the optimal set of evolu-

tionary events that explain any topological discordance between the gene and species

trees. A brute-force search through all possible evolutionary histories is intractable,

as the number of possible histories grows exponentially with increasing tree size [120].

However, for a given gene and species tree pair, there are only |S| possible mappings

for the root node of the gene tree, gr. If the optimal reconciliation is already known

for each possible mapping gr:sr, where sr is a node in S, a new outgroup for the

gene tree can be added (making gr a child of the new root node gn), and optimal

reconciliations for the larger gene tree can be quickly computed using the following

method:

1. For each possible pair of mappings (gr:sr, gn:sn) where sr and sn are nodes in

S

(a) Choose the most parsimonious explanation for how a gene copy in sn de-

scended into sr.

(b) Concatenate this history to the known optimal reconciliation for gr:sr, to

produce the optimal reconciliation for the (gr:sr, gn:sn) pair

2. Identify optimal reconciliations for each mapping gn:sn by selecting the minimal

overall reconciliation cost associated with (gr:sr, gn:sn) as sr is varied over the

64

nodes of S.

Using the above method, the reconciliation problem can be formulated in a dy-

namic programming framework, yielding computational complexity that is a polynomial-

time function of gene tree size. The AnGST program implements this algorithm as a

post-fix traversal of the gene tree. At each node, reconciliations from child subtrees

are combined in mini-reconciliations, which explain how the gene copy at g coalesced

from two child copies c1 andc2 (i.e., whether HGT, speciation, or duplication oc-

curred), assuming the mappings g:s, c1:s1, and c2:s2. This is repeated for each s, s1,

and s2 ∈ S. Mini-reconciliations return optimal duplication-loss or HGT scenarios if s

is the last common ancestor of s1 and s2, or if s is identical to either s1 or s2. All other

combinations of s, s1, and s2, yield mini-reconciliations that we refer to as complex

scenarios. We include these scenarios in the pseudocode below to aid understand-

ing of basic reconciliation design, but we do not provide a method for their solution

since complex scenarios can be safely ignored without loss of reconciliation optimal-

ity (see Running Time discussion below). If g is a leaf node, mini-reconciliations are

unnecessary since the true mapping from g to the species tree is known. Once all

combinations have been evaluated, we retain the optimal reconciliation associated

with each possible mapping of g to the species tree. Pseudocode for the reconciliation

algorithm is provided on the following page in Python style.

65

Pseudocode:

% Main %• Reconcile(gene_tree.root)% Methods %• define Reconcile(node):

• child_1, child_2 = ChildNodes(node) %strictly bifurcating tree• if child_1 AND child_2 are null: %is a leaf node

• for node_map in AllNodes(species_tree): • if node_map is KnownHostGenome(node):

• node.reconciliation_cost(node_map) = 0 %correct answer is known for leaves• else:

• node.reconciliation_cost(node_map) = maxint • return

• Reconcile(child_1) %post-fix traversal• Reconcile(child_2)• for node_map in AllNodes(species_tree): %try all possible hosts for ancestor

• for child_1_map in AllNodes(species_tree): %try all possible hosts for children• for child_2_map in AllNodes(species_tree):

• events = MiniReconcile(node_map, child_1_map, child_2_map)• prior_events_1 = child_1.reconciliation_cost(child_1_map)• prior_events_2 = child_2.reconciliation_cost(child_2_map)• overall_cost = Cost(events + prior_events_1 + prior_events_2)• cost_matrix(node_map, child_1_map, child_2_map) = overall_cost

• for node_map in AllNodes(species_tree):• node.reconciliation_cost(node_map) = Min(cost_matrix(node_map, :, :))

• return

• define MiniReconcile(node_map, child_1_map, child_2_map):• % compute DupLoss scenarios• if node_map is ancestral to child_1_map AND child_2_map:

• if node_map is last_common_ancestor of child_1_map AND child_2_map:• %%% See Page46 for DupLoss pseudocode• duploss_events = DupLoss(node_map, child_1_map, child_2_map)

• else:• duploss_events = ComplexScenario()• %ComplexScenario() not implemented -- see Methods Section 1.1.2 Running Time discussion for explanation

• else:• duploss_events = maxint % impossible to reconcile with only dup-loss

• % compute HGT scenarios• if node_map is child_1_map:

• hgt_events = {HGT from node_map to child_2_map}• elif node_map is child_2_map:

• hgt_events = {HGT from node_map to child_1_map}• else:

• hgt_events = ComplexScenario()• return MinCost(hgt_events,duploss_events)

5

An example reconciliation:

An AnGST reconciliation of two simple, but discordant, gene and species trees is

provided in Figure 2.4. Here, we assume that we know the true mappings from the

leaves in G to S: g1:sA, g2:sC , g3:sB. Because AnGST uses a post-fix traversal of G

and the mapping of G’s leaves to S is trivial, we first investigate how g4 is mapped

to nodes in S. We initialize the algorithm by assigning infinite reconciliation cost to

leaf mappings which deviate from the known leaf mappings (e.g. g1:sB); thus, there

is only one valid mapping for g1 and g2.

In Scenario α, g4 is mapped to sA (g4:sA) and we infer one HGT event using

the mini-reconciliation algorithm (since g4 is mapped to the same lineage as one

of its child nodes). Similarly, if we consider g4:sC , we infer one HGT from sC to

sA (Scenario β). In the case of g4:sE (Scenario γ), g4 is mapped to the LCA of

sA and sE and a duplication-loss scenario is invoked by the mini-reconciler. Other

more complex scenarios exist (e.g., s4:sD), but these can be ignored without affecting

overall reconciliation optimality (see Running Time section below). Once optimal

reconciliations have been found for each possible g4:s mapping, AnGST recurses to

g5 and repeats the process. In the next mini-reconciliation, there are multiple valid

g4:s mappings. Thus AnGST must iterate through prospective mappings for both g5

and g4 (although for the sake of illustrative simplicity, we only enumerate a fraction

of these scenarios).

In the first mapping shown for g5 (g5:sD, g4:sA, g3:sB), there is 1 SPC (since

sA and sB are direct vertical descendants of sD) and this cost is added to the 1

HGT already inferred in Scenario α, which resulted in g4:sA. For the combination

(g5:sC ,g4:sC ,g3:sB), a cost of 1 HGT (because g5 and g4 share the same mapping) is

added to the cost for Scenario β. The last mapping shown is (g5:sE,g4:sE,g3:sB). A

mini-reconciliation that posits HGT will imply forward-in-time gene transfers – an

evolutionary event we do not allow (see Temporal constraints on HGT below). In-

stead, a DUP in sE and subsequent losses among sA and sC are needed to correctly

explain the mapping of s5:sE, s4:sE, and g3:sB. The g5 mapping that leads to the

67

optimal reconciliation is a function of the chosen evolutionary event costs. With a

cost structure: CSPC=0, CHGT =1, CLOS=2, CDUP =3, the optimal mappings would

be g5:sD, g4:sA, and the associated reconciliation would be a GEN event at sD, fol-

lowed by SPC at sD, and an HGT from sA to sC . However, if CSPC=0, CHGT =10,

CLOS=2, CDUP =3, the optimal mapping would be g5:sE, g4:sE, and the associated

reconciliation would be an initial GEN event at sE, followed by a DUP in sE, 2 SPCs

each at sE and sD, and LOS in lineages sA, sB, and sC .

Running time

O(|G|*|S|3) is an upper bound on run-time complexity of AnGST, where |S| and

|G| are the number of nodes in those trees, respectively. Running times can be

significantly reduced without loss of reconciliation optimality, however, with a simple

speedup. When performing mini-reconciliations on all combinations of g:s, c1:s1, and

c2:s2 for s, s1, and s2 ∈ S, any complex scenario (s is not s1 or s2, and s is not the

last common ancestor of both s1 or s2) will require at least two HGT (one to s1 and

another to s2), or one HGT to the last common ancestor of s1 and s2 followed by a

duplication-loss scenario originating at that ancestor. These more complex scenarios

will therefore always be suboptimal with respect to non-complex scenarios and their

evaluation can be skipped during the reconciliation process. The resulting reduction

in mapping search space lowers AnGST run-time complexity to O(|G|*|S|2). When

temporal constraints on HGT are enforced (see below), this speedup cannot be fully

exploited, as nodes ancestral to s1 and s2 are potentially optimal values for s in HGT

scenarios.

In practice, on 3.0Ghz single-cores with access to 8GB of memory, an AnGST run

reconciling 100 bootstrap trees from one gene family against a reference tree of 100

species would take roughly: 0.1 minutes for gene trees with 10 leaves, 4 minutes for

gene trees with 50 leaves, 13 minutes for gene trees with 100 leaves, 27 minutes for

gene trees with 150 leaves, 37 minutes for gene trees with 200 leaves.

68

Temporal constraints on HGT

If provided a chronogram as a reference tree, AnGST will restrict the set of possible

inferred gene transfers to only those between contemporaneous lineages. This feature

eliminates the possibility of inferring multiple HGT events which are chronologically

impossible [31]. Any non-zero chronological overlap is sufficient to allow transfers.

But, if a gene transfer is inferred from node s1 to node s2, subsequent transfers of the

gene copy in s2 may only occur with lineages which exist during the range T1 ∩ T2,

where T1 and T2 are the times spanned by the parent edges of s1 and s2, respectively. A

feature enabling transfers forward in time (which may represent “phantom transfers”

from unsampled taxa [121]) has been built into AnGST, but remains off by default

and was not used in our analyses.

Gene tree rooting

Bootstrap trees are assumed to be unrooted. All possible rootings of these bootstrap

trees are evaluated during the reconciliation process. The resulting gene tree is rooted

on the branch that results in the overall lowest reconciliation score.

2.2.3 Bootstrap tree amalgamation

Errors or uncertainty in gene phylogenies can lead to the inference of spurious macroevo-

lutionary events [32] and is a particular concern for deeply branching phylogenies

[122]. AnGST resolves uncertainty by incorporating reconciliation into the tree-

building process: the tree with the lowest reconciliation cost is chosen from a large

ensemble of trees consistent with the sequence data. To generate an ensemble of

suitable trees, AnGST considers the set of all trees that contains only bipartitions

observed in a set of input trees, which we generate with non-parametric bootstrap-

ping. Thus, AnGST typically outputs chimeric trees that do not match any of the

input bootstrap trees exactly, although every bipartition in the AnGST tree occurs

in at least one of the bootstraps. In simulations, we observe these trees to be signif-

icantly more accurate than trees based on sequence likelihood alone, although they

69

generally have lower likelihood (see Chimeric tree fidelity below and Figure 2.5). Any

number of bootstrap trees can be used, but we found limited increase in accuracy in

simulated data as a result of using more than 10 (data not shown).

We implement this approach in the following manner (see Figure 2.6 for an exam-

ple). Given n gene tree bootstraps {G1, G2, ... Gn} and a reference tree S, AnGST

will begin the basic reconciliation algorithm starting on tree G1. Each time AnGST

evaluates an internal node g of G1, it also evaluates the set of internal nodes I = {g1,

g2, ... gk} in other bootstrap gene trees that define the same bipartition as g. The

optimal reconciliation at this node is the lowest scoring scenario/topology observed

in any of the bootstrap trees. That is, a distinct solution is computed for each pos-

sible mapping (gi:s for gi ∈ I, s ∈ S), and only the best solution is retained for each

value of s. These |S| optimal mappings and their reconciliations are subsequently

shared across all the nodes in I. This last step creates “chimeric” gene trees, as the

reconciliation at g in G1 may now refer to a topology found in bootstrap Gi.

2.2.4 Simulation and Benchmarking

Simulation

Benchmarking

We used simulations to benchmark the performance of AnGST. Ten independent

gene trees births were simulated on each of the 199 extant and ancestral lineages of

the reference tree. A simple Poisson statistics-based model of HGT, DUP, and LOS

was used to generate random gene histories and associated gene trees; the average

simulated gene family underwent 0.21 HGT, 0.05 DUP, 0.76 SPC, and 0.26 LOS per

extant gene copy. (For comparison, our analysis of the COG dataset inferred 0.29

HGT, 0.10 DUP, 0.83 SPC, and 0.27 LOS per extant gene copy.) Synthetic amino

acid sequences were generated using these simulated trees and the SeqGen software

(v.1.3.2) [123]. Trees were reconstructed from the synthetic sequences using either the

BIONJ algorithm (implemented in PhyML), PhyML (v.2.4.5) [72], or AnGST via 100

PhyML-generated bootstrap topologies (see Section 2.2.8 for PhyML parameters). A

70

subset of 75 gene families were used to learn costs for HGT and DUP (see Methods

Section 2.2.5). A cost combination of CHGT =4,CDUP =3 minimized genome size flux

using this gene family subset (compared to CHGT =3, CDUP =2 learned for the COG

dataset).

Chimeric tree fidelity

Following reconciliation, nodes deep in the interior of the resultant gene tree can

contain topologies not found in any of the inputted bootstraps (although all possible

bipartitions of these subtrees will exist in at least one of the bootstraps). Thus, the

potential search space of topologies is vast. We tested the fidelity of the chimeric

gene trees learned during the reconciliation process using the Robinson-Foulds (RF)

statistic [124], which measures the number of bipartitions not shared by a pair of

trees. A 0 RF score indicates perfect concordance (all bipartitions of the candidate

and reference tree are identical) and increasing RF scores denote higher phyloge-

netic discordance. Analysis of the 225 gene trees with a minimal level of complexity

(more than 10 leaves) demonstrates that AnGST trees are significantly more accu-

rate than trees generated by BIONJ (p=9.110-8 Wilcoxon rank sum test) or PhyML

alone (p=1.810-2 Wilcoxon rank sum test). Interestingly, this increase in topological

accuracy comes with a likelihood tradeoff in comparison to the PhyML algorithm

(p=2.710-39 Wilcoxon rank sum test). As an aside, we note that the PhyML like-

lihoods in these analyses are in agreement with previous simulations which showed

PhyML capable of constructing trees with higher likelihood than the true topologies

[72].

Inferred birth date accuracy

We benchmarked the accuracy of gene family birth dates predicted by AnGST using

the 747 synthetic gene families that included more than one extant gene copy. A

comparison of inferred birth events and the simulated age of birth events is shown in

Figure 2.7A. There is a strong correlation between inferred and simulated ages (0.88)

and 76% of births are predicted to within 250 My of their simulated age. These

71

results are especially promising given the noisy processes (sequence simulation and

phylogenetic inference) separating simulation of a gene family and its reconciliation.

Moreover, we see no obvious evidence of inference bias which may lead to the false

inference of a birth spike. Direct comparison of birth counts during the AGE (2.9-3.3

Ga) to simulated births during the same period (Figure 2.7B) did not show a bias

towards over-counting births. We did, however, observe a bias toward gene birth prior

to the AGE, suggesting that our set of very ancient genes (born prior to 3.3 Ga) may

be inflated.

2.2.5 Parameter learning

Minimizing genome size flux

We address the problem of assigning the costs to each event type in a manner similar

to some previous studies [25, 27, 125]: we use predictions of ancestral genome sizes

to constrain the costs CDUP and CHGT (these are the only free parameters as we

can assume CLOS=1 and CSPC=0 without loss of generality). However, we chose

to minimize differences in genome size between parent and child nodes (a metric we

refer to as genome size flux) rather than constraining overall genome size over time for

two reasons: first, gene acquisition rates may not have been constant over time and

ancient genomes may have been smaller (or larger) than modern day genomes [125];

second, the extinction of ancient gene families would lead to a trend of smaller inferred

ancestral genome sizes at earlier times even if actual genome sizes were constant. A

grid search of cost space showed genome size flux to be minimized at: CHGT =3 and

CDUP =2 (Figures 2.8A, 2.9).

Sensitivity analysis

We investigated the extent to which the high fraction of overall gene birth detected

during the Archean Gene Expansion was dependent on model parameters (Figure

2.8B). Gene birth patterns were invariant over a broad range of CDUP . Gene birth

from 2.8-3.4 Ga dissipated only at low CHGT values. However, this regime of CHGT

72

resulted in unrealistic genome size distributions: ancestral genomes were much smaller

than present day ones, and most genes were predicted to have been born on terminal

branches and spread via HGT.

2.2.6 Reference tree construction

Building a Chronogram of Life

We used a previously reported Tree of Life as the template for a reference chronogram

[90]. This template was constructed using a concatenation of 31 translation-related

orthologs. All of the species represented in our gene family dataset were present

in this template tree. Divergence times were estimated using PhyloBayes (v.2.3c)

[126]. Since autocorrelated molecular clock models have been shown to outperform

uncorrelated ones in some cases similar to this study [91], we ran PhyloBayes with

a CIR process model of rate correlation. Eight sets of temporal constraints that

could be directly linked to fossil or geochemical evidence were used and are displayed

in Figure 2.10. Benchmarking PhyloBayes runs in parallel (n=95) established that

predicted divergence times and model likelihood converged after a burn-in of roughly

1500 model cycles. Final divergence time estimates were estimated following a burn-

in of 2500 cycles, after which trees were sampled every 20 cycles until the 3500th

cycle.

2.2.7 Alternate reference trees

We tested the extent to which the AGE was sensitive to the topology of the reference

tree and to the molecular clock model used in chronogram construction. We built 10

separate reference phylogenies using non-parametric bootstrapping of the Ciccarelli et

al. gene alignment [90] (see Section 2.2.8 for PhyML parameters) and rooted each with

either the Bacteria, the Archaea, or the Eukarya as the outgroup. Unequivocal errors

in phylogeny that may be due to sequence alignment construction errors were observed

for the Bdellovibrio, Shigella, Treponema, and Helicobacter pylori taxa; these were

resolved by manual pruning and re-grafting. Each phylogeny was then converted to an

73

ultrametric tree using r8s (v.1.71) [116] under a penalized likelihood model (with an

additive penalty function, truncated Newton nonlinear optimization, cross-validation

enabled, a cross validation start value of 10, a cross validation smoothing increment

of 3, and the number of smoothing values tried set to 4). The same set of temporal

constraints was used as for PhyloBayes. For the purposes of computational economy,

a subset of 250 COGs was randomly selected from our dataset and reconciled against

each of the 10 alternate chronograms.

Predicted birth ages were robust to the usage of alternative chronograms. The

median gene family birth date difference between any two alternative chronograms

is 0.09 Ga (Fig. 2.11) Elevated rates of gene birth during the Late Archean were

observed in all 30 of the alternative chronograms and on average, 19% of the 250

chosen COGs were predicted to be born during a 200-My window (Fig. 2.12). How-

ever, the timing of AGE-like window diverged from that reported by PhyloBayes

and spanned 2.7-2.5 Ga (compared to 3.3-2.9 Ga for PhyloBayes). Inspection of the

r8s chronograms suggests this temporal discrepancy may be related to differences in

dating the cyanobacteria under the two models. For both the r8s and PhyloBayes

chronograms, AGE-like events coincided with the relatively brief period during which

the major bacterial phyla, such as the Firmicutes and Proteobacteria, diverged from

the last bacterial common ancestor. This period of compressed cladogenesis predates

the appearance of the cyanobacteria by roughly 100-200 My in both models. Our

r8s analysis places the initial occurrence of the cyanobacteria at 2.5 Ga, which is

precisely the minimum age constraint for the appearance of this clade (see Section

2.2.6). Using the same constraints, PhyloBayes predicts the cyanobacteria to have

emerged 3.0 Ga. We confirmed the importance of the cyanobacteria in dating the

AGE by reducing the minimum constrained age of this clade during r8s chronogram

construction; the resultant model yielded a younger AGE (data not shown).

2.2.8 Gene tree construction

Families of orthologous genes used in this study are based upon functionally anno-

tated orthologous groups from the COG database [127], as extended to a wider set

74

of genomes in the eggNOG database [106]. Due to computational limitations, we re-

stricted this study to a subset of 100 of these genomes (11 eukaryotic, 12 archaeal, and

67 bacterial) broadly distributed across the Tree of Life. Sequences were downloaded

from the eggNOG database in September of 2008.

eggNOG-derived families were filtered to ensure usable levels of sequence conser-

vation with the aim of excluding the most error-prone phylogenies. We performed

this filtering in an iterative fashion: First, we excised poorly aligned regions of se-

quence [90, 128], using Gblocks (0.91b) [129] with the minimum number of sequences

for a flank position set to half the number of sequences in the alignment, the maxi-

mum number of contiguous non-conserved positions set to 8, the minimum length of

a block set to 2, and the allowed gap positions set to all. Second, we excluded genes

with more than 20% of their sequence in these excised regions from each gene family.

Third, Muscle (v3.7) [130] was used with default settings to realign the remaining se-

quences. This process then returned to the first step, unless no sequences or regions

were removed in the first or second steps in which case the process terminated. Of

the original 4872 COGs, 788 lost more than 25% of their original gene copies during

this process; these COGs were considered likely to be error-prone and thus excluded

from further analysis. Another 101 COGs were not analyzed due to their high gene

copy numbers and the extreme computational demands of running AnGST on those

large families. A distribution of gene copy numbers within each gene family is shown

in Figure 2.13.

Phylogenetic trees were constructed for the remaining gene families using ver-

sion 2.4.5 of PhyML [72] and the following parameters: 100 bootstrap trees, a JTT

substitution model, 0.0 percentage of the sites were invariable, 4 substitution rate

categories, a gamma distribution parameter of 1.0, a BIONJ-based starting tree, and

both tree topology and branch length optimization were enabled.

75

Figure 2.4: Example of a basic reconciliation. An AnGST reconciliation of twosimple, but discordant, gene (G) and species (S) trees is shown. The mapping of leaves ofG to S: g1:sA, g2:sC , g3:sB is indicated with color (e.g., g1 and sA are both shown in blue).Reconciliation proceeds in a post-fix manner through the gene tree, first evaluating possiblemappings from g4 to nodes in the S. Once the reconciliation process is completed at g4,the algorithm continues at g5. A detailed explanation of this reconciliation is provided inSection 2.2.2.

76

−20 −10 0 100

2

4

6

8

10

RF

Scor

e

! log likelihood

AnGSTBIONJPhyML

Figure 2.5: AnGST trees are more accurate than likelihood trees in simulation studies.We simulated the evolution of sequence data using 225 randomly generated gene treeswith more than 10 leaves. Gene trees were reconstructed from synthetic sequence datausing either BIONJ (red), PhyML (black), or AnGST (blue). Phylogenetic accuracy wasevaluated by Robinson-Foulds (RF) score. A 0 RF score indicates perfect concordance(all bipartitions of the candidate and reference tree are identical) and increasing RF scoresdenote higher phylogenetic discordance.The logarithm of sequence likelihood given each treemodel, relative to the likelihood calculated with the true gene topology, is plotted on the Xaxis. Mean RF scores and relative log likelihoods are drawn with rectangles whose heightand width reflect standard errors of the mean; protruding lines are standard deviations.PhyML-based trees enjoy significantly higher likelihood scores than the AnGST chimerictrees (p = 2.7×10−39 Wilcoxon rank sum test), but the AnGST-based trees are significantlymore similar to the correct gene tree topologies (p = 1.8 × 10−2 Wilcoxon rank sum test).Outlying points beyond axes were not drawn to facilitate viewing mean values, but wereincluded in mean and significance estimations.

77

Figure 2.6: Amalgamation algorithm for phylogenetic uncertainty. An AnGSTreconciliation of four gene tree bootstrap topologies {G1, G2, G3, and G4} and species treeS is shown. Leaf nodes on each bootstrap map to leaves on S according to color. Thereconciliation begins on one of the bootstrap trees, G1 (Step 1) and proceeds to an interiornode (Step 2). The reconciliation does not consider other topologies for this subtree, as itonly contains two leaves. When the reconciliation reaches the parent node node g1 (Step 3),AnGST considers subtrees from other bootstraps with alternative topologies (but identicalleaves). Corresponding subtrees are found on G3 and G4 and rooted at nodes g3 and g4

respectively (Step 4). Reconciliations are performed in parallel at g1, g3, and g4. For themapping of these internal nodes to lineage A on the species tree, the reconciliation at g3 isoptimal (since its topology matches the reference one) and the corresponding subtree in G3

is substituted for the mappings g1:sA and g4:sA.

78

−4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

Simulated Birthdate [Ga]

Infe

rred

Birth

date

[Ga]

−4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 00

0.5

1

1.5

2

2.5

Birthdate [Ga]

Birth

bia

s ra

tio

B

A

Figure 2.7: Benchmarking AnGST inference accuracy. A) A scatter plot of simu-lated gene family birth dates and inferred birth dates. Points drawn signify midpoints ofbranches associated with birth events. A slight amount of Gaussian noise with distributionN(µ=0, σ=0.025) has been added to each point so that overlapping points can be distin-guished. The correlation coefficient is 0.88 and 76% of predicted births are within 250 My(bounded by red dashed lines) of their true ages. B) Birth prediction bias is plotted as afunction of time. Predicted births have been normalized by the number of simulated birthsassociated with a given age. The AGE (2.9-3.3 Ga) is highlighted in pink.

79

012345678

0

5

10

5

10

15

20

25

30

35

CHGT

Genome size flux sensitivity to HGT and DUP costs

CDUP

Gen

ome

size

flux

10

15

20

25

30

0 1 2 3 4 5 6 7 8 9

0

2

4

6

8

100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

CHGT

Gene birth spike sensitivity to HGT and DUP costs

CDUP

Frac

tion

of a

ll CO

Gs

born

from

2.8−3

.4 G

a

0.05

0.1

0.15

0.2

0.25

0.3

0.35

B

A

Figure 2.8: AnGST parameter learning and sensitivity analysis. A) We performeda grid search over the costs CHGT and CDUP with the intention of minimizing averagegenome size flux between inferred ancestral genomes. The costs CLOS and CSPC were fixedat 1.0 and 0.0, respectively. Flux can clearly be minimized along the CHGT axis, but isless sensitive to changes in CDUP . A minimum point does exist, however, at CHGT =3.0,CDUP =2.0. B) A sensitivity analysis for our detection of a high fraction of births from 2.8-3.4 Ga was performed over the same parameter space evaluated in A). Comparable fractionsof overall gene birth to the AGE were detected in parameter space near the genome sizeflux minimum. Note that axes are reversed in panels A and B in order to facilitate viewingparameter sensitivity landscapes.

80

10

Legend:

Dataset sizes

A

Color ranges:

foo_ffd9b3

foo_ffecd9

foo_d9d9ff

foo_b3b3ff

foo_d7ffce

foo_9ee08c

Thermo. a

cidop.

Archae. fulgid.

Haloba. sp.

Methan. acetiv.

Methan. jannas.

Methan. kandle.

Methan. therma.

Pyroco. DSM

Nanoar. equita

.

Aeropy. p

ernix.

Sulfol. s

olfat.

Pyroba. a

eroph.

Gia

rdi. lam

bli.

Pla

sm

o. fa

lcip

.C

rypto

. hom

ini.

Ara

bid

. th

alia

.S

acc

ha. ce

revi

.

Caenor. e

legan.

Dro

sop.

mel

ano.

Dan

io rer

io.

Hom

o sa

pien

.

Pan tr

oglo

.Mus

mus

cul.

Thermo. tengco.

Clostr. tetani.

Onion phytop.Mycopl. genita.

Mycopl. gallis.

Staphy. aureus.

Lister. innocu.Bacill. anthra.

Bacill. subtil.

Oceano. iheyen.

Lactob. planta.Entero. faecal.Lactoc. lactis.Strept. pneumo.

Desu

lf. v

ulgar.

Bdello. b

acter.

Geobac. su

lfur.

Sol

iba.

usita

t.

Acido

b. s

p.

Woline. s

uccin.Helico. pylori.Helico. hepati.

Campyl. jejuni.

Wolbac. sp.Ricket. prowaz.Caulob. cresce.Rhodop. palust.

Bradyr. japoni.

Mesorh. loti.

Brucel. abortu.

Sinorh. melilo.

Agroba. tumefa.

Nitros. europa.

Bordet. pertus.

Ralsto. solana.

Chromo. violac.

Neisse. mening.

Coxiel. burnet.

Xantho. campes.

Xylell. fastid.

Pseudo. aerugi.

Shewan. oneide.

Photob. profun.

Vibrio choler.

Haem

op. influe.

Salm

on. e

nte

ri.E

scher. co

li.S

hig

el. fle

xne.

Yersin

. pestis.

Wiggle. glossi.

Buch

ne. a

phid

i.

Therm

o. m

ariti.

Aquife. aeolic

.

Fusoba. nucle

a.

Therm

u. th

erm

o.

Dein

oc. ra

dio

d.

Dehalo

. eth

eno.

Glo

eob. vi

ola

c.Syn

ech.

sp.

Pro

chl.

m.9

313.

Pro

chl.

m.C

CM

P.

Syn

ech

. elo

nga.

Nost

oc

sp.

Porphy. gingiv.

Bacte

r. theta

i.C

hloro. tepidu.Chlam

y. tracho.

Chlam

y. pneumo.

Tre

pon. p

allid

.

Borre

l. burg

do.

Lepto

s. inte

rr.R

hodop. b

altic.

Bifid

o. lo

ngum

.

Tro

phe. w

hip

pl.

Mycoba. tu

berc

.

Cory

ne. g

luta

m.

Stre

pt. c

oelic

.

183

2507

Figure 2.9: Inferred ancient genome sizes for CHGT=3.0, CDUP=2.0. Circle areasscale absolutely with genome sizes. Our optimization metric, genome size flux, aims tominimize the average difference between parent and child genomes. Ancestral genomes arepredicted to be smaller than modern day ones; this may reflect the evolution of increasinglycomplex genomes, and/or the extinction of ancestral gene families. Genome sizes for theLUCA (183 genes) and the modern-day genome of E. coli (2507 genes) are labeled in darkblue. Metazoan genomes appear only slightly larger than prokaryotic ones because theCOGs used in this study were originally defined using only unicellular organismsTatusov2000, which thus biased our analyses of eukaryotic genomes towards only microbially-relatedgenes.

81

10

Color ranges:

euk

arc

bac

Giardia lamblia ATCC 50803

Plasmodium

falciparum

Cryptosporidium hom

inis

Saccharom

yces cerevisiae

Caenorhabditis elegans

Dro

sophila

mela

nogaste

r

Hom

o s

apie

ns

Pan tro

glo

dyte

s

Mus m

uscu

lus

Danio

rerio

Arabidopsis thaliana

Therm

opla

sm

a a

cid

ophilu

m D

SM

1728

Arc

haeoglo

bus fu

lgid

us D

SM

4304

Halo

bacte

rium

sp N

RC

!1 M

eth

anosarc

ina a

cetivora

ns C

2A

Pyro

coccus furiosus D

SM

3638

Meth

anocald

ococcus jannaschii D

SM

2661

Meth

anopyru

s k

andle

ri A

V19

Meth

anoth

erm

obacte

r th

erm

auto

trophic

us s

tr D

elta H

Nanoarc

haeum

equita

ns K

in4!

M

Aero

pyru

m p

ern

ix K

1S

ulfo

lobus s

olfa

taric

us

Pyro

baculu

m a

ero

philu

m Therm

oanaero

bacte

r te

ngcongensis

MB

4

Clo

str

idiu

m teta

ni E

88

Mycopla

sm

a g

enitaliu

m G

37

Mycopla

sm

a g

alli

septicum

R

Onio

n y

ello

ws p

hyto

pla

sm

a O

Y!M

Lact

obaci

llus

pla

nta

rum

WC

FS1

Ente

roco

ccus

faeca

lis V

583

Lact

ococ

cus

lact

is s

ubsp

lact

is Il

1403

Strep

toco

ccus

pne

umon

iae

TIG

R4

Liste

ria in

nocua C

lip11262

Bacillus a

nthra

cis str Ste

rne

Bacillus subtili

s subsp subtilis str 1

68

Oceanobacillu

s iheyensis H

TE831

Sta

phyloco

ccus

aureus

subsp

aure

us NCTC

8325

Chlorobium tepidum TLS

Porphyromonas gingivalis W83

Bacteroides thetaiotaomicron VPI!5482

Chlamydia trachomatis DUW!3CX

Chlamydophila pneumoniae AR39

Bifidobacterium longum NCC2705Tropheryma whipplei str TwistStreptomyces coelicolor A3(2)

Mycobacterium tuberculosis CDC1551

Corynebacterium glutamicum ATCC 13032

Treponema pallidum subsp pallidum str Nichols

Borrelia burgdorferi B31

Leptospira interrogans serovar Lai str 56601

Pirellula sp

Fusobacterium nucleatum subsp nucleatum ATCC 25586

Thermotoga m

aritima M

SB8

Aquifex aeolicus VF5

Synechococcus s

p W

H 8

102

Pro

chlo

rococcus m

arin

us s

tr MIT

9313

Pro

chlo

rococcus m

arin

us s

ubsp m

arin

us s

tr CC

MP

1375

Syn

ech

oco

ccus e

longatu

s PC

C 6

301

Nosto

c sp P

CC

7120

Gloeobacter violaceus P

CC 7421

Thermus therm

ophilus HB27

Deinococcus radiodurans R

1

Dehalococcoides ethenogenes 195

Desulfo

vib

rio v

ulg

aris

str H

ildenboro

ugh

Bdello

vib

rio b

acte

riovoru

s

Geobacte

r sulfu

rreducens P

CA

Solib

acte

r usita

tus E

llin6076

Acid

obacte

ria b

acte

rium

Ellin

345

Wolb

achia

sp w

Mel

Ric

kettsia

pro

wazekii s

tr M

adrid E

Caulo

bacte

r cre

scentu

s C

B15

Rhodopseudom

onas p

alu

str

is C

GA

009

Bra

dyrh

izobiu

m japonic

um

US

DA

110

Mesorh

izobiu

m loti M

AFF303099

Bru

cella

abortus

Sin

orh

izobiu

m m

eliloti 1

021

Agro

bact

erium

tum

efa

ciens

str C

58

Nitr

oso

monas

euro

paea A

TCC 1

9718

Bor

dete

lla p

ertu

ssis

Ral

ston

ia s

olan

acea

rum

GM

I100

0

Chro

mobact

erium

vio

lace

um A

TCC

12472

Neis

seria

menin

gitidis

MC

58

Pseudomonas aeruginosa PAO1

Shewanella oneidensis MR!1Photobacterium profundum

Vibrio cholerae O1 biovar El Tor str N16961Haemophilus influenzae Rd KW20

Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis

Buchnera aphidicola str APS (Acyrthosiphon pisum)

Yersinia pestis KIM

Escherichia coli K!12

Shigella flexneri 2a str 301

Salmonella enterica subsp enterica serovar Typhi str Ty2

Xanthomonas campestris pv campestris

str ATCC 33913

Xylella fastid

iosa 9a5c

Coxiella b

urnetii

RSA 493

Cam

pylo

bacte

r jeju

ni s

ubsp je

juni N

CT

C 1

1168

Wolinella s

uccin

ogenes D

SM

1740

Helicobacte

r pylo

ri 2

6695

Helicobacte

r hepaticus A

TC

C 5

1449

2

1

3

4

5

6

7b

7a

8

Event Constraint

1Last universal common ancestor emerges

> 3850 Ma

2Cyanobacteria emerge

> 2500 Ma

3Eukaryotes diverge from Archaea

> 2670 Ma

4Akinetes diverge from cyanobacteria lacking cell differentiation

> 1500 Ma

5Archaeplastida emerge

> 1198 Ma

6 Animals emerge > 635 Ma

7 Tetrapods emerge(a) < 385 Ma(b) > 359 Ma

8Buchnera diverge from Wigglesworthia > 160 Ma

Figure 2.10: Temporal constraints. Eight fossil and biogeochemical constraints wereused to constrain the chronogram (evidence cited can be found in Table 2.1). Those con-straints are overlaid onto the reference phylogeny.

82

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

2.5

3

3.5

4

Gene family birth using topology A [Ga]

Gen

e fa

mily

birt

h us

ing

topo

logy

B [G

a]

Supplementary Figure 8: Sensitivity of predicted birth ages to variation in reference tree

topology. Ten bootstraps of the reference tree were rooted using either the Bacteria, Archaea, or

Eukarya as an outgroup and subsequently processed with r8s, producing 30 alternative reference

chronograms. Birth dates were inferred for 250 gene families using each chronogram. We graph

1000 random combinations of alternative chronogram pairs and gene families in the scatter plot

above (correlation coefficient = 0.86). The median gene family birth date difference between

any two alternative chronograms is 0.09 Ga.

2010-03-03429A-Z

25

Figure 2.11: Sensitivity of predicted birth ages to variation in reference treetopology. Ten bootstraps of the reference tree were rooted using either the Bacteria,Archaea, or Eukarya as an outgroup and subsequently processed with r8s, producing 30alternative reference chronograms. Birth dates were inferred for 250 gene families usingeach chronogram. We graph 1000 random combinations of alternative chronogram pairsand gene families in the scatter plot above (correlation coefficient = 0.86). The mediangene family birth date difference between any two alternative chronograms is 0.09 Ga.

83

Supplementary Figure 9: Gene family birth using 30 alternative reference tree topologies.

Shown above are cumulative distribution functions (CDFs) of total COG birth over time for the

30 alternative reference chronograms (light gray lines). Mean CDFs for the Bacteria, Archaea,

and Eukarya as outgroups are shown using green, red, and blue dashed lines, respectively.

Overall (solid black line), the period 2.7-2.5 Ga witnesses a gene family birth spike of on

average 0.23 families born per 1 Ma and accounts for the birth of 19% of the COG families

studied. By contrast, birth rates average 0.07 families born per 1 Ma from 2.5 Ga-present day.

−3.5 −3 −2.5 −2 −1.5 −1 −0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time [Ga]

0.00.51.01.52.02.53.03.54.0

Exis

tin

g C

OG

fracti

on

Overall average CDFAverage CDF w/ euk. outgroupAverage CDF w/ bac. outgroupAverage CDF w/ arc. outgroupIndividual chronogram CDF

2010-03-03429A-Z

26

Figure 2.12: Gene family birth using 30 alternative reference tree topologies.Shown above are cumulative distribution functions (CDFs) of total COG birth over time forthe 30 alternative reference chronograms (light gray lines). Mean CDFs for the Bacteria,Archaea, and Eukarya as outgroups are shown using green, red, and blue dashed lines,respectively. Overall (solid black line), the period 2.7-2.5 Ga witnesses a gene family birthspike of on average 0.23 families born per 1 Ma and accounts for the birth of 19% of theCOG families studied. By contrast, birth rates average 0.07 families born per 1 Ma from2.5 Ga-present day.

84

0 50 100 150 200 250 300 350 4000

200

400

600

800

1000

1200

1400

Num

ber o

f gen

e fa

milie

s

Gene family size

Figure 2.13: Histogram of COG family sizes. The median COG family in our datasetpossesses 18 gene copies, and 93% of COG families have 100 or fewer gene copies.

85

4.0 3.5 3.0 2.5 2.00

0.05

0.1

Time [Ga]

O2−

bind

ing

CO

G fr

actio

n

**

Figure 2.14: O2-utilizing gene birth over time. The fraction of compound-bindingCOG births which bind O2 is shown over time. A chi-square test was used to comparethe overall number of COGs born and the number of O2-binding COGs born in 100 Mywindows: prior to the AGE (3.7 Ga), at the height of the AGE (3.25 Ga), and at the tail ofthe AGE (2.85 Ga). Comparisons with p < 0.05 are denoted with asterisks on the graph.These data suggest that changes in O2 usage came toward the end of the AGE.

86

0 50 100 150 200 250 300 350 4000

20

40

60

80

100

120

HG

T ev

ents

Gene family size

y = 0.085x

y = 0.303x

LCA age > 2.8 GaLCA age <= 2.8 Ga

Figure 2.15: HGT counts vs. gene family size. The average gene family reconciliationyields 9.7 inferred HGT events. The number of HGT events inferred grows with the numberof gene copies in a COG family. Gene family HGT counts also grow with the age of the lastcommon ancestor of all genomes represented in the family, suggesting that HGT is morefrequent among gene families spanning wider phyletic range. We note that y-intercepts forthe above line fittings have been forced to equal 0.

87

Supplementary Figure 13: Confidence intervals for divergence times on reference

chronogram. Confidence intervals (95%) were estimated by PhyloBayes and are shown next to

each divergence point on the tree. Values are in units of Ga.

Thermoplas.acidoArchaeoglo.fulgiHalobacter.sp.Methanosar.aceti1.06-0.481.34-0.78

Methanocal.jannaMethanopyr.kandlMethanothe.therm1.05-0.621.27-0.83Pyrococcus.furio

1.62-1.041.8-1.25

2.02-1.59

Nanoarchae.equitAeropyrum.perniSulfolobus.solfa1.56-1.0Pyrobaculu.aerop1.77-1.37

2.01-1.65

2.33-2.05

Giardia.lamblPlasmodium.falciCryptospor.homin1.26-0.82Arabidopsi.thaliSaccharomy.cerevCaenorhabd.elegaDrosophila.melanDanio.rerioHomo.sapiePan.trogl0.19-0.1Mus.muscu0.34-0.21

0.57-0.370.97-0.7

1.1-0.821.4-1.04

1.61-1.241.77-1.37

2.08-1.77

2.82-2.67

Thermoanae.tengcClostridiu.tetan1.95-1.57Onion.yelloMycoplasma.genitMycoplasma.galli0.72-0.352.03-1.51Staphyloco.aureuListeria.innocBacillus.anthrBacillus.subti0.74-0.27Oceanobaci.iheye0.9-0.41

1.19-0.711.6-1.05

Lactobacil.plantEnterococc.faecaLactococcu.lactiStreptococ.pneum0.44-0.210.95-0.571.26-0.77

1.9-1.36

2.54-2.17

3.15-2.87

Desulfovib.vulgaBdellovibr.bacteGeobacter.sulfu2.51-2.332.64-2.36Solibacter.usitaAcidobacte.sp.1.96-1.31

2.97-2.78

Wolinella.succiHelicobact.pylorHelicobact.hepat1.4-1.261.42-1.28Campylobac.jejun

1.89-1.71Wolbachia.sp.Rickettsia.prowa2.14-1.14Caulobacte.crescRhodopseud.palusBradyrhizo.japon0.61-0.17Mesorhizob.lotiBrucella.abort0.82-0.27Sinorhizob.melilAgrobacter.tumef0.6-0.16

0.99-0.351.53-0.7

2.02-1.252.65-2.23

Nitrosomon.europBordetella.pertuRalstonia.solan1.11-0.61Chromobact.violaNeisseria.menin1.07-0.631.72-1.23

1.98-1.68

Coxiella.burneXanthomona.campeXylella.fasti0.8-0.41Pseudomona.aerugShewanella.oneidPhotobacte.profuVibrio.chole0.84-0.6Haemophilu.influSalmonella.enterEscherichi.coliShigella.flexn0.66-0.590.67-0.6Yersinia.pesti

0.83-0.74Wiggleswor.glossBuchnera.aphid0.54-0.39

0.9-0.781.0-0.86

1.09-0.951.2-1.06

1.94-1.732.25-1.97

2.34-2.08

2.5-2.21

2.94-2.73

3.03-2.84

3.14-2.93

Thermotoga.maritAquifex.aeoli2.43-1.85Fusobacter.nucle3.0-2.75Thermus.thermDeinococcu.radio1.68-1.07Dehalococc.ethen2.98-2.76Gloeobacte.violaSynechococ.sp.Prochloroc.marin0.71-0.24Prochloroc.marin1.16-0.7Synechococ.elongNostoc.sp.1.54-1.5

1.77-1.672.09-1.92

3.14-2.96

3.22-3.03

3.29-3.09

Porphyromo.gingiBacteroide.theta0.58-0.19Chlorobium.tepid2.63-1.38Chlamydia.trachChlamydoph.pneum0.88-0.23

3.14-2.42

Treponema.palliBorrelia.burgd2.7-2.19Leptospira.inter2.76-2.3Rhodopirel.balti3.05-2.61Bifidobact.longuTropheryma.whipp1.36-0.72Mycobacter.tuberCorynebact.gluta0.87-0.47Streptomyc.coeli1.62-1.06

2.01-1.363.18-2.76

3.29-3.05

3.37-3.18

3.39-3.21

100

3.85-3.77

0.1 Ma

2010-03-03429A-Z

30

Figure 2.16: Confidence intervals for divergence times on reference chronogram.Confidence intervals (95%) were estimated by PhyloBayes and are shown next to eachdivergence point on the tree. Values are in units of Ga.

88

Meta-function Function COG CodeNumber of

COGs

Fraction of genes studied

Fraction of AGE births

AGE birth enrichment

Fraction of pre-AGE

births

pre-AGE birth enrichment

pre-AGE vs. AGE p-value

Information storage & processing

Translation


RNA proc.Information storage &

processingTranscriptionInformation storage &

processingReplication, recombination


Chromatin struct.

Cellular processes & signaling

Cell cycle control


Defense mech.


Signal transduction


Cell wall/membraneCellular processes & signaling Cell motility


Cytoskeleton


Intracell. trafficking


Post-trans. modification

Metabolism

Energy prod. & conv.

Metabolism

Carb. trans. & met.

Metabolism

Amino acid trans. & met.

MetabolismNucleotide trans. & met.

MetabolismCoenzyme trans. & met.

Metabolism

Lipid trans. & met.

Metabolism

Inorganic ion trans. & met.

Metabolism

Secondary metabolites

Poorly characterizedFunc. unknown

Poorly characterizedGeneral func. pred.

J 197 0.049 0.061 1.234 0.150 3.042 0.000A 17 0.004 0.001 0.219 0.002 0.377 1.000K 173 0.043 0.030 0.681 0.042 0.962 0.246L 155 0.039 0.034 0.868 0.068 1.746 0.002B 11 0.003 0.000 0.000 0.002 0.655 0.338

D 56 0.014 0.013 0.903 0.012 0.854 1.000V 29 0.007 0.008 1.083 0.001 0.097 0.033T 106 0.027 0.028 1.044 0.020 0.762 0.405M 141 0.035 0.050 1.424 0.060 1.702 0.487N 82 0.021 0.031 1.493 0.015 0.718 0.064Z 5 0.001 0.000 0.000 0.000 0.000 1.000U 122 0.031 0.032 1.047 0.022 0.710 0.274O 157 0.039 0.043 1.101 0.042 1.070 1.000

C 211 0.053 0.079 1.499 0.068 1.289 0.490G 186 0.047 0.056 1.205 0.051 1.087 0.730E 226 0.057 0.079 1.394 0.109 1.923 0.054F 83 0.021 0.026 1.228 0.059 2.835 0.001H 155 0.039 0.056 1.439 0.069 1.772 0.326I 72 0.018 0.024 1.306 0.032 1.795 0.334P 182 0.046 0.058 1.276 0.046 0.998 0.298Q 70 0.018 0.015 0.847 0.004 0.216 0.045

S 1186 0.298 0.183 0.615 0.071 0.238 0.000R 560 0.141 0.142 1.010 0.113 0.804 0.105

Figure 2.17: Function of gene births prior to and during the AGE. Functionalenrichment of gene birth from 2.8-3.3 Ga is shown for the 20 COG functional categories. Atwo-tailed Fisher exact test was used to compute the p-value of a difference in total COGbirths prior to the AGE vs. during the AGE, for each functional category (last column).

89

Meta-function FunctionCOG Code

HGT out of cyano into

plant

HGT out of cyano

Fisher p-value

HGT out of alpha-proteos

into euks

HGT out of alpha-proteos

Fisher p-value


Translation


RNA proc.Information storage

& processingTranscriptionInformation storage

& processingReplication, recombination


Chromatin struct.


Cell cycle control


Defense mech.


Signal transduction


Cell wall/membraneCellular processes & signaling Cell motility


Cytoskeleton


Intracell. trafficking


Post-trans. modification

Metabolism

Energy prod. & conv.

Metabolism

Carb. trans. & met.

Metabolism

Amino acid trans. & met.

MetabolismNucleotide trans. & met.

MetabolismCoenzyme trans. & met.

Metabolism

Lipid trans. & met.

Metabolism

Inorganic ion trans. & met.

Metabolism

Secondary metabolites

Poorly characterized

Func. unknownPoorly characterized General func. pred.

J 38 1225 4.41E-05 13 779 9.34E-02A 0 1 1.00E+00 0 0 1.00E+00K 2 297 3.34E-01 0 231 1.78E-01L 4 550 1.52E-01 0 324 8.23E-02B 0 7 1.00E+00 0 8 1.00E+00

D 3 164 7.42E-01 0 98 6.28E-01V 0 116 4.25E-01 0 59 1.00E+00T 2 341 2.54E-01 3 209 4.84E-01M 5 735 6.06E-02 0 513 6.24E-03N 0 80 6.36E-01 0 168 4.23E-01Z 0 0 1.00E+00 0 0 1.00E+00U 6 273 3.20E-01 1 305 3.79E-01O 11 549 3.72E-01 9 367 1.55E-02

C 26 949 3.95E-03 21 621 1.55E-06G 9 541 7.21E-01 2 335 5.86E-01E 16 1205 6.23E-01 5 701 5.57E-01F 6 486 8.49E-01 1 270 5.32E-01H 13 1057 5.12E-01 7 397 1.98E-01I 9 295 5.06E-02 4 247 3.33E-01P 12 918 6.76E-01 5 472 1.00E+00Q 3 163 7.41E-01 0 106 6.30E-01

S 16 1601 8.04E-02 5 1103 3.73E-02R 19 1496 4.34E-01 9 797 7.17E-01

Figure 2.18: Biases in gene function associated with ancient endosymbioses.Shown here is a functional breakdown of HGT from cyanobacteria into Arabidopsis thaliana,and from the alpha-proteobacteria to ancient eukaryotes (which we defined as eukaryoticlineages predating the divergence of Arabidopsis). The significance of functional enrichmentsfor genes associated with the chloroplast are calculated by comparing the observed numberof HGT from cyanobacteria into the plant lineage, to the total number of HGT originatingin the cyanobacteria, so as to account for functional biases associated with HGT out of thecyanobacteria. Similar statistics are performed on mitochondria-related genes. P-valuesthat fall below a 5% False Discovery Rate cutoff in are underlined.

90

Chapter 3

Building prokaryotic species treesfrom thousands of gene trees

Lawrence A. David, Albert Y. Wang, Eric J. Alm

Experiments in this chapter were performed in collaboration with Albert Wang inthe Alm laboratory.

Chapter 3

Building prokaryotic species trees

from thousands of gene trees

3.1 Abstract

Sequenced-based approaches for reconstructing prokaryotic species trees from more

than one gene utilize either the concatenation of sequences (supermatrices) or the

merger of separate gene trees (supertrees) [131]. Yet, supermatrices ignore differ-

ences in evolutionary rate and nucleotide compositional bias between concatenated

genes, which can ultimately reduce the accuracy of inferred phylogenetic trees [132].

Supertree methods can account for these inter-gene heterogeneities, but a commonly-

used supertree technique, matrix representation, violates the site-independence as-

sumption underlying many phylogenetic construction algorithms. Here, I extend an

alternative supertree method known as Gene Tree Parsimony (GTP), which chooses

the species tree as the topology that requires the least number of duplication events

to be inferred when compared to a set of gene trees [133]. My GTP model, named

GAnG, can account for horizontal gene transfer events (HGT) and is suitable for

use with prokaryotic gene trees. Preliminary GAnG analysis of 250 archaeal gene

trees built from a subset of the sequenced archaeal genomes supports hypotheses that

Nanoarchaeota diverged from the last ancestor of the Archaea prior to the Crenar-

chaeota/Euryarchaeota split, but also appears sensitive to HGT artifacts between the

92

Crenarchaeota and the Thermoplasma.

93

3.2 Introduction

Prokaryotic species phylogenies are frequently built by concatenating sequences from

more than one gene [131], in order to sample a range of phylogenetically-informative

characters [134], and to mitigate the role of topological artifacts caused by long branch

attraction [122, 135] and nucleotide composition bias [136, 137]. These genes can

be restricted to “core” sets that are topologically congruent [90, 138], under the

assumption that horizontal gene transfer (HGT) artifacts would only be present if

identical HGTs affected all of the genes in the core set. However, reconstructions of

organismal histories using core genes have been criticized for ignoring the majority

of phylogenetic signal present in genomes and have been referred to as “The tree[s]

of one percent” [139]. In order to claim that species trees represent overall genome

evolution, computational tools are required that can incorporate genetic information

from hundreds, or even thousands, of genes.

Previous studies that have built prokaryotic species trees from multiple gene fam-

ilies have utilized either “supertree” or “supermatrix” approaches [90, 138, 140, 141].

Under the supermatrix approach, sequences from orthologous gene families are con-

catenated into a single sequence, which can subsequently be inputted into a standard

phylogenetic construction algorithm. This method is robust to gene families with lim-

ited taxonomic distribution, as simulations have shown supermatrices to perform well

even in regimes where only 10% of species possess a given gene copy [142]. However,

the concatenation of prokaryotic genes risks assembly of sequences with divergent

evolutionary histories due to HGT [143]. Supermatrix approaches must therefore as-

sume that the phylogenetic signal from vertically-descended genes overwhelms any

signal from transferred regions of the alignment. Moreover, species trees cannot be

constructed in a computationally-efficient way if differences in gene evolutionary rate

or nucleotide composition are taken into account. Simulations have shown that par-

titioning concatenations and fitting these parameters on a per-gene basis improved

the fidelity of reconstructed phylogenies in nearly all reported scenarios, and in some

cases enabled the reliable recovery of clades that were never inferred by a conven-

94

tional supermatrix procedure [132]. However, supermatrices that model evolutionary

heterogeneity between genes cannot utilize existing tree building algorithms and have

so far required exhaustively searching tree space in order to identify an optimal tree

[132, 134].

Supertree methods provide an alternative phylogenetic framework that can model

evolutionary parameters on a per-gene basis, while still constructing a species phy-

logeny in a computationally-efficient manner. Under this framework, gene trees are

built separately for each orthologous gene family and subsequently merged to form

a species tree. Supertree methods are distinguished by how gene tree merger takes

place. According to the most popular merger method [144], matrix representation

using parsimony (MRP), a binary character matrix is built that possesses a column

for each internal node in the set of gene trees, and a row for each of the species sam-

pled. Species with a gene copy descended from a given internal node all share a “1”

in the corresponding column of the matrix, and all other species share a “0” [145]. A

consensus tree is formed by inputting the character matrix into a sequence-based phy-

logenetic construction algorithm. Because the supertree framework allows trees to be

built for individual gene families, the model is capable of accounting for differential

evolutionary rates and nucleotide compositions between genes, unlike conventional

supermatrices.

However, the MRP algorithm in particular has been criticized for its analogy

between the tree matrix and a nucleic acid sequence matrix [133]. This analogy

violates the site-independence model assumption of many phylogenetic reconstruction

algorithms [76], since sibling leaves on a gene tree will be partitioned similarly for

each MRP matrix column that corresponds to one of these leaves’ ancestral nodes.

In practice, this site dependence should manifest as a supertree bias towards subtree

topologies from large gene trees with many ancestral nodes and therefore more sites

in the tree matrix. This bias is likely deleterious, as large gene trees can be caused

by frequent HGT and duplication events that impede partitioning gene families into

orthologous groups.

Gene tree parsimony (GTP) provides an alternative supertree merger criterion

95

that has received less statistical criticism than MRP, but that also cannot be used

on prokaryotic gene trees in present implementations. First introduced by Slowinski

in 1997, the GTP approach chooses the species tree as the topology that requires

the least number of duplication events to be inferred when species and gene trees are

compared (a step called a reconciliation) [146]. This reconciliation-based approach

avoids the matrix construction step that has led to criticism of MRP supertree meth-

ods. However, existing implementations of GTP can only model gene duplication and

gene loss events [147, 148], limiting their usage to eukaryotic phylogenies.

Here, I enable the use of GTP supertrees on prokaryotic trees, through a new

program that I have named GAnG. This algorithm reconciles gene trees and species

trees using a generalized parsimony model I previously developed, the Analyzer of

Gene & Species Trees, or AnGST [149], which accounts for HGTs, as well as gene

duplication and gene loss events, in gene family evolution. Experiments with simu-

lated data show that GAnG can accurately quantify species tree accuracy. As a proof

of concept, I also used the algorithm to learn a phylogeny of 12 sequenced archaeal

species from 250 gene trees with divergent topologies and that have likely undergone

HGT. Results of this analysis support the basal position of Nanoarchaeum equitans

on the archaeal tree.

96

3.3 Methods

3.3.1 Approach overview

The GAnG algorithm is an iterative process composed of four primary steps:

1. Initial topology generation and reconciliation: An initial reference tree

is constructed using one gene, or a concatenation of multiple genes. Individual

gene trees are also constructed for all orthologous gene families. After trees

have been built, each of the gene trees is reconciled against the species tree

using AnGST.

2. Proposal of tree refinements: The HGTs inferred from reconciliations be-

tween the species tree and gene trees are mined to propose subtree-prune-and-

regraft (SPR) moves. These SPRs are used to generate candidate species tree

topologies (Section 3.3.2). Alternatively, new rootings of the species tree can

be evaluated (Section 3.3.3).

3. Evaluation of candidate refinements: Each candidate species tree is eval-

uated by comparison to the set of gene trees using AnGST, as described in

Section 3.3.4.

4. Selection of a new species tree: The best scoring candidate tree is identified.

If this tree has a better reconciliation score than the current species tree, the

species tree is replaced with the candidate tree and the algorithm returns to

Step 2. Otherwise, the process terminates and returns the present species tree.

3.3.2 Generating candidate species trees

GAnG searches species tree space using an iterative subtree-prune-and-regraft (SPR)

strategy that generates a candidate species trees from an existing species tree by

pruning off subtrees and regrafting them onto new locations on the tree. This heuristic

approach is necessary because it is unlikely that a polynomial-time algorithm exists

for finding an optimal species tree using AnGST [150]. An exhaustive search strategy

97

is also not possible, since the number of possible rooted binary trees grows super-

exponentially with the number of species s, following the function (2s−3)!! [151] (for

the 14 genomes in Fig. 3.2, there are nearly eight trillion possible organismal trees).

SPR moves enable dramatic changes in tree topology using relatively few topo-

logical operations and are used by tree construction algorithms like PhyML 3.0 [152],

FastTree [153] and RaxML [154] to avoid being caught in local minima in tree space.

An SPR-based strategy for refining species trees has the added advantage of com-

plementing AnGST reconciliation process. High-frequency HGTs can be eliminated

if an SPR between the transferred nodes (in the reverse direction of the HGT) is

performed on the species tree. Consequently, GAnG generates a candidate species

tree for each of the SPR moves associated with the 100 most frequently inferred HGT

on the present species tree.

3.3.3 Rooting the species tree

Unlike typical supermatrix or supertree methods, GAnG does not require an outgroup

sequence to to infer a rooted species tree. Since the species tree root orients the

direction of vertical inheritance at internal nodes, alternative rootings of the same

unrooted species tree will have distinct AnGST scores when reconciled against a

gene tree. Therefore, in addition to incorporating SPR moves, the candidate tree

generation routine can also vary the position of the root node on the species tree.

3.3.4 Scoring a candidate species tree

A score S for a putative species tree topology T quantifies how well the species tree

fits the set of gene trees according to the AnGST model:

S(T ) =�

G

Rec(G|T ) (3.1)

where Rec(G|T ) is the AnGST reconciliation score for a gene tree G, given the can-

didate species tree T . The AnGST model assigns a score of 3, 2, and 1 to each HGT,

duplication, and loss event inferred, respectively [149].

98

3.4 Preliminary results

I have performed two preliminary tests using the GAnG algorithm, using simulated

and real-world data generated from my previous study of microbial genome evolution

[149]. These tests suggest:

1. The AnGST-based scoring function can discern between species trees of varying

accuracy (Section 3.4.1)

2. HGTs can be used to propose SPR moves on the species tree that lead to lower

AnGST reconciliation scores (Section 3.4.2).

3.4.1 AnGST scores increase with true species tree permu-

tation

I evaluated how AnGST reconciliation scores changed as increasing phylogenetic noise

was added to the Tree of Life [90], in order to test with real-world gene trees if the

GAnG tree search process could be attracted towards the correct species tree. This

experiment used 125 microbial gene families randomly selected from my previous

study of microbial genome evolution [149]. A total of 100 sequenced genomes spanning

all three domains of life were represented in these gene trees. Because the true species

tree for the 100 sampled genomes is not known, I could not directly evaluate whether

or not the GAnG algorithm could use the gene trees to recover the true species tree.

Instead, I tested if the GAnG algorithm behaved in a manner consistent with less

accurate species trees receiving higher reconciliation scores than more accurate species

trees. I simulated a continuum of species tree accuracy by taking a topology (the Tree

of Life) likely similar to the true tree, and randomly permuting it with between 1 and

10 SPR random moves. This procedure was used to generate 300 alternate species

trees. Gene tree reconciliation scores against the species trees increased linearly with

the number of SPRs performed on the Tree of Life (Figure 3.1; R2 = 0.67), suggesting

that more accurate species trees will receive lower reconciliation scores than less

accurate species trees.

99

3.4.2 HGTs can be used to find a better-scoring tree

The GAnG algorithm was run to completion on a set of 12 sequenced archaeal species

and 250 gene trees randomly chosen from my previous study on microbial genome

evolution [149]. The initial species tree topology was pruned from the Ciccarelli &

Bork Tree of Life (Fig. 3.2A) [90]. The algorithm terminated after four iterations,

yielding a refined species tree whose reconciliations with gene trees were an average of

3.7% lower than the initial tree (Figure 3.2B). A single SPR move of the Crenarchaeota

to a basal euryarchaeal lineage provided the most dramatic change in the refined

topology, shifting N. equitans to a basal position on the archaeal tree.

100

3.5 Discussion

GAnG employs the AnGST reconciliation model to become the first GTP-based su-

pertree method capable of accounting for HGT events and therefore appropriate for

inferring prokaryotic species trees. Usage of the AnGST model carries two additional

benefits. First, the AnGST algorithm includes a bootstrap amalgamation step that

constructs a chimeric gene tree from the bootstrap subtrees that best conform to the

species tree. As a result, gene families that evolved by vertical descent, but whose

phylogenetic reconstructions are sensitive to sequence sampling variation, will be less

likely to mislead the GTP process. Second, HGT events inferred by AnGST are

directed between specific branches on the species tree and can therefore provide a

heuristic guide for refining species trees. Frequent HGTs between two lineages may

be the result of a topological error on the species tree and can be resolved by making

these lineages sibling to one another on an updated tree.

Preliminary analyses of GAnG on simulated data demonstrate that the algorithm’s

scoring function will be attracted towards correct species topologies (Fig. 3.1). The

observed linear relationship between species tree score and tree randomization demon-

strates also suggests that GAnG can accurately quantify relative differences in inac-

curacy among trees. Thus, future implementations of GAnG’s tree search algorithm

may be able to employ more sophisticated search algorithms, such as gradient descent.

The GAnG analysis on real-world data showed the algorithm capable of inferring

a prokaryotic species tree largely in line with prior archaeal phylogenies, despite not

relying on a “core” set of gene families free of HGT (Fig. 3.2). A single SPR move of

the Crenarchaeota to a basal euryarchaeal lineage provided the most dramatic refine-

ment to the starting Ciccarelli & Bork archaeal tree. This SPR shifted N. equitans

from its initial sibling position with the Crenarchaeota (a position not supported by

the archaeal phylogenetic literature) to a more basal position outgrouping both the

Euryarchaeota and the Crenarchaeota. This outgroup location of N. equitans has

been reported previously [155], but remains controversial [156]. The refined archaeal

tree also includes a paraphyletic euryarchaeal clade that features the Thermoplas-

101

matales and Crenarchaeota as sibling to one another. A close relationship between

the Thermoplasmatales and the Crenarchaeota has been reported by previous phy-

logenetic studies [157–159] and has been attributed to HGT between thermoplasma

species and the Crenarchaeota [138].

Finally, these experiments both indicate that GAnG could be run on datasets

composed of thousands of gene trees. The analysis of the archaeal tree with 250 gene

trees took on the order of 3 days on a computer cluster. GAnG running time increases

linearly with the number of gene trees, so analysis of a 1000 gene tree set would

require approximately two weeks of compute time – a time-consuming experiment,

but not prohibitively long. Moreover, a potential running time speedup is currently

in development (see Future Directions below).

102

3.6 Future Directions

Future studies will investigate several potential improvements to the GAnG model,

specifically in species tree scoring, running time improvement, and gene tree weight-

ing. The benefits of each of these model changes will be investigated using gene and

species tree simulation software that I have previously written during the development

of AnGST [149].

A more accurate and more efficient version of GAnG will also be tested on a

much larger archaeal dataset of 9053 gene families drawn from 70 sequenced archaeal

genomes [160]. The resulting archaeal tree will be used to test hypotheses involving

the basal position of the Korarchaeota and Thaumarchaeota on archaeal phylogenies

and whether N. equitans represents a separate archaeal phylum.

3.6.1 Model improvements

Scoring modifications

The scoring function Eq. 3.1 may be biased by larger gene trees, which are likely to

have a higher variance in reconciliation scores. An alternative scoring function robust

to this bias is:

S(T ) =�

G

Sign(Rec(G|T )− Rec(G|R)) (3.2)

This score is negative if there are more gene trees whose reconciliation scores are lower

with the candidate tree than with the present species tree R. In this case, a candidate

species tree would be accepted and become the new present species tree. Evidence

that S(T ) will be positive for incorrect candidate trees is presented in Section 3.4.1

of the Preliminary Results. Summary of a tree’s fitness using the sign function also

facilitates an algorithmic speedup described below in Reducing running time.

Alternatively, if we assume that reconciliation score variance grows with reconcili-

ation score, exceptional changes in reconciliation score can be captured by the scoring

103

metric:

S(T ) =�

G

Log

�Rec(G|T )− Rec(G|R)

Rec(G|R)

�(3.3)

In contrast to Equation 3.1, however, this scoring function may be biased by smaller

gene trees, which are likely to have smaller reconciliation scores.

Reducing running time

I can avoid reconciling the entire set of gene trees if it can be quickly determined

that a candidate species tree is less fit than the current species tree. I can perform

this speedup using the scoring function in Equation 3.2 and by assuming that gene

trees evolve independently from one another. Under this model, determining the sign

of S(T ) can be likened to the problem of determining if a coin is fair using a finite

number of coin tosses. To compute if there is a bias towards [Rec(G|T )−Rec(G|R)]

being positive or negative, we can count the total number of G for which this value is

negative, N−, or non-zero, N , and estimate the probability that a candidate species

tree that changes the reconciliation score of G causes a lower reconciliation score:

p =N−

N(3.4)

However, p is an estimated probability with a standard error sp

sp =

�p(1− p)

N(3.5)

and a maximum error of

E = Z × sp (3.6)

where Z is taken from a table of Z-values and associated confidence levels, calculated

using a normal distribution [75].

If p − E > 0.5, the candidate topology can be accepted at the chosen confidence

level. Similarly, if p + E > 0.5, the candidate topology can be rejected. Otherwise,

the maximum error is too high to determine if the tree should be accepted or rejected

104

and more gene trees need to be reconciled. One way to determine the additional

number of reconciliations to perform in this case would be to calculate E = |0.5− p|,

or the maximum error associated with p so that it can be determined if p is positive

or negative. It can be shown that the number of reconciliations necessary to achieve

this error, NE, is equal to:

NE =Z2p(1− p)

E2(3.7)

Gene tree weighting

Core gene approaches to species tree construction usually exclude gene families that

show evidence of HGT [90]. However, the strict exclusion of gene families that carry

only weak signals of HGT may be overly conservative. A single HGT event causes

only one bifurcation on a gene tree to not be explained by vertical inheritance (i.e.

a speciation event). A large gene tree with relatively few HGT can thus still be

informative for phylogenomic purposes. This intuition can be incorporated into a

scoring function by multiplying each gene tree score by a weighting factor w(G). One

method for calculating this weight would be to take the ratio of the number of HGT

possible given a gene family’s taxonomic distribution (which can be approximated by

the square of the number of species L represented in the gene family), to the number

of HGT inferred on the gene tree, HGT(G):

w(G) =L(G)2

HGT(G)(3.8)

Translation-related gene families, which have been previously been relied upon for

constructing archaeal phylogenies [138], are weighted more highly according to this

scheme than the average gene family (p < 10−30, Rank Sum test; Fig. 3.3). A caveat

to this weighting scheme, however, is that HGT(G) will change as the species tree is

refined by GAnG.

I will also evaluate via simulation an alternative gene tree weighting function that

uses gene tree branch lengths to downweight the influence of gene families suspected of

105

bearing transferred or duplicated genes. This approach relies on the assumption that

genetic distances between gene copies descended from an HGT event will be shorter

than expected, whereas genetic distances between certain gene copies descended from

a duplication/loss scenarios will be longer than expected (see Figure 3.4 for an il-

lustration of these phenomena). The following weighting function accounts for this

evidence of non-vertical descent:

w(G) = 1/ minr

�

si,sj

(rD(si, sj|G)−D(si, sj|R)) (3.9)

This metric iterates over all pairs of species si and sj represented in a gene tree G,

and computes their pairwise distance on G using the function D. The difference

between this distance and the expected distance (taken from the reference tree R) is

then summed. To account for gene-specific rates of evolution, a rate term r is fit to

G using a minimization function.

3.6.2 A tree of all sequenced archaea

Once I have completed optimizing the GAnG algorithm, I will construct an archaeal

phylogeny using the arCOG dataset of 9504 archaeal gene families, which span 88%

of sequenced archaeal genomes [160]. Due in part to the challenges of cultivating

archaeal species [161], only 70 genomes have so far been sequenced from this domain

of life; 4 of these genomes are the sole representatives of 3 of the 5 known archaeal

phyla [155, 162–164] (Fig. 3.5). This uneven taxon sampling, along with suspected

HGT events and unequal rates of lineage evolution, makes resolution of deep nodes

on the archaeal phylogeny sensitive to the choice of analyzed genes. For example, a

phylogeny built from a core set of 27 large and 23 small ribosomal subunit proteins

supports a basal position of N. equitans on the archaeal tree, but the phylogeny of

just the small subunit proteins suggests this species may only be a rapidly-evolving

euryarchaeote [156].

An archaeal species tree built using the full arCOG dataset should correctly posi-

tion N. equitans, unless phylogenetic artifacts systematically bias a large proportion

106

of genes in archaeal genomes. This tree may help resolve ongoing debates regarding

the origins of other archaeal phyla. Previous studies with varying sets of core genes

have placed the Korarchaeota either basal on the archaeal tree [165, 166], deep within

the Crenarchaeota [162], or sibling to the Euryarchaeota [167]. Core gene studies have

also reported conflicting positions for the Thaumarchaeota, positioning the phylum

either within the Crenarchaeota [162] or basal on the archaeal tree [167, 168]. The

GAnG algorithm offers a quantitative method for testing hypotheses of korarchaeal

and thaumarchaeal origins against thousands of gene trees. Lessons learned from re-

solving these questions in archaeal history may also prove insightful for future efforts

that use GAnG to reconstruct a larger three domain Tree of Life.

107

0 1 2 3 4 5 6 7 8 9 10

50

55

60

65

Degree of reference tree randomization [SPRs]

AnG

ST s

core

Figure 3.1: Average AnGST gene tree reconciliation scores as species tree ran-domization increases: Between 1-10 random SPR moves were made to 300 copies of theTree of Life [90], to produce a continuum of species tree accuracy. Each of the these speciestrees was reconciled against a set of 125 gene trees and the average AnGST score for eachreconciliation is plotted on the y-axis. The R2 of the best fit line (shown in red) to thesedata is 0.67.

108

njplottemp_3 Mon May 03 22:14:21 2010 Page 1 of 1

Thermoplasma acidophilum

Archaeoglobus fulgidus

Halobacterium sp.

Methanosarcina acetivorans

760.385

760.385

1087.550

327.167

Methanocaldococcus jannaschii

Methanopyrus kandleri

Methanothermobacter thermautotrophicus

857.400

857.400

1092.450

235.052

Pyrococcus furiosus

257.340

1349.790

469.760

207.521

1804.730

247.419

Nanoarchaeum equitans

Aeropyrum pernix

Sulfolobus solfataricus

1244.660

1244.660

Pyrobaculum aerophilum

255.626

1500.290

1803.860

303.570

368.697

369.573

Homo Sapiens

540.331

2713.764

Escherichia coli

1109.430

3823.1945e+002

njplottemp_2 Mon May 03 22:14:04 2010 Page 1 of 1

Halobacterium sp.


535.540

535.540


112.407

647.947



556.634

556.634


107.244

663.877

117.217

101.286

Pyrococcus furiosus

60.834

825.998


Aeropyrum pernix

621.057

621.057


78.368

699.424

145.682

272.256


20.789

992.469


758.779

1751.248

Homo Sapiens

470.469

2221.716

Escherichia coli

1628.284

3850.0005e+002

A B

Ma Ma

Figure 3.2: An archaeal phylogeny refined using 250 gene trees and HGT-proposed SPRs: (A) Initial phylogeny of archaeal species pruned from the Ciccarelli &Bork Tree of Life [90]. Crenarchaeal lineages are labeled in blue, euryarchaeal lineages arelabeled in red, and the nanoarchaeal lineage is labeled in green. The most dramatic acceptedSPR move on this topology is shown using a black arrow. (B) The resulting archaeal treeafter 4 iterations of tree refinement. The average AnGST gene tree reconciliation scoreusing this refined tree decreased 3.7% relative to the initial phylogeny.

109

0 1 2 3 4 5 6 7 80

10

20

30

40

50

60

70

80

90

100

Log(SpeciesCount2)

HG

T

All GenesTranslation Genes

Figure 3.3: Inferred HGT as a function of the number of species representedin a gene tree: Each of the 9053 arCOG gene trees was reconciled against an archaealspecies tree (Fig. 3.5). The number of inferred HGT for each gene family is plotted againstthe square of the number of species represented in the gene family (a rough approximationof the number of possible HGT) on a log-scale. The 201 gene families annotated by thearCOG database as involved in translation are shown in red, and all other gene familiesare shown in blue. Gene tree weights, as calculated by Eq. 3.8 are significantly higher fortranslation-associated gene families (p < 10−30, Rank Sum test).

110

A

B

C

D

E

F

A

C

B

D

E

F

A

B

C

D

E

F

A

C

B

D

E

F

L

LL

L

D

L

L

T

T

Duplication HGT

Sp

ec

ies

Tre

eG

en

e T

re

e

Figure 3.4: Expected gene tree branch lengths following duplication or HGT:Two hypothetical evolutionary histories are shown for a gene family. On the left, an ances-tral duplication event (D), followed by four loss events (L), creates higher than expectedgenetic distance between species A and B, and between species C and D. On the right,HGT events (T) cause lower than expected genetic distance between species A and C, andbetween species B and D.

111

0.6

Methanococcus maripaludis S2

Thermococcus gammatolerans EJ3

Thermofilum pendens Hrk 5


Sulfolobus islandicus M.16.4


Sulfolobus islandicus L.S.2.15

Halobacterium sp.


Caldivirga maquilingensis IC-167


Ignicoccus hospitalis KIN4/IHyperthermus butylicus

Picrophilus torridus DSM 9790

Methanococcoides burtonii DSM 6242

Pyrobaculum islandicum DSM 4184

Sulfolobus islandicus Y.N.15.51

Pyrobaculum arsenaticum DSM 13514

Halorubrum lacusprofundi ATCC 49239

Methanococcus vannielii SB


Methanococcus aeolicus Nankai-3

Pyrobaculum calidifontis JCM 11548

Methanosphaerula palustris E1-9c

uncultured methanogenic archaeon

Candidatus Methanoregula boonei 6A8


Methanosaeta thermophila PT

Candidatus Korarchaeum cryptofilum OPF8

Methanocaldococcus fervens AG86

Thermoproteus neutrophilus V24Sta


Desulfurococcus kamchatkensis 1221n

Methanocaldococcus vulcanius M7


Aeropyrum pernix

Methanococcus maripaludis C5

Methanospirillum hungatei JF-1

Sulfolobus islandicus Y.G.57.14

Methanoculleus marisnigri JR1

Cenarchaeum symbiosum

Halomicrobium mukohataei DSM 12286

Methanobrevibacter smithii ATCC 35061

Natronomonas pharaonis

Nitrosopumilus maritimus SCM1

Haloarcula marismortui ATCC 43049

Pyrococcus abyssi

Metallosphaera sedula DSM 5348

Halobacterium salinarum R1

Methanosphaera stadtmanae


Thermococcus kodakarensis KOD1

Sulfolobus tokodaii


Staphylothermus marinus F1

Pyrococcus furiosus


Methanosarcina mazei

Haloquadratum walsbyi

Thermoplasma volcanium


Thermococcus onnurineus NA1Thermococcus sibiricus MM 739

Pyrococcus horikoshii

Methanocorpusculum labreanum Z


Methanosarcina barkeri str. Fusaro

Halorhabdus utahensis DSM 12940

Sulfolobus acidocaldarius DSM 639

Figure 3.5: Maximum likelihood tree of archaeal species: This tree of 70 archaealspecies will be the starting topology for the GAnG analysis of 9504 arCOG gene trees. Thetree was built using a maximum likelihood analysis of 53 concatenated ribosomal proteins[138, 160].

112

Part III

Conclusion

113

This thesis contributes new algorithms for comparative microbial genomics, partic-

ularly in the subfields of microbial ecology and microbial genome evolution. In Chap-

ter 1 of Part II, I presented the algorithm AdaptML, which can identify genetically-

and ecologically-distinct bacterial populations. I used this tool to identify clusters of

marine vibrio that appear to have differentiated in response to their nutrient prefer-

ences. Code for AdaptML has been made public and with the help of Albert Wang, an

undergraduate researcher in Eric Alm’s laboratory, I have created a webserver where

users can submit their own phylogenetic and ecological data for AdaptML to pro-

cess online (http://almlab.mit.edu/adaptml/). Outside investigators have used these

tools to run AnGST on their own datasets. Fred Cohan’s group ran the algorithm to

find ecotypes in the genus Bacillus that would have been overlooked with traditional

sequence divergence-based thresholds for calling species [11]. Oakley and colleagues

used AdaptML to help confirm evidence for niche partitioning among sampled Desul-

fobulbus [169]. Other investigators have promoted using AdaptML for future studies

of microbial evolution and ecology. Daniel Falush has suggested using AdaptML to

infer the evolutionary history of microbe-host adaptation [170], and Ford Doolittle &

Olga Zhaxybayeva have pointed out the algorithm’s advantages over strict threshold-

based approaches to microbial ecology [171].

In Chapter 2, I introduced the algorithm AnGST, which reconciles an ultrametric

reference tree and a gene tree to infer HGT, gene duplication, and gene family birth

events in a chronological context. Implicit in these reconciliations are known con-

straints on organismal history drawn from paleontological and geochemical records.

Analysis of 100 sequenced genomes with AnGST produced evidence for a massive ex-

pansion of microbial genetic diversity during the Archean eon, as well as the gradual

oxygenation of the biosphere over the past 3 Ga. This later finding is in agreement

with other studies that suggest secular changes in earth geochemistry were recorded

in, and can now be mined from, microbial genomes [81, 83, 172, 173]. Further evidence

of the link between the AnGST analysis and biogeochemistry could be uncovered by

follow up studies on enzyme families whose first appearance can be dated using the

sedimentary record. For example, Form 1 RuBisCO makes specific contacts with

114

molecular oxygen and first appeared 2.7-2.9 Ga according to carbon isotope fraction-

ation results [174]. AnGST’s predicted birth date for this enzyme’s small subunit,

between 2.0-3.0 Ga, is wide, but not incompatible with the fractionation results.

Identification and analysis of other biomarkers whose age can be independently ver-

ified by AnGST will lead to potentially fruitful collaborations between genomicists,

paleontologists, and biogeochemists.

Lastly, in Chapter 3, I introduced the supertree algorithm GAnG, which is capable

of constructing prokaryotic species trees from thousands of gene trees. Preliminary

GAnG analysis of 250 archaeal gene trees built from a subset of the sequenced ar-

chaeal genomes supports the hypothesis that Nanoarchaeota diverged from the last

ancestor of the Archaea prior to the Crenarchaeota/Euryarchaeota split, but also

appears sensitive to HGT from the Crenarchaeota to the Thermoplasma. Further

improvements to the GAnG scoring function and gene tree weighting scheme are on-

going. An improved version of GAnG will be used to build a species tree relating

70 sequenced Archaea from all five known phyla. One important observation during

that tree’s construction will be the topology of the species tree scoring landscape near

variations on deep node branching order. A relatively flat landscape in that regime

may suggest that the phyla radiated so rapidly during early archaeal evolution that

their branching order cannot be resolved [141] or that HGT dominated vertical de-

scent during the formation of the archaeal phyla [175]. By contrast, a bowl-shaped

landscape with a clear scoring minimum will provide support that a Tree of Life exists

for the Archaea, and will encourage attempts to construct a universal Tree of Life

with sequenced genomes from all three domains of life.

115

Part IV

Bibliography

116

Bibliography

[1] Daniel Godoy, Gaynor Randle, An-drew J Simpson, David M Aanensen,Tyrone L Pitt, Reimi Kinoshita, andBrian G Spratt. Multilocus sequence typ-ing and evolutionary relationships amongthe causative agents of melioidosis andglanders, Burkholderia pseudomallei andBurkholderia mallei. J Clin Microbiol,41(5):2068–79, May 2003.

[2] Fergus G Priest, Margaret Barker, Les W JBaillie, Edward C Holmes, and Martin C JMaiden. Population structure and evolu-tion of the Bacillus cereus group. J Bacte-

riol, 186(23):7959–70, Dec 2004.

[3] William P Hanage, Christophe Fraser, andBrian G Spratt. Fuzzy species among re-combinogenic bacteria. BMC Biol, 3:6, Jan2005.

[4] Dirk Gevers, Frederick M Cohan, Jef-frey G Lawrence, Brian G Spratt, Tom Co-enye, Edward J Feil, Erko Stackebrandt,Yves Van de Peer, et al. Opinion: Re-evaluating prokaryotic species. Nat Rev

Microbiol, 3(9):733–9, Sep 2005.

[5] Frederick M Cohan and Elizabeth B Perry.A systematics for discovering the funda-mental units of bacterial diversity. Curr

Biol, 17(10):R373–86, May 2007.

[6] Christophe Fraser, William P Hanage, andBrian G Spratt. Recombination and thenature of bacterial speciation. Science,315(5811):476–80, Jan 2007.

[7] B. Jesse Shapiro, Jonathan Friedman,Otto X. Cordero, Sarah Preheim, SoniaTimberlake, Martin Polz, and Eric Alm.Recombination in the core and flexiblegenome drives ecological differentiation insympatric ocean microbes. In progress.

[8] Andrew P Martin. Phylogenetic ap-proaches for describing and comparing thediversity of microbial communities. Appl

Environ Microbiol, 68(8):3673–82, Aug2002.

[9] Catherine Lozupone and Rob Knight.UniFrac: a new phylogenetic method forcomparing microbial communities. Appl

Environ Microbiol, 71(12):8228–35, Dec2005.

[10] Alexander Koeppel, Elizabeth B Perry,Johannes Sikorski, Danny Krizanc, An-drew Warner, David M Ward, Alejandro PRooney, Evelyne Brambilla, et al. Iden-tifying the fundamental units of bacterialdiversity: a paradigm shift to incorporateecology into bacterial systematics. Proc

Natl Acad Sci USA, 105(7):2504–9, Feb2008.

[11] Nora Connor, Johannes Sikorski, Alejan-dro P Rooney, Sarah Kopac, Alexander FKoeppel, Andrew Burger, Scott G Cole,Elizabeth B Perry, et al. Ecology of speci-ation in the genus Bacillus. Appl Environ

Microbiol, 76(5):1349–58, Mar 2010.

[12] H Ochman, J G Lawrence, and E A Gro-isman. Lateral gene transfer and thenature of bacterial innovation. Nature,405(6784):299–304, May 2000.

[13] Eugene V Koonin. Darwinian evolution inthe light of genomics. Nucleic Acids Res,37(4):1011–34, Mar 2009.

[14] N A Moran and A Mira. The processof genome shrinkage in the obligate sym-biont Buchnera aphidicola. Genome Biol-

ogy, 2(12), Jan 2001.

[15] M M Riehle, A F Bennett, and A D Long.Genetic architecture of thermal adaptation

117

in Escherichia coli. Proc Natl Acad Sci

USA, 98(2):525–30, Jan 2001.

[16] Manolis Kellis, Bruce W Birren, and Eric SLander. Proof and evolutionary analy-sis of ancient genome duplication in theyeast Saccharomyces cerevisiae. Nature,428(6983):617–24, Apr 2004.

[17] Nancy A Moran and Tyler Jarvik. Lat-eral transfer of genes from fungi underliescarotenoid production in aphids. Science,328(5978):624–7, Apr 2010.

[18] Mary E Rumpho, Jared M Worful, JunghoLee, Krishna Kannan, Mary S Tyler, De-bashish Bhattacharya, Ahmed Moustafa,and James R Manhart. Horizontal genetransfer of the algal nuclear gene psbOto the photosynthetic sea slug Elysiachlorotica. Proc Natl Acad Sci USA,105(46):17867–71, Nov 2008.

[19] Julie C Dunning Hotopp, Michael E Clark,Deodoro C S G Oliveira, Jeremy M Foster,Peter Fischer, Monica C Munoz Torres,Jonathan D Giebel, Nikhil Kumar, et al.Widespread lateral gene transfer from in-tracellular bacteria to multicellular eukary-otes. Science, 317(5845):1753–6, Sep 2007.

[20] Natsuko Kondo, Naruo Nikoh, NobuyukiIjichi, Masakazu Shimada, and TakemaFukatsu. Genome fragment of Wolbachiaendosymbiont transferred to X chromo-some of host insect. Proc Natl Acad Sci

USA, 99(22):14280–5, Oct 2002.

[21] J G Lawrence and H Ochman. Moleculararchaeology of the Escherichia coli genome.Proc Natl Acad Sci USA, 95(16):9413–7,Aug 1998.

[22] Yoji Nakamura, Takeshi Itoh, Hideo Mat-suda, and Takashi Gojobori. Biased bio-logical functions of horizontally transferredgenes in prokaryotic genomes. Nat Genet,36(7):760–6, Jul 2004.

[23] Dirk Gevers, Klaas Vandepoele, CedricSimillon, and Yves Van de Peer. Gene du-plication and biased functional retention ofparalogs in bacterial genomes. Trends Mi-

crobiol, 12(4):148–54, Apr 2004.

[24] Eric Alm, Katherine Huang, and AdamArkin. The evolution of two-componentsystems in bacteria reveals different strate-gies for niche adaptation. PLoS Comput

Biol, 2(11):e143, Nov 2006.

[25] Victor Kunin and Christos A Ouzou-nis. The balance of driving forces duringgenome evolution in prokaryotes. Genome

Res, 13(7):1589–94, Jul 2003.

[26] Boris G Mirkin, Trevor I Fenner, Michael YGalperin, and Eugene V Koonin. Algo-rithms for computing parsimonious evolu-tionary scenarios for genome evolution, thelast universal common ancestor and dom-inance of horizontal gene transfer in theevolution of prokaryotes. BMC Evol Biol,3:2, Jan 2003.

[27] Berend Snel, Peer Bork, and Martijn AHuynen. Genomes in flux: the evolution ofarchaeal and proteobacterial gene content.Genome Res, 12(1):17–25, Jan 2002.

[28] Olga Zhaxybayeva, J Peter Gogarten,Robert L Charlebois, W Ford Doolittle,and R Thane Papke. Phylogenetic analysesof cyanobacterial genomes: quantificationof horizontal gene transfer events. Genome

Res, 16(9):1099–108, Sep 2006.

[29] Vincent Daubin, Nancy A Moran, andHoward Ochman. Phylogenetics and thecohesion of bacterial genomes. Science,301(5634):829–32, Aug 2003.

[30] R D Page and M A Charleston. Fromgene to organismal phylogeny: reconciledtrees and the gene tree/species tree prob-lem. Mol Phylogenet Evol, 7(2):231–40,Apr 1997.

[31] M A Charleston. Jungles: a new solutionto the host/parasite phylogeny reconcilia-tion problem. Mathematical Biosciences,149(2):191–223, May 1998.

[32] Matthew W Hahn. Bias in phylogenetictree reconciliation methods: implicationsfor vertebrate genome evolution. Genome

Biol, 8(7):R141, Jan 2007.

[33] Matthew D Rasmussen and Manolis Kel-lis. Accurate gene-tree reconstruction by

118

learning gene- and species-specific sub-stitution rates across multiple completegenomes. Genome Res, 17(12):1932–42,Dec 2007.

[34] Orjan Akerborg, Bengt Sennblad, Lars Ar-vestad, and Jens Lagergren. SimultaneousBayesian gene tree reconstruction and rec-onciliation analysis. Proc Natl Acad Sci

USA, 106(14):5714–9, Apr 2009.

[35] Patrick J Keeling and Jeffrey D Palmer.Horizontal gene transfer in eukaryotic evo-lution. Nat Rev Genet, 9(8):605–18, Aug2008.

[36] J Peter Gogarten and Jeffrey P Townsend.Horizontal gene transfer, genome innova-tion and evolution. Nat Rev Microbiol,3(9):679–87, Sep 2005.

[37] Stephen J Giovannoni and Ulrich Stingl.Molecular diversity and ecology of micro-bial plankton. Nature, 437(7057):343–8,Sep 2005.

[38] Edward F DeLong, Christina M Preston,Tracy Mincer, Virginia Rich, Steven J Hal-lam, Niels-Ulrik Frigaard, Asuncion Mar-tinez, Matthew B Sullivan, et al. Commu-nity genomics among stratified microbialassemblages in the ocean’s interior. Sci-

ence, 311(5760):496–503, Jan 2006.

[39] Martin F Polz, Dana E Hunt, Sarah PPreheim, and Daniel M Weinreich. Pat-terns and mechanisms of genetic and phe-notypic differentiation in marine microbes.Philos Trans R Soc Lond, B, Biol Sci,361(1475):2009–21, Nov 2006.

[40] Alban Ramette and James M Tiedje. Mul-tiscale responses of microbial life to spatialdistance and environmental heterogeneityin a patchy ecosystem. Proc Natl Acad Sci

USA, 104(8):2761–6, Feb 2007.

[41] Jennifer B Hughes Martiny, Brendan J MBohannan, James H Brown, Robert K Col-well, Jed A Fuhrman, Jessica L Green,M Claire Horner-Devine, Matthew Kane,et al. Microbial biogeography: putting mi-croorganisms on the map. Nat Rev Micro-

biol, 4(2):102–12, Feb 2006.

[42] Silvia G Acinas, Vanja Klepac-Ceraj,Dana E Hunt, Chanathip Pharino, IvicaCeraj, Daniel L Distel, and Martin FPolz. Fine-scale phylogenetic architectureof a complex bacterial community. Nature,430(6999):551–4, Jul 2004.

[43] Zackary I Johnson, Erik R Zinser, Alli-son Coe, Nathan P McNulty, E Malcolm SWoodward, and Sallie W Chisholm. Nichepartitioning among Prochlorococcus eco-types along ocean-scale environmental gra-dients. Science, 311(5768):1737–40, Mar2006.

[44] Mike J Ferris, Michael Kuhl, AndreaWieland, and David M Ward. Cyanobac-terial ecotypes in different optical microen-vironments of a 68 degrees C hot springmat community revealed by 16S-23S rRNAinternal transcribed spacer region varia-tion. Appl Environ Microbiol, 69(5):2893–8, May 2003.

[45] W Ford Doolittle and R Thane Papke. Ge-nomics and the bacterial species problem.Genome Biology, 7(9):116, Jan 2006.

[46] Adam C Retchless and Jeffrey G Lawrence.Temporal fragmentation of speciation inbacteria. Science, 317(5841):1093–6, Aug2007.

[47] J R Thompson and M F Polz. The Biology

of Vibrios. ASM Press, Washington, DC,2006.

[48] Thomas Kiørboe, Hans-Peter Grossart,Helle Ploug, and Kam Tang. Mechanismsand rates of bacterial colonization of sink-ing aggregates. Appl Environ Microbiol,68(8):3996–4006, Aug 2002.

[49] L Pomeroy, R Hanson, P Mcgillivary,B Sherr, D Kirchman, and D Deibel. Mi-crobiology and Chemistry of Fecal Prod-ucts of Pelagic Tunicates: Rates and Fates.Bull. Mar. Sci., 35(3):426–439, 1984.

[50] C Panagiotopoulos, R Sempere, andI Obernosterer. Bacterial degradationof large particles in the southern IndianOcean using in vitro incubation exper-iments. Organic Geochemistry, 33:985–1000, Jan 2002.

119

[51] J F Heidelberg, K B Heidelberg, and R RColwell. Bacteria of the gamma-subclassProteobacteria associated with zooplank-ton in Chesapeake Bay. Appl Environ Mi-

crobiol, 68(11):5498–507, Nov 2002.

[52] Dana E Hunt, Dirk Gevers, Nisha M Va-hora, and Martin F Polz. Conservationof the chitin utilization pathway in theVibrionaceae. Appl Environ Microbiol,74(1):44–51, Jan 2008.

[53] Jakob Pernthaler and Rudolf Amann. Fateof heterotrophic microbes in pelagic habi-tats: focus on populations. Microbiol Mol

Biol Rev, 69(3):440–61, Sep 2005.

[54] Janelle R Thompson, Sarah Pacocha,Chanathip Pharino, Vanja Klepac-Ceraj,Dana E Hunt, Jennifer Benoit, RamahiSarma-Rupavtarm, Daniel L Distel, et al.Genotypic diversity within a naturalcoastal bacterioplankton population. Sci-

ence, 307(5713):1311–3, Feb 2005.

[55] W Maddison and M Slatkin. Null modelsfor the number of evolutionary steps in acharacter on a phylogenetic tree. Evolu-

tion, 45(5):1184–1197, Jan 1991.

[56] Fredrik Ronquist. Bayesian inference ofcharacter evolution. Trends Ecol Evol

(Amst), 19(9):475–81, Sep 2004.

[57] Janelle R Thompson, Mark A Randa,Luisa A Marcelino, Aoy Tomita-Mitchell,Eelin Lim, and Martin F Polz. Diversityand dynamics of a north atlantic coastalVibrio community. Appl Environ Micro-

biol, 70(7):4103–10, Jul 2004.

[58] Vincent Daubin and Nancy A Moran.Comment on ”The origins of genome com-plexity”. Science, 306(5698):978, Nov2004.

[59] Frederick M Cohan. Sexual isolation andspeciation in bacteria. Genetica, 116(2-3):359–70, Nov 2002.

[60] J Majewski. Sexual isolation in bacteria.FEMS Microbiol Lett, 199(2):161–9, May2001.

[61] C von Mering, P Hugenholtz, J Raes, S GTringe, T Doerks, L J Jensen, N Ward, andP Bork. Quantitative phylogenetic assess-ment of microbial communities in diverseenvironments. Science, 315(5815):1126–30,Feb 2007.

[62] S G Acinas, J Anton, and F Rodrıguez-Valera. Diversity of free-living andattached bacteria in offshore WesternMediterranean waters as depicted by anal-ysis of genes encoding 16S rRNA. Appl En-

viron Microbiol, 65(2):514–22, Feb 1999.

[63] B C Crump, E V Armbrust, and J ABaross. Phylogenetic analysis of particle-attached and free-living bacterial commu-nities in the Columbia river, its estuary,and the adjacent coastal ocean. Appl Env-

iron Microbiol, 65(7):3192–204, Jul 1999.

[64] L Riemann and A Winding. Commu-nity Dynamics of Free-living and Particle-associated Bacterial Assemblages during aFreshwater Phytoplankton Bloom. Micro-

bial Ecology, 42(3):274–285, Oct 2001.

[65] N Selje and M Simon. Composition anddynamics of particle-associated and free-living bacterial communities in the Weserestuary, Germany. Aquatic Microbial Ecol-

ogy, 30:221–237, Jan 2003.

[66] E Delong, D Franks, and A Alldredge. Phy-logenetic diversity of aggregate-attachedvs. free-living marine bacterial assem-blages. Limnology and Oceanography,38(5):924–934, Jan 1993.

[67] S H Goh, S Potter, J O Wood, S M Hem-mingsen, R P Reynolds, and A W Chow.HSP60 gene sequences as universal targetsfor microbial species identification: stud-ies with coagulase-negative staphylococci.J Clin Microbiol, 34(4):818–23, Apr 1996.

[68] Erko Stackebrandt and M Goodfellow, ed-itors. Nucleic acid techniques in bacterial

systematics. Wiley and Sons, Chichester,UK, 1991.

[69] J R Cole, B Chai, R J Farris, Q Wang,A S Kulam-Syed-Mohideen, D M McGar-rell, A M Bandela, E Cardenas, et al. The

120

ribosomal database project (RDP-II): in-troducing myRDP space and quality con-trolled public data. Nucleic Acids Res,35(Database issue):D169–72, Jan 2007.

[70] S F Altschul, W Gish, W Miller, E W My-ers, and D J Lipman. Basic local alignmentsearch tool. J Mol Biol, 215(3):403–10, Oct1990.

[71] Scott R Santos and Howard Ochman. Iden-tification and phylogenetic sorting of bac-terial lineages with universally conservedgenes and proteins. Environmental Micro-

biology, 6(7):754–9, Jul 2004.

[72] Stephane Guindon and Olivier Gascuel. Asimple, fast, and accurate algorithm to es-timate large phylogenies by maximum like-lihood. Syst Biol, 52(5):696–704, Oct 2003.

[73] M Hasegawa, Y Iida, T Yano, F Takaiwa,and M Iwabuchi. Phylogenetic relation-ships among eukaryotic kingdoms inferredfrom ribosomal RNA sequences. J Mol

Evol, 22(1):32–8, Jan 1985.

[74] Ivica Letunic and Peer Bork. InteractiveTree Of Life (iTOL): an online tool forphylogenetic tree display and annotation.Bioinformatics, 23(1):127–8, Jan 2007.

[75] John A. Rice. Mathematical Statistics and

Data Analysis. Duxbury Press: Belmont,CA, 1994.

[76] J. Felsenstein. Inferring phylogenies. Sin-auer Associates, Inc., Sunderland, MA,2003.

[77] Akaike. A new look at the statistical modelidentification. Automatic Control, IEEE

Transactions on, 19(6):716 – 723, 1974.

[78] Richard Durbin, Sean R. Eddy, AndersKrogh, and Graeme J. Mitchison. Biologi-

cal Sequence Analysis: Probabilistic Models

of Proteins and Nucleic Acids. CambridgeUniversity Press, 1998.

[79] E G Nisbet and N H Sleep. The habi-tat and nature of early life. Nature,409(6823):1083–91, Feb 2001.

[80] Birger Rasmussen, Ian R Fletcher,Jochen J Brocks, and Matt R Kilburn.Reassessing the first appearance of eu-karyotes and cyanobacteria. Nature,455(7216):1101–4, Oct 2008.

[81] Christopher L Dupont, Song Yang, BrianPalenik, and Philip E Bourne. Modern pro-teomes contain putative imprints of ancientshifts in trace metal geochemistry. Proc

Natl Acad Sci USA, 103(47):17822–7, Nov2006.

[82] M Saito, D Sigman, and F Morel. Thebioinorganic chemistry of the ancientocean: the co-evolution of cyanobacterialmetal requirements and biogeochemical cy-cles at the Archean-Proterozoic bound-ary? Inorganica Chimica Acta, 356:308–318, Jan 2003.

[83] A Zerkle, C House, and S Brantley. Bio-geochemical signatures through time asinferred from whole microbial genomes.American Journal of Science, 305:467–502,Jan 2005.

[84] D Yang, Y Oyaizu, H Oyaizu, G J Olsen,and C R Woese. Mitochondrial origins.Proc Natl Acad Sci USA, 82(13):4443–7,Jul 1985.

[85] S J Giovannoni, S Turner, G J Olsen,S Barns, D J Lane, and N R Pace. Evo-lutionary relationships among cyanobacte-ria and green chloroplasts. J Bacteriol,170(8):3584–92, Aug 1988.

[86] D J De Marais. Evolution. When did pho-tosynthesis emerge on Earth? Science,289(5485):1703–5, Sep 2000.

[87] J Gogarten, W Doolittle, and JeffreyLawrence. Prokaryotic Evolution inLight of Gene Transfer. Mol Biol Evol,19(12):2226, Dec 2002.

[88] R Jain, M C Rivera, and J A Lake. Hor-izontal gene transfer among genomes: thecomplexity hypothesis. Proc Natl Acad Sci

USA, 96(7):3801–6, Mar 1999.

[89] Mark A Ragan and Robert G Beiko. Lat-eral genetic transfer: open issues. Phi-

los Trans R Soc Lond, B, Biol Sci,364(1527):2241–51, Aug 2009.

121

[90] Francesca D Ciccarelli, Tobias Doerks,Christian von Mering, Christopher JCreevey, Berend Snel, and Peer Bork.Toward automatic reconstruction of ahighly resolved tree of life. Science,311(5765):1283–7, Mar 2006.

[91] Thomas Lepage, David Bryant, HervePhilippe, and Nicolas Lartillot. A gen-eral comparison of relaxed molecular clockmodels. Mol Biol Evol, 24(12):2669–80,Dec 2007.

[92] S J Mojzsis, G Arrhenius, K D McKee-gan, T M Harrison, A P Nutman, andC R Friend. Evidence for life on Earthbefore 3,800 million years ago. Nature,384(6604):55–9, Nov 1996.

[93] M Rosing. 13C-Depleted carbon mi-croparticles in ¿3700-Ma sea-floor sedimen-tary rocks from west greenland. Science,283(5402):674–6, Jan 1999.

[94] Jessica Garvin, Roger Buick, Ariel D An-bar, Gail L Arnold, and Alan J Kauf-man. Isotopic evidence for an aerobic ni-trogen cycle in the latest Archean. Science,323(5917):1045–8, Feb 2009.

[95] Ariel D Anbar, Yun Duan, Timothy WLyons, Gail L Arnold, Brian Kendall,Robert A Creaser, Alan J Kaufman,Gwyneth W Gordon, et al. A whiff of oxy-gen before the great oxidation event? Sci-

ence, 317(5846):1903–6, Sep 2007.

[96] Alan J Kaufman, David T Johnston, JamesFarquhar, Andrew L Masterson, Timo-thy W Lyons, Steve Bates, Ariel D Anbar,Gail L Arnold, et al. Late Archean bio-spheric oxygenation and atmospheric evo-lution. Science, 317(5846):1900–3, Sep2007.

[97] Christopher T Reinhard, Rob Raiswell,Clint Scott, Ariel D Anbar, and Timo-thy W Lyons. A late Archean sulfidic seastimulated by early oxidative weathering ofthe continents. Science, 326(5953):713–6,Oct 2009.

[98] J J Brocks, G A Logan, R Buick, andR E Summons. Archean molecular fossilsand the early rise of eukaryotes. Science,285(5430):1033–6, Aug 1999.

[99] Jacob R Waldbauer, Laura S Sherman,Dawn Y Sumner, and Roger E Summons.Late Archean molecular fossils from theTransvaal Supergroup record the antiquityof microbial diversity and aerobiosis. Pre-

cambrian Research, 169(1-4):28–47, Dec2009.

[100] S Golubic, V N Sergeev, and A HKnoll. Mesoproterozoic Archaeoellipsoides:akinetes of heterocystous cyanobacteria.Lethaia, 28:285–98, Jan 1995.

[101] N Butterfield. Bangiomorpha pubescens n.gen., n. sp.: implications for the evolutionof sex, multicellularity, and the Mesopro-terozoic/ Neoproterozoic radiation of eu-karyotes. Paleobiology, 26(3):386–404, Jan2000.

[102] Gordon D Love, Emmanuelle Grosjean,Charlotte Stalvies, David A Fike, John PGrotzinger, Alexander S Bradley, Amy EKelly, Maya Bhatia, et al. Fossil steroidsrecord the appearance of Demospongiaeduring the Cryogenian period. Nature,457(7230):718–21, Feb 2009.

[103] Edward B Daeschler, Neil H Shubin, andFarish A Jenkins. A Devonian tetrapod-like fish and the evolution of the tetrapodbody plan. Nature, 440(7085):757–63, Apr2006.

[104] E Daeschler, N Shubin, K Thomson, andW Amaral. A Devonian Tetrapod fromNorth America. Science, 265(5172):639–642, Jul 1994.

[105] N Moran, M Munson, P Baumann, andH Ishikawa. A Molecular Clock in En-dosymbiotic Bacteria is Calibrated Usingthe Insect Hosts. Proceedings of the Royal

Society B: Biological Sciences, 253:167–171, Jan 1993.

[106] Lars Juhl Jensen, Philippe Julien, MichaelKuhn, Christian von Mering, Jean Muller,Tobias Doerks, and Peer Bork. eggNOG:automated construction and annotation oforthologous groups of genes. Nucleic Acids

Res, 36(Database issue):D250–4, Jan 2008.

[107] J G Jelesko, R Harper, M Furuya, andW Gruissem. Rare germinal unequal

122

crossing-over leading to recombinant geneformation and gene duplication in Ara-bidopsis thaliana. Proc Natl Acad Sci USA,96(18):10302–7, Aug 1999.

[108] Michael Lynch and John S Conery. Theorigins of genome complexity. Science,302(5649):1401–4, Nov 2003.

[109] M Sugiura, T Hirose, and M Sugita. Evo-lution and mechanism of translation inchloroplasts. Annu Rev Genet, 32:437–59,Jan 1998.

[110] D Fischer and D Eisenberg. Finding fam-ilies for genomic ORFans. Bioinformatics,15(9):759–62, Sep 1999.

[111] C Scott, T W Lyons, A Bekker, Y Shen,S W Poulton, X Chu, and A D Anbar.Tracing the stepwise oxygenation of theProterozoic ocean. Nature, 452(7186):456–9, Mar 2008.

[112] Kurt O Konhauser, Ernesto Pecoits, Ste-fan V Lalonde, Dominic Papineau, Euan GNisbet, Mark E Barley, Nicholas T Arndt,Kevin Zahnle, et al. Oceanic nickeldepletion and a methanogen famine be-fore the Great Oxidation Event. Nature,458(7239):750–3, Apr 2009.

[113] D Canfield. A new model for Proterozoicocean chemistry. Nature, 396(6710):450–453, Jan 1998.

[114] A Bekker, H D Holland, P-L Wang,D Rumble, H J Stein, J L Hannah,L L Coetzee, and N J Beukes. Datingthe rise of atmospheric oxygen. Nature,427(6970):117–20, Jan 2004.

[115] D Canfield. The early history of atmo-spheric oxygen: homage to Robert M. Gar-rels. Annu. Rev. Earth Planet. Sci., 33:1–36, Jan 2005.

[116] Michael J Sanderson. r8s: inferring abso-lute rates of molecular evolution and di-vergence times in the absence of a molecu-lar clock. Bioinformatics, 19(2):301–2, Jan2003.

[117] M Kanehisa and S Goto. KEGG: kyoto en-cyclopedia of genes and genomes. Nucleic

Acids Res, 28(1):27–30, Jan 2000.

[118] Eric J Alm, Katherine H Huang, Morgan NPrice, Richard P Koche, Keith Keller,Inna L Dubchak, and Adam P Arkin. TheMicrobesOnline Web site for comparativegenomics. Genome Res, 15(7):1015–22, Jul2005.

[119] JP Huelsenbeck, B Rannala, and B Larget.Tangled Trees: Phylogenies, Cospeciation,

and Coevolution. University of ChicagoPress, Chicago, 2002.

[120] L Arvestad, A Berglund, J Lagergren, andB Sennblad. Gene tree reconstruction andorthology analysis based on an integratedmodel for duplications and sequence evo-lution. RECOMB’04, pages 326–335, Jan2004.

[121] Dave MacLeod, Robert L Charlebois, FordDoolittle, and Eric Bapteste. Deductionof probable events of lateral gene transferthrough comparison of phylogenetic treesby recursive consolidation and rearrange-ment. BMC Evol Biol, 5(1):27, Jan 2005.

[122] J Bergsten. A review of long-branch at-traction. Cladistics, 21:163–193, Jan 2005.

[123] A Rambaut and N Grass. Seq-Gen: anapplication for the Monte Carlo simulationof DNA sequence evolution along phyloge-netic trees. Bioinformatics, 13(3):235–238,Jan 1997.

[124] DF Robinson and LR Foulds. Comparisonof phylogenetic trees. Mathematical Bio-

sciences, 53(1-2):131–147, 1981.

[125] Tal Dagan and William Martin. Ances-tral genome sizes specify the minimum rateof lateral gene transfer during prokary-ote evolution. Proc Natl Acad Sci USA,104(3):870–5, Jan 2007.

[126] Nicolas Lartillot and Herve Philippe. ABayesian mixture model for across-site het-erogeneities in the amino-acid replacementprocess. Mol Biol Evol, 21(6):1095–109,Jun 2004.

[127] Roman L Tatusov, Natalie D Fedorova,John D Jackson, Aviva R Jacobs, BorisKiryutin, Eugene V Koonin, Dmitri MKrylov, Raja Mazumder, et al. The COG

123

database: an updated version includes eu-karyotes. BMC Bioinformatics, 4:41, Sep2003.

[128] Gerard Talavera and Jose Castresana. Im-provement of phylogenies after removingdivergent and ambiguously aligned blocksfrom protein sequence alignments. Syst

Biol, 56(4):564–77, Aug 2007.

[129] J Castresana. Selection of conserved blocksfrom multiple alignments for their usein phylogenetic analysis. Mol Biol Evol,17(4):540–52, Apr 2000.

[130] Robert C Edgar. MUSCLE: multiplesequence alignment with high accuracyand high throughput. Nucleic Acids Res,32(5):1792–7, Jan 2004.

[131] Frederic Delsuc, Henner Brinkmann, andHerve Philippe. Phylogenomics and the re-construction of the tree of life. Nat Rev

Genet, 6(5):361–75, May 2005.

[132] Fengrong Ren, Hiroshi Tanaka, and Zi-heng Yang. A likelihood look at thesupermatrix-supertree controversy. Gene,441(1-2):119–25, Jul 2009.

[133] J Slowinski and R Page. How shouldspecies phylogenies be inferred from se-quence data? Syst Biol, 48(4):814–825,Jan 1999.

[134] Eric Bapteste, Henner Brinkmann,Jennifer A Lee, Dorothy V Moore,Christoph W Sensen, Paul Gordon,Laure Durufle, Terry Gaasterland, et al.The analysis of 100 genes supports thegrouping of three highly divergent amoe-bae: Dictyostelium, Entamoeba, andMastigamoeba. Proc Natl Acad Sci USA,99(3):1414–9, Feb 2002.

[135] J Felsenstein. Cases in which parsimonyor compatibility methods will be positivelymisleading. Syst Biol, 27(4):401–410, Jan1978.

[136] P J Lockhart, M A Steel, M D Hendy, andD Penny. Recovering evolutionary trees un-der a more realistic model of sequence evo-lution. Mol Biol Evol, 11(4):605–12, Jul1994.

[137] Matthew J Phillips, Frederic Delsuc, andDavid Penny. Genome-scale phylogeny andthe detection of systematic biases. Mol Biol

Evol, 21(7):1455–8, Jul 2004.

[138] Oriane Matte-Tailliez, Celine Brochier,Patrick Forterre, and Herve Philippe. Ar-chaeal phylogeny based on ribosomal pro-teins. Mol Biol Evol, 19(5):631–9, May2002.

[139] Tal Dagan and William Martin. The treeof one percent. Genome Biol, 7(10):118,Jan 2006.

[140] Vincent Daubin and Howard Ochman.Quartet mapping and the extent of lateraltransfer in bacterial genomes. Mol Biol

Evol, 21(1):86–9, Jan 2004.

[141] Pere Puigbo, Yuri I Wolf, and Eugene VKoonin. Search for a ’Tree of Life’ in thethicket of the phylogenetic forest. Journal

of Biology, 8(6):59, Jan 2009.

[142] Herve Philippe, Elizabeth A Snell, EricBapteste, Philippe Lopez, Peter W H Hol-land, and Didier Casane. Phylogenomics ofeukaryotes: impact of missing data on largealignments. Mol Biol Evol, 21(9):1740–52,Sep 2004.

[143] Herve Philippe and Christophe J Douady.Horizontal gene transfer and phylogenet-ics. Curr Opin Microbiol, 6(5):498–505,Oct 2003.

[144] Olaf R P Bininda-Emonds. The evolutionof supertrees. Trends Ecol Evol (Amst),19(6):315–22, Jun 2004.

[145] M A Ragan. Phylogenetic inference basedon matrix representation of trees. Mol Phy-

logenet Evol, 1(1):53–8, Mar 1992.

[146] J B Slowinski, A Knight, and A P Rooney.Inferring species trees from gene trees: aphylogenetic analysis of the Elapidae (Ser-pentes) based on the amino acid sequencesof venom proteins. Mol Phylogenet Evol,8(3):349–62, Dec 1997.

[147] R D Page. GeneTree: comparing gene andspecies phylogenies using reconciled trees.Bioinformatics, 14(9):819–20, Jan 1998.

124

[148] A Wehe, M Bansal, J Burleigh, and O Eu-lenstein. DupTree: A program for large-scale phylogenetic analyses using gene treeparsimony. Bioinformatics, May 2008.

[149] LA David and EJ Alm. Rapid evolution-ary innovation during an Archean GeneticExpansion. Submitted, 2010.

[150] M Bansal, J Burleigh, O Eulenstein,and A Wehe. Heuristics for the gene-duplication problem: A (n) speed-up forthe local search. RECOMB’07, pages 238–252, Jan 2007.

[151] L Billera, S Holmes, and K Vogtmann.Geometry of the Space of PhylogeneticTrees. Advances in Applied Mathematics,27(4):733–767, Jan 2001.

[152] Stephane Guindon, Jean-Francois Du-fayard, Vincent Lefort, Maria Anisi-mova, Wim Hordijk, and Olivier Gascuel.New algorithms and methods to estimatemaximum-likelihood phylogenies: assess-ing the performance of PhyML 3.0. Syst

Biol, 59(3):307–21, May 2010.

[153] Morgan N Price, Paramvir S Dehal, andAdam P Arkin. FastTree 2–approximatelymaximum-likelihood trees for large align-ments. PLoS One, 5(3):e9490, Jan 2010.

[154] A Stamatakis, T Ludwig, and H Meier.RAxML-III: a fast program for maximumlikelihood-based inference of large phyloge-netic trees. Bioinformatics, 21(4):456–63,Feb 2005.

[155] Elizabeth Waters, Michael J Hohn, IvanAhel, David E Graham, Mark D Adams,Mary Barnstead, Karen Y Beeson, LisaBibbs, et al. The genome of Nanoarchaeumequitans: insights into early archaeal evo-lution and derived parasitism. Proc Natl

Acad Sci USA, 100(22):12984–8, Oct 2003.

[156] Celine Brochier, Simonetta Gribaldo,Yvan Zivanovic, Fabrice Confalonieri, andPatrick Forterre. Nanoarchaea: represen-tatives of a novel archaeal phylum or afast-evolving euryarchaeal lineage relatedto Thermococcales? Genome Biology,6(5):R42, Jan 2005.

[157] Yuri I Wolf, Igor B Rogozin, Nick V Gr-ishin, and Eugene V Koonin. Genometrees and the tree of life. Trends Genet,18(9):472–9, Sep 2002.

[158] Alexei I Slesarev, Katja V Mezhevaya,Kira S Makarova, Nikolai N Polushin,Olga V Shcherbinina, Vera V Shakhova,Galina I Belova, L Aravind, et al.The complete genome of hyperthermophileMethanopyrus kandleri AV19 and mono-phyly of archaeal methanogens. Proc Natl

Acad Sci USA, 99(7):4644–9, Apr 2002.

[159] Ji Qi, Bin Wang, and Bai-Iin Hao. Wholeproteome prokaryote phylogeny withoutsequence alignment: a K-string composi-tion approach. J Mol Evol, 58(1):1–11, Jan2004.

[160] Kira S Makarova, Alexander V Sorokin,Pavel S Novichkov, Yuri I Wolf, and Eu-gene V Koonin. Clusters of orthologousgenes for 41 archaeal genomes and implica-tions for evolutionary genomics of archaea.Biol Direct, 2:33, Jan 2007.

[161] Ken Nealson. A Korarchaeote yields togenome sequencing. Proc Natl Acad Sci

USA, 105(26):8805–6, Jul 2008.

[162] James G Elkins, Mircea Podar, David EGraham, Kira S Makarova, Yuri Wolf,Lennart Randau, Brian P Hedlund, CelineBrochier-Armanet, et al. A korarchaealgenome reveals insights into the evolutionof the Archaea. Proc Natl Acad Sci USA,105(23):8102–7, Jun 2008.

[163] Steven J Hallam, Konstantinos T Kon-stantinidis, Nik Putnam, Christa Schleper,Yoh ichi Watanabe, Junichi Sugahara,Christina Preston, Jose de la Torre,et al. Genomic analysis of the unculti-vated marine crenarchaeote Cenarchaeumsymbiosum. Proc Natl Acad Sci USA,103(48):18296–301, Nov 2006.

[164] Harald Huber, Michael J Hohn, Rein-hard Rachel, Tanja Fuchs, Verena C Wim-mer, and Karl O Stetter. A new phy-lum of Archaea represented by a nano-sized hyperthermophilic symbiont. Nature,417(6884):63–7, May 2002.

125

[165] S M Barns, C F Delwiche, J D Palmer, andN R Pace. Perspectives on archaeal diver-sity, thermophily and monophyly from en-vironmental rRNA sequences. Proc Natl

Acad Sci USA, 93(17):9188–93, Aug 1996.

[166] Thomas A Auchtung, Cristina D Takacs-Vesbach, and Colleen M Cavanaugh. 16SrRNA phylogenetic investigation of thecandidate division ”Korarchaeota”. Appl

Environ Microbiol, 72(7):5077–82, Jul2006.

[167] Anja Spang, Roland Hatzenpichler,Celine Brochier-Armanet, Thomas Rattei,Patrick Tischler, Eva Spieck, WolfgangStreit, David A Stahl, et al. Distinct geneset in two different lineages of ammonia-oxidizing archaea supports the phylumThaumarchaeota. Trends Microbiol,18(8):331–340, Aug 2010.

[168] Celine Brochier-Armanet, Bastien Bous-sau, Simonetta Gribaldo, and PatrickForterre. Mesophilic Crenarchaeota: pro-posal for a third archaeal phylum, theThaumarchaeota. Nat Rev Microbiol,6(3):245–52, Mar 2008.

[169] Brian B Oakley, Franck Carbonero,Christopher J van der Gast, Robert JHawkins, and Kevin J Purdy. Evolution-ary divergence and biogeography of sym-patric niche-differentiated bacterial popu-lations. ISME J, 4(4):488–97, Apr 2010.

[170] Daniel Falush. Toward the use of genomicsto study microevolutionary change in bac-teria. PLoS Genet, 5(10):e1000627, Oct2009.

[171] W Doolittle and O Zhaxybayeva. Metage-nomics and the Units of Biological Orga-nization. BioScience, 60(2):102–112, Jan2010.

[172] Christopher L Dupont, Andrew Butcher,Ruben E Valas, Philip E Bourne, and Gus-tavo Caetano-Anolles. History of biologicalmetal utilization inferred through phyloge-nomic analysis of protein structures. Proc

Natl Acad Sci USA, 107(23):10567–72, Jun2010.

[173] Jason Raymond and Daniel Segre. Theeffect of oxygen on biochemical networksand the evolution of complex life. Science,311(5768):1764–7, Mar 2006.

[174] E G Nisbet, N V Grassineau, C J Howe, P IAbell, M Regelous, and R E R Nisbet. Theage of Rubisco: the evolution of oxygenicphotosynthesis. Geobiology, 5:311–335, Jan2007.

[175] Eugene V Koonin. The Biological Big Bangmodel for the major transitions in evolu-tion. Biol Direct, 2:21, Jan 2007.

126

Documents

Novel Phylogenetic Approaches to Problems in Microbial ...Novel Phylogenetic Approaches to Problems in Microbial Genomics by Lawrence A. David Submitted to the Computational & Systems