Gene tree rooting methods give distributions that mimic the coalescent process

Accepted Manuscript

Gene tree rooting methods give distributions that mimic the coalescent process

Yuan Tian, Laura S. Kubatko

PII: S1055-7903(13)00346-1

DOI: http://dx.doi.org/10.1016/j.ympev.2013.09.004

Reference: YMPEV 4705

To appear in: Molecular Phylogenetics and Evolution

Received Date: 13 March 2013

Revised Date: 28 August 2013

Accepted Date: 6 September 2013

Please cite this article as: Tian, Y., Kubatko, L.S., Gene tree rooting methods give distributions that mimic the

coalescent process, Molecular Phylogenetics and Evolution (2013), doi: http://dx.doi.org/10.1016/j.ympev.

2013.09.004

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers

we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and

review of the resulting proof before it is published in its final form. Please note that during the production process

errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

http://dx.doi.org/10.1016/j.ympev.2013.09.004

http://dx.doi.org/http://dx.doi.org/10.1016/j.ympev.2013.09.004

http://dx.doi.org/http://dx.doi.org/10.1016/j.ympev.2013.09.004

1

GENE TREE ROOTING METHODS GIVE DISTRIBUTIONS THAT MIMIC

THE COALESCENT PROCESS

Yuan Tian and Laura S. Kubatko*,+

*Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State

University, Columbus, OH 43210

[email protected], [email protected]

Author for correspondence:

Laura S. Kubatko

Departments of Statistics and

Evolution, Ecology, and Organismal Biology

The Ohio State University

404 Cockins Hall, 1958 Neil Avenue

Columbus, OH 43210

E-mail: [email protected]

Phone: 614-247-8846

FAX: 614-292-2866

mailto:[email protected]



http://ees.elsevier.com/mpe/viewRCResults.aspx?pdf=1&docID=6466&rev=1&fileID=379448&msid=7A1791D5-212F-4488-8FB1-775EE2BB585B

2

INTRODUCTION

In multi-locus phylogenetic analysis, a phylogenetic tree based on a single gene (e.g., a

gene tree) may not agree with the species tree, the tree that represents the actual evolutionary

pathway (Pamilo and Nei, 1988; Maddison 1997). The many possible causes of this discord

are well-known, and include processes such as incomplete lineage sorting (ILS), gene

duplication, hybridization, non-neutral evolution, and horizontal gene transfer (Hein 1993;

Maddison 1997; Sang and Zhong 2000; Bayzid and Warnow 2012). An appropriate

probabilistic model that links gene trees and species trees should involve the phylogenetic

relationships of species as well as the genealogical history of each gene (Anderson et al.

2012). Such models are necessary to carry out accurate inference of the species phylogeny

from a multi-locus data set.

The coalescent process is a retrospective model of population genetics that is commonly

used to model ILS. The coalescent model is based on tracing the evolutionary history of

sampled genes by considering the time from the present back to their most recent common

ancestor (Kingman 2000), and can be derived as the limiting distribution (as the population

size becomes large) that results from the Wright-Fisher and other commonly-used population

genetics models (Wakeley 2008; Kingman 1982; Takahata and Nei 1985). Under the

coalescent model, the probability distribution of gene trees given a fixed species tree

topology and branch lengths can be computed (Degnan and Salter 2005). The coalescent

model is also used as the basis for different methods to estimate species trees using either

multi-locus DNA sequence data or a set of observed gene trees (Kubatko et al. 2009; Than et

al. 2008; Liu and Pearl 2007; Liu et al. 2010; Heled and Drummond 2010). These methods

have been widely applied to the multi-locus data sets that are commonly produced by

next-generation sequencing techniques.

The coalescent process has been so widely applied in part because it is believed that ILS

is a predominant cause of the incongruence observed between gene trees and species trees

(Liu et al. 2010). Indeed, the predictions made by the coalescent model in terms of the

distribution of gene trees are consistent with several observed data sets (Ebersberger et al.

2007; Ane 2010; Kubatko et al. 2011). However, a recent study of seven large multi-partition

3

genome-level data sets has suggested that random rooting might be another potential

explanation for the apparent fit to the coalescent model (Rosenfeld et al. 2012). Here, we

consider whether the signature evidenced by this gene tree topology distribution is unique to

the coalescent process. In particular, we show that in the case of four taxa, this distribution

can be mimicked by generating gene trees from a single species tree and then rooting these

gene trees using either the assumption of a molecular clock or outgroup rooting.

Below we briefly review the coalescent process, focusing on the gene tree distribution in

the four-taxon case. We then use simulation to show that both the coalescent and

non-coalescent models described above lead to nearly identical distributions on the set of

gene trees. Further, we study the behavior of several coalescent-based methods of species tree

estimation when the data come from a single tree. We conclude that a model without ILS can

produce a distribution that mimics that of the coalescent process. Furthermore, under

different gene trees distributions not all of the methods of species tree inference examined

here can correctly estimate the true species tree with high frequency.

Gene tree distribution under coalescent process

Consider a 4-taxon, asymmetric bifurcating species tree as shown in Figure 1 (a) (thick

lines). A gene tree that has the same coalescent history as the species tree is nested within the

species tree (Fig. 1(a), thin lines). Note that this species tree contains two internal branch

lengths that we denote by x and y that are given in coalescent units (number of 2Ne

generations). In the gene tree, the coalescent times are denoted by t1, t2, and t3, from the most

recent to the oldest, respectively. The probability of observing gene tree g given a particular

species tree S can be calculated using the formula (Degnan and Salter 2005)

PG = g | S = PG = g,history | Shistories

å =histories

å wbPu(b ),v(b )(tb )b

Õ ,

where the product on the right-hand side is taken over all branches, b,

wb is the probability

of getting a sequence of coalescent events that is consistent with g, andPu(b),v(b)(tb) is the

probability that u(b) lineages coalescent into v(b) lineages along branch b which has length

tb

(see, e.g., Rosenberg 2007). This formula can be used to obtain expressions for the

probabilities of all 15 gene tree topologies that are possible for the species tree in Figure 1 (a)

4

when one lineage per species is sampled. Figure 1 (b) to (e) show the entire probability

distribution on the set of 15 gene trees for several choices of the species tree branch lengths x

and y. Note that when x and y are long, only one tree has substantial probability (Figure 1 (b)).

As y becomes shorter, there are three asymmetric trees that have substantial probability

(Figure 1(c)). If y is long but x becomes shorter, the distribution of mass shifts to two

asymmetric trees as well as a symmetric tree (Figure 1 (d)). When both x and y are very short,

many of the 15 trees have nearly equal probability (Figure 1 (e)).

The fact that under the coalescent model for four taxa one tree will have most of the

probability while two others occur with smaller, but equal, probability (and all other trees

have much lower probability) has been noted to be a signature of the coalescent model.

This type of distribution has been observed in empirical studies as well (Ebersberger et al.

2007; Ane 2010; Kubatko et al. 2011; Rosenfeld et al. 2012), and for this reason has been

used as evidence of the importance of incorporating the coalescent model into species tree

inference methodology. It is thus of interest to determine whether other features of the

process of molecular evolution coupled with tree estimation methodology can produce a

distribution on gene trees that shows this feature. This is the question that we examine here.

METHODS

Sequence simulation and gene tree estimation

Five 4-taxon, asymmetric bifurcating gene trees with topology (((AB)C)D) and different

branch length parameters t1, t2, and t3 as depicted in Figure 2 (a) were chosen as “true gene

trees” in our simulation study. All five of the true gene trees used satisfied the molecular

clock assumption. Multiple sequence alignments of 500 base pairs (bp) were generated with

the program Seq-Gen (Rambaut and Grassly 1997) under the HKY85 substitution model

(Hasegawa, Kishino and Yano 1985) for each of the following five input true gene trees:

(((A:0.2,B:0.2):0.15,C:0.35):0.4,D:0.75), (((A:0.25,B:0.25):0.015,C:0.265):0.15,D:0.415),

(((A:0.05,B:0.05):0.0005,C:0.0505):0.15,D:0.2005),

(((A:0.15,B:0.15):0.025,C:0.175):0.015,D:0.18), and

5

(((A:0.2,B:0.2):0.0075,C:0.2075):0.0075,D:0.215).

This procedure was repeated 10,000 times for each tree, to generate 10,000 alignments of

length 500 for each model tree.

From these alignments, gene trees were then estimated under the molecular clock

assumption and rooted using maximum likelihood (ML) as implemented in the program

PAUP* 4.0b10 (Swofford 2002). PAUP* was also used to obtain ML gene tree estimates

without the molecular clock assumption, and the resulting trees were rooted by the outgroup

rooting method with taxon D specified as the outgroup. In both the molecular clock rooting

and outgroup rooting methods, the frequencies of the resulting rooted gene tree topology

estimates were recorded. These were used both to examine the induced gene tree distribution

and to assess the performance of several current coalescent-based methods of species trees

estimation.

Species trees estimation

The simulated data sets were used to estimate species trees under the coalescent model.

Each data set was randomly divided into 100 groups, with 100 genes in each group. Each

group could be considered as a multi-locus data set of 100 gene trees for inference of species

trees. Three coalescent-based software packages were used to estimate species trees: STEM

(Kubatko et al. 2009), MP-EST (Liu et al. 2010), and the minimize deep coalescences method

(MDC), as implemented in PhyloNet (MDC-PhyloNet; Than and Nakhleh 2009). For each of

these three methods, the input data were the trees estimated for each simulated gene

alignment under the ML method with the assumption of a molecular clock. For MP-EST and

MDC-PhyloNet, the gene trees estimated using outgroup rooting were also used for analysis

(note that this is not possible for STEM, because branch lengths that satisfy the molecular

clock are required). For each method, we recorded the proportion of the 100 data sets (each

consisting of 100 genes) that correctly recovered the tree that generated the data.

RESULTS

Estimation of the gene tree distribution under different rooting methods

6

Figure 2 (b) to (f) shows the proportion of times each gene tree (red bars) was estimated

in the five data sets that are simulated with different choices of the true gene tree branch

lengths t1, t2, and t3. We see that estimating ML gene trees under the molecular clock

assumption can lead to an estimated gene tree distribution that mimics that which would be

expected under the coalescent model. For example, Figure 2 (b) also shows the expected

distribution under the coalescent model for the species tree in Figure 1 (a) (tree in thick lines)

with x=10.0 and y=10.0 (green bars). Note the agreement between this expected distribution

and the observed distribution.

The proportion of times that each gene is estimated when using outgroup rooting is also

shown in Figure 2 (b) to (f) (blue bars), along with the predicted gene tree distribution under

the coalescent model for the species tree in Figure 1 (a) with a different choice of species tree

branch lengths x and y (purple bars). As was the case with molecular clock rooting, the close

agreement between these distributions is noted. In all sub-figures of Figure 2, the species tree

branch lengths used to generate the distributions shown with the green and purple bars were

selected to try to mimic the distribution that was observed for the molecular clock and

outgroup rooting methods, respectively.

Species trees estimation

The five data sets were used to compare the ability of the three methods examined

(STEM, MP-EST, and MDC-PhyloNet) to correctly estimate the species tree. For all of the

methods, the 100 species trees estimated by each method were compared with the tree

(((AB)C)D) that was used to simulate the sequences. The frequency of different tree

topologies estimated is shown in Table 1 when both molecular clock rooting and outgroup

rooting were used. Most of the gene trees that are estimated from the alignments generated

from the true gene tree (((A:0.2,B:0.2):0.15,C:0.35):0.4,D:0.75) have the topology

(((A,B),C),D) (Figure 2 (b)). It is thus not surprising that all three methods estimated the

correct tree topology (((A,B),C),D) under both the molecular clock rooting and the outgroup

rooting in this case. For the data set that is simulated from the true gene tree

(((A:0.25,B:0.25):0.015,C:0.265):0.15,D:0.415), all methods other than STEM maintain a

100% correct ratio, but only 89% of species trees estimated by STEM had the same topology as

7

the true tree. With the input true gene tree (((A:0.05,B:0.05):0.0005,C:0.0505):0.15,D:0.2005),

fewer gene trees with the topology (((A,B),C),D) were estimated (Figure 2 (d)). With

molecular clock rooting, 57% and 59% of species trees estimated by MP-EST and

MDC-PhyloNet, respectively, were consistent with the true tree, while 38% of species trees

estimated by STEM had the same topology as the true tree. With outgroup rooting, 67% of

species trees estimated by MP-EST agreed with the true tree and MDC-PhyloNet estimated

64% of the species trees that were consistent with the true tree.

In general, when fewer of the gene trees estimated from the alignments simulated from

the true gene tree have the topology (((A,B),C),D) (Figure 2 (b) to (d)), the methods estimated

fewer correct trees (Table 1). Note that in the same data set, STEM seems to have estimated

fewer correct trees than MDC-PhyloNet and MP-EST. However, with the true gene tree

(((A:0.15,B:0.15):0.025,C:0.175):0.015,D:0.18), which had a relatively shorter branch length

between the outgroup D and other species, 93% of the trees estimated by STEM had the same

topology as the true gene tree under the molecular clock assumption, while MDC-PhyloNet

and MP-EST only estimated 0% and 8% correct tree topologies with molecular clock rooting,

respectively. If the branch length between C and (A,B) also decreases, as in true gene tree

(((A:0.2,B:0.2):0.0075,C:0.2075):0.0075,D:0.215), STEM and MDC-PhyloNet only estimate

the correct topology 40% and 59% of the time, respectively, with molecular rooting, while 94%

of the trees estimated by MP-EST under the molecular clock assumption resulted in the correct

estimated topology. Note that for both of the input true gene trees with a short branch length

between D and ((A,B),C), MDC-PhyloNet and MP-EST estimate the correct tree under the

outgroup rooting, with D used as the outgroup.

DISCUSSION AND CONCLUSION

In four-taxon problems, observation of a distribution on estimated gene trees that consists

of one tree with high frequency and two specific other trees with lower but equal frequency

has been used to indicate that the data “fit” a coalescent model. Here we show that there is at

8

least one other way to induce a similar distribution on estimated gene trees that only requires

a single underlying gene tree for all genes (rather than variation in the gene trees for each

gene as predicted by the coalescent). In particular, here we used five single gene trees with

branch lengths that agree with the molecular clock assumption to simulate multiple sequence

alignments. To estimate gene trees from these alignments, we used one of two methods for

rooting the trees: molecular clock rooting or outgroup rooting. Both of these methods are

commonly used when rooted gene trees are desired. In particular, estimation subject to a

molecular clock is commonly carried out when the gene trees will be used as input to a

species tree estimation package, such as STEM, that requires estimated coalescent times.

Both molecular clock rooting and outgroup rooting are options for pre-processing multi-locus

data when species tree estimation methods that require rooted tree topologies, such as

MP-EST and MDC-PhyloNet, are used.

We demonstrate that the observation of a gene tree distribution with one topology with

the majority of the probability and two others with lower but equal probability cannot

necessarily be taken as a signature of the coalescent process. Our study generated

non-coalescent gene tree distributions that successfully mimicked several typical gene tree

distributions of the coalescent process: only one tree has substantial probability (Fig. 2 (b));

three asymmetric trees, two of which have lower but equal probability (Fig. 2 (c) and (d));

two asymmetric trees and one symmetric tree, with one of the asymmetric trees and the

symmetric tree having lower but equal probability (Fig. 2 (e)). In particular, similar

distributions result when various methods of rooting single-gene phylogenies are used, even

when all of the data are generated from a single underlying phylogeny. Study of the

probabilities distributions on gene trees can be found in the earlier literature, as well. For

example, Cavalli-Sforza and Edwards (1967) noted that under the Yule model (Yule 1924),

symmetric tree topologies have twice the probability of asymmetric tree topologies for four

taxa. Tajima (1983) showed that the unlabeled asymmetric tree occurs with twice the

probability of the unlabeled symmetric tree under the within-population coalescent model.

These relationships make it clear that, even if gene tree topologies could be estimated without

error in multi-locus studies, it is not an easy task to assess fit to a model of ILS using

topology information alone.

9

It is worth noting that with the same input true gene tree, molecular clock and outgroup

rooting may generate different gene trees distributions. As shown in Fig. 2 (b), (d), and (e),

The gene tree distributions generated by those two different rooting methods can be

respectively mimicked by species trees with very different x and y values. This difference was

caused by the principle of those two rooting methods. In outgroup rooting, species D was

chosen as the outgroup. Thus, only three gene tree topologies (((A,B),C),D), (((A,C),B),D),

and (((B,C),A),D) will appear in the gene tree distribution. Notice that these trees have

different unrooted topologies. If the molecular clock method only produces those three

topologies (Fig. 2(c), (d)), the distributions generated by two different rooting methods will

be mimicked by very similar species trees (with similar x and y values) (Fig. 1(c)). Otherwise,

when the molecular clock method only produces gene trees with topology ((A,B),C),D)

(Fig.2 (b)), or it produces gene trees with different topologies other than those three generated

from outgroup rooting (Fig.2 (d), (e)), very different species trees will be used to mimic the

gene trees distribution (Fig.1 (b), (d), (e)).

Under different gene tree distributions, not all of the species tree inference packages that

were applied to the estimated gene trees showed good ability to correctly estimate the

underlying tree that generated the data when the data were not generated from the coalescent

model. The two methods that do not use branch lengths in the estimated gene trees, MP-EST

and MDC-PhyloNet, can estimate the underlying tree when the true gene tree has a relatively

shorter branch length between the outgroup D and other species as well as a relatively shorter

branch length between species C and (A,B). It is not surprising that when fewer of the

estimated gene trees simulated from the true gene tree have the topology (((A,B),C),D)

(Figure 2 (b) to (d)), the methods estimated fewer correct trees (Table 1), while STEM seems

to have estimated fewer correct trees than MDC-PhyloNet and MP-EST. However, STEM

shows much better performance in estimating the underlying tree when the branch length

between the outgroup D and the other species decreases, which will induce a gene tree

distribution of two asymmetric trees and one symmetric tree. Interestingly, when the branch

length between C and (A,B) also decreases, MP-EST estimated more correct species trees

than the other two methods. From these results, we can conclude that not all of the methods

of species tree inference examined here can correctly estimate the species tree with high

10

frequency when the gene tree topology distribution mimics what would be expected from the

coalescent process, but each of the three methods estimates the true species tree with high

frequency under its “preferred” gene tree distribution.

Remarkably, even if the input true gene trees have a short branch length between D and

((A,B),C), MDC-PhyloNet and MP-EST can still estimate the correct species tree under the

outgroup rooting, with D used as the outgroup. This suggests that correctly choosing the

outgroup will greatly increase the accuracy in estimating species trees. This conclusion is also

consistent with the study by Rosenfeld et al. (2012), which concluded that outgroup choice is

extremely important in phylogenetic analysis. Another interesting question is whether

methods of species tree estimation that also estimate population genetic parameters such as

species tree branch lengths and effective population sizes (for example, BEST (Liu and Pearl

2007) and *BEAST (Heled and Drummond 2010)) are affected in either estimation of

topology or estimation of parameters for data generated under these types of models.

ACKNOWLEDGEMENTS

The ideas in this study were motivated by a seminar talk given by Rob DeSalle in the

Department of Evolution, Ecology, and Organismal Biology at The Ohio State University in

February 2012.

11

REFERENCES

Anderson, C.N., Liu, L., Pearl, D., Edwards, S.V., 2012. Tangled trees: the challenge of

inferring species trees from coalescent and noncoalescent genes. Methods Mol. Biol. 856,

3-28.

Ané, C., 2010. Reconstructing concordance trees and testing the coalescent model from

genome-wide data sets. Chapter 3, p.35-52 in Estimating species trees: Practical and

Theoretical Aspects, L. Knowles and L. Kubatko, eds. Wiley-Blackwell.

Bayzid, M.S., Warnow, T., 2012. Estimating optimal species trees from incomplete gene trees

under deep coalescence. J Comput Biol.19(6), 591-605.

Cavalli-Sforza, L.L., Edwards, A.W.F., 1967. Phylogenetic analysis: models and estimation

procedures. Amer. J. Hum. Genet. 19, 233-257.

Degnan, J.H., Salter, L.A., 2005. Gene tree distribtutions under the coalescent process.

Evolution 59(1), 24-37.

Ebersberger, I., Galgoczy, P., Taudien, S., Taenzer, S., Platzer, M., Von Haeseler, A., 2007.

Mapping human genetic ancestry. Mol. Biol. Evol. 24(10), 2266-2276.

Hasegawa, M., Kishino, H., Yano, T., 1985. Dating of human-ape splitting by a molecular

clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174.

Hein, J., 1993. A heuristic method to reconstruct the history of sequences subject to

recombination. J. Mol. Evol. 36, 396–405.

Heled, J., Drummond, A.J., 2010. Bayesian inference of species trees from multilocus data.

Mol. Biol. Evol. 27, 570–580.

Kingman, J.F.C., 1982. The coalescent. Stoch. Proc. Appl. 13, 235–248.

Kingman, J.F.C., 2000. Origins of the coalescent 1974–1982. Genetics 156, 1461–1463

Kubatko, L.S., 2009. Identifying hybridization events in the presence of coalescence via

model selection, Syst. Biol. 58(5), 478-488.

Kubatko, L.S., Carstens, B.C., Knowles, L.L., 2009. STEM: Species Tree Estimation using

Maximum likelihood for gene trees under coalescence, Bioinformatics 25 (7), 971-973.

Kubatko, L., Gibbs, H.L., Bloomquist, E.W., 2011. Inferring species-level phylogenies and

taxonomic distinctiveness using multi-locus data in Sistrurus rattlesnakes, Syst. Biol.

http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470526858.html

http://mbe.oxfordjournals.org/search?author1=Ingo+Ebersberger&sortspec=date&submit=Submit

http://mbe.oxfordjournals.org/search?author1=Stefan+Taudien&sortspec=date&submit=Submit

http://mbe.oxfordjournals.org/search?author1=Simone+Taenzer&sortspec=date&submit=Submit

12

60(4), 393-409.

Liu, L., Pearl, D.K., 2007. Species trees from gene trees: reconstructing Bayesian posterior

distributions of a species phylogeny using estimated gene tree distributions. Syst. Biol.

56, 504-514.

Liu, L., Yu, L.L., Edwards, S.V., 2010. A maximum pseudo-likelihood approach for

estimating species trees under the coalescent model. BMC Evol. Biol. 10, 302.

Maddison, W.P., 1997. Gene trees in species trees. Syst. Biol. 46, 523–536.

Pamilo, P., Nei, M., 1988. Relationships between gene trees and species trees. Mol. Biol.

Evol. 5(5), 568–583.

Rambaut, A., Grassly, N.C., 1997. Seq-Gen: An application for the Monte Carlo simulation of

DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13, 235–238.

Rosenberg, N.A., 2007. Counting coalescent histories. J. Comput. Biol. 14, 360-377.

Rosenfeld, J.A., Payne, A., DeSalle, R., 2012. Random roots and lineage sorting. Mol

Phylogenet Evol 64(1), 12-20.

Sang, T., Zhong, Y., 2000. Testing hybridization hypotheses based on incongruent gene trees.

Syst. Biol. 49, 422–434.

Swofford, D.L., 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods),

version 4. Sinauer Associates, Sunderland, Massachusetts, USA.

Tajima, F., 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics

105 437-460.

Takahata, N., Nei, M., 1985. Gene genealogy and variance of interpopulational nucleotide

differences. Genetics 110, 325–344.

Than, C., Nakhleh, L., 2009. Species tree inference by minimizing deep coalescences. PLoS

Comput Biol 5(9).

Than, C., Ruths, D., Nakhleh, L., 2008. PhyloNet: a software package for analyzing and

reconstructing reticulate evolutionary relationships. BMC Bioinformatics, 9, 322.

Wakeley, J., 2008. Coalescent Theory: An Introduction. Roberts & Company Publishers,

Greenwood Village, Colorado.

Yule, G.U., 1924, A mathematical theory of evolution, based on the conclusions of Dr. J. C.

Willis, F.R.S. Philosophical Transaction of Royal Society of London, Series B, 213:

13

21-87.

Branch lengths t1=0.2, t2=0.15, t3=0.4 t1=0.25, t2=0.015, t3=0.15 t1=0.05, t2=0.0005, t3=0.15 t1=0.15, t2=0.025, t3=0.015 t1=0.2, t2=0.0075, t3=0.0075

Roo#ng Method Molecular Clock Molecular Clock Molecular Clock Molecular Clock Molecular Clock

So4ware STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest

(((AB)C)D) 100% 100% 100% 89% 100% 100% 38% 59% 57% 93% 0% 8% 40% 59% 94%

(((AB)D)C) 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 0% 7% 6% 0% 0%

(((AC)B)D) 0% 0% 0% 5% 0% 0% 32% 20% 22% 0% 0% 6% 14% 3% 3%

(((AC)D)B) 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 0% 7% 2% 0% 0%

(((AD)B)C) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 2% 0% 0%

(((AD)C)B) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 7% 0% 0% 0%

(((BC)A)D) 0% 0% 0% 6% 0% 0% 30% 21% 21% 0% 0% 9% 16% 3% 3%

(((BC)D)A) 0% 0% 0% 0% 0% 0% 0% 0% 0% 3% 0% 10% 2% 0% 0%

(((BD)A)C) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 9% 3% 0% 0%

(((BD)C)A) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 4% 0% 0% 0%

(((CD)A)B) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 7% 0% 0% 0%

(((CD)B)A) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 10% 0% 0% 0%

((AB)(CD)) 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 32% 6% 11% 35% 0%

((AC)(BD)) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 33% 4% 1% 0% 0%

((AD)(BC)) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 35% 4% 3% 0% 0%

Roo#ng Method Outgroup Outgroup Outgroup Outgroup Outgroup

So4ware STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest

(((AB)C)D) -‐ 100% 100% -‐ 100% 100% -‐ 64% 67% -‐ 100% 100% -‐ 100% 100%

(((AB)D)C) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

(((AC)B)D) -‐ 0% 0% -‐ 0% 0% -‐ 20% 21% -‐ 0% 0% -‐ 0% 0%

(((AC)D)B) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

(((AD)B)C) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

(((AD)C)B) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

(((BC)A)D) -‐ 0% 0% -‐ 0% 0% -‐ 16% 12% -‐ 0% 0% -‐ 0% 0%

(((BC)D)A) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

(((BD)A)C) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

(((BD)C)A) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

(((CD)A)B) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

(((CD)B)A) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

((AB)(CD)) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

((AC)(BD)) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

((AD)(BC)) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%

Tables

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(((AB)C)D)

(((AB)D)C)

(((AC)B)D)

(((AC)D)B)

(((AD)B)C)

(((AD)C)B)

(((BC)A)D)

(((BC)D)A)

(((BD)A)C)

(((BD)C)A)

(((CD)A)B)

(((CD)B)A)

((AB)(CD))

((AC)(BD))

((AD)(BC))

x=0.7, y=10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(((AB)C)D)

(((AB)D)C)

(((AC)B)D)

(((AC)D)B)

(((AD)B)C)

(((AD)C)B)

(((BC)A)D)

(((BC)D)A)

(((BD)A)C)

(((BD)C)A)

(((CD)A)B)

(((CD)B)A)

((AB)(CD))

((AC)(BD))

((AD)(BC))

x=0.4, y=0.2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(((AB)C)D)

(((AB)D)C)

(((AC)B)D)

(((AC)D)B)

(((AD)B)C)

(((AD)C)B)

(((BC)A)D)

(((BC)D)A)

(((BD)A)C)

(((BD)C)A)

(((CD)A)B)

(((CD)B)A)

((AB)(CD))

((AC)(BD))

((AD)(BC))

x=10, y=0.55

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(((AB)C)D)

(((AB)D)C)

(((AC)B)D)

(((AC)D)B)

(((AD)B)C)

(((AD)C)B)

(((BC)A)D)

(((BC)D)A)

(((BD)A)C)

(((BD)C)A)

(((CD)A)B)

(((CD)B)A)

((AB)(CD))

((AC)(BD))

((AD)(BC))

x=10, y=10 (b) (c)

(d) (e)

A B C D

x

y

(a)

Figure

A B C D

t 1

t 2

t 3

t1 +t2 +t3

t1

t1 +t2

(a) (b)

(c) (d)

(f) (e) t1=0.15, t2=0.025, t3=0.015 t1=0.2, t2=0.0075, t3=0.0075

t1=0.25, t2=0.015, t3=0.15 t1=0.05, t2=0.0005, t3=0.15

t1=0.2, t2=0.15, t3=0.4

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

(((AB)C)D)

(((AB)D)C)

(((AC)B)D)

(((AC)D)B)

(((AD)B)C)

(((AD)C)B)

(((BC)A)D)

(((BC)D)A)

(((BD)A)C)

(((BD)C)A)

(((CD)A)B)

(((CD)B)A)

((AB)(CD))

((AC)(BD))

((AD)(BC))

Molecular clock

x=0.7, y=10

Outgroup rooEng

x=10, y=2.1

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

(((AB)C)D)

(((AB)D)C)

(((AC)B)D)

(((AC)D)B)

(((AD)B)C)

(((AD)C)B)

(((BC)A)D)

(((BC)D)A)

(((BD)A)C)

(((BD)C)A)

(((CD)A)B)

(((CD)B)A)

((AB)(CD))

((AC)(BD))

((AD)(BC))

Molecular clock

x=0.4, y=0.2

Outgroup rooEng

x=10, y=0.37

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

(((AB)C)D)

(((AB)D)C)

(((AC)B)D)

(((AC)D)B)

(((AD)B)C)

(((AD)C)B)

(((BC)A)D)

(((BC)D)A)

(((BD)A)C)

(((BD)C)A)

(((CD)A)B)

(((CD)B)A)

((AB)(CD))

((AC)(BD))

((AD)(BC))

Molecular clock

x=5, y=0.06

Outgroup rooEng

x=5, y=0.06

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

(((AB)C)D)

(((AB)D)C)

(((AC)B)D)

(((AC)D)B)

(((AD)B)C)

(((AD)C)B)

(((BC)A)D)

(((BC)D)A)

(((BD)A)C)

(((BD)C)A)

(((CD)A)B)

(((CD)B)A)

((AB)(CD))

((AC)(BD))

((AD)(BC))

Molecular clock

x=10, y=0.55

Outgroup rooEng

x=10, y=0.35

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

(((AB)C)D)

(((AB)D)C)

(((AC)B)D)

(((AC)D)B)

(((AD)B)C)

(((AD)C)B)

(((BC)A)D)

(((BC)D)A)

(((BD)A)C)

(((BD)C)A)

(((CD)A)B)

(((CD)B)A)

((AB)(CD))

((AC)(BD))

((AD)(BC))

Molecular clock

x=10, y=10

Outgroup rooEng

x=5.0, y=1.6

Figure

A B C D

t 1

t 2

t 3

t1 +t2 +t3

t1

t1 +t2

(a) (b)

(c) (d)

(f) (e) t1=0.15, t2=0.025, t3=0.015 t1=0.2, t2=0.0075, t3=0.0075

t1=0.25, t2=0.015, t3=0.15 t1=0.05, t2=0.0005, t3=0.15

t1=0.2, t2=0.15, t3=0.4

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=0.7,$y=10$$

Outgroup$rooEng$

x=10,$y=2.1$

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=0.4,$y=0.2$

Outgroup$rooEng$

x=10,$y=0.37$

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=5,$y=0.06$$

Outgroup$rooEng$

x=5,$y=0.06$

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=10,$y=0.55$$

Outgroup$rooEng$

x=10,$y=0.35$

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=10,$y=10$$

Outgroup$rooEng$

x=5.0,$y=1.6$

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

(((AB)C)D)"

(((AB)D)C)"

(((AC)B)D)"

(((AC)D)B)"

(((AD)B)C)"

(((AD)C)B)"

(((BC)A)D)"

(((BC)D)A)"

(((BD)A)C)"

(((BD)C)A)"

(((CD)A)B)"

(((CD)B)A)"

((AB)(CD))"

((AC)(BD))"

((AD)(BC))"

x=0.7,'y=10'

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

(((AB)C)D)"

(((AB)D)C)"

(((AC)B)D)"

(((AC)D)B)"

(((AD)B)C)"

(((AD)C)B)"

(((BC)A)D)"

(((BC)D)A)"

(((BD)A)C)"

(((BD)C)A)"

(((CD)A)B)"

(((CD)B)A)"

((AB)(CD))"

((AC)(BD))"

((AD)(BC))"

x=0.4,'y=0.2'

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

(((AB)C)D)"

(((AB)D)C)"

(((AC)B)D)"

(((AC)D)B)"

(((AD)B)C)"

(((AD)C)B)"

(((BC)A)D)"

(((BC)D)A)"

(((BD)A)C)"

(((BD)C)A)"

(((CD)A)B)"

(((CD)B)A)"

((AB)(CD))"

((AC)(BD))"

((AD)(BC))"

x=10,'y=0.55'

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

(((AB)C)D)"

(((AB)D)C)"

(((AC)B)D)"

(((AC)D)B)"

(((AD)B)C)"

(((AD)C)B)"

(((BC)A)D)"

(((BC)D)A)"

(((BD)A)C)"

(((BD)C)A)"

(((CD)A)B)"

(((CD)B)A)"

((AB)(CD))"

((AC)(BD))"

((AD)(BC))"

x=10,'y=10'(b) (c)

(d) (e)

A B C D

x

y

(a)

Figure

A B C D

t 1

t 2

t 3

t1 +t2 +t3

t1

t1 +t2

(a) (b)

(c) (d)

(f) (e) t1=0.15, t2=0.025, t3=0.015 t1=0.2, t2=0.0075, t3=0.0075

t1=0.25, t2=0.015, t3=0.15 t1=0.05, t2=0.0005, t3=0.15

t1=0.2, t2=0.15, t3=0.4

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=0.7,$y=10$$

Outgroup$rooEng$

x=10,$y=2.1$

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=0.4,$y=0.2$

Outgroup$rooEng$

x=10,$y=0.37$

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=5,$y=0.06$$

Outgroup$rooEng$

x=5,$y=0.06$

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=10,$y=0.55$$

Outgroup$rooEng$

x=10,$y=0.35$

0.00%$

10.00%$

20.00%$

30.00%$

40.00%$

50.00%$

60.00%$

70.00%$

80.00%$

90.00%$

100.00%$

(((AB)C)D)$

(((AB)D)C)$

(((AC)B)D)$

(((AC)D)B)$

(((AD)B)C)$

(((AD)C)B)$

(((BC)A)D)$

(((BC)D)A)$

(((BD)A)C)$

(((BD)C)A)$

(((CD)A)B)$

(((CD)B)A)$

((AB)(CD))$

((AC)(BD))$

((AD)(BC))$

Molecular$clock$

x=10,$y=10$$

Outgroup$rooEng$

x=5.0,$y=1.6$

*Graphical Abstract (for review)

Highlights:

In multilocus phylogenetics, gene trees are often estimated separately at each locus.

For a single true gene tree, multi-locus data were simulated.

Rooted gene trees were estimated for each locus, and their frequencies recorded.

The estimated gene tree distribution may be close to the coalescent in this case.

Species tree estimation based on the coalescent shows variable performance here.

*Highlights (for review)

Documents

Gene tree rooting methods give distributions that mimic the coalescent process