Upload
laura-s
View
212
Download
0
Embed Size (px)
Citation preview
Accepted Manuscript
Gene tree rooting methods give distributions that mimic the coalescent process
Yuan Tian, Laura S. Kubatko
PII: S1055-7903(13)00346-1
DOI: http://dx.doi.org/10.1016/j.ympev.2013.09.004
Reference: YMPEV 4705
To appear in: Molecular Phylogenetics and Evolution
Received Date: 13 March 2013
Revised Date: 28 August 2013
Accepted Date: 6 September 2013
Please cite this article as: Tian, Y., Kubatko, L.S., Gene tree rooting methods give distributions that mimic the
coalescent process, Molecular Phylogenetics and Evolution (2013), doi: http://dx.doi.org/10.1016/j.ympev.
2013.09.004
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1
GENE TREE ROOTING METHODS GIVE DISTRIBUTIONS THAT MIMIC
THE COALESCENT PROCESS
Yuan Tian and Laura S. Kubatko*,+
*Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State
University, Columbus, OH 43210
[email protected], [email protected]
Author for correspondence:
Laura S. Kubatko
Departments of Statistics and
Evolution, Ecology, and Organismal Biology
The Ohio State University
404 Cockins Hall, 1958 Neil Avenue
Columbus, OH 43210
E-mail: [email protected]
Phone: 614-247-8846
FAX: 614-292-2866
2
INTRODUCTION
In multi-locus phylogenetic analysis, a phylogenetic tree based on a single gene (e.g., a
gene tree) may not agree with the species tree, the tree that represents the actual evolutionary
pathway (Pamilo and Nei, 1988; Maddison 1997). The many possible causes of this discord
are well-known, and include processes such as incomplete lineage sorting (ILS), gene
duplication, hybridization, non-neutral evolution, and horizontal gene transfer (Hein 1993;
Maddison 1997; Sang and Zhong 2000; Bayzid and Warnow 2012). An appropriate
probabilistic model that links gene trees and species trees should involve the phylogenetic
relationships of species as well as the genealogical history of each gene (Anderson et al.
2012). Such models are necessary to carry out accurate inference of the species phylogeny
from a multi-locus data set.
The coalescent process is a retrospective model of population genetics that is commonly
used to model ILS. The coalescent model is based on tracing the evolutionary history of
sampled genes by considering the time from the present back to their most recent common
ancestor (Kingman 2000), and can be derived as the limiting distribution (as the population
size becomes large) that results from the Wright-Fisher and other commonly-used population
genetics models (Wakeley 2008; Kingman 1982; Takahata and Nei 1985). Under the
coalescent model, the probability distribution of gene trees given a fixed species tree
topology and branch lengths can be computed (Degnan and Salter 2005). The coalescent
model is also used as the basis for different methods to estimate species trees using either
multi-locus DNA sequence data or a set of observed gene trees (Kubatko et al. 2009; Than et
al. 2008; Liu and Pearl 2007; Liu et al. 2010; Heled and Drummond 2010). These methods
have been widely applied to the multi-locus data sets that are commonly produced by
next-generation sequencing techniques.
The coalescent process has been so widely applied in part because it is believed that ILS
is a predominant cause of the incongruence observed between gene trees and species trees
(Liu et al. 2010). Indeed, the predictions made by the coalescent model in terms of the
distribution of gene trees are consistent with several observed data sets (Ebersberger et al.
2007; Ane 2010; Kubatko et al. 2011). However, a recent study of seven large multi-partition
3
genome-level data sets has suggested that random rooting might be another potential
explanation for the apparent fit to the coalescent model (Rosenfeld et al. 2012). Here, we
consider whether the signature evidenced by this gene tree topology distribution is unique to
the coalescent process. In particular, we show that in the case of four taxa, this distribution
can be mimicked by generating gene trees from a single species tree and then rooting these
gene trees using either the assumption of a molecular clock or outgroup rooting.
Below we briefly review the coalescent process, focusing on the gene tree distribution in
the four-taxon case. We then use simulation to show that both the coalescent and
non-coalescent models described above lead to nearly identical distributions on the set of
gene trees. Further, we study the behavior of several coalescent-based methods of species tree
estimation when the data come from a single tree. We conclude that a model without ILS can
produce a distribution that mimics that of the coalescent process. Furthermore, under
different gene trees distributions not all of the methods of species tree inference examined
here can correctly estimate the true species tree with high frequency.
Gene tree distribution under coalescent process
Consider a 4-taxon, asymmetric bifurcating species tree as shown in Figure 1 (a) (thick
lines). A gene tree that has the same coalescent history as the species tree is nested within the
species tree (Fig. 1(a), thin lines). Note that this species tree contains two internal branch
lengths that we denote by x and y that are given in coalescent units (number of 2Ne
generations). In the gene tree, the coalescent times are denoted by t1, t2, and t3, from the most
recent to the oldest, respectively. The probability of observing gene tree g given a particular
species tree S can be calculated using the formula (Degnan and Salter 2005)
PG = g | S = PG = g,history | Shistories
å =histories
å wbPu(b ),v(b )(tb )b
Õ ,
where the product on the right-hand side is taken over all branches, b,
wb is the probability
of getting a sequence of coalescent events that is consistent with g, andPu(b),v(b)(tb) is the
probability that u(b) lineages coalescent into v(b) lineages along branch b which has length
tb
(see, e.g., Rosenberg 2007). This formula can be used to obtain expressions for the
probabilities of all 15 gene tree topologies that are possible for the species tree in Figure 1 (a)
4
when one lineage per species is sampled. Figure 1 (b) to (e) show the entire probability
distribution on the set of 15 gene trees for several choices of the species tree branch lengths x
and y. Note that when x and y are long, only one tree has substantial probability (Figure 1 (b)).
As y becomes shorter, there are three asymmetric trees that have substantial probability
(Figure 1(c)). If y is long but x becomes shorter, the distribution of mass shifts to two
asymmetric trees as well as a symmetric tree (Figure 1 (d)). When both x and y are very short,
many of the 15 trees have nearly equal probability (Figure 1 (e)).
The fact that under the coalescent model for four taxa one tree will have most of the
probability while two others occur with smaller, but equal, probability (and all other trees
have much lower probability) has been noted to be a signature of the coalescent model.
This type of distribution has been observed in empirical studies as well (Ebersberger et al.
2007; Ane 2010; Kubatko et al. 2011; Rosenfeld et al. 2012), and for this reason has been
used as evidence of the importance of incorporating the coalescent model into species tree
inference methodology. It is thus of interest to determine whether other features of the
process of molecular evolution coupled with tree estimation methodology can produce a
distribution on gene trees that shows this feature. This is the question that we examine here.
METHODS
Sequence simulation and gene tree estimation
Five 4-taxon, asymmetric bifurcating gene trees with topology (((AB)C)D) and different
branch length parameters t1, t2, and t3 as depicted in Figure 2 (a) were chosen as “true gene
trees” in our simulation study. All five of the true gene trees used satisfied the molecular
clock assumption. Multiple sequence alignments of 500 base pairs (bp) were generated with
the program Seq-Gen (Rambaut and Grassly 1997) under the HKY85 substitution model
(Hasegawa, Kishino and Yano 1985) for each of the following five input true gene trees:
(((A:0.2,B:0.2):0.15,C:0.35):0.4,D:0.75), (((A:0.25,B:0.25):0.015,C:0.265):0.15,D:0.415),
(((A:0.05,B:0.05):0.0005,C:0.0505):0.15,D:0.2005),
(((A:0.15,B:0.15):0.025,C:0.175):0.015,D:0.18), and
5
(((A:0.2,B:0.2):0.0075,C:0.2075):0.0075,D:0.215).
This procedure was repeated 10,000 times for each tree, to generate 10,000 alignments of
length 500 for each model tree.
From these alignments, gene trees were then estimated under the molecular clock
assumption and rooted using maximum likelihood (ML) as implemented in the program
PAUP* 4.0b10 (Swofford 2002). PAUP* was also used to obtain ML gene tree estimates
without the molecular clock assumption, and the resulting trees were rooted by the outgroup
rooting method with taxon D specified as the outgroup. In both the molecular clock rooting
and outgroup rooting methods, the frequencies of the resulting rooted gene tree topology
estimates were recorded. These were used both to examine the induced gene tree distribution
and to assess the performance of several current coalescent-based methods of species trees
estimation.
Species trees estimation
The simulated data sets were used to estimate species trees under the coalescent model.
Each data set was randomly divided into 100 groups, with 100 genes in each group. Each
group could be considered as a multi-locus data set of 100 gene trees for inference of species
trees. Three coalescent-based software packages were used to estimate species trees: STEM
(Kubatko et al. 2009), MP-EST (Liu et al. 2010), and the minimize deep coalescences method
(MDC), as implemented in PhyloNet (MDC-PhyloNet; Than and Nakhleh 2009). For each of
these three methods, the input data were the trees estimated for each simulated gene
alignment under the ML method with the assumption of a molecular clock. For MP-EST and
MDC-PhyloNet, the gene trees estimated using outgroup rooting were also used for analysis
(note that this is not possible for STEM, because branch lengths that satisfy the molecular
clock are required). For each method, we recorded the proportion of the 100 data sets (each
consisting of 100 genes) that correctly recovered the tree that generated the data.
RESULTS
Estimation of the gene tree distribution under different rooting methods
6
Figure 2 (b) to (f) shows the proportion of times each gene tree (red bars) was estimated
in the five data sets that are simulated with different choices of the true gene tree branch
lengths t1, t2, and t3. We see that estimating ML gene trees under the molecular clock
assumption can lead to an estimated gene tree distribution that mimics that which would be
expected under the coalescent model. For example, Figure 2 (b) also shows the expected
distribution under the coalescent model for the species tree in Figure 1 (a) (tree in thick lines)
with x=10.0 and y=10.0 (green bars). Note the agreement between this expected distribution
and the observed distribution.
The proportion of times that each gene is estimated when using outgroup rooting is also
shown in Figure 2 (b) to (f) (blue bars), along with the predicted gene tree distribution under
the coalescent model for the species tree in Figure 1 (a) with a different choice of species tree
branch lengths x and y (purple bars). As was the case with molecular clock rooting, the close
agreement between these distributions is noted. In all sub-figures of Figure 2, the species tree
branch lengths used to generate the distributions shown with the green and purple bars were
selected to try to mimic the distribution that was observed for the molecular clock and
outgroup rooting methods, respectively.
Species trees estimation
The five data sets were used to compare the ability of the three methods examined
(STEM, MP-EST, and MDC-PhyloNet) to correctly estimate the species tree. For all of the
methods, the 100 species trees estimated by each method were compared with the tree
(((AB)C)D) that was used to simulate the sequences. The frequency of different tree
topologies estimated is shown in Table 1 when both molecular clock rooting and outgroup
rooting were used. Most of the gene trees that are estimated from the alignments generated
from the true gene tree (((A:0.2,B:0.2):0.15,C:0.35):0.4,D:0.75) have the topology
(((A,B),C),D) (Figure 2 (b)). It is thus not surprising that all three methods estimated the
correct tree topology (((A,B),C),D) under both the molecular clock rooting and the outgroup
rooting in this case. For the data set that is simulated from the true gene tree
(((A:0.25,B:0.25):0.015,C:0.265):0.15,D:0.415), all methods other than STEM maintain a
100% correct ratio, but only 89% of species trees estimated by STEM had the same topology as
7
the true tree. With the input true gene tree (((A:0.05,B:0.05):0.0005,C:0.0505):0.15,D:0.2005),
fewer gene trees with the topology (((A,B),C),D) were estimated (Figure 2 (d)). With
molecular clock rooting, 57% and 59% of species trees estimated by MP-EST and
MDC-PhyloNet, respectively, were consistent with the true tree, while 38% of species trees
estimated by STEM had the same topology as the true tree. With outgroup rooting, 67% of
species trees estimated by MP-EST agreed with the true tree and MDC-PhyloNet estimated
64% of the species trees that were consistent with the true tree.
In general, when fewer of the gene trees estimated from the alignments simulated from
the true gene tree have the topology (((A,B),C),D) (Figure 2 (b) to (d)), the methods estimated
fewer correct trees (Table 1). Note that in the same data set, STEM seems to have estimated
fewer correct trees than MDC-PhyloNet and MP-EST. However, with the true gene tree
(((A:0.15,B:0.15):0.025,C:0.175):0.015,D:0.18), which had a relatively shorter branch length
between the outgroup D and other species, 93% of the trees estimated by STEM had the same
topology as the true gene tree under the molecular clock assumption, while MDC-PhyloNet
and MP-EST only estimated 0% and 8% correct tree topologies with molecular clock rooting,
respectively. If the branch length between C and (A,B) also decreases, as in true gene tree
(((A:0.2,B:0.2):0.0075,C:0.2075):0.0075,D:0.215), STEM and MDC-PhyloNet only estimate
the correct topology 40% and 59% of the time, respectively, with molecular rooting, while 94%
of the trees estimated by MP-EST under the molecular clock assumption resulted in the correct
estimated topology. Note that for both of the input true gene trees with a short branch length
between D and ((A,B),C), MDC-PhyloNet and MP-EST estimate the correct tree under the
outgroup rooting, with D used as the outgroup.
DISCUSSION AND CONCLUSION
In four-taxon problems, observation of a distribution on estimated gene trees that consists
of one tree with high frequency and two specific other trees with lower but equal frequency
has been used to indicate that the data “fit” a coalescent model. Here we show that there is at
8
least one other way to induce a similar distribution on estimated gene trees that only requires
a single underlying gene tree for all genes (rather than variation in the gene trees for each
gene as predicted by the coalescent). In particular, here we used five single gene trees with
branch lengths that agree with the molecular clock assumption to simulate multiple sequence
alignments. To estimate gene trees from these alignments, we used one of two methods for
rooting the trees: molecular clock rooting or outgroup rooting. Both of these methods are
commonly used when rooted gene trees are desired. In particular, estimation subject to a
molecular clock is commonly carried out when the gene trees will be used as input to a
species tree estimation package, such as STEM, that requires estimated coalescent times.
Both molecular clock rooting and outgroup rooting are options for pre-processing multi-locus
data when species tree estimation methods that require rooted tree topologies, such as
MP-EST and MDC-PhyloNet, are used.
We demonstrate that the observation of a gene tree distribution with one topology with
the majority of the probability and two others with lower but equal probability cannot
necessarily be taken as a signature of the coalescent process. Our study generated
non-coalescent gene tree distributions that successfully mimicked several typical gene tree
distributions of the coalescent process: only one tree has substantial probability (Fig. 2 (b));
three asymmetric trees, two of which have lower but equal probability (Fig. 2 (c) and (d));
two asymmetric trees and one symmetric tree, with one of the asymmetric trees and the
symmetric tree having lower but equal probability (Fig. 2 (e)). In particular, similar
distributions result when various methods of rooting single-gene phylogenies are used, even
when all of the data are generated from a single underlying phylogeny. Study of the
probabilities distributions on gene trees can be found in the earlier literature, as well. For
example, Cavalli-Sforza and Edwards (1967) noted that under the Yule model (Yule 1924),
symmetric tree topologies have twice the probability of asymmetric tree topologies for four
taxa. Tajima (1983) showed that the unlabeled asymmetric tree occurs with twice the
probability of the unlabeled symmetric tree under the within-population coalescent model.
These relationships make it clear that, even if gene tree topologies could be estimated without
error in multi-locus studies, it is not an easy task to assess fit to a model of ILS using
topology information alone.
9
It is worth noting that with the same input true gene tree, molecular clock and outgroup
rooting may generate different gene trees distributions. As shown in Fig. 2 (b), (d), and (e),
The gene tree distributions generated by those two different rooting methods can be
respectively mimicked by species trees with very different x and y values. This difference was
caused by the principle of those two rooting methods. In outgroup rooting, species D was
chosen as the outgroup. Thus, only three gene tree topologies (((A,B),C),D), (((A,C),B),D),
and (((B,C),A),D) will appear in the gene tree distribution. Notice that these trees have
different unrooted topologies. If the molecular clock method only produces those three
topologies (Fig. 2(c), (d)), the distributions generated by two different rooting methods will
be mimicked by very similar species trees (with similar x and y values) (Fig. 1(c)). Otherwise,
when the molecular clock method only produces gene trees with topology ((A,B),C),D)
(Fig.2 (b)), or it produces gene trees with different topologies other than those three generated
from outgroup rooting (Fig.2 (d), (e)), very different species trees will be used to mimic the
gene trees distribution (Fig.1 (b), (d), (e)).
Under different gene tree distributions, not all of the species tree inference packages that
were applied to the estimated gene trees showed good ability to correctly estimate the
underlying tree that generated the data when the data were not generated from the coalescent
model. The two methods that do not use branch lengths in the estimated gene trees, MP-EST
and MDC-PhyloNet, can estimate the underlying tree when the true gene tree has a relatively
shorter branch length between the outgroup D and other species as well as a relatively shorter
branch length between species C and (A,B). It is not surprising that when fewer of the
estimated gene trees simulated from the true gene tree have the topology (((A,B),C),D)
(Figure 2 (b) to (d)), the methods estimated fewer correct trees (Table 1), while STEM seems
to have estimated fewer correct trees than MDC-PhyloNet and MP-EST. However, STEM
shows much better performance in estimating the underlying tree when the branch length
between the outgroup D and the other species decreases, which will induce a gene tree
distribution of two asymmetric trees and one symmetric tree. Interestingly, when the branch
length between C and (A,B) also decreases, MP-EST estimated more correct species trees
than the other two methods. From these results, we can conclude that not all of the methods
of species tree inference examined here can correctly estimate the species tree with high
10
frequency when the gene tree topology distribution mimics what would be expected from the
coalescent process, but each of the three methods estimates the true species tree with high
frequency under its “preferred” gene tree distribution.
Remarkably, even if the input true gene trees have a short branch length between D and
((A,B),C), MDC-PhyloNet and MP-EST can still estimate the correct species tree under the
outgroup rooting, with D used as the outgroup. This suggests that correctly choosing the
outgroup will greatly increase the accuracy in estimating species trees. This conclusion is also
consistent with the study by Rosenfeld et al. (2012), which concluded that outgroup choice is
extremely important in phylogenetic analysis. Another interesting question is whether
methods of species tree estimation that also estimate population genetic parameters such as
species tree branch lengths and effective population sizes (for example, BEST (Liu and Pearl
2007) and *BEAST (Heled and Drummond 2010)) are affected in either estimation of
topology or estimation of parameters for data generated under these types of models.
ACKNOWLEDGEMENTS
The ideas in this study were motivated by a seminar talk given by Rob DeSalle in the
Department of Evolution, Ecology, and Organismal Biology at The Ohio State University in
February 2012.
11
REFERENCES
Anderson, C.N., Liu, L., Pearl, D., Edwards, S.V., 2012. Tangled trees: the challenge of
inferring species trees from coalescent and noncoalescent genes. Methods Mol. Biol. 856,
3-28.
Ané, C., 2010. Reconstructing concordance trees and testing the coalescent model from
genome-wide data sets. Chapter 3, p.35-52 in Estimating species trees: Practical and
Theoretical Aspects, L. Knowles and L. Kubatko, eds. Wiley-Blackwell.
Bayzid, M.S., Warnow, T., 2012. Estimating optimal species trees from incomplete gene trees
under deep coalescence. J Comput Biol.19(6), 591-605.
Cavalli-Sforza, L.L., Edwards, A.W.F., 1967. Phylogenetic analysis: models and estimation
procedures. Amer. J. Hum. Genet. 19, 233-257.
Degnan, J.H., Salter, L.A., 2005. Gene tree distribtutions under the coalescent process.
Evolution 59(1), 24-37.
Ebersberger, I., Galgoczy, P., Taudien, S., Taenzer, S., Platzer, M., Von Haeseler, A., 2007.
Mapping human genetic ancestry. Mol. Biol. Evol. 24(10), 2266-2276.
Hasegawa, M., Kishino, H., Yano, T., 1985. Dating of human-ape splitting by a molecular
clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174.
Hein, J., 1993. A heuristic method to reconstruct the history of sequences subject to
recombination. J. Mol. Evol. 36, 396–405.
Heled, J., Drummond, A.J., 2010. Bayesian inference of species trees from multilocus data.
Mol. Biol. Evol. 27, 570–580.
Kingman, J.F.C., 1982. The coalescent. Stoch. Proc. Appl. 13, 235–248.
Kingman, J.F.C., 2000. Origins of the coalescent 1974–1982. Genetics 156, 1461–1463
Kubatko, L.S., 2009. Identifying hybridization events in the presence of coalescence via
model selection, Syst. Biol. 58(5), 478-488.
Kubatko, L.S., Carstens, B.C., Knowles, L.L., 2009. STEM: Species Tree Estimation using
Maximum likelihood for gene trees under coalescence, Bioinformatics 25 (7), 971-973.
Kubatko, L., Gibbs, H.L., Bloomquist, E.W., 2011. Inferring species-level phylogenies and
taxonomic distinctiveness using multi-locus data in Sistrurus rattlesnakes, Syst. Biol.
12
60(4), 393-409.
Liu, L., Pearl, D.K., 2007. Species trees from gene trees: reconstructing Bayesian posterior
distributions of a species phylogeny using estimated gene tree distributions. Syst. Biol.
56, 504-514.
Liu, L., Yu, L.L., Edwards, S.V., 2010. A maximum pseudo-likelihood approach for
estimating species trees under the coalescent model. BMC Evol. Biol. 10, 302.
Maddison, W.P., 1997. Gene trees in species trees. Syst. Biol. 46, 523–536.
Pamilo, P., Nei, M., 1988. Relationships between gene trees and species trees. Mol. Biol.
Evol. 5(5), 568–583.
Rambaut, A., Grassly, N.C., 1997. Seq-Gen: An application for the Monte Carlo simulation of
DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13, 235–238.
Rosenberg, N.A., 2007. Counting coalescent histories. J. Comput. Biol. 14, 360-377.
Rosenfeld, J.A., Payne, A., DeSalle, R., 2012. Random roots and lineage sorting. Mol
Phylogenet Evol 64(1), 12-20.
Sang, T., Zhong, Y., 2000. Testing hybridization hypotheses based on incongruent gene trees.
Syst. Biol. 49, 422–434.
Swofford, D.L., 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods),
version 4. Sinauer Associates, Sunderland, Massachusetts, USA.
Tajima, F., 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics
105 437-460.
Takahata, N., Nei, M., 1985. Gene genealogy and variance of interpopulational nucleotide
differences. Genetics 110, 325–344.
Than, C., Nakhleh, L., 2009. Species tree inference by minimizing deep coalescences. PLoS
Comput Biol 5(9).
Than, C., Ruths, D., Nakhleh, L., 2008. PhyloNet: a software package for analyzing and
reconstructing reticulate evolutionary relationships. BMC Bioinformatics, 9, 322.
Wakeley, J., 2008. Coalescent Theory: An Introduction. Roberts & Company Publishers,
Greenwood Village, Colorado.
Yule, G.U., 1924, A mathematical theory of evolution, based on the conclusions of Dr. J. C.
Willis, F.R.S. Philosophical Transaction of Royal Society of London, Series B, 213:
Branch lengths t1=0.2, t2=0.15, t3=0.4 t1=0.25, t2=0.015, t3=0.15 t1=0.05, t2=0.0005, t3=0.15 t1=0.15, t2=0.025, t3=0.015 t1=0.2, t2=0.0075, t3=0.0075
Roo#ng Method Molecular Clock Molecular Clock Molecular Clock Molecular Clock Molecular Clock
So4ware STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest
(((AB)C)D) 100% 100% 100% 89% 100% 100% 38% 59% 57% 93% 0% 8% 40% 59% 94%
(((AB)D)C) 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 0% 7% 6% 0% 0%
(((AC)B)D) 0% 0% 0% 5% 0% 0% 32% 20% 22% 0% 0% 6% 14% 3% 3%
(((AC)D)B) 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 0% 7% 2% 0% 0%
(((AD)B)C) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 2% 0% 0%
(((AD)C)B) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 7% 0% 0% 0%
(((BC)A)D) 0% 0% 0% 6% 0% 0% 30% 21% 21% 0% 0% 9% 16% 3% 3%
(((BC)D)A) 0% 0% 0% 0% 0% 0% 0% 0% 0% 3% 0% 10% 2% 0% 0%
(((BD)A)C) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 9% 3% 0% 0%
(((BD)C)A) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 4% 0% 0% 0%
(((CD)A)B) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 7% 0% 0% 0%
(((CD)B)A) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 10% 0% 0% 0%
((AB)(CD)) 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 32% 6% 11% 35% 0%
((AC)(BD)) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 33% 4% 1% 0% 0%
((AD)(BC)) 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 35% 4% 3% 0% 0%
Roo#ng Method Outgroup Outgroup Outgroup Outgroup Outgroup
So4ware STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest STEM PhyloNet mpest
(((AB)C)D) -‐ 100% 100% -‐ 100% 100% -‐ 64% 67% -‐ 100% 100% -‐ 100% 100%
(((AB)D)C) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
(((AC)B)D) -‐ 0% 0% -‐ 0% 0% -‐ 20% 21% -‐ 0% 0% -‐ 0% 0%
(((AC)D)B) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
(((AD)B)C) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
(((AD)C)B) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
(((BC)A)D) -‐ 0% 0% -‐ 0% 0% -‐ 16% 12% -‐ 0% 0% -‐ 0% 0%
(((BC)D)A) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
(((BD)A)C) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
(((BD)C)A) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
(((CD)A)B) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
(((CD)B)A) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
((AB)(CD)) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
((AC)(BD)) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
((AD)(BC)) -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0% -‐ 0% 0%
Tables
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
((AB)(CD))
((AC)(BD))
((AD)(BC))
x=0.7, y=10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
((AB)(CD))
((AC)(BD))
((AD)(BC))
x=0.4, y=0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
((AB)(CD))
((AC)(BD))
((AD)(BC))
x=10, y=0.55
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
((AB)(CD))
((AC)(BD))
((AD)(BC))
x=10, y=10 (b) (c)
(d) (e)
A B C D
x
y
(a)
Figure
A B C D
t 1
t 2
t 3
t1 +t2 +t3
t1
t1 +t2
(a) (b)
(c) (d)
(f) (e) t1=0.15, t2=0.025, t3=0.015 t1=0.2, t2=0.0075, t3=0.0075
t1=0.25, t2=0.015, t3=0.15 t1=0.05, t2=0.0005, t3=0.15
t1=0.2, t2=0.15, t3=0.4
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
((AB)(CD))
((AC)(BD))
((AD)(BC))
Molecular clock
x=0.7, y=10
Outgroup rooEng
x=10, y=2.1
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
((AB)(CD))
((AC)(BD))
((AD)(BC))
Molecular clock
x=0.4, y=0.2
Outgroup rooEng
x=10, y=0.37
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
((AB)(CD))
((AC)(BD))
((AD)(BC))
Molecular clock
x=5, y=0.06
Outgroup rooEng
x=5, y=0.06
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
((AB)(CD))
((AC)(BD))
((AD)(BC))
Molecular clock
x=10, y=0.55
Outgroup rooEng
x=10, y=0.35
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
((AB)(CD))
((AC)(BD))
((AD)(BC))
Molecular clock
x=10, y=10
Outgroup rooEng
x=5.0, y=1.6
Figure
A B C D
t 1
t 2
t 3
t1 +t2 +t3
t1
t1 +t2
(a) (b)
(c) (d)
(f) (e) t1=0.15, t2=0.025, t3=0.015 t1=0.2, t2=0.0075, t3=0.0075
t1=0.25, t2=0.015, t3=0.15 t1=0.05, t2=0.0005, t3=0.15
t1=0.2, t2=0.15, t3=0.4
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=0.7,$y=10$$
Outgroup$rooEng$
x=10,$y=2.1$
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=0.4,$y=0.2$
Outgroup$rooEng$
x=10,$y=0.37$
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=5,$y=0.06$$
Outgroup$rooEng$
x=5,$y=0.06$
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=10,$y=0.55$$
Outgroup$rooEng$
x=10,$y=0.35$
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=10,$y=10$$
Outgroup$rooEng$
x=5.0,$y=1.6$
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
(((AB)C)D)"
(((AB)D)C)"
(((AC)B)D)"
(((AC)D)B)"
(((AD)B)C)"
(((AD)C)B)"
(((BC)A)D)"
(((BC)D)A)"
(((BD)A)C)"
(((BD)C)A)"
(((CD)A)B)"
(((CD)B)A)"
((AB)(CD))"
((AC)(BD))"
((AD)(BC))"
x=0.7,'y=10'
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
(((AB)C)D)"
(((AB)D)C)"
(((AC)B)D)"
(((AC)D)B)"
(((AD)B)C)"
(((AD)C)B)"
(((BC)A)D)"
(((BC)D)A)"
(((BD)A)C)"
(((BD)C)A)"
(((CD)A)B)"
(((CD)B)A)"
((AB)(CD))"
((AC)(BD))"
((AD)(BC))"
x=0.4,'y=0.2'
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
(((AB)C)D)"
(((AB)D)C)"
(((AC)B)D)"
(((AC)D)B)"
(((AD)B)C)"
(((AD)C)B)"
(((BC)A)D)"
(((BC)D)A)"
(((BD)A)C)"
(((BD)C)A)"
(((CD)A)B)"
(((CD)B)A)"
((AB)(CD))"
((AC)(BD))"
((AD)(BC))"
x=10,'y=0.55'
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
(((AB)C)D)"
(((AB)D)C)"
(((AC)B)D)"
(((AC)D)B)"
(((AD)B)C)"
(((AD)C)B)"
(((BC)A)D)"
(((BC)D)A)"
(((BD)A)C)"
(((BD)C)A)"
(((CD)A)B)"
(((CD)B)A)"
((AB)(CD))"
((AC)(BD))"
((AD)(BC))"
x=10,'y=10'(b) (c)
(d) (e)
A B C D
x
y
(a)
Figure
A B C D
t 1
t 2
t 3
t1 +t2 +t3
t1
t1 +t2
(a) (b)
(c) (d)
(f) (e) t1=0.15, t2=0.025, t3=0.015 t1=0.2, t2=0.0075, t3=0.0075
t1=0.25, t2=0.015, t3=0.15 t1=0.05, t2=0.0005, t3=0.15
t1=0.2, t2=0.15, t3=0.4
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=0.7,$y=10$$
Outgroup$rooEng$
x=10,$y=2.1$
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=0.4,$y=0.2$
Outgroup$rooEng$
x=10,$y=0.37$
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=5,$y=0.06$$
Outgroup$rooEng$
x=5,$y=0.06$
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=10,$y=0.55$$
Outgroup$rooEng$
x=10,$y=0.35$
0.00%$
10.00%$
20.00%$
30.00%$
40.00%$
50.00%$
60.00%$
70.00%$
80.00%$
90.00%$
100.00%$
(((AB)C)D)$
(((AB)D)C)$
(((AC)B)D)$
(((AC)D)B)$
(((AD)B)C)$
(((AD)C)B)$
(((BC)A)D)$
(((BC)D)A)$
(((BD)A)C)$
(((BD)C)A)$
(((CD)A)B)$
(((CD)B)A)$
((AB)(CD))$
((AC)(BD))$
((AD)(BC))$
Molecular$clock$
x=10,$y=10$$
Outgroup$rooEng$
x=5.0,$y=1.6$
*Graphical Abstract (for review)
Highlights:
In multilocus phylogenetics, gene trees are often estimated separately at each locus.
For a single true gene tree, multi-locus data were simulated.
Rooted gene trees were estimated for each locus, and their frequencies recorded.
The estimated gene tree distribution may be close to the coalescent in this case.
Species tree estimation based on the coalescent shows variable performance here.
*Highlights (for review)