Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Reconstruction of species trees from gene trees using ASTRAL
Siavash Mirarab University of California, San Diego (ECE)
sour
ce: h
ttp://
ww
w.ev
ogen
eao.
com
/
in collaboration with Warnow lab (UIUC)
Phylogenomics
2
More data (#genes)
Inference error
“gene” here refers to a portion of the genome (not a functional gene)
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
Orangutan
Gorilla
Chimpanzee
Human
Gene tree discordance
3
Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
Gene tree discordance
3
OrangutanGorilla ChimpHuman
The species tree
A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
Gene tree discordance
3
OrangutanGorilla ChimpHuman
The species tree
A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
Causes of gene tree discordance include:• Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)
gene1000gene 1
Incomplete Lineage Sorting (ILS)
• A random process related to the coalescence of alleles across various populations
4
Tracing alleles through generations
Incomplete Lineage Sorting (ILS)
• A random process related to the coalescence of alleles across various populations
4
Tracing alleles through generations
Incomplete Lineage Sorting (ILS)
• A random process related to the coalescence of alleles across various populations
• Omnipresent: possible for every tree
• Likely for short branches or large population sizes
4
Tracing alleles through generations
MSC and Identifiability• A statistical model called multi-species coalescent
(MSC) can generate ILS.
5
MSC and Identifiability• A statistical model called multi-species coalescent
(MSC) can generate ILS.
• Any species tree defines a unique distribution on the set of all possible gene trees
5
MSC and Identifiability• A statistical model called multi-species coalescent
(MSC) can generate ILS.
• Any species tree defines a unique distribution on the set of all possible gene trees
• In principle, the species tree can be identified despite high discordance from the gene tree distribution
5
6
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
gene 1
Summary method
Orangutan
Gorilla
Chimpanzee
Human
Approach 2: Summary methods
Approach 1: ConcatenationOrangutan
Gorilla
Chimpanzee
Human
Phylogeny inferenceACTGCACACCG
ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrixgene 2 gene 1000gene 1
Gene tree estimation
Orang.
GorillaChimp
Humangene
1
Orang.
Gorilla Chimp
Humangene
2
Orang.
Gorilla
Chimp
Humangene
100
0
Multi-gene tree estimation pipelines
6
There are also other very good approaches: co-estimation (e.g., *BEAST),
site-based (e.g., SVDQuartets, SNAPP)
6
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
gene 1
Summary method
Orangutan
Gorilla
Chimpanzee
Human
Approach 2: Summary methods
Approach 1: ConcatenationOrangutan
Gorilla
Chimpanzee
Human
Phylogeny inferenceACTGCACACCG
ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrixgene 2 gene 1000gene 1
Gene tree estimation
Orang.
GorillaChimp
Humangene
1
Orang.
Gorilla Chimp
Humangene
2
Orang.
Gorilla
Chimp
Humangene
100
0
Multi-gene tree estimation pipelines
6
Statistically inconsistent [Roch and Steel, 2014]
Can be statistically consistent given true gene treesData
Error
Error-free gene trees
6
gene 2
ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000
gene 1
Summary method
Orangutan
Gorilla
Chimpanzee
Human
Approach 2: Summary methods
Approach 1: ConcatenationOrangutan
Gorilla
Chimpanzee
Human
Phylogeny inferenceACTGCACACCG
ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT
supermatrixgene 2 gene 1000gene 1
Gene tree estimation
Orang.
GorillaChimp
Humangene
1
Orang.
Gorilla Chimp
Humangene
2
Orang.
Gorilla
Chimp
Humangene
100
0
Multi-gene tree estimation pipelines
6
Statistically inconsistent [Roch and Steel, 2014]
Can be statistically consistent given true gene trees
STAR, STELLS, GLASS, BUCKy (population tree), MP-EST, NJst (ASTRID), … ASTRAL, ASTRAL-II, ASTRAL-III
Data
Error
Error-free gene trees
ASTRAL used by the biologists• Plants: Wickett, et al., 2014, PNAS • Birds: Prum, et al., 2015, Nature • Xenoturbella, Cannon et al., 2016, Nature • Xenoturbella, Rouse et al., 2016, Nature • Flatworms: Laumer, et al., 2015, eLife • Shrews: Giarla, et al., 2015, Syst. Bio. • Frogs: Yuan et al., 2016, Syst. Bio. • Tomatoes: Pease, et al., 2016, PLoS Bio. • Angiosperms: Huang et al., 2016, MBE • Worms: Andrade, et al., 2015, MBE
AST
RA
L-II
AST
RA
L
This talk: ASTRAL• Outlines of the method
• Accuracy in simulation studies
• With strong model violations
• The impact of
• Fragmentary sequence data
• Sampling multiple individuals
8
Unrooted quartets under MSC model
For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)
9
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
Unrooted quartets under MSC model
For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)
9
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
The most frequent gene tree = The most likely species tree
Unrooted quartets under MSC model
For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)
9
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
The most frequent gene tree = The most likely species tree
shorter branches ⇒ more discordance ⇒ a harder species tree reconstruction problem
10
Orang.Gorilla ChimpHuman Rhesus
More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)
10
Orang.Gorilla ChimpHuman Rhesus
1. Break gene trees into (n 4 ) quartets of species
2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees
Some tools (e.g.. BUCKy-p [Larget, et al., 2010])
More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)
10
Orang.Gorilla ChimpHuman Rhesus
1. Break gene trees into (n 4 ) quartets of species
2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees
Some tools (e.g.. BUCKy-p [Larget, et al., 2010])
More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)
Alternative: weight all 3(n
4 ) quartet topologies by their frequency
and find the optimal tree
Maximum Quartet Support Species Tree
• Optimization problem:
11
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
the set of quartet trees induced by T
a gene treeScore(T ) =
mX
1
|Q(T ) \Q(ti)|
Maximum Quartet Support Species Tree
• Optimization problem:
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
11
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
the set of quartet trees induced by T
a gene treeScore(T ) =
mX
1
|Q(T ) \Q(ti)|
ASTRAL-I and ASTRAL-II [Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015]
• Solve the problem exactly using dynamic programming
• Constrains the search space to make large datasets feasible
• The constrained version remains statistically consistent
• Running time: polynomially increases with the number of genes and the number of species
12
ASTRAL-I and ASTRAL-II [Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015]
• Solve the problem exactly using dynamic programming
• Constrains the search space to make large datasets feasible
• The constrained version remains statistically consistent
• Running time: polynomially increases with the number of genes and the number of species
• ASTRAL-II:
• Increased the search space
• Improved the running time
• Can handle polytomies (lack of resolution) in input gene trees
12
Simulation study
13
Truegenetrees Sequencedata
Es�matedspeciestree
Finch Falcon Owl Eagle Pigeon
Es�matedgenetreesFinch Owl Falcon Eagle Pigeon
True(model)speciestree
Simulation study
• Evaluate using the FN rate: the percentage of branches in the true tree that are missing from the estimated tree
13
Truegenetrees Sequencedata
Es�matedspeciestree
Finch Falcon Owl Eagle Pigeon
Es�matedgenetreesFinch Owl Falcon Eagle Pigeon
True(model)speciestree
Number of species impacts estimation error in the species tree
14
1000 genes, “medium” levels of ILS, simulated species trees [S. Mirarab, T. Warnow, 2015]
ASTRAL: accurate and scalable
15
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of ILS, simulated species trees[S. Mirarab, T. Warnow, 2015]
ASTRAL: accurate and scalable
15
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of ILS, simulated species trees[S. Mirarab, T. Warnow, 2015]
Comparison to concatenation: depends on the ILS level
16
200 species, recent ILS
1000 genes 200 genes 50 genes
0%
10%
20%
30%
10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)
Spec
ies
tree
topo
logi
cal e
rror (
FN) ASTRAL−II
NJstCA−ML
more ILS more ILS more ILSL M H L M H L M H
Comparison to concatenation: depends on the ILS level
16
200 species, recent ILS
1000 genes 200 genes 50 genes
0%
10%
20%
30%
10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)
Spec
ies
tree
topo
logi
cal e
rror (
FN) ASTRAL−II
NJstCA−ML
more ILS more ILS more ILSL M H L M H L M H
Horizontal Gene Transfer (HGT)[R. Davidson et al., BMC Genomics. 16 (2015)]
~70%~55% ~35% ~30% discordance
Model violation: the simulated discordance is due to both ILS and randomly distributed HGT
50 species, 10 genes
Horizontal Gene Transfer (HGT)[R. Davidson et al., BMC Genomics. 16 (2015)]
~70%~55%
~35% ~30% discordance
Randomly distributed HGT is tolerated with enough genes
50 species, varying # genes
# genes
UCSD Work 1. Statistical support 2. ASTRAL-III
• Better running time • GPU and CPU Parallelism • Multi-Individual datasets
3. Impact of fragmentary data
19
Branch support• Traditional approach:
Multi-locus bootstrapping (MLBS)
• Slow: requires bootstrapping all genes(e.g., 100×m ML trees)
• Inaccurate and hard to interpret [Mirarab et al., Sys bio, 2014; Bayzid et al., PLoS One, 2015]
• We can do better!
20
[Mirarab et al., Sys bio, 2014]
Local posterior probability• Recall quartet frequencies follow a multinomial distribution
• P ( topology seen in m1 / m gene trees is the species tree ) = P ( θ1 > 1/3 )
21
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
m 2 = 63 m3 = 57m 1 = 80
θ2 θ3θ1
Local posterior probability• Recall quartet frequencies follow a multinomial distribution
• P ( topology seen in m1 / m gene trees is the species tree ) = P ( θ1 > 1/3 )
• Can be analytically solved
• We implemented this idea in astral and called the measure “the local posterior probability” (localPP)
21
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
m 2 = 63 m3 = 57m 1 = 80
θ2 θ3θ1
Quartet support v.s. localPP
22
!3Background: How many targets are enough? The number of loci required for confidently resolving a phylogenetic relationship depends on its “hardness”. Discordant gene trees due to incomplete lineage sorting (ILS) can make branches of a species tree very difficult to resolve. ILS is a function of the branch length and population size. This means that shorter branches or larger population sizes increase ILS, and can make species tree reconstruction more difficult. The co-PI has recently developed a new approach for estimating the support of a branch in a coalescent-based framework based on the proportion of gene trees that support quartet topologies in gene trees[31]. This new approach is analogous to finding the probability that a three-sided die is loaded towards each facet by observing outcomes of many tosses. We can also ask the reverse question: if we have an estimate of the degree of bias of a die, how many tosses are needed to confidently find its loaded side. Similarly, this approach[31] sheds light on the relationship between the number of genes required to achieve a level of support for a branch of a certain length in coalescent units (Fig. 3). While the co-PI’s method of calculating support, which is implemented in ASTRAL [32, 33], gives the basic mathematical framework, more method development is needed (see below).
Research Approach Specimens Suitable specimens of more than 80 species of Sabellidae and >120 Terebelliformia species have been collected from localities around the world in the last 15 years by the PI. Further collecting will be undertaken for some key taxa for a few more transcriptomes and for targeted DNA capture. We also have agreements from other experts in Sabellidae and Terebelliformia to provide other needed ethanol-preserved specimens for targeted capture. We will obtain ~ 50% of the known sabellid diversity and ~25% of terebelliforms for targeted capture sequencing (~500 species total). Our plan is to sample all genera, particularly focusing on their type species. This should allow for the generation of robust phylogenies, from which we will revise the taxonomy. Specimens will be vouchered at the SIO Benthic Invertebrate Collection and biodiversity information curated on the Encyclopedia of Life. Sequencing (transcriptomes and targeted capture of DNA) A targeted capture approach will be used with an appropriate number of loci (see below) to generate robust phylogenies of Sabellidae and Terebelliformia. Two different sets of targets will be designed from transcriptomes, across Sabellidae and Terebelliformia repectively. We have already generated new transcriptomes for nine sabellids, a serpulid and a fabriciid (bold terminals in Fig. 4), and nine Terebelliformia (bold terminals in Fig. 5) and combined this data with the few publicly available transcriptomes. Based on direct sequencing data already obtained for several genes for Sabellidae and Terebelliformia [Rouse in prep.; Stiller et al. in prep.], our sampling arguably spans the extant diversity of each clade. Several more transcriptomes will be needed to ensure the targeted capture methods will be robust. Methods for generating these transcriptomes will follow those previously used by us[6, 34]. How many targets are needed? A targeted capture pipeline typically starts from a large number of loci sequenced from a small number of species using genome-wide approaches (e.g., transcriptomics)[7, 8, 35]. This initial dataset is then used to select a smaller subset of loci for the targeted capture phase, which involves a larger set of species. To reduce the cost and effort, one would want to minimize the number of loci in the second phase, as long as sufficient loci are selected to confidently resolve relationships of interest. To date the number of loci has been determined in an ad hoc manner, either by the ultra conserved elements discovered[7, 8] or limitations of the targeted capture technology[13]. Initial analyses of our two datasets demonstrates how the number of
Fig. 3. Impact of number of genes (colors) and amount of ILS (x-axis) on support (posterior probability) for a branch (y-axis). Under the coalescent model, for four species separated with an unrooted branch of length d, the probability of a gene tree being identical to the species tree is 1-2/3e-d (we use this as a measure of ILS). We measure support using local pp [31] for various numbers of gene trees (assuming no gene tree estimation error). More ILS (ie., low 1-2/3e-d) requires more genes to get high support.
quartet frequency
Increased number of genes (m ) ⇒ increased support
Decreased discordance ⇒ increased support
23
localPP is more accurate than bootstrapping
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
False Positive Rate
Recall
MLBS Local PP
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
False Positive Rate
Recall
MLBS Local PP
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
False Positive Rate
Recall
MLBS Local PP
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
False Positive Rate
Recall
MLBS Local PP
Avian simulated dataset (48 taxa, 1000 genes)
the ROC curves show (fig. 5), for the same number offalse positives branches, local posterior probabilities result inbetter recall than MLBS. This pattern is more pronouncedfor shorter alignments, which have increased gene treeerror. For example, for the 250 bp model condition, if wechoose a support threshold that results in 0.01 FPR, with localposterior values, we still recover 84% of correct branches,whereas with MLBS, the same FPR results in retaining70% of correct branches. Thus, for a desired level of precision,better recall can be obtained using local posteriorprobabilities.
Branch LengthBranch length accuracy on the avian dataset was a function ofgene tree estimation error whether ASTRAL or MP-EST wasused (table 4). With true gene trees, branch length log error
FIG. 4. Branch length accuracy on the A-200 dataset with Medium ILS. See supplementary figures S7 and S8, Supplementary Material online forlow and high ILS. The estimated branch length is plotted against the true branch length in log scale (base 10). Blue line: a fitted generalizedadditive model with smoothing (Wood 2011).
Table 3. Branch Length Accuracy on the A-200 Dataset.
Dataset n Log Err RMSE
True gt Est. gt True gt Est. gt
Low ILS 1,000 0.10 0.42 5.57 6.75Low ILS 200 0.16 0.44 6.22 6.99Low ILS 50 0.25 0.48 6.84 7.29Med ILS 1,000 0.03 0.20 0.22 0.86Med ILS 200 0.07 0.22 0.44 0.91Med ILS 50 0.13 0.26 0.74 1.05High ILS 1,000 0.06 0.15 0.03 0.13High ILS 200 0.11 0.18 0.07 0.15High ILS 50 0.18 0.24 0.14 0.19
NOTE.—Logarithmic error (Log Err) and RMSE are shown for true species treesscored with true gene trees or estimated gene trees (Est. gt).
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00False Positive Rate
Rec
all
MLBS Local PP
FIG. 5. ROC curve for the avian dataset based on MLBS and local PPPPsupport values. Boxes show different numbers of sites per gene (con-trolling gene tree estimation error).
Fast Coalescent-Based Support Computation . doi:10.1093/molbev/msw079 MBE
9
by guest on May 28, 2016
http://mbe.oxfordjournals.org/
Dow
nloaded from
[Sayyari and Mirarab, MBE, 2016]
100Xfaster
ASTRAL can also estimate internal branch lengths
24
True gene trees
1500 1000
500 250
−7.5
−5.0
−2.5
0.0
2.5
−7.5
−5.0
−2.5
0.0
2.5
−7.5 −5.0 −2.5 0.0 2.5 −7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)
estim
ated
bra
nch
leng
th (l
og s
cale
)
−7.5
−5.0
−2.5
0.0
2.5
−7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)
estim
ated
bra
nch
leng
th (l
og s
cale
)
With true gene trees, ASTRAL correctly estimates BL
ASTRAL can also estimate internal branch lengths
24
True gene trees
1500 1000
500 250
−7.5
−5.0
−2.5
0.0
2.5
−7.5
−5.0
−2.5
0.0
2.5
−7.5 −5.0 −2.5 0.0 2.5 −7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)
estim
ated
bra
nch
leng
th (l
og s
cale
)
−7.5
−5.0
−2.5
0.0
2.5
−7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)
estim
ated
bra
nch
leng
th (l
og s
cale
)
was only 0.06, corresponding to branches that are about 14%shorter or longer than the true branch. As gene tree estima-tion error increases with reduced number of sites (see table 1for gene tree error statistics), the branch length error also
increases. Thus, while 1,500 bp genes give 0.17 log error,250 bp genes result in 0.59 error, which corresponds tobranches that are on average 3.9 times too short or long.Moreover, unlike true gene trees, the error in branch lengthsestimated based on estimated gene trees is biased towardunderestimation (fig. 6), a pattern that increases in intensitywith shorter alignments.
ASTRAL and MP-EST have similar branch length accuracymeasured by log error for highly accurate gene trees, butASTRAL has an advantage with increased gene tree error(supplementary fig. S10, Supplementary Material online andtable 4). Error measured by root mean squared error (RMSE)(which emphasizes the accuracy of long branches) is compa-rable for the two methods, but MP-EST has a slight advantagegiven accurate gene trees.
Biological DatasetsFor each biological dataset, we show MLBS support and thelocal posterior probabilities, computed based on RAxML genetrees available from respective publications. We also collapsegene tree branches with <33% bootstrap support and usethese collapsed gene trees to draw local posterior probabili-ties. For ease of discussion, we show local posterior probabil-ities as percentages and refer to them simply as posterior orcollapsed posterior (for values based on collapsed gene trees).We discuss the confidence in important branches in eachtree.
1KP. Three of the key relationships studied by Wickett et al.(2014) are the sister branch to land plants, the base of theangiosperms, and the relationship among Bryophytes (horn-worts, liverworts, and mosses). In the ASTRAL tree, manybranches have full support regardless of the measure of thesupport used, but the remaining branches reveal interestingpatterns (fig. 7). The sister relationship betweenZygnematales and land plants receives a moderate 80% BS,but has 100% posterior. Wickett et al. (2014) also recoveredthis relationship by concatenation of various data partitions.There are 12 other branches that have collapsed posteriorsthat are at least 10% higher than BS (fig. 7); no branch hassubstantially higher BS than collapsed posterior. Collapsedposterior for monophyly of Bryophytes and for Amborellaas sister to other angiosperms are 100% (compared with97% and 93% BS, respectively).
When we collapse low-support branches in gene trees,posterior goes up for several branches: nine branches haveimprovements of 10% or more, and only two branches havecomparable reductions. An interesting case is Coleochaetalesas sister to Zygnematalesþ land plants, which has only 62%BS and 61% posterior, but has 100% collapsed posterior.Finally, note that several branches have low posterior, evenafter collapsing.
Our estimated branch lengths are short for several nodes.For example, the branch that unites (ChloranthalesþMagnoliids) and Eudicots has a length of 0.14 in coalescentunits. Other branches that have been historically hard to re-cover also tend to have short branches; however, these arenot necessarily extremely short branches that would
FIG. 6. ASTRAL branch length accuracy on the avian dataset. Logtransformed estimated branch lengths are shown versus true branchlengths, and a generalized additive model is fitted to the data. Onebranch with length 10"6 is trimmed out here, but full results, includ-ing MP-EST, is shown in supplementary figure S9, SupplementaryMaterial online.
Table 4. Branch Length Accuracy for the Avian Dataset.
No. of sites Log Err RMSE
ASTRAL MPEST ASTRAL MPEST
True gt. 0.06 (0.10) 0.07 (0.11) 0.44 (0.44) 0.30 (0.30)1,500 0.17 (0.20) 0.14 (0.18) 0.83 (0.83) 0.70 (0.70)1,000 0.22 (0.27) 0.22 (0.25) 1.08 (1.07) 1.01 (1.00)500 0.37 (0.42) 0.42 (0.46) 1.65 (1.64) 1.65 (1.64)250 0.59 (0.63) 0.81 (0.84) 2.25 (2.24) 2.28 (2.26)
NOTE.—Logarithmic error and root mean squared error are shown for true speciestrees scored with true gene trees or estimated gene trees with various numbers ofsites using ASTRAL and MP-EST. An extremely short branch with length 10"6 wasremoved from the calculations, but error including that branch is shownparenthetically.
Sayyari and Mirarab . doi:10.1093/molbev/msw079 MBE
10
by guest on May 28, 2016
http://mbe.oxfordjournals.org/
Dow
nloaded from
low gene tree error
High gene tree error
Moderate g.t. error
Medium g.t. error
With true gene trees, ASTRAL correctly estimates BLWith error-prone estimated gene trees, ASTRAL underestimates BL
Running time complexity & ASTRAL-III
(unpublished work)
25
Maryam Rabiee Hashemi Chao ZhangJohn Yin
ASTRAL-III new features• Improved running time with a single CPU
• We have ~3-5X running time improvement, and guaranteed polynomial running time
• ~8 hours for 1000 genes, 1000 species
• Handling datasets with multiple individuals
• GPU and CPU parallelism
26
Parallelism
Can infer trees with 10,000 species & 400 genes in a day27
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
GPU (1000 species)
GPU (500 species)
0
5
10
15
20
25
1 2 4 8 12 16 20 24# cores
Spee
d−up
●
●
1000 species
500 species
# CPU cores (comet)
Multiple individuals• What if we sample multiple
individuals from each species?
• In recently diverged species individuals may have different trees for each gene
• Sampling multiple individuals may provide useful information
• The ASTRAL approach can be extended to these types of data
28
Multiple individuals helpful?
29
0.0
0.1
0.2
0.3
Variable effort500 genes; 200 species
Fixed effortgenes x inds = 1000; 200 species
Fixed effort500 genes; species x inds = 200
Spec
ies
tree
erro
r (R
F di
stan
ce) 1 ind.
2 ind.
5 ind.
Yes, it marginally helps accuracy
Simulations with very shallow trees (500K generations in total)
Multiple individuals helpful?
29
0.0
0.1
0.2
0.3
Variable effort500 genes; 200 species
Fixed effortgenes x inds = 1000; 200 species
Fixed effort500 genes; species x inds = 200
Spec
ies
tree
erro
r (R
F di
stan
ce) 1 ind.
2 ind.
5 ind.
Yes, it marginally helps accuracy
But not if sequencing effort is kept fixed
Simulations with very shallow trees (500K generations in total)
Multiple individuals helpful?
29
0.0
0.1
0.2
0.3
Variable effort500 genes; 200 species
Fixed effortgenes x inds = 1000; 200 species
Fixed effort500 genes; species x inds = 200
Spec
ies
tree
erro
r (R
F di
stan
ce) 1 ind.
2 ind.
5 ind.
Yes, it marginally helps accuracy
But not if sequencing effort is kept fixed
Simulations with very shallow trees (500K generations in total)
Impact of fragmentary data
30
Erfan Sayyari James B. Whitfield
Two types of missing data• Missing genes: a taxon
misses the gene fully (taxon occupancy)
• Fragmentary data: the species is partially sequenced for some genes
31
Figure from Hosner et al, 2016
Does missing data matter?• Missing genes:
• Evidence suggests summary methods are often robust
32
Does missing data matter?• Missing genes:
• Evidence suggests summary methods are often robust
• Fragmentary data:
• Less extensively studied
• Hosner et al. indicated:
• Fragments matter
• They suggest removing genes with many fragments
32
Our study• Empirical data:
• An insect dataset of 144 species and 1478 genes
• 600 million years of evolution
• Transcriptomics, so plenty of missing data of both types
• Simulated data:
• 100 taxon, medium ILS, 1000 genes
• Added fragmentation with patterns similar to the insect data
33
Fragmentary data hurt gene trees and species trees
34
●
●
0.25
0.50
0.75
1.00
no−filtering 10 20 25 33 50 66 75 80 orig−seqFiltering method
NR
F D
ista
nce
with fragments
without fragments
Species tree error: 2.8% errorG
ene
tree
erro
r
Fragmentary data hurt gene trees and species trees
Fragmentary data dramatically increase
gene tree and the species tree error
34
●
●
0.25
0.50
0.75
1.00
no−filtering 10 20 25 33 50 66 75 80 orig−seqFiltering method
NR
F D
ista
nce
with fragments
without fragments
Species tree error: 2.8% errorG
ene
tree
erro
r
Species tree error: 7.0% error
How do we solve this?• Hosner et al. suggested removing entire genes.
• This is too conservative in our opinion.
35
How do we solve this?• Hosner et al. suggested removing entire genes.
• This is too conservative in our opinion.
• We simply remove the fragmentary sequence from the gene and keep the rest of the gene
• What is fragmentary?
• Anything that has at most X% of sites
• We study different X values
35
Filtering helps to some level (simulations)
36
●●
●●
●
●● ● ●
●
0.25
0.50
0.75
1.00
no−filtering 10 20 25 33 50 66 75 80 orig−seqFiltering method
NR
F D
ista
nce
Gene tree error Species tree error
How about real data?
37
38
no-filtering
Dipt
era
Mec
opte
ra
Siph
onap
tera
Lepi
dopt
era
Trich
opte
ra
Cole
opte
ra
Stre
psip
tera
Neur
opte
rida
Hym
enop
tera
Psoc
odea
Hem
ipte
ra
Thys
anop
tera
Blat
tode
a
Man
tode
a
Gry
llobl
atto
dea
Orth
opte
raPh
asm
ida
Embi
idin
a
Plec
opte
ra
Derm
apte
ra
Ephe
mer
opte
ra
Odo
nata
Thys
anur
a
Arch
aeog
nath
a
Dipl
ura
Colle
mbo
laIm
porta
nt
Ingr
oup A B C
B/C D
D/B/
C E F G H I J K L M N P
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Prop
ortio
n of
rele
vant
gen
e tre
es
Strongly Supported Weakly Supported Weakly Reject Strongly Reject
66% filtering
Gene trees seem to improve (less gene tree discordance)
38
no-filtering
Dipt
era
Mec
opte
ra
Siph
onap
tera
Lepi
dopt
era
Trich
opte
ra
Cole
opte
ra
Stre
psip
tera
Neur
opte
rida
Hym
enop
tera
Psoc
odea
Hem
ipte
ra
Thys
anop
tera
Blat
tode
a
Man
tode
a
Gry
llobl
atto
dea
Orth
opte
raPh
asm
ida
Embi
idin
a
Plec
opte
ra
Derm
apte
ra
Ephe
mer
opte
ra
Odo
nata
Thys
anur
a
Arch
aeog
nath
a
Dipl
ura
Colle
mbo
laIm
porta
nt
Ingr
oup A B C
B/C D
D/B/
C E F G H I J K L M N P
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Prop
ortio
n of
rele
vant
gen
e tre
es
Strongly Supported Weakly Supported Weakly Reject Strongly Reject
66% filtering
Gene trees seem to improve (less gene tree discordance)
The species tree also improves
39
IngroupABC
B/CD
D/B/CEFGHIJKLMNP
no filter
ing
20
25 33 50
66 75 80
Supported0% 50% 100%
Weakly Rejected(<95%)
StronglyRejected(>=95%)
b)
• ASTRAL trees differed with concatenation initially
• After filtering, ASTRAL and concatenation were very similar
• Differences on controversial nodes
ASTRAL …• Reconstructs species trees from gene trees with both
• high accuracy
• scalability to large datasets • robust to some levels of model violations
• Like any other summary method, remains sensitive to gene tree estimation error
• try best to obtain highly accurate gene trees
• removing fragmentary data seems to help
40
Future of ASTRAL• Can it be changed to use characters directly?
possible but slow for binary characters …
• More broadly, can alternative scalable methods be developed for better gene tree estimation?
41
Future of ASTRAL• Can it be changed to use characters directly?
possible but slow for binary characters …
• More broadly, can alternative scalable methods be developed for better gene tree estimation?
• ASTRAL scales to 10K leaves. We have 90K bacterial genomes.
• Can we scale further?
41
S.M. Bayzid Théo Zimmermann
Erfan Sayyari Shubhanshu Shekhar
Maryam Rabiee Hashemi
Chao ZhangTandy Warnow
John Yin
James B. Whitfield
Theoretical sample complexity results
How many genes are enough to reconstruct the tree?
43
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
0
10000
20000
30000
40000
50000
60000
70000
010
020
030
0
1 f
# ge
nes
ε0.050.10.20.33
●
●
●
BalancedCaterpillarTheoretical
Gene trees seem to improve (bootstrap)
44
c)
●
●●
●
●
● ● ●
●
●
●
●
●
●
● ●
●
●●
●
●
●●
●
●
●●
●
●
●● ●
10%
20%
30%
40%
50%
no−filtering20 25 33 50 66 75 80Fragment filtering threshold
Perc
ent
Bootstrap support ● ● ● ●=0 <=33 >=75 =100
d)
Gene trees seem to improve (less gene tree discordance)
45
e)
● ●●
●● ● ●
●
0.00
0.25
0.50
0.75
1.00
no-filtering 20 25 33 50 66 75 90Filtering method
FN d
istan
ce b
etwe
en s
pecie
s an
d ge
ne tr
ees
0.6 0.7 0.8 0.9
occupancy
Impact of gene tree error (using true gene trees)
46
Spec
ies
tree
topo
logi
cal
erro
r (FN
)
10M 2M 500K
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−II ASTRAL−II (true gt) CA−ML
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
9
Medium ILSLow ILS High ILS
10M 2M 500K
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.41e−06
1e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−II ASTRAL−II (true gt) CA−ML
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
9
Impact of gene tree error (using true gene trees)
• When we divide our 50 replicates into low, medium, or high gene tree estimation error, ASTRAL tends to be better with low error
46
Spec
ies
tree
topo
logi
cal
erro
r (FN
)
10M 2M 500K
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−II ASTRAL−II (true gt) CA−ML
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
9
Medium ILSLow ILS High ILS
10M 2M 500K
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.41e−06
1e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−II ASTRAL−II (true gt) CA−ML
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
9
ASTRAL-I versus ASTRAL-II
47
10M 2M 500K
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
10M 2M 500K
0
3
6
9
12
0
3
6
9
12
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Run
ning
tim
e (h
ours
)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes
and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.
5
Medium ILSLow ILS High ILS
Spec
ies
tree
topo
logi
cal
erro
r (FN
)
10M 2M 500K
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
10M 2M 500K
0
3
6
9
12
0
3
6
9
12
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Run
ning
tim
e (h
ours
)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes
and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.
5200 species, deep ILS
ASTRAL-I versus ASTRAL-II
47
10M 2M 500K
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
10M 2M 500K
0
3
6
9
12
0
3
6
9
12
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Run
ning
tim
e (h
ours
)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes
and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.
5
Medium ILSLow ILS High ILS
Spec
ies
tree
topo
logi
cal
erro
r (FN
)
10M 2M 500K
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Spec
ies
tree
topo
logi
cal e
rror (
RF)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
10M 2M 500K
0
3
6
9
12
0
3
6
9
12
1e−061e−07
50 200 1000 50 200 1000 50 200 1000genes
Run
ning
tim
e (h
ours
)
ASTRAL−I ASTRAL−II ASTRAL−II + true st
Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes
and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.
5200 species, deep ILS
Dynamic programming
48
A L-A
L
Dynamic programming
48
A
A-A’A’A-A’A’A-A’A’
L-A
• Recursively break subsets of species into smaller subsets
L
Dynamic programming
48
A
A-A’A’A-A’A’A-A’A’
L-A
A-A’
A’
L-A
• Recursively break subsets of species into smaller subsets
• : Compare u against input gene trees and compute quartets from gene trees satisfied by u
u
L
w(u)
Dynamic programming
48
A
A-A’A’A-A’A’A-A’A’
L-A
A-A’
A’
L-A
• Recursively break subsets of species into smaller subsets
• : Compare u against input gene trees and compute quartets from gene trees satisfied by u
Exponential
u
L
w(u)
Constrained version
49
A
A-A’A’A-A’A’A-A’A’
L-A
L
A1
A4
A8
A5
A2
A6
A3A7
A9
A10
Constrained version
49
A
A-A’A’A-A’A’A-A’A’
L-A
L
A1
A4
A8
A5
A2
A6
A3A7
A9
A10
• Restrict “branches” in the species tree to a given constraint set X.
Deep ILS
50
1000 genes 200 genes 50 genes
10%
20%
30%
10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IINJstCA−ML
200 species, simulated species trees, deep ILS [Mirarab and Warnow, ISMB, 2015]