57
ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support Siavash Mirarab University of California, San Diego Orangutan Gorilla Chimpanzee Human Joint work with Tandy Warnow Erfan Sayyari

ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Siavash Mirarab University of California, San Diego

OrangutanGorilla ChimpanzeeHuman

Joint work with Tandy Warnow Erfan Sayyari

Page 2: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Gene tree discordance

2

OrangutanGorilla ChimpHuman

The species tree

A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

Causes of gene tree discordance include:• Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)

gene1000gene 1

“gene”: recombination-free

orthologous stretches of the genome

Page 3: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Incomplete Lineage Sorting (ILS)• A random process related to the

coalescence of alleles across various populations

3

Tracing alleles through generations

Page 4: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Incomplete Lineage Sorting (ILS)• A random process related to the

coalescence of alleles across various populations

3

Tracing alleles through generations

Page 5: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Incomplete Lineage Sorting (ILS)• A random process related to the

coalescence of alleles across various populations

• Omnipresent; most likely for short branches or large population sizes

3

Tracing alleles through generations

Page 6: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Incomplete Lineage Sorting (ILS)• A random process related to the

coalescence of alleles across various populations

• Omnipresent; most likely for short branches or large population sizes

• Multi-species coalescent: models ILS

• The species tree defines the probability distribution on gene trees, and is identifiable from the distribution on gene tree topologies [Degnan and Salter, Int. J. Org. Evolution, 2005]

3

Tracing alleles through generations

Page 7: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

6

gene 2

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000

gene 1

Summary method

Orangutan

Gorilla

Chimpanzee

Human

Approach 2: Summary methods

Approach 1: ConcatenationOrangutan

Gorilla

Chimpanzee

Human

Phylogeny inferenceACTGCACACCG

ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

supermatrixgene 2 gene 1000gene 1

Gene tree estimation

Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

Multi-gene species tree estimation

4

Page 8: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

6

gene 2

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000

gene 1

Summary method

Orangutan

Gorilla

Chimpanzee

Human

Approach 2: Summary methods

Approach 1: ConcatenationOrangutan

Gorilla

Chimpanzee

Human

Phylogeny inferenceACTGCACACCG

ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

supermatrixgene 2 gene 1000gene 1

Gene tree estimation

Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

Multi-gene species tree estimation

4

Statistically inconsistent [Roch and Steel,2014]

Can be statistically consistent given true gene trees

Page 9: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

6

gene 2

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000

gene 1

Summary method

Orangutan

Gorilla

Chimpanzee

Human

Approach 2: Summary methods

Approach 1: ConcatenationOrangutan

Gorilla

Chimpanzee

Human

Phylogeny inferenceACTGCACACCG

ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

supermatrixgene 2 gene 1000gene 1

Gene tree estimation

Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

Multi-gene species tree estimation

4

Statistically inconsistent [Roch and Steel,2014]

Can be statistically consistent given true gene trees

STAR, STELLS, GLASS, BUCKy (population tree), MP-EST, NJst (ASTRID), … ASTRAL, ASTRAL-II

Page 10: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

6

gene 2

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000

gene 1

Summary method

Orangutan

Gorilla

Chimpanzee

Human

Approach 2: Summary methods

Approach 1: ConcatenationOrangutan

Gorilla

Chimpanzee

Human

Phylogeny inferenceACTGCACACCG

ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

supermatrixgene 2 gene 1000gene 1

Gene tree estimation

Orang.

GorillaChimp

Humangene

1

Orang.

Gorilla Chimp

Humangene

2

Orang.

Gorilla

Chimp

Humangene

100

0

Multi-gene species tree estimation

4

Statistically inconsistent [Roch and Steel,2014]

Can be statistically consistent given true gene trees

STAR, STELLS, GLASS, BUCKy (population tree), MP-EST, NJst (ASTRID), … ASTRAL, ASTRAL-II

There are also other approaches: co-estimation (e.g., *BEAST),

site-based (SVDQuartets)

Page 11: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Properties of quartet trees in presence of ILS

5

• For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010]

Orang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Dominant

30%p1 = 30%p2 = 40%p3 =

Orang.

GorillaChimp

Human Orang.

Gorilla Chimp

Human

Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Orang.

Gorilla Chimp

Human

Orang.

Gorilla

Chimp

Human

Page 12: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Properties of quartet trees in presence of ILS

5

• For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010]

• For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006]

1. Breakup input each gene tree into trees on 4 taxa (quartet trees)

2. Find all (4n) dominant quartet topologies

3. Combine dominant quartet treesOrang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Dominant

30%p1 = 30%p2 = 40%p3 =

Orang.

GorillaChimp

Human Orang.

Gorilla Chimp

Human

Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Orang.

Gorilla Chimp

Human

Orang.

Gorilla

Chimp

Human ✓n

4

✓n

4

Page 13: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Properties of quartet trees in presence of ILS

5

• For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010]

• For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006]

1. Breakup input each gene tree into trees on 4 taxa (quartet trees)

2. Find all (4n) dominant quartet topologies

3. Combine dominant quartet trees

• Alternative: weight 3(4n) quartet topology by their frequency and find the optimal tree

Orang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Dominant

30%p1 = 30%p2 = 40%p3 =

Orang.

GorillaChimp

Human Orang.

Gorilla Chimp

Human

Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

Orang.

Gorilla Chimp

Human

Orang.

Gorilla

Chimp

Human ✓n

4

✓n

4

✓n

4

Page 14: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014]

• Optimization Problem (suspected NP-Hard):

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

6

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Set of quartet trees induced by T

a gene tree all input gene trees

Score(T ) =X

t2T|Q(T ) \Q(t)|

Page 15: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL-I [Mirarab, et al., Bioinformatics, 2014]

• ASTRAL solves the problem exactly using dynamic programming:

• Exponential running time (feasible for <18 species)

7

Page 16: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL-I [Mirarab, et al., Bioinformatics, 2014]

• ASTRAL solves the problem exactly using dynamic programming:

• Exponential running time (feasible for <18 species)

• Introduced a constrained version of the problem

• Draws the set of branches in the species tree from a given set X = {all bipartitions in all gene trees}

• Given many genes, each species tree branch likely appears in at least one of the gene trees

• Theorem: the constrained version remains statistically consistent

• Running time: for n species and k species

7

O(n2k|X |2)

Page 17: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]

1. Faster calculation of the score function inside DP

• O(nk ) instead of O(n2k) for n species and k genes

8

O(nk|X |2) O(n2k|X |2)

Page 18: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]

1. Faster calculation of the score function inside DP

• O(nk ) instead of O(n2k) for n species and k genes

2. Add extra bipartitions to the set X using heuristic approaches

• Resolving consensus trees by subsampling taxa

• Using quartet-based distances to find likely branches

• Complete incomplete gene trees

8

O(nk|X |2) O(n2k|X |2)

Page 19: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]

1. Faster calculation of the score function inside DP

• O(nk ) instead of O(n2k) for n species and k genes

2. Add extra bipartitions to the set X using heuristic approaches

• Resolving consensus trees by subsampling taxa

• Using quartet-based distances to find likely branches

• Complete incomplete gene trees

3. Ability to take as input gene trees with polytomies

8

O(nk|X |2) O(n2k|X |2)

Page 20: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Simulation study

• Using SimPhy. Vary many parameters:

• Number of species: 10 – 1000

• Number of genes: 50 – 1000

• Amount of ILS: low, medium, high

• Deep versus recent speciation

• 11 model conditions (50 replicas each) with heterogenous gene tree error

• Compare to NJst, MP-EST, concatenation (CA-ML) using FastTree-II

• Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree

9

Truegenetrees Sequencedata

Es�matedspeciestree

Finch Falcon Owl Eagle Pigeon

Es�matedgenetreesFinch Owl Falcon Eagle Pigeon

True(model)speciestree

ASTRAL-II

look at all pairs of leaves chosen each from one of the children ofu. For each such pair of leaves, there are

�u0

2

�quartet trees that put

that pair together, where u0 is the number of leaves outside the nodeu. This will examine each pair of nodes in each of the input k nodesexactly once and would therefore require O(n2k) computations.The final score can be normalized by the maximum number of inputquartet trees that include a pair of taxa.

Given the similarity matrix, we calculate an UPGMA tree andadd all its bipartitions to the set X. This heuristic adds relatively fewbipartitions, but the matrix is used in the next heuristic, which is ourmain addition mechanism.

Greedy: We estimate the greedy consensus of the gene trees atdifferent threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).For each polytomy in each greedy consensus tree, we resolve thepolytomy in multiple ways and add bipartitions implied by thoseresolutions to the set X. First, we resolve the polytomy by applyingthe UPGMA algorithm to the similarity matrix, starting from theclades given by the polytomy. Then, we sample one taxon fromeach side of the ploytomy randomly, and use the greedy consensusof the gene trees restricted to this subsample to find a resolutionof the polytomy (we randomly resolve any multifunctions in thisgreedy consensus on indued subsample). We repeat this process atleast 10 times, but if the subsampled greedy consensus trees includesufficiently frequent bipartitions (defined as > 1%), we do morerounds of random sampling (we increase the number of iterationsby two every time this happens). For each random subsamplearound a polytomy, we also resolve it by calculating an UPGMAtree on the subsampled similarity matrix. Finally, for the two firstgreedy threshold values and the first 10 random subsamples, wealso use a third strategy that can potentially add a larger number ofbipartitions. For each subsampled taxon x, we resolve the polytomyas a caterpillar tree by sorting the remaining taxa according to theirsimilarity with x.

Gene tree polytomies: When gene trees include polytomies, wealso add new bipartitions to set X. We first compute the greedyconsensus of the input gene trees with threshold 0 and if thegreedy consensus has polytomies, we resolve them using UPGMA;we repeat this process twice to account for uncertainty in greedyconsensus estimation. Then, for each gene tree polytomy, we use thetwo resolved consensus trees to infer a resolution of the polytomyand we add the implied resolutions to set X.

3.3 Multi-furcating input gene trees

Extending ASTRAL to inputs that include polytomies requiressolving the weighted quartet tree problem when each node of theinput defines not a tripartition, but a multi-partition of the setof taxa. We start by a basic observation: every resolved quartettree induced by a gene tree maps to two nodes in the gene treeregardless of whether the gene tree is binary or not. In other words,induced quartet trees that map to only one node of the gene tree areunresolved. When maximizing the quartet support, these unresolvedgene tree quartet trees are inconsequential and need to be ignored.Now, consider a polytomy of degree d. There are

�d3

�ways to select

three sides of the polytomy. Each of these ways of selecting threesides defines a tripartition of a subset of taxa. Any selection of twotaxa from one side of this tripartition and one taxon from each of theremaining two sides still defines an induced resolved quartet tree,

0

5

10

15

20

0% 20% 40% 60% 80%RF distance (true species tree vs true gene trees)

dens

ity

rate1e−06 1e−07

tree height10M 2M 500K

(a) True gene tree discordance

0

1

2

3

4

0% 25% 50% 75% 100%RF distance (true vs estimated)

dens

ity

(b) Gene tree estimation error

Fig. 1. Characteristics of the simulation (a) RF distance between the truespecies tree and the true gene trees (50 replicates of 1000 genes) for DatasetI. Tree height directly affects the amount of true discordance; the speciationrate affects true gene tree discordance only with 10M tree length. (b) RFdistance between true gene trees and estimated gene trees for Dataset I. Seealso Figure S1 for inter and intra-replicate gene tree error distributions.

and each induced resolved quartet tree would still map to exactlytwo nodes in our multi-furcating tree. Thus, all the algorithmicassumptions of ASTRAL remain intact, as long as for each multi-furcating node in an input gene tree, we treat it as a collection of

�d3

tripartitions. Note that in the presence of polytomies, the runningtime analysis can change because analyzing each multi-furcatingnode requires time cubic in its degree and the degree can increase inprinciple with n. Thus, the running time depends on the patterns ofthe multi-furcations and cannot be studied in a general case.

Statistical Consistency: ASTRAL-I was statistically consistent, andchanges from ASTRAL-I to ASTRAL-II either affect running time,or enlarge the search space, which does not negate consistency.

Theorem 3: ASTRAL-II is statistically consistent for binarycomplete input gene trees.

4 EXPERIMENTAL SETUPSimulation Procedure: We used SimPhy, a tool developed by Malloet al. (2015), to simulate species trees and gene trees (producedin mutation units), and then used Indelible to simulate sequencesdown the gene trees with varying length and model parameters. Weestimated gene trees on these simulated gene alignments, which wethen used in coalescent-based analyses.

We simulated 10 model conditions, which we divide into twodatasets, with one model condition appearing in both datasets. We

3

Page 21: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Tree error, varying level of ILS

10

200 species, recent ILS

1000 genes 200 genes 50 genes

0%

10%

20%

30%

10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)

Spec

ies

tree

topo

logi

cal e

rror (

FN) ASTRAL−II

NJstCA−ML

more ILS more ILS more ILSL M H L M H L M H

Page 22: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Tree error, varying level of ILS

10

200 species, recent ILS

1000 genes 200 genes 50 genes

0%

10%

20%

30%

10M 2M 500K 10M 2M 500K 10M 2M 500Ktree length (controls the amount of ILS)

Spec

ies

tree

topo

logi

cal e

rror (

FN) ASTRAL−II

NJstCA−ML

more ILS more ILS more ILSL M H L M H L M H

Page 23: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Impact of gene tree error (using true gene trees)

11

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−II ASTRAL−II (true gt) CA−ML

Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and

varying tree shapes and number of genes.

9

Medium ILSLow ILS High ILS

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.41e−06

1e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−II ASTRAL−II (true gt) CA−ML

Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and

varying tree shapes and number of genes.

9

Page 24: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Impact of gene tree error (using true gene trees)

• When we divide our 50 replicates into low, medium, or high gene tree estimation error, ASTRAL tends to be better with low error

11

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−II ASTRAL−II (true gt) CA−ML

Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and

varying tree shapes and number of genes.

9

Medium ILSLow ILS High ILS

10M 2M 500K

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.41e−06

1e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−II ASTRAL−II (true gt) CA−ML

Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and

varying tree shapes and number of genes.

9

Page 25: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Impact of HGT [Davidson, et al., BMC-genomics, 2015]

• Random HGT, with 50 species, moderate levels of ILS

• ASTRAL does well on ILS+HGT

0.1

0.2

0.3

10 50 100 200 400 1000number of genes

spec

ies

tee

erro

r

HGT rate(expected transfers per gene)

00.82820

Page 26: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Going beyond the topology

• Branch length (BL):

• ASTRAL can compute BL in coalescent units for internal branches

• Simply a function of the amount of discordance

13

[Sayyari and Mirarab, MBE, 2016]

Page 27: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Going beyond the topology

• Branch length (BL):

• ASTRAL can compute BL in coalescent units for internal branches

• Simply a function of the amount of discordance

• Branch support:

• Multi-locus bootstrapping: — Slow (requires bootstrapping each gene)— Inaccurate [Mirarab et al., Sys bio, 2014; Bayzid et al., PLoS One, 2015]

• Local posterior probability: ASTRAL’s own support values that don’t require bootstrapping

13

[Sayyari and Mirarab, MBE, 2016]

Page 28: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Branch support: idea• Observing quartet topologies in

gene trees is just like tossing a three-sided unbalanced die

14

Orang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

f3 = 63 f2 = 57 f1 = 80

Page 29: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Branch support: idea• Observing quartet topologies in

gene trees is just like tossing a three-sided unbalanced die

• P (a quartet tree seen f1 times in k gene trees is in the species tree) = P (a three-sided die tossed k times is biased towards a side that shows up f1 times)

14

Orang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

f3 = 63 f2 = 57 f1 = 80

Page 30: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Branch support: idea• Observing quartet topologies in

gene trees is just like tossing a three-sided unbalanced die

• P (a quartet tree seen f1 times in k gene trees is in the species tree) = P (a three-sided die tossed k times is biased towards a side that shows up f1 times)

• Can be analytically solved (Bayesian with a prior)

14

Orang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

f3 = 63 f2 = 57 f1 = 80

Page 31: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Branch support: idea• Observing quartet topologies in

gene trees is just like tossing a three-sided unbalanced die

• P (a quartet tree seen f1 times in k gene trees is in the species tree) = P (a three-sided die tossed k times is biased towards a side that shows up f1 times)

• Can be analytically solved (Bayesian with a prior)

• ML or MAP BL can be computed as a function of f1

14

Orang.

Gorilla Chimp

Human Orang.

GorillaChimp

Human

Orang.

Gorilla

Chimp

Human

f3 = 63 f2 = 57 f1 = 80

Page 32: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Prior• We assume a-priori all three topologies are equally likely

• We assume branch lengths are generated through a Yule process with rate λ

• Default: λ =0.5 —> branch lengths become uniformly distributed

15

Pr(✓1 >1

3) = Pr(✓2 >

1

3) = Pr(✓3 >

1

3) =

1

3

Page 33: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Quartet support versus posterior probability

16

!3Background: How many targets are enough? The number of loci required for confidently resolving a phylogenetic relationship depends on its “hardness”. Discordant gene trees due to incomplete lineage sorting (ILS) can make branches of a species tree very difficult to resolve. ILS is a function of the branch length and population size. This means that shorter branches or larger population sizes increase ILS, and can make species tree reconstruction more difficult. The co-PI has recently developed a new approach for estimating the support of a branch in a coalescent-based framework based on the proportion of gene trees that support quartet topologies in gene trees[31]. This new approach is analogous to finding the probability that a three-sided die is loaded towards each facet by observing outcomes of many tosses. We can also ask the reverse question: if we have an estimate of the degree of bias of a die, how many tosses are needed to confidently find its loaded side. Similarly, this approach[31] sheds light on the relationship between the number of genes required to achieve a level of support for a branch of a certain length in coalescent units (Fig. 3). While the co-PI’s method of calculating support, which is implemented in ASTRAL [32, 33], gives the basic mathematical framework, more method development is needed (see below).

Research Approach Specimens Suitable specimens of more than 80 species of Sabellidae and >120 Terebelliformia species have been collected from localities around the world in the last 15 years by the PI. Further collecting will be undertaken for some key taxa for a few more transcriptomes and for targeted DNA capture. We also have agreements from other experts in Sabellidae and Terebelliformia to provide other needed ethanol-preserved specimens for targeted capture. We will obtain ~ 50% of the known sabellid diversity and ~25% of terebelliforms for targeted capture sequencing (~500 species total). Our plan is to sample all genera, particularly focusing on their type species. This should allow for the generation of robust phylogenies, from which we will revise the taxonomy. Specimens will be vouchered at the SIO Benthic Invertebrate Collection and biodiversity information curated on the Encyclopedia of Life. Sequencing (transcriptomes and targeted capture of DNA) A targeted capture approach will be used with an appropriate number of loci (see below) to generate robust phylogenies of Sabellidae and Terebelliformia. Two different sets of targets will be designed from transcriptomes, across Sabellidae and Terebelliformia repectively. We have already generated new transcriptomes for nine sabellids, a serpulid and a fabriciid (bold terminals in Fig. 4), and nine Terebelliformia (bold terminals in Fig. 5) and combined this data with the few publicly available transcriptomes. Based on direct sequencing data already obtained for several genes for Sabellidae and Terebelliformia [Rouse in prep.; Stiller et al. in prep.], our sampling arguably spans the extant diversity of each clade. Several more transcriptomes will be needed to ensure the targeted capture methods will be robust. Methods for generating these transcriptomes will follow those previously used by us[6, 34]. How many targets are needed? A targeted capture pipeline typically starts from a large number of loci sequenced from a small number of species using genome-wide approaches (e.g., transcriptomics)[7, 8, 35]. This initial dataset is then used to select a smaller subset of loci for the targeted capture phase, which involves a larger set of species. To reduce the cost and effort, one would want to minimize the number of loci in the second phase, as long as sufficient loci are selected to confidently resolve relationships of interest. To date the number of loci has been determined in an ad hoc manner, either by the ultra conserved elements discovered[7, 8] or limitations of the targeted capture technology[13]. Initial analyses of our two datasets demonstrates how the number of

Fig. 3. Impact of number of genes (colors) and amount of ILS (x-axis) on support (posterior probability) for a branch (y-axis). Under the coalescent model, for four species separated with an unrooted branch of length d, the probability of a gene tree being identical to the species tree is 1-2/3e-d (we use this as a measure of ILS). We measure support using local pp [31] for various numbers of gene trees (assuming no gene tree estimation error). More ILS (ie., low 1-2/3e-d) requires more genes to get high support.

quartet frequency

Page 34: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL

• Input: a set of unrooted gene trees

• Output: an unrooted species tree

• Optimization score: The number of quartet trees shared between the species tree and the input gene trees

17

Page 35: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Multiple quartets

1818

a

b

c

d

e

f

gh

Q

|C1|=m1

|C2|=m2

|C3|=m3

|C4|=m4

SQ = {ac|dh, ac|df, bc|ef, . . .}

Fi = (fi1, fi2, fi3) for 1 i m

m = #SQ = m1 ⇤m2 ⇤m3 ⇤m4

• Locality Assumption: All four clusters around a branch Q are correct

• Compute support for each branch independently from others

Page 36: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Multiple quartets

1818

a

b

c

d

e

f

gh

Q

|C1|=m1

|C2|=m2

|C3|=m3

|C4|=m4

SQ = {ac|dh, ac|df, bc|ef, . . .}

Fi = (fi1, fi2, fi3) for 1 i m

m = #SQ = m1 ⇤m2 ⇤m3 ⇤m4

• Locality Assumption: All four clusters around a branch Q are correct

• Compute support for each branch independently from others

• Assuming independence between quartets would be too liberal (m x k tosses)

Page 37: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Multiple quartets

1818

a

b

c

d

e

f

gh

Q

|C1|=m1

|C2|=m2

|C3|=m3

|C4|=m4

SQ = {ac|dh, ac|df, bc|ef, . . .}

Fi = (fi1, fi2, fi3) for 1 i m

m = #SQ = m1 ⇤m2 ⇤m3 ⇤m4

• Locality Assumption: All four clusters around a branch Q are correct

• Compute support for each branch independently from others

• Assuming independence between quartets would be too liberal (m x k tosses)

• Conservative approach: All quartet frequencies are noisy estimates of a single hidden true frequency

• Simply average quartet frequencies • k tosses, each time reading the results

m times with some noise

Page 38: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Speed• Can analytically compute local posterior

probabilities and branch lengths (using results from Allman, et. al, 2012 for BL).

1919

Page 39: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Speed• Can analytically compute local posterior

probabilities and branch lengths (using results from Allman, et. al, 2012 for BL).

• We don’t need to list all n choose 4 quartet frequencies.

• We can use extensions of algorithmic tricks in ASTRAL to compute average quartet frequencies for each branch in O(nk)

1919

Page 40: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Simulation studies• Violations of assumptions

• Estimated gene trees instead of true gene trees • Estimated species trees, where the locality assumption

is not always correct

• Datasets: — 201-taxon ASTRAL-II dataset— An avian simulated dataset

• Support accuracy:The number of false positive and false negatives above a certain threshold of support

20

Page 41: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

21

Results (Avian, ROC)

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00

False Positive Rate

Recall

MLBS Local PP

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00

False Positive Rate

Recall

MLBS Local PP

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00

False Positive Rate

Recall

MLBS Local PP

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00

False Positive Rate

Recall

MLBS Local PP

Avian simulated dataset (48 taxa, 1000 genes)

the ROC curves show (fig. 5), for the same number offalse positives branches, local posterior probabilities result inbetter recall than MLBS. This pattern is more pronouncedfor shorter alignments, which have increased gene treeerror. For example, for the 250 bp model condition, if wechoose a support threshold that results in 0.01 FPR, with localposterior values, we still recover 84% of correct branches,whereas with MLBS, the same FPR results in retaining70% of correct branches. Thus, for a desired level of precision,better recall can be obtained using local posteriorprobabilities.

Branch LengthBranch length accuracy on the avian dataset was a function ofgene tree estimation error whether ASTRAL or MP-EST wasused (table 4). With true gene trees, branch length log error

FIG. 4. Branch length accuracy on the A-200 dataset with Medium ILS. See supplementary figures S7 and S8, Supplementary Material online forlow and high ILS. The estimated branch length is plotted against the true branch length in log scale (base 10). Blue line: a fitted generalizedadditive model with smoothing (Wood 2011).

Table 3. Branch Length Accuracy on the A-200 Dataset.

Dataset n Log Err RMSE

True gt Est. gt True gt Est. gt

Low ILS 1,000 0.10 0.42 5.57 6.75Low ILS 200 0.16 0.44 6.22 6.99Low ILS 50 0.25 0.48 6.84 7.29Med ILS 1,000 0.03 0.20 0.22 0.86Med ILS 200 0.07 0.22 0.44 0.91Med ILS 50 0.13 0.26 0.74 1.05High ILS 1,000 0.06 0.15 0.03 0.13High ILS 200 0.11 0.18 0.07 0.15High ILS 50 0.18 0.24 0.14 0.19

NOTE.—Logarithmic error (Log Err) and RMSE are shown for true species treesscored with true gene trees or estimated gene trees (Est. gt).

250 500

1000 1500

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00False Positive Rate

Rec

all

MLBS Local PP

FIG. 5. ROC curve for the avian dataset based on MLBS and local PPPPsupport values. Boxes show different numbers of sites per gene (con-trolling gene tree estimation error).

Fast Coalescent-Based Support Computation . doi:10.1093/molbev/msw079 MBE

9

by guest on May 28, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

Page 42: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Precision and recall at high support

201-taxon datasets (simPhy)

22

increase observed gene tree discordance and branch lengthsare a function of observed discordance.

AvianOn the avian dataset, we compare local posterior probabilitiesagainst branch support generated using site-only MLBS withestimated gene trees and ASTRAL species trees. Here, we alsostudy the impact of increasing levels of gene tree estimationerror by decreasing the number of sites per gene from1,500 bp to 250 bp.

Posterior and MLBSThe precision of local PP is 100% for the 0.99 threshold, re-gardless of the numbers of sites, but the recall ranges from

81% for the 1,500 bp model condition to 69% for 250 bp(supplementary table S1 and figs. S5 and S6, SupplementaryMaterial online). Precision is at least 99.8% for the 0.95 thresh-old, and the recall is between 71.5% and 84.7%, depending onthe model condition (an improvement of 2–5% comparedwith the 0.99 threshold). Lowering the support threshold allthe way to 0.7 still retains at least 99.1% accuracy and in-creases the recall to between 78.3% and 91.4%. Therefore, thelocal posterior probabilities allow very few false positives withhigh support but also miss some true positives (and thus maybe conservative).

Nevertheless, local posterior probabilities are less conser-vative than MLBS support values and have better recall. As

A

B

FIG. 3. Evaluation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2–S4, Supplementary Material online forother species trees. (A) Precision and recall of branches with local PP above a threshold ranging from 0.9 to 1.0 using estimated gene trees (solid) ortrue gene trees (dotted). (B) ROC curve (recall vs. FPR) for varying thresholds (figure trimmed at 0.4 FPR). Columns show different levels of ILS.

Sayyari and Mirarab . doi:10.1093/molbev/msw079 MBE

8

by guest on May 28, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

increase observed gene tree discordance and branch lengthsare a function of observed discordance.

AvianOn the avian dataset, we compare local posterior probabilitiesagainst branch support generated using site-only MLBS withestimated gene trees and ASTRAL species trees. Here, we alsostudy the impact of increasing levels of gene tree estimationerror by decreasing the number of sites per gene from1,500 bp to 250 bp.

Posterior and MLBSThe precision of local PP is 100% for the 0.99 threshold, re-gardless of the numbers of sites, but the recall ranges from

81% for the 1,500 bp model condition to 69% for 250 bp(supplementary table S1 and figs. S5 and S6, SupplementaryMaterial online). Precision is at least 99.8% for the 0.95 thresh-old, and the recall is between 71.5% and 84.7%, depending onthe model condition (an improvement of 2–5% comparedwith the 0.99 threshold). Lowering the support threshold allthe way to 0.7 still retains at least 99.1% accuracy and in-creases the recall to between 78.3% and 91.4%. Therefore, thelocal posterior probabilities allow very few false positives withhigh support but also miss some true positives (and thus maybe conservative).

Nevertheless, local posterior probabilities are less conser-vative than MLBS support values and have better recall. As

A

B

FIG. 3. Evaluation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2–S4, Supplementary Material online forother species trees. (A) Precision and recall of branches with local PP above a threshold ranging from 0.9 to 1.0 using estimated gene trees (solid) ortrue gene trees (dotted). (B) ROC curve (recall vs. FPR) for varying thresholds (figure trimmed at 0.4 FPR). Columns show different levels of ILS.

Sayyari and Mirarab . doi:10.1093/molbev/msw079 MBE

8

by guest on May 28, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

increase observed gene tree discordance and branch lengthsare a function of observed discordance.

AvianOn the avian dataset, we compare local posterior probabilitiesagainst branch support generated using site-only MLBS withestimated gene trees and ASTRAL species trees. Here, we alsostudy the impact of increasing levels of gene tree estimationerror by decreasing the number of sites per gene from1,500 bp to 250 bp.

Posterior and MLBSThe precision of local PP is 100% for the 0.99 threshold, re-gardless of the numbers of sites, but the recall ranges from

81% for the 1,500 bp model condition to 69% for 250 bp(supplementary table S1 and figs. S5 and S6, SupplementaryMaterial online). Precision is at least 99.8% for the 0.95 thresh-old, and the recall is between 71.5% and 84.7%, depending onthe model condition (an improvement of 2–5% comparedwith the 0.99 threshold). Lowering the support threshold allthe way to 0.7 still retains at least 99.1% accuracy and in-creases the recall to between 78.3% and 91.4%. Therefore, thelocal posterior probabilities allow very few false positives withhigh support but also miss some true positives (and thus maybe conservative).

Nevertheless, local posterior probabilities are less conser-vative than MLBS support values and have better recall. As

A

B

FIG. 3. Evaluation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2–S4, Supplementary Material online forother species trees. (A) Precision and recall of branches with local PP above a threshold ranging from 0.9 to 1.0 using estimated gene trees (solid) ortrue gene trees (dotted). (B) ROC curve (recall vs. FPR) for varying thresholds (figure trimmed at 0.4 FPR). Columns show different levels of ILS.

Sayyari and Mirarab . doi:10.1093/molbev/msw079 MBE

8

by guest on May 28, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

Page 43: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Precision and recall at high support

201-taxon datasets (simPhy)

22

increase observed gene tree discordance and branch lengthsare a function of observed discordance.

AvianOn the avian dataset, we compare local posterior probabilitiesagainst branch support generated using site-only MLBS withestimated gene trees and ASTRAL species trees. Here, we alsostudy the impact of increasing levels of gene tree estimationerror by decreasing the number of sites per gene from1,500 bp to 250 bp.

Posterior and MLBSThe precision of local PP is 100% for the 0.99 threshold, re-gardless of the numbers of sites, but the recall ranges from

81% for the 1,500 bp model condition to 69% for 250 bp(supplementary table S1 and figs. S5 and S6, SupplementaryMaterial online). Precision is at least 99.8% for the 0.95 thresh-old, and the recall is between 71.5% and 84.7%, depending onthe model condition (an improvement of 2–5% comparedwith the 0.99 threshold). Lowering the support threshold allthe way to 0.7 still retains at least 99.1% accuracy and in-creases the recall to between 78.3% and 91.4%. Therefore, thelocal posterior probabilities allow very few false positives withhigh support but also miss some true positives (and thus maybe conservative).

Nevertheless, local posterior probabilities are less conser-vative than MLBS support values and have better recall. As

A

B

FIG. 3. Evaluation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2–S4, Supplementary Material online forother species trees. (A) Precision and recall of branches with local PP above a threshold ranging from 0.9 to 1.0 using estimated gene trees (solid) ortrue gene trees (dotted). (B) ROC curve (recall vs. FPR) for varying thresholds (figure trimmed at 0.4 FPR). Columns show different levels of ILS.

Sayyari and Mirarab . doi:10.1093/molbev/msw079 MBE

8

by guest on May 28, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

increase observed gene tree discordance and branch lengthsare a function of observed discordance.

AvianOn the avian dataset, we compare local posterior probabilitiesagainst branch support generated using site-only MLBS withestimated gene trees and ASTRAL species trees. Here, we alsostudy the impact of increasing levels of gene tree estimationerror by decreasing the number of sites per gene from1,500 bp to 250 bp.

Posterior and MLBSThe precision of local PP is 100% for the 0.99 threshold, re-gardless of the numbers of sites, but the recall ranges from

81% for the 1,500 bp model condition to 69% for 250 bp(supplementary table S1 and figs. S5 and S6, SupplementaryMaterial online). Precision is at least 99.8% for the 0.95 thresh-old, and the recall is between 71.5% and 84.7%, depending onthe model condition (an improvement of 2–5% comparedwith the 0.99 threshold). Lowering the support threshold allthe way to 0.7 still retains at least 99.1% accuracy and in-creases the recall to between 78.3% and 91.4%. Therefore, thelocal posterior probabilities allow very few false positives withhigh support but also miss some true positives (and thus maybe conservative).

Nevertheless, local posterior probabilities are less conser-vative than MLBS support values and have better recall. As

A

B

FIG. 3. Evaluation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2–S4, Supplementary Material online forother species trees. (A) Precision and recall of branches with local PP above a threshold ranging from 0.9 to 1.0 using estimated gene trees (solid) ortrue gene trees (dotted). (B) ROC curve (recall vs. FPR) for varying thresholds (figure trimmed at 0.4 FPR). Columns show different levels of ILS.

Sayyari and Mirarab . doi:10.1093/molbev/msw079 MBE

8

by guest on May 28, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

Page 44: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Branch length accuracy

23

was only 0.06, corresponding to branches that are about 14%shorter or longer than the true branch. As gene tree estima-tion error increases with reduced number of sites (see table 1for gene tree error statistics), the branch length error also

increases. Thus, while 1,500 bp genes give 0.17 log error,250 bp genes result in 0.59 error, which corresponds tobranches that are on average 3.9 times too short or long.Moreover, unlike true gene trees, the error in branch lengthsestimated based on estimated gene trees is biased towardunderestimation (fig. 6), a pattern that increases in intensitywith shorter alignments.

ASTRAL and MP-EST have similar branch length accuracymeasured by log error for highly accurate gene trees, butASTRAL has an advantage with increased gene tree error(supplementary fig. S10, Supplementary Material online andtable 4). Error measured by root mean squared error (RMSE)(which emphasizes the accuracy of long branches) is compa-rable for the two methods, but MP-EST has a slight advantagegiven accurate gene trees.

Biological DatasetsFor each biological dataset, we show MLBS support and thelocal posterior probabilities, computed based on RAxML genetrees available from respective publications. We also collapsegene tree branches with <33% bootstrap support and usethese collapsed gene trees to draw local posterior probabili-ties. For ease of discussion, we show local posterior probabil-ities as percentages and refer to them simply as posterior orcollapsed posterior (for values based on collapsed gene trees).We discuss the confidence in important branches in eachtree.

1KP. Three of the key relationships studied by Wickett et al.(2014) are the sister branch to land plants, the base of theangiosperms, and the relationship among Bryophytes (horn-worts, liverworts, and mosses). In the ASTRAL tree, manybranches have full support regardless of the measure of thesupport used, but the remaining branches reveal interestingpatterns (fig. 7). The sister relationship betweenZygnematales and land plants receives a moderate 80% BS,but has 100% posterior. Wickett et al. (2014) also recoveredthis relationship by concatenation of various data partitions.There are 12 other branches that have collapsed posteriorsthat are at least 10% higher than BS (fig. 7); no branch hassubstantially higher BS than collapsed posterior. Collapsedposterior for monophyly of Bryophytes and for Amborellaas sister to other angiosperms are 100% (compared with97% and 93% BS, respectively).

When we collapse low-support branches in gene trees,posterior goes up for several branches: nine branches haveimprovements of 10% or more, and only two branches havecomparable reductions. An interesting case is Coleochaetalesas sister to Zygnematalesþ land plants, which has only 62%BS and 61% posterior, but has 100% collapsed posterior.Finally, note that several branches have low posterior, evenafter collapsing.

Our estimated branch lengths are short for several nodes.For example, the branch that unites (ChloranthalesþMagnoliids) and Eudicots has a length of 0.14 in coalescentunits. Other branches that have been historically hard to re-cover also tend to have short branches; however, these arenot necessarily extremely short branches that would

FIG. 6. ASTRAL branch length accuracy on the avian dataset. Logtransformed estimated branch lengths are shown versus true branchlengths, and a generalized additive model is fitted to the data. Onebranch with length 10"6 is trimmed out here, but full results, includ-ing MP-EST, is shown in supplementary figure S9, SupplementaryMaterial online.

Table 4. Branch Length Accuracy for the Avian Dataset.

No. of sites Log Err RMSE

ASTRAL MPEST ASTRAL MPEST

True gt. 0.06 (0.10) 0.07 (0.11) 0.44 (0.44) 0.30 (0.30)1,500 0.17 (0.20) 0.14 (0.18) 0.83 (0.83) 0.70 (0.70)1,000 0.22 (0.27) 0.22 (0.25) 1.08 (1.07) 1.01 (1.00)500 0.37 (0.42) 0.42 (0.46) 1.65 (1.64) 1.65 (1.64)250 0.59 (0.63) 0.81 (0.84) 2.25 (2.24) 2.28 (2.26)

NOTE.—Logarithmic error and root mean squared error are shown for true speciestrees scored with true gene trees or estimated gene trees with various numbers ofsites using ASTRAL and MP-EST. An extremely short branch with length 10"6 wasremoved from the calculations, but error including that branch is shownparenthetically.

Sayyari and Mirarab . doi:10.1093/molbev/msw079 MBE

10

by guest on May 28, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

Page 45: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

1KP dataset

24

rest of land plants

Page 46: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Summary• ASTRAL-II can infer species tree from gene trees in

datasets with 1000 genes and 1000 taxa in a day of single cpu running time.

• ASTRAL mostly dominates other summary methods. However, Concatenation is better when gene trees have high error.

• ASTRAL’s way of computing support is more accurate than MLBS while being much faster

25

Page 47: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Phylogenomic species tree reconstruction

X

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

“gene” here refers to a portion of the genome (not a functional gene)

Orangutan

Gorilla

Chimpanzee

Human

I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome

Page 48: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL on biological datasets

X

• 1KP: 103 plant species, 400-800 genes

• Yang, et al. 96 Caryophyllales species, 1122 genes

• Dentinger, et al. 39 mushroom species, 208 genes

• Giarla and Esselstyn. 19 Philippine shrew species, 1112 genes

• Laumer, et al. 40 flatworm species, 516 genes

• Grover, et al. 8 cotton species, 52 genes

• Hosner, Braun, and Kimball. 28 quail species, 11 genes

• Simmons and Gatesy. 47 angiosperm species, 310 genes

Phylotranscriptomic analysis of the origin and earlydiversification of land plantsNorman J. Wicketta,b,1,2, Siavash Mirarabc,1, Nam Nguyenc, Tandy Warnowc, Eric Carpenterd, Naim Matascie,f,Saravanaraj Ayyampalayamg, Michael S. Barkerf, J. Gordon Burleighh, Matthew A. Gitzendannerh,i, Brad R. Ruhfelh,j,k,Eric Wafulal, Joshua P. Derl, Sean W. Grahamm, Sarah Mathewsn, Michael Melkoniano, Douglas E. Soltish,i,k,Pamela S. Soltish,i,k, Nicholas W. Milesk, Carl J. Rothfelsp,q, Lisa Pokornyp,r, A. Jonathan Shawp, Lisa DeGironimos,Dennis W. Stevensons, Barbara Sureko, Juan Carlos Villarrealt, Béatrice Roureu, Hervé Philippeu,v, Claude W. dePamphilisl,Tao Chenw, Michael K. Deyholosd, Regina S. Baucomx, Toni M. Kutchany, Megan M. Augustiny, Jun Wangz, Yong Zhangv,Zhijian Tianz, Zhixiang Yanz, Xiaolei Wuz, Xiao Sunz, Gane Ka-Shu Wongd,z,aa,2, and James Leebens-Mackg,2

aChicago Botanic Garden, Glencoe, IL 60022; bProgram in Biological Sciences, Northwestern University, Evanston, IL 60208; cDepartment of Computer Science,University of Texas, Austin, TX 78712; dDepartment of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E9; eiPlant Collaborative,Tucson, AZ 85721; fDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721; gDepartment of Plant Biology, University ofGeorgia, Athens, GA 30602; hDepartment of Biology and iGenetics Institute, University of Florida, Gainesville, FL 32611; jDepartment of Biological Sciences,Eastern Kentucky University, Richmond, KY 40475; kFlorida Museum of Natural History, Gainesville, FL 32611; lDepartment of Biology, Pennsylvania StateUniversity, University Park, PA 16803; mDepartment of Botany and qDepartment of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4;nArnold Arboretum of Harvard University, Cambridge, MA 02138; oBotanical Institute, Universität zu Köln, Cologne D-50674, Germany; pDepartment ofBiology, Duke University, Durham, NC 27708; rDepartment of Biodiversity and Conservation, Real Jardín Botanico-Consejo Superior de InvestigacionesCientificas, 28014 Madrid, Spain; sNew York Botanical Garden, Bronx, NY 10458; tDepartment fur Biologie, Systematische Botanik und Mykologie,Ludwig-Maximilians-Universitat, 80638 Munich, Germany; uDépartement de Biochimie, Centre Robert-Cedergren, Université de Montréal, SuccursaleCentre-Ville, Montreal, QC, Canada H3C 3J7; vCNRS, Station d’ Ecologie Experimentale du CNRS, Moulis, 09200, France; wShenzhen Fairy Lake BotanicalGarden, The Chinese Academy of Sciences, Shenzhen, Guangdong 518004, China; xDepartment of Ecology and Evolutionary Biology, University ofMichigan, Ann Arbor, MI 48109; yDonald Danforth Plant Science Center, St. Louis, MO 63132; zBGI-Shenzhen, Bei shan Industrial Zone, Yantian District,Shenzhen 518083, China; and aaDepartment of Medicine, University of Alberta, Edmonton, AB, Canada T6G 2E1

Edited by Paul O. Lewis, University of Connecticut, Storrs, CT, and accepted by the Editorial Board September 29, 2014 (received for review December 23, 2013)

Reconstructing the origin and evolution of land plants and theiralgal relatives is a fundamental problem in plant phylogenetics, andis essential for understanding how critical adaptations arose, in-cluding the embryo, vascular tissue, seeds, and flowers. Despiteadvances in molecular systematics, some hypotheses of relationshipsremain weakly resolved. Inferring deep phylogenies with bouts ofrapid diversification can be problematic; however, genome-scaledata should significantly increase the number of informative charac-ters for analyses. Recent phylogenomic reconstructions focused onthe major divergences of plants have resulted in promising but in-consistent results. One limitation is sparse taxon sampling, likelyresulting from the difficulty and cost of data generation. To addressthis limitation, transcriptome data for 92 streptophyte taxa weregenerated and analyzed along with 11 published plant genomesequences. Phylogenetic reconstructions were conducted using upto 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyseswere performed to test the robustness of phylogenetic inferences topermutations of the datamatrix or to phylogenetic method, includingsupermatrix, supertree, and coalescent-based approaches, maximum-likelihood and Bayesian methods, partitioned and unpartitioned ana-lyses, and amino acid versus DNA alignments. Among otherresults, we find robust support for a sister-group relationshipbetween land plants and one group of streptophyte green al-gae, the Zygnematophyceae. Strong and robust support for aclade comprising liverworts and mosses is inconsistent with awidely accepted view of early land plant evolution, and suggeststhat phylogenetic hypotheses used to understand the evolution offundamental plant traits should be reevaluated.

land plants | Streptophyta | phylogeny | phylogenomics | transcriptome

The origin of embryophytes (land plants) in the Ordovicianperiod roughly 480 Mya (1–4) marks one of the most im-

portant events in the evolution of life on Earth. The early evo-lution of embryophytes in terrestrial environments was facilitatedby numerous innovations, including parental protection for thedeveloping embryo, sperm and egg production in multicellularprotective structures, and an alternation of phases (often referred toas generations) in which a diploid sporophytic life history stagegives rise to a multicellular haploid gametophytic phase. With

Significance

Early branching events in the diversification of land plants andclosely related algal lineages remain fundamental and un-resolved questions in plant evolutionary biology. Accuratereconstructions of these relationships are critical for testing hy-potheses of character evolution: for example, the origins of theembryo, vascular tissue, seeds, and flowers. We investigatedrelationships among streptophyte algae and land plants usingthe largest set of nuclear genes that has been applied to thisproblem to date. Hypothesized relationships were rigorouslytested through a series of analyses to assess systematic errors inphylogenetic inference caused by sampling artifacts and modelmisspecification. Results support some generally accepted phy-logenetic hypotheses, while rejecting others. This work providesa new framework for studies of land plant evolution.

Author contributions: N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., P.S.S., D.W.S., M.K.D.,J.W., G.K.-S.W., and J.L.-M. designed research; N.J.W., S. Mirarab, N.N., T.W., E.C., N.M., S.A.,M.S.B., J.G.B., M.A.G., B.R.R., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M.,C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., B.R., H.P., C.W.d., T.C., M.K.D., M.M.A., J.W., Y.Z.,Z.T., Z.Y., X.W., X.S., G.K.-S.W., and J.L.-M. performed research; S. Mirarab, N.N., T.W., N.M.,S.A., M.S.B., J.G.B., M.A.G., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R.,L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., H.P., C.W.d., T.C., M.K.D., R.S.B., T.M.K., M.M.A., J.W., Y.Z.,G.K.-S.W., and J.L.-M. contributed new reagents/analytic tools; N.J.W., S. Mirarab, N.N., E.C.,N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., B.R., H.P., and J.L.-M. analyzed data; N.J.W.,S. Mirarab, T.W., S.W.G., M.M., D.E.S., D.W.S., H.P., G.K.-S.W., and J.L.-M. wrote the paper;and N.M. archived data.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. P.O.L. is a guest editor invited by theEditorial Board.

Freely available online through the PNAS open access option.

Data deposition: The sequences reported in this paper have been deposited in theiplant datastore database, mirrors.iplantcollaborative.org/onekp_pilot, and the Na-tional Center for Biotechnology Information Sequence Read Archive, www.ncbi.nlm.nih.gov/sra [accession no. PRJEB4921 (ERP004258)].1N.J.W. and S. Mirarab contributed equally to this work.2To whom correspondence may be addressed. Email: [email protected],[email protected], or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1323926111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1323926111 PNAS Early Edition | 1 of 10

EVOLU

TION

PNASPL

US

Page 49: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Future datasets• 1200 plants with ~ 400 genes (1KP consortium)

• 250 avian species with 2000 genes (with LSU, UF, and Smithsonian)

• 200 avian species with whole genomes (with Genome 10K, international)

• 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane)

• 140 Insects with 1400 genes (with U. Illinois at Urbana-Champaign)

X

Page 50: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Shortcomings of ASTRAL-I

• Even the constrained version was too slow for more than about 200 species and hundreds of genes

• The constraint set X did not include true species tree branches for some challenging datasets, resulting in low accuracy in some cases

• Input gene trees could not have polytomies

X

Page 51: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL-I versus ASTRAL-II

X

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes

and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.

5

Medium ILSLow ILS High ILS

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes

and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.

5200 species, deep ILS

Page 52: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

ASTRAL-I versus ASTRAL-II

X

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes

and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.

5

Medium ILSLow ILS High ILS

Spec

ies

tree

topo

logi

cal

erro

r (FN

)

10M 2M 500K

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Spec

ies

tree

topo

logi

cal e

rror (

RF)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

10M 2M 500K

0

3

6

9

12

0

3

6

9

12

1e−061e−07

50 200 1000 50 200 1000 50 200 1000genes

Run

ning

tim

e (h

ours

)

ASTRAL−I ASTRAL−II ASTRAL−II + true st

Figure S2: Comparison of various variants of ASTRAL with 200 taxa and varying tree shapes

and number of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRAL-II +true st shows the case where the true species tree is added to the search space; this is included to approximatean ideal (e.g. exact) solution to the quartet problem.

5200 species, deep ILS

Page 53: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

X

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree error, varying # of species

Page 54: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

X

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree error, varying # of species

Page 55: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

X

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IINJstMP−EST

1000 genes, “medium” levels of recent ILS

Tree error, varying # of species

Page 56: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

X

0

10

20

10 50 100 200 500 1000number of species

Run

ning

tim

e (h

ours

)

ASTRAL−IINJstMP−EST

Running time, varying # of species

1000 genes, “medium” levels of recent ILS

Page 57: ASTRAL: Fast coalescent-based computation of the species ...eceweb.ucsd.edu/~smirarab/assets/evolution-workshop.pdfASTRAL-I [Mirarab, et al., Bioinformatics, 2014] • ASTRAL solves

Acknowledgments

Jim5Leebens9mack5(UGA)

Norman5Wickett5(U5Chicago)

Gane5Wong5(U5of5Alberta)

Keshav5Pingali

5S.M.5Bayzid5

Tandy&Warnow

Théo55Zimmermann

Travel funding to ISMB/ECCB 2015 was generously provided by akamia.