New methods for estimating species trees from gene trees

New methods for estimating species trees

from gene trees

Tandy WarnowMarch 12, 2012

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

Phylogeny(evolutionary tree)

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AAGACTT

TGGACTTAAGGCCT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Input: unaligned sequences

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

Phase 2: Construct tree

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

Progress on Gene Tree and Alignment Estimation

• Statistical performance of phylogeny estimation methods

• Co-estimation of alignments and trees (SATé)

• “Alignment-free” phylogeny estimation (DACTAL)

• Phylogenetic analysis and alignment of NGS data (SEPP)

• Taxon identification of short reads from same gene (metagenomic analysis) (TIPP)

Tomorrow’s talk will cover SATé, SEPP, and TIPP

Single gene vs. multi-gene analyses

• Most methods analyze single genes (or other genomic region). These produce estimated “gene trees”.

• But species trees are estimated using multiple genes.

Multi-gene analysesAfter alignment of each gene dataset:

• Combined analysis: Concatenate (“combine”) alignments for different genes, and run phylogeny estimation methods

• Supertree: Compute trees on alignment and combine gene trees

Not all genes present in all species

gene 1S1

TCTAATGGAA

GCTAAGGGAA

TCTAAGGGAA

TCTAACGGAA

TCTAATGGAC

TATAACGGAA

gene 3TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

gene 2GGTAACCCTC

GCTAAACCTC

GGTGACCATC

GCTAAACCTC

Analyzeseparately

SupertreeMethod

Two competing approaches

gene 1 gene 2 . . . gene k

. . . Combined Analysis

Constructing trees from subtrees

Let T|A denote the induced subtree of T on the leafset A

T|{a,c,d,f}

Question: given induced subtrees of T for many subsets of taxa -- can you produce the tree T?

Supertree estimationChallenges:• Tree compatibility is NP-complete (therefore,

even if subtrees are correct, supertree estimation is hard)

• Estimated subtrees have error

Advantages:• Estimating individual gene trees can be

computationally feasible (compared to the combined analysis of many genes)

• Can use different types of data for each gene

Many Supertree Methods

• MRP• weighted MRP• MRF• MRD• Robinson-Foulds

Supertrees• Min-Cut• Modified Min-Cut• Semi-strict Supertree

• QMC• Q-imputation• SDM• PhySIC• Majority-Rule

Supertrees• Maximum Likelihood

Supertrees• and many more ...

Matrix Representation with Parsimony(Most commonly used and most accurate)

Quantifying topological error

True Tree Estimated Tree

• False positive (FP): b B(Test.)-B(Ttrue)

• False negative (FN): b B(Ttrue)-B(Test.)

FN rate of MRP vs. combined analysis

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Scaffold Density (%)

SuperFine-boosting: improves accuracy of MRP

(Swenson et al., Syst. Biol. 2012)

SuperFine

• First, construct a supertree with low false positives

The Strict Consensus

• Then, refine the tree to reduce false negatives by resolving each polytomy using a “base” supertree method (e.g., MRP)

Quartet Max Cut

Obtaining a supertree with low FP

The Strict Consensus Merger (SCM)

SCM of two treesComputes the strict consensus on the

common leaf setThen superimposes the two trees,

contracting more edges in the presence of “collisions”

Strict Consensus Merger (SCM)

Performance of SCM

• Low false positive (FP) rate(Estimated supertree has few false

edges)

• High false negative (FN) rate(Estimated supertree is missing many

true edges)

Theoretical results for SCM

• SCM can be computed in polynomial time

• For certain types of inputs, the SCM method solves the NP-hard “Tree Compatibility” problem

• All splits in the SCM “appear” in at least one source tree (and project onto each source tree)

Resolving a single polytomy, v, using MRP

• Step 1: Reduce each source tree to a tree on leafset, {1,2,...,d} where d=degree(v)

• Step 2: Apply MRP to the collection of reduced source trees, to produce a tree t on {1,2,...,d}

• Step 3: Replace the star tree at v by tree t

Part 1 of SuperFinea b

Part 2 of SuperFine

a bc e

Theorem

Given – a set of source trees, – SCM tree T, – and a polytomy in T,

after relabelling and reducing, each source tree has at most one leaf with each label.

Step 2: Apply MRP to the collection of reduced source trees

Replace polytomy using tree from MRP

a bc e

SuperFine-boosting: improves accuracy of MRP

(Swenson et al., Syst. Biol. 2012)

SuperFine is also much faster

MRP 8-12 sec.SuperFine 2-3 sec.

Scaffold Density (%) Scaffold Density (%)Scaffold Density (%)

Limitations of Supertree Methods

• Traditional supertree methods assume that the true gene trees match the true species tree.

• This is known to be unrealistic in some situations, due to processes such as• Deep coalescence (“incomplete lineage

sorting”)• Gene duplication and loss• Horizontal gene transfer

Multiple populations/species

Present

Courtesy James Degnan

Gene tree in a species treeCourtesy James Degnan

Deep Coalescence

• Population-level process, also called “Incomplete Lineage Sorting”

• Gene trees can differ from species trees due to short times between speciation events (population size also impacts this probability)

• Causes difficulty in estimating some species trees (such as human-chimp-gorilla)

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

Phylogeny(evolutionary tree)

MDC Problem

• MDC (minimize deep coalescence) problem:

– given set of true gene trees, find the species tree that implies the fewest deep coalescence events

• Posed by Wayne Maddison, Syst Biol 1997

Counting deep coalescences

QuickTime™ and a decompressor

Extra Lineages XL(T,t)

• T is the species tree

• t is the gene tree

• XL(T,t): the number of extra lineages, under the best embedding of t into T

Two MDC problems

Score pair of trees:• Input: rooted binary gene tree t and species

tree T• Output: XL(T,t)

Find best species tree:• Input: set X of rooted, binary gene trees on set

S• Output: species tree T on S that minimizes

XL(T,X) = t XL(T,t).

Limitations of methods for MDC

Current methods typically assume

• input gene trees are correct, binary, rooted trees containing all the taxa

• Estimated gene trees are usually partially incorrect, are often unrooted, and may not be complete.

• Assuming all gene tree incompatibility is due to deep coalescence is likely problematic.

Minimizing Deep Coalescence (MDC)

• Than and Nakhleh (PLoS Comp Biol 2009): algorithms for MDC which assume all gene trees are correct, rooted, binary trees.

• Yu, Warnow, and Nakhleh (RECOMB 2011 and J Comp Biol 2011) extends T&N 2009 to handle estimated gene trees that are unrooted and have errors.

• Bayzid and Warnow (J Comp Biol, in press) extends T&N 2009 to handle incomplete gene trees.

Search: main results in T&N 2009

• Theorem: Let X be a set of k rooted binary gene trees on taxon set S, and let C be a set of subsets of the taxon set. Then a species tree T that optimizes MDC with Clusters(T) C can be found in time that is polynomial in |C|, n, and k.

• Exact MDC: Let C be all possible subsets of S• “Heuristic” MDC: Let C be the set of “clusters” of

the input gene trees (where a cluster is the set of leaves below a node in a tree)

T&N 2009: B-maximal clusters and

kB(t) T is a species tree, and t is a gene tree,

both rooted and binary

Definitions• B is a cluster of T• Y is a B-maximal cluster in t if (i) Y is a

cluster of t, (ii) Y B, and (iii) Y Z for any other cluster Z of t such that Z B.

• kB(t) is the number of B-maximal clusters in t

Calculating XL(T,t)

Lemma (T&N 2009): Let T be a binary species tree and t be a binary rooted gene tree. Then for an optimal embedding of t into T:– kB(t) is the number of lineages on the

edge “above” subtree for B in T

– XL(T,t) = B[kB(t)-1], where B ranges over the clusters of T.

Calculating XL(T,X)

Define CostB(t)= kB(t)-1, and therefore

XL(T,t) = B CostB(t)

Given set X of gene trees, define XL(T,X) = t XL(T,t)

= t B CostB(t)

= B t CostB(t)

= B w(B)

where w(B) = t CostB(t)

Graph Algorithm for MDC

Graph G(X):• Vertex set: v corresponds to non-trivial S(v) S,

where S(v) is the cluster of T below node v• Edges: (v,w) present iff clusters S(v) and S(w) can

co-exist as clusters in a tree

• Vertex weight: Weight(v) =∑t CostS(v)(t)

Theorem: T, binary rooted tree on S s.t. XL(T,X)=W, iff (n-2)-clique in G(X) of weight W, where |S|=n.

Hence, MDC can be solved by finding a (n-2)-clique of minimum total weight in G(X).

T&N algorithm for MDC

• Because of the structure of the graph, we can find a min cost max clique (of size n-2) in polynomial time (in the size of the graph), using dynamic programming. But the graph has 2n vertices!

• However, if we constrain the set C of permitted clusters for the species tree, we can find an optimal constrained solution in O(|C|2 nk) time (the “heuristic” algorithm in T&N 2009).

Yu, Warnow and Nakhleh (2011)

• Allows for error in estimated gene trees.

• RECOMB 2011 and J Comp Biol 2011

Yu, Warnow and Nakhleh (2011)

Modify gene trees to reduce false positive error:

• Unroot trees• Use bootstrap (or other statistical

techniques) to identify the edges that are potentially incorrect

• Contract the low support edges

Result: estimated gene trees that are likely to be unrooted contractions of the true gene tree.

New MDC problem

• Input: set X ={t1, t2, …, tk} of incompletely resolved, unrooted gene trees.

• Output: set X’={t’1, t’2, …, t’k} (such that each t’i is a resolved, rooted version of ti, i=1,2…k) and species tree T that minimizes XL(T,X’).

In other words, we treat ti as a constraint on the true gene tree for gene i.

Search: main theoretical result in T&N 2009

• Theorem: Let X be a set of k rooted binary gene trees on taxon set S, and let C be a set of clusters on the taxon set. Then a species tree T that optimizes MDC with Clusters(T) C can be found in O(|C|2nk) time, where |S|=n.

Search: main theoretical result in YWN 2011

• Theorem: Let X be a set of k unrooted and not necessarily binary gene trees on taxon set S, and let C be a set of clusters on the taxon set. Then a species tree T that optimizes MDC with Clusters(T) C can be found in O(|C|2nk) time, where |S|=n.

Scoring: main theoretical result

• Theorem: Let t be an unrooted and not necessarily binary gene tree, and let T be a rooted binary species tree, both on S. Then a rooted refinement t* of t that minimizes XL(T,t*) can be found in O(n2) time, where |S|=n.

Note: brute-force is exponential, even if t is rooted and the maximum degree in t is low

Simplest case: t is rooted

• Input: rooted tree t, not necessarily binary, and binary rooted species tree T

• Output: refinement t* of t, minimizing XL(T,t*)

Recall that XL(T,t*) = ∑B[kB(t*)-1]

Refining rooted tree t

Def.: FB(t) denotes the number of nodes in t that have at least one B-maximal child.

Lemma: If t’ is a binary refinement of t, then FB(t) kB(t’).

Theorem: For all rooted trees t, there exists t*, a binary refinement of t, such that for all clusters B of T, kB(t*) = FB(t).

Computing t*

• Algorithm: Refine around each high degree node v in t using the subtree of T defined by the LCAs in T of the children of v.

• Order in which you visit each high degree node does not impact the output

• Can be computed in O(n2) time

Proof of optimality

Recall: FB(t) denotes the number of nodes in t that have at least one B-maximal child.

Theorem: The tree t* produced by the algorithm satisfies kB(t*) = FB(t) for every cluster B of T. Hence, t* is optimal.

Proof: Algorithm is locally optimal.

Finding the best species tree, given rooted non-

binary trees• Same basic graph-theoretic

approach and DP algorithms work• Same graph G(X), but redefine

CostB(t)= FB(t)-1

and keep weight(v) = t CostS(v)(t)

General case: t unrooted, non-binary

Input: unrooted, non-binary gene tree t and rooted binary species tree T

Output: rooted, binary tree t* refining t such that XL(T,t*) is minimized

Clearly this is solvable in O(n3) time.Better O(n2) algorithm: find root, then

refine optimally.

Summary of YWN 2011

• Extends all results from Than and Nakhleh 2009 to partially resolved, unrooted gene trees

• Suggests contraction of low support edges and suppression of root before species tree estimation

• Gives polynomial time DP algorithm for constrained search for species tree (using only clusters from input gene trees)

New methods for estimating species trees from gene trees

Documents

Physically grounded approach for estimating gene ... · Physically grounded approach for estimating gene expression from microarray data Patrick D. McMullena, Richard I. Morimotob,

Topological Concordance of Gene Trees and …nakhleh/COMP571/Presentations/Jatin.pdf• Horizontal Gene Transfer and Recombination • Stochastic Factors Topological Concordance Species

THE AGES OF MUTATIONS IN GENE TREES By R. C. Griffiths1 and

Department of Computer Science University of Texas at Austin Estimating Species Tree from Gene Trees by Minimizing Duplications Md. Shamsuzzoha Bayzid,

Unsupervised gene network inference with decision trees ... · Unsupervised gene network inference with decision trees and Random forests ... (e.g. [6–13]), usually achieving competitive

Gene Trees and Species Trees: Lessons from morning glories Lauren A. Eserman & Richard E. Miller Department of Biological Sciences Southeastern Louisiana

GREAT TREES A proposal for Cooperative Research in Gene ...people.forestry.oregonstate.edu/steve-strauss/sites/people.forestry... · A proposal for Cooperative Research in Gene Editing

ESTIMATING SPECIES TREES USING MULTIPLE-ALLELE … · ESTIMATING SPECIES TREES USING MULTIPLE-ALLELE DNA SEQUENCE ... The advance of molecular biological technologies has ... even

Species and Gene Trees: History, Inference, and Visualization - Joseph Heled

Crowdsourcing gene predictions & estimating population sizes

Estimating phylogenetic trees from genome-scale …edwards.oeb.harvard.edu/files/edwards/files/1501.03578.pdfEstimating phylogenetic trees from genome-scale data ... incomplete lineage

Estimating Student Retention and Degree-Completion Time ... · Estimating Student Retention and Degree-Completion Time: Decision Trees and Neural Networks Vis-à-Vis Regression Serge

CRISPR and better trees: Gene editing to promote ...people.forestry.oregonstate.edu/steve-strauss/... · CRISPR-Cas9 (Clustered regularly interspaced short palindromic repeats) gene

From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin

Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others

Webinar: Tools for Estimating Greenhouse Gas Benefits of Trees

CALIBRATING DIVERGENCE TIMES ON SPECIES TREES VERSUS GENE TREES

Estimating species trees from multiple gene trees in the presence of ILS

Estimating Time-Dependent Gene Networks from …bonsai.hgc.jp/~imoto/yoshida-csb2005.pdf · Estimating Time-Dependent Gene Networks from Time Series Microarray Data by Dynamic Linear

Gene Prediction and Phylogenetic Trees Jared Mimms