62
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Embed Size (px)

Citation preview

Page 1: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Phylogenetic Tree Reconstruction

Tandy Warnow

The Program in Evolutionary Dynamics at Harvard University

The University of Texas at Austin

Page 2: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Phylogeny

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

Page 3: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Reconstructing the “Tree” of Life

Handling large datasets: Handling large datasets: millions of speciesmillions of species

Page 4: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Cyber Infrastructure for Phylogenetic Research

Purpose: to create a national infrastructure of hardware, algorithms, database technology, etc., necessary to infer the Tree of Life. Group: 40 biologists, computer scientists, and mathematicians from 13 institutions.Funding: $11.6 M (large ITR grant from NSF).

Page 5: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

CIPRes Members

University of New MexicoBernard Moret David Bader Tiffani Williams

UCSD/SDSCFran Berman Alex Borchers David Stockwell Phil Bourne John Huelsenbeck Mark MillerMichael Alfaro Tracy Zhao University of ConnecticutPaul O Lewis

University of PennsylvaniaSusan DavidsonJunhyong Kim Sampath Kannan

UT Austin Tandy Warnow David M. Hillis Warren Hunt Robert Jansen Randy Linder Lauren Meyers Daniel Miranker Usman Roshan Luay Nakhleh

University of ArizonaDavid R. Maddison

University of British ColumbiaWayne Maddison

North Carolina State UniversitySpencer Muse

American Museum of Natural HistoryWard C. Wheeler

UC BerkeleySatish Rao Steve EvansRichard M Karp Brent MishlerElchanan MosselEugene W. MyersChristos M. PapadimitriouStuart J. Russell

SUNY BuffaloWilliam Piel

Florida State UniversityDavid L. SwoffordMark Holder

Yale UniversityMichael DonoghuePaul Turner

sanofi-aventis Lisa Vawter

Page 6: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Page 7: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Steps in a phylogenetic analysis

• Gather data• Align sequences• Reconstruct phylogeny on the multiple alignment

- often obtaining a large number of trees• Compute consensus (or otherwise estimate the

reliable components of the evolutionary history)• Perform post-tree analyses.

Page 8: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

CIPRES research in algorithms

• Heuristics for NP-hard problems in phylogeny reconstruction• Compact representation of sets of trees • Reticulate evolution reconstruction • Performance of phylogeny reconstruction methods under stochastic

models of evolution• Gene order phylogeny • Genomic alignment • Lower bounds for MP • Distance-based reconstruction • Gene family evolution• High-throughput phylogenetic placement• Multiple sequence alignment

Page 9: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Page 10: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Markov models of site evolution

Simplest (Jukes-Cantor):• The model tree is a pair (T,{p(e)}), where T is a rooted

binary tree, and p(e) is the probability of a substitution on the edge e

• The state at the root is random• If a site changes on an edge, it changes with equal

probability to each of the remaining states• The evolutionary process is MarkovianMore complex models (such as the General Markov

model) are also considered, with little change to the theory.

Page 11: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Phylogeny Reconstruction

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Page 12: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

1. Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood)

Phylogenetic reconstruction methods

Phylogenetic trees

Cost

Global optimum

Local optimum

2. Polynomial time distance-based methods: Neighbor Joining, etc.

Page 13: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Performance criteria

• Running time.• Space.• Statistical performance issues (e.g., statistical

consistency) with respect to a Markov model of evolution.

• “Topological accuracy” with respect to the underlying true tree. Typically studied in simulation.

• Accuracy with respect to a particular criterion (e.g. tree length or likelihood score), on real data.

Page 14: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Maximum Parsimony

• Input: Set S of n aligned sequences of length k

• Output: A phylogenetic tree T– leaf-labeled by sequences in S– additional sequences of length k labeling the

internal nodes of T

such that is minimized. ∑∈ )(),(

),(TEji

jiH

Page 15: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Theoretical results

• Neighbor Joining is polynomial time, and statistically consistent under typical models of evolution.

• Maximum Parsimony is NP-hard, and even exact solutions are not statistically consistent under typical models.

• Maximum Likelihood is of unknown computational complexity, but statistically consistent under typical models.

Page 16: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Problems with NJ

• Theory: The convergence rate is exponential: the number of sites needed to obtain an accurate reconstruction of the tree with high probability grows exponentially in the evolutionary diameter.

• Empirical: NJ has poor performance on datasets with some large leaf-to-leaf distances.

Page 17: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Neighbor joining has poor accuracy on large diameter model trees

[Nakhleh et al. ISMB 2001]

Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides.

Error rates reflect proportion of incorrect edges in inferred trees.

NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Page 18: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Solving NP-hard problems exactly is … unlikely

• Number of (unrooted) binary trees on n leaves is (2n-5)!!

• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in

2890 millennia

#leaves #trees

4 3

5 15

6 105

7 945

8 10395

9 135135

10 2027025

20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Page 19: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

1. Hill-climbing heuristics (which can get stuck in local optima)2. Randomized algorithms for getting out of local optima3. Approximation algorithms for MP (based upon Steiner Tree

approximation algorithms).

Approaches for “solving” MP/ML

Phylogenetic trees

Cost

Global optimum

Local optimum

Page 20: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

How good an MP analysis do we need?

• Our research shows that we need to get within 0.01% of optimal (or better even, on large datasets) to return reasonable estimates of the true tree’s “topology”

Page 21: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Datasets

• 1322 lsu rRNA of all organisms• 2000 Eukaryotic rRNA• 2594 rbcL DNA• 4583 Actinobacteria 16s rRNA • 6590 ssu rRNA of all Eukaryotes• 7180 three-domain rRNA• 7322 Firmicutes bacteria 16s rRNA• 8506 three-domain+2org rRNA• 11361 ssu rRNA of all Bacteria• 13921 Proteobacteria 16s rRNA

Obtained from various researchers and online databases

Page 22: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Problems with current techniques for MP

00.010.020.030.040.050.060.070.080.090.1

Average MP

score above optimal at 24

hours, shown as a percentage of the

optimal

1 2 3 4 5 6 7 8 9 10Dataset#

TNT

Average MP scores above optimal of best methods at 24 hours across 10 datasets

Best current techniques fail to reach 0.01% of optimal at the end of 24 hours, on large datasets

Page 23: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Problems with current techniques for MP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Shown here is the performance of the TNT heuristic search for maximum parsimony on a real dataset of almost 14,000 sequences. The required level of accuracy with respect to MP score is no more than 0.01% error (otherwise high topological error results). (“Optimal” here means best score to date, using any method for any amount of time.)

Performance of TNT with time

Page 24: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Empirical problems with existing methods

• Heuristics for Maximum Parsimony (MP) and Maximum Likelihood (ML) cannot handle large datasets (take too long!) – we need new heuristics for MP/ML that can analyze large datasets

• Polynomial time methods have poor topological accuracy on large datasets – we need better polynomial time methods

Page 25: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

“Boosting” phylogeny reconstruction methods

• DCMs “boost” the performance of phylogeny reconstruction methods.

DCMBase method M DCM-M

Page 26: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

DCMs: Divide-and-conquer for improving phylogeny reconstruction

Page 27: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

DCMs (Disk-Covering Methods)

• DCMs for polynomial time methods improve topological accuracy (empirical observation), and have provable theoretical guarantees under Markov models of evolution

• DCMs for hard optimization problems reduce running time needed to achieve good levels of accuracy (empirically observation)

• Each DCM is designed by considering the kinds of datasets the base method will do well or poorly on, and these designs are then tested on real and simulated data.

Page 28: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

DCM1 Decompositions

DCM1 decomposition : compute the maximal cliques

Input: Set S of sequences, distance matrix d, threshold value

1. Compute threshold graph }),(:),{(,),,( qjidjiESVEVGq ≤===

2. If the graph is not triangulated, add additional edges to triangulate.

}{ ijdq∈

Page 29: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001]

•DCM1-boosting makes distance-based methods more accurate

•Theoretical guarantees that DCM1-NJ converges to the true tree from polynomial length sequences

NJ

DCM1-NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Page 30: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Major challenge: MP and ML

• Maximum Parsimony (MP) and Maximum Likelihood (ML) remain the methods of choice for most systematists

• The main challenge here is to make it possible to obtain good solutions to MP or ML in reasonable time periods on large datasets

Page 31: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Maximum Parsimony

• Input: Set S of n aligned sequences of length k

• Output: A phylogenetic tree T– leaf-labeled by sequences in S– additional sequences of length k labeling the

internal nodes of T

such that is minimized. ∑∈ )(),(

),(TEji

jiH

Page 32: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Maximum parsimony (example)

• Input: Four sequences– ACT– ACA– GTT– GTA

• Question: which of the three trees has the best MP scores?

Page 33: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Maximum Parsimony

ACT

GTT ACA

GTA ACA ACT

GTAGTT

ACT

ACA

GTT

GTA

Page 34: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Maximum Parsimony

ACT

GTT

GTT GTA

ACA

GTA

12

2

MP score = 5

ACA ACT

GTAGTT

ACA ACT

3 1 3

MP score = 7

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Optimal MP tree

Page 35: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Maximum Parsimony: computational complexity

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Finding the optimal MP tree is NP-hard

Optimal labeling can becomputed in linear time O(nk)

Page 36: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Problems with current techniques for MP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Even the best of the current methods do not reach 0.01% of “optimal” on large datasets in 24 hours. (“Optimal” means best score to date, using any method over any amount of time.)

Performance of TNT with time

Page 37: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Observations

• The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets.

• Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions.

• Apparent convergence can be misleading.

Page 38: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

How can we improve upon existing techniques?

Page 39: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Tree Bisection and Reconnection (TBR)

Page 40: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Tree Bisection and Reconnection (TBR)

Delete an edge

Page 41: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Tree Bisection and Reconnection (TBR)

Page 42: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Tree Bisection and Reconnection (TBR)

Reconnect the trees with a new edgethat bifurcates an edge in each tree

Page 43: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

A conjecture as to why current techniques are poor:

• Our studies suggest that trees with near optimal scores tend to be topologically close (RF distance less than 15%) from the other near optimal trees.

• The standard technique (TBR) for moving around tree space explores O(n3) trees, which are mostly topologically distant.

• So TBR may be useful initially (to reach near optimality) but then more “localized” searches are more productive.

Page 44: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Using DCMs differently

• Observation: DCMs make small local changes to the tree

• New algorithmic strategy: use DCMs iteratively and/or recursively to improve heuristics on large datasets

• However, the initial DCMs for MP – produced large subproblems and – took too long to compute

• We needed a decomposition strategy that produces small subproblems quickly.

Page 45: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Using DCMs differently

• Observation: DCMs make small local changes to the tree

• New algorithmic strategy: use DCMs iteratively and/or recursively to improve heuristics on large datasets

• However, the initial DCMs for MP – produced large subproblems and – took too long to compute

• We needed a decomposition strategy that produces small subproblems quickly.

Page 46: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Using DCMs differently

• Observation: DCMs make small local changes to the tree

• New algorithmic strategy: use DCMs iteratively and/or recursively to improve heuristics on large datasets

• However, the initial DCMs for MP – produced large subproblems and – took too long to compute

• We needed a decomposition strategy that produces small subproblems quickly.

Page 47: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

New DCM3 decomposition

Input: Set S of sequences, and guide-tree T

1. Compute short subtree graph G(S,T), based upon T

2. Find clique separator in the graph G(S,T) and form subproblems

DCM3 decompositions (1) can be obtained in O(n) time(2) yield small subproblems(3) can be used iteratively

Page 48: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Strict Consensus Merger (SCM)

Page 49: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Iterative-DCM3

T

T’

Base methodDCM3

Page 50: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

New DCMs

• DCM31. Compute subproblems using DCM3 decomposition

2. Apply base method to each subproblem to yield subtrees

3. Merge subtrees using the Strict Consensus Merger technique

4. Randomly refine to make it binary

• Recursive-DCM3• Iterative DCM3

1. Compute a DCM3 tree

2. Perform local search and go to step 1

• Recursive-Iterative DCM3

Page 51: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Comparison of DCM decompositions(Maximum subset size)

DCM2 subproblems are almost as large as the full dataset size on datasets 1 through 4. On datasets 5-10 DCM2 was too slow to compute a decomposition within 24 hours.

01020

3040

5060

708090

100

Maximum subset size as a percent of the full dataset

size

1 2 3 4 5 6 7 8 9 10

Dataset#

Rec-DCM3 DCM3 DCM2

Page 52: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Comparison of DCMs (4583 sequences)

Base method is the TNT-ratchet. DCM2 tree takes almost 10 hours to produce a tree and is too slow to run on larger datasets. Rec-I-DCM3 is the best method at all times.

0

0.05

0.1

0.15

0.2

0.25

0.3

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

DCM2 TNT DCM3

Rec-DCM3 I-DCM3 Rec-I-DCM3

Page 53: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Comparison of DCMs (13,921 sequences)

Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the defaultto recursion to iteration to recursion+iteration. On very large datasets Rec-I-DCM3 gives significant improvements over unboosted TNT.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

TNT DCM3 Rec-DCM3 I-DCM3 Rec-I-DCM3

Page 54: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Rec-I-DCM3 significantly improves performance

Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Current best techniques

DCM boosted version of best techniques

Page 55: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at 24 hours)

The base method is the default TNT technique, the current best method for MP on large datasets. Rec-I-DCM3 significantly improves upon the unboosted TNT.

00.010.020.030.040.050.060.070.080.090.1

Average MP score above

optimal at 24 hours, shown as a

percentage of the optimal

1 2 3 4 5 6 7 8 9 10

Dataset#

TNT Rec-I-DCM3

Page 56: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Observations

• Rec-I-DCM3 improves upon the best performing heuristics for MP.

• The improvement increases with the difficulty of the dataset.

• DCMs also boost the performance of ML heuristics (not shown).

Page 57: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Questions

• Tree shape (including branch lengths) has an impact on phylogeny reconstruction - but what model of tree shape to use?

• What is the sequence length requirement for Maximum Likelihood? (Result by Szekely and Steel is worse than that for Neighbor Joining.)

• Why is MP not so bad?

Page 58: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

General comments

• There is interesting computer science research to be done in computational phylogenetics, with a tremendous potential for impact.

• Algorithm development must be tested on both real and simulated data.

• The interplay between data, stochastic models of evolution, optimization problems, and algorithms, is important and instructive.

Page 59: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Reconstructing the “Tree” of LifeHandling large datasets: Handling large datasets:

millions of speciesmillions of species

The “Tree of Life” is not The “Tree of Life” is not really a tree: really a tree:

reticulate evolutionreticulate evolution

Page 60: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Ringe-Warnow Phylogenetic Tree of Indo-European

(dates not meant seriously)

Page 61: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Websites for more information

• Please see the CIPRES web site at http://www.phylo.org for more about biological phylogenetics

• The Computational Phylogenetics for Historical Linguistics project web site has papers, data, and additional material, as well as a workshop announcement for March 21, 2005 on Mathematical Modelling and Analysis of Language Diversification: http://www.cs.rice.edu/~nakhleh/CPHL

Page 62: Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin

Acknowledgements

• NSF• The David and Lucile Packard Foundation• The Radcliffe Institute for Advanced Study, and

the Program in Evolutionary Dynamics at Harvard• The Institute for Cellular and Molecular Biology

at UT-Austin• Collaborators: Usman Roshan, Bernard Moret,

and Tiffani Williams