38
How to Compute a Large Tree (with or without an alignment)

How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

HowtoComputeaLargeTree(withorwithoutanalignment)

Page 2: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Thistalk

•  Part1:Howtogetagoodalignment•  Part2:Howtogetagoodtreefromagoodalignment

•  Part3:Howtogetagoodtreewithoutanalignment

Page 3: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Challenge: Alignment of datasets with > 100,000 sequences

Plus many many other people…

Page 4: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Multiple Sequence Alignment (MSA): an important grand challenge1

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA

Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Page 5: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

1000-taxonmodels,orderedbydifficulty(Liuetal.,2009)

Page 6: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Re-aligningonatree(boos5nganMSAmethod)

A

B D

C

Mergesubset-alignments

EsMmateMLtreeonmerged

alignment

Decompose dataset

A B

C D

Alignsubsets

A B

C D

ABCD

Page 7: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

SATéandPASTAAlgorithms

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

RepeatunMlterminaMoncondiMon,and

returnthealignment/treepairwiththebestMLscore

Page 8: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

RNASim

0.00

0.05

0.10

0.15

0.20

10000 50000 100000 200000

Tree

Erro

r (FN

Rat

e) Clustal−OmegaMuscleMafftStarting TreeSATe2PASTAReference Alignment

•  SimulatedRNASimdatasetsfrom10Kto200Ktaxa•  Limitedto24hoursusing12CPUs•  Notallmethodscouldrun(missingbarscouldnotfinish)

TreeError–Simulateddata

Page 9: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

PASTARunningTimeandScalability10 PASTA: ultra-large multiple sequence alignment

(a)

0

250

500

750

1000

1250

10,000 50,000 100,000 200,000Number of Sequences

Run

ning

tim

e (m

inut

es)

(b)

●●

●●

1

2

4

6

8

1 2 4 6 8 10 12Number of Threads

Spee

dup

PASTASATe2

(c)

Fig. 5. Running time comparison of PASTA and SATe. (a) Running time pro-filing on one iteration for RNASim datasets with 10K and 50K sequences (the dottedregion indicates the last pairwise merge). (b) Running time for one iteration of PASTAwith 12 CPUs as a function of the number of sequences (the solid line is fitted to firsttwo points). (c) Scalability for PASTA and SATe with increased number of CPUs.

reason SATe uses so much time is that all mergers are done hierarchically usingeither Opal (for small datasets) or Muscle (on larger datasets), and both arecomputationally expensive with increased number of sequences. For example,the last pairwise merge within SATe, shown by the dotted area in Figure 5a,is entirely serial and takes up a large chunk of the total time. PASTA solvesthis problem by using transitivity for all but the initial pairwise mergers, andtherefore scales well with increased dataset size, as shown in Figure 5b (thesub-linear scaling is due to a better use of parallelism with increased number ofsequences). Finally, Figure 5c shows that PASTA is highly parallelizable, andhas a much better speed-up with increasing number of threads than SATe does.While PASTA has a much improved parallelization, it does not quite scale uplinearly, because FastTree-2 does not scale up well with increased thread count.

Divide-and-Conquer strategy: impact of guide tree. We also investigated theimpact of the use of the guide tree for computing the subset decomposition,and hence defining the Type 1 sub-alignments. We compared results obtainedusing three di↵erent decompositions: the decomposition computed by PASTAon the HMM-based starting tree, the decomposition computed by PASTA onthe true (model) tree, and a random decomposition into subsets of size 200,all on the RNASim 10k dataset. PASTA alignments and trees had roughly thesame accuracy when the guide tree was either the true tree or the HMM-basedstarting tree (Table 3). However, when based on a random decomposition, treeerror increased dramatically from 10.5% to 52.3%, and alignment scores alsodropped substantially. Thus, the guide-tree based dataset decomposition usedby PASTA provides substantial improvements over random decompositions, andthe default technique for getting the starting tree works quite well.

•  OneiteraMon

•  Using•  12cpus•  1nodeonLonestarTACC•  Maximum24GBmemory

•  ShowingwallclockrunningMme•  ~1hourfor10ktaxa•  ~17hoursfor200ktaxa

Page 10: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Challenge: Alignment of datasets with > 100,000 sequences

Plus many many other people…

Page 11: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Length

Counts

0

2000

4000

6000

8000

10000

12000 Mean:317Median:266

0 500 1000 1500 2000

1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary

Page 12: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Length

Counts

0

2000

4000

6000

8000

10000

12000 Mean:317Median:266

0 500 1000 1500 2000

1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary

Allstandardmul,plesequencealignmentmethodswetestedperformedpoorlyondatasetswithfragments.

Page 13: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Challenge: Alignment of datasets with > 100,000 sequences, many of which are fragmentary

Plus many many other people…

Page 14: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

UPPUPP=“Ultra-largemulMplesequencealignmentusingPhylogeny-awareProfiles”Nguyen,Mirarab,andWarnow.GenomeBiology,2014.Purpose:highlyaccuratelarge-scalemulMplesequencealignments,eveninthepresenceoffragmentarysequences.

Page 15: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

UPPUPP=“Ultra-largemulMplesequencealignmentusingPhylogeny-awareProfiles”Nguyen,Mirarab,andWarnow.GenomeBiology,2014.Purpose:highlyaccuratelarge-scalemulMplesequencealignments,eveninthepresenceoffragmentarysequences.

UsesanensembleofHMMs

Page 16: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

UPPAlgorithmicApproach

1.  Selectsmallrandomsubsetoffull-lengthsequences,andbuild“backbonealignment”

2.  Constructan“EnsembleofHiddenMarkovModels”onthebackbonealignment

3.  AddallremainingsequencestothebackbonealignmentusingtheEnsembleofHMMs

Page 17: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

RNASimMillionSequences:treeerror

Using 12 TACC processors: •  UPP(Fast,NoDecomp)

took 2.2 days,

•  UPP(Fast) took 11.9 days, and

•  PASTA took 10.3 days

Page 18: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

RNASimMillionSequences:alignmenterror

Notes: •  We show alignment error

using average of SP-FN and SP-FP.

•  UPP variants have better alignment scores than PASTA.

•  (Not shown: Total Column Scores – PASTA more accurate than UPP)

•  No other methods tested could complete on these data

•  PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in length to true alignment (0.93 to 1.38 wider).

Page 19: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

0.0

0.2

0.4

0.6

0 12.5 25 50% Fragmentary

Mea

n al

ignm

ent e

rror

PASTA UPP(Default)

(a) Average alignment error

0.0

0.2

0.4

0 12.5 25 50% Fragmentary

Del

ta F

N tr

ee e

rror

PASTA UPP(Default)

(b) Average tree error

Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2datasets.

80

1000M2modelcondiMon

UPPismorerobusttofragmentarysequencesthanPASTA

UnderhighratesofevoluMon,PASTAisbadlyimpactedbyfragmentarysequences(thesameistrueforothermethods).UnderlowratesofevoluMon,PASTAcansMllbehighlyaccurate(datanotshown).UPPconMnuestohavegoodaccuracyevenondatasetswithmanyfragmentsunderallratesofevoluMon.

Page 20: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Nguyen et al. Genome Biology (2015) 16:124 Page 6 of 15

Table 2 Average alignment SP-error, tree error, and TC score across most full-length datasets

Method ROSE RNASim Indelible ROSE CRW 10 AA HomFam HomFam

NT 10K 10K AA (17) (2)

Average alignment SP-error

UPP 7.8 (1) 9.5 (1) 1.7 (2) 2.9 (1) 12.5 (1) 24.2 (1) 23.3 (1) 20.8 (2)

PASTA 7.8 (1) 15.0 (2) 0.4 (1) 3.1 (1) 12.8 (1) 24.0 (1) 22.5 (1) 17.3 (1)

MAFFT 20.6 (2) 25.5 (3) 41.4 (3) 4.9 (2) 28.3 (2) 23.5 (1) 25.3 (2) 20.7 (2)

Muscle 20.6 (2) 64.7 (5) 62.4 (4) 5.5 (3) 30.7 (3) 30.2 (2) 48.1 (4) X

Clustal 49.2 (3) 35.3 (4) X 6.5 (4) 43.3 (4) 24.3 (1) 27.7 (3) 29.4 (3)

Average !FN error

UPP 1.3 (1) 0.8 (1) 0.3 (1) 1.8 (1) 7.8 (2) 3.4 (2) NA NA

PASTA 1.3 (1) 0.4 (1) <0.1 (1) 1.3 (1) 5.1 (1) 3.3 (1) NA NA

MAFFT 5.8 (2) 3.5 (2) 24.8 (3) 4.5 (3) 10.1 (3) 2.3 (1) NA NA

Muscle 8.4 (3) 7.3 (3) 32.5 (4) 3.1 (2) 5.5 (1) 12.6 (3) NA NA

Clustal 24.3 (4) 10.4 (4) X 4.2 (3) 34.1 (4) 3.5 (2) NA NA

Average TC score

UPP 37.8 (1) 0.5 (2) 11.0 (3) 2.6 (2) 1.4 (1) 11.4 (1) 47.3 (1) 40.3 (3)

PASTA 37.8 (1) 2.3 (1) 48.0 (1) 5.4 (1) 2.3 (1) 12.1 (1) 46.1 (2) 50.0 (1)

MAFFT 31.4 (2) 0.4 (2) 7.8 (4) 0.6 (3) 0.7 (2) 12.1 (1) 45.5 (2) 46.9 (2)

Muscle 9.8 (3) <0.0 (2) 18.3 (2) 2.7 (2) 0.7 (2) 10.5 (2) 27.7 (4) X

Clustal 5.7 (4) 0.2 (2) X 3.1 (2) 0.1 (2) 11.8 (1) 38.6 (3) 31.0 (4)

We report the average alignment SP-error (the average of SPFN and SPFP errors) (top), average !FN error (middle), and average TC score (bottom), for the collection offull-length datasets. All scores represent percentages and so are out of 100. Results marked with an X indicate that the method failed to terminate within the time limit(24 hours on a 12-core machine). Muscle failed to align two of the HomFam datasets; we report separate average results on the 17 HomFam datasets for all methods and thetwo HomFam datasets for all but Muscle. We did not test tree error on the HomFam datasets (therefore, the !FN error is indicated by “NA”). The tier ranking for each methodis shown parenthetically

memory error message were marked as failures. Forexperiments on the million-sequence RNASim dataset,we ran the methods on a dedicated machine with 256GBof main memory and 12 cores until an alignment wasgenerated or the method failed. We also performed a lim-ited number of experiments on TACC with UPP’s internal

checkpointing mechanism, to explore performance whentime is not limited. All methods other than Muscle hadparallel implementations and were able to take advantageof the 12 available cores.On full-length datasets (Table 2) where nearly all meth-

ods were able to complete, PASTA was nearly always in

Table 3 Average alignment SP-error and tree error across fragmentary datasets

Method ROSE NT RNASim 10K Indelible 10K CRW

(16S.3 and 16S.T)

Average alignment SP-error

UPP 8.3 (1) 11.8 (1) 2.7 (1) 16.1 (1)

PASTA 25.2 (2) 47.7 (4) 8.8 (2) 23.3 (2)

MAFFT 32.5 (3) 25.5 (2) 51.3 (3) 24.5 (3)

Muscle 35.3 (4) 82.2 (5) 77.6 (4) 70.6 (5)

Clustal 62.0 (5) 35.0 (3) X 46.7 (4)

Average !FN error

UPP 1.9 (1) 3.1 (1) 2.5 (1) 7.4 (2)

PASTA 25.2 (3) 21.9 (3) 9.0 (2) 8.2 (2)

MAFFT 18.0 (2) 6.2 (2) 35.6 (3) 2.5 (1)

Muscle 27.5 (4) 43.6 (5) 45.2 (4) 30.1 (3)

Clustal 47.8 (5) 26.3 (4) X 37.4 (4)

We report the average alignment error (top) and average !FN error (bottom) on the collection of fragmentary datasets. Clustal-Omega failed to align any of the Indelible10000M2 fragmentary datasets and thus we mark the results with an X. The tier ranking for each method is shown in parentheses

Page 21: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

0

5

10

15

50000 100000 150000 200000Number of sequences

Wal

l clo

ck a

lign

time

(hr)

● UPP(Fast)

UPPRunningTime

Wall-clockMmeused(inhours)given12processors

Page 22: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

PASTAandUPP:boostersofMSAmethods

•  PASTA–  CombinesiteraMonanddivide-and-conquerto“boost”apreferred

MSAmethodtolargedatasets;weshowedresultsbasedonMAFFT•  UPP

–  Step1:Constructsa“backbone”treeandanalignmentonasmallrandomsubsetofthesequences

–  Step2:Alignsalltheremainingsequencestothebackbonealignment

–  WeshowedresultswheredefaultPASTAcomputedthebackbonealignmentandtree.

Note:PASTAandUPPcanbeusedwithanyMSAmethod.

Page 23: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Part2:TogetalargetreefromanMSA?

Basicapproach:maximumlikelihoodLeadingMLmethodsforlargedatasets:

– RAxML(andExaML),and– FastTree-2

Page 24: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Figure3.ComparisonofMLmethodsonthe16S.B.ALLdataset.

LiuK,LinderCR,WarnowT(2011)RAxMLandFastTree:ComparingTwoMethodsforLarge-ScaleMaximumLikelihoodPhylogenyEsMmaMon.PLoSONE6(11):e27731.doi:10.1371/journal.pone.0027731hhp://journals.plos.org/plosone/arMcle?id=info:doi/10.1371/journal.pone.0027731

Page 25: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Figure1.MissingbranchratesofMLmethodsonthesimulated1000-taxondatasets.

LiuK,LinderCR,WarnowT(2011)RAxMLandFastTree:ComparingTwoMethodsforLarge-ScaleMaximumLikelihoodPhylogenyEsMmaMon.PLoSONE6(11):e27731.doi:10.1371/journal.pone.0027731hhp://journals.plos.org/plosone/arMcle?id=info:doi/10.1371/journal.pone.0027731

Page 26: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

FastTreevs.RAxML

•  FastTree-2(MorganPriceetal.):veryfast,justaboutasaccuratetreetopologiesasRAxML,andcanhandledatasetsupto1,000,000sequences.

•  RAxML(Stamatakisetal.):notnearlyfastenoughonlargenumbersoftaxa,butcandomulM-locusanalyseswell.

Page 27: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

FastTreevs.RAxML

MainadvantageofRAxMLoverFastTreeisreallyonlyonextremelyaccuratealignments,andeventhereit’snotprobablyaboutthetreetopologybutsomeotherparameter.

Page 28: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Part3:TogetalargetreewithoutanMSA?

•  Alignment-freeesMmaMon?Noevidence(yet)thatitisasaccurateasgoodtwo-phasemethods.

•  Butalmostalignment-freeesMmaMonseemsfeasible!

Page 29: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

DACTAL

•  Divide-And-ConquerTrees(Almost)withoutalignments

•  Nelesenetal.,ISMB2012andBioinformaMcs2012

•  Input:unalignedsequences•  Output:Tree(butnoalignment)

Page 30: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

DACTAL

Supertree method: SuperFine

RAxML(MAFFT)

pRecDCM3

BLAST-based

Overlapping subsets

A tree for each subset

Unaligned Sequences

A tree for the entire dataset

Page 31: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Analysisofthe16S.Tdataset(7350RNAsequences)

Default:startwithtwo-phasetree,decomposeinto200-taxonsubsets,RunFastTree(MAFFT)onsubsets,combineusingSuperFine+MRP

Page 32: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1000M31000L2

1000S31000S2

1000L11000M2*

1000L3*1000S1*

1000M1*

Mis

sing

Bra

nch

Rat

eML(Muscle)

ML(Prank+GT)ML(Opal)

ML(MAFFT)SATé

DACTALML(TrueAln)

0

20

40

60

80

100

120

1000M31000L2

1000S31000S2

1000L11000M2

1000L31000S1

1000M1

Run

time

(h)

Weshowresultswith10DACTALitera5ons;SATe-1used

Page 33: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Summary:TogetalargeMSA

•  Fewerthan200sequences?MAFFTlikelybestforslowlyevolvingloci,otherwisePASTAorUPP.

•  Morethan200sequences?PASTAorUPP.•  Fragments?UseUPP.

Page 34: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Summary:TogetalargetreefromanMSA?

•  Maximumlikelihoodgoodapproach(highaccuracyfortreetopologypointesMmate).

•  Forunder100sequences,RAxMLisgoodenough.

•  Forlargerdatasets,tryFastTree-2.

Page 35: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

OpenProblemsinMSA/treeesMmaMon

•  StaMsMcalco-esMmaMonofalignmentsandtrees

•  PhylogenyesMmaMongivenindels•  Exploringalignmentuncertainty•  EsMmaMngbeheralignmentsbycombiningesMmatedalignments

Page 36: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Re-aligningonatree(boos5nganMSAmethod)

A

B D

C

Mergesubset-alignments

EsMmateMLtreeonmerged

alignment

Decompose dataset

A B

C D

Alignsubsets

A B

C D

ABCD

Page 37: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive
Page 38: How to Compute a Large Tree (with or without an alignment)tandy.cs.illinois.edu/Warnow-largetrees.pdfMany important applications besides phylogenetic estimation 1 Frontiers in Massive

Acknowledgments

Papersavailableathhp://tandy.cs.illinois.edu/papers.htmlPASTAandUPPathhps://github.com/smirarabFunding:NSFABI-1458652andIII:AF:1513629,aFounderProfessorshipfromtheGraingerFoundaMon,andHHMI(toS.M.)Computa5onalsupport:TACC(PASTAandUPP)