How to Compute a Large Tree (with or without an...

HowtoComputeaLargeTree(withorwithoutanalignment)

Thistalk

•  Part1:Howtogetagoodalignment•  Part2:Howtogetagoodtreefromagoodalignment

•  Part3:Howtogetagoodtreewithoutanalignment

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Challenge: Alignment of datasets with > 100,000 sequences

Plus many many other people…

Multiple Sequence Alignment (MSA): an important grand challenge1

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA

Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1 Frontiers in Massive Data Analysis, National Academies Press, 2013

1000-taxonmodels,orderedbydifficulty(Liuetal.,2009)

Re-aligningonatree(boos5nganMSAmethod)

Mergesubset-alignments

EsMmateMLtreeonmerged

alignment

Decompose dataset

Alignsubsets

SATéandPASTAAlgorithms

Estimate ML tree on new alignment

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

RepeatunMlterminaMoncondiMon,and

returnthealignment/treepairwiththebestMLscore

RNASim

10000 50000 100000 200000

e) Clustal−OmegaMuscleMafftStarting TreeSATe2PASTAReference Alignment

•  SimulatedRNASimdatasetsfrom10Kto200Ktaxa•  Limitedto24hoursusing12CPUs•  Notallmethodscouldrun(missingbarscouldnotfinish)

TreeError–Simulateddata

PASTARunningTimeandScalability10 PASTA: ultra-large multiple sequence alignment

10,000 50,000 100,000 200,000Number of Sequences

●●

1 2 4 6 8 10 12Number of Threads

PASTASATe2

Fig. 5. Running time comparison of PASTA and SATe. (a) Running time pro-filing on one iteration for RNASim datasets with 10K and 50K sequences (the dottedregion indicates the last pairwise merge). (b) Running time for one iteration of PASTAwith 12 CPUs as a function of the number of sequences (the solid line is fitted to firsttwo points). (c) Scalability for PASTA and SATe with increased number of CPUs.

reason SATe uses so much time is that all mergers are done hierarchically usingeither Opal (for small datasets) or Muscle (on larger datasets), and both arecomputationally expensive with increased number of sequences. For example,the last pairwise merge within SATe, shown by the dotted area in Figure 5a,is entirely serial and takes up a large chunk of the total time. PASTA solvesthis problem by using transitivity for all but the initial pairwise mergers, andtherefore scales well with increased dataset size, as shown in Figure 5b (thesub-linear scaling is due to a better use of parallelism with increased number ofsequences). Finally, Figure 5c shows that PASTA is highly parallelizable, andhas a much better speed-up with increasing number of threads than SATe does.While PASTA has a much improved parallelization, it does not quite scale uplinearly, because FastTree-2 does not scale up well with increased thread count.

Divide-and-Conquer strategy: impact of guide tree. We also investigated theimpact of the use of the guide tree for computing the subset decomposition,and hence defining the Type 1 sub-alignments. We compared results obtainedusing three di↵erent decompositions: the decomposition computed by PASTAon the HMM-based starting tree, the decomposition computed by PASTA onthe true (model) tree, and a random decomposition into subsets of size 200,all on the RNASim 10k dataset. PASTA alignments and trees had roughly thesame accuracy when the guide tree was either the true tree or the HMM-basedstarting tree (Table 3). However, when based on a random decomposition, treeerror increased dramatically from 10.5% to 52.3%, and alignment scores alsodropped substantially. Thus, the guide-tree based dataset decomposition usedby PASTA provides substantial improvements over random decompositions, andthe default technique for getting the starting tree works quite well.

•  OneiteraMon

•  Using•  12cpus•  1nodeonLonestarTACC•  Maximum24GBmemory

•  ShowingwallclockrunningMme•  ~1hourfor10ktaxa•  ~17hoursfor200ktaxa

N. Matasci iPlant

Challenge: Alignment of datasets with > 100,000 sequences

Length

Counts

12000 Mean:317Median:266

0 500 1000 1500 2000

1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary

Length

Counts

12000 Mean:317Median:266

0 500 1000 1500 2000

1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary

Allstandardmul,plesequencealignmentmethodswetestedperformedpoorlyondatasetswithfragments.

N. Matasci iPlant

Challenge: Alignment of datasets with > 100,000 sequences, many of which are fragmentary

UPPUPP=“Ultra-largemulMplesequencealignmentusingPhylogeny-awareProfiles”Nguyen,Mirarab,andWarnow.GenomeBiology,2014.Purpose:highlyaccuratelarge-scalemulMplesequencealignments,eveninthepresenceoffragmentarysequences.

UsesanensembleofHMMs

UPPAlgorithmicApproach

1.  Selectsmallrandomsubsetoffull-lengthsequences,andbuild“backbonealignment”

2.  Constructan“EnsembleofHiddenMarkovModels”onthebackbonealignment

3.  AddallremainingsequencestothebackbonealignmentusingtheEnsembleofHMMs

RNASimMillionSequences:treeerror

Using 12 TACC processors: •  UPP(Fast,NoDecomp)

took 2.2 days,

•  UPP(Fast) took 11.9 days, and

•  PASTA took 10.3 days

RNASimMillionSequences:alignmenterror

Notes: •  We show alignment error

using average of SP-FN and SP-FP.

•  UPP variants have better alignment scores than PASTA.

•  (Not shown: Total Column Scores – PASTA more accurate than UPP)

•  No other methods tested could complete on these data

•  PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in length to true alignment (0.93 to 1.38 wider).

0 12.5 25 50% Fragmentary

PASTA UPP(Default)

(a) Average alignment error

0 12.5 25 50% Fragmentary

PASTA UPP(Default)

(b) Average tree error

Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2datasets.

1000M2modelcondiMon

UPPismorerobusttofragmentarysequencesthanPASTA

UnderhighratesofevoluMon,PASTAisbadlyimpactedbyfragmentarysequences(thesameistrueforothermethods).UnderlowratesofevoluMon,PASTAcansMllbehighlyaccurate(datanotshown).UPPconMnuestohavegoodaccuracyevenondatasetswithmanyfragmentsunderallratesofevoluMon.

Nguyen et al. Genome Biology (2015) 16:124 Page 6 of 15

Table 2 Average alignment SP-error, tree error, and TC score across most full-length datasets

Method ROSE RNASim Indelible ROSE CRW 10 AA HomFam HomFam

NT 10K 10K AA (17) (2)

Average alignment SP-error

UPP 7.8 (1) 9.5 (1) 1.7 (2) 2.9 (1) 12.5 (1) 24.2 (1) 23.3 (1) 20.8 (2)

PASTA 7.8 (1) 15.0 (2) 0.4 (1) 3.1 (1) 12.8 (1) 24.0 (1) 22.5 (1) 17.3 (1)

MAFFT 20.6 (2) 25.5 (3) 41.4 (3) 4.9 (2) 28.3 (2) 23.5 (1) 25.3 (2) 20.7 (2)

Muscle 20.6 (2) 64.7 (5) 62.4 (4) 5.5 (3) 30.7 (3) 30.2 (2) 48.1 (4) X

Clustal 49.2 (3) 35.3 (4) X 6.5 (4) 43.3 (4) 24.3 (1) 27.7 (3) 29.4 (3)

Average !FN error

UPP 1.3 (1) 0.8 (1) 0.3 (1) 1.8 (1) 7.8 (2) 3.4 (2) NA NA

PASTA 1.3 (1) 0.4 (1) <0.1 (1) 1.3 (1) 5.1 (1) 3.3 (1) NA NA

MAFFT 5.8 (2) 3.5 (2) 24.8 (3) 4.5 (3) 10.1 (3) 2.3 (1) NA NA

Muscle 8.4 (3) 7.3 (3) 32.5 (4) 3.1 (2) 5.5 (1) 12.6 (3) NA NA

Clustal 24.3 (4) 10.4 (4) X 4.2 (3) 34.1 (4) 3.5 (2) NA NA

Average TC score

UPP 37.8 (1) 0.5 (2) 11.0 (3) 2.6 (2) 1.4 (1) 11.4 (1) 47.3 (1) 40.3 (3)

PASTA 37.8 (1) 2.3 (1) 48.0 (1) 5.4 (1) 2.3 (1) 12.1 (1) 46.1 (2) 50.0 (1)

MAFFT 31.4 (2) 0.4 (2) 7.8 (4) 0.6 (3) 0.7 (2) 12.1 (1) 45.5 (2) 46.9 (2)

Muscle 9.8 (3) <0.0 (2) 18.3 (2) 2.7 (2) 0.7 (2) 10.5 (2) 27.7 (4) X

Clustal 5.7 (4) 0.2 (2) X 3.1 (2) 0.1 (2) 11.8 (1) 38.6 (3) 31.0 (4)

We report the average alignment SP-error (the average of SPFN and SPFP errors) (top), average !FN error (middle), and average TC score (bottom), for the collection offull-length datasets. All scores represent percentages and so are out of 100. Results marked with an X indicate that the method failed to terminate within the time limit(24 hours on a 12-core machine). Muscle failed to align two of the HomFam datasets; we report separate average results on the 17 HomFam datasets for all methods and thetwo HomFam datasets for all but Muscle. We did not test tree error on the HomFam datasets (therefore, the !FN error is indicated by “NA”). The tier ranking for each methodis shown parenthetically

memory error message were marked as failures. Forexperiments on the million-sequence RNASim dataset,we ran the methods on a dedicated machine with 256GBof main memory and 12 cores until an alignment wasgenerated or the method failed. We also performed a lim-ited number of experiments on TACC with UPP’s internal

checkpointing mechanism, to explore performance whentime is not limited. All methods other than Muscle hadparallel implementations and were able to take advantageof the 12 available cores.On full-length datasets (Table 2) where nearly all meth-

ods were able to complete, PASTA was nearly always in

Table 3 Average alignment SP-error and tree error across fragmentary datasets

Method ROSE NT RNASim 10K Indelible 10K CRW

(16S.3 and 16S.T)

Average alignment SP-error

UPP 8.3 (1) 11.8 (1) 2.7 (1) 16.1 (1)

PASTA 25.2 (2) 47.7 (4) 8.8 (2) 23.3 (2)

MAFFT 32.5 (3) 25.5 (2) 51.3 (3) 24.5 (3)

Muscle 35.3 (4) 82.2 (5) 77.6 (4) 70.6 (5)

Clustal 62.0 (5) 35.0 (3) X 46.7 (4)

Average !FN error

UPP 1.9 (1) 3.1 (1) 2.5 (1) 7.4 (2)

PASTA 25.2 (3) 21.9 (3) 9.0 (2) 8.2 (2)

MAFFT 18.0 (2) 6.2 (2) 35.6 (3) 2.5 (1)

Muscle 27.5 (4) 43.6 (5) 45.2 (4) 30.1 (3)

Clustal 47.8 (5) 26.3 (4) X 37.4 (4)

We report the average alignment error (top) and average !FN error (bottom) on the collection of fragmentary datasets. Clustal-Omega failed to align any of the Indelible10000M2 fragmentary datasets and thus we mark the results with an X. The tier ranking for each method is shown in parentheses

50000 100000 150000 200000Number of sequences

● UPP(Fast)

UPPRunningTime

Wall-clockMmeused(inhours)given12processors

PASTAandUPP:boostersofMSAmethods

•  PASTA–  CombinesiteraMonanddivide-and-conquerto“boost”apreferred

MSAmethodtolargedatasets;weshowedresultsbasedonMAFFT•  UPP

–  Step1:Constructsa“backbone”treeandanalignmentonasmallrandomsubsetofthesequences

–  Step2:Alignsalltheremainingsequencestothebackbonealignment

–  WeshowedresultswheredefaultPASTAcomputedthebackbonealignmentandtree.

Note:PASTAandUPPcanbeusedwithanyMSAmethod.

Part2:TogetalargetreefromanMSA?

Basicapproach:maximumlikelihoodLeadingMLmethodsforlargedatasets:

– RAxML(andExaML),and– FastTree-2

Figure3.ComparisonofMLmethodsonthe16S.B.ALLdataset.

LiuK,LinderCR,WarnowT(2011)RAxMLandFastTree:ComparingTwoMethodsforLarge-ScaleMaximumLikelihoodPhylogenyEsMmaMon.PLoSONE6(11):e27731.doi:10.1371/journal.pone.0027731hhp://journals.plos.org/plosone/arMcle?id=info:doi/10.1371/journal.pone.0027731

Figure1.MissingbranchratesofMLmethodsonthesimulated1000-taxondatasets.

LiuK,LinderCR,WarnowT(2011)RAxMLandFastTree:ComparingTwoMethodsforLarge-ScaleMaximumLikelihoodPhylogenyEsMmaMon.PLoSONE6(11):e27731.doi:10.1371/journal.pone.0027731hhp://journals.plos.org/plosone/arMcle?id=info:doi/10.1371/journal.pone.0027731

FastTreevs.RAxML

•  FastTree-2(MorganPriceetal.):veryfast,justaboutasaccuratetreetopologiesasRAxML,andcanhandledatasetsupto1,000,000sequences.

•  RAxML(Stamatakisetal.):notnearlyfastenoughonlargenumbersoftaxa,butcandomulM-locusanalyseswell.

FastTreevs.RAxML

MainadvantageofRAxMLoverFastTreeisreallyonlyonextremelyaccuratealignments,andeventhereit’snotprobablyaboutthetreetopologybutsomeotherparameter.

Part3:TogetalargetreewithoutanMSA?

•  Alignment-freeesMmaMon?Noevidence(yet)thatitisasaccurateasgoodtwo-phasemethods.

•  Butalmostalignment-freeesMmaMonseemsfeasible!

DACTAL

•  Divide-And-ConquerTrees(Almost)withoutalignments

•  Nelesenetal.,ISMB2012andBioinformaMcs2012

•  Input:unalignedsequences•  Output:Tree(butnoalignment)

DACTAL

Supertree method: SuperFine

RAxML(MAFFT)

pRecDCM3

BLAST-based

Overlapping subsets

A tree for each subset

Unaligned Sequences

A tree for the entire dataset

Analysisofthe16S.Tdataset(7350RNAsequences)

Default:startwithtwo-phasetree,decomposeinto200-taxonsubsets,RunFastTree(MAFFT)onsubsets,combineusingSuperFine+MRP

1000M31000L2

1000S31000S2

1000L11000M2*

1000L3*1000S1*

1000M1*

eML(Muscle)

ML(Prank+GT)ML(Opal)

ML(MAFFT)SATé

DACTALML(TrueAln)

1000M31000L2

1000S31000S2

1000L11000M2

1000L31000S1

1000M1

Weshowresultswith10DACTALitera5ons;SATe-1used

Summary:TogetalargeMSA

•  Fewerthan200sequences?MAFFTlikelybestforslowlyevolvingloci,otherwisePASTAorUPP.

•  Morethan200sequences?PASTAorUPP.•  Fragments?UseUPP.

Summary:TogetalargetreefromanMSA?

•  Maximumlikelihoodgoodapproach(highaccuracyfortreetopologypointesMmate).

•  Forunder100sequences,RAxMLisgoodenough.

•  Forlargerdatasets,tryFastTree-2.

OpenProblemsinMSA/treeesMmaMon

•  StaMsMcalco-esMmaMonofalignmentsandtrees

•  PhylogenyesMmaMongivenindels•  Exploringalignmentuncertainty•  EsMmaMngbeheralignmentsbycombiningesMmatedalignments

Re-aligningonatree(boos5nganMSAmethod)

Mergesubset-alignments

EsMmateMLtreeonmerged

alignment

Decompose dataset

Alignsubsets

Acknowledgments

Papersavailableathhp://tandy.cs.illinois.edu/papers.htmlPASTAandUPPathhps://github.com/smirarabFunding:NSFABI-1458652andIII:AF:1513629,aFounderProfessorshipfromtheGraingerFoundaMon,andHHMI(toS.M.)Computa5onalsupport:TACC(PASTAandUPP)

How to Compute a Large Tree (with or without an...

Documents

CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012

Phylogenetic Analysis

Phylogenetic community structure and phylogenetic turnover ...ib.berkeley.edu/labs/fine/Site/publications_files/fine_kembel2011.pdfPhylogenetic community structure and phylogenetic

Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director

Phylogenetic Workflows

CS173 Lecture B, November 17, 2015tandy.cs.illinois.edu/173-trees.pdfCS173 Lecture B, November 17, 2015 Tandy Warnow November 17, 2015. CS 173, Lecture B November 17, 2015 Tandy Warnow

Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer…

CS 173, Lecture B September 1, 2015 Tandy Warnow

Phylogenetic structure and phylogenetic diversity of angiosperm

New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science

Computa(onal Challenges in Construcng the Tree of Lifeipdps.org/ipdps2017/Warnow-IPDPS2017.pdf · Computa(onal Challenges in Construcng the Tree of Life Tandy Warnow Founder Professor

Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,

Titelbild: Warnow am ehemaligen Hauptwehr in Bützow ... · in der Warnow am ehemaligen Hauptwehr in Bützow und ... sern damit eine zentrale Bedeutung zu. Diese wird inzwischen auch

Phylogenetic reconstruction

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science

Phylogenetic Inference

Phylogenetic Tools

Phylogenetic Tree

Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Challenges)in)Computaonal) Linguisc) Phylogenecs )tandy.cs.illinois.edu/Warnow-Linguistics-UIUC.pdf · 2015. 2. 23. · Proto-Indo-European voiced aspirated stops become voiced fricatives