How to Compute a Large Tree (with or without an...

Preview:

Citation preview

HowtoComputeaLargeTree(withorwithoutanalignment)

Thistalk

•  Part1:Howtogetagoodalignment•  Part2:Howtogetagoodtreefromagoodalignment

•  Part3:Howtogetagoodtreewithoutanalignment

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Challenge: Alignment of datasets with > 100,000 sequences

Plus many many other people…

Multiple Sequence Alignment (MSA): an important grand challenge1

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA

Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation

1 Frontiers in Massive Data Analysis, National Academies Press, 2013

1000-taxonmodels,orderedbydifficulty(Liuetal.,2009)

Re-aligningonatree(boos5nganMSAmethod)

A

B D

C

Mergesubset-alignments

EsMmateMLtreeonmerged

alignment

Decompose dataset

A B

C D

Alignsubsets

A B

C D

ABCD

SATéandPASTAAlgorithms

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

RepeatunMlterminaMoncondiMon,and

returnthealignment/treepairwiththebestMLscore

RNASim

0.00

0.05

0.10

0.15

0.20

10000 50000 100000 200000

Tree

Erro

r (FN

Rat

e) Clustal−OmegaMuscleMafftStarting TreeSATe2PASTAReference Alignment

•  SimulatedRNASimdatasetsfrom10Kto200Ktaxa•  Limitedto24hoursusing12CPUs•  Notallmethodscouldrun(missingbarscouldnotfinish)

TreeError–Simulateddata

PASTARunningTimeandScalability10 PASTA: ultra-large multiple sequence alignment

(a)

0

250

500

750

1000

1250

10,000 50,000 100,000 200,000Number of Sequences

Run

ning

tim

e (m

inut

es)

(b)

●●

●●

1

2

4

6

8

1 2 4 6 8 10 12Number of Threads

Spee

dup

PASTASATe2

(c)

Fig. 5. Running time comparison of PASTA and SATe. (a) Running time pro-filing on one iteration for RNASim datasets with 10K and 50K sequences (the dottedregion indicates the last pairwise merge). (b) Running time for one iteration of PASTAwith 12 CPUs as a function of the number of sequences (the solid line is fitted to firsttwo points). (c) Scalability for PASTA and SATe with increased number of CPUs.

reason SATe uses so much time is that all mergers are done hierarchically usingeither Opal (for small datasets) or Muscle (on larger datasets), and both arecomputationally expensive with increased number of sequences. For example,the last pairwise merge within SATe, shown by the dotted area in Figure 5a,is entirely serial and takes up a large chunk of the total time. PASTA solvesthis problem by using transitivity for all but the initial pairwise mergers, andtherefore scales well with increased dataset size, as shown in Figure 5b (thesub-linear scaling is due to a better use of parallelism with increased number ofsequences). Finally, Figure 5c shows that PASTA is highly parallelizable, andhas a much better speed-up with increasing number of threads than SATe does.While PASTA has a much improved parallelization, it does not quite scale uplinearly, because FastTree-2 does not scale up well with increased thread count.

Divide-and-Conquer strategy: impact of guide tree. We also investigated theimpact of the use of the guide tree for computing the subset decomposition,and hence defining the Type 1 sub-alignments. We compared results obtainedusing three di↵erent decompositions: the decomposition computed by PASTAon the HMM-based starting tree, the decomposition computed by PASTA onthe true (model) tree, and a random decomposition into subsets of size 200,all on the RNASim 10k dataset. PASTA alignments and trees had roughly thesame accuracy when the guide tree was either the true tree or the HMM-basedstarting tree (Table 3). However, when based on a random decomposition, treeerror increased dramatically from 10.5% to 52.3%, and alignment scores alsodropped substantially. Thus, the guide-tree based dataset decomposition usedby PASTA provides substantial improvements over random decompositions, andthe default technique for getting the starting tree works quite well.

•  OneiteraMon

•  Using•  12cpus•  1nodeonLonestarTACC•  Maximum24GBmemory

•  ShowingwallclockrunningMme•  ~1hourfor10ktaxa•  ~17hoursfor200ktaxa

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Challenge: Alignment of datasets with > 100,000 sequences

Plus many many other people…

Length

Counts

0

2000

4000

6000

8000

10000

12000 Mean:317Median:266

0 500 1000 1500 2000

1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary

Length

Counts

0

2000

4000

6000

8000

10000

12000 Mean:317Median:266

0 500 1000 1500 2000

1KPdataset:morethan100,000p450amino-acidsequences,manyfragmentary

Allstandardmul,plesequencealignmentmethodswetestedperformedpoorlyondatasetswithfragments.

1kp:ThousandTranscriptomeProject

l  PlantTreeofLifebasedontranscriptomesof~1200speciesl  Morethan13,000genefamilies(mostnotsinglecopy)GeneTreeIncongruence

G. Ka-Shu Wong U Alberta

N. Wickett Northwestern

J. Leebens-Mack U Georgia

N. Matasci iPlant

T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD

Challenge: Alignment of datasets with > 100,000 sequences, many of which are fragmentary

Plus many many other people…

UPPUPP=“Ultra-largemulMplesequencealignmentusingPhylogeny-awareProfiles”Nguyen,Mirarab,andWarnow.GenomeBiology,2014.Purpose:highlyaccuratelarge-scalemulMplesequencealignments,eveninthepresenceoffragmentarysequences.

UPPUPP=“Ultra-largemulMplesequencealignmentusingPhylogeny-awareProfiles”Nguyen,Mirarab,andWarnow.GenomeBiology,2014.Purpose:highlyaccuratelarge-scalemulMplesequencealignments,eveninthepresenceoffragmentarysequences.

UsesanensembleofHMMs

UPPAlgorithmicApproach

1.  Selectsmallrandomsubsetoffull-lengthsequences,andbuild“backbonealignment”

2.  Constructan“EnsembleofHiddenMarkovModels”onthebackbonealignment

3.  AddallremainingsequencestothebackbonealignmentusingtheEnsembleofHMMs

RNASimMillionSequences:treeerror

Using 12 TACC processors: •  UPP(Fast,NoDecomp)

took 2.2 days,

•  UPP(Fast) took 11.9 days, and

•  PASTA took 10.3 days

RNASimMillionSequences:alignmenterror

Notes: •  We show alignment error

using average of SP-FN and SP-FP.

•  UPP variants have better alignment scores than PASTA.

•  (Not shown: Total Column Scores – PASTA more accurate than UPP)

•  No other methods tested could complete on these data

•  PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in length to true alignment (0.93 to 1.38 wider).

0.0

0.2

0.4

0.6

0 12.5 25 50% Fragmentary

Mea

n al

ignm

ent e

rror

PASTA UPP(Default)

(a) Average alignment error

0.0

0.2

0.4

0 12.5 25 50% Fragmentary

Del

ta F

N tr

ee e

rror

PASTA UPP(Default)

(b) Average tree error

Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2datasets.

80

1000M2modelcondiMon

UPPismorerobusttofragmentarysequencesthanPASTA

UnderhighratesofevoluMon,PASTAisbadlyimpactedbyfragmentarysequences(thesameistrueforothermethods).UnderlowratesofevoluMon,PASTAcansMllbehighlyaccurate(datanotshown).UPPconMnuestohavegoodaccuracyevenondatasetswithmanyfragmentsunderallratesofevoluMon.

Nguyen et al. Genome Biology (2015) 16:124 Page 6 of 15

Table 2 Average alignment SP-error, tree error, and TC score across most full-length datasets

Method ROSE RNASim Indelible ROSE CRW 10 AA HomFam HomFam

NT 10K 10K AA (17) (2)

Average alignment SP-error

UPP 7.8 (1) 9.5 (1) 1.7 (2) 2.9 (1) 12.5 (1) 24.2 (1) 23.3 (1) 20.8 (2)

PASTA 7.8 (1) 15.0 (2) 0.4 (1) 3.1 (1) 12.8 (1) 24.0 (1) 22.5 (1) 17.3 (1)

MAFFT 20.6 (2) 25.5 (3) 41.4 (3) 4.9 (2) 28.3 (2) 23.5 (1) 25.3 (2) 20.7 (2)

Muscle 20.6 (2) 64.7 (5) 62.4 (4) 5.5 (3) 30.7 (3) 30.2 (2) 48.1 (4) X

Clustal 49.2 (3) 35.3 (4) X 6.5 (4) 43.3 (4) 24.3 (1) 27.7 (3) 29.4 (3)

Average !FN error

UPP 1.3 (1) 0.8 (1) 0.3 (1) 1.8 (1) 7.8 (2) 3.4 (2) NA NA

PASTA 1.3 (1) 0.4 (1) <0.1 (1) 1.3 (1) 5.1 (1) 3.3 (1) NA NA

MAFFT 5.8 (2) 3.5 (2) 24.8 (3) 4.5 (3) 10.1 (3) 2.3 (1) NA NA

Muscle 8.4 (3) 7.3 (3) 32.5 (4) 3.1 (2) 5.5 (1) 12.6 (3) NA NA

Clustal 24.3 (4) 10.4 (4) X 4.2 (3) 34.1 (4) 3.5 (2) NA NA

Average TC score

UPP 37.8 (1) 0.5 (2) 11.0 (3) 2.6 (2) 1.4 (1) 11.4 (1) 47.3 (1) 40.3 (3)

PASTA 37.8 (1) 2.3 (1) 48.0 (1) 5.4 (1) 2.3 (1) 12.1 (1) 46.1 (2) 50.0 (1)

MAFFT 31.4 (2) 0.4 (2) 7.8 (4) 0.6 (3) 0.7 (2) 12.1 (1) 45.5 (2) 46.9 (2)

Muscle 9.8 (3) <0.0 (2) 18.3 (2) 2.7 (2) 0.7 (2) 10.5 (2) 27.7 (4) X

Clustal 5.7 (4) 0.2 (2) X 3.1 (2) 0.1 (2) 11.8 (1) 38.6 (3) 31.0 (4)

We report the average alignment SP-error (the average of SPFN and SPFP errors) (top), average !FN error (middle), and average TC score (bottom), for the collection offull-length datasets. All scores represent percentages and so are out of 100. Results marked with an X indicate that the method failed to terminate within the time limit(24 hours on a 12-core machine). Muscle failed to align two of the HomFam datasets; we report separate average results on the 17 HomFam datasets for all methods and thetwo HomFam datasets for all but Muscle. We did not test tree error on the HomFam datasets (therefore, the !FN error is indicated by “NA”). The tier ranking for each methodis shown parenthetically

memory error message were marked as failures. Forexperiments on the million-sequence RNASim dataset,we ran the methods on a dedicated machine with 256GBof main memory and 12 cores until an alignment wasgenerated or the method failed. We also performed a lim-ited number of experiments on TACC with UPP’s internal

checkpointing mechanism, to explore performance whentime is not limited. All methods other than Muscle hadparallel implementations and were able to take advantageof the 12 available cores.On full-length datasets (Table 2) where nearly all meth-

ods were able to complete, PASTA was nearly always in

Table 3 Average alignment SP-error and tree error across fragmentary datasets

Method ROSE NT RNASim 10K Indelible 10K CRW

(16S.3 and 16S.T)

Average alignment SP-error

UPP 8.3 (1) 11.8 (1) 2.7 (1) 16.1 (1)

PASTA 25.2 (2) 47.7 (4) 8.8 (2) 23.3 (2)

MAFFT 32.5 (3) 25.5 (2) 51.3 (3) 24.5 (3)

Muscle 35.3 (4) 82.2 (5) 77.6 (4) 70.6 (5)

Clustal 62.0 (5) 35.0 (3) X 46.7 (4)

Average !FN error

UPP 1.9 (1) 3.1 (1) 2.5 (1) 7.4 (2)

PASTA 25.2 (3) 21.9 (3) 9.0 (2) 8.2 (2)

MAFFT 18.0 (2) 6.2 (2) 35.6 (3) 2.5 (1)

Muscle 27.5 (4) 43.6 (5) 45.2 (4) 30.1 (3)

Clustal 47.8 (5) 26.3 (4) X 37.4 (4)

We report the average alignment error (top) and average !FN error (bottom) on the collection of fragmentary datasets. Clustal-Omega failed to align any of the Indelible10000M2 fragmentary datasets and thus we mark the results with an X. The tier ranking for each method is shown in parentheses

0

5

10

15

50000 100000 150000 200000Number of sequences

Wal

l clo

ck a

lign

time

(hr)

● UPP(Fast)

UPPRunningTime

Wall-clockMmeused(inhours)given12processors

PASTAandUPP:boostersofMSAmethods

•  PASTA–  CombinesiteraMonanddivide-and-conquerto“boost”apreferred

MSAmethodtolargedatasets;weshowedresultsbasedonMAFFT•  UPP

–  Step1:Constructsa“backbone”treeandanalignmentonasmallrandomsubsetofthesequences

–  Step2:Alignsalltheremainingsequencestothebackbonealignment

–  WeshowedresultswheredefaultPASTAcomputedthebackbonealignmentandtree.

Note:PASTAandUPPcanbeusedwithanyMSAmethod.

Part2:TogetalargetreefromanMSA?

Basicapproach:maximumlikelihoodLeadingMLmethodsforlargedatasets:

– RAxML(andExaML),and– FastTree-2

Figure3.ComparisonofMLmethodsonthe16S.B.ALLdataset.

LiuK,LinderCR,WarnowT(2011)RAxMLandFastTree:ComparingTwoMethodsforLarge-ScaleMaximumLikelihoodPhylogenyEsMmaMon.PLoSONE6(11):e27731.doi:10.1371/journal.pone.0027731hhp://journals.plos.org/plosone/arMcle?id=info:doi/10.1371/journal.pone.0027731

Figure1.MissingbranchratesofMLmethodsonthesimulated1000-taxondatasets.

LiuK,LinderCR,WarnowT(2011)RAxMLandFastTree:ComparingTwoMethodsforLarge-ScaleMaximumLikelihoodPhylogenyEsMmaMon.PLoSONE6(11):e27731.doi:10.1371/journal.pone.0027731hhp://journals.plos.org/plosone/arMcle?id=info:doi/10.1371/journal.pone.0027731

FastTreevs.RAxML

•  FastTree-2(MorganPriceetal.):veryfast,justaboutasaccuratetreetopologiesasRAxML,andcanhandledatasetsupto1,000,000sequences.

•  RAxML(Stamatakisetal.):notnearlyfastenoughonlargenumbersoftaxa,butcandomulM-locusanalyseswell.

FastTreevs.RAxML

MainadvantageofRAxMLoverFastTreeisreallyonlyonextremelyaccuratealignments,andeventhereit’snotprobablyaboutthetreetopologybutsomeotherparameter.

Part3:TogetalargetreewithoutanMSA?

•  Alignment-freeesMmaMon?Noevidence(yet)thatitisasaccurateasgoodtwo-phasemethods.

•  Butalmostalignment-freeesMmaMonseemsfeasible!

DACTAL

•  Divide-And-ConquerTrees(Almost)withoutalignments

•  Nelesenetal.,ISMB2012andBioinformaMcs2012

•  Input:unalignedsequences•  Output:Tree(butnoalignment)

DACTAL

Supertree method: SuperFine

RAxML(MAFFT)

pRecDCM3

BLAST-based

Overlapping subsets

A tree for each subset

Unaligned Sequences

A tree for the entire dataset

Analysisofthe16S.Tdataset(7350RNAsequences)

Default:startwithtwo-phasetree,decomposeinto200-taxonsubsets,RunFastTree(MAFFT)onsubsets,combineusingSuperFine+MRP

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1000M31000L2

1000S31000S2

1000L11000M2*

1000L3*1000S1*

1000M1*

Mis

sing

Bra

nch

Rat

eML(Muscle)

ML(Prank+GT)ML(Opal)

ML(MAFFT)SATé

DACTALML(TrueAln)

0

20

40

60

80

100

120

1000M31000L2

1000S31000S2

1000L11000M2

1000L31000S1

1000M1

Run

time

(h)

Weshowresultswith10DACTALitera5ons;SATe-1used

Summary:TogetalargeMSA

•  Fewerthan200sequences?MAFFTlikelybestforslowlyevolvingloci,otherwisePASTAorUPP.

•  Morethan200sequences?PASTAorUPP.•  Fragments?UseUPP.

Summary:TogetalargetreefromanMSA?

•  Maximumlikelihoodgoodapproach(highaccuracyfortreetopologypointesMmate).

•  Forunder100sequences,RAxMLisgoodenough.

•  Forlargerdatasets,tryFastTree-2.

OpenProblemsinMSA/treeesMmaMon

•  StaMsMcalco-esMmaMonofalignmentsandtrees

•  PhylogenyesMmaMongivenindels•  Exploringalignmentuncertainty•  EsMmaMngbeheralignmentsbycombiningesMmatedalignments

Re-aligningonatree(boos5nganMSAmethod)

A

B D

C

Mergesubset-alignments

EsMmateMLtreeonmerged

alignment

Decompose dataset

A B

C D

Alignsubsets

A B

C D

ABCD

Acknowledgments

Papersavailableathhp://tandy.cs.illinois.edu/papers.htmlPASTAandUPPathhps://github.com/smirarabFunding:NSFABI-1458652andIII:AF:1513629,aFounderProfessorshipfromtheGraingerFoundaMon,andHHMI(toS.M.)Computa5onalsupport:TACC(PASTAandUPP)

Recommended