14
Bhattacharjee and Bayzid BMC Genomics (2020) 21:497 https://doi.org/10.1186/s12864-020-06892-5 METHODOLOGY ARTICLE Open Access Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices Ananya Bhattacharjee 1,2 and Md. Shamsuzzoha Bayzid 1* Abstract Background: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data. Results: We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data. Conclusions: This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya- Bhattacharjee/ImputeDistances. Keywords: Phylogenetic trees, Species trees, Gene trees, Missing data, Imputation, Deep learning, Matrix factorization, Autoencoder Background Phylogenetic trees, also known as evolutionary trees, rep- resent the evolutionary history of a group of entities (i.e., species, genes, etc.). Phylogenetic trees provide insights into basic biology, including how life evolved, the mech- anisms of evolution and how it modifies function and structure. One of the ambitious goals of modern sci- ence is to construct the “Tree of Life” – the evolutionary relationships among all the organisms on earth. Central *Correspondence: [email protected] 1 Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, 1205, Dhaka, Bangladesh Full list of author information is available at the end of the article to assembling the tree of life is the ability to efficiently analyze a vast amount of genomic data. The field of phylogenetics has experienced tremendous advancements over the last few decades. Sophisticated and highly accurate statistical methods for reconstructing gene trees and species trees are mostly based on maximum likelihood or Markov Chain Monte Carlo (MCMC) meth- ods, and probabilistic models of sequence evolution (see [1] for example). Various coalescent-based species tree methods – with statistical guarantees of returning the true tree with high probability given a sufficiently large num- ber of estimated gene trees that are error-free – have been developed, and are increasingly popular [212]. However, © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 https://doi.org/10.1186/s12864-020-06892-5

METHODOLOGY ARTICLE Open Access

Machine learning based imputationtechniques for estimating phylogenetic treesfrom incomplete distance matricesAnanya Bhattacharjee1,2 and Md. Shamsuzzoha Bayzid1*

Abstract

Background: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampledthroughout the whole genome has become a basic task in comparative and evolutionary biology. However,substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is todevelop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighborjoining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matriceswithout any missing data.

Results: We introduce two highly accurate machine learning based distance imputation techniques. These methodsare based on matrix factorization and autoencoder based deep learning architectures. We evaluated these twomethods on a collection of simulated and biological datasets. Experimental results suggest that our proposedmethods match or improve upon the best alternate distance imputation techniques. Moreover, these methods arescalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data.

Conclusions: This study shows, for the first time, the power and feasibility of applying deep learning techniques forimputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in thepresence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances.

Keywords: Phylogenetic trees, Species trees, Gene trees, Missing data, Imputation, Deep learning, Matrixfactorization, Autoencoder

BackgroundPhylogenetic trees, also known as evolutionary trees, rep-resent the evolutionary history of a group of entities (i.e.,species, genes, etc.). Phylogenetic trees provide insightsinto basic biology, including how life evolved, the mech-anisms of evolution and how it modifies function andstructure. One of the ambitious goals of modern sci-ence is to construct the “Tree of Life” – the evolutionaryrelationships among all the organisms on earth. Central

*Correspondence: [email protected] of Computer Science and Engineering, Bangladesh University ofEngineering and Technology, 1205, Dhaka, BangladeshFull list of author information is available at the end of the article

to assembling the tree of life is the ability to efficientlyanalyze a vast amount of genomic data.The field of phylogenetics has experienced tremendous

advancements over the last few decades. Sophisticatedand highly accurate statistical methods for reconstructinggene trees and species trees are mostly based on maximumlikelihood or Markov Chain Monte Carlo (MCMC) meth-ods, and probabilistic models of sequence evolution (see[1] for example). Various coalescent-based species treemethods – with statistical guarantees of returning the truetree with high probability given a sufficiently large num-ber of estimated gene trees that are error-free – have beendeveloped, and are increasingly popular [2–12]. However,

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriatecredit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes weremade. The images or other third party material in this article are included in the article’s Creative Commons licence, unlessindicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and yourintended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directlyfrom the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The CreativeCommons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data madeavailable in this article, unless otherwise stated in a credit line to the data.

Page 2: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 2 of 14

many of these methods are not scalable to analyze phy-logenomic datasets that contain hundreds or thousandsof genes and taxa [13, 14]. Therefore, developing fast yetreasonably accurate methods is one of the foremost chal-lenges in large-scale phylogenomic analyses. Distance-based methods represent an attractive class of meth-ods for large-scale analyses due to their computationalefficiency. Although these methods are generally not asaccurate as the computationally demanding Bayesian orlikelihood based methods, several studies [10, 11, 15–19]have provided support for the ability of the distance-basedmethods in estimating accurate phylogenetic trees. There-fore, the trees estimated by distance-based methods canbe used as guide trees (also known as starting trees) forother sophisticated methods as well as for divide-and-conquer based boosting methods [14, 20–24]. Moreover,under various challenging model conditions, distance-based methods become the only viable option for con-structing phylogenetic trees. Whole genome sequencesare one such case, where the traditional approach of mul-tiple sequence alignments may not work [25]. Auch et al.(2006) proposed a distance-based method to infer phy-logeny from whole genome sequences and discussed thepotential risks associated with other approaches [26]. Gaoet al. (2007) also introduced a composite vector approachfor whole genome data, where distances are computedbased on the frequency of appearance of overlappingoligopeptides [27]. Therefore, notable progress has beenmade towards developing various distance-based meth-ods [1, 10, 11, 16, 17, 19, 28–35]. Some of these methodscan also be used to analyze large-scale single nucleotidepolymorphism (SNP) data [36, 37].Missing data is considered as one of the biggest chal-

lenges in phylogenomics [38–40]. Missing data can arisefrom a combination of reasons, including data generationprotocols, failure of an experimental assay, approaches totaxon and gene sampling, and gene birth and loss [36, 41].The presence of taxa comprising a substantial amountof missing (unknown) nucleotides may significantlydeteriorate the accuracy of the phylogenetic analysis[40, 42, 43], and can affect branch length estimations intraditional Bayesian methods [44]. Therefore, many stud-ies avoid working with missing data and conduct exper-iments on the available complete dataset [39]. Severalpaleontology-oriented studies also suggest that missingdata can frequently result in poorly resolved phylogeneticrelationships [42, 45, 46].Several widely used distance-based methods, includ-

ing NJ [16], UPGMA [28], and BioNJ [17] require thatthe distance matrices do not contain any missing entries.However, only a few studies have addressed the prob-lem of imputing distance values [36, 47]. These worksmainly rely on two approaches, namely direct approachand indirect approach. Direct approaches try to construct

a tree directly from a partially filled distance matrix[1, 48]. Indirect approaches, on the other hand, estimatethe missing entries, and subsequently construct a phy-logenetic tree based on the complete distance matrix[49, 50]. Some studies have tried to combine the advan-tages of both approaches [43]. LASSO [36], which is aheuristic approach for reconstructing phylogenetic treesfrom distancematrices withmissing values, tries to exploitthe redundancy in a distance matrix. This method, requir-ing the molecular clock assumption (i.e., sequences evolveat a constant rate over time and among different organ-isms [51, 52]), has been shown to be relatively less accu-rate by Xia et al. (2018), as significant differences wereobserved between the original trees and the trees recon-structed by LASSO from incomplete distance matrices[47]. Xia et al. (2018) proposed a least square method withmultivariate optimization, which achieved a high accuracyfor estimating trees from distance matrices with missingentries [47] .In this paper, we propose two statistical and machine

learning (ML) based methods for imputing missing val-ues in distant matrices. These methods do not requireany particular assumptions (e.g., molecular clock) and canhandle large numbers of missing entries. Our techniquesare based on matrix factorization (MF) [53] and autoen-coders (AE) [54]. We assessed the performance of MFand AE on a collection of real biological and simulateddatasets. MF and AE were compared with the methodsproposed by Xia et al. (2018) [47] (implemented in theDAMBE software package [55, 56]) and Kettleborough etal. (2015) [36] (implemented in the LASSO software pack-age [57]). Experimental results suggest that MF and AEare more accurate and robust than DAMBE and LASSOunder various model conditions, and can handle largenumbers of missing values.

ResultsWe compared our methods (MF and AE) with two of themost accurate existing methods: 1) DAMBE (the impu-tation method proposed by Xia et al. (2018) [47], and 2)LASSO [36]. We used a collection of previously studiedsimulated and biological datasets to evaluate the perfor-mance of these methods. We compared the estimatedspecies trees to the model species tree (for the simulateddatasets) or to the trees estimated on the full data withoutany missing entries (for the biological datasets), to eval-uate the accuracy of various imputation techniques. Wehave used normalized Robinson-Foulds (RF) distance [58]to measure the tree error. The RF distance between twotrees is the sum of the bipartitions (splits) induced by onetree but not by the other, and vice versa. Normalized RFdistance (RF rate) is obtained by dividing the RF distanceby the maximum possible RF distance. This error rateaccounts for the number of different bipartitions between

Page 3: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 3 of 14

the inferred and the true phylogenies, and hence relativelylower error rates indicate better performance.Similar to previous studies [47], we generated missing

entries in two ways: i) modifying the input sequencesin a way that results in missing entries in the dis-tance matrix (indirect approach), and ii) directly deletingentries from a given distance matrix (direct approach).There are n(n−1)

2 distance values in a complete distancematrix of n taxa since the distance matrix is symmet-ric. For the direct approach, similar to previous studies[36, 47], we randomly removed some entries to createpartial distance matrices. See the “Datasets” section fordetails on the indirect approach. We computed distancesfrom the sequences based on the MLCompositeTN93(TN93) model [59]. TN93 model holds the assumption ofa complex but specific model of nucleotide substitution.The distance formula is derived under the homogeneityassumption, which means that the pattern of nucleotidesubstitution has not changed in the evolutionary historyof the observed sequences [60, 61]. TN93 model accountsfor the difference between transitional substitution rates,i.e., interchange of a purine nucleotide to another purine(A ↔ G), or a pyrimidine nucleotide to another pyrimi-dine (C ↔ T), and transversions (interchange of a singlepurine to a pyrimidine, or vice versa). TN93 also differen-tiates the two kinds of transitions (A ↔ G andC ↔ T). Inaddition to the TN93 model used in previous studies [47],we also applied the LogDet method [62] to observe itsimpact on the imputation process. The LogDet distancedxy between two taxa x and y is defined as follows. Let Fxybe a K × K (K = 4 for nucleotide sequences and K = 20for amino acid sequences) divergence matrix where ij-thentry is the proportion of sites in which taxa x and y havecharacter states i and j, respectively. Then, dxy is calculatedusing the following transformation [62, 63].

dxy = − ln [ detFxy]. (1)

We used MEGA-X [61, 64, 65] to compute distancesunder the TN93 and LogDet models as well as to intro-duce missing entries in the distance matrices. We usedFastME [19, 30] to construct trees from complete dis-tance matrices. A schematic diagram of the experimentalpipeline used in this study is shown in Fig. 1.

DatasetsWe have used a set of mitochondrial COI and CytBsequences from 10 Hawaiian katydid species in the genusBanza along with four outgroup species. This dataset,comprising 24 operational taxonomic units (OTUs) and10 genes which evolved under the HKY85 model [66], waspreviously studied in [47]. In order to evaluate the rela-tive performance, we followed exactly the same processused by Xia et al. (2018) [47] for modifying the sequencesto create missing entries in distance matrices. However,

Xia et al. (2018) only generated 30 missing entries inthe matrix, whereas we analyzed a wide range of missingentries (10 ∼140).We now explain how missing values were introduced

by modifying the sequences. The 24 OTUs dataset com-prises a set of mitochondrial COI and CytB sequences. Ifwe remove the COI sequence from a taxonA and the CytBsequence from another taxon B, then the (A,B) pair doesnot share any homologous sites which results in a miss-ing entry in the corresponding distance matrix. Thus, ifwe remove the COI sequence from n1 taxa and removethe CytB sequence from a different set of n2 taxa, thecorresponding distance matrix will have n1 × n2 missingentries.We used another set of simulated datasets based on

a biological dataset (37-taxon mammalian dataset [67])which was generated and subsequently analyzed in priorstudies [9, 14, 68, 69]. This dataset was generated underthe multi-species coalescent model [70] with variousmodel conditions reflecting varying amounts of gene treediscordance resulting from the incomplete lineage sort-ing (ILS) [71]. This collection of datasets was simulatedby taking the species tree estimated by MP-EST [7] on thebiological dataset studied in Song et al. (2012) [67]. Thisspecies tree had branch lengths in coalescent units thatwere scaled (multiplying or dividing by two) to vary theamount of ILS (shorter branch lengths result in more ILS).The basic model condition with moderate amount of ILSis referred to as 1X and the model conditions with higherand lower amounts of ILS are denoted by 0.5X and 2X,respectively. For each model condition, we used 10 repli-cates of data each containing 37 sequences. We analyzeda wide range of missing entries: 36 (6 × 6), 100 (10 × 10),225 (15 × 15), and 342 (19 × 18).In order to evaluate the performance of variousmethods

on relatively larger datasets, we used a dataset containing201 taxa, which was simulated and used by [72]. We ana-lyzed various numbers of missing entries: 400 (20 ×20),1,024 (32×32), 2,500 (50×50), 5,625 (75×75), and 10,100(101 ×100).We also analyzed three distance matrices, which were

computed from aligned sequences from Carnivores, Bac-ulovirus, and mtDNAPri3F84SE, and were used in previ-ous studies [73, 74]. The numbers of taxa in these matricesrange from 7 to 10. Various numbers of distance valueswere randomly removed to introduce missing data.For each model condition with a particular number

of missing entries, we generated 10 replicates of data,and reported the average RF rate and standard errorover 10 replicates. However, we deliberately analyzed onereplicate of data on 24 OTUs dataset as was done inXia et al. (2018) [47] and removed the same entries thatwere removed by [47] to compare the performance of ourproposed techniques with respect to the results reportedin [47].

Page 4: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 4 of 14

Fig. 1 An overview of the experimental pipeline of this study. The input is either a set of sequences, or a complete distance matrix. We generateincomplete distance matrix from input sequences or input complete distance matrix by using various missingness mechanisms. Next, we applyvarious imputation techniques to impute the missing entries in the incomplete distance matrix and thereby, generating (complete) imputeddistance matrices. Next, we estimate phylogenetic trees from the imputed distance matrices using FastME. Finally, we compare the estimated treeswith the model tree to evaluate the performance of various imputation techniques

Results on sequence inputTable 1 shows the results on 24 OTUs for a wide rangeof missing entries (10 ∼140). On this particular dataset,MF achieved superior performance on small to moderatenumbers of missing entries (0 ∼40), LASSO matched orimproved upon the other methods for moderate to highnumbers of missing entries (50 ∼110), and AE outper-formed other methods in the presence of large numbers ofmissing values (110 ∼140).

Table 1 RF rates of different methods on the 24 taxa datasetwith varying numbers of missing entries. The best RF rates forvarious model conditions are shown in boldface

#Taxa #Entries #Missing RF Rate

Entries DAMBE LASSO MF AE

10 0.0476 0.2857 0 0.0952

20 0.1429 0.3333 0.1905 0.2381

30 0.2381 0.3333 0.1905 0.2381

40 0.2857 0.3333 0.2857 0.3333

50 0.3333 0.1905 0.4286 0.3333

60 0.2857 0.2381 0.3333 0.381

70 0.4286 0.2857 0.5714 0.381

24 276 80 0.4762 0.381 0.6667 0.381

90 0.5714 0.5238 0.5238 0.5238

100 0.5714 0.5714 0.7143 0.6190

110 0.7143 0.6190 0.8571 0.6190

120 0.8095 0.7619 0.8571 0.7143

130 0.8571 0.7619 0.8095 0.7619

140 N/A N/A 0.7619 0.7619

For 30 missing entries (which was the case analyzed in[47]), MF recovered 81% of the true bipartitions, whereasDAMBE and LASSO recovered 76% and 67% bipartitionsrespectively. Figure 2 shows the trees constructed by vari-ous methods with 30 missing values. With 10∼40 missingentries, MF estimated tree was closer to the tree esti-mated on the complete dataset than DAMBE and AE.Notably, with 10 missing entries, MF was able to recon-struct the tree on complete dataset, whereas DAMBE andAE incurred 5% and 10% errors, respectively. However,as we increase the number of missing entries, DAMBEstarted to outperform MF, and AE started to outper-form both DAMBE and MF. Moreover, with moderateto high numbers of missing values (50 ∼110), LASSOachieved the best performance in recovering true bipar-titions, although MF and AE were equally good in somecases. When around one-third (90) of the entries in thedistance matrix were missing, LASSO, MF, and AE recov-ered around 48% of the true bipartitions, and DAMBErecovered 43% of the bipartitions. DAMBE can not imputedistances when more than 50% of the total entries aremissing. LASSO’s performance on these model conditionswith a relatively large amount of missing data is also notsatisfactory, since LASSO failed to construct a tree on thefull set of taxa, resulting in an incomplete tree. There-fore, we did not consider DAMBE and LASSO whenmorethan 50% of the entries (i.e., 140 entries on this partic-ular dataset) are missing. On the other hand, both MFand AE were able to reconstruct around 25% of the truebipartitions even when 140 (more than 50%) entries aremissing. Although,more than 50%missing entries in a dis-tance matrix may not be a very commonmodel condition,

Page 5: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 5 of 14

Fig. 2 Phylogenetic trees estimated on the full and incomplete dataset (30 missing entries) with 24 OTUs from 10 Hawaiian katydid species. a Treeestimated from the full data (complete distance matrix), b - e trees reconstructed from incomplete distance matrix by DAMBE, LASSO, MF, and AE,respectively. Red rectangles highlight the inconsistencies with the tree on the full dataset

the ability to handle arbitrarily large amounts of miss-ing data advances the state-of-the-art in distance matriximputation.Results on 37-taxon simulated dataset with varying

amounts of ILS, two different evolution models and vary-ing numbers of missing values are shown in Table 2. MFand AE were competitive with or better than DAMBE inmost of the cases. Unlike the 24 OTUs dataset, LASSOperformed poorly on this 37-taxon dataset, and achievedthe worst tree accuracy. As DAMBE and LASSO can not

handle distance matrices with more than 50% missingentries, only MF and AE were able to run on the dis-tance matrices with 342 (∼50%) missing entries, albeitthe RF rates were very high (due to the lack of sufficientphylogenetic information present in the highly incom-plete distance matrix). MF could not recover any internalbranches on the 1X dataset with 342 missing entries. AE,on the other hand, was able to reconstruct around 15%bipartitions. Another observation, within the scope of theexperiments performed in this study, is that the amount

Page 6: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 6 of 14

Table 2 Average RF rates (± standard error) of different methods on the 37-taxon dataset for varying numbers of missing entries andtwo different sequence evolution models. For each model condition, we show the average RF rate and standard error over 10replicates. The best RF rates for various model conditions are shown in boldface

#Taxa #Entries Scaling Model #Missing Average RF Rate

Entries DAMBE LASSO MF AE

36 0.41±0.02 0.72±0.03 0.33±0.03 0.41±0.02

100 0.48±0.02 0.72±0.03 0.46±0.02 0.45±0.02

TN93 225 0.72±0.03 0.78±0.03 0.62±0.01 0.70±0.02

37 666 1X 342 N/A N/A 0.99±0.02 0.86±0.01

36 0.41±0.02 0.71±0.02 0.35±0.03 0.4±0.02

100 0.49±0.02 0.72±0.02 0.5±0.03 0.46±0.02

LogDet 225 0.72±0.02 0.76±0.02 0.66±0.03 0.72±0.02

342 N/A N/A 1±0 0.86±0.02

36 0.45±0.02 0.69±0.02 0.35±0.02 0.43±0.02

100 0.49±0.03 0.72±0.02 0.5±0.02 0.54±0.03

TN93 225 0.66±0.02 0.76±0.02 0.62±0.01 0.71±0.02

37 666 0.5X 342 N/A N/A 1±0 0.84±0.02

36 0.45±0.02 0.68±0.02 0.35±0.02 0.42±0.02

100 0.49±0.03 0.71±0.02 0.52±0.02 0.51±0.02

LogDet 225 0.64±0.02 0.76±0.01 0.66±0.02 0.7±0.02

342 N/A N/A 0.99±0.02 0.84±0.02

36 0.43±0.02 0.68±0.01 0.36±0.03 0.42±0.02

100 0.5±0.01 0.69±0.02 0.52±0.02 0.5±0.02

TN93 225 0.66±0.02 0.73±0.02 0.71±0.02 0.69±0.02

37 666 2X 342 N/A N/A 0.99±0.01 0.85±0.01

36 0.44±0.02 0.63±0.02 0.36±0.02 0.4±0.01

100 0.51±0.02 0.66±0.02 0.54±0.02 0.52±0.02

LogDet 225 0.66±0.02 0.73±0.01 0.7±0.02 0.69±0.02

342 N/A N/A 0.99±0.01 0.86±0.02

of ILS does not have any significant impact on the perfor-mance of various imputation techniques. However, moreexperiments are required to further investigate the impactof ILS.We also analyzed the impact of two widely used

sequence evolution models (TN93 and LogDet) on theperformance of the proposed imputation techniques. MFperformed poorly on LogDet model compared to theTN93 model, and produced higher error rates in 17 (outof 24) cases on LogDet model than the TN93 model.AE, on the other hand, showed similar (on 1X model)or slightly better (on 0.5X and 2X models) performanceunder the LogDet model. DAMBE achieved an improvedperformance under the LogDet model (compared to theTN93 model) only on the 0.5X model condition and theopposite trend is observed on the 1X and 2X model con-ditions, albeit the differences are very small (Table 2).LASSO performed slightly better on LogDet model thanon TN93.

Finally, we applied our techniques on a large datasetwith 201 taxa (Table 3). As DAMBE was too computa-tionally expensive to run on this large dataset (it did notprovide any result after 24 hours of computation), weexcluded DAMBE from this analysis. Both MF and AEoutperformed LASSO in all cases, and the improvementsare substantial. AE performed particularly well on thisdataset, as it achieved the lowest average RF rates under allmodel conditions. MF also performed well, and achievedcomparable accuracies. The improvement of AE over MFand LASSO increases as we increase the number of miss-ing entries. Even with 10,100 (∼50%) missing entries,AE was able to recover 57% true bipartitions under bothsequence evolution models. LASSO consistently achievedthe highest average RF rate under various model condi-tions. Even with only 400 (∼2%) missing entries, LASSOcould not recover more than 41% true bipartitions, whichis worse than AE’s performance on amodel condition with10,100 (∼50%) missing entries.

Page 7: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 7 of 14

Table 3 Average RF rates (± standard error) of different methods on the 201-taxon dataset. The best RF rates for various modelconditions are shown in boldface

#Taxa #Entries Model #Missing Average RF Rate

Entries LASSO MF AE

400 0.6±0.02 0.36±0.04 0.36±0.01

1024 0.61±0.02 0.4±0.05 0.39±0.04

TN93 2500 0.62±0.02 0.41±0.03 0.4±0.02

5625 0.63±0.03 0.44±0.03 0.41±0.03

201 20100 10100 N/A 0.59±0.02 0.43±0.01

400 0.59±0.02 0.38±0.02 0.37±0.01

1024 0.62±0.01 0.4±0.03 0.38±0.02

LogDet 2500 0.61±0.02 0.41±0.02 0.4±0.02

5625 0.62±0.02 0.46±0.03 0.43±0.02

10100 N/A 0.58±0.03 0.43±0.01

Results on distance matrix inputResults on Carnivores, Baculovirus and mtD-NAPri3F84SE are shown in Tables 4, 5, and 6. On theCarnivores dataset (Table 4), LASSO and AE producedthe best results. Even with 25 (more than 50%) missingentries, AE was able to reconstruct more than 25% of thetrue bipartitions. The performance of MF was worse thanLASSO, AE, and DAMBE. On the Baculovirus dataset,MF achieved the lowest RF rates for relatively lowernumbers of missing entries. However, as we increase thenumber of missing entries, LASSO and AE started tooutperform other methods. On the mtDNAPri3F84SEdataset, these methods showed a mixed performance, andno method consistently outperformed the others. How-ever, DAMBE and LASSO achieved better performancethan MF and AE.

Running timeWe performed the experiments on a computer with i5-3230M, 2.6 GHz CPU with 12 GB RAM. The runningtime of MF on the 24-taxon dataset ranges between 7 ∼15minutes for various numbers of missing entries. DAMBEtakes only a few seconds with 10missing entries, but as weincrease the number of missing entries to 130, the runningtime of DAMBE increases to 2 minutes. AE was faster,requiring only around 30 seconds for this dataset. LASSO

was the fastest, which took only a second. Notably, unlikeMF and DAMBE, the running times of LASSO and AE donot change much as we increase the number of missingentries.On the 37-taxon dataset, MF takes around 30 min-

utes while DAMBE takes 12 ∼15 minutes. AE is fasterthan MF and DAMBE, taking only around 45 seconds.LASSO was the fastest method which took only a second.On 201-taxon dataset, DAMBE was too computationallyexpensive to run, and did not produce any result after24 hours of computation. MF took 4 ∼6 hours, whereasAE took only 20 ∼30 minutes. LASSO was the fastestmethod which took only a second, although the accuracywas substantially worse than both MF and AE.For relatively smaller matrices (Carnivores, Baculovirus

and mtDNAPri3F84SE datasets), DAMBE is very fast, andfinished in a second. MF took around 45 seconds, and AEtook 20 seconds. Overall, AE and LASSO scale well tolarge datasets and their running times are less sensitive tothe number of taxa and the number of missing entries.

DiscussionWe extensively evaluated MF and AE on a collection ofreal and simulated datasets. Previous studies [36, 47] lim-ited their evaluation studies to a small number of datasetswith limited numbers of taxa. Moreover, previous studies

Table 4 Average RF rates (± standard error) of different methods on the Carnivores dataset. The best RF rates for various modelconditions are shown in boldface

#Taxa #Entries #Missing Average RF Rate

Entries DAMBE LASSO MF AE

5 0.29±0.06 0.14±0.06 0.37±0.1 0.23±0.07

10 0.6±0.03 0.23±0.07 0.71±0.06 0.23±0.07

10 45 15 0.63±0.07 0.26±0.02 0.83±0.09 0.57±0.04

20 0.77±0.03 0.4±0.06 0.94±0.07 0.63±0.05

25 N/A N/A 0.94±0.05 0.74±0.05

Page 8: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 8 of 14

Table 5 Average RF rates (± standard error) of different methods on the Baculovirus dataset. The best RF rates for various modelconditions are shown in boldface

#Taxa #Entries #Missing Average RF Rate

Entries DAMBE LASSO MF AE

4 0.27±0.08 0.29±0.03 0.17±0.15 0.39±0.04

8 0.5±0.11 0.33±0.08 0.5±0.1 0.39±0.08

9 36 12 0.7±0.07 0.47±0.06 0.49±0.05 0.5±0

16 0.7±0.06 0.5±0.07 0.67±0.05 0.57±0.08

20 N/A N/A 0.67±0.11 0.67±0.11

did not explore the model conditions with more than 10%missing values. We tried to address these issues by eval-uating our methods on six different datasets with variouschallenging model conditions. We analyzed a 201-taxondataset, whereas previous comparative studies were lim-ited to less than 30 taxa. Furthermore, we analyzed theimpact of varying amounts of ILS on the performance ofvarious imputation techniques.In general, MF and AE based methods produced more

accurate trees than the existing methods. DAMBE wascomparable to MF and AE when the numbers of miss-ing entries were relatively small. However, DAMBE didnot perform well with moderate to high numbers of miss-ing entries. Although LASSO was previously shown to beless accurate than DAMBE [47], we found several caseswhere LASSO performed better than DAMBE. For rel-atively lower numbers of taxa, LASSO works very well,even when 25 ∼45% entries are missing. But on the 37-taxon and 201-taxon datasets, LASSO consistently per-formed poorly compared to other methods. On the otherhand, MF and AE achieved superior tree accuracy onmost of the model conditions. One prominent outcomeof this study is the introduction of methods that caneffectively analyze large datasets. While DAMBE failed toproduce any results after 24 hours of computation andLASSO could not recover more than 40% true bipartitionson the 201-taxon dataset, the AE-based method consis-tently recovered around 60% bipartitions on this largedataset under various model conditions. Even the MF-based method, although less scalable than AE, showed

promising performance, especially with relatively lowernumbers of missing entries. The ability to analyze largedatasets with hundreds of taxa makes our proposedmeth-ods applicable to large scale phylogenomic analyses.Another important aspect is that both DAMBE and

LASSO failed to handle distance matrices with more than50% missing entries. However, MF and AE can handle anarbitrarily large amount of missing data. Sequence datamay contain substantial amounts of missing information,resulting in distance matrices with large numbers of miss-ing entries. We note that the presence of a substantialnumber of missing values in distance matrices may resultin inaccurate trees and researchers will tend to approachthese trees with care. However, the ability to constructtrees in the presence of arbitrarily large numbers of miss-ing entries will help us estimate starting trees (guide trees)on extremely challenging model conditions with high lev-els of missing data. These guide trees can be improved byfurther analysis (e.g., divide-and-conquer based boostingtechniques [14, 20–23]).Our extensive experimental studies on six different

datasets suggest that AE-based method is more accurateand robust than others under most of the model con-ditions. Especially, on moderate to large-scale datasetsand in the presence of relatively higher levels of missingdata, AE is substantially better than the existing meth-ods – making it a suitable candidate for large-scale phy-logenomic analyses. This demonstrates the power of MLbased techniques in capturing the latent representationsin large-scale phylogenetic datasets, despite the presence

Table 6 Average RF (± standard error) of different methods on the mtDNAPri3F84SE dataset. The best RF rates for various modelconditions are shown in boldface

#Taxa #Entries #Missing Average RF Rate

Entries DAMBE LASSO MF AE

2 0.05±0.05 0.1±0.04 0.4±0.15 0.15±0.09

5 0.2±0.08 0.2±0.08 0.55±0.08 0.5±0.1

7 21 7 0.4±0.11 0.3±0.13 0.75±0.07 0.8±0.19

10 0.65±0.17 0.5±0.16 0.8±0.04 0.7±0.04

12 N/A N/A 0.9±0.05 0.85±0.05

Page 9: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 9 of 14

of missing data. However, future works will need to inves-tigate how to help the researchers choose the right impu-tation approaches on relatively small datasets as variousmethods have shown mixed performance on very smalldatasets (≤ 10 taxa).Although we investigated a collection of datasets under

various practical model conditions, this study can beextended in several directions. This study investigated rel-atively long sequences (250∼2600 bp); subsequent studiesshould investigate the relative performance of variousmethods on very short sequences. This study analyzedsmall to large scale dataset (7 ∼201 taxa). Ultra largedatasets with thousands of taxa need to be analyzed,especially to demonstrate the power of ML based tech-niques in leveraging the latent features of phylogeneticdata. Although we have appropriately adapted the MFand AE based techniques for imputing distance matri-ces, further parameter tuning and customization in theunderlying deep learning architecture may improve theperformance of our proposed techniques. We leave theseas future works.

ConclusionsIn this study, we have presented two imputation tech-niques, inspired frommatrix factorization and deep learn-ing architecture, to reconstruct phylogenetic trees frompartial distance matrices. Experimental results on bothsimulated and real biological datasets show that ourmeth-ods match or improve upon the alternate best techniquesunder various model conditions with varying numbers oftaxa, sequence lengths, and amounts of gene tree discor-dance. We also evaluated these methods using differentDNA sequence evolution models and missingness mech-anisms.Estimating phylogenetic trees in the presence of missing

data is sufficiently complex and hence existing meth-ods cannot fully comprehend or predict the relationshipsamong the taxa from partial distance matrices. Thus, thegoal here should be the creation of an appropriate modelto capture the underlying data distribution; the modelshould account for as much phylogenetic data as possi-ble to impute the missing entries. This view emphasizesthe importance of ML for distance matrix imputation.Moreover, we aimed to develop appropriate unsupervisedmodels. Unsupervised learning approaches have advan-tages over supervised methods particularly when the dataare heterogeneous, which are often so with various phy-logenetic dataset and therefore the supervised modelstrained on distance matrices on a particular set of taxamay not be generalizable to a new set of taxa.We have shown that MF and AE are robust, and can

handle high amounts of missing data. Unlike other meth-ods [36], MF and AE do not require the molecular clockassumption. Moreover, deep learning basedmethods (e.g.,

autoencoders) are able to automatically learn latent repre-sentations and complex inter-variable associations, whichis not possible for heuristic based methods. Therefore,this study lays a firm and broad foundation for applyingML based techniques in various problems in phyloge-nomics. Considering the rapidly increasing amount ofphylogenomic datasets, and the prevalence of accompany-ing missing data, the timing of our proposed approachesseems appropriate. We believe that the proposed impu-tation techniques represent a major step towards solvingreal world instances in phylogenomics.

MethodsMatrix factorization (MF)Matrix factorization (MF) has become popular since 2006,when one group of competitors for the Netflix Prize thatyear used this technique [53, 75]. This method is usuallybeing applied in recommender systems [76], and is usedto discover latent features between two interacting enti-ties.Matrix factorization is a class of collaborative filteringalgorithms [77], which predicts users’ future interest byanalyzing their past behavior.Intuitively, there should be some latent features behind

how a certain user rates an item. For example, movie rat-ings by users generally rely on many features, includinggenre, actors, etc. If a certain individual gives high ratingsto action movies, we can expect him to do the same toanother action movie which is not yet rated by him. Dis-covering the latent features will thus help predict users’future preferences.Matrix Factorization has previously been used in imput-

ing missing data in various domains of bioinformatics,including analyzing scRNA-seq with missing data [78],handling missing data in genome-wide association stud-ies (GWAS) [79], and identifying cancerous genes [80]. Inthis study, we have adapted this idea for imputing missingentries in a distance matrix for phylogenetic estimation. Ifthe distance between two taxa A and B is not known, wecan predict the distance by analyzing their distances withother taxa using the concept of matrix factorization (withappropriate customization).Let S be a set of N OTUs (operational taxonomic units).

Let R be an |N | × |N | distance matrix comprising the dis-tances between any two OTUs. If we want to find K latentfeatures of distances, we need to find two matrices X andY, where the dimensions of X and Y are |N | ×K . We usedK = N in our implementation. The product of X and YT

will then approximate R as follows.

R ≈ X × YT = R̂ (2)

However, as matrix R (and R̂) are symmetric, meaningthat rij = rji (and r̂ij = r̂ji), we only consider the lowertriangular portion of the matrix. We impute a distance r̂ijbetween two OTUs as follows.

Page 10: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 10 of 14

r̂ij =K∑

k=1xikykj (3)

We randomly initialize X and Y and try to determinethe error between R and the product of X and Y. Thenwe iteratively update X and Y so that the error is reduced.We considered the squared error as the errors can be bothpositive and negative. We used a regularization parame-ter β to avoid overfitting. Thus, we calculate the error asfollows.

e2ij = (rij − r̂ij)2 + β

2

K∑

k=1(||X||2 + ||Y ||2)

= (rij −K∑

k=1xikykj)2 + β

2

K∑

k=1(||X||2 + ||Y ||2)

(4)

In order to minimize the error defined in Eqn. 4, thedirections for modifying xik and ykj need to be identified.This means we need to find the gradient at current values,which we do by differentiating Eqn. 4 with respect to xikand ykj separately. Thus, we use the following update rules.

xik(updated) = xik + α∂

∂xike2ij = xik + α(2eijykj − βxik)

(5)

ykj(updated) = ykj + α∂

∂ykje2ij = ykj + α(2eijxik − βykj)

(6)

In Eqns. 5 and 6, α is a constant which determines therate to approach the minimum error. We experimentedwith a range of values (10−4 ∼ 10−1) of α and β from theimplementations in [81], and set α = 0.002 and β = 0.02 asthese values provided reliable performance. However, fur-ther parameter tuningmay improve both the accuracy andconvergence time.We perform these steps iteratively untilthe total error E (= ∑

eij ) converges to a pre-specifiedthreshold value (10−6) or 10,000 iterations take place.Algorithm 1 shows our MF-based imputation process.

Autoencoder (AE)An autoencoder (AE) is a type of artificial neural net-work that learns to copy its input to its output. This isachieved by learning efficient data codings in an unsuper-vised manner to recreate the input. An autoencoder firstcompresses the input into a latent space representationand then reconstructs the output from that representa-tion. It tries to learn a function g(f (x)) ≈ x, where f (x)encodes the input x and g(f (x)) reconstructs the input

Algorithm 1 Imputation Method using Matrix Factoriza-tion1: R ← Actual N × N distance matrix with missing

values2: Randomly initialize matrices X and Y of dimen-

sions N × K3: Initialze α and β

4: while iteration number ≤ 10, 000 do5: for all known values rij in R do6: e2ij ← (rij − ∑K

k=1 xikykj)2 + β2

∑Kk=1(||X||2 +

||Y ||2)7: for k = 1 to K do8: xik(updated) ← xik + α(2eijykj − βxik)9: ykj(updated) ← ykj + α(2eijxik − βykj)

10: end for11: end for12: Calculate Total Error E ← ∑

eij13: if E ≤ 10−6 then14: break15: end if16: end while17: Complete Distance Matrix Rimputed ← X.YT

18: Replace themissing values in Rwith the reconstructedvalues in Rimputed, leaving the non-missing entries inR unchanged.

19: Return R

x using decoder. Figure 3a shows a general overview ofautoencoders.Autoencoders have been used in integrative analyses

of biomedical big data. Its ability to reduce the dimen-sion and extract non-linear features [82] have been lever-aged by many studies. In one oncology study, autoen-coders have been able to extract cellular features, whichcan correlate with drug sensitivity involved with cancercell lines [83]. An autoencoder was also used to dis-cover two liver cancer sub-types that had distinguishablechances of survival [84]. Moreover, some recent successfuldata imputation methods have been developed based onautoencoders [85–87]. Autoimpute [85] can be an exam-ple which imputes single cell RNA-seq gene expressiondata. Autoencoder-based methods such as [86] and [87]have surpassed older ML techniques on various real lifedatasets.In this study, we developed an undercomplete autoen-

coder [54] to predict the missing values in a distancematrix. The goal of an undercomplete autoencoder isto learn the most salient features of data by putting aconstraint on the amount of information that can flowthrough the network. We do not need any regularizationterm here because an undercomplete autoencoder maxi-mizes the probability of data rather than copying the inputto the output.

Page 11: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 11 of 14

Fig. 3 a General overview of an autoencoder. b A schematic of our proposed autoencoder model. The X ’s in the dropout layers symbolically denotethat their weights will be set to zero

Our architecture has been inspired by an open sourcelibrary, FancyImpute [88], which is a library for imputa-tion algorithms and is implemented in Python language.Our model has 3 hidden layers with ReLU (Rectified Lin-ear Unit) activation functions [89]. The dropout rate is setto 0.75, which appears to work better than other values.A sigmoid function [90] is used as the activation func-tion for output layer. FancyImpute iteratively updates theimputed values where a prediction from the previous iter-ation is updated according to Eq. 7. We used the defaultpredefined weights from the FancyImpute library.

x′ = (1 − w)x + wp (7)

In Eq. 7, x′ = updated value, x = old value, w = predefinedweight, and p = predicted value from the autoencoder.

Our model takes as input a distance matrix R with miss-ing entries. First, the missing values in R are replacedwith random values. Next, using the architecture shown inFig. 3b, our model tries to fit the input (R) to output (R′).It tries to progressively improve the prediction by mini-mizing the reconstruction error (loss function) where theerror is computed based on the non-missing entries ofthe original matrix. We have used the mean squared error(MSE) as the reconstruction error function L(R,R′), whichminimizes the difference between the input R and theautoencoder’s output R′ considering only the non-missingentries. Let NM be the set of non-missing entries in R.Then, L(R,R′) is computed as follows.

L(R,R′) =∑

i∈NM|Ri − R′

i|2. (8)

Page 12: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 12 of 14

Algorithm 2 Imputation Method using Autoencoder1: Replace the missing entries in the distance matrix R

with random values2: while iteration number = 10, 000 do3: Fit an autoencoder to the observed entries in the

original matrix by minimizing the reconstructionerror as shown in Eqn. 8

4: Reconstruct output for the missing entries5: Update the missing values using Eqn. 76: if reconstruction error ≤ 10−6 then7: break8: end if9: end while

10: Replace the missing values in R with the recon-structed values, leaving the non-missing entries in Runchanged

We replace the missing entries with the imputed valuesand keep the original non-missing values unchanged oncea certain number of iterations (10,000) have taken placeor the reconstruction error has gone below a pre-specifiedthreshold value (10−6). Algorithm 2 shows our AE-basedimputation process.

Software implementationThe proposedmethods have been developed in Python 3.5using various libraries, namely, easygui, pandas, numpy,matplotlib, seaborn, tensorflow, and keras. The methodshave been developed as cross-platform applications.The 201-taxon dataset is available at https://sites.

google.com/eng.ucsd.edu/datasets/astral/astral-ii [72].The 37-taxon dataset is available at https://sites.google.

com/eng.ucsd.edu/datasets/binning [67].The 24-taxon dataset is available at https://doi.org/10.

7717/peerj.5321/supp-1 [47].The 10-, 9-, and 7-taxon datasets are available in the

DAMBE software package (http://dambe.bio.uottawa.ca/DAMBE/dambe.aspx) [56].

AbbreviationsAE: Autoencoder; GWAS: Genome-wide association study; ILS: Incompletelineage sorting; MCMC: Markov chain Monte Carlo; MF: Matrix factorization;ML: Machine learning; MSE: Mean-squared error; NJ: Neighbor joining; OTU:Operational taxonomic unit; ReLU: Rectified Linear Unit; RF: Robinson Foulds;scRNA-seq: Single-cell RNA sequencing; SNP: Single nucleotide polymorphism;UPGMA: Unweighted pair group method with arithmetic mean

AcknowledgementsNot applicable.

Authors’ contributionsMSB and AB conceived the study; AB and MSB designed the methods; ABimplemented the methods and performed the experiments; MSB and ABinterpreted the results; MSB and AB wrote the paper; Both authors have readand approved the manuscript.

FundingThe authors received no financial support for this research.

Availability of data andmaterialThe proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances. All the datasets analyzed in thispaper are from previously published studies and are publicly available.

Ethics approval and consent to participateNot applicable.

Consent for publicationNot applicable.

Competing interestsThe authors declare that they have no competing interests.

Author details1Department of Computer Science and Engineering, Bangladesh University ofEngineering and Technology, 1205, Dhaka, Bangladesh. 2Department ofComputer Science and Engineering, Eastern University, Dhaka, Bangladesh.

Received: 27 January 2020 Accepted: 7 July 2020

References1. Felsenstein J. Inferring Phylogenies. Vol 2. Sunderland: Sinauer Associates;

2004, p. 664.2. Drummond AJ, Rambaut A. BEAST: Bayesian evolutionary analysis by

sampling trees. BMC Evol Biol. 2007;7:214.3. Kubatko LS, Carstens BC, Knowles LL. STEM: Species tree estimation

using maximum likelihood for gene trees under coalescence.Bioinformatics. 2009;25:971–973.

4. Liu L, Yu L, Pearl DK, Edwards SV. Estimating species phylogenies usingcoalescence times among sequences. Syst Biol. 2009;58(5):468–477.

5. Larget B, Kotha SK, Dewey CN, Ané C. BUCKy: Gene tree/species treereconciliation with the Bayesian concordance analysis. Bioinformatics.2010;26(22):2910–1.

6. Liu L. BEST: Bayesian estimation of species trees under the coalescentmodel. Bioinformatics. 2008;24:2542–3.

7. Liu L, Yu L, Edwards SV. A maximum pseudo-likelihood approach forestimating species trees under the coalescent model. BMC Evol Biol.2010;10:302.

8. Reaz R, Bayzid MS, Rahman MS. Accurate phylogenetic treereconstruction from quartets: A heuristic approach. PLoS One. 2014;9(8):104008.

9. Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T.ASTRAL: Genome-scale coalescent-based species tree estimation.Bioinformatics. 2014;30(17):541–8.

10. Liu L, Yu L. Estimating species trees from unrooted gene trees. Syst Biol.2011;60(5):661–7.

11. Vachaspati P, Warnow T. ASTRID: Accurate species trees from internodedistances. BMC Genomics. 2015;16(10):3.

12. Islam M, Sarker K, Das T, Reaz R, Bayzid MS. STELAR: A statisticallyconsistent coalescent-based species tree estimation method bymaximizing triplet consistency. BMC Genomics. 2020;21(1):1–13.

13. Bayzid MS, Warnow T. Naive binning improves phylogenomic analyses.Bioinformatics. 2013;29(18):2277–84.

14. Bayzid MS, Hunt T, Warnow T. Disk covering methods improvephylogenomic analyses. BMC Genomics. 2014;15(6):7.

15. Sourdis J, Nei M. Relative efficiencies of the maximum parsimony anddistance-matrix methods in obtaining the correct phylogenetic tree. MolBiol Evol. 1988;5(3):298–311.

16. Saitou N, Imanishi T. Relative efficiencies of the Fitch-Margoliash,maximum-parsimony, maximum-likelihood, minimum-evolution, andneighbor-joining methods of phylogenetic tree construction in obtainingthe correct tree. Mol Biol Evol. 1989;6(5):514.

17. Gascuel O. BIONJ: An improved version of the NJ algorithm based on asimple model of sequence data. Mol Biol Evol. 1997;14(7):685–95.

18. Rosenberg MS, Kumar S. Traditional phylogenetic reconstructionmethods reconstruct shallow and deep evolutionary relationships equallywell. Mol Biol Evol. 2001;18(9):1823–7.

Page 13: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 13 of 14

19. Desper R, Gascuel O. Fast and accurate phylogeny reconstructionalgorithms based on the minimum-evolution principle. In: Lecture Notesin Computer Science. Springer; 2002. p. 357–374. https://doi.org/10.1007/3-540-45784-4_27.

20. Huson D, Nettles S, Warnow T. Disk-Covering, a fast converging methodfor phylogenetic tree reconstruction. J Comput Biol. 1999;6(3):369–86.

21. Huson D, Vawter L, Warnow T. Solving large scale phylogeneticproblems using DCM2. In: Proceedings of the 7th InternationalConference on Intelligent Systems for Molecular Biology (ISMB’99).Palo Alto: AAAI Press; 1999. p. 118–129.

22. Roshan U, Moret BME, Williams TL, Warnow T. Rec-I-DCM3: A fastalgorithmic technique for reconstructing large phylogenetic trees. In:Proceedings. 2004 IEEEComputational SystemsBioinformatics Conference,2004. CSB 2004.. IEEE; 2004. https://doi.org/10.1109/csb.2004.1332422.

23. Nakhleh L, Roshan U, James KS, Sun J, Warnow T. Designing fastconverging phylogenetic methods. Bioinformatics. 2001;17:190–8.

24. Roshan U, Moret BME, Williams TL, Warnow T. Performance of supertreemethods on various dataset decompositions. In: Bininda-Emonds ORP,editor. Phylogenetic Supertrees: Combining Information to Reveal TheTree of Life. Dordrecht; 2004. p. 301–328. Volume 3 of ComputationalBiology, Kluwer Academics, (Andreas Dress, series editor).

25. Deng R, Huang M, Wang J, Huang Y, Yang J, Feng J, Wang X. PTreeRec:Phylogenetic tree reconstruction based on genome blast distance.Comput Biol Chem. 2006;30(4):300–2.

26. Auch AF, Henz SR, Holland BR, Göker M. Genome BLAST distancephylogenies inferred from whole plastid and whole mitochondriongenome sequences. BMC Bioinformatics. 2006;7(1):350.

27. Gao L, Qi J. Whole genome molecular phylogeny of large dsDNA virusesusing composition vector method. BMC Evol Biol. 2007;7(1):41.

28. Sokal RR. A statistical method for evaluating systematic relationships. UnivKansas Sci Bull. 1958;38:1409–38.

29. Desper R, Gascuel O. Fast and accurate phylogeny reconstructionalgorithms based on the minimum-evolution principle. In: InternationalWorkshop on Algorithms in Bioinformatics. Springer; 2002. p. 357–374.https://doi.org/10.1007/3-540-45784-4_27.

30. Desper R, Gascuel O. Theoretical foundation of the balanced minimumevolution method of phylogenetic inference and its relationship toweighted least-squares tree fitting. Mol Biol Evol. 2004;21(3):587–98.

31. Cao MD, Allison L, Dix TI, Bodén M. Robust estimation of evolutionarydistances with information theory. Mol Biol Evol. 2016;33(5):1349–57.

32. Bogusz M, Whelan S. Phylogenetic tree estimation with and withoutalignment: New distance methods and benchmarking. Syst Biol.2017;66(2):218–31.

33. Balaban M, Sarmashghi S, Mirarab S. APPLES: Scalable Distance-BasedPhylogenetic Placement with or without Alignments. Syst Biol. 2019;69(3):566–78.

34. Moshiri N. TreeN93: A non-parametric distance-based method forinferring viral transmission clusters. bioRxiv. 2018. https://doi.org/10.1101/383190.

35. Allman ES, Long C, Rhodes JA. Species tree inference from genomicsequences using the log-det distance. SIAM J Appl Algebra Geom.2019;3(1):107–27.

36. Kettleborough G, Dicks J, Roberts IN, Huber KT. Reconstructing (super)trees from data sets with missing distances: not all is lost. Mol Biol Evol.2015;32(6):1628–42.

37. Joly S, Bryant D, Lockhart PJ. Flexible methods for estimating geneticdistances from single nucleotide polymorphisms. Methods Ecol Evol.2015;6(8):938–948.

38. Sanderson MJ, Purvis A, Henze C. Phylogenetic supertrees: Assemblingthe trees of life. Trends Ecol Evol. 1998;13(3):105–9.

39. Wiens JJ. Missing data and the design of phylogenetic analyses. J BiomedInform. 2006;39(1):34–42.

40. Bayzid MS, Warnow T. Estimating optimal species trees from incompletegene trees under deep coalescence. J Comput Biol. 2012;19(6):591–605.

41. Christensen S, Molloy EK, Vachaspati P, Warnow T. OCTAL: Optimalcompletion of gene trees in polynomial time. Algoritm Mol Biol.2018;13(1):6.

42. Huelsenbeck JP. When are fossils better than extant taxa in phylogeneticanalysis?. Syst Biol. 1991;40(4):458–69.

43. Makarenkov V, Lapointe F-J. A weighted least-squares approach forinferring phylogenies from incomplete distance matrices. Bioinformatics.2004;20(13):2113–21.

44. Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. The effect ofambiguous data on phylogenetic estimates obtained by maximumlikelihood and bayesian inference. Syst Biol. 2009;58(1):130–45.

45. Gauthier J. Saurischian monophyly and the origin of birds. Mem CalifAcad Sci. 1986;8:1–55.

46. Langer MC, Ferigolo J, Schultz CL. Heterochrony and tooth evolution inhyperodapedontine rhynchosaurs (reptilia, diapsida). Lethaia. 2000;33(2):119–28.

47. Xia X. Imputing missing distances in molecular phylogenetics. PeerJ.2018;6:5321.

48. Guénoche A, Leclerc B. The triangles method to build X-trees fromincomplete distance matrices. RAIRO Oper Res. 2001;35(2):283–300.

49. De Soete G. Additive-tree representations of incomplete dissimilaritydata. Qual Quant. 1984;18(4):387–93.

50. Lapointe FJ, Kirsch JA. Estimating phylogenies from lacunose distancematrices, with special reference to DNA hybridization data. Mol Biol Evol.1995;12:266–84.

51. Robinson NE, Robinson AB. Molecular clocks. Proc Nat Acad Sci.2001;98(3):944–9.

52. Ho S. The molecular clock and estimating species divergence. Nat Educ.2008;1(1):1–2.

53. Koren Y, Bell R, Volinsky C. Matrix factorization techniques forrecommender systems. Computer. 2009;42(8):30–7. https://doi.org/10.1109/mc.2009.263.

54. Goodfellow I, Bengio Y, Courville A. Deep Learning. AdaptiveComputation and Machine Learning series. Cambridge: MIT press; 2016.

55. Xia X, Xie Z. DAMBE: Software package for data analysis in molecularbiology and evolution. J Hered. 2001;92(4):371–3.

56. Xia X. DAMBE7: New and improved tools for data analysis in molecularbiology and evolution. Mol Biol Evol. 2018;35(6):1550–2.

57. The UEA Computational Biology Laboratory. https://www.uea.ac.uk/computing/lasso. Accessed 08 July 2019.

58. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci.1981;53(1-2):131–47.

59. Tamura K, Nei M. Estimation of the number of nucleotide substitutions inthe control region of mitochondrial DNA in humans and chimpanzees.Mol Biol Evol. 1993;10(3):512–26.

60. Tamura K, Kumar S. Evolutionary distance estimation underheterogeneous substitution pattern among lineages. Mol Biol Evol.2002;19(10):1727–36.

61. Tamura K, Dudley J, Nei M, Kumar S. MEGA4: Molecular evolutionarygenetics analysis (MEGA) software version 4.0. Mol Biol Evol. 2007;24(8):1596–9.

62. Lockhart PJ, Steel MA, Hendy MD, Penny D. Recovering evolutionarytrees under a more realistic model of sequence evolution. Mol Biol Evol.1994;11(4):605–12.

63. Steel M. Recovering a tree from the leaf colourations it generates under amarkov model. Appl Math Lett. 1994;7(2):19–23.

64. Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: Molecularevolutionary genetics analysis across computing platforms. Mol Biol Evol.2018;35(6):1547–9.

65. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. MEGA6: Molecularevolutionary genetics analysis version 6.0. Mol Biol Evol. 2013;30(12):2725–9.

66. Hasegawa M, Kishino H, Yano T-a. Dating of the human-ape splitting bya molecular clock of mitochondrial DNA. J Mol Evol. 1985;22(2):160–74.

67. Song S, Liu L, Edwards SV, Wu S. Resolving conflict in eutherian mammalphylogeny using phylogenomics and the multispecies coalescent model.Proc Nat Acad Sci. 2012;109(37):14942–7.

68. Mirarab S, Bayzid MS, Warnow T. Evaluating summary methods formultilocus species tree estimation in the presence of incomplete lineagesorting. Syst Biol. 2014;65(3):366–80.

69. Mirarab S, Bayzid MS, Boussau B, Warnow T. Statistical binning enablesan accurate coalescent-based estimation of the avian tree. Science.2014;346(6215):1250463.

70. Kingman JFC. The coalescent. Stoch Process Appl. 1982;13:235–48.71. Maddison WP. Gene trees in species trees. Syst Biol. 1997;46:523–36.72. Mirarab S, Warnow T. ASTRAL-II: Coalescent-based species tree

estimation with many hundreds of taxa and thousands of genes.Bioinformatics. 2015;31(12):44–52.

73. Xia X. Information-theoretic indices and an approximate significance testfor testing the molecular clock hypothesis with genetic distances. MolPhylogenet Evol. 2009;52(3):665–76.

Page 14: METHODOLOGYARTICLE OpenAccess … · 2020-07-20 · BhattacharjeeandBayzidBMCGenomics (2020) 21:497 Page2of14 manyofthesemethodsarenotscalabletoanalyzephy-logenomic datasets that

Bhattacharjee and Bayzid BMCGenomics (2020) 21:497 Page 14 of 14

74. Xia X. Rapid evolution of animal mitochondrial DNA. Rapidly EvolvingGenes Genet Syst. 201273–82. https://doi.org/10.1093/acprof:oso/9780199642274.003.0008.

75. Funk S. Netflix Update: Try This at Home. https://sifter.org/~simon/journal/20061211.html. Accessed 08 July 2019.

76. Ricci F, Rokach L, Shapira B. Introduction to recommender systemshandbook. In: Recommender Systems Handbook. Springer; 2011. p. 1–35.https://doi.org/10.1007/978-0-387-85820-3_1.

77. Terveen L, Hill W. Beyond recommender systems: Helping people helpeach other. HCI New Millennium. 2001;1(2001):487–509.

78. Linderman GC, Zhao J, Kluger Y. Zero-preserving imputation ofscrna-seq data using low-rank approximation. bioRxiv. 2018. https://doi.org/10.1101/397588.

79. Jiang B, Ma S, Causey J, Qiao L, Hardin MP, Bitts I, Johnson D, Zhang S,Huang X. SparRec: An effective matrix completion framework of missingdata imputation for GWAS. Sci Rep. 2016;6:35534.

80. Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S,Huang X. SPARCoC: A new framework for molecular pattern discoveryand cancer gene identification. PloS One. 2015;10(3):0117135.

81. Töscher A, Jahrer M. The bigchaos solution to the netflix prize 2008.Netflix Prize, Report. 2008.

82. Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data withneural networks. Science. 2006;313(5786):504–7.

83. Ding MQ, Chen L, Cooper GF, Young JD, Lu X. Precision oncologybeyond targeted therapy: Combining omics data with machine learningmatches the majority of cancer cells to effective therapeutics. Mol CancerRes. 2018;16(2):269–278.

84. Chaudhary K, Poirion OB, Lu L, Garmire LX. Deep learning–basedmulti-omics integration robustly predicts survival in liver cancer. ClinCancer Res. 2018;24(6):1248–59.

85. Talwar D, Mongia A, Sengupta D, Majumdar A. AutoImpute:Autoencoder based imputation of single-cell RNA-seq data. Sci Rep.2018;8(1):16329.

86. Beaulieu-Jones BK, Moore JH. Missing data imputation in the electronichealth record using deeply learned autoencoders. In: Pacific Symposiumon Biocomputing 2017. Singapore: World Scientific; 2017. p. 207–218.

87. Gondara L, Wang K. Mida: Multiple imputation using denoisingautoencoders. In: Advances in Knowledge Discovery and Data Mining.Springer; 2018. p. 260–272. https://doi.org/10.1007/978-3-319-93040-4_21.

88. Rubinsteyn A. https://github.com/iskandr/fancyimpute. Accessed 08 July2019.

89. Hahnloser RH, Sarpeshkar R, Mahowald MA, Douglas RJ, Seung HS.Digital selection and analogue amplification coexist in a cortex-inspiredsilicon circuit. Nature. 2000;405(6789):947.

90. Han J, Moraga C. The influence of the sigmoid function parameters onthe speed of backpropagation learning. In: Lecture Notes in ComputerScience. Springer; 1995. p. 195–201. https://doi.org/10.1007/3-540-59497-3_175.

Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.