A meta-analysis of computational biology benchmarks reveals predictors of programming accuracy

1. A meta-analysis of computational biology benchmarks reveals predictors of programming accuracy Paul Gardner University of Canterbury Christchurch New Zealand

2. Hard work from...

3. ResBaz I want to say a big thank you to the organisors of ResBaz and NeSI and Aleksandra and...! Everything you are about to see is built using tools you have learned at ResBaz... Warning: the following research is a work in progress, conclusions may change (after Ive triple-checked data & claims) { }

4. Pretend we want to build a phylogenetic tree...

5. Building trees... Bioinformaticians are bad, impatient & intolerant people! Once you have gathered your data, you are faced with a problem... Parsimony (useful if we want to publish in Cladistics) 47 methods ARB FootPrinter LVB Parsimov POY Bionumerics Freqpars MALIGN PAST PRAP BIRCH Gambit MEGA PAUP* PSODA Bosque GAPars Mesquite PAUPRat RA BPAnalysis GelCompar-II Murka PaupUp SeaView CAFCA GeneTree Network phangorn SeqState CRANN gmaes NimbleTree PHYLIP Simplot DAMBE Hennig86 NONA PhyloNet sog EMBOSS IDEA Notung Phylo_win TCS TNT Felsenstein http://evolution.genetics.washington.edu/phylip/software.html

6. Building trees... Maximum likelihood 97 methods ALIFRITZ EMBOSS MOLPHY PHYLLAB rRNA-phylogeny aLRT EREM MrAIC PhyloCoCo SeaView ARB fastDNAml MrModeltest Phylo_win Segminator Bio++ fastDNAmlRev MrMTgui PHYML SEMPHY Bionumerics FASTML MultiPhyl PhyML-Multi SeqPup BIRCH FastTree NEPAL PhyNav SeqState BootPHYML GARLI NHML PHYSIG SIMMAP Bosque GZ-Gamma nhPhyML PLATO Simplot CodeAxe HY-PHY NimbleTree Porn* SLR CoMET IQPNNI p4 PRAP Spectronet Concaterpillar Kakusan4 PAL PROCOV Spectrum CONSEL Leaphy PAML ProtTest SplitsTree Crux Mac5 PARAT PTP SSA DAMBE McRate PARBOOT r8s-bootstrap TipDate DART Mesquite PASSML Rate4Site Treefinder Darwin MetaPIGA PAUP* rate-evolution TREE-PUZZLE dnarates MixtureTree PAUPRat RAxML Vanilla DPRML Modelfit PaupUp raxmlGUI DT-ModSel ModelGenerator phangorn RevDNArates

7. Building trees... Bayesian methods 28 methods AMBIORE BEST IMa2 p4 SIMMAP ANC-GENE Bio++ Mesquite PAL tracer BAli-Phy bms_runner MrBayes PAML Vanilla BAMBE burntrees MrBayesPlugin PHASE BayesPhylogenies Cadence MrBayes-tree-scanners PHYLLAB BEAST Crux Multidivtime PhyloBayes Felsenstein http://evolution.genetics.washington.edu/phylip/software.html

8. How can we choose software? Which of the 172 methods do you use?

9. Can we trust the authors of software? We can read all the manuscripts & manuals describing 172 software packages. But...

10. How should we choose software? Some possibilities (assuming you dont create another method...) Do you know the developer? Are they famous? Select the most recently published tool? Has the software been widely adopted? Is it published in a good journal? Is the software fast? We could test the software...

11. Neutral comparison studies (a.k.a. benchmarks) A. The main focus of the article is the comparison itself. B. The authors should be reasonably neutral. C. The evaluation criteria, methods, and data sets should be chosen in a rational way.

12. Try approaching software like a scientist Are any good controls available? Positive: databases, publications, simulation, ... Negative: randomized, select relevant negative data, ... Some common accuracy metrics: Sensitivity (true positive rate) Specicity (true negative rate) Mathews correlation coecients Area under an ROC curve False positive rateTruepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Pfam Treefam Custom PROVEAN Polyphen2 FATHMM FATHMM, unweighted Wheeler et al. (2016) A Prole-Based Method for Measuring the Impact of Genetic Variation. bioRxiv.

13. Benchmarks are useful, and fun...

14. Tools can be slow and inaccurate! CLARK Kraken OneCodex LMAT MGRAST MetaPhlAn mOTU Genometa QIIME EBI MetaPhyler MEGAN taxatortk GOTTCHA A) Sum of log odds scores, phylum level Deviation 0 10 20 30 40 50 0 5 10 15 Log2ofruntime(minutes) ~30 mins ~17 hrs ~23 days

15. Is there really a relationship between speed & accuracy? Can we run a meta-analysis of bioinformatic benchmarks What factors are predictive of accuracy? Training articles: initially 10 (historical knowledge) Candidate articles: ((bioinformatics) AND (algorithmic OR algorithms OR biotechnologies OR computational OR kernel OR methods OR procedure OR programs OR software OR technologies)) AND (accuracy OR analysis OR assessment OR benchmark OR benchmarking OR biases OR comparing OR comparison OR comparisons OR comprehensive OR effectiveness OR estimation OR evaluation OR metrics OR efficiency OR performance OR perspective OR quality OR rated OR robust OR strengths OR suitable OR suitability OR superior OR survey OR weaknesses) AND (benchmark OR competing OR complexity OR cputime OR duration OR fast OR faster OR perform OR performance OR slow OR speed OR time) 568,130 articles Background articles: (bioinformatics [TIAB] 2013:2015 [dp]) #sorted on first author 154,485 articles

16. Hunting for relevant articles After trying Abstrackr (& getting annoyed)... Training articles Background articles Removehighfreq. words Computeword&di-wordfreqs Computeword scores: lo(word) = log2 ftraining(word)+ fbackground(word)+ logOdds tnFreq bgFreq word 5.28 0.0019 0.0000 benchmarking 5.21 0.0061 0.0002 benchmark 4.91 0.0011 0.0000 noisy 4.85 0.0022 0.0001 metrics 4.85 0.0003 0.0000 encouragingly ... -7.90 0.0000 0.0024 disease -8.02 0.0000 0.0026 associated -8.09 0.0000 0.0027 mirnas Score&rankcandi- datearticles: i lo(wi) Candidate articles Manually evaluate high scoring articles noyes Buildmodel

17. Word and article scores Can use the same scoring scheme for words that we use for scoring biological sequences... logOdds(word) = log2 ftraining (word)+ fbackground (word)+ articleScore = wordarticle logOdds(word) expression mirnas associated patients binding mirna expressed network involved regulated levels revealed database mutations drug response tumor system activity induced . . . benchmarking sequencers benchtop merits correctness benchmark kernels convolution winner supertree structal seeker choosing corpora supermatrix phenocopy epistasis segmod encad balibase head & tail word scores wordscore(bits) 10 5 0 5

18. Iteratively checking articles... 1. Score and rank candidate articles 2. Check the highest scoring articles, add to either training or background articles 3. Return to 1.

19. So far we have... found 35 matching articles. Manually extracted ranks, IF, H, ... 84 benchmarks (method accuracies and speeds) 203 bioinformatic methods 63 journals (47 Bioinformatics, 17 BMC bioinformatics, ...) 124 author GoogleScholar proles abyss bwasw dialigntx gossamer mafftfftns2 mpest paralign repeatfinder seqmap ssake velvet antepiseeker caml diffsplice gottcha mafftlinsi mpjclustalw pass repeatgluer sga ssap wmrpmp apg camp diginormvelvet greedyft maq mpsclustalw perm repeatscout sharcgs ssearch woodhams barry ce dima gsnap mats mrfast phylonetft rmap shrimp ssm wublast bfast celera djigsaw heidge megan mrpml piler rnacofold simulatedannealing sst xalign bismark clark downhillsimplex hmmer metaphlan mrpmp poa rnaduplex sl st xcmswithcorrection biss clc dsgseq idbaud metaphyler mrsfast poy rnahybrid smalt starbeast xcmswithoutretentiontime boost clustalomega ebi igtpduplossft mgrast msinspect poystar rnaplex snap strcutal zema bowtie clustalw edenanonstrict inchworm minia multalin pragcz rnaup snpruler swissmodel bowtie2 comus edenastrict infernal mira muscle probalign rsearch soap taipan bratbw coprarna edit intarna mlclustalw musclemaxiters probcons rsmatch soap2 targetrna bsmap cosine epimode kalign mlclustalwquicktree mzmine probtree sam soapdenovo targetrna2 bsseeker cro erpin kbsps mlmafft ncbiblast pso sate spades taxatortk buckycon cufflinks fa kraken mlmafftparttree nest pt scro sparse tcoffee buckymrbayes dali fasta kthse mlmuscle newbler qiime scwrl sparseassembler team buckymrbayesspa de fasttree leidnl mlopal novoalign qsra scwrlcons spcomp tmap buckypop dexseq gassst lmat mlprankgt oases ravenna segemehl specarray transabyss buckyraxml dialign genometa lsqman modellerv onecodex raxml segmodencad spt trinity builder dialign22 gojobori mafft mosaik openms raxmllimited seqgsea srmapper upmes bwa dialignt goldman mafftfftns motu pairfold rdiffparam seqman ssaha vcake

20. Possible predictors of accuracy... Number of citations #citations Frequency 0 5 10 15 20 1 10 100 1,000 10,000 100,000 Journal impact factor journal.IF Frequency 0 10 20 30 40 50 60 0.5 1 2.5 5 10 25 50 Journal H5 index (GoogleScholar) journal.H5 Frequency 0 10 20 30 40 50 60 10 25 50 100 250 500 Corresponding Author's Hindex author.H Frequency 0 5 10 15 5 10 25 50 100 150 Corresponding Author's Mindex author.M Frequency 2 4 6 8 0 5 10 15 20 25 30 Relative age Relative age Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30

21. I have found no *signicant* predictors accuracy! Z = 1.52; p = 0.94author.M author.H journal.H5 relative age speed #citations journal.IF Correlations with accuracy rank Spearman'srho 0.10 0.05 0.00 0.05 0.10 Accuracy vs. Speed mean normalised speed rank meannormalisedaccuracyrank 0.2 0.4 0.6 0.8 1.0 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 * * * * * * * * * ** * * * * * * * * * * * * * * * * * * ** * * ** * * * * o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x xx x x x x x x x x x x xx x xx x x x xx x xx x x x x x x x x x * = hi profile journal; o = hi profile author; x = hi cited fast+accurate fast+inaccurateslow+inaccurate slow+accurate

22. IF & #citations IF: Spearmans = 0.104; p-value = 0.20 #cites: Spearmans = 0.101; p-value = 0.18 Accuracy vs. IF Journal impact factor meannormalisedaccuracyrank 0.0 0.5 1.0 1.5 0.5 1 2.5 5 10 25 50 1.0 0.8 0.6 0.4 0.2 0.0 Accuracy vs. #citations # citations meannormalisedaccuracyrank 0.0 0.2 0.4 0.6 0.8 1.0 1 10 100 1,000 10,000 100,000 1.0 0.8 0.6 0.4 0.2 0.0

23. Conclusions Nothing appears to be predictive of accuracy1 Fast software undergoes more developmental iterations Can heuristic approaches produces a better result than mathematically complete approaches? It doesnt appear to matter how famous you are, the journals you publish in, whether youre early or late or often your work is cited, you can still write great software! 1 There is still a chance I have screwed something up...

24. Thanks Stephanie McGimpsey Fatemeh Ashari Ghomi Sinan Uur Umu Funded by: Rutherford Discovery Fellowship, BPRC and Biological Heritage: National Science Challenge.