Developing a flexible platform for high- throughput phylogenomics: Case study, conclusions and lessons for the future Joe Parker, Georgia Tsakgogeorga,

Embed Size (px)

DESCRIPTION

Lab Interests Ecology and evolution of traits Echolocation, sociality NGS data for population genetics and phylogenomics

Citation preview

Developing a flexible platform for high- throughput phylogenomics: Case study, conclusions and lessons for the future Joe Parker, Georgia Tsakgogeorga, James A. Cotton and Stephen J. Rossiter Queen Mary University London In case you hadnt noticed.. Recent advances in next-generation sequencing (NGS) technologies now allow us to in more detail than ever before - Every Grant Application Ever Lab Interests Ecology and evolution of traits Echolocation, sociality NGS data for population genetics and phylogenomics The task Phylogeny estimation/comparison Molecular correlates of evolution; site substitutions, dN/dS, composition Simulation Dataset limitations (R-L): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey The parameters De novo genomes: four taxa 2,321 protein-coding loci 801,301 codons Published: 18 genomes ~69,000 simulated datasets ~3,500 cluster cores Development cycle Design Wireframe & specify tests Implement Alignment loadSequences() getSubstitutions() Phylogeny trimTaxa() getMRCA() DataSeries calculateECDF() randomise() Regression getResiduals() predictInterval() Review, refine & refactor Serialisation Process data remotely Freeze-dry objects, download to desktop Implement new methods directly on previously- analysed data Conclusions Junichuro Ayoyama Distributions Genome-scale data provides context Identify outliers Genes / taxa / trees Compare values across biological systems Parameter investigation Multiple configurations Hyperparameters empirically investigated Determine sensitivity of results Lessons Well-defined research questions: Find the best tree Estimate dN/dS Questions arise from data: How many genes have at least k substitutions in k or more taxa? Data-hypothesis-analysis cycle implies feature creep Use of available databases, e.g. ontology; orthology; expression Sequence reads = observations Unlimited flexibility, finite time Development trade-off Off-the-shelf Bespoke Exploratory work Real time genomic transects? Essential fundamental data missing from nearly every system; Diversity; structure; substitution rates; dN/dS; recombination; dispersal; lateral transfer Thanks Steve Rossiter 1, James Cotton 2, Elia Stupka 3 & Georgia Tsagkogeorga 1 1 School of Biological and Chemical Sciences, Queen Mary, University of London 2 Wellcome Trust Sanger Institute 3 Center for Translational Genomics and Bioinformatics, San Raffaele Institute, Milan Chris Walker & Dan Traynor Queen Mary GridPP High-throughput Cluster Chaz Mein & Anna Terry Barts and The London Genome Centre Mahesh Pancholi School of Biological and Chemical Sciences BBSRC (UK); Queen Mary, University of London