STORI: Selectable Taxon Ortholog Retrieval Iteratively

  • View
    16

  • Download
    0

Embed Size (px)

DESCRIPTION

Stern, J. G. (2013). STORI: Selectable Taxon Ortholog Retrieval Iteratively. Master's Thesis, Georgia Institute of TechnologySpeciation and gene duplication are fundamental evolutionary processes that enable biological innovation. For over a decade, biologists have endeavored to distinguish orthology (homology caused by speciation) from paralogy (homology caused by duplication). Disentangling orthology and paralogy is useful to diverse fields such as phylogenetics, protein engineering, and genome content comparison.A common step in ortholog detection is the computation of Bidirectional Best Hits (BBH). However, we found this computation impractical for more than 24 Eukaryotic proteomes. Attempting to retrieve orthologs in less time than previous methods require, we developed a novel algorithm and implemented it as a suite of Perl scripts. This software, Selectable Taxon Ortholog Retrieval Iteratively (STORI), retrieves orthologous protein sequences for a set of user-defined proteomes and query sequences. While the time complexity of the BBH method is O(#taxa2), we found that the average CPU time used by STORI may increase linearly with the number of taxa.To demonstrate one aspect of STORI’s usefulness, we used this software to infer the orthologous sequences of 26 ribosomal proteins (rProteins) from the large ribosomal subunit (LSU), for a set of 115 Bacterial and 94 Archaeal proteomes. Next, we used established tree-search methods to seek the most probable evolutionary explanation of these data. The current implementation of STORI runs on Red Hat Enterprise Linux 6.0 with installations of Moab 5.3.7, Perl 5 and several Perl modules. STORI is available at: .

Text of STORI: Selectable Taxon Ortholog Retrieval Iteratively

  • STORI: SELECTABLE TAXON ORTHOLOG RETRIEVAL

    ITERATIVELY

    A Thesis Presented to

    The Academic Faculty

    by

    Joshua Gallant Stern

    In Partial Fulfillment of the Requirements for the Degree

    Master of Science in the School of Biology

    Georgia Institute of Technology December 2013

  • STORI: SELECTABLE TAXON ORTHOLOG RETRIEVAL

    ITERATIVELY

    Approved by: Dr. Eric A. Gaucher, Advisor School of Biology Georgia Institute of Technology

    Dr. Brian K. Hammer School of Biology Georgia Institute of Technology

    Dr. I. King Jordan School of Biology Georgia Institute of Technology

    Dr. Terry W. Snell School of Biology Georgia Institute of Technology

    Dr. Christine M. Dunham Department of Biochemistry Emory University

    Date Approved: August 5, 2013

  • iii

    ACKNOWLEDGEMENTS

    I am grateful to my thesis advisor, Prof. Eric A. Gaucher, for his mentoring,

    support and attention. I am thankful for the guidance and support of my Committee. The

    Georgia Institute of Technology Presidential Fellowship and the National Aeronautics

    and Space Administrations Earth and Space Science Fellowship (grant NNX10AT21H)

    supported the work described in this Thesis. This work also used the Georgia Institute of

    Technologys Partnership for an Advanced Computing Environment (PACE), and the

    National Science Foundations Extreme Science and Engineering Discovery Environment

    (XSEDE TG-MCB120032). I also thank my friends and family for their love and support.

  • iv

    TABLE OF CONTENTS

    Page

    ACKNOWLEDGEMENTS iii

    LIST OF TABLES vi

    LIST OF FIGURES vii

    SUMMARY viii

    CHAPTER

    1 INTRODUCTION 1

    2 MATERIAL AND METHODS 7

    STORI algorithm 7

    Retrieving Sequences 16

    Tree Inference 17

    3 RESULTS 21

    Benchmarking Compute Time 21

    Benchmarking Accuracy 23

    The most probable history of the Bacterial 50S, given its sequences 25

    4 DISCUSSION 29

    APPENDIX A: SELECTABLE TAXON ORTHOLOG RETRIEVAL ITERATIVELY (STORI) USERS GUIDE 32

    APPENDIX B: SEQUENCE ACCESSIONS 39

    APPENDIX C: TAXA SUBSETS AND MULTIPLE SEQUENCE ALIGNMENTS 56

    APPENDIX D: PAML & PHASE OPTIMIZED TOPOLOGIES 57

    APPENDIX E: THE MOST LIKELY MODEL OF BACTERIAL AND ARCHAEAL HISTORY 75

    APPENDIX F: BLAST SERCHES IN ONE ITERATION OF STORI 76

  • v

    APPENDIX G: REFERENCE SET BUILDING PROCEDURE 77

    REFERENCES 78

  • vi

    LIST OF TABLES

    Page

    Table 1: Model selection statistics for BBH CPU time data 23

    Table 2: Model selection statistics for STORI retrieval convergence time data 23

    Table 3: Approximately Unbiased p-values for phylogenetic model selection 27

  • vii

    LIST OF FIGURES

    Page

    Figure 1: CPU time to Bidirectional Best Hits vs. size of taxa set 3

    Figure 2: CPU time to STORI retrieval convergence vs. size of taxa set 4

    Figure 3: Schematic of STORI algorithm, found in STORI.pl 8

    Figure 4: Algorithmic flow diagrams of STORI "front and middle ends" 15

    Figure 5: Alternative topological models of Bacterial phylogeny 20

    Figure 6: Accuracy of families retrieved by each replicate run 24

    Figure 7: Likelihood-optimized phylogeny of Bacteria and Archaea 26

  • viii

    SUMMARY

    Speciation and gene duplication are fundamental evolutionary processes that

    enable biological innovation. For over a decade, biologists have endeavored to

    distinguish orthology (homology caused by speciation) from paralogy (homology caused

    by duplication). Disentangling orthology and paralogy is useful to diverse fields such as

    phylogenetics, protein engineering, and genome content comparison.

    A common step in ortholog detection is the computation of Bidirectional Best

    Hits (BBH). However, we found this computation impractical for more than 24

    Eukaryotic proteomes. Attempting to retrieve orthologs in less time than previous

    methods require, we developed a novel algorithm and implemented it as a suite of Perl

    scripts. This software, Selectable Taxon Ortholog Retrieval Iteratively (STORI), retrieves

    orthologous protein sequences for a set of user-defined proteomes and query sequences.

    While the time complexity of the BBH method is O(#taxa2), we found that the average

    CPU time used by STORI may increase linearly with the number of taxa.

    To demonstrate one aspect of STORIs usefulness, we used this software to infer

    the orthologous sequences of 26 ribosomal proteins (rProteins) from the large ribosomal

    subunit (LSU), for a set of 115 Bacterial and 94 Archaeal proteomes. Next, we used

    established tree-search methods to seek the most probable evolutionary explanation of

    these data. The current implementation of STORI runs on Red Hat Enterprise Linux 6.0

    with installations of Moab 5.3.7, Perl 5 and several Perl modules. STORI is available at:

    .

  • 1

    CHAPTER 1

    INTRODUCTION

    Evolutionary biology owes four decades of progress to shared protein and nucleic

    acid sequence data (Fuchs & Cameron 1991; Strasser, 2010; Mushegian, 2011). For

    example, comparisons of ribosomal RNA gene sequences shared by multiple laboratories

    studying diverse organisms led to the discovery of Archaea (Woese & Fox, 1977).

    Follow up work revealed evolutionary histories specific to the Archaeal, Eukaryal, and

    Bacterial spaces of lifes tree (Battistuzzi & Hedges, 2009; Gribaldo & Brochier, 2009;

    Yoon et al., 2008). Phylogenetic study is useful for exploring the thermostability of

    Earths earliest biomolecules (Gaucher et al., 2008); understanding the invention of novel

    receptor specificities (Bridgham et al., 2011); and providing context to antimicrobial drug

    resistance (Dridi et al., 2009).

    Although initial phylogenies represented single gene families (typically, the gene

    for the ribosomes small subunit RNA), increasing availability of whole genomes allowed

    follow-up studies to use multiple protein families as phylogenetic markers (Pupko et al.,

    2002) yielding reconstructions with improved accuracy (Rokas et al., 2003). Increased

    taxonomic and genomic sampling (e.g., Wu et al., 2009; Lang et al., 2013) improves

    phylogenetic accuracy (Nabhan & Sarkar, 2011).

    Another means of improving phylogenetic accuracy is to use orthologous (rather

    than paralogous) genes when generating a multiple alignment of the sequence data (Cao

    et al., 2000; Philippe et al., 2011). Orthologous genes share a common ancestor because

    of speciation, and are distinct from paralogous genes, which result from gene duplication

    (Kristensen et al., 2011).

    Popular sources of phylogenetic data include the ribosomal protein (rProtein)

    genes, notwithstanding their low abundance relative to all prokaryotic gene families

  • 2

    (Dagan & Martin, 2006). The ribosomes essential role of translating information to

    function (Fox & Naik, 2004) deters gene loss (Makarova et al., 2001) or horizontal

    transfer (Sorek et al., 2007). Lecompte et al. (2002) exemplifies the standard method of

    rProtein retrieval: seed sequences from curated proteomes are selected, and used as

    queries in BLASTP (Altschul et al., 1997) similarity searches of additional proteome

    databases. Although tedious, this process is straightforward for Bacteria and Archaea. We

    found that Eukaryotic retrievals are complicated when ostensibly orthologous query

    sequences to the same target proteome produce different best hits (J. G. S. & E. A. G.,

    unpublished data).

    As an alternative to manual sequence retrieval, one can employ a variety of

    ortholog retrieval services (Zhou & Landweber, 2007; Chen et al., 2007; Kuzniar et al.,

    2008; Schmitt et al., 2011; Powell et al., 2011). These services usually depend on large

    databases of pre-computed information. For example, the Clusters of Orthologous Groups

    algorithm (Tatusov et al., 1997) requires computation of Bidirectional Best Hits (BBH),

    by storing the results of a BLAST search of every sequence of every proteome against a

    database of all other sequences.

    On our Departments shared compute cluster (730 AMD Opteron CPUs; six

    cores/ CPU at 2400 MHz), we found it impractical to perform BBH computation for

    more than 24 Eukaryotic proteomes. In Figure 1, we show the effect of increased taxon

    sampling on the CPU time to find best hits of every protein sequence in the sample. The

    first part of the present study demonstrates a novel algorithm that may reduce the

    ortholog retrieval time for taxa sets containing >90 Bacteria, Archaea, or Eukaryotes

    (Figure 2). This algorithm is a new and potentially faster way to understand the data

    deluge created by DNA and protein sequencing technologies.

  • 3

  • 4

    The microprocessor-fueled boom in molecular data has accompanied increasingly

    sophisticated methods of modeling evolutionary history. Felsenstein (2004) provides an

    illuminating account of phylogenetic development, from the first taxonomy inferred

    using numerical methods (Michener & Sokal, 1957) to contemporary Bayesian Inference

    and Maximum Likelihood algorithms (Yang & Rannala, 2012). We summarize the

    typical practice presently. Given a hypothesis, or model, of character rep