Click here to load reader

BMC Genomics BioMed Central - · PDF fileBioMed Central Page 1 of 19 (page number not for citation purposes) BMC Genomics ... Biology, University of Florida, PO Box 118526, Gainesville,

  • View

  • Download

Embed Size (px)

Text of BMC Genomics BioMed Central - · PDF fileBioMed Central Page 1 of 19 (page number not for...

  • BioMed CentralBMC Genomics


    Open AcceMethodology articleComparison of next generation sequencing technologies for transcriptome characterizationP Kerr Wall1, Jim Leebens-Mack2, Andr S Chanderbali3, Abdelali Barakat4, Erik Wolcott1, Haiying Liang4, Lena Landherr1, Lynn P Tomsho5, Yi Hu1, John E Carlson4, Hong Ma1, Stephan C Schuster5, Douglas E Soltis3, Pamela S Soltis6, Naomi Altman7 and Claude W dePamphilis*1

    Address: 1Department of Biology, Institute of Molecular Evolutionary Genetics, and The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA, 2Department of Plant Biology, University of Georgia, Athens, GA 30602, USA, 3Department of Biology, University of Florida, PO Box 118526, Gainesville, FL, 32611, USA, 4The School of Forest Resources, Department of Horticulture, and Huck Institutes of the Life Sciences, Pennsylvania State University, 323 Forest Resources Building, University Park, PA 16802, USA, 5Center for Comparative Genomics, Center for Infectious Disease Dynamics, The Pennsylvania State University, University Park, PA 16802, USA, 6Florida Museum of Natural History, University of Florida, P.O. Box 117800, Gainesville, FL, 32611, USA and 7Department of Statistics and The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA

    Email: P Kerr Wall - [email protected]; Jim Leebens-Mack - [email protected]; Andr S Chanderbali - [email protected]; Abdelali Barakat - [email protected]; Erik Wolcott - [email protected]; Haiying Liang - [email protected]; Lena Landherr - [email protected]; Lynn P Tomsho - [email protected]; Yi Hu - [email protected]; John E Carlson - [email protected]; Hong Ma - [email protected]; Stephan C Schuster - [email protected]; Douglas E Soltis - [email protected]; Pamela S Soltis - [email protected]; Naomi Altman - [email protected]; Claude W dePamphilis* - [email protected]

    * Corresponding author

    AbstractBackground: We have developed a simulation approach to help determine the optimal mixtureof sequencing methods for most complete and cost effective transcriptome sequencing. Wecompared simulation results for traditional capillary sequencing with "Next Generation" (NG) ultrahigh-throughput technologies. The simulation model was parameterized using mappings of 130,000cDNA sequence reads to the Arabidopsis genome (NCBI Accession SRA008180.19). We alsogenerated 454-GS20 sequences and de novo assemblies for the basal eudicot California poppy(Eschscholzia californica) and the magnoliid avocado (Persea americana) using a variety of methodsfor cDNA synthesis.

    Results: The Arabidopsis reads tagged more than 15,000 genes, including new splice variants andextended UTR regions. Of the total 134,791 reads (13.8 MB), 119,518 (88.7%) mapped exactly toknown exons, while 1,117 (0.8%) mapped to introns, 11,524 (8.6%) spanned annotated intron/exonboundaries, and 3,066 (2.3%) extended beyond the end of annotated UTRs. Sequence-basedinference of relative gene expression levels correlated significantly with microarray data. Asexpected, NG sequencing of normalized libraries tagged more genes than non-normalized libraries,although non-normalized libraries yielded more full-length cDNA sequences. The Arabidopsis datawere used to simulate additional rounds of NG and traditional EST sequencing, and variouscombinations of each. Our simulations suggest a combination of FLX and Solexa sequencing foroptimal transcriptome coverage at modest cost. We have also developed ESTcalc http://

    Published: 1 August 2009

    BMC Genomics 2009, 10:347 doi:10.1186/1471-2164-10-347

    Received: 1 August 2008Accepted: 1 August 2009

    This article is available from:

    2009 Wall et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    Page 1 of 19(page number not for citation purposes)

  • BMC Genomics 2009, 10:347, an online webtool, which allows users to explore the resultsof this study by specifying individualized costs and sequencing characteristics.

    Conclusion: NG sequencing technologies are a highly flexible set of platforms that can be scaledto suit different project goals. In terms of sequence coverage alone, the NG sequencing is adramatic advance over capillary-based sequencing, but NG sequencing also presents significantchallenges in assembly and sequence accuracy due to short read lengths, method-specificsequencing errors, and the absence of physical clones. These problems may be overcome by hybridsequencing strategies using a mixture of sequencing methodologies, by new assemblers, and bysequencing more deeply. Sequencing and microarray outcomes from multiple experiments suggestthat our simulator will be useful for guiding NG transcriptome sequencing projects in a wide rangeof organisms.

    BackgroundSequencing technology has made great advances over thelast 30 years since the development of chain-terminatinginhibitor-based technologies [1]. Traditional sequencingapproaches require cloning of DNA fragments into bacte-rial vectors for amplification and sequencing of individualtemplates using vector-based primers. This approach wasadapted for cDNA libraries [2] and, with the advent ofcapillary sequencing, became suitable for high-through-put sequencing of large samples of transcripts, termedExpressed Sequence Tags (ESTs). ESTs have become aninvaluable resource for gene discovery, genome annota-tion, alternative splicing, SNP discovery, molecular mark-ers for population analysis, and expression analysis inanimal, plant, and microbial species [3]. Otherapproaches for analyzing transcriptomes include serialanalysis of gene expression (SAGE) [4], massively parallelsignature sequencing (MPSS) [5], and microarrays [6,7].These approaches, which involve the sequencing orhybridizing of small concatamers of cDNA derived frommRNA by reverse transcription, have been used success-fully in analyzing the expression of genomes (transcrip-tomes) at a very large scale, usually from species with asequenced genome or an existing and extensive EST dataset. Although several alternatives have been describedsince the emergence of EST sequencing projects, none hasyet totally supplanted the use of bacterial vectors andSanger sequencing.

    In 2005, two new sequencing technologies were intro-duced both based on sequencing by synthesis, whichpromised to replace or enhance traditional sequencingmethods. The 454 system, usingpyrosequencing technology [8], and the Solexa system, which detects fluorescence sig-nals [9]. Both execute millions of sequencing reactions inparallel, producing data at ultrahigh rates [10]. Althoughread lengths are much shorter with these new methodsthan with capillary sequencing (averaging 100230 bpand 300400 bp for 454FLX and 454Titanium, respec-

    tively, and 35 to up to 76 b for Illumina Solexa platforms),respectively, both platforms generate sufficient data tocompletely re-sequence bacterial genomes in a single run[8,11-13]. In the past year, Applied Biosystems has intro-duced their SOLiD sequencer, another short-read 2035 bp platform, withread lengths anticipated to be 50 bp in the upcomingSOLiD3 release. The three platforms offer a variety ofexperimental approaches for characterizing a transcrip-tome, including single-end and paired-end cDNAsequencing, tag profiling (3' end sequencing especiallyappropriate to estimating expression level), methylationassays, small RNA sequencing, sample tagging ("barcod-ing") to permit small subsample identification, and splicevariant analyses. Several challenges face investigators hop-ing to use these methods, including the relatively largecost of most NG experiments and intense demands fordata storage and analysis on the scale required for NGdatasets, and rapidly evolving technologies. Initial studiesreported success with 454 sequencing of chloroplastgenomes [14,15], small RNAs [16-19], and transcrip-tomes of organisms with [20-22] or without [23] exten-sive genomic sequence information.

    These Next Generation (NG) sequencing methods prom-ise a cost-effective means of either deeply sampling orfully sequencing an organism's transcriptome, with evensmall experiments tagging a very large number ofexpressed genes. However, prior transcriptome sequenc-ing studies have been largely exploratory, only hinting atthe potential for NG transcriptome sequencing at differ-ent scales. There is a great need for quantitative studiesand analysis tools that help investigators optimally designNG sequencing experiments to address specific goals.

    A complete solution to this problem would involve realis-tic models for each technology, accounting for the cost oflibrary generation and data collection, the characteristicsof cDNA libraries, transcript abundance distributions,read length distributions, and the error rates in sequence

    Page 2 of 19(page number not for citation purposes)

  • BMC Genomics 2009, 10:347


Search related