1
mRNAmarkup: quality control and annotation of de novo transcriptome assemblies Volker P. Brendel 1,2 , Daniel S. Standage 1,3 1 Department of Biology, Indiana University, Bloomington, IN 47405 2 School of Informatics and Computing, Indiana University, Bloomington, IN 47405 3 Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA 50011 Computational Genome Science Laboratory, Indiana University Overview mRNAmarkup is a tool for annotation of de novo assembled transcripts with a focus on 1) careful consideration and handling of potential contaminants and assembly artifacts; 2) a portable, lightweight implementation; and 3) a modular, extensible design. e mRNAmarkup workflow, as shown in Figure 1, is best described as a series of filters. At each stage of the workflow, an analysis is performed—transcripts matching certain search criteria are set aside and annotated, while transcripts that do not match continue to subsequent analysis steps. e final product is a single set of vetted and annotated transcripts suitable for additional analysis. Filtered annotation strategy Rather than running BLAST comparisons against massive and comprehensive protein databases (such as NCBI’s nr), mRNAmarkup instead searches a much smaller database of references proteins from closely related species. Not only does this drastically reduce search time, but it also has a substantial effect on the BLAST statistics computed for each sequence match. Only transcripts with no match in the reference protein database are subject to additional searches against more comprehensive protein databases. Another benefit of mRNAmarkup’s unique annotation strategy is the granular organization of outputs. A single, combined file containing all non-contaminant transcripts, suitable for whole-transcriptome analysis, is provided for convenience. However, the annotated transcripts are also grouped according to the database against which they match, facilitating the selection of subsets of the mRNAmarkup output for particular analysis tasks. For example, probable full-length cDNAs matching the reference protein database are well-suited for codon usage analysis and, when a genome is available, training spliced alignment models. mRNAmarkup results also have secondary utility as a rough measure of a given transcriptome assembly’s quality. Transcriptome assembly is still one of the most difficult problems in genome informatics, and assembling with different parameters and different soſtware packages will almost always produce very different results. Identifying and interpreting those differences on a whole-transcriptome scale remains a significant challenge. e breakdown of transcripts according to their mRNAmarkup annotation provides a convenient starting point for assessing which result (or results) are best suited for a particular analysis. Quality control: detecting contaminants and resolving chimeras mRNAmarkup’s quality control features are crucial to eliminating potential errors in functional analysis. Two very common sources of error in de novo assembled transcriptomes are contaminants and improperly fused chimeric sequences. Experimental protocols designed to select, enrich, and sequence mRNAs of interest are common, and at least one widely used transcript assembler, Trinity (Grabherr, Haas, et al, 2011), has a setting to minimize chimeric fusions. Anecdotal evidence, however, suggests that post-assembly checks for contaminants and chimeras are usually overlooked, with potentially confounding effects on downstream analysis. e first stage in mRNAmarkup’s workflow is to screen for contamination from vector sequence, bacteria, microsporidia, viruses, and organellar genomes. ese contaminant sequences are set aside and excluded from further analysis, although depending on their origin (such as an endosymbiotic organism) they may be of interest for related study. In subsequent stages of the workflow, multiple distinct matches against protein databases are used to identify potentially chimeric sequences (illustrated in Figure 2). For cases in which a transcript’s protein matches are in opposite orientations, mRNAmarkup utilizes the user-configurable maximum overlap and minimum extension parameters to split the chimeric sequence into 2 or 3 distinct transcripts. For cases in which a transcript’s protein hits are in the same orientation, the transcript is simply flagged as a potential chimera and no further attempt is made to split the transcript. mRNAmarkup analysis of the recently published transcriptome of the paper wasp Polistes canadensis (Ferreira et al, 2013) identified 2,596 previously unaccounted chimeric transcripts. All but 3 of these transcripts were successfully resolved by mRNAmarkup, producing a total of 6,496 non-chimeric transcripts from the original 2,596. Analysis of additional unpublished data sets have had similar numbers and yielded similar results. For comparison, both sets of 2,596 unresolved chimeric transcripts and 6,496 resolved chimeras were annotated using FastAnnotator (Chen et al, 2012) and BLAST2GO (Götz et al, 2008). Neither tool flagged any of the sequences as potentially problematic, although as Figure 2 demonstrates, the use of chimeric transcripts in downstream analysis (such as genome annotation) clearly leads to unreliable results. It is noteworthy that we had previously attempted to analyze the entire P. canadensis transcriptome using FastAnnotator, but were unable to produce usable results within a reasonable time frame. We did not attempt a full-transcriptome annotation with BLAST2GO, considering the amount of time required to access remote databases for the mapping and annotation steps. Implementation and availability Existing transcript annotation tools such as BLAST2GO (Götz et al, 2008), FastAnnotator (Chen et al, 2012), and Trinotate (http://trinotate.sourceforge.net/) require web- based data submission, access to remote databases, and/or an extensive and complicated local setup (if local setup is even offered at all). In contrast, mRNAmarkup has been implemented with a focus on simplicity, automation, data provenance, reproducibility, and reducing external soſtware dependencies to facilitate easy local setup and portability across different operating systems. e main procedure is implemented as a batch script, with several supplementary programs included in the source code distribution. e only external soſtware dependency is the NCBI BLAST+ suite, a common fixture on most bioinformatics systems. In addition, mRNAmarkup utilizes a modular design that supports the use of additional user-supplied databases and/or analysis steps not included in the main soſtware distribution. e latest stable release of mRNAmarkup is always available from the Brendel Group website. Additionally, a public development repository is maintained at GitHub. BrendelGroup @ IU http://brendelgroup.org/bioinformatics2go/mRNAmarkup.php BrendelGroup @ GitHub https://github.com/BrendelGroup/mRNAmarkup Acknowledgements e authors express appreciation to Dr. Jenna Woody for her work during the initial stages of mRNAmarkup’s development. is work was made possible by the generous support of the NSF Plant Genome Research Program. References Chen TW, et al (2012) FastAnnotator: an efficient transcript annotation web tool. BMC Genomics, 13(Suppl 7):S9, doi:10.1186/1471-2164-13-S7-S9. Ferreira PG, et al (2013) Transcriptome analyses of primitively eusocial wasps reveal novel insights into the evolution of sociality and the origin of alternative phenotypes. Genome Biol, 14:R20, doi:10.1186/gb-2013-14-2-r20. Götz S, et al (2008) High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res, 36(10): 3420-3435, doi:10.1093/nar/gkn176. Grabherr MG, Haas BJ, et al (2011) Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol. 9(7):644-52. doi:10.1038/nbt.1883. Figure 2: Resolution of chimeric sequences. RNA-Seq reads from Polistes dominula assembled with Trinity and analyzed with mRNAmarkup are show by red glyphs. Orange glyphs correspond to reference proteins, and blue glyphs correspond to automated gene annotations. e transcript highlighted in green was produced by Trinity, and is clearly chimeric in light of the aligned protein evidence (and additional transcript evidence not shown). e transcripts highlighted in red were resolved by mRNAmarkup. Figure 1: mRNAmarkup workflow. Aſter an initial screening step to remove contaminants, mRNAmarkup’s primary search steps compare transcripts to a database of reference proteins, and then if unmatched, to a more comprehensive protein database, and if still unmatched, to a database of hypothetical proteins. Transcripts with no matches in these searches are then subject to subsequent searches against against a database of conserved protein domains and, finally, if still unmatched, to a database of microRNA sequences.

mRNAmarkup: quality control and annotation of de … · mRNAmarkup: quality control and annotation of de novo transcriptome assemblies Volker P. Brendel1,2, Daniel S. Standage1,3

Embed Size (px)

Citation preview

Page 1: mRNAmarkup: quality control and annotation of de … · mRNAmarkup: quality control and annotation of de novo transcriptome assemblies Volker P. Brendel1,2, Daniel S. Standage1,3

mRNAmarkup: quality control and annotation of de novo transcriptome assembliesVolker P. Brendel1,2, Daniel S. Standage1,31Department of Biology, Indiana University, Bloomington, IN 474052School of Informatics and Computing, Indiana University, Bloomington, IN 474053Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA 50011

Computational Genome Science Laboratory, Indiana UniversityOverviewmRNAmarkup is a tool for annotation of de novo assembled transcripts with a focus on 1) careful consideration and handling of potential contaminants and assembly artifacts; 2) a portable, lightweight implementation; and 3) a modular, extensible design. The mRNAmarkup workflow, as shown in Figure 1, is best described as a series of filters. At each stage of the workflow, an analysis is performed—transcripts matching certain search criteria are set aside and annotated, while transcripts that do not match continue to subsequent analysis steps. The final product is a single set of vetted and annotated transcripts suitable for additional analysis.

Filtered annotation strategyRather than running BLAST comparisons against massive and comprehensive protein databases (such as NCBI’s nr), mRNAmarkup instead searches a much smaller database of references proteins from closely related species. Not only does this drastically reduce search time, but it also has a substantial effect on the BLAST statistics computed for each sequence match. Only transcripts with no match in the reference protein database are subject to additional searches against more comprehensive protein databases.

Another benefit of mRNAmarkup’s unique annotation strategy is the granular organization of outputs. A single, combined file containing all non-contaminant transcripts, suitable for whole-transcriptome analysis, is provided for convenience. However, the annotated transcripts are also grouped according to the database against which they match, facilitating the selection of subsets of the mRNAmarkup output for particular analysis tasks. For example, probable full-length cDNAs matching the reference protein database are well-suited for codon usage analysis and, when a genome is available, training spliced alignment models.

mRNAmarkup results also have secondary utility as a rough measure of a given transcriptome assembly’s quality. Transcriptome assembly is still one of the most difficult problems in genome informatics, and assembling with different parameters and different software packages will almost always produce very different results. Identifying and interpreting those differences on a whole-transcriptome scale remains a significant challenge. The breakdown of transcripts according to their mRNAmarkup annotation provides a convenient starting point for assessing which result (or results) are best suited for a particular analysis.

Quality control: detecting contaminants and resolving chimerasmRNAmarkup’s quality control features are crucial to eliminating potential errors in functional analysis. Two very common sources of error in de novo assembled transcriptomes are contaminants and improperly fused chimeric sequences. Experimental protocols designed to select, enrich, and sequence mRNAs of interest are common, and at least one widely used transcript assembler, Trinity (Grabherr, Haas, et al, 2011), has a setting to minimize chimeric fusions. Anecdotal evidence, however, suggests that post-assembly checks for contaminants and chimeras are usually overlooked, with potentially confounding effects on downstream analysis.

The first stage in mRNAmarkup’s workflow is to screen for contamination from vector sequence, bacteria, microsporidia, viruses, and organellar genomes. These contaminant sequences are set aside and excluded from further analysis, although depending on their origin (such as an endosymbiotic organism) they may be of interest for related study.

In subsequent stages of the workflow, multiple distinct matches against protein databases are used to identify potentially chimeric sequences (illustrated in Figure 2). For cases in which a transcript’s protein matches are in opposite orientations, mRNAmarkup utilizes the user-configurable maximum overlap and minimum extension parameters to split the chimeric sequence into 2 or 3 distinct transcripts. For cases in which a transcript’s protein hits are in the same orientation, the transcript is simply flagged as a potential chimera and no further attempt is made to split the transcript.

mRNAmarkup analysis of the recently published transcriptome of the paper wasp Polistes canadensis (Ferreira et al, 2013) identified 2,596 previously unaccounted chimeric transcripts. All but 3 of these transcripts were successfully resolved by mRNAmarkup, producing a total of 6,496 non-chimeric transcripts from the original 2,596. Analysis of additional unpublished data sets have had similar numbers and yielded similar results.

For comparison, both sets of 2,596 unresolved chimeric transcripts and 6,496 resolved chimeras were annotated using FastAnnotator (Chen et al, 2012) and BLAST2GO (Götz et al, 2008). Neither tool flagged any of the sequences as potentially problematic, although as Figure 2 demonstrates, the use of chimeric transcripts in downstream analysis (such as genome annotation) clearly leads to unreliable results.

It is noteworthy that we had previously attempted to analyze the entire P. canadensis transcriptome using FastAnnotator, but were unable to produce usable results within a reasonable time frame. We did not attempt a full-transcriptome annotation with BLAST2GO, considering the amount of time required to access remote databases for the mapping and annotation steps.

Implementation and availabilityExisting transcript annotation tools such as BLAST2GO (Götz et al, 2008), FastAnnotator (Chen et al, 2012), and Trinotate (http://trinotate.sourceforge.net/) require web-based data submission, access to remote databases, and/or an extensive and complicated local setup (if local setup is even offered at all). In contrast, mRNAmarkup has been implemented with a focus on simplicity, automation, data provenance, reproducibility, and reducing external software dependencies to facilitate easy local setup and portability across different operating systems. The main procedure is implemented as a batch script, with several supplementary programs included in the source code distribution. The only external software dependency is the NCBI BLAST+ suite, a common fixture on most bioinformatics systems. In addition, mRNAmarkup utilizes a modular design that supports the use of additional user-supplied databases and/or analysis steps not included in the main software distribution.

The latest stable release of mRNAmarkup is always available from the Brendel Group website. Additionally, a public development repository is maintained at GitHub.

BrendelGroup @ IUhttp://brendelgroup.org/bioinformatics2go/mRNAmarkup.php

BrendelGroup @ GitHubhttps://github.com/BrendelGroup/mRNAmarkup

AcknowledgementsThe authors express appreciation to Dr. Jenna Woody for her work during the initial stages of mRNAmarkup’s development. This work was made possible by the generous support of the NSF Plant Genome Research Program.

ReferencesChen TW, et al (2012) FastAnnotator: an efficient transcript

annotation web tool. BMC Genomics, 13(Suppl 7):S9, doi:10.1186/1471-2164-13-S7-S9.

Ferreira PG, et al (2013) Transcriptome analyses of primitively eusocial wasps reveal novel insights into the evolution of sociality and the origin of alternative phenotypes. Genome Biol, 14:R20, doi:10.1186/gb-2013-14-2-r20.

Götz S, et al (2008) High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res, 36(10): 3420-3435, doi:10.1093/nar/gkn176.

Grabherr MG, Haas BJ, et al (2011) Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol. 9(7):644-52. doi:10.1038/nbt.1883.

Figure 2: Resolution of chimeric sequences. RNA-Seq reads from Polistes dominula assembled with Trinity and analyzed with mRNAmarkup are show by red glyphs. Orange glyphs correspond to reference proteins, and blue glyphs correspond to automated gene annotations. The transcript highlighted in green was produced by Trinity, and is clearly chimeric in light of the aligned protein evidence (and additional transcript evidence not shown). The transcripts highlighted in red were resolved by mRNAmarkup.

Figure 1: mRNAmarkup workflow. After an initial screening step to remove contaminants, mRNAmarkup’s primary search steps compare transcripts to a database of reference proteins, and then if unmatched, to a more comprehensive protein database, and if still unmatched, to a database of hypothetical proteins. Transcripts with no matches in these searches are then subject to subsequent searches against against a database of conserved protein domains and, finally, if still unmatched, to a database of microRNA sequences.