[Methods in Molecular Biology] Jasmonate Signaling Volume 1011 || Analysis of RNA-Seq Data with TopHat and Cufflinks for Genome-Wide Expression Analysis of Jasmonate-Treated Plants

305

Alain Goossens and Laurens Pauwels (eds.), Jasmonate Signaling: Methods and Protocols, Methods in Molecular Biology, vol. 1011, DOI 10.1007/978-1-62703-414-2_24, © Springer Science+Business Media, LLC 2013

Chapter 24

Analysis of RNA-Seq Data with TopHat and Cuf fl inks for Genome-Wide Expression Analysis of Jasmonate-Treated Plants and Plant Cultures

Jacob Pollier , Stephane Rombauts , and Alain Goossens

Abstract

The recent development of various deep sequencing techniques has led to the most powerful transcript pro fi ling method available to date, RNA sequencing or RNA-Seq. Besides the identi fi cation of new genes and new splice variants of known genes, RNA-Seq allows to compare the whole transcriptome of any organism under two or more experimental conditions, such as before and after jasmonate treatment. However, the vast amounts of data generated during RNA-Seq experiments require complex computa-tional methods for read mapping and expression quanti fi cation. Here, we describe a detailed protocol for the analysis of deep sequencing data, starting from the raw RNA-Seq reads. First, a quality check is per-formed on the raw reads to assess the quality of the sequencing. Subsequently, adapters and low-quality sequences are trimmed off the raw reads. The resulting processed reads are mapped to the reference genome, and the mapped reads are counted to generate expression data for the annotated genes for each sample. This method can be used for the analysis of RNA-Seq data of any organism for which a reference genome is available.

Key words Transcript pro fi ling , Gene expression , Transcriptome , RNA sequencing , RNA-Seq , FastQC , TopHat , Cuf fl inks

The treatment of plants or plant cell cultures with jasmonates triggers an extensive transcriptional reprogramming of the cells, leading to transcriptional activation or repression of entire metabolic path-ways [ 1, 2 ] . Since many of these pathways lead to the production of secondary metabolites, comparing the transcriptome before and after jasmonate treatment may allow to identify candidate genes for the biosynthesis of secondary metabolites [ 2, 3 ] . In tobacco Bright Yellow 2 (BY-2) cells, for instance, the biosynthesis of nicotine is elicited by jasmonate treatment. Genome-wide transcript pro fi ling of jasmonate-elicited BY-2 cultures has led to a set of tobacco genes [ 4 ]

1 Introduction

306 Jacob Pollier et al.

from which several new regulators, transporters, and enzymes involved in nicotine biosynthesis were identi fi ed in subsequent functional screens [ 5– 9 ] .

Various techniques can be used for genome-wide transcript pro fi ling, including hybridization-based approaches like microar-rays, and tag-based sequencing approaches such as cDNA-AFLP [ 10 ] ( see Chapter 23), serial analysis of gene expression (SAGE) [ 11 ] , and massively parallel signature sequencing (MPSS) [ 12 ] . However, the development of deep sequencing technologies has led to a method that is undoubtedly the most powerful transcript pro fi ling technique available to date, RNA sequencing or RNA-Seq [ 13– 15 ] . RNA-Seq not only allows the identi fi cation of new genes and new splice variants of known genes, but it also allows to compare the whole transcriptome under two or more conditions [ 16 ] , such as before and after jasmonate treatment. For RNA-Seq, RNA is isolated and converted to a set of cDNAs sheared into frag-ments to which adapters are attached. Subsequently, the cDNA fragments are subject to deep sequencing, resulting in millions of short sequence fragments or reads (typically 30–400 nucleotides long, depending on the sequencing technology used) from one end (single-end) or both ends (paired-end) of the cDNA frag-ments. To obtain genome-wide quantitative transcript data, the reads are mapped on a reference genome or a de novo assembled set of transcripts [ 13 ] .

To analyze the vast amounts of data generated during an RNA-Seq experiment, complex computational methods for read mapping, transcriptome reconstruction, and expression quanti fi cation are required [ 16, 17 ] . Several methods and soft-ware exist, but here we use the pipeline relying on TopHat [ 15 ] for read mapping and Cuf fl inks [ 18 ] for expression quanti fi cation. The method presented here aims to compare the transcriptome under two different conditions (e.g., before and after jasmonate treatment), starting from the raw RNA-Seq reads. First, a quality check is performed on the raw reads, and the adapters and low-quality sequences are trimmed. Subsequently, the processed reads are mapped to the reference genome with TopHat, of which the resulting alignment fi les are used as input for Cuf fl inks, which generates normalized expression data for each of the analyzed raw sequencing fi les.

The software used for analysis of deep sequencing data in this protocol need a 64-bit CPU/computer running on Linux, with a minimal amount of 16 GB of RAM. During processing of RNA-Seq data, several hundred GB of disk space may be required.

2 Materials

307RNA-Seq Data Analysis

The various RNA-Seq protocols available to date still suffer from biases and sequencing artifacts, such as GC bias, read errors, primer and adapter contaminations, and PCR bias. To assure meaningful downstream processing of the obtained RNA-Seq data, a quality check should be performed on the raw sequencing data [ 19, 20 ] . In this protocol, the quality control is done with FastQC, a com-monly used program that provides an overview of whether the raw RNA-Seq data have any problems or biases to consider before fur-ther analysis. FastQC is a freely available program and can be downloaded from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ .

Various programs are available to trim adapters from the RNA-Seq data. In this protocol, adapter trimming is performed with fastx_clip-per. The fastx_clipper program is part of the FASTX-Toolkit, which is freely available and can be downloaded from http://hannonlab.cshl.edu/fastx_toolkit/ .

The quality trimming step will trim ambiguous (N) and low-quality residues from the ends of the reads. In this protocol, the quality trimming is performed with fastq_quality_trimmer. Like the fastx_clipper program, the fastq_quality_trimmer program is part of the FASTX-Toolkit.

In the read mapping step, the processed sequencing reads are aligned to the reference genome. In this protocol, read mapping is performed with TopHat [ 15 ] , which uses the widely used Bowtie program [ 21, 22 ] as alignment engine. In addition to the reference genome, TopHat (freely available at http://tophat.cbcb.umd.edu/ ) needs Bowtie [ 21, 22 ] (freely available at http://bowtie-bio.sourceforge.net/index.shtml/ ) and SAM Tools [ 23 ] (freely available at http://samtools.sourceforge.net/ ) to be installed.

The counting of the mapped reads is performed with Cuf fl inks [ 18 ] , a software program freely available at http://cuf fl inks.cbcb.umd.edu/ . Cuf fl inks counts the expression of each gene and reports it in “fragments per kilobase of transcript per million frag-ments mapped” or FPKM [ 18 ] . The FPKM value is a measure of the expression of a transcript, normalized by transcript length and the total number of fragments. As such, the FPKM value can be used to compare the expression of the genes in the analyzed sam-ples. However, one should be aware that Cuf fl inks (but other soft-ware too) uses an annotation of the reference genome described in GFF (or GTF) format. This means that the results depend on the quality of the provided annotation. The reported FPKM values relate to the genes described and genes missing in the annotation description fi le (even though reads map to it) will not be reported. Wrong gene models will report altered FPKM values.

2.1 Quality Control

2.2 Adapter Trimming

2.3 Quality Trimming

2.4 Read Mapping

2.5 Read Counting

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://hannonlab.cshl.edu/fastx_toolkit/

http://hannonlab.cshl.edu/fastx_toolkit/

http://tophat.cbcb.umd.edu/

http://tophat.cbcb.umd.edu/

http://bowtie-bio.sourceforge.net/index.shtml/

http://bowtie-bio.sourceforge.net/index.shtml/

http://samtools.sourceforge.net/

http://cufflinks.cbcb.umd.edu/

http://cufflinks.cbcb.umd.edu/


All programs are operated through the UNIX shell. The working directory in which the commands (pre fi x $) given in this protocol are executed should contain the (zipped) raw sequencing data in FASTQ format. This protocol was designed with the following versions of the above-described programs:

FastQC version 0.9.1. – FastX version 0.0.13. – Bowtie version 2.0.0b6 [ – 22 ] . TopHat version 2.0.3 [ – 15 ] . SAM Tools 0.1.18 [ – 23 ] . Cuf fl inks 1.3.0 [ – 18 ] .

1. Unzip the fi rst raw sequence fi le:

$ gunzip name.fastq.gz

2. R u n FastQC on the unzipped fi le:

$ fastqc name.fastq

3. Rezip the raw sequencing fi le:

$ gzip name.fastq

4. Repeat steps 1 – 3 for all the raw sequence fi les ( see Note 1 ).

As output, for each of the analyzed raw sequencing fi les, the FastQC program will generate a folder containing an HTML-based permanent report that provides an overview of whether the raw RNA-Seq data have problems or biases to consider before further analysis.

1. Unzip the fi rst raw sequence fi le ( see Note 2 ):

$ gunzip –c name.fastq.gz >name.fastq

2. Trim an adapter from the unzipped data fi le ( see Note 3 ):

$ fastx_clipper -i name.fastq -o newname.fastq -l 20 -v -a ADAPTERSEQUENCE

3. R e move the original fastq fi le:

$ rm name.fastq

4. Z i p the trimmed fi le:

$ gzip newname.fastq

5. R e peat steps 1 – 4 for all the raw sequence fi les and all used adapters ( see Note 4 ).

The output of the fastx_clipper program is a new fastq fi le con-taining the adapter-trimmed sequences. The original raw sequencing

3 Methods

3.1 Quality Control

3.2 Adapter Trimming


fi les are unchanged and will not be used any more in the downstream processing.

1. Unzip the fi rst adapter-trimmed sequence fi le:

$ gunzip -c name.fastq.gz >name.fastq

2. Perform the quality trimming ( see Note 5 ):

$ fastq_quality_trimmer -i name.fastq -o newname.fastq –v –t 20 –l 65

3. Zip the quality-trimmed fi le:

$ gzip newname.fastq

4. Remove the original fastq fi le:

$ rm name.fastq

5. Repeat steps 1 – 4 for all the adapter-trimmed sequence fi les ( see Note 6 ).

The output of the quality trimming step is a new (zipped) fastq fi le containing the quality-trimmed sequences that will be used for downstream processing. The fi le containing the adapter-trimmed sequences is unchanged, and is not needed any more after this step. To assess the effects of the adapter and quality trimming steps, a new quality control can be performed on the processed reads (Fig. 1 ).

1. Build the bowtie index fi les from the reference genome ( see Note 7 ):

$ bowtie2-build genomename.fasta genomename

2. Unzip the fi rst quality-trimmed sequence fi le:

$ gunzip -c name.fastq.gz >name.fastq

3. Make an output directory for the fi rst sequence fi le:

$ mkdir dirname

4. M a p the reads of the fi rst sample to the reference genome ( see Note 8 ):

$ tophat2 -o ./dirname genomename name.fastq

5. Remove the quality-trimmed fi le:

$ rm name.fastq

6. Repeat steps 2 – 5 to map the reads of the other samples to the reference genome ( see Note 9 ).

TopHat will write its output into the de fi ned folder. Next to a set of intermediate fi les, the output consists of a fi le called accepted_hits.bam , which contains a list of read alignments, and which will be used for the read counting. The fi le containing the quality-trimmed sequences is unchanged, and is not needed any more after this step.

3.3 Quality Trimming

3.4 Read Mapping


Fig. 1 Box and whisker plot of the per base sequence quality generated by the FastQC quality control program before ( a ) and after ( b ) adapter and quality trimming of the raw RNA-Seq reads. For each of the base positions ( X -axis), the quality scores are plotted ( Y -axis), with higher scores representing better base calls. The background green , orange , and red colors represent base calls of good, reasonable, and poor quality, respectively. In most RNA-Seq platforms, it is normal to see the base call quality degrading with the base position ( a ), which is improved after quality trimming of the reads ( b )


1. Count the reads of the fi rst sample with Cuf fl inks ( see Note 10 ):

$ cuf fl inks -v --compatible-hits-norm -u -o ./dirname -G genome.gff3 dirname/accepted_hits.bam

2. Repeat step 1 to count the reads of the other samples ( see Note 11 ).

As output of the read counting, for each sequencing fi le, a fi le ( genes.fpkm_tracking ) is generated containing the FPKM values of all the genes present on the reference genome. The FPKM values can be copied in an Excel table, and used to compare the transcriptome in the different conditions.

1. Running of the quality control can be automated for all the raw RNA-Seq fi les by making use of the UNIX foreach com-mand in step 1 :

$ foreach i(*.fastq.gz) gunzip $i foreach j(*.fastq)

fastqc $j gzip $j end

end

2. U n like the quality control, trimming of the adapters will mod-ify the original fi les. In order not to lose the original data, the raw sequencing data are unzipped whilst keeping the original fi les unchanged.

3. With this command, the speci fi ed adapter sequence will be trimmed, and sequences shorter than 20 nucleotides, or sequences with unknown (N) nucleotides, will be discarded. As output, a new fastq fi le is generated containing the adapter-trimmed sequences. The –v (verbose) parameter will create a short summary with information on the amount of reads that were processed and trimmed or discarded. When using sequence data generated by the Illumina/Solexa platform, an invalid qual-ity score value error may occur depending on the CASAVA soft-ware version that generated the original fastq fi les. Depending on the case, add the –Q 33 parameter in the command line: $ fastx_clipper -i name.fastq -o newname.fastq -Q 33 -l 20 -v -a ADAPTERSEQUENCE

4. Trimming of the adapters can be automated for all the raw RNA-Seq fi les by making use of the foreach command in step 1 :

$ foreach i(*.fastq.gz) echo $i

3.5 Read Counting

4 Notes


set name = `basename $i .fastq.gz` echo $name gunzip -c ${name}.fastq.gz >${name}.fastq fastx_clipper -i ${name}.fastq -o n e w _ $ { n a m e } . f a s t q - l 2 0 - v - a ADAPTERSEQUENCE gzip new_${name}.fastq rm ${name}.fastq end

5. W h en using sequence data generated by the Illumina/Solexa platform, an invalid quality score value error may occur. In this case, add the –Q 33 parameter in the command line:

$ fastq_quality_trimmer -i name.fastq -o newname.fastq –v –t 20 –l 65 –Q 33

The –v (verbose) parameter will create a short summary with information on the amount of reads that were processed and trimmed or discarded. The –t parameter de fi nes the minimum acceptable quality of the base calling. In this case, the quality threshold ( Q ) is 20, meaning that the probability ( P ) of an incor-rect base call is 1 %, according to the formula Q = −10 log 10 ( P ). The Q20 base call accuracy of 99 % means that a read of 100 bp will likely contain one error. Given the reads will be trimmed from the ends until the quality reaches the minimum required value of 20, it is good to add a minimum length (−l option) for the reads that should be reported in the output. This parameter depends on the length of the input reads and what will be done after trimming. But in any case, one should keep in mind that the shorter a read, the less speci fi c it becomes.

6. Quality trimming can be automated for all the adapter-trimmed RNA-Seq fi les by making use of the foreach command:

$ foreach i(*.fastq.gz) echo $i set name = `basename $i .fastq.gz` echo $name gunzip -c ${name}.fastq.gz >${name}.fastq fastq_quality_trimmer -i ${name}.fastq -o new_${name}.fastq –v –t 20 –l 65 gzip new_${name}.fastq rm ${name}.fastq end

7. The bowtie2-build algorithm builds a Bowtie index from the FASTA fi le of the reference genome. The Bowtie index is used to align the reads to the genome and consists of a set of six fi les with suf fi xes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. As input fi le, a FASTA- fi le of the complete genome or a comma-separated list of FASTA fi les containing the reference


sequence (e.g., FASTA- fi les of the chromosomes) is used. The de fi ned genome name of the index fi les to write will be used as base name of the set of six fi les.

8. The command given in the protocol is to run the TopHat script using the default parameters. However, these default parameters are set to process mammalian RNA-Seq reads, and hence, when working with other organisms, such as plants, a more strict setting of certain parameters will keep the number of false positives low. For instance, the command given below restricts the maximum intron size to 6,000 bp (in Arabidopsis , over 99.9 % of the introns are shorter than 6,000 bp): $ tophat2 –I 6000 -o ./dirname genomename name.fastq For more detailed information about the options available in TopHat, use the help-command: $ tophat2 –h

9. Read mapping can be automated by making use of the foreach command:

$ foreach i(*.fastq.gz) echo $i set name = `basename $i fastq.gz` mkdir ${name} gunzip -c ${name}.fastq.gz > ${name}.fastq tophat2 -o ./${name} genomename ${name}.fastq rm ${name}.fastq end

10. By adding the --compatible-hits-norm option, Cuf fl inks will normalize the gene expression according to the number of hits within the reference genome, and not the total amount of reads, as is the default. For more detailed information about the options available in Cuf fl inks, use the help-command:

$ cuf fl inks –h

Furthermore, it is important that the headers of the genome annotation fi le are the same as the headers in the accepted_hits.bam fi le. If they are not the same, Cuf fl inks will give the expression of all genes as 0 FPKM.

11. Read counting can be automated for all the TopHat output fi les by making use of the foreach command:

$ foreach i(dirname) set dirname = `basename $i` echo ${dirname} cuf fl inks –v --compatible-hits-norm -u -o ./${dirname} -G genome.gff3 ${dirname}/accepted_hits.bam end


Acknowledgements

This work was supported by the European Framework Programme 7 project SMARTCELL (FP7 KBBE 222716).

References

1. De Geyter N, Gholami A, Goormachtig S, Goossens A (2012) Transcriptional machiner-ies in jasmonate-elicited plant secondary metabolism. Trends Plant Sci 17:349–359

2. Pauwels L, Inzé D, Goossens A (2009) Jasmonate-inducible gene: what does it mean? Trends Plant Sci 14:87–91

3. Pollier J, Moses T, Goossens A (2011) Combinatorial biosynthesis in plants: a (p)review on its potential and future exploitation. Nat Prod Rep 28:1897–1916

4. Goossens A, Häkkinen ST, Laakso I, Seppänen-Laakso T, Biondi S, De Sutter V, Lammertyn F, Nuutila AM, Söderlund H, Zabeau M, Inzé D, Oksman-Caldentey KM (2003) A functional genomics approach toward the understanding of secondary metabolism in plant cells. Proc Natl Acad Sci USA 100:8595–8600

5. De Boer K, Tilleman S, Pauwels L, Vanden Bossche R, De Sutter V, Vanderhaeghen R, Hilson P, Hamill JD, Goossens A (2011) APETALA2/ETHYLENE RESPONSE FACTOR and basic helix-loop-helix tobacco transcription factors cooperatively mediate jas-monate-elicited nicotine biosynthesis. Plant J 66:1053–1065

6. De Sutter V, Vanderhaeghen R, Tilleman S, Lammertyn F, Vanhoutte I, Karimi M, Inzé D, Goossens A, Hilson P (2005) Exploration of jasmonate signalling via automated and stan-dardized transient expression assays in tobacco cells. Plant J 44:1065–1076

7. Häkkinen ST, Tilleman S, Šwiatek A, De Sutter V, Rischer H, Vanhoutte I, Van Onckelen H, Hilson P, Inzé D, Oksman-Caldentey KM, Goossens A (2007) Functional characterisation of genes involved in pyridine alkaloid biosyn-thesis in tobacco. Phytochemistry 68:2773–2785

8. Lackman P, González-Guzmán M, Tilleman S, Carqueijeiro I, Cuéllar Pérez A, Moses T, Seo M, Kanno Y, Häkkinen ST, Van Montagu MCE, Thevelein JM, Maaheimo H, Oksman-Caldentey KM, Rodriguez PL, Rischer H, Goossens A (2011) Jasmonate signaling involves the abscisic acid receptor PYL4 to reg-ulate metabolic reprogramming in Arabidopsis and tobacco. Proc Natl Acad Sci USA 108:5891–5896

9. Morita M, Shitan N, Sawada K, Van Montagu MCE, Inzé D, Rischer H, Goossens A, Oksman-Caldentey KM, Moriyama Y, Yazaki K (2009) Vacuolar transport of nicotine is medi-ated by a multidrug and toxic compound extrusion (MATE) transporter in Nicotiana tabacum . Proc Natl Acad Sci USA 106:2447–2452

10. Bachem CWB, van der Hoeven RS, de Bruijn SM, Vreugdenhil D, Zabeau M, Visser RGF (1996) Visualization of differential gene expres-sion using a novel method of RNA fi ngerprinting based on AFLP: analysis of gene expression during potato tuber development. Plant J 9:745–753

11. Velculescu VE, Zhang L, Vogelstien B, Kinzler KW (1995) Serial analysis of gene expression. Science 270:484–487

12. Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, Roth R, George D, Eletr S, Albrecht G, Vermaas E, Williams SR, Moon K, Burcham T, Pallas M, DuBridge RB, Kirchner J, Fearon K, Mao J, Corcoran K (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 18:630–634

13. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

14. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628

15. Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111

16. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cuf fl inks. Nat Protoc 7:562–578

17. Garber M, Grabherr MG, Guttman M, Trapnell C (2011) Computational methods for tran-scriptome annotation and quanti fi cation using RNA-seq. Nat Methods 8:469–477


18. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quanti fi cation by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515

19. Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments. Bioinformatics 28:2184–2185

20. Patel RK, Jain M (2012) NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 7:e30619

21. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-ef fi cient align-ment of short DNA sequences to the human genome. Genome Biol 10:R25

22. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359

23. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079

Documents

[Methods in Molecular Biology] Jasmonate Signaling Volume 1011 || Analysis of RNA-Seq Data with TopHat and Cufflinks for Genome-Wide Expression Analysis of Jasmonate-Treated Plants