Analyzing RNA Sequence Data Tutorial - Golden Helix, Inc...Analyzing RNA Sequence Data Tutorial, Release 8.1 Figure 2-7. 4 x 4 scatter plot of ﬁrst four principal components calculated

Analyzing RNA Sequence Data TutorialRelease 8.1

Golden Helix, Inc.

Feb 12, 2019

Contents

1. Overview 2

2. Sample Statistics and Normalization 3A. Sample Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3B. Minimum Read Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5C. Upper Quartile Normalization and Log Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3. Numeric Association and P-P Plots 10A. Numeric Association Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10B. Visualizing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4. DESeq Analysis and Visualization 13A. DESeq Variance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13B. Plotting Variance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5. Visualizing Expression Data 22

i

Analyzing RNA Sequence Data Tutorial, Release 8.1

Updated: January 9, 2019

Level: Advanced

Packages: RNA-Seq, Power Seat

The following tutorial is designed to systematically introduce you to a number of techniques for analyzing your RNA-Seq or other high throughput sequencing data output within SVS. It is not meant to replicate all the workflows youmight use in a complete analysis, but instead touch on a sampling of the more typical scenarios you may come acrossin your own studies.

To complete this tutorial you will need to download and unzip the following file, which includes a starter project forthis tutorial.

Download

RNASeq_Analysis.zip

Files included in the above ZIP file:

• Analyzing RNA Sequence Data Tutorial - Starter project containing isoforms and gene counts data.

Contents 1

http://doc.goldenhelix.com/SVS/tutorials/rna_seq_analysis/RNASeq_Analysis.zip

1. Overview

The following tutorial is designed to systematically introduce you to a number of techniques for analyzing your RNA-Seq or other high throughput sequencing data within SVS. It is not meant to replicate all the workflows you might usein a complete analysis, but instead touch on a sampling of the more typical scenarios you may come across in yourown studies.

Facets of analysis covered in this tutorial include:

• Sample Statistics and Upper Quartile Normalization

• Numeric Association and Visualization

• DESeq and Visualization

• BAM Visualization

The previously downloaded project contains two spreadsheets, Isoforms counts - Sheet 1 and Breast Cancer Sub-types + Gene Counts - Sheet 1. This data was obtained from RNA-seq on several breast cancer cell lines and wasgenerated by the Golden Helix pipeline.

The datasets represent how many reads mapped to each gene or isoform (columns) for each sample (rows), howeveronly the Gene Counts spreadsheet will be used in this tutorial. The steps in this tutorial can be applied to any datasetthat is compatible with SVS and was generated by a secondary analysis RNA-Seq pipeline. This tutorial also assumesthat the count data has not yet been normalized. As you will see, this tutorial covers different normalization techniques,including one that is built-in to the DESeq Analysis tool.

The Golden Helix pipeline generated isoform and gene counts in .tsv format which were then imported into SVS usingImport RNA-Seq Tabularized Quantification in the project navigator Import menu. This import tool also recognizesmarker map data included in the .tsv file and applies the marker map to the imported spreadsheet.

You can find an example of the format that is used by our Tabularized import tool in the SVS Manual.

If you are an SVS customer and have RNA-Seq data in a different format, contact our Bioinformatics Support Teamand ask about how to import your data into SVS.

2

http://doc.goldenhelix.com/SVS/latest/svsmanual/importing_data.html#rna-seq-tabularized-quantification

2. Sample Statistics and Normalization

RNA-Seq data generated as raw count data from RSEM contains proportional values as a result of reads that havethe potential for multiple alignments to various transcripts found in the reference genome. This creates the possibilityfor read data to be non-uniquely mapped. SVS can assist in validating the read depth obtained on your samples andperforming an upper quartile normalization step as outlined by Bullard et al. BMC Bioinformatics 2010.

A. Sample Statistics

First examine the summary statistics of the read counts per sample. To do this, transpose the dataset and calculatecolumn statistics on the transposed spreadsheet. Then create a side-by-side boxplot of the transposed columns andvisually examine the sample distributions.

• Open Breast Cancer Subtypes + Gene Counts - Sheet 1. and select Edit >Transpose Spreadsheet. ClickOK.

• From the transposed spreadsheet, choose Numeric > Statistics (per Column).

• Leave Minimum, First Quartile, Maximum, Third Quartile, Median, Sum and Outlier Thresholdschecked. Uncheck all other output options (except the Integer columns and Real columns options insidethe Include in calculations: box).

• The dialog should look like Figure 2-1. Click OK.

The resulting spreadsheet contains the selected statistics for each sample. The target depth had been set at 12.5 millionreads, so the values in the Sum column should be close to this value.

• From Column Statistics, right-click on the Sum column header (8th column) and select Sort Descending.

This information displays the number of reads obtained for each sample. Sample BT549 has the closest approximationto the target read depth at 11,160,803.71 reads. The remaining reads for this sample were collected, but were not foundin alignment to the reference genome.

Also notice that all the values in the Minimum and Q1 columns are 0. This not only demonstrates the skewness in thedata but also the fact that the dataset is oversaturated with 0 values.

Note: Selecting additional options in the Column Statistics dialog could also provide the mean and variance of thecount data.

Next, visually compare the read depth distribution over all samples by creating a side-by-side boxplot.

• From the transposed spreadsheet Breast Cancer Subtypes + Gene Counts Transposed - Sheet 1, choose Plot> Columnwise Side-by-Side Boxplot.

Notice that one sample (HS578T) contains an extreme outlier compared to all other samples’ data. It will be importantto keep this outlier in mind when reviewing the analysis results later in the tutorial.

3

http://www.biomedcentral.com/1471-2105/11/94


Figure 2-1. Column Statistics dialog

Figure 2-2. Columnwise Side-by-Side Boxplot of all samples

4 2. Sample Statistics and Normalization


This plot also shows that each sample has a skewed (non-normal) distribution. Later in the tutorial, you will apply anormalization procedure to this dataset before performing an association test. The other analysis method used, DESeq,has a built-in normalization procedure.

B. Minimum Read Threshold

As shown previously, the gene count data is overly saturated with 0 counts. To account for this we will be using a toolthat allows you to require a minimum read count in a minimum number of samples.

• From Breast Cancer Subtypes + Gene Counts - Sheet 1, choose RNA-Seq > Activate Genes by MinimumRead Threshold

• Enter 5 in the first box and change the drop down value to occurrences. Enter 10 in the second box.

• The dialog should look like figure 2-3. Click OK

Figure 2-3. Activate Genes by Minimum Read Threshold dialog

14,804 gene counts should have passed the provided threshold values, with 14,807 columns remaining active. Createa subset of the spreadsheet.

• Choose Select > Column > Column Subset Spreadsheet

• Rename the subset to Minimum Read Threshold

C. Upper Quartile Normalization and Log Transform

Close all open spreadsheets and plots by selecting Window > Close All from the Project Navigator.

The next step is to normalize the data using an Upper-Quartile normalization and Log Transformation procedureavailable in SVS. The transformed dataset can then be used for association testing.

• Open the Minimum Read Threshold spreadsheet

• Select RNA-Seq > Normalization and Log Transformation.

• Leave the default settings as seen in Figure 2-4 and click OK.

This produces two output spreadsheets, Minimum Read Threshold - Upper Quartile Normalized and MinimumRead Threshold - Upper Quartile Normalized – log2 Transformed. The second spreadsheet has both proceduresapplied and should be used for further analysis.

Plot another side-by-side boxplot here to view and compare the normalized distributions.

B. Minimum Read Threshold 5


Figure 2-4. Normalization and Log Transformation settings.

• From Minimum Read Threshold - Upper Quartile Normalized – log2 Transformed, choose Edit > Trans-pose Spreadsheet.

• Select Current Spreadsheet after Spreadsheet as child of: and click OK.

• From the transposed spreadsheet choose Plot > Columnwise Side-by-Side Boxplots.

As evidenced in Figure 2-5, the normalization procedure has worked as expected and the distributions across allsamples look considerably more comparable.

This breast cancer samples in this dataset comprise three subtypes, Basal, Claudin and Luminal. Next, perform PCAAnalysis to examine and compare relationships between the three subtypes.

• From the Minimum Read Threshold - Upper Quartile Normalized - log2 Transformed spreadsheet, selectNumeric > Numeric Principal Component Analysis.

• Click Yes when asked if you want to continue.

• After Data Centering, select the checkbox for Center data by marker, which in this case means to center bygene.

• The dialog should look like Figure 2-6. Click Run.

Two output spreadsheets are created, Principal Components (Center by Marker) and PC Eigenvalues (Center byMarker). The PC Eigenvalues (Center by Marker) spreadsheet lists the top 10 calculated Eigenvalues. Since thereare three major subtypes, one would expect the top two or three eigenvalues to be able to differentiate the majority ofthe data. In other words, the eigenvalues should begin to stabilize after the first three, which seems to be the case.

Next, create an NxN PCA plot of the top four principal components. You would expect to lose differentiation bysubtype in the fourth principal component and creating a 4x4 PCA plot should confirm this.

Before creating the plot, the subtype information needs to be joined with the Principal Components.

• Open Breast Cancer Subtypes + Gene Counts - Sheet 1.

• Choose Select > Column > Inactivate all columns. Then click on column 1 to reactivate the Type column.This will create spreadsheet Breast Cancer Subtypes + Gene Counts - Sheet 2, which will have one columnactive.

• Now select File > Join or Merge Spreadsheets.



Figure 2-5. Columnwise Side-by-Side Boxplot of normalized samples

Figure 2-6. Numeric Principal Component Analysis options

C. Upper Quartile Normalization and Log Transform 7


• Choose Principal Components (Center by Marker) from the list.

• Enter Pheno + PCs as the New dataset name: and click OK.

Create the scatter plot from this spreadsheet.

• From Pheno + PCs - Sheet 1, select Plot > N x N Scatter Plot.

• Click Add Columns and select the first four columns (EV=10.4881, EV=4.58711, EV=2.99062, andEV=2.4402). Click OK.

• After Optionally choose grouping variable click Select Column.

• Select Type and click OK. Then click OK to create the plot.

This produces PCA scatter plots, colored by sample type, of all combinations of the first four principal componentstaken two at a time (Figure 2-7). We see that the scatter plot between the first two principal components (shown in twodifferent directions in the upper-left corner) is well-segregated by subtype.

Looking in further detail, we see that the first principal component exhibits good segregation by all subtypes, and thatthe second principal component is good at segregating Basal from the other two subtypes. The 3rd and 4th componentsdo not really segregate any of the three subtypes at all.

Note: Upper quartile normalization and log transformation is an important step that is performed on the raw countdata from RNA-Seq output created by the RSEM pipeline. Data imported from other RNA-Seq pipelines such as FPMis also supported and requires upper quartile normalization as well. However, RNA-Seq FPKM data produced fromTop Hat has already been normalized and only requires a log2 transformation of the data as well as PCA.



Figure 2-7. 4 x 4 scatter plot of first four principal components calculated from RNA-Seq expression count databetween three breast cancer cell lines colored by type.

C. Upper Quartile Normalization and Log Transform 9

3. Numeric Association and P-P Plots

Next, perform a numeric association test on the transformed dataset. This type of test requires a binary dependentcolumn–thus, only two subtypes can be tested per run. The second and third columns in the dataset contain binaryexpansions of the Claudin and Luminal subtypes.

A. Numeric Association Test

• Open the Minimum Read Threshold - Upper Quartile Normalized - log2 Transformed spreadsheet.

• Right-click on the first column header (Type) and select Activate by Category.

• Select the Claudin and Luminal categories (and deselect Basal) by first clicking on Claudin, then, whileholding down the Ctrl key, clicking on Luminal.

• The dialog should look like Figure 3-1. Click OK.

Figure 3-1. Activate by Category dialog

This will inactivate the samples with Basal subtype, allowing the association test to compare only the (active) rowscorresponding to the Claudin and Luminal subtypes.

• Make Type=Luminal? the dependent column by clicking once on its column name header. The column willturn magenta.

• Select Numeric > Numeric Association Tests.

• Click Yes when asked if you would like to continue.

• Match the settings to Figure 3-2 by checking T-test.

10


• Click Run.

Figure 3-2. Numeric Association Tests dialog

B. Visualizing Results

Visualize the log-10 P-value by creating a numeric plot.

• From the Association Tests spreadsheet, right-click on the T Test -log10 P column header and select PlotVariable in GenomeBrowse.

From the association results, you could mark the genes with the largest -log10 P-values for further investigation. Thenext step in this tutorial will lead you through an analysis method specifically designed for RNA-Seq data.

B. Visualizing Results 11


Figure 3-3. -log10 P-value Plot

12 3. Numeric Association and P-P Plots

4. DESeq Analysis and Visualization

DESeq is an analysis tool for analyzing variance in numerical count data produced from high throughput analysistools, such as RNA-Seq. These analysis techniques were first published in a paper from Anders & Huber, 2010.

This technique provides an internal normalization technique and therefore the test should be performed on the originalMinimum Read Threshold spreadsheet that has not yet been normalized.

A. DESeq Variance Analysis

• Open Minimum Read Threshold.

• Click on the Type column name header, which turns the column magenta and designates the column as thedependent variable.

• Select RNA-Seq > DESeq Analysis.

• In the first tab, select the analysis options that are shown in Figure 4-1. The options in the second tab will remainthe default options.

• Click OK.

Two output spreadsheet are created, P-values for Conditions A (Claudin) and B (Luminal) and VST-TransformedCounts for G/T with the Best 50 P-values. The first spreadsheet contains the analysis results while the second is anintermediate spreadsheet that will be used for plotting purposes.

B. Plotting Variance Results

Calling Differential Expression

• Open the spreadsheet P-Values for Conditions A (Claudin) and B (Luminal).

• Select Plot > XY Scatter Plots.

• In the left column, select Log 2 (Combined Base Mean) as the Independent Axis (X), in the right column selectLog 2 Fold Change as the Dependent Axis (Y).

• Click Plot.

Now color the plot based on a p-value threshold.

• In the Graph Control Interface, click on the Log 2 Fold Change node and select the Color tab below.

• Click the radio button By Variable and then click the Select Variable... button.

• Choose FDR-corrected p-value (fourth down), select < in the bottom right and type 0.001.

13

http://genomebiology.com/2010/11/10/R106


Figure 4-1. DESeq Analysis dialog

14 4. DESeq Analysis and Visualization


• Click OK.

The image should appear much like Figure 4-2. The blue color indicates p-values less than 0.001 after they are adjustedfor multiple testing with the Benjamini-Hochberg false discovery rate (FDR) procedure.

This plot demonstrates how the overall counts (combined base mean) affect the amount of fold change needed todetect differential expression. As the overall counts become larger, the surrounding noise decreases, making smallerfold changes significant. On the other hand, the surrounding noise around smaller overall counts causes them to requirelarge fold changes to be significant.

Volcano Plot

• Open P-Values for Conditions A (Claudin) and B (Luminal).

• Select Plot > XY Scatter Plots.

• In the left column, select Log 2 Fold Change as the Independent Axis (X) and in the right column select -Log10 P-Value the Dependent Axis (Y).

• Click Plot.

• In the Graph Control Interface, select the -Log10 P-Value node and click on the Color tab below.

• Click the By Variable radio button and then click the Select Variable... button.

• Choose FDR-corrected p-value, select > in the bottom right, type 1E-6, and click OK.

The plot should appear much like Figure 4-3.

Dendrograms and Heat Map

• Open VST-Transformed Counts for G/T with the Best 50 P-Values.

• Select RNA-Seq > Dendrograms and Heatmap.

• Leave the default options as they are and click OK.

The plot (Figure 4-4) shows two well-segregated groups that correspond to Claudin and Luminal subtypes. In theheatmap, blue represents low levels of expression while red represents high levels.

Plotting P-values in genomic space

• Open P-Values for Conditions A (Claudin) and B (Luminal) and right-click on the -Log 10 P-Value columnheader. Select Sort Descending.

This brings the most interesting potentially predictor genes to the top of the list. LEPREL1 is indicated as the top geneon this list.

• Right-click on the column header for the -Log10 P-Value column.

• Select Plot Variable in GenomeBrowse.

This provides a visual display of the -log 10 P values in genomic space. The chromosomes and genes are displayedon the bottom X axis and the -log 10 P values on the Y axis.

• In the Plot Tree select either -Log 10 P-Value graph node. Then select the Style tab in the Controls window.

• Select Chromosome from the drop down menu under Style By:.

This creates a Manhattan Plot where each chromosome is a different color (Figure 4-5).

B. Plotting Variance Results 15


Figure 4-2. Log2 Fold Change plotted against Log10 Combined Base Mean, colored by FDR-corrected P-value.



Figure 4-3. Volcano plot of Log 2 Fold change versus –Log 10 P-Values adjusted for FDR.



Figure 4-4. Dendrograms and Heatmap of top 50 Genes



Figure 4-5. Manhattan plot of the -log10 P values

• Click on the uppermost yellow-orange dot in Chromosome 3.

The Console in the lower-left corner should now show the name LEPREL1, which corresponds to the best predictorgene shown in the spreadsheet P-Values for Conditions A (Claudin) and B (Luminal).

• Zoom in on this gene by selecting on the zoom toolbar, then pressing your left mouse button to the left ofthis gene and while holding down your left mouse button dragging the cursor across to the right of this gene,finally releasing the mouse button to complete the zoom. Repeat this process until the dot representing the genenot only has a label, but has become a horizontal oval taking up at least a quarter of the horizontal distanceacross the plot.

Notice (Figure 4-6) that according to the RefSeq Genes 105 Interim v1, NCBI track in the bottom plot, the geneticcoordinates taken up by “LEPREL1” are instead supposed to be gene P3H2. What is going on? Where is LEPREL1?

Note: You may want to vertically expand the bottom plot, RefSeq Genes 105 Interim v1, NCBI, in order to be ableto see gene P3H2, or to assure yourself that you haven’t missed anything in the bottom plot.

• Now hover the mouse over the P3H2 gene in the bottom plot until a box forms around its name, then click.Information about gene P3H2 should appear in the Data Console, which is in the lower-left corner of the plot.

• Vertically expand the Data Console box, if necessary, then scroll the Data Console down to the GeneID field(near the bottom).

• Click on the 55214 Gene ID.

You will now be brought to an NCBI web page about gene P3H2. Note that this gene is also known as “MCVD;MLAT4; LEPREL1”. (Mystery solved!)



Figure 4-6. Close-up plot of the gene with the best -log10 P value



Several other links to databases are available from the information in the Data Console about gene P3H2 (LEPREL1),or about any other gene on whose name you click in the gene track.

Note: If the name of a gene in your RNA-Seq data is completely current, you may zoom in on that gene simply byentering its name into the Genomic Location Bar (at the top of the plot window above the Full Domain Band).


5. Visualizing Expression Data

Now to add the BAM files to our P-Value plot for visualization. The view should still be zoomed into the area coveringthe P3H2 (LEPREL1) gene, which was found to have a very small p-value.

• From the plot we have just been viewing (Plot of Column -Log10 P-Value from P-Values for Conditions A(Claudin) and B (Luminal)), do the following:

• Click the Plot button and choose Example Samples in the left-most pane. Most of these samples correspond tothis Breast Cancer dataset. You will also see a few DNA-Seq files that we host on our server in this list.

• Check a couple of Luminal samples, such as HCC1428_12.5M and MCF7_12.5M, then click Plot.

• Now check a couple of Claudin samples, such as BT549_12.5M and HS578T_12.5M, then click Plot & Close.

The first two BAM files correspond to the Claudin group and show high levels of expressions for the P3H2 (LEPREL1)gene. The 3rd and 4th BAM files correspond to the Luminal group and show low levels of expression for the gene.

This discrepancy in expression levels between the breast cancer subtypes follows our expectations based on the smallDESeq p-value.

22


Figure 5-1. Pile-up and Coverage Plots for Breast Cancer samples

23

Documents

Analyzing RNA Sequence Data Tutorial - Golden Helix, Inc...Analyzing RNA Sequence Data Tutorial, Release 8.1 Figure 2-7. 4 x 4 scatter plot of ﬁrst four principal components calculated