1
IRFinder: an algorithm to detect retained introns from mRNA-seq data Introduction Alternative splicing of RNA is a fundamental biological process that affects almost all multi-exonic genes to promote protein diversity. Of the 3 modes of RNA splicing that also include exon skipping and alternative splice site usage, intron retention (IR) is the least understood. Using deep mRNA-seq and a novel bioinformatic algorithm we termed IRFinder, we discovered that IR is a major regulator of gene expression that has been completely overlooked. Specifically we showed that: William Ritchie 1,2,3 , Rob Middleton 3 , Justin J-L Wong 1,2 , Jeff Holst 1,2,4 ,John EJ Rasko 1,2,5 1 Gene & Stem Cell Therapy Program, Centenary Institute, Camperdown, Australia 2 Sydney Medical School, University of Sydney, Australia 3 Bioinformatics Laboratory, Centenary Institute, Camperdown, Australia 4 Origins of Cancer Laboratory, Centenary Institute, Camperdown, Australia 5 Cell and Molecular Therapies, Royal Prince Alfred Hospital, Camperdown, Australia [email protected]; [email protected] Orchestrated intron retention is a normal physiological event that affects nearly a hundred genes during the development of granulocytes. Intron-retaining transcripts are exported to the cytoplasm and trigger nonsense-mediated decay to massively downregulate protein expression. IRFinder corrects mapping bias When mapping sequencing reads to a transcriptome, certain algorithms may favor splitting reads across two exons rather than continuing them into the intron. IRFinder uses the STAR algorithm with specific option to eliminate this bias. STAR algorithm, Dobin et al, 2012 Using random hexamer priming to create a cDNA library for sequencing can introduce bias because certain hexamers prime more often than others. IRFinder tests and eliminates this bias with a normalizing step after initial alignment. Hansen et al, 2010 Numerous introns contain repeat regions and low complexity sequences. These regions are difficult to map to and make IR difficult to detect. IRFinder creates a mappability index based on how genomic sequences remap to their own genome. Sequences with low mappability are excluded from the analysis performed by IRFinder. Quantifying intron retention IRFinder first searches for evidence of IR. The presence of at least 3 non-identical sequencing reads across the exon-intron boundaries at both ends of the intron in addition to sequence coverage across 90% of the intron (after having excluded non mappable sequences) are required. The number of transcripts with retained introns (I) is estimated by the median coverage of the mappable intronic sequence. Using the median avoids overestimating intronic expression due to small peaks of expression within introns. These small peaks may represent unknown exons or small non- coding RNAs expressed from the intronic region. exon exon 21 22 23 24 The number of correctly spliced transcripts (N) is estimated by the number of reads that are split across one of the flanking exons of the retained intron and another exon of the same transcript. Significant changes in IR of a given gene between two experiments are estimated by a Bayesian Method for comparing digital counts developed by Audic and Claverie (1997). Results We have developed a novel algorithm, IRFinder to detect intron retention in mRNA transcripts from Next Generation Sequencing data. 1. IRFinder uses parallel processes to generate results faster 2. IRFinder can detect changes in IR between multiple samples 3. IR occurs in hundreds of genes across multiple various tissues 4. If you have mRNA sequencing data, you should use IRFinder! Intron retention is widespread adipose adrenal brain breast colon kidney heart liver lung lymph node prostate skeletal muscle leukocyte ovary tes6s thyroid 6680 Genes We ran IRFinder on mRNA-seq data from 16 different tissue types from the bodymap project. We discovered 6680/20636 genes that showed intron retention. We found that retained introns could either be specific to one tissue type or common to multiple tissue types. IR Map Map again MMP1 MMP2 RNA-seq read exons in the genome C1orf101 Cell 154, 583–595, August 1, 2013

Detecting intron retention in NGS data - William Ritchie

Embed Size (px)

DESCRIPTION

Intron retention (IR) is widely recognized as a consequence of mis-splicing that leads to failed excision of intronic sequences from pre-messenger RNAs. Our bioinformatic analyses of transcriptomic and proteomic data of normal white blood cell differentiation reveal IR as a physiological mechanism. IR actively regulates the expression of nearly one hundred functionally related genes, including those that determine the nuclear shape that is unique to granulocytes. Retention of introns in specific genes is associated with downregulation of splicing factors and is conserved between human and mouse. IR led to reduced mRNA and protein levels by triggering the nonsense-mediated decay (NMD) pathway. In contrast to the prevalent view that NMD is limited to mRNAs encoding aberrant proteins, our data establish that IR coupled with NMD is a conserved mechanism in normal granulopoiesis. Physiological IR may provide an energetically favorable level of dynamic gene expression control prior to sustained gene translation. Our findings have been published on 1st August 2013 in Cell.

Citation preview

Page 1: Detecting intron retention in NGS data - William Ritchie

IRFinder: an algorithm to detect retained introns from mRNA-seq data

Introduction Alternative splicing of RNA is a fundamental biological process that affects almost all multi-exonic genes to promote protein diversity. Of the 3 modes of RNA splicing that also include exon skipping and alternative splice site usage, intron retention (IR) is the least understood. Using deep mRNA-seq and a novel bioinformatic algorithm we termed IRFinder, we discovered that IR is a major regulator of gene expression that has been completely overlooked. Specifically we showed that:

William Ritchie1,2,3, Rob Middleton3, Justin J-L Wong1,2, Jeff Holst1,2,4,John EJ Rasko1,2,5 1 Gene & Stem Cell Therapy Program, Centenary Institute, Camperdown, Australia 2 Sydney Medical School, University of Sydney, Australia 3 Bioinformatics Laboratory, Centenary Institute, Camperdown, Australia 4 Origins of Cancer Laboratory, Centenary Institute, Camperdown, Australia

5 Cell and Molecular Therapies, Royal Prince Alfred Hospital, Camperdown, Australia

[email protected]; [email protected]

• Orchestrated intron retent ion is a normal physiological event that affects nearly a hundred genes during the development of granulocytes. • Intron-retaining transcripts are exported to the cytoplasm and trigger nonsense-mediated decay to massively downregulate protein expression.

IRFinder corrects mapping bias

W h e n m a p p i n g s e q u e n c i n g r e a d s t o a transcriptome, certain algorithms may favor splitting reads across two exons rather than continuing them into the intron. IRFinder uses the STAR algorithm with specific option to eliminate this bias. STAR algorithm, Dobin et al, 2012

Using random hexamer priming to create a cDNA library for sequencing can introduce bias because certain hexamers prime more often than others. IRFinder tests and eliminates this bias with a normalizing step after initial alignment.

Hansen et al, 2010

Numerous introns contain repeat regions and low complexity sequences. These regions are difficult to map to and make IR difficult to detect. IRFinder creates a mappability index based on how genomic sequences remap to their own genome. Sequences with low mappability are excluded from the analysis performed by IRFinder.

Quantifying intron retention IRFinder first searches for evidence of IR. The presence of at least 3 non-identical sequencing reads across the exon-intron boundaries at both ends of the intron in addition to sequence coverage across 90% of the intron (after having excluded non mappable sequences) are required.

The number of transcripts with retained introns (I) is estimated by the median coverage of the mappable intronic sequence. Using the median avoids overestimating intronic expression due to small peaks of expression within introns. These small peaks may represent unknown exons or small non-coding RNAs expressed from the intronic region.

exon exon

Figure S4

Prom.

Myel.

Gran.

Mouse

Map4k4

Prom.

Myel.

Gran.

Mouse

Sp100

21 22 23 24

18 19The number of correctly spliced transcripts (N) is estimated by the number of reads that are split across one of the flanking exons of the retained intron and another exon of the same transcript.

Significant changes in IR of a given gene between two experiments are estimated by a Bayesian Method for comparing digital counts developed by Audic and Claverie (1997).

Results We have developed a novel algorithm, IRFinder to detect intron retention in mRNA transcripts from Next Generation Sequencing data. 1.  IRFinder uses parallel processes to generate

results faster 2.  IRFinder can detect changes in IR between

multiple samples 3.  IR occurs in hundreds of genes across multiple

various tissues 4.  If you have mRNA sequencing data, you should

use IRFinder!

Intron retention is widespread adipose  adrenal  brain  breast  colon  kidney  heart  liver  lung  lymph  node  prostate  skeletal  muscle  leukocyte  ovary  tes6s  thyroid  

6680 Genes We ran IRFinder on mRNA-seq data from 16 different tissue types from the bodymap project. We discovered 6680/20636 genes that showed intron retention. We found that retained introns could either be specific to one tissue type or common to multiple tissue types.

IR

Map Map again MMP1 MMP2

RNA-seq read

exons in the genome

C1orf101

Cell 154, 583–595, August 1, 2013