View
372
Download
11
Category
Preview:
Citation preview
Cancer Systems Biology: RNA-‐Seq and Differen;al Expression Analysis
Taking advantage of a Measurement Revolu;on
July 25, 2013 Anne DeslaLes Mays
Wellstein/Riegel Laboratory Mentor: Anton Wellstein, MD, PhD
7/25/13 Wellstein/Riegel Laboratory 1
Talk Outline
• On the Shoulders of Giants • Sequencing Timeline • RNASeq for Everyone • RNA-‐Sequencing Details • Differen;al Expression Analysis • Causality • Cancer Therapeu;cs Example • Ask Bigger Ques;ons – Sequencing Everything
7/25/13 Wellstein/Riegel Laboratory 2
7/25/13 Wellstein/Riegel Laboratory 3
Rosalind Franklin “pioneered use of x-‐rays to create images of unorganized maLer – such as
large biological molecules – not just single crystals”
hLp://www.pbs.org/wgbh/aso/databank/entries/bofran.html
“Franklin made equipment adjustments to produce an extremely fine beam of x-‐rays. She extracted finer DNA fibers than ever before and arranged them in parallel bundles. Studied fibers’ reac;ons to humid condi;ons. … allowed her to discover cruical keys to DNA’s structure…. Wilkins shared this with Watson & Crick at Cambridge without her knowledge…”
7/25/13 Wellstein/Riegel Laboratory 4
���� ���� ���� ���� ���� ��� ��� ���� ���� ���� ���� ���� ���� ���� ���� ��� ��� ���� ���� ���� ���� ���������������
� �������������� ����
�� ������������� �� ����������� ������������������� �
�
��������������������������������
����������
��������!��"���������"����������� �#���
$
%����� "���� ��&��" �&�'� ����������(��� "���������"���)��
$*
+������ ����""��"��� �����,-�
��!��"� "������ "�
.
%����"������"&�������������,-�
��!��"������/0� ��
$$
,� ��"�������+���"�1�"�������2����3+1�4
+1�� �"���� ������ ���������������!��"�����������������5�
$5
'������������������������!��"� "�
$�
(��� ����������� �
�
6�������������������1�-7)
�
��������� ��� ������!��"������
������#� � "������"� � $
������"�'������"�����������"�� ����!��"��
��� 0�/����"�
�
+1���"�� ���������� ��� ����8��& "��������� ",�������"���� �"��
$
9���"�+��� �"����%����� "�����:�� 5
� ��"���������%- ���������" "�
$;
�"�������8���<��������������
� ����������������� �
$.
1 �������""��"������"�����������1�"���� ���0
�
)�"����+����"����� ������ �������������������"���"����
8 �����=�����&���
%� � ��=�>���"����"��6���������������� ������!��"����������"������������55$.
-��� ���% ��������������� ���������������!��"� "������ "���"����&�� 5
6���"�����"��������������������"�� ����������� ����"������"�
*
+����� <�& ����������� ����������
�����#��"�����������,-�
$;
/���� "#%����? ����,-���!��"���
$.
%��&��9���"� ������������@- �����
���" "�
$. ��"" �#���������� ���� ������"�� �����
) ���"�(�������� ���%)-�:������ ��������� �" "����!��"���
;
(������ ���"���� ���������������"�� ��� "&��������
�����������"���"���
7"���"�� �"�����"���� ��������������!��"� "�
���� �������"�
Sequencing Timeline
7/25/13 Wellstein/Riegel Laboratory 5
���� ���� ���� ���� ���� ��� ��� ���� ���� ���� ���� ���� ���� ���� ���� ��� ��� ���� ���� ���� ���� ���������������
� �������������� ����
�� ������������� �� ����������� ������������������� �
�
��������������������������������
����������
��������!��"���������"����������� �#���
$
%����� "���� ��&��" �&�'� ����������(��� "���������"���)��
$*
+������ ����""��"��� �����,-�
��!��"� "������ "�
.
%����"������"&�������������,-�
��!��"������/0� ��
$$
,� ��"�������+���"�1�"�������2����3+1�4
+1�� �"���� ������ ���������������!��"�����������������5�
$5
'������������������������!��"� "�
$�
(��� ����������� �
�
6�������������������1�-7)
�
��������� ��� ������!��"������
������#� � "������"� � $
������"�'������"�����������"�� ����!��"��
��� 0�/����"�
�
+1���"�� ���������� ��� ����8��& "��������� ",�������"���� �"��
$
9���"�+��� �"����%����� "�����:�� 5
� ��"���������%- ���������" "�
$;
�"�������8���<��������������
� ����������������� �
$.
1 �������""��"������"�����������1�"���� ���0
�
)�"����+����"����� ������ �������������������"���"����
8 �����=�����&���
%� � ��=�>���"����"��6���������������� ������!��"����������"������������55$.
-��� ���% ��������������� ���������������!��"� "������ "���"����&�� 5
6���"�����"��������������������"�� ����������� ����"������"�
*
+����� <�& ����������� ����������
�����#��"�����������,-�
$;
/���� "#%����? ����,-���!��"���
$.
%��&��9���"� ������������@- �����
���" "�
$. ��"" �#���������� ���� ������"�� �����
) ���"�(�������� ���%)-�:������ ��������� �" "����!��"���
;
(������ ���"���� ���������������"�� ��� "&��������
�����������"���"���
7"���"�� �"�����"���� ��������������!��"� "�
���� �������"�
Human Sequencing Timeline
Key Technical Advances: Celera Human Sequence done in one loca;on on the largest super computer in private hands at that ;me
7/25/13 Wellstein/Riegel Laboratory 6
7/25/13 Wellstein/Riegel Laboratory 7
7/25/13 Wellstein/Riegel Laboratory 8
7/25/13 Wellstein/Riegel Laboratory 9
7/25/13 Wellstein/Riegel Laboratory 10
7/25/13 Wellstein/Riegel Laboratory 11
7/25/13 Wellstein/Riegel Laboratory 12
Cancer Systems Biology Taking advantage of measurement revolu3on
Declining sequencing costs, decreasing compu3ng costs How do you leverage all this data?
GEO May 25, 2012
GEO June 25, 2013
Here is an example RNA-‐Seq Workflow
7/25/13 Wellstein/Riegel Laboratory 14
Experimental Design
Sample Collec;on
Quality Control Read Trimming
Differen;al Analysis
Transcript Iden;fica;on
Pathway Analysis
Feature Discovery
Sequencing
7/25/13 Wellstein/Riegel Laboratory 15
hLp://rnaseq.uoregon.edu/index.html
7/25/13 Wellstein/Riegel Laboratory 16 hLp://rnaseq.uoregon.edu/index.html
7/25/13 Wellstein/Riegel Laboratory 17 hLp://rnaseq.uoregon.edu/index.html
7/25/13 Wellstein/Riegel Laboratory 18 hLp://rnaseq.uoregon.edu/index.html
7/25/13 Wellstein/Riegel Laboratory 19 hLp://rnaseq.uoregon.edu/index.html
7/25/13 Wellstein/Riegel Laboratory 20
hLp://rnaseq.uoregon.edu/index.html
7/25/13 Wellstein/Riegel Laboratory 21
hLp://rnaseq.uoregon.edu/index.html
Replicates: Type I and Type II errors
7/25/13 Wellstein/Riegel Laboratory 22
Detec;ng Signal vs. Noise
7/25/13 Wellstein/Riegel Laboratory 23
7/25/13 Wellstein/Riegel Laboratory 24
What is the goal of the sequencing experiment?
7/25/13 Wellstein/Riegel Laboratory 25
7/25/13 Wellstein/Riegel Laboratory 26
7/25/13 Wellstein/Riegel Laboratory 27
7/25/13 Wellstein/Riegel Laboratory 28
Before Library Construc;on 1. Most vendors and cores will assess
the quality of the RNA before sequencing
2. Important to determine before sequencing begins
Garbage – in == Garbage out
Before library construc;on, RNA quality must be assessed
7/25/13 Wellstein/Riegel Laboratory 29
RNA-‐seq
7/25/13 Wellstein/Riegel Laboratory 30
Three steps to get to a fresh sequence with the Illumina Genome Sequence Analyzer
• Library genera;on • Cluster genera;on • Sequencing
7/25/13 Wellstein/Riegel Laboratory 31
Before Library Construc;on 1. Poly-‐A Selec;on (Total RNA -‐>
mRNA) 2. mRNA fragmenta;on 3. First strand synthesis (here we stop
if we want to maintain strand specificity
4. Second strand synthesis
Other techniques 1. Ribozero 2. Ribominus
Library Construc;on: Messenger RNA are Poly-‐A selected from Total RNA, fragmented and cDNA synthesized
7/25/13 Wellstein/Riegel Laboratory 32
cDNA (single or double stranded) 1. cDNA is blunt end-‐repaired and
phosphorylated (B.) 2. A-‐base added to prepare for
indexed adapter liga;on (C.)
Library Construc;on: End repair and adenyla;on results in adapter liga;on ready constructs
7/25/13 Wellstein/Riegel Laboratory 33
Index adapter liga;on and product ready for amplifica;on on cBot or the cluster sta;on 1. Strand specific tags are added to
the A base – ligate index adapter (D)
2. Denature and amplify for final product (E)
Library Construc;on: Adapter liga;on results in cluster-‐genera;on-‐ready constructs
7/25/13 Wellstein/Riegel Laboratory 34
Single DNA molecules hybridize to the lawn of oligos graped to the surface of the flow cell 1. Oligo lawn 2. Oligos hybridize to the adapters
that had been ligated to the library fragments which flow through the cell
Cluster Genera;on: In the illumina Cbot system, single molecules are isothermally amplified in a flow cell to prepare them for sequencing
7/25/13 Wellstein/Riegel Laboratory 35
Bridge amplifica;ons resul;ng in 100s of millions of unique clusters 1. Each fragment is clonally
amplified through a series of extensions and isothermal bridge amplifica;ons
2. Reverse strands cleaved and washed away
3. Ends are blocked 4. Sequencing primer hybridized to
the DNA template 5. Libraries are ready for
sequencing
Cluster genera;on: Bound fragments are extended to make copies and reverse strands cleaved and washed away
7/25/13 Wellstein/Riegel Laboratory 36
4 fluorescently labeled reversibly terminated nucleo;des 1. Each base competes for addi;on 2. Natural compe;;on ensures
highest accuracy 3. Aper each round of synthesis,
clusters are excited by a laser emiqng a color that iden;fies the newly added base
4. Fluorescent label and blocking group are removed allowing for addi;on of next nucleo;de
5. Proprietary (Illumina) chemistry reads a base in each cycle
6. Allows for accurate sequencing through difficult regions such as homopolymers and repe;;ve sequence
Sequencing: 100s of millions of clusters sequenced simultaneously
There are other ways to Inquire about the Transcriptome
• Array Based Technologies – Affymetrix – Agilent – Known genes and hybridiza;on protocols
• Microarray – 20,000+ array experiments on a single platorm – Edge effects – False posi;ves / false nega;ves
• Bead-‐based arrays • Tiling arrays • SAGE 7/25/13 Wellstein/Riegel Laboratory 37
What is unique about RNA-‐Seq?
• Allows you to discover and profile the en;re transcriptome of any organism
• No probes or primers to design • Novel transcripts • Novel isoforms • Alterna;ve splice sites • Rare transcripts • cSNPS – all of this in one experiment
7/25/13 Wellstein/Riegel Laboratory 38
7/25/13 Wellstein/Riegel Laboratory 39
Aper sequencing… 1. Quality control – trim your reads 2. Count Reads
• Align to genome • Align to transcriptome
3. Interpret Data • Sta;s;cal tests (differen;al
expression analysis) • Visualiza;on (mapped
reads) • Pathway analysis
Not so simple – big data, big compute requirements
Aper sequencing, we must then perform RNA-‐Seq Data Analysis
7/25/13 Wellstein/Riegel Laboratory 40
7/25/13 Wellstein/Riegel Laboratory 41
RNASeq flow chart – reference (steps 1-4): http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html Step 1: align-reads:
FASTQ PE* reads
Reference Genome Assembly WGS
Exis;ng Gene models
(gt files w/ tss ids)*
Gene models mapped to reference
gsnap
trimmoma;c FASTQC
trimmed PE* reads
Quality control consensus
per read length graphs
• Tss ids = transcription start site ids, in a gtf file format • PE – paired end • The gene models that are built with the pasa pipeline can be input to tophat
Shadeless rectangle An unshaded rectangle represents code to be run – a process
Shaded rectangle
A shaded rectangle is a file or a graphic which may be an input and/or an output
Legend
Gsnap aligned Bam files
Dark rectangle Dark rectangle represents a file that can be displayed as a track in crop-pedia
Align-reads: Gsnap is used to align reads to the genome sequence.
samtools Gsnap.CoordSorted.bam
RNA Alterna;ve Splicing: Why you need gapped aligners
7/25/13 Wellstein/Riegel Laboratory 43
RNASeq flow chart – reference (steps 1-4): http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html Step 2: assemble-reads:
Prep_rnaseq_ alignments_for
genome_assisted_ assembly.pl
• Tss ids = transcription start site ids, in a gtf file format • PE – paired end • The gene models that are built with the pasa pipeline can be input to tophat
Shadeless rectangle An unshaded rectangle represents code to be run – a process
Shaded rectangle
A shaded rectangle is a file or a graphic which may be an input and/or an output
Legend
Dark rectangle Dark rectangle represents a file that can be displayed as a track in crop-pedia
assemble-reads: Trinity is used to assemble the RNA-Seq reads in each partition. This can be done in a massiviely parallel manner, typically requiring little RAM as compared to whole de novo RNA-Seq assemblies, and can be executed using standard hardware. The firs step (pre_rnaseq_alignments_for genome_assisted_assembly.pl – partitions the reads according to covered regions
Gsnap.CoordSorted.bam
Find Dir_* -‐name “*reads” > read_files.list
Read_files.list
GG_write_trinity_ cmds.pl
ParaFly
Trinity_GG.cmds
Find Dir_* -‐name “*inity.fasta” –exec cat {} |
Inchworm_accession_incrementer.pl > Trinity_GG.fasta
Trinity_GG.fasta
RNASeq flow chart – reference (steps 1-4): http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html Steps 3 and 4: align-transcripts and assemble-transcript alignments
Launch_PASA_pipeline.pl
• Tss ids = transcription start site ids, in a gtf file format • PE – paired end • The gene models that are built with the pasa pipeline can be input to tophat
Shadeless rectangle An unshaded rectangle represents code to be run – a process
Shaded rectangle
A shaded rectangle is a file or a graphic which may be an input and/or an output
Legend
Dark rectangle Dark rectangle represents a file that can be displayed as a track in crop-pedia
Trinity_GG.fasta
Pasa_databasename .pasa_assemblies.denovo_ transcript_isoforms.gt
Pasa_databasename .pasa_assemblies.denovo_ transcript_isoforms.bed
Pasa_databasename .pasa_assemblies.denovo_ transcript_isoforms.gff3
Pasa_databasename .pasa_assemblies.denovo_ transcript_isoforms.fasta
RNASeq flow chart – Step 5 – Tuxedo Suite – using the output of the trinity-genome-guided assembly and the pasa and keygene annotation pipelines à call tuxedo suite (in parallel with then calling the abundancy estimator RSEM
• Tss ids = transcription start site ids, in a gtf file format • PE – paired end • The gene models that are built with the pasa pipeline can be input to tophat
Shadeless rectangle An unshaded rectangle represents code to be run – a process
Shaded rectangle
A shaded rectangle is a file or a graphic which may be an input and/or an output
Legend
Dark rectangle Dark rectangle represents a file that can be displayed as a track in crop-pedia
Gff3 (gene model)
Gff3togt (convert to gt format
Gt (gene model)
tophat Calls Bow;e2
Junc;ons.bed
Accepted.hits. sam
RNASeq Quantitation and Differential Analysis
• Tss ids = transcription start site ids, in a gtf file format • PE – paired end • The gene models that are built with the pasa pipeline can be input to tophat
Shadeless rectangle An unshaded rectangle represents code to be run – a process
Shaded rectangle
A shaded rectangle is a file or a graphic which may be an input and/or an output
Legend
Quantitation (matrix file with counts per isoform) Model building/Differential analysis
Trinity.fasta
Dark rectangle Dark rectangle represents a file that can be displayed as a track in crop-pedia
Tuxedo suite
Trinity genome guided assembly Abundance es;ma;on RSEM
Transcripts .gt/.gff*
trimmed PE* reads
RSEM.isoform. results
Limma Model Design/contrast
matrix building
randomForest pcAlg
Genie3.R DREAM4
Accepted.hits. sam
cuffdiff2
• Transcript annotation file produced by cufflinks, cuffcompare or other source
• Counts and read group tracking files also created
Isoforms.fpkm_tracking
Genes.fpkm.tracking
Cds.fpkm.tracking
Tss_groups.fpkm.tracking
Isoform_exp.diff
Gene_exp.diff
Tss_group_exp.diff
Cds_exp.diff
7/25/13 Wellstein/Riegel Laboratory 48
How much RNA-‐sequencing data? 1. 20 million paired end reads ~ 2 GB of data 2. 100 million paired end reads ~ 10 GB of data How much computa;on power? 1. More memory, more processors, less ;me it takes to compute 2. Outsource the analysis, s;ll will need to store the results somewhere
Amazon web services S3 storage EC elas;c cloud on demand computa;onal facility Georgetown University High Performance Computer Core matrix.georgetown.edu UPENN Galaxy services
How much RNA-‐sequencing data, how much computa;on power and where do you go to compute?
7/25/13 Wellstein/Riegel Laboratory 49
A growing number of tools enable RNA-‐Seq analysis
7/25/13 Wellstein/Riegel Laboratory 50
What percentage of reads are covered? What percentage of reads are mapped?
3’ Bias on transcript reads 1. 60-‐80% of reads are mapped 2. Highest percentage or 3’ end of
reads are mapped 3. Reads need to be quality trimmed
Mapping tools bias exons to known genes
7/25/13 Wellstein/Riegel Laboratory 51
Galaxy is a web based tool commiLed to enable a researcher (more than just for RNA-‐Seq)
7/25/13 Wellstein/Riegel Laboratory 52
How to visualize mapped results?
• UCSC Genome Browser (Gbrowse) • Integrated Genome Browser (IGB) • Integrated Genome Viewer (IGV)
Many shared formats, reading many of the outputs generated by the programs, ability to generate ones own tracks
7/25/13 Wellstein/Riegel Laboratory 53
7/25/13 Wellstein/Riegel Laboratory 54
Scalechr21:
DNase Clusters
Multiz Align
Human mRNAs
K562 CTCF Int 1
K562 Pol2 Int 1
HeLaS3 Pol2 Int 1
GM12878 1H1-hESC 1
K562 1HeLa-S3 1
HepG2 1GM12878H1-hESC
K562HeLa-S3
HepG2HUVEC
GM12878 PkH1-hESC Pk
K562 PkHeLa-S3 Pk
50 kb hg1923,600,000 23,650,000
C7 Random
C7 Targeted
Transcription Factor ChIP-seq from ENCODE
SwitchGear Genomics Transcription Start SitesH3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE
RefSeq GenesHuman ESTs That Have Been Spliced
Digital DNaseI Hypersensitivity Clusters in 125 cell types from ENCODEVertebrate Multiz Alignment & Conservation (46 Species)
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of Samples
Individual matches for article Przybylski2010Sequences in Articles: PubmedCentral and Elsevier
SNPs in PublicationsHuman mRNAs from GenBank
Regulatory elements from ORegAnnoChromatin Interaction Analysis Paired-End Tags (ChIA-PET) from ENCODE/GIS-Ruan
DNA Methylation by Reduced Representation Bisulfite Seq from ENCODE/HudsonAlpha
CpG Methylation by Methyl 450K Bead Arrays from ENCODE/HAIB
Chromatin Interactions by 5C from ENCODE/Dekker Univ. Mass.
HWI-ST1129:97:D0LRDACXX:6:2208:3356:23592_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2208:3356:23592_1:N:0:CACTCA
HWI-ST1129:97:D0LRDACXX:6:2204:15017:145130_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2204:15017:145130_1:N:0:CACTCA
HWI-ST1129:97:D0LRDACXX:6:2107:8319:79365_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2107:8319:79365_1:N:0:CACTCA
HWI-ST1129:97:D0LRDACXX:6:2107:12368:117403_1:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2107:12368:117403_2:N:0:CACTCA
HWI-ST1129:97:D0LRDACXX:6:2208:7212:116648_1:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2208:7212:116648_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2205:11321:72079_1:N:0:CACTCA
HWI-ST1129:97:D0LRDACXX:6:1203:1649:66972_1:N:0:CTCTCAHWI-ST1129:97:D0LRDACXX:6:1203:1649:66972_2:N:0:CTCTCA
HWI-ST1129:97:D0LRDACXX:6:2106:11187:101221_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2106:11187:101221_1:N:0:CACTCA
HWI-ST1129:97:D0LRDACXX:6:2102:8052:88370_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2102:8052:88370_1:N:0:CACTCA
HWI-ST1129:97:D0LRDACXX:6:2108:5000:141429_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2108:5000:141429_1:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:1303:16417:184679_2:N:0:CACTCCHWI-ST1129:97:D0LRDACXX:6:1303:16417:184679_1:N:0:CACTCC
HWI-ST1129:97:D0LRDACXX:6:2106:18235:74385_1:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2106:18235:74385_2:N:0:CACTCA
HWI-ST1129:97:D0LRDACXX:6:2201:15196:5280_2:N:0:CACTCAHWI-ST1129:97:D0LRDACXX:6:2201:15196:5280_1:N:0:CACTCA
HWI-ST1129:299:C18KJACXX:6:1305:12160:63303_1:N:0:ATCACGHWI-ST1129:299:C18KJACXX:6:1102:19732:75986_1:N:0:ATCACGHWI-ST1129:299:C18KJACXX:6:1305:12160:63303_2:N:0:ATCACGHWI-ST1129:299:C18KJACXX:6:1102:19732:75986_2:N:0:ATCACGKCEBPB
LMafK_(ab50322)KTAL1_(SC-12984)
KCEBPB KKYY1KTBPKE2F4KTAF1KELF1_(SC-631)KPol2-4H8KHEY1KE2F6_(H-50)KCEBPBKTFIIIC-110
ggNFKBGgPU.1GBATFGIRF4_(M-17)GBCL11A
GgPU.1
gPU.1 KCEBPB
DA743484BF207587
Delgado-Olguin2004
Layered H3K27Ac100 _
0 _
Mammal Cons
K562 CTCF Sig 1
K562 Pol2 Sig 1
HeLaS3 Pol2 Sig 1
7/25/13 Wellstein/Riegel Laboratory 55
7/25/13 Wellstein/Riegel Laboratory 56
What do RNA-‐Seq reads look like for GAPDH?
Repeat masked allowing 1/2 mismatched bases blat’d reads viewed in IGB 6.7.2
7/25/13 Wellstein/Riegel Laboratory 58
RNA-‐Seq Differen;al Expression analysis
What does GAPDH look like in terms of quan;ta;on?
TOTAL BM HPP RPKM 3SEQ Counts BLAT Reads RPKM 3SEQ Counts BLAT Reads
CD34 0.7 340 230 8 8 14 BST1 19.7 5374 31 31 CD133 0.2 173 176 16 16 33 THY1 0 7 4 4 A12 1 0 A5 0 0 ALK 0 9 24 0 0 3 B9 0 0 C1 0 0 C2 0 0 C7 0 0 E7 0 0 E9 2 0 F6 0 0 G12 0 0 GAPDH 3013.2 727831 356289 120.8 5559 2670 H3 0 0
Blat read raw counts ra;o == 3Seq counts ra;o ~= 130 to 1 RPKM ra;o ~= 24.3
7/25/13 Wellstein/Riegel Laboratory 60
RNA-‐Seq Quan;fica;on Challenge: A problem that exists with RNA-‐Seq data that doesn’t exist with array data: Longer transcripts produce more reads than shorter transcripts
One solu;on to account for this is RPKM (FPKM used by Cufflinks) RPKM = 10^9 x C / NL, which is really just simply C/N C(gene)= the number of mappable reads that fall onto a gene's exons N= total number of mappable reads in the experiment L(gene)= the sum of the exons in base pairs. Wold (2008) RPKM – reads per kilo base per million CPM – counts per million
7/25/13 Wellstein/Riegel Laboratory 61
RNA-‐Seq Quan;fica;on Challenge: DESeq Method uses the geometric mean of counts in all samples
DESeq Method: Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples. To get the sequencing depth of a sample rela;ve to the reference, calculate for each gene the quo;ent of the counts in your sample divided by the counts of the reference sample. Now you have, for each gene, an es;mate of the depth ra;o. Simply take the median of all the quo;ents to get the rela;ve depth of the library. 'es;mateSizeFactors' func;on of DESeq package does this calcula;on.
DESeq: an R package that works with Raw Counts to determine genes differen;ally expressed across samples
• Simon Anders
7/25/13 Wellstein/Riegel Laboratory 62
7/25/13 Wellstein/Riegel Laboratory 63
7/25/13 Wellstein/Riegel Laboratory 64
7/25/13 Wellstein/Riegel Laboratory 65
Given a list of differen;ally expressed Genes now enrichment analysis should be performed
• Enrichment analysis allows the researcher to leverage documented experiments which provide evidence for genes roles in pathways and func;ons that enable the researcher to determine the results and significance of their experiments
• DAVID – Gene ontology – Func;onal ontology
• Revigo – Output of David may be placed in REVIGO for further interpreta;on and sta;s;cal explora;on of significance of discovered sets of genes
7/25/13 Wellstein/Riegel Laboratory 66
Using differen;ally expressed genes, biological pathways should be explored
• Differen;ally expressed genes are put into programs such as pathway studio or ingenuity
• Shortest path programs and • Canonical pathway analysis • Enables a researcher to reverse engineer the pathways
expressed in the course of a healthy response to a diseased response
• Ideally a pathway reveals the observed phenotype – connec;ng the expressed gene expression program with the phenotype – genotype – gene expression program to phenotype
7/25/13 Wellstein/Riegel Laboratory 67
RNA-‐Sequencing: What is it good for?
• Transcript Annota;on – Muta;on iden;fica;on – Isoform determina;on – Alterna;ve Splice Varia;on
• Differen;al Gene Expression – Phenotypically segrega;ng experiments – Allows us to get at the How in looking at the response of an organism within a par;cular cell popula;on to events
– Good and careful design will allow us to unfold the dynamics of this response and iden;fy targets for altering disease responses to improve ones chances of surviving
7/25/13 Wellstein/Riegel Laboratory 68
7/25/13 Wellstein/Riegel Laboratory 69
7/25/13 Wellstein/Riegel Laboratory 70
hLp://bayes.cs.ucla.edu/home.htm
7/25/13 Wellstein/Riegel Laboratory 71
7/25/13 Wellstein/Riegel Laboratory 72
7/25/13 Wellstein/Riegel Laboratory 73
7/25/13 Wellstein/Riegel Laboratory 74
Acknowledgements Dr. Anton Wellstein Dr. Anna Riegel
Dr. Marcel Schmidt Dr. Elena Tassi The en;re lab: Elena, Virginie, Ghada, Ivana, Eveline, Khalid, Eric the en;re Wellstein/Riegel laboratory My CommiLee Dr. Yuri Gusev Dr. Anatoly Dritschilo Dr. Michael Johnson Dr. Christopher Loffredo Dr. Habtom Ressom Dr. Terry Ryan (external commiLee member) High Performance Core Group, Steve Moore, especially Woonki Chung Amazon Cloud Services Dr. Ann Loraine, UNC, IGB Developer Brian Haas, Author Trinity Suite
Some Resources
• hLp://rnaseq.uoregon.edu/index.html • hLp://dx.doi.org/10.1038/npre.2010.4282.1 (DESeq) • hLp://galaxy.psu.edu/ • hLp://seqanswers.com/ • hLp://www.broadins;tute.org/igv/ • hLp://bioviz.org/igb/index.html • hLp://www.illumina.com • hLp://www.otogene;cs.com • hLp://www.dnanexus.com • hLp://bioconductor.org/packages/2.12/bioc/html/limma.html • hLp://trinityrnaseq.sourceforge.net/ • hLp://trinityrnaseq.sourceforge.net/genome_guided_trinity.html • hLp://cufflinks.cbcb.umd.edu/ • hLp://brb.nci.nih.gov/BRB-‐ArrayTools.html • hLp://www.modernatx.com/
7/25/13 Wellstein/Riegel Laboratory 75
Systems Biology History (wikipedia)
• Systems biology roots found in – Quan;ta;ve modeling of enzyme kine;cs – Mathema;cal modeling of popula;on growth – Simula;ons to study neurophysiology – Control theory and cyberne;cs
• Theorists – Ludwig von Bertalanffy – General Systems Theory – Alan Lloyd Hodgkin and Andrew Fielding Huxley – constructed a
mathema;cal model that explained poten;al propaga;ng along the axon of a neuron cell
– Denis Nobel – first computer model of the heart Pacemaker
7/25/13 Wellstein/Riegel Laboratory 76
Scien;fic knowledge is limited (and advanced) by the limits (and advancements) of measurement
7/25/13 Wellstein/Riegel Laboratory 77
• Ilya Shmulevich Genomic Signal Processing “Validity of the model involves observa;on and measurement, scien;fic knowledge is limited by the limits of measurement”
• Erwin Shrödinger Science Theory and Man: “It really is the ul;mate purpose of all schemes and models to serve as scaffolding for any observa;ons that are at all means observable”
Recommended