16
David-Emlyn Parfitt Shen Lab, Irving Cancer Research Center Using RNA Seq to conduct systems-level analysis of embryonic pluripotency, self-renewal and differentiation

David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

Embed Size (px)

Citation preview

Page 1: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

David-Emlyn Parfitt

Shen Lab, Irving Cancer Research Center

Using RNA Seq to conduct systems-level analysis of

embryonic pluripotency, self-renewal and differentiation

Page 2: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

The molecular regulators of self-renewal and pluripotency are

not completely defined or characterized

mESC hESCmEpiSC

Mouse blastocyst

(3.5 days)

Mouse egg cylinder

(5.5 days)

Epiblast

Inner Cell

Mass

Human blastocyst

(5-7 days)

Self-renewal and PluripotencyNanog

Oct4

Sox2

JAK-STAT

MAPK

Novel Master Regulators?

Page 3: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

150 Combinatory

Chemical

Treatments

Genome-Wide GEP Data

Algorithmic

analysis

(ARACNe,

MINDy)

Master

Regulator

Analysis

Ra

nk

ESC/EpiSC

„Interactome‟

In vitro and in vivo

validation

Defining the molecular networks associated with stem cell self-

renewal, pluripotency and differentiation

Which tool to use for

expression profiling?

Page 4: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

Gene Expression Profiling:

Microarrays vs RNA-Sequencing

Arrays:

Well defined technique

High throughput

Discrete measurement

Background noise + batch effect

No distinction between isoforms/alleles

Page 5: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

aaaaaaa

aaaaaaa

Total RNA

Fragment

Reverse-transcribe

to cDNA

aaaaaaa

aaaaaaa

Gene Expression Profiling:

Microarrays vs RNA-Sequencing

RNA Sequencing:

Page 6: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

Gene Expression Profiling:

Microarrays vs RNA-Sequencing

Single base resolution

Low background noise

Distinction of isoform and allelic

expression

Low amount of RNA needed

*Including non-coding RNAs, depending

on purification protocol

RNA Sequencing:

aaaaaaa

aaaaaaa

Total RNA*

Reverse-transcribe

to cDNA

aaaaaaa

aaaaaaa

Algorithmic and logistic challenge

Lengthy library preparation

Page 7: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

RNA-Sequencing Methodology:

Deciding the parameters

Read length?

-Efficiency vs faithfulness

Single end or paired end reads?

-Efficiency vs faithfulness

-Alignment accuracy

Number of reads?

-Depth of coverage

-Cost

How many to effectively cover

the mouse genome (~50MB)?

aaaaaaa

aaaaaaa

aaaaaaa

aaaaaaa

Page 8: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

Deciding the parameters:

How many 100 bp reads is necessary for comprehensive

coverage of the mouse genome?

RPKM:

Normalized measurement of transcript abundance

Reads per kilobase of exome per million mapped

reads

RPKM for a particular transcript does not change

when overall number of reads changes, and it is

the same for transcripts with same abundance

Page 9: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

Deciding the parameters:

How many 100 bp reads is necessary for comprehensive

coverage of the mouse genome?

RPKM:

Normalized measurement of transcript abundance

Reads per kilobase of exome per million mapped

reads

RPKM for a particular transcript does not change

when overall number of reads changes, and it is

the same for transcripts with same abundance

Page 10: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

Deciding the parameters:

How many 100 bp reads is necessary for comprehensive

coverage of the mouse genome?

100 million, 100bp, SE reads

Page 11: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

RA-72H-1 RA-72H-2 CM CM

Number of raw reads (million) 97.3 88 87 95

Number of mapped reads (million) 97 87.7 87 94

Transcripts w. RPKM > 0.01 (/27641) 72% 77% 84% 84%

Setting the transcript ‘detection’ threshold

Page 12: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

RA-72H-1 RA-72H-2 CM CM

Number of raw reads (million) 97.3 88 87 95

Number of mapped reads (million) 97 87.7 87 94

Transcripts w. RPKM > 1 (/27641) 49% 48% 51% 52%

Setting the transcript ‘detection’ threshold

Page 13: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

r2=0.9 r2=0.97

RPKM is constant, regardless of number of reads

“RPKM for a particular transcript does not change

when overall number of reads changes”

Page 14: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

0.749

0.725

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Media

nR

PK

M

20 40 60 80

Reads (millions)

i.e. We are not detecting significantly more genes/transcripts above

20-30 million reads

RPKM becomes relatively constant with increased read

number

Page 15: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

0.7

0.75

0.8

0.85

0.9

0.95

1

0 20 40 60 80 100

Perc

ent

of final

transcripts

Reads (millions)

[60,)

[30,60)

[15,30)

[7.5,15)

[3.75,7.5)

[0.01,3.74)

Transcript

Abundance

(RPKM)

Between 20 and 30 million 100bp reads is sufficient to capture

~100% of the most abundant transcripts and 95% of the least

abundant

How many 100 bp reads is necessary for comprehensive

coverage of the mouse genome?

Page 16: David-Emlyn Parfitt, Columbia Illumina seminar 11/9/2011

Acknowledgements

Shen Lab:

Michael Shen

Hui Zhao

Shen Lab Members

Califano Lab:

Andrea Califano

Mariano Alvarez

Yufeng Shen

Xiaoyun Sun

Olivier Couronne

Erin Bush