Upload
nacho-caballero
View
82
Download
3
Embed Size (px)
Citation preview
Single-Cell Transcriptome Analysis of Pluripotent Stem Cells
Nacho CaballeroCenter for Regenerative Medicine
Boston UniversityJun 12, 2017
From raw data to insights
Raw data
ATCG
Analysis pipeline
Raw data
ATCG
Initial QC
Analysis pipeline
Raw data
ATCG
Alignment and Quantification
Initial QC
Analysis pipeline
Raw data
ATCG
Alignment and Quantification
Outlier analysis
Initial QC
Analysis pipeline
Raw data
ATCG
Alignment and Quantification
Outlier analysis
Gene selection and clustering
Initial QC
Analysis pipeline
Raw data
ATCG
Alignment and Quantification
Outlier analysis
Gene selection and clustering
Initial QC Insights
Analysis pipeline
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Barcodedsequencing
files
ATCG
Demultiplex
One pair of sequencing
filesper cell
Barcodedsequencing
files
ATCG
Demultiplex
One pair of sequencing
filesper cell
@NB500996:64:HNM72BGX2:3:12510:12240:93662:N:0:TAGTCATGCTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATC…+AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE</6A……
Millions of reads
Barcodedsequencing
files
ATCG
Demultiplex
One pair of sequencing
filesper cell
@NB500996:64:HNM72BGX2:3:12510:12240:93662:N:0:TAGTCATGCTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATC…+AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE</6A……
Millions of reads
Metadata fileCell_idCondition1Condition2Cell_01BU3redCell_02BU3greenCell_03C17redCell_04C17greenCell_05BU3redCell_06BU3green…
Barcodedsequencing
files
ATCG
Demultiplex
One pair of sequencing
filesper cell
@NB500996:64:HNM72BGX2:3:12510:12240:93662:N:0:TAGTCATGCTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATC…+AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE</6A……
Millions of reads
Metadata fileCell_idCondition1Condition2Cell_01BU3redCell_02BU3greenCell_03C17redCell_04C17greenCell_05BU3redCell_06BU3green…
Barcodedsequencing
files
ATCG
Short simple names
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Analysis pipeline
Position in ReadAvg
Sequ
ence
Qua
lity
Good cDNA quality
Position in ReadAvg
Sequ
ence
Qua
lity
Good cDNA quality
Read length is often inversely correlated with base-pair sequencing quality
Position in ReadAvg
Sequ
ence
Qua
lity
Good cDNA quality Average quality
Read length is often inversely correlated with base-pair sequencing quality
Position in ReadAvg
Sequ
ence
Qua
lity
Good cDNA quality Average quality Bad quality
Read length is often inversely correlated with base-pair sequencing quality
Position in ReadAvg
Sequ
ence
Qua
lity
Num
ber o
f rea
ds p
er c
ell
1M
10K
1K
0400 Cells
More reads is generally better than longer reads (safe target: 200K reads, 150-bp long)
Num
ber o
f rea
ds p
er c
ell
1M
10K
1K
0400 Cells
The Fluidigm protocol makes it extremely easy to lose entire rows or columns
Row
s
Columns
The Fluidigm protocol makes it extremely easy to lose entire rows or columns
Row
s
Columns
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Analysis pipeline
We quantify the gene expression in a cell by counting how many reads align to each gene
SFTPC gene
We quantify the gene expression in a cell by counting how many reads align to each gene
AGGCAGAGGGGCGAGATGCA…
SFTPC gene
We quantify the gene expression in a cell by counting how many reads align to each gene
AGGCAGAGGGGCGAGATGCA…
1358 reads aligned to the SFTPC gene in this cell
SFTPC gene
We quantify the gene expression in a cell by counting how many reads align to each gene
Read type Number of reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span multiple genes 20,838
Aligned uniquely to a single gene 167,241
Read type Number of reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span multiple genes 20,838
Aligned uniquely to a single gene 167,241
Read type Number of reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span multiple genes 20,838
Aligned uniquely to a single gene 167,241
Read type Number of reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span multiple genes 20,838
Aligned uniquely to a single gene 167,241
Read type Number of reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span multiple genes 20,838
Aligned uniquely to a single gene 167,241
Read type Number of reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span multiple genes 20,838
Aligned uniquely to a single gene 167,241
Read type Number of reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span multiple genes 20,838
Aligned uniquely to a single gene 167,241
40-60% of the raw reads cannot be used to quantify gene expression
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Analysis pipeline
Filter out cells with fewer than 5K aligned reads N
umbe
r of a
ligne
d re
ads
1M
10K
1K
0120 Cells
Filter out cells with a high percentage of mitochondrial gene counts (indicative of a broken cell membrane)
% o
f Mito
chon
dria
l gen
e co
unts 100%
75%
50%
048 Cells
25%
Filter out cells with less than 2K expressed genes N
umbe
r of e
xpre
ssed
gen
es6K
4K
030 Cells
2K
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Analysis pipeline
Raw count data
Normalized expression data
Raw count data
Assume that most genes are not differentially expressed
Normalized expression data
Raw count data
Assume that most genes are not differentially expressed
Calculate scaling factors for each cell
Normalized expression data
Raw count data
Assume that most genes are not differentially expressed
Calculate scaling factors for each cell
Normalized expression data
Apply the scaling factors and log
Raw count data
Normalization corrects for differences in capture efficiency, sequencing depth and other technical bias
Assume that most genes are not differentially expressed
Calculate scaling factors for each cell
Normalized expression data
Apply the scaling factors and log
Aver
age
expr
essi
on
Variance
Aver
age
expr
essi
on
Expr
essi
on
Variance
Aver
age
expr
essi
on
Expr
essi
on
Variance
cell
Aver
age
expr
essi
on
Expr
essi
on
Variance
high expression low variance
cell
Aver
age
expr
essi
on
Expr
essi
on
Variance
high expression low variance
cell
Expr
essi
on
low expression low variance
Aver
age
expr
essi
on
Expr
essi
on
Variance
high expression low variance
cell
Expr
essi
on
low expression low variance
high expression high variance
high expression high variance
Typical questions
What are the expression differences between my experimental groups?
Typical questions
What are the expression differences between my experimental groups?
What are the subpopulations in my data?
Typical questions
What are the expression differences between my experimental groups?
What are the subpopulations in my data?
What are the gene expression patterns in each subpopulation?
TREATCONDITIONS AS
GROUPS?
TREATCONDITIONS AS
GROUPS?
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
A difference between the populations (signal) should appear among the most variable genes
Aver
age
expr
essi
on
Variance
TREATCONDITIONS AS
GROUPS?
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
A difference between the populations (signal) should appear among the most variable genes
Aver
age
expr
essi
on
Variance
TREATCONDITIONS AS
GROUPS?
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
Variance is a necessary but insufficient indicator of population differences
Aver
age
expr
essi
on
Variance
TREATCONDITIONS AS
GROUPS?
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
Aver
age
expr
essi
on
Variance
Unique populations consistently over or under-express a set of genes
TREATCONDITIONS AS
GROUPS?
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
TREATCONDITIONS AS
GROUPS?
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
TREATCONDITIONS AS
GROUPS?
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
TREATCONDITIONS AS
GROUPS?
The silhouette coefficient is a useful metric to determine the optimal number of groups
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
k = 2 Silhouette coefficient: 0.48
TREATCONDITIONS AS
GROUPS?
The silhouette coefficient is a useful metric to determine the optimal number of groups
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
k = 3 Silhouette coefficient: 0.56
TREATCONDITIONS AS
GROUPS?
The silhouette coefficient is a useful metric to determine the optimal number of groups
ASSIGN CELLS TOGROUPS
SELECTGENES
NO
k = 4 Silhouette coefficient: 0.47
TREATCONDITIONS AS
GROUPS?
The silhouette coefficient is a useful metric to determine the optimal number of groups
ASSIGN CELLS TOGROUPS
TEST GENES FOR DIFFERENTIALEXPRESSION
YES
SELECTGENES
NO
TREATCONDITIONS AS
GROUPS?
ASSIGN CELLS TOGROUPS
TEST GENES FOR DIFFERENTIALEXPRESSION
YES
SELECTGENES
NO
TREATCONDITIONS AS
GROUPS?
Variance
Aver
age
exp
ress
ion
Differentially expressed genes
ASSIGN CELLS TOGROUPS
TEST GENES FOR DIFFERENTIALEXPRESSION
YES
SELECTGENES
NO
TREATCONDITIONS AS
GROUPS?
Variance
Aver
age
exp
ress
ion
Differentially expressed genes
ASSIGN CELLS TOGROUPS
TEST GENES FOR DIFFERENTIALEXPRESSION
YES
SELECTGENES
NO
TREATCONDITIONS AS
GROUPS?
Variance
Aver
age
exp
ress
ion
Differentially expressed genes
Variance
Aver
age
exp
ress
ion
Highly variable genes
ASSIGN CELLS TOGROUPS
TEST GENES FOR DIFFERENTIALEXPRESSION
YES
SELECTGENES
NO
TREATCONDITIONS AS
GROUPS?
Variance
Aver
age
exp
ress
ion
Differentially expressed genes
Variance
Aver
age
exp
ress
ion
Highly variable genes
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Analysis pipeline
The ideal heatmap
Real heatmaps are a rough-draft visualization
NKX2-1CD47
Real heatmaps are a rough-draft visualization
NKX2-1CD47
NKX2-1
CD47
Real heatmaps are a rough-draft visualization
NKX2-1CD47
NKX2-1
CD47
ROW-SCALING GLOBAL SCALING
Real heatmaps are a rough-draft visualization
Expression patterns arebetter conveyed by showing individual genes
Expression patterns arebetter conveyed by showing individual genes
CLU
STER
ED
Expression patterns arebetter conveyed by showing individual genes
CLU
STER
EDR
AN
DO
M
Expression patterns arebetter conveyed by showing individual genes
Geneset enrichment analysis depends on the quality of the geneset
Geneset enrichment analysis depends on the quality of the geneset
MsigDB hallmark genesets only contain 4000 genes
Geneset enrichment analysis depends on the quality of the geneset
MsigDB hallmark genesets only contain 4000 genesMAKE YOUR OWN GENESETS FROM THE LITERATURE
Remember to provide a metadata file
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Takeaways
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Takeaways
More reads is usually better than longer reads
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Takeaways
You will only be able to align 50% of your reads
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Takeaways
Assume that 50% of your cells could fail
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Takeaways
High variance doesn’t imply subpopulations
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Takeaways
Make your own gene lists!
Slides available at: bit.ly/crem_bioinformatics
Raw data Initial QC Alignment and Quantification
Outlier analysis
Gene selection and clustering
Insights
ATCG
Takeaways