Single-Cell Transcriptome Analysis of Pluripotent Stem Cells

Single-Cell Transcriptome Analysis of Pluripotent Stem Cells

Nacho CaballeroCenter for Regenerative Medicine

Boston UniversityJun 12, 2017

From raw data to insights

Raw data

ATCG

Analysis pipeline

Raw data

ATCG

Initial QC

Analysis pipeline

Raw data

ATCG

Alignment and Quantification

Initial QC

Analysis pipeline

Raw data

ATCG


Outlier analysis

Initial QC

Analysis pipeline

Raw data

ATCG


Outlier analysis

Gene selection and clustering

Initial QC

Analysis pipeline

Raw data

ATCG


Outlier analysis


Initial QC Insights

Analysis pipeline

Raw data Initial QC Alignment and Quantification

Outlier analysis


Insights

ATCG

Barcodedsequencing

files

ATCG

Demultiplex

One pair of sequencing

filesper cell

Barcodedsequencing

files

ATCG

Demultiplex


filesper cell

@NB500996:64:HNM72BGX2:3:12510:12240:93662:N:0:TAGTCATGCTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATC…+AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE</6A……

Millions of reads

Barcodedsequencing

files

ATCG

Demultiplex


filesper cell


Millions of reads

Metadata fileCell_idCondition1Condition2Cell_01BU3redCell_02BU3greenCell_03C17redCell_04C17greenCell_05BU3redCell_06BU3green…

Barcodedsequencing

files

ATCG

Demultiplex


filesper cell


Millions of reads

Metadata fileCell_idCondition1Condition2Cell_01BU3redCell_02BU3greenCell_03C17redCell_04C17greenCell_05BU3redCell_06BU3green…

Barcodedsequencing

files

ATCG

Short simple names


Outlier analysis


Insights

ATCG

Analysis pipeline

Position in ReadAvg

Sequ

ence

Qua

lity

Good cDNA quality

Position in ReadAvg

Sequ

ence

Qua

lity

Good cDNA quality

Read length is often inversely correlated with base-pair sequencing quality

Position in ReadAvg

Sequ

ence

Qua

lity

Good cDNA quality Average quality


Position in ReadAvg

Sequ

ence

Qua

lity

Good cDNA quality Average quality Bad quality


Position in ReadAvg

Sequ

ence

Qua

lity

Num

ber o

f rea

ds p

er c

ell

1M

10K

1K

0400 Cells

More reads is generally better than longer reads (safe target: 200K reads, 150-bp long)

Num

ber o

f rea

ds p

er c

ell

1M

10K

1K

0400 Cells

The Fluidigm protocol makes it extremely easy to lose entire rows or columns

Row

s

Columns

The Fluidigm protocol makes it extremely easy to lose entire rows or columns

Row

s

Columns


Outlier analysis


Insights

ATCG

Analysis pipeline

We quantify the gene expression in a cell by counting how many reads align to each gene

SFTPC gene


AGGCAGAGGGGCGAGATGCA…

SFTPC gene


AGGCAGAGGGGCGAGATGCA…

1358 reads aligned to the SFTPC gene in this cell

SFTPC gene


Read type Number of reads per cell

Raw 333,229

Unaligned 81,673

Aligned, but non-uniquely 28,813

Aligned uniquely, but not to a gene 32,774

Aligned uniquely, but span multiple genes 20,838

Aligned uniquely to a single gene 167,241


Raw 333,229

Unaligned 81,673






Raw 333,229

Unaligned 81,673






Raw 333,229

Unaligned 81,673






Raw 333,229

Unaligned 81,673






Raw 333,229

Unaligned 81,673






Raw 333,229

Unaligned 81,673





40-60% of the raw reads cannot be used to quantify gene expression


Outlier analysis


Insights

ATCG

Analysis pipeline

Filter out cells with fewer than 5K aligned reads N

umbe

r of a

ligne

d re

ads

1M

10K

1K

0120 Cells

Filter out cells with a high percentage of mitochondrial gene counts (indicative of a broken cell membrane)

% o

f Mito

chon

dria

l gen

e co

unts 100%

75%

50%

048 Cells

25%

Filter out cells with less than 2K expressed genes N

umbe

r of e

xpre

ssed

gen

es6K

4K

030 Cells

2K


Outlier analysis


Insights

ATCG

Analysis pipeline

Raw count data

Normalized expression data

Raw count data

Assume that most genes are not differentially expressed


Raw count data


Calculate scaling factors for each cell


Raw count data




Apply the scaling factors and log

Raw count data

Normalization corrects for differences in capture efficiency, sequencing depth and other technical bias




Apply the scaling factors and log

Aver

age

expr

essi

on

Variance

Aver

age

expr

essi

on

Expr

essi

on

Variance

Aver

age

expr

essi

on

Expr

essi

on

Variance

cell

Aver

age

expr

essi

on

Expr

essi

on

Variance

high expression low variance

cell

Aver

age

expr

essi

on

Expr

essi

on

Variance


cell

Expr

essi

on

low expression low variance

Aver

age

expr

essi

on

Expr

essi

on

Variance


cell

Expr

essi

on

low expression low variance

high expression high variance

high expression high variance

Typical questions

What are the expression differences between my experimental groups?

Typical questions


What are the subpopulations in my data?

Typical questions


What are the subpopulations in my data?

What are the gene expression patterns in each subpopulation?

TREATCONDITIONS AS

GROUPS?

TREATCONDITIONS AS

GROUPS?

ASSIGN CELLS TOGROUPS

SELECTGENES

NO


SELECTGENES

NO

A difference between the populations (signal) should appear among the most variable genes

Aver

age

expr

essi

on

Variance

TREATCONDITIONS AS

GROUPS?


SELECTGENES

NO

A difference between the populations (signal) should appear among the most variable genes

Aver

age

expr

essi

on

Variance

TREATCONDITIONS AS

GROUPS?


SELECTGENES

NO

Variance is a necessary but insufficient indicator of population differences

Aver

age

expr

essi

on

Variance

TREATCONDITIONS AS

GROUPS?


SELECTGENES

NO

Aver

age

expr

essi

on

Variance

Unique populations consistently over or under-express a set of genes

TREATCONDITIONS AS

GROUPS?


SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?


SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?


SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

The silhouette coefficient is a useful metric to determine the optimal number of groups


SELECTGENES

NO

k = 2 Silhouette coefficient: 0.48

TREATCONDITIONS AS

GROUPS?



SELECTGENES

NO


TREATCONDITIONS AS

GROUPS?



SELECTGENES

NO


TREATCONDITIONS AS

GROUPS?



TEST GENES FOR DIFFERENTIALEXPRESSION

YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?



YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

Variance

Aver

age

exp

ress

ion

Differentially expressed genes



YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

Variance

Aver

age

exp

ress

ion




YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

Variance

Aver

age

exp

ress

ion


Variance

Aver

age

exp

ress

ion

Highly variable genes



YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

Variance

Aver

age

exp

ress

ion


Variance

Aver

age

exp

ress

ion

Highly variable genes


Outlier analysis


Insights

ATCG

Analysis pipeline

The ideal heatmap

Real heatmaps are a rough-draft visualization

NKX2-1CD47


NKX2-1CD47

NKX2-1

CD47


NKX2-1CD47

NKX2-1

CD47

ROW-SCALING GLOBAL SCALING


Expression patterns arebetter conveyed by showing individual genes


CLU

STER

ED


CLU

STER

EDR

AN

DO

M


Geneset enrichment analysis depends on the quality of the geneset


MsigDB hallmark genesets only contain 4000 genes


MsigDB hallmark genesets only contain 4000 genesMAKE YOUR OWN GENESETS FROM THE LITERATURE

Remember to provide a metadata file


Outlier analysis


Insights

ATCG

Takeaways


Outlier analysis


Insights

ATCG

Takeaways

More reads is usually better than longer reads


Outlier analysis


Insights

ATCG

Takeaways

You will only be able to align 50% of your reads


Outlier analysis


Insights

ATCG

Takeaways

Assume that 50% of your cells could fail


Outlier analysis


Insights

ATCG

Takeaways

High variance doesn’t imply subpopulations


Outlier analysis


Insights

ATCG

Takeaways

Make your own gene lists!

Slides available at: bit.ly/crem_bioinformatics


Outlier analysis


Insights

ATCG

Takeaways

Science

Single-Cell Transcriptome Analysis of Pluripotent Stem Cells