Bioinformatics Analysis of Single-Cell Sequencing Data

Bioinformatics Analysis of Single-Cell

Sequencing Data

Min Gao, PhD

Assistant Professor of Medicine& UAB Informatics Institute

3.13.2020

• Backgroud

1. Why single cell?

2. Current platforms for scRNA-seq

3. Experimental design

• Single-cell data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3. Normalization

4. Dimensionality Reduction

5. Clustering

6. Differential Expression gene / Marker Identification

7. Trajectory analysis

8. Multimodal analysis

• Single-cell RNAseq data analysis in U-BRITE2.0 (Zongliang Yue)

Outline

Single-cell analysis reveals heterogeneity J Hematol Oncol. 2017 Jan 21;10(1):27

Why single-cell analysis?

Wang Y.; Navin N. E. Mol. Cell 2015

Single cell analysis affects diverse areas of biological research

Trends Genet. 2015 Oct; 31(10): 576–586.

Single cell analysis in cancer genomics

Single cell analysis in immunology

Trends Immunol. 2017 February ; 38(2): 140–149

Scaling of scRNA-seq experiments

Svennson et al., Nature Protocols, 2018

Platforms Advantage Limitation

Chromium System (10x Genomics)High cell capture efficiency, easy to operate, end-to-end solution,

multiple applications, well established platform, intensive supportHigh initial cell concentration required, no users modification possible

Nadia (Dolomite Bio)Open platform, possibility to develop own protocols, multiple

applications (PACS, DroNc-Seq)

High initial cell concentration required, lower cell capturing efficiency,

no analysis software provided, skills to operate required

InDrop System (1CellBio)High cell capture efficiency, open platform, possibility to develop own

protocols

High initial cell concentration required, no analysis software support,

skills to operate required

Illumina Bio-Rad ddSEQ Single-Cell IsolatorProduct from industry leaders, easy to operate, end-to-end solution, kits

for different starting number of cells

High initial cell concentration required, no users modification possible,

single application (RNA-Seq)

Tapestri Platform (MissionBio)Only platform dedicated to DNA-Seq, easy to operate, customized

panels availableSingle application possible (DNA-Seq)

BD Rhapsody Single-Cell Analysis System (BD)Possibility to optimize costs (subsampling, archiving, targeted assays),

easy to operate, end-to-end solution, protein detection promisedSingle application possible (targeted RNA-Seq)

ICELL8 Single-Cell System (Takara) Combined high throughput with active cell selection, easy to operate Bioinformatics analysis not provided, single application (RNA-Seq)

C1 System and Polaris (Fluidigm)Variable throughput (48–800 cells), multiple applications, customizable

protocols, cell stimulation, well established platform, intensive supportSize-based cell selection (C1)

Puncher Platform (Vycap)Filtering for rare cell capturing, active cell selection, visual control, high

transferring efficiency, easy to operate, established WGA/WTA

protocols

Low throughput, bioinformatics analysis not provided

CellRaft AIR System (CellMicrosystems)Multiple applications (cultivation and tracking cell phenotypes,

substance testing), active cell selection, visual control, high transfer

efficiency, cost-effective manual version available

Low throughput, bioinformatics analysis not provided, adhesive

properties of cells expected (although not mandatory)

DEPArray NxT (Menarini Silicon Biosystems)Active cell selection, visual control, high transfer efficiency, possibility

to study cell–cell interaction, established WGA/WTA protocols

Low throughput, bioinformatics analysis not provided; compared to

other low-throughput instruments, a high price of consumables (chips)

AVISO CellCelector (ALS)Active cell selection, visual control, multiple applications (transfer cell

colonies), low price for consumables

Low throughput, bioinformatics analysis not provided, skills to operate

required, adhesive properties of cells lower transfer efficiency, risk of

contamination from co-transferred medium

Int J Mol Sci. 2018 Mar; 19(3): 807.

Single Cell Platforms

Droplet-based single cell platform

Single Cell Solutions :

Single Cell CNVSingle Cell Gene ExpressionSingle Cell Immune ProfilingSingle Cell ATAC

https://wp.10xgenomics.com/instruments/chromium-controller/Lab Chip. 2019 May 14;19(10):1706-1727

A B

https://wp.10xgenomics.com/instruments/chromium-controller/

Experimental Design considerations

How deep to sequence?• # of genes saturates around 1 million reads per cell• Identifying different cell types present: 25-50k reads per cell• Identifying transcriptional dynamics within a population: 50-100k reads per cell• We don’t necessarily need to detect everything in every cell!

How many cells?• Two main things to consider1) How many cell types are there?2) What is the proportion of the rarest cell type you’re interested in?• 10x Genomics currently allows for each run to yield anywhere from ~500-10,000 cells• Do you actually know how many cell types are there? What about cell states? The

current trend in the field seems more focused in increasing cell #

What about replicates?• It kind of depends.• Technical replicate of PBMCs has near-perfect overlap• Cells are dramatically different between patients.

Challenges in processing single-cell sequencing data

Variety in and of data is a classic biological problem pertaining also to big data. While there are clear opportunities in bigger volumes of data, there are technical, statistical and interpretative challenges rising alongside.

Basic programming needed to interpret data

The information contained in single-cell data needs to be transformed into relevant biological knowledge

Single cell sequencing data analysis


2. QC and Filtering

3. Normalization


5. Clustering



8. multimodal analysis

Flowchart of the single-cell RNA-seq analysis

Cell Ranger

Seurat

U-BRITE2.0-Web App

U-BRITE1.0- Jupyter Notebook

https://gitlab.rc.uab.edu/MCBIOS19/single_cell_rnaseq_hands-on_1

https://gitlab.rc.uab.edu/attis2020/single_cell_attis2020

attis2020-scrnaseq-demo.informatics.uab.edu

https://gitlab.rc.uab.edu/MCBIOS19/single_cell_rnaseq_hands-on_1

https://gitlab.rc.uab.edu/attis2020/single_cell_attis2020

http://138.26.131.247:8501/



2. QC and Filtering

3. Normalization


5. Clustering




From raw reads to gene expression matrix

Cook D. Introduction to the single-cell RNA sequencing workflow. 2018

cellranger count --helpcellranger count (3.1.0)Copyright (c) 2019 10x Genomics, Inc. All rights reserved.-------------------------------------------------------------------------------'cellranger count' quantifies single-cell gene expression.The commands below should be preceded by 'cellranger':

Usage:count--id=ID[--fastqs=PATH][--sample=PREFIX]--transcriptome= ref-DIR[options]count <run_id> [options]count -h | --help | --version

The commands for cellranger count

cellranger count --id=run_count_1kpbmcs \--fastqs=/path_to fastq/pbmc_1k_v3_fastqs \--sample=pbmc_1k_v3 \--transcriptome=/path_to_ref/run_cellranger_count/refdata-cellranger-GRCh38-3.0.0

Example

• cellranger count --id=sample345 \--transcriptome=/refdata-cellranger-GRCh38-3.0.0 \--fastqs=/fastq_path \--expect-cells=1000

• Outputs:- Run summary HTML: web_summary.html- Run summary CSV: metrics_summary.csv- BAM: possorted_genome_bam.bam- BAM index: possorted_genome_bam.bam.bai- Filtered feature-barcode matrices MEX: filtered_feature_bc_matrix- Filtered feature-barcode matrices HDF5: filtered_feature_bc_matrix.h5- Unfiltered feature-barcode matrices MEX: raw_feature_bc_matrix- Unfiltered feature-barcode matrices HDF5: raw_feature_bc_matrix_h5.h5- Secondary analysis output CSV: analysis- Per-molecule read information: molecule_info.h5- Loupe Cell Browser file: cloupe.cloupe

The outputs of cellranger count

• filtered_feature_bc_matrix├── barcodes.tsv.gz├── features.tsv.gz└── matrix.mtx.gz

Seurat or other software



2.QC and Filtering

3. Normalization


5. Clustering




QC and Filtering

• Goal: Remove low-quality cells and potential doublets

• Common parameters worth exploring

UMI distribution

Number of genes detected

Percent of UMIs aligning to mitochondrial genes

Oddly-high nUMI/nGene could be doublets (~90 doublets per 1000 cells)

High mitochondrial genes is associated with cell death (loss of membrane integrity > cytoplasmic loss > enrichment of mitochondrial content)

Install Seurat3 and load the package

Load PBMC data to Seurat

Cell QC – filter cells



2. QC and Filtering

3.Normalization


5. Clustering




Normalization

It’s common to “regress out” the effect of nUMI and percent mito on each cell

Goal: Make profiles of each cell comparableSimplest Approach: Scaling library size to some arbitrary value (eg. 10,000)

Data normalization

Detection of highly variable genes

Scaling the data



2. QC and Filtering

3. Normalization

4.Dimensionality Reduction

5. Clustering




Dimensionality Reduction

Goal: To visualize the structure of our data

Principal Component Analysis is one of the techniques used for dimensionality reduction

• Dimensionality simply refers to the number of features (i.e. input variables) in your dataset.• Why reduce the dimensions? Large dimensions are difficult to train on, need more computational power and time. Visualization is not possible with very large dimensional data.

• PCA is a variance maximizer. It projects the original data onto the directions where variance is maximum. Variance is the measure of how spread out the data is.

Perform linear dimensional reduction(PCA)

Visualizing the PCA results



2. QC and Filtering

3. Normalization


5.Clustering




Clustering: Assign cells to groups of similar cells

Clustering

Nature Reviews Genetics volume 20, pages273–282(2019)

Cell clustering

Cell clusters



2. QC and Filtering

3. Normalization


5. Clustering

6.Differential Expression gene / Marker Identification



Differential Expression gene / Marker Identification

Goal: Find out what’s different between clusters

Lots of options—some complex, some samplePairwise comparisons (eg. Cluster A vs. Cluster B)Marker identification (eg. Cluster A vs. Combined Cluster B,C,D)

Often many genes identified as “significant” due to large number of cells per cluster. May need to apply effect size (eg. Fold change) cutoffs to filter down to a smaller list of things to follow up on

Soneson & Robinson., Nature Methods, 2018

Finding differentially expressed genes (cluster biomarkers)

Visualizing marker genes





2. QC and Filtering

3. Normalization


5. Clustering


7.Trajectory analysis


Trajectory analysis

Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation.

In single-cell expression studies of processes such as cell differentiation, captured cells might be widely distributed in terms of progress.

Pseudotime is an abstract unit of progress: it's simply the distance between a cell and the start of the trajectory, measured along the shortest path.

Nat Methods. 2017 Oct; 14(10): 979–982.



2. QC and Filtering

3. Normalization


5. Clustering




Single cell VDJ seq data analysis

Samples : Thousands of single-cells from B cells of healthy control and SLE patients

Method : 10X Genomics - Single Cell Immune Profiling Solution - 5’ Gene Expression + V(D) J Enriched Libraries

Goal: Compare the IG genes of the healthy control and the patients

Cell proportion for different contig numbers in autoAb+ and healthy samples

The cells only had two contigs (one is Heavy chain, the other is Light chain) were used for the following analysis.

IGHV for autoAb+ and healthy samples

autoAb+ : SLE#1, SLE#2 and SLE#3Healthy: HC

Single-cell RNAseq data analysis in U-BRITE2.0 Zongling Yue

Acknowledgement

Department of Pathology

Dr. Casey T. Weaver

Division of Clinical Immunology and Rheumatology

Dr. John D. Mountz

Dr. Hui-Chen Hsu

Informatics Institute(UAB) School of Medicine

Zongliang Yue

Jelai Wang

Dr. Jake Chen

Dr. Alexander Rosenberg

Dr. Zechen Chong

Dr. James J. Cimino

Documents

Bioinformatics Analysis of Single-Cell Sequencing Data