49
Bioinformatics Analysis of Single-Cell Sequencing Data Min Gao, PhD Assistant Professor of Medicine & UAB Informatics Institute 3.13.2020

Bioinformatics Analysis of Single-Cell Sequencing Data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatics Analysis of Single-Cell Sequencing Data

Bioinformatics Analysis of Single-Cell

Sequencing Data

Min Gao, PhD

Assistant Professor of Medicine& UAB Informatics Institute

3.13.2020

Page 2: Bioinformatics Analysis of Single-Cell Sequencing Data

• Backgroud

1. Why single cell?

2. Current platforms for scRNA-seq

3. Experimental design

• Single-cell data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3. Normalization

4. Dimensionality Reduction

5. Clustering

6. Differential Expression gene / Marker Identification

7. Trajectory analysis

8. Multimodal analysis

• Single-cell RNAseq data analysis in U-BRITE2.0 (Zongliang Yue)

Outline

Page 3: Bioinformatics Analysis of Single-Cell Sequencing Data

Single-cell analysis reveals heterogeneity J Hematol Oncol. 2017 Jan 21;10(1):27

Why single-cell analysis?

Page 4: Bioinformatics Analysis of Single-Cell Sequencing Data

Wang Y.; Navin N. E. Mol. Cell 2015

Single cell analysis affects diverse areas of biological research

Page 5: Bioinformatics Analysis of Single-Cell Sequencing Data

Trends Genet. 2015 Oct; 31(10): 576–586.

Single cell analysis in cancer genomics

Page 6: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell analysis in immunology

Trends Immunol. 2017 February ; 38(2): 140–149

Page 7: Bioinformatics Analysis of Single-Cell Sequencing Data

Scaling of scRNA-seq experiments

Svennson et al., Nature Protocols, 2018

Page 8: Bioinformatics Analysis of Single-Cell Sequencing Data

Platforms Advantage Limitation

Chromium System (10x Genomics)High cell capture efficiency, easy to operate, end-to-end solution,

multiple applications, well established platform, intensive supportHigh initial cell concentration required, no users modification possible

Nadia (Dolomite Bio)Open platform, possibility to develop own protocols, multiple

applications (PACS, DroNc-Seq)

High initial cell concentration required, lower cell capturing efficiency,

no analysis software provided, skills to operate required

InDrop System (1CellBio)High cell capture efficiency, open platform, possibility to develop own

protocols

High initial cell concentration required, no analysis software support,

skills to operate required

Illumina Bio-Rad ddSEQ Single-Cell IsolatorProduct from industry leaders, easy to operate, end-to-end solution, kits

for different starting number of cells

High initial cell concentration required, no users modification possible,

single application (RNA-Seq)

Tapestri Platform (MissionBio)Only platform dedicated to DNA-Seq, easy to operate, customized

panels availableSingle application possible (DNA-Seq)

BD Rhapsody Single-Cell Analysis System (BD)Possibility to optimize costs (subsampling, archiving, targeted assays),

easy to operate, end-to-end solution, protein detection promisedSingle application possible (targeted RNA-Seq)

ICELL8 Single-Cell System (Takara) Combined high throughput with active cell selection, easy to operate Bioinformatics analysis not provided, single application (RNA-Seq)

C1 System and Polaris (Fluidigm)Variable throughput (48–800 cells), multiple applications, customizable

protocols, cell stimulation, well established platform, intensive supportSize-based cell selection (C1)

Puncher Platform (Vycap)Filtering for rare cell capturing, active cell selection, visual control, high

transferring efficiency, easy to operate, established WGA/WTA

protocols

Low throughput, bioinformatics analysis not provided

CellRaft AIR System (CellMicrosystems)Multiple applications (cultivation and tracking cell phenotypes,

substance testing), active cell selection, visual control, high transfer

efficiency, cost-effective manual version available

Low throughput, bioinformatics analysis not provided, adhesive

properties of cells expected (although not mandatory)

DEPArray NxT (Menarini Silicon Biosystems)Active cell selection, visual control, high transfer efficiency, possibility

to study cell–cell interaction, established WGA/WTA protocols

Low throughput, bioinformatics analysis not provided; compared to

other low-throughput instruments, a high price of consumables (chips)

AVISO CellCelector (ALS)Active cell selection, visual control, multiple applications (transfer cell

colonies), low price for consumables

Low throughput, bioinformatics analysis not provided, skills to operate

required, adhesive properties of cells lower transfer efficiency, risk of

contamination from co-transferred medium

Int J Mol Sci. 2018 Mar; 19(3): 807.

Single Cell Platforms

Page 9: Bioinformatics Analysis of Single-Cell Sequencing Data

Droplet-based single cell platform

Single Cell Solutions :

Single Cell CNVSingle Cell Gene ExpressionSingle Cell Immune ProfilingSingle Cell ATAC

https://wp.10xgenomics.com/instruments/chromium-controller/Lab Chip. 2019 May 14;19(10):1706-1727

A B

Page 10: Bioinformatics Analysis of Single-Cell Sequencing Data

Experimental Design considerations

How deep to sequence?• # of genes saturates around 1 million reads per cell• Identifying different cell types present: 25-50k reads per cell• Identifying transcriptional dynamics within a population: 50-100k reads per cell• We don’t necessarily need to detect everything in every cell!

How many cells?• Two main things to consider1) How many cell types are there?2) What is the proportion of the rarest cell type you’re interested in?• 10x Genomics currently allows for each run to yield anywhere from ~500-10,000 cells• Do you actually know how many cell types are there? What about cell states? The

current trend in the field seems more focused in increasing cell #

What about replicates?• It kind of depends.• Technical replicate of PBMCs has near-perfect overlap• Cells are dramatically different between patients.

Page 11: Bioinformatics Analysis of Single-Cell Sequencing Data

Challenges in processing single-cell sequencing data

Variety in and of data is a classic biological problem pertaining also to big data. While there are clear opportunities in bigger volumes of data, there are technical, statistical and interpretative challenges rising alongside.

Basic programming needed to interpret data

The information contained in single-cell data needs to be transformed into relevant biological knowledge

Page 12: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell sequencing data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3. Normalization

4. Dimensionality Reduction

5. Clustering

6. Differential Expression gene / Marker Identification

7. Trajectory analysis

8. multimodal analysis

Page 13: Bioinformatics Analysis of Single-Cell Sequencing Data

Flowchart of the single-cell RNA-seq analysis

Cell Ranger

Seurat

U-BRITE2.0-Web App

U-BRITE1.0- Jupyter Notebook

https://gitlab.rc.uab.edu/MCBIOS19/single_cell_rnaseq_hands-on_1

https://gitlab.rc.uab.edu/attis2020/single_cell_attis2020

attis2020-scrnaseq-demo.informatics.uab.edu

Page 14: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell sequencing data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3. Normalization

4. Dimensionality Reduction

5. Clustering

6. Differential Expression gene / Marker Identification

7. Trajectory analysis

8. multimodal analysis

Page 15: Bioinformatics Analysis of Single-Cell Sequencing Data

From raw reads to gene expression matrix

Cook D. Introduction to the single-cell RNA sequencing workflow. 2018

Page 16: Bioinformatics Analysis of Single-Cell Sequencing Data

cellranger count --helpcellranger count (3.1.0)Copyright (c) 2019 10x Genomics, Inc. All rights reserved.-------------------------------------------------------------------------------'cellranger count' quantifies single-cell gene expression.The commands below should be preceded by 'cellranger':

Usage:count--id=ID[--fastqs=PATH][--sample=PREFIX]--transcriptome= ref-DIR[options]count <run_id> [options]count -h | --help | --version

The commands for cellranger count

cellranger count --id=run_count_1kpbmcs \--fastqs=/path_to fastq/pbmc_1k_v3_fastqs \--sample=pbmc_1k_v3 \--transcriptome=/path_to_ref/run_cellranger_count/refdata-cellranger-GRCh38-3.0.0

Example

Page 17: Bioinformatics Analysis of Single-Cell Sequencing Data

• cellranger count --id=sample345 \--transcriptome=/refdata-cellranger-GRCh38-3.0.0 \--fastqs=/fastq_path \--expect-cells=1000

• Outputs:- Run summary HTML: web_summary.html- Run summary CSV: metrics_summary.csv- BAM: possorted_genome_bam.bam- BAM index: possorted_genome_bam.bam.bai- Filtered feature-barcode matrices MEX: filtered_feature_bc_matrix- Filtered feature-barcode matrices HDF5: filtered_feature_bc_matrix.h5- Unfiltered feature-barcode matrices MEX: raw_feature_bc_matrix- Unfiltered feature-barcode matrices HDF5: raw_feature_bc_matrix_h5.h5- Secondary analysis output CSV: analysis- Per-molecule read information: molecule_info.h5- Loupe Cell Browser file: cloupe.cloupe

The outputs of cellranger count

• filtered_feature_bc_matrix├── barcodes.tsv.gz├── features.tsv.gz└── matrix.mtx.gz

Seurat or other software

Page 18: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell sequencing data analysis

1. From raw reads to gene expression matrix

2.QC and Filtering

3. Normalization

4. Dimensionality Reduction

5. Clustering

6. Differential Expression gene / Marker Identification

7. Trajectory analysis

8. multimodal analysis

Page 19: Bioinformatics Analysis of Single-Cell Sequencing Data

QC and Filtering

• Goal: Remove low-quality cells and potential doublets

• Common parameters worth exploring

UMI distribution

Number of genes detected

Percent of UMIs aligning to mitochondrial genes

Oddly-high nUMI/nGene could be doublets (~90 doublets per 1000 cells)

High mitochondrial genes is associated with cell death (loss of membrane integrity > cytoplasmic loss > enrichment of mitochondrial content)

Page 20: Bioinformatics Analysis of Single-Cell Sequencing Data

Install Seurat3 and load the package

Page 21: Bioinformatics Analysis of Single-Cell Sequencing Data

Load PBMC data to Seurat

Page 22: Bioinformatics Analysis of Single-Cell Sequencing Data

Cell QC – filter cells

Page 23: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell sequencing data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3.Normalization

4. Dimensionality Reduction

5. Clustering

6. Differential Expression gene / Marker Identification

7. Trajectory analysis

8. multimodal analysis

Page 24: Bioinformatics Analysis of Single-Cell Sequencing Data

Normalization

It’s common to “regress out” the effect of nUMI and percent mito on each cell

Goal: Make profiles of each cell comparableSimplest Approach: Scaling library size to some arbitrary value (eg. 10,000)

Page 25: Bioinformatics Analysis of Single-Cell Sequencing Data

Data normalization

Page 26: Bioinformatics Analysis of Single-Cell Sequencing Data

Detection of highly variable genes

Page 27: Bioinformatics Analysis of Single-Cell Sequencing Data

Scaling the data

Page 28: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell sequencing data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3. Normalization

4.Dimensionality Reduction

5. Clustering

6. Differential Expression gene / Marker Identification

7. Trajectory analysis

8. Multimodal analysis

Page 29: Bioinformatics Analysis of Single-Cell Sequencing Data

Dimensionality Reduction

Goal: To visualize the structure of our data

Principal Component Analysis is one of the techniques used for dimensionality reduction

• Dimensionality simply refers to the number of features (i.e. input variables) in your dataset.• Why reduce the dimensions? Large dimensions are difficult to train on, need more computational power and time. Visualization is not possible with very large dimensional data.

• PCA is a variance maximizer. It projects the original data onto the directions where variance is maximum. Variance is the measure of how spread out the data is.

Page 30: Bioinformatics Analysis of Single-Cell Sequencing Data

Perform linear dimensional reduction(PCA)

Page 31: Bioinformatics Analysis of Single-Cell Sequencing Data

Visualizing the PCA results

Page 32: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell sequencing data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3. Normalization

4. Dimensionality Reduction

5.Clustering

6. Differential Expression gene / Marker Identification

7. Trajectory analysis

8. Multimodal analysis

Page 33: Bioinformatics Analysis of Single-Cell Sequencing Data

Clustering: Assign cells to groups of similar cells

Clustering

Nature Reviews Genetics volume 20, pages273–282(2019)

Page 34: Bioinformatics Analysis of Single-Cell Sequencing Data

Cell clustering

Page 35: Bioinformatics Analysis of Single-Cell Sequencing Data

Cell clusters

Page 36: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell sequencing data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3. Normalization

4. Dimensionality Reduction

5. Clustering

6.Differential Expression gene / Marker Identification

7. Trajectory analysis

8. Multimodal analysis

Page 37: Bioinformatics Analysis of Single-Cell Sequencing Data

Differential Expression gene / Marker Identification

Goal: Find out what’s different between clusters

Lots of options—some complex, some samplePairwise comparisons (eg. Cluster A vs. Cluster B)Marker identification (eg. Cluster A vs. Combined Cluster B,C,D)

Often many genes identified as “significant” due to large number of cells per cluster. May need to apply effect size (eg. Fold change) cutoffs to filter down to a smaller list of things to follow up on

Soneson & Robinson., Nature Methods, 2018

Page 38: Bioinformatics Analysis of Single-Cell Sequencing Data

Finding differentially expressed genes (cluster biomarkers)

Page 39: Bioinformatics Analysis of Single-Cell Sequencing Data

Visualizing marker genes

Page 40: Bioinformatics Analysis of Single-Cell Sequencing Data

Visualizing marker genes

Page 41: Bioinformatics Analysis of Single-Cell Sequencing Data

Visualizing marker genes

Page 42: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell sequencing data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3. Normalization

4. Dimensionality Reduction

5. Clustering

6. Differential Expression gene / Marker Identification

7.Trajectory analysis

8. Multimodal analysis

Page 43: Bioinformatics Analysis of Single-Cell Sequencing Data

Trajectory analysis

Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation.

In single-cell expression studies of processes such as cell differentiation, captured cells might be widely distributed in terms of progress.

Pseudotime is an abstract unit of progress: it's simply the distance between a cell and the start of the trajectory, measured along the shortest path.

Nat Methods. 2017 Oct; 14(10): 979–982.

Page 44: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell sequencing data analysis

1. From raw reads to gene expression matrix

2. QC and Filtering

3. Normalization

4. Dimensionality Reduction

5. Clustering

6. Differential Expression gene / Marker Identification

7. Trajectory analysis

8. Multimodal analysis

Page 45: Bioinformatics Analysis of Single-Cell Sequencing Data

Single cell VDJ seq data analysis

Samples : Thousands of single-cells from B cells of healthy control and SLE patients

Method : 10X Genomics - Single Cell Immune Profiling Solution - 5’ Gene Expression + V(D) J Enriched Libraries

Goal: Compare the IG genes of the healthy control and the patients

Page 46: Bioinformatics Analysis of Single-Cell Sequencing Data

Cell proportion for different contig numbers in autoAb+ and healthy samples

The cells only had two contigs (one is Heavy chain, the other is Light chain) were used for the following analysis.

Page 47: Bioinformatics Analysis of Single-Cell Sequencing Data

IGHV for autoAb+ and healthy samples

autoAb+ : SLE#1, SLE#2 and SLE#3Healthy: HC

Page 48: Bioinformatics Analysis of Single-Cell Sequencing Data

Single-cell RNAseq data analysis in U-BRITE2.0 Zongling Yue

Page 49: Bioinformatics Analysis of Single-Cell Sequencing Data

Acknowledgement

Department of Pathology

Dr. Casey T. Weaver

Division of Clinical Immunology and Rheumatology

Dr. John D. Mountz

Dr. Hui-Chen Hsu

Informatics Institute(UAB) School of Medicine

Zongliang Yue

Jelai Wang

Dr. Jake Chen

Dr. Alexander Rosenberg

Dr. Zechen Chong

Dr. James J. Cimino