Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Bioinformatics Analysis of Single-Cell
Sequencing Data
Min Gao, PhD
Assistant Professor of Medicine& UAB Informatics Institute
3.13.2020
• Backgroud
1. Why single cell?
2. Current platforms for scRNA-seq
3. Experimental design
• Single-cell data analysis
1. From raw reads to gene expression matrix
2. QC and Filtering
3. Normalization
4. Dimensionality Reduction
5. Clustering
6. Differential Expression gene / Marker Identification
7. Trajectory analysis
8. Multimodal analysis
• Single-cell RNAseq data analysis in U-BRITE2.0 (Zongliang Yue)
Outline
Single-cell analysis reveals heterogeneity J Hematol Oncol. 2017 Jan 21;10(1):27
Why single-cell analysis?
Wang Y.; Navin N. E. Mol. Cell 2015
Single cell analysis affects diverse areas of biological research
Trends Genet. 2015 Oct; 31(10): 576–586.
Single cell analysis in cancer genomics
Single cell analysis in immunology
Trends Immunol. 2017 February ; 38(2): 140–149
Scaling of scRNA-seq experiments
Svennson et al., Nature Protocols, 2018
Platforms Advantage Limitation
Chromium System (10x Genomics)High cell capture efficiency, easy to operate, end-to-end solution,
multiple applications, well established platform, intensive supportHigh initial cell concentration required, no users modification possible
Nadia (Dolomite Bio)Open platform, possibility to develop own protocols, multiple
applications (PACS, DroNc-Seq)
High initial cell concentration required, lower cell capturing efficiency,
no analysis software provided, skills to operate required
InDrop System (1CellBio)High cell capture efficiency, open platform, possibility to develop own
protocols
High initial cell concentration required, no analysis software support,
skills to operate required
Illumina Bio-Rad ddSEQ Single-Cell IsolatorProduct from industry leaders, easy to operate, end-to-end solution, kits
for different starting number of cells
High initial cell concentration required, no users modification possible,
single application (RNA-Seq)
Tapestri Platform (MissionBio)Only platform dedicated to DNA-Seq, easy to operate, customized
panels availableSingle application possible (DNA-Seq)
BD Rhapsody Single-Cell Analysis System (BD)Possibility to optimize costs (subsampling, archiving, targeted assays),
easy to operate, end-to-end solution, protein detection promisedSingle application possible (targeted RNA-Seq)
ICELL8 Single-Cell System (Takara) Combined high throughput with active cell selection, easy to operate Bioinformatics analysis not provided, single application (RNA-Seq)
C1 System and Polaris (Fluidigm)Variable throughput (48–800 cells), multiple applications, customizable
protocols, cell stimulation, well established platform, intensive supportSize-based cell selection (C1)
Puncher Platform (Vycap)Filtering for rare cell capturing, active cell selection, visual control, high
transferring efficiency, easy to operate, established WGA/WTA
protocols
Low throughput, bioinformatics analysis not provided
CellRaft AIR System (CellMicrosystems)Multiple applications (cultivation and tracking cell phenotypes,
substance testing), active cell selection, visual control, high transfer
efficiency, cost-effective manual version available
Low throughput, bioinformatics analysis not provided, adhesive
properties of cells expected (although not mandatory)
DEPArray NxT (Menarini Silicon Biosystems)Active cell selection, visual control, high transfer efficiency, possibility
to study cell–cell interaction, established WGA/WTA protocols
Low throughput, bioinformatics analysis not provided; compared to
other low-throughput instruments, a high price of consumables (chips)
AVISO CellCelector (ALS)Active cell selection, visual control, multiple applications (transfer cell
colonies), low price for consumables
Low throughput, bioinformatics analysis not provided, skills to operate
required, adhesive properties of cells lower transfer efficiency, risk of
contamination from co-transferred medium
Int J Mol Sci. 2018 Mar; 19(3): 807.
Single Cell Platforms
Droplet-based single cell platform
Single Cell Solutions :
Single Cell CNVSingle Cell Gene ExpressionSingle Cell Immune ProfilingSingle Cell ATAC
https://wp.10xgenomics.com/instruments/chromium-controller/Lab Chip. 2019 May 14;19(10):1706-1727
A B
Experimental Design considerations
How deep to sequence?• # of genes saturates around 1 million reads per cell• Identifying different cell types present: 25-50k reads per cell• Identifying transcriptional dynamics within a population: 50-100k reads per cell• We don’t necessarily need to detect everything in every cell!
How many cells?• Two main things to consider1) How many cell types are there?2) What is the proportion of the rarest cell type you’re interested in?• 10x Genomics currently allows for each run to yield anywhere from ~500-10,000 cells• Do you actually know how many cell types are there? What about cell states? The
current trend in the field seems more focused in increasing cell #
What about replicates?• It kind of depends.• Technical replicate of PBMCs has near-perfect overlap• Cells are dramatically different between patients.
Challenges in processing single-cell sequencing data
Variety in and of data is a classic biological problem pertaining also to big data. While there are clear opportunities in bigger volumes of data, there are technical, statistical and interpretative challenges rising alongside.
Basic programming needed to interpret data
The information contained in single-cell data needs to be transformed into relevant biological knowledge
Single cell sequencing data analysis
1. From raw reads to gene expression matrix
2. QC and Filtering
3. Normalization
4. Dimensionality Reduction
5. Clustering
6. Differential Expression gene / Marker Identification
7. Trajectory analysis
8. multimodal analysis
Flowchart of the single-cell RNA-seq analysis
Cell Ranger
Seurat
U-BRITE2.0-Web App
U-BRITE1.0- Jupyter Notebook
https://gitlab.rc.uab.edu/MCBIOS19/single_cell_rnaseq_hands-on_1
https://gitlab.rc.uab.edu/attis2020/single_cell_attis2020
attis2020-scrnaseq-demo.informatics.uab.edu
Single cell sequencing data analysis
1. From raw reads to gene expression matrix
2. QC and Filtering
3. Normalization
4. Dimensionality Reduction
5. Clustering
6. Differential Expression gene / Marker Identification
7. Trajectory analysis
8. multimodal analysis
From raw reads to gene expression matrix
Cook D. Introduction to the single-cell RNA sequencing workflow. 2018
cellranger count --helpcellranger count (3.1.0)Copyright (c) 2019 10x Genomics, Inc. All rights reserved.-------------------------------------------------------------------------------'cellranger count' quantifies single-cell gene expression.The commands below should be preceded by 'cellranger':
Usage:count--id=ID[--fastqs=PATH][--sample=PREFIX]--transcriptome= ref-DIR[options]count <run_id> [options]count -h | --help | --version
The commands for cellranger count
cellranger count --id=run_count_1kpbmcs \--fastqs=/path_to fastq/pbmc_1k_v3_fastqs \--sample=pbmc_1k_v3 \--transcriptome=/path_to_ref/run_cellranger_count/refdata-cellranger-GRCh38-3.0.0
Example
• cellranger count --id=sample345 \--transcriptome=/refdata-cellranger-GRCh38-3.0.0 \--fastqs=/fastq_path \--expect-cells=1000
• Outputs:- Run summary HTML: web_summary.html- Run summary CSV: metrics_summary.csv- BAM: possorted_genome_bam.bam- BAM index: possorted_genome_bam.bam.bai- Filtered feature-barcode matrices MEX: filtered_feature_bc_matrix- Filtered feature-barcode matrices HDF5: filtered_feature_bc_matrix.h5- Unfiltered feature-barcode matrices MEX: raw_feature_bc_matrix- Unfiltered feature-barcode matrices HDF5: raw_feature_bc_matrix_h5.h5- Secondary analysis output CSV: analysis- Per-molecule read information: molecule_info.h5- Loupe Cell Browser file: cloupe.cloupe
The outputs of cellranger count
• filtered_feature_bc_matrix├── barcodes.tsv.gz├── features.tsv.gz└── matrix.mtx.gz
Seurat or other software
Single cell sequencing data analysis
1. From raw reads to gene expression matrix
2.QC and Filtering
3. Normalization
4. Dimensionality Reduction
5. Clustering
6. Differential Expression gene / Marker Identification
7. Trajectory analysis
8. multimodal analysis
QC and Filtering
• Goal: Remove low-quality cells and potential doublets
• Common parameters worth exploring
UMI distribution
Number of genes detected
Percent of UMIs aligning to mitochondrial genes
Oddly-high nUMI/nGene could be doublets (~90 doublets per 1000 cells)
High mitochondrial genes is associated with cell death (loss of membrane integrity > cytoplasmic loss > enrichment of mitochondrial content)
Install Seurat3 and load the package
Load PBMC data to Seurat
Cell QC – filter cells
Single cell sequencing data analysis
1. From raw reads to gene expression matrix
2. QC and Filtering
3.Normalization
4. Dimensionality Reduction
5. Clustering
6. Differential Expression gene / Marker Identification
7. Trajectory analysis
8. multimodal analysis
Normalization
It’s common to “regress out” the effect of nUMI and percent mito on each cell
Goal: Make profiles of each cell comparableSimplest Approach: Scaling library size to some arbitrary value (eg. 10,000)
Data normalization
Detection of highly variable genes
Scaling the data
Single cell sequencing data analysis
1. From raw reads to gene expression matrix
2. QC and Filtering
3. Normalization
4.Dimensionality Reduction
5. Clustering
6. Differential Expression gene / Marker Identification
7. Trajectory analysis
8. Multimodal analysis
Dimensionality Reduction
Goal: To visualize the structure of our data
Principal Component Analysis is one of the techniques used for dimensionality reduction
• Dimensionality simply refers to the number of features (i.e. input variables) in your dataset.• Why reduce the dimensions? Large dimensions are difficult to train on, need more computational power and time. Visualization is not possible with very large dimensional data.
• PCA is a variance maximizer. It projects the original data onto the directions where variance is maximum. Variance is the measure of how spread out the data is.
Perform linear dimensional reduction(PCA)
Visualizing the PCA results
Single cell sequencing data analysis
1. From raw reads to gene expression matrix
2. QC and Filtering
3. Normalization
4. Dimensionality Reduction
5.Clustering
6. Differential Expression gene / Marker Identification
7. Trajectory analysis
8. Multimodal analysis
Clustering: Assign cells to groups of similar cells
Clustering
Nature Reviews Genetics volume 20, pages273–282(2019)
Cell clustering
Cell clusters
Single cell sequencing data analysis
1. From raw reads to gene expression matrix
2. QC and Filtering
3. Normalization
4. Dimensionality Reduction
5. Clustering
6.Differential Expression gene / Marker Identification
7. Trajectory analysis
8. Multimodal analysis
Differential Expression gene / Marker Identification
Goal: Find out what’s different between clusters
Lots of options—some complex, some samplePairwise comparisons (eg. Cluster A vs. Cluster B)Marker identification (eg. Cluster A vs. Combined Cluster B,C,D)
Often many genes identified as “significant” due to large number of cells per cluster. May need to apply effect size (eg. Fold change) cutoffs to filter down to a smaller list of things to follow up on
Soneson & Robinson., Nature Methods, 2018
Finding differentially expressed genes (cluster biomarkers)
Visualizing marker genes
Visualizing marker genes
Visualizing marker genes
Single cell sequencing data analysis
1. From raw reads to gene expression matrix
2. QC and Filtering
3. Normalization
4. Dimensionality Reduction
5. Clustering
6. Differential Expression gene / Marker Identification
7.Trajectory analysis
8. Multimodal analysis
Trajectory analysis
Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation.
In single-cell expression studies of processes such as cell differentiation, captured cells might be widely distributed in terms of progress.
Pseudotime is an abstract unit of progress: it's simply the distance between a cell and the start of the trajectory, measured along the shortest path.
Nat Methods. 2017 Oct; 14(10): 979–982.
Single cell sequencing data analysis
1. From raw reads to gene expression matrix
2. QC and Filtering
3. Normalization
4. Dimensionality Reduction
5. Clustering
6. Differential Expression gene / Marker Identification
7. Trajectory analysis
8. Multimodal analysis
Single cell VDJ seq data analysis
Samples : Thousands of single-cells from B cells of healthy control and SLE patients
Method : 10X Genomics - Single Cell Immune Profiling Solution - 5’ Gene Expression + V(D) J Enriched Libraries
Goal: Compare the IG genes of the healthy control and the patients
Cell proportion for different contig numbers in autoAb+ and healthy samples
The cells only had two contigs (one is Heavy chain, the other is Light chain) were used for the following analysis.
IGHV for autoAb+ and healthy samples
autoAb+ : SLE#1, SLE#2 and SLE#3Healthy: HC
Single-cell RNAseq data analysis in U-BRITE2.0 Zongling Yue
Acknowledgement
Department of Pathology
Dr. Casey T. Weaver
Division of Clinical Immunology and Rheumatology
Dr. John D. Mountz
Dr. Hui-Chen Hsu
Informatics Institute(UAB) School of Medicine
Zongliang Yue
Jelai Wang
Dr. Jake Chen
Dr. Alexander Rosenberg
Dr. Zechen Chong
Dr. James J. Cimino