Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
OverviewofsinglecellRNA-Seq analysis
VishalKoparde,Ph.D.CCRCollaborativeBioinformaticsResource(CCBR)
Leidos BiomedicalResearch,Inc.
Outline
• BulkRNA-Seq vsscRNA-Seq• scRNA-Seq applications&challenges• scRNA-Seq assays• scRNA-Seq pipeline(UMI,QC,downstreamanalysis)• t-SNEvisualization• CCBRscRNA-Seq pipeline
BulkRNA-Seq:Averagingmaskscellularheterogeneity
BulkRNA-Seq
• Thebasicunitofresearchisthecell.Butweareusuallyanalyzingpopulationsofcellsandthiscan:• Leadtofalsepositivesfromunderestimatingbiologicalvariability• Missimportantbiologicaldivisions
Mean=.959Stddev=.048SampleSize=2
BulkRNA-Seq
• Impossibletoconcludeifthesedifferencesareduetoauniformpropertyofallcellswithinatumor/genegrouporduetoadditiveeffectsofjustafewsub-cloneswithinatumor/genegroup
• ConclusionsfrombulkanalysiscanberepresentativeofNOTHING
scRNA-Seq
scRNA-Seq evolution
scRNA-Seq hasbeenaroundforadecade,buthasnowmaturedwithpracticalapplications
PotentialofscRNA-Seq
• InvestigateTranscriptHeterogeneity• Differencesintranscriptabundance• Usageofalternatetranscriptioninitiationandpolyadenylationsites• Alternativesplicinganddifferentialexpressionoftranscriptisoforms
Challenges:scRNA-Seq
• Limitedamountsofmaterial• Illumina/PacBio require1ng-500ng• 1bacteriumor1cancercellcontain1fg-1pg• ~10-50pgoftotalRNAà 1-5%ofwhichismRNA• WTA/WGAbecomereallyimportant
Challenges:scRNA-Seq
• Relativelyhighcorrelationisexpectedbetweensamples
• Highlyvariableandnoisy• Manydifferencesevenbetweensametypeofcells• Biologicalsignaldiminishesatcellularlevelandbecomescomparabletotechnicalvariability
Challenges:scRNA-Seq
• Viability• Cellstress• Dependsonapplication• Budgetconsiderations• Technicalvariation• Cellcapture• Librarypreparation• Inherentsequencingbiases
scRNA-Seq:Differentassays
• TherearemanyscRNA-Seqassaysoutthere:• Somearecommercialized… growing• Fulltranscriptvs3’• Isoformvsgene• Lessormoreautomated• Differentlevelsofthroughput• Differencesincosts• Differencesindownstreambioinformaticsanalysis
SmartSeq2
• FulltranscriptscRNA-Seq• Selectsforpoly-Atail• 5’switchingforfulltranscriptlengthcapture• StandardIlluminasequencing• Off-the-shelfproducts
• Hundredsofsamples• NoUMIsused
DropSeq
• Droplet-based• Throughputfromhundredstothousands• 3’Endonly• Singlecelltranscriptomesattachedtomicroparticles• UMIs(uniquemolecularidentifiers)• Cellbarcodes• Drop-seq toolssoftware
10XGenomics
• Droplet-based• GEMà GelbeadinEMulsion
• 3’mRNA• Standardizedinstrumentationandreagents• Morehigh-throughputscalingtotensofthousandsofcells• Lessprocessingtime• Cellrangersoftware
10XGenomicsCellRanger
• cellranger mkfastq … BCLtofastq• cellranger count… fastqtocountmatrix• cellranger aggr … combinefastqs frommultiplerunsintoonecountmatrix• cellranger reanalyze… countmatrixtodimensionalityreduction/clustering• PythonandRsupportincluded
Analysispipelinesdifferalot… why?
• scRNA-Seqdatasetscomeindifferentflavors:• “Fishingexpedition”:Dissociatetissueà getsubpopulations!• CasevsControl• Moleculartransformationsovertime(Pseudotime)• Perturbationseg.CRISPRScreening
• Veryhardtohaveasingledo-it-allpipelineforscRNA-Seqanalysis
Newpipelineskeeppoppingup…
https://github.com/seandavi/awesome-single-cell
Rawdata
AssaysdifferbyFASTQcontent
• Dropseq and10XisreallyjustsingleendRNASeqonapercelllevel
UMI
• UniqueMolecularIdentifiersareshort(4-10bp)randombarcodesaddedtotranscriptsduringreverse-transcription.• TheyenablesequencingreadstobeassignedtoindividualtranscriptmoleculesandthustheremovalofamplificationnoiseandbiasesfromscRNASeqdata
IssueswithUMI
• DifferentUMIdoesn’tnecessarilymeandifferentmolecule
• Differenttranscriptdoesn’tnecessarilymeandifferentmolecule
• SameUMIdoesn’tnecessarilymeansamemolecule
• Cell-rangeperformsbuilt-inerrorcorrectionforUMI-Read-Transcriptmapping
scRNA-Seq pipeline:QC
• ReadalignmentsandmuchoftheQCcanbehandledsimilartobulkRNA-Seq
• Cellrangerusesmaskedhumanandmousegenometoexclude“problem”regionsapriori
scRNA-Seq pipeline:QC
scRNA-Seq pipeline:Downstreamanalysis
• Completeend-to-endpipelineinR• SINCERAusedbyCincinnatiChildren’sHospital
• Rpackages• Seurat• scater• SC3
CommonstepsofascRNA-Seq pipeline
• Aligningtogenome• QC• Readcounting• Filtering• Normalization• Clustering
• Selectionoftoolsvarybetweendatasettodataset… dependonthebiggerbiologicalquestion
CountMatrix
• 5000-12000genespercellon10X
• MostentriesarezeroàSparsematrix
• Differentefficientrepresentationofsparsematrixinmemoryareused
CountMatrixascoordinatelist
Geneshavedifferentdistributions
Geneshavedifferentdistributions
Filtering:Numberofreads
• Cellbarcodeswithlownumberofreadsareexcluded
user-defined
arbitrarycutoff
Filtering:Numberofgenes
user-defined
arbitrarycutoff
Filtering:NumberofUMIs
user-defined
arbitrarycutoff
Filtering:Mitochondrial%
user-defined
arbitrarycutoff
Filtering:Outliersare“multipleoffenders”
user-defined
arbitrarycutoff
AutomaticFiltering
Normalization
• scRNASeqassaysarepronetoconfoundingbatcheffects,hencewenormalize• ManyRNA-Seqnormalizationtechniquescandirectlybeapplied• Exploringtheideaofcombiningorselectingfromacollectionofnormalizationorcorrectionmethodsbestforaspecificcase• SomebelievethatUMIbaseanalysisneednotbenormalizedbetweensamplesasabsolutecountofmoleculesarereported
Normalization
• No.ofreadspercellvaryhighlyandlibrarysizenormalizationisrequired• MethodslikeRSEMalreadyincorporatelibrarysizeanddonotrequirenormalization• “normalizationfactor”isamultiplicationfactor,whichisanestimateofthelibrarysizerelativetotheothercells• TMM,UQ,RLE
NormalizationRaw
CPM RLEUQ
Visualization
• PCA(lineardimensionalityreduction)• Smallpercentageofvarianceisexplainedbyfirst2or3PCs
• t-SNE(non-lineardimensionalityreduction)coloredwithk-meansclustering• Notstraightforwardtodeterminek• kcanbesettothenumberofsignificantPCswhichcanbedeterminedbyajackstrawplot(computationallyintensive)• Iterativeprocesswithgradualfinetuning
PCAvs.t-SNE
Whatist-SNE?
• t-DistributedStochasticNeighborEmbedding• “Dimensionalityreduction”-->representmulti-dimensionalitydatain2or3-Dspace• PCA:• Linear• Unabletointerpretcomplex
polynomialrelationshipbetweenfeatures
• Focusesonrepresentingdissimilarpointsfartherawayinlowerdimensions
• t-SNE:• Non-linear• Focusesonrepresentingsimilar
datapoints closetogetherinlowerdimensions
• notcapableofretainingboththelocalandglobalstructureofthedataatthesametime
Thingstorememberaboutt-SNE
• Canarriveatdifferentsolutionsdependingontheinitial“random”distributions… Hence,itisimportanttousethesame“seed”everytime
• Notcapableofretainingboththelocalandglobalstructureofthedataatthesametime
• OneoftheunderlyingalgorithmsusedinthelatestiPhoneXFacialrecognitionsoftware
Algorithm FERaccuracy
PCA 75.40%
LDA 75.90%
LLE 87.70%
SNE 90.60%
t-SNE 94.50%
CCBRscRNASeqpipeline
Development:https://github.com/CCBR/scRNASeq Production:https://github.com/CCBR/Pipeliner
CCBRscRNASeqpipeline
• Startingpoints• 10Xgenomicsfastq• 10Xgenomicscountmatrix
• DataFilteringetc.• Downstreamanalysis
• k-meansclustering• PCA• tSNE plot• markergenelists
• Alltasksaresubmittedasslurm jobstobiowulf
CCBRPipeliner:examplevisualizations
CCBRPipeliner:What’sinthe“pipeline”?
• Futureenhancements:• ComprehensiveQCatacellularlevel
• Smoothmergingofmultipledatasetswithminimalbatcheffect(newerversionofSeurat)
• Pseudotimeanalysis
• Shinyappforinteractive3-Dt-SNEforeffectivevisualizationwhichwillallowpseudocoloringbasedon:• Clusternumber• Gene• Genesets /Pathways
Takehomemessage
• “scRNA Seqanditsanalysis”fieldischangingatarapidpace• ManyofthethingsthatyouheardaboutscRNA Seqanalysistodaywillbeobsoleteinafewmonths!
THANKYOUFORYOURATTENTION