Click here to load reader

EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

  • View

  • Download

Embed Size (px)

Text of EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and...

  • EDACCPrimary Analysis Pipelines

    Cristian CoarfaBioinformatics Research LaboratoryMolecular and Human Genetics

  • Data Levels

  • Data Types Submitted To EDACCChIP-Seq Shotgun Bisulfite SequencingMethyl-C Reduced Representation Bisulfite SequencingRRBS MRE-Seq MeDIP-Seq Chromatin Accessibility small RNA-Seq mRNA-Seq

  • Read MappingCommon processing step to all pipelinesHigh throughputSequence space: IlluminaColor space: SOLIDQuick and accurate anchoringReads size varies 36-76 bpShort read aligners1st generation: Maq, soapUngapped alignment2nd generation: bowtie, bwa, soap 2Tradeoff speed for sensitivity, good enough for many applicationsMapping toolsRobust to indelsSensitive to variable number of mismatches

  • Pash 3.0Positional Hashing

    Regular reads mappingBisulfite sequencing mappingIntegrate basepair variation with epigenetic variation

    SAM output, easy integration with other analysis toolsAccuracy without sacrificing efficiency

  • Bisulfite SequencingCurrent tools: BSMAP, RMAP-BS, mrsFast, Zoom

    Pash 3.0Integrate mutation discovery with basepair-level methylation discoverySpeedup

    General approachCovert Cs to Ts in reads and/or referenceUse mappings, reads and reference to determine methylated sites

    Pash 3Generate and hash all possible kmers for readsCTT: CCC, CCT, CTC, CTTMap against forward and reverse complement chromosome strands

    Superior sensitivity to other tools, without loss of efficiency

  • Galaxy/GenboreeDeveloped at Penn State UniversityBenefitsRapid deployment toolShare pipelines w/ others

    Alan Harris, Sriram RaghuramDeployed Galaxy/GenboreeIntegration w/ GenboreeAPI for upload/downloadAdaptors for LFF file format supportEDACC XML validation toolsSriram Raghuram, Andrew Jackson, Cristian CoarfaIntegration with compute clustersArpit Tandon, Sriram RaghuramDeployed analysis tools

  • Primary Analysis PipelinesImplemented & exposed via Galaxy/GenboreeRead mappingBisulfite Sequencing read mappingPeak calling (ChIP-Seq, MeDIP-Seq)MACS (Harvard), FindPeaks (UBC)Chromatin accessibilityHotSpot (UW)Small RNA-seqComing soonmRNA seqExpression, alternative splicingGene fusionTypical user interactionUse Galaxy for user inputSubmit jobs to a clusterUpload results to Genboree

  • Reads Mapping

  • ChIP-SeqSelect uniquely mapping reads Build read density mapsExtend each read 200bp along the mapping strandRemove monoclonal readsGenerate WIG dataCan be visualized in Genboree and UCSCPeak callingFindPeaks, MACSIntepret PeaksOverlap with genomic features of interest: gene promoters, etc

  • MeDIP-SeqSelect uniquely mapping reads Build read density mapsDetermine methylated CpGsFindPeaks

  • Finding methylated CpGs

  • MeDIP-Seq Signal Visualization

  • MRE-SeqSelect uniquely mapping reads Determine unmethylated CpGs

  • Bisulfite SequencingShotgun Bisulfite SequencingMethyl-CGenome wideReduced Representation Bisulfite SequencingRRBSEnzyme cocktailMap using PashBuild methylation maps

  • Bisulfite Sequencing Read Mapping

  • Methylation MapsPosition Strand CHHStatus Methylation Unmethylated TotalReads50100242 + CG 1 0 150100243 - CG 40 11 5150100250 + CG 1 0 150100251 - CG 37 8 46

  • Small RNA-SeqTrim adaptersMap reads onto target genomeup to 100 locations per readInterpretOverlap w/ miRNAs, piRNAs, sno/scaRNAs

  • ExerciseDownload the input MeDIP-Seq file from the workshop wikiAnalyze it using FindPeaks in GalaxyObtain results in Genboree Lff formatUpload the results to Genboree databaseView the results in a tabular viewFind the largest peaksExplore them in the Genboree browser