19
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

EDACC Primary Analysis Pipelines

  • Upload
    sevita

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

EDACC Primary Analysis Pipelines. Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics. Data Levels. ChIP-Seq Shotgun Bisulfite Sequencing Methyl-C Reduced Representation Bisulfite Sequencing RRBS MRE-Seq MeDIP-Seq Chromatin Accessibility - PowerPoint PPT Presentation

Citation preview

Page 1: EDACC Primary Analysis Pipelines

EDACCPrimary Analysis Pipelines

Cristian CoarfaBioinformatics Research Laboratory

Molecular and Human Genetics

Page 2: EDACC Primary Analysis Pipelines

Data Levels

Page 3: EDACC Primary Analysis Pipelines

Data Types Submitted To EDACC

• ChIP-Seq • Shotgun Bisulfite Sequencing

– Methyl-C

• Reduced Representation Bisulfite Sequencing– RRBS

• MRE-Seq • MeDIP-Seq • Chromatin Accessibility • small RNA-Seq • mRNA-Seq

Page 4: EDACC Primary Analysis Pipelines

Read Mapping• Common processing step to all pipelines• High throughput

– Sequence space: Illumina– Color space: SOLID

• Quick and accurate anchoring• Reads size varies 36-76 bp• Short read aligners

– 1st generation: Maq, soap• Ungapped alignment

– 2nd generation: bowtie, bwa, soap 2• Tradeoff speed for sensitivity, good enough for many applications

• Mapping tools– Robust to indels– Sensitive to variable number of mismatches

Page 5: EDACC Primary Analysis Pipelines

Pash 3.0

• Positional Hashing

• Regular reads mapping• Bisulfite sequencing mapping• Integrate basepair variation with epigenetic variation

• SAM output, easy integration with other analysis tools• Accuracy without sacrificing efficiency

Page 6: EDACC Primary Analysis Pipelines

Bisulfite Sequencing• Current tools: BSMAP, RMAP-BS, mrsFast, Zoom

• Pash 3.0– Integrate mutation discovery with basepair-level methylation discovery– Speedup

• General approach– Covert C’s to T’s in reads and/or reference– Use mappings, reads and reference to determine methylated sites

• Pash 3– Generate and hash all possible kmers for reads– CTT: CCC, CCT, CTC, CTT– Map against forward and reverse complement chromosome strands

• Superior sensitivity to other tools, without loss of efficiency

Page 7: EDACC Primary Analysis Pipelines

Galaxy/Genboree

• Developed at Penn State University• Benefits

– Rapid deployment tool– Share pipelines w/ others

• Alan Harris, Sriram Raghuram– Deployed Galaxy/Genboree– Integration w/ Genboree

• API for upload/download– Adaptors for LFF file format support– EDACC XML validation tools

• Sriram Raghuram, Andrew Jackson, Cristian Coarfa– Integration with compute clusters

• Arpit Tandon, Sriram Raghuram– Deployed analysis tools

http://genboree.org/galaxy

Page 8: EDACC Primary Analysis Pipelines

Primary Analysis Pipelines

• Implemented & exposed via Galaxy/Genboree– Read mapping– Bisulfite Sequencing read mapping– Peak calling (ChIP-Seq, MeDIP-Seq)

• MACS (Harvard), FindPeaks (UBC)– Chromatin accessibility

• HotSpot (UW)– Small RNA-seq

• Coming soon– mRNA seq– Expression, alternative splicing– Gene fusion

• Typical user interaction– Use Galaxy for user input– Submit jobs to a cluster– Upload results to Genboree

Page 9: EDACC Primary Analysis Pipelines

Reads Mapping

Page 10: EDACC Primary Analysis Pipelines

ChIP-Seq

• Select uniquely mapping reads • Build read density maps

– Extend each read 200bp along the mapping strand– Remove monoclonal reads– Generate WIG data– Can be visualized in Genboree and UCSC

• Peak calling– FindPeaks, MACS

• Intepret Peaks– Overlap with genomic features of interest: gene promoters, etc

Page 11: EDACC Primary Analysis Pipelines

MeDIP-Seq

• Select uniquely mapping reads • Build read density maps• Determine methylated CpGs

– FindPeaks

Page 12: EDACC Primary Analysis Pipelines

Finding methylated CpGs

Page 13: EDACC Primary Analysis Pipelines

MeDIP-Seq Signal Visualization

Page 14: EDACC Primary Analysis Pipelines

MRE-Seq

• Select uniquely mapping reads • Determine unmethylated CpGs

Page 15: EDACC Primary Analysis Pipelines

Bisulfite Sequencing

• Shotgun Bisulfite Sequencing– Methyl-C– Genome wide

• Reduced Representation Bisulfite Sequencing– RRBS– Enzyme cocktail

• Map using Pash• Build methylation maps

Page 16: EDACC Primary Analysis Pipelines

Bisulfite Sequencing Read Mapping

Page 17: EDACC Primary Analysis Pipelines

Methylation Maps

Position Strand CHHStatus Methylation Unmethylated TotalReads50100242 + CG 1 0 150100243 - CG 40 11 5150100250 + CG 1 0 150100251 - CG 37 8 46

Page 18: EDACC Primary Analysis Pipelines

Small RNA-Seq

• Trim adapters

• Map reads onto target genome– up to 100 locations per read

• Interpret– Overlap w/ miRNAs, piRNAs, sno/scaRNAs

Page 19: EDACC Primary Analysis Pipelines

Exercise

• Download the input MeDIP-Seq file from the workshop wiki

• Analyze it using FindPeaks in Galaxy– Obtain results in Genboree Lff format

• Upload the results to Genboree database

• View the results in a tabular view

• Find the largest peaks

• Explore them in the Genboree browser