19
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Embed Size (px)

Citation preview

Page 1: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

EDACCPrimary Analysis Pipelines

Cristian CoarfaBioinformatics Research Laboratory

Molecular and Human Genetics

Page 2: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Data Levels

Page 3: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Data Types Submitted To EDACC

• ChIP-Seq • Shotgun Bisulfite Sequencing

– Methyl-C

• Reduced Representation Bisulfite Sequencing– RRBS

• MRE-Seq • MeDIP-Seq • Chromatin Accessibility • small RNA-Seq • mRNA-Seq

Page 4: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Read Mapping• Common processing step to all pipelines• High throughput

– Sequence space: Illumina– Color space: SOLID

• Quick and accurate anchoring• Reads size varies 36-76 bp• Short read aligners

– 1st generation: Maq, soap• Ungapped alignment

– 2nd generation: bowtie, bwa, soap 2• Tradeoff speed for sensitivity, good enough for many applications

• Mapping tools– Robust to indels– Sensitive to variable number of mismatches

Page 5: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Pash 3.0

• Positional Hashing

• Regular reads mapping• Bisulfite sequencing mapping• Integrate basepair variation with epigenetic variation

• SAM output, easy integration with other analysis tools• Accuracy without sacrificing efficiency

Page 6: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Bisulfite Sequencing• Current tools: BSMAP, RMAP-BS, mrsFast, Zoom

• Pash 3.0– Integrate mutation discovery with basepair-level methylation discovery– Speedup

• General approach– Covert C’s to T’s in reads and/or reference– Use mappings, reads and reference to determine methylated sites

• Pash 3– Generate and hash all possible kmers for reads– CTT: CCC, CCT, CTC, CTT– Map against forward and reverse complement chromosome strands

• Superior sensitivity to other tools, without loss of efficiency

Page 7: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Galaxy/Genboree

• Developed at Penn State University• Benefits

– Rapid deployment tool– Share pipelines w/ others

• Alan Harris, Sriram Raghuram– Deployed Galaxy/Genboree– Integration w/ Genboree

• API for upload/download– Adaptors for LFF file format support– EDACC XML validation tools

• Sriram Raghuram, Andrew Jackson, Cristian Coarfa– Integration with compute clusters

• Arpit Tandon, Sriram Raghuram– Deployed analysis tools

http://genboree.org/galaxy

Page 8: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Primary Analysis Pipelines

• Implemented & exposed via Galaxy/Genboree– Read mapping– Bisulfite Sequencing read mapping– Peak calling (ChIP-Seq, MeDIP-Seq)

• MACS (Harvard), FindPeaks (UBC)– Chromatin accessibility

• HotSpot (UW)– Small RNA-seq

• Coming soon– mRNA seq– Expression, alternative splicing– Gene fusion

• Typical user interaction– Use Galaxy for user input– Submit jobs to a cluster– Upload results to Genboree

Page 9: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Reads Mapping

Page 10: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

ChIP-Seq

• Select uniquely mapping reads • Build read density maps

– Extend each read 200bp along the mapping strand– Remove monoclonal reads– Generate WIG data– Can be visualized in Genboree and UCSC

• Peak calling– FindPeaks, MACS

• Intepret Peaks– Overlap with genomic features of interest: gene promoters, etc

Page 11: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

MeDIP-Seq

• Select uniquely mapping reads • Build read density maps• Determine methylated CpGs

– FindPeaks

Page 12: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Finding methylated CpGs

Page 13: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

MeDIP-Seq Signal Visualization

Page 14: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

MRE-Seq

• Select uniquely mapping reads • Determine unmethylated CpGs

Page 15: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Bisulfite Sequencing

• Shotgun Bisulfite Sequencing– Methyl-C– Genome wide

• Reduced Representation Bisulfite Sequencing– RRBS– Enzyme cocktail

• Map using Pash• Build methylation maps

Page 16: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Bisulfite Sequencing Read Mapping

Page 17: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Methylation Maps

Position Strand CHHStatus Methylation Unmethylated TotalReads50100242 + CG 1 0 150100243 - CG 40 11 5150100250 + CG 1 0 150100251 - CG 37 8 46

Page 18: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Small RNA-Seq

• Trim adapters

• Map reads onto target genome– up to 100 locations per read

• Interpret– Overlap w/ miRNAs, piRNAs, sno/scaRNAs

Page 19: EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Exercise

• Download the input MeDIP-Seq file from the workshop wiki

• Analyze it using FindPeaks in Galaxy– Obtain results in Genboree Lff format

• Upload the results to Genboree database

• View the results in a tabular view

• Find the largest peaks

• Explore them in the Genboree browser