Click here to load reader

Variant analysis and whole exome sequencing

  • View

  • Download

Embed Size (px)

Text of Variant analysis and whole exome sequencing

  1. 1. February 4th, 2015 Mariam Quiones, PhD Computational Molecular Biology Specialist Bioinformatics and Computational Biosciences Branch OCICB/OSMO/OD/NIAID/NIH
  2. 2. BCBB: A Branch Devoted to Bioinformatics and Computational Biosciences Researchers time is increasingly important BCBB saves our collaborators time and effort Researchers speed projects to completion using BCBB consultation and development services No need to hire extra post docs or use external consultants or developers What makes us different? Software DevelopersComputational Biologists PMs and Analysts 2
  3. 3. BCBB NIH Users: Access a menu of BCBB services on the NIAID Intranet: Outside of NIH search BCBB on the NIAID Public Internet Page: Email us at: [email protected] 3
  4. 4. Remember Sanger? Sanger introduced the dideoxy method (also known as Sanger sequencing) in December 1977 Alignment of reads using tools such as Sequencher IMAGE: 4
  6. 6. File formats for aligned reads SAM (sequence alignment map) CTTGGGCTGCGTCGTGTCTTCGCTTCACACCCGCGACGAGCGCGGCTTCT CTTGGGCTGCGTCGTGTCTTCGCTTCACACC Chr_ start end Chr2 100000 100050 Chr_ start end Chr2 50000 50050 6
  7. 7. Most commonly used alignment file formats SAM (sequence alignment map) Unified format for storing alignments to a reference genome BAM (binary version of SAM) used commonly to deliver data Compressed SAM file, is normally indexed BED Commonly used to report features described by chrom, start, end, name, score, and strand. For example: chr1 11873 14409 uc001aaa.3 0 + 7
  8. 8. Variant Analysis Class Topics Introduction Variant calling, filtering and annotation Overview of downstream analyses 8
  9. 9. Efforts at creating databases of variants: Project that started on 2002 with the goal of describing patterns of human genetic variation and create a haplotype map using SNPs present in at least 1% of the population, which were deposited in dbSNPs. 1000 Genomes Started in 2008 with a goal of using at least 1000 individuals (about 2,500 samples at 4X coverage), interrogate 1000 gene regions in 900 samples (exome analysis), find most genetic variants with allele frequencies above 1% and to a 0.1% if in coding regions as well as Indels and structural variants Make data available to the public or via Amazon Cloud 9
  10. 10. Variants are not only found in coding sequences Main Goal: Find all functional elements in the genome 10
  11. 11. Read alignment Mapping reads to reference genome, processing to improve alignment and mark PCR duplicates Variant identification Variant calling and genotyping Annotation and Filtering Removing likely false positive variants Visualization and downstream analyses (e.g. association studies) A typical variant calling pipeline 11
  12. 12. Mapping reads to reference Whole Genome Exome-Seq Useful for identifying SNPs but also structural variants in any genomic region. Interrogates variants in specific genomic regions, usually coding sequences. Cheaper and provides high coverage Read alignment Recommended alignment tools: BWA-MEM, Bowtie2, ssaha2, novoalign. See more tools: Seq Answers Forum: 12
  13. 13. How to identify structural variants? Methods and SV tools 1- Read Pair use pair ends. GASVPro (Sindi 2012), BreakDancer (Chen 2009) 2- Read depth CNVnator (Abyzov, 2010) 3- Split read PinDEL (Ye 2009) 4- Denovo assembly Learn more: Also see references: - SV - CNV 13
  14. 14. Exome-Seq Targeted exome capture targets ~20,000 variants near coding sequences and a few rare missense or loss of function variants Provides high depth of coverage for more accurate variant calling It is starting to be used as a diagnostic tool Ann Neurol. 2012 Jan;71(1):5-14. -Nimblegen -Agilent -Illumina 14
  15. 15. Analysis Workflow 15
  16. 16. Mapping subject and family sequence data with BWA (paired end) Variants called with SAMtools Filtering out common variants in 1000 genomes (>0.05 allele freq) 500K variants Annotation using ANNOVAR* 576 non- synonymous filtered variants Visualization of each gene Whole-exome sequencing reveals a heterozygous LRP5 mutation in a 6- year-old boy with vertebral compression fractures and low trabecular bone density Fahiminiya S, Majewski J, Roughley P, Roschger P, Klaushofer K, Rauch F. Bone: July 23, 2013 *Annovar annotated: the type of mutations, presence in dbSNP132, minor allele frequency in the 1000 Genomes project, SIFT, Polyphen-2 and PHASTCONS scores 16
  17. 17. Exome-seq - Also used for finding somatic mutations Exome sequencing of serous endometrial tumors identifies recurrent somatic mutations in chromatin- remodeling and ubiquitin ligase complex genes VOLUME 44 | NUMBER 12 | DECEMBER 2012 Nature Genetics Figure 1: Somatic mutations in CHD4, FBXW7 and SPOP cluster within important functional domains of the encoded proteins. 17
  18. 18. Variant Analysis Class Topics Introduction Variant calling, filtering and annotation Overview of downstream analyses 18
  19. 19. Variant identification Goals: Identify variant bases, genotype likelihood and allele frequency while avoiding instrument noise. Generate an output file in vcf format (variant call format) with genotypes assigned to each sample 19
  20. 20. Challenges of finding variants using NGS data Base calling errors Different types of errors that vary by technology, sequence cycle and sequence context Low coverage sequencing Lack of sequence from two chromosomes of a diploid individual at a site Inaccurate mapping Aligned reads should be reported with mapping quality score 20
  21. 21. Pre-Processing 21
  22. 22. Processing reads for Variant Analysis Trimming: Unless quality of read data is poor, low quality end trimming of short reads (e.g. Illumina) prior to mapping is usually not required. If adapters are present in a high number of reads, these could be trimmed to improve mapping. Recommended tools: Prinseq, Btrim64, FastQC Marking duplicates: It is not required to remove duplicate reads prior to mapping but instead it is recommended to mark duplicates after the alignment. Recommended tool: Picard. A library that is composed mainly of PCR duplicates could produce inaccurate variant calling. Clip: 22
  23. 23. Picard Tools (integrates with GATK) AddOrReplaceReadGroups.jar BamIndexStats.jar BamToBfq.jar BuildBamIndex.jar CalculateHsMetrics.jar CleanSam.jar CollectAlignmentSummaryMetrics.jar CollectCDnaMetrics.jar CollectGcBiasMetrics.jar CollectInsertSizeMetrics.jar CollectMultipleMetrics.jar CompareSAMs.jar CreateSequenceDictionary.jar EstimateLibraryComplexity.jar ExtractIlluminaBarcodes.jar ExtractSequences.jar FastqToSam.jar FixMateInformation.jar IlluminaBasecallsToSam.jar MarkDuplicates.jar MeanQualityByCycle.jar MergeBamAlignment.jar MergeSamFiles.jar NormalizeFasta.jar picard-1.45.jar QualityScoreDistribution.jar ReorderSam.jar ReplaceSamHeader.jar RevertSam.jar sam-1.45.jar SamFormatConverter.jar SamToFastq.jar SortSam.jar ValidateSamFile.jar ViewSam.jar java -jar QualityScoreDistribution.jar I=file.bam CHART=file.pdf 23
  24. 24. SNP Calling Methods Early SNP callers and some commercial packages use a simple method of counting reads for each allele that have passed a mapping quality threshold. ** This is not good enough in particular when coverage is low. It is best to use advanced SNP callers which add more statistics for more accurate variant calling of low coverage datasets and indels. -e.g. GATK 24
  25. 25. SNP and Genotype calling Germline Variant Callers Somatic Variant Callers Varscan, MuTEC Samtools: 25
  26. 26. VCF format (version 4.0) Format used to report information about a position in the genome Use by 1000 genomes to report all variants 26
  27. 27. VCF format's_VCF_files 27
  28. 28. GATK pipeline (BROAD) for variant calling 28
  29. 29. GATK Haplotype Caller shows improvements on detecting misalignments around gaps and reducing false positives 29
  30. 30. GATK uses a Base Quality Recalibration method Covariates the raw quality score the position of the base in the read the dinucleotide context and the read group 30
  31. 31. GATK uses a bayesian model

Search related