Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Experience of using BWA-mem & GATK HaplotypeCaller for Variant
Calling in Multiple Rat Strains
Wim Spee, Bio-informatics Engineer, Cuppen Group, Hubrecht Institute
Content
● Project overview: Euratrans (FP7)● Pipeline and data overview● NGS alignment and variant calling● BAC alignment and variant calling● NGS and BAC genotype concordance● Heterozygosity in inbred species● Conclusion & discussion
Euratrans (FP7)
● European consortium for large-scale functional genomics in the rat for translational research
● Rat is a popular model organism● Multiple homozygous inbred disease model
strains have been set up– Example: SHR = Spontaneously hypertensive rat
– Set up before NGS was established by traditional breeding on phenotype
Pipeline and Data Overview
Raw reads =WGS Solid Fragment (50bp) and PE (50bp x 35bp)
Mapping =BWA 0.5.9 colorspace
Duplicate marking =Picard MarkDuplicates
Local realignment =GATK IndelRealigner
BQSR =N/A on BWA mapped Solid
Call variants =GATK HaplotypeCaller multisample
VQSR = SNP array and top 33% of indels as truth variant call set
Variant evaluation = Precision and Recall against 13 aligned Sanger based contigs (2.1 mB, aligned with BWA-MEM)
NGS Alignment and Variant Calling
● “Best practice” BWA-Picard-GATK pipeline– GATK HaplotypeCaller for variant calling
– Variant Quality Score Recalibration (VQSR) using SNP array and top 33% INDELS
HaplotypeCaller (HC) (Theory)
● Local denovo assembly based variant caller– Calls SNP, INDEL, MNP and small SV simultaneously
– Removes mapping artifacts
– More sensitive and accurate than the Unified Genotyper (UG) – Physical phasing of variants
– Used to run on geological timescales– Now runs on practical timescales (v 2.6.3 via Queue on SGE cluster)
● 2 days for 10 SOLID WGS rat strains multi-sample variant calling
Slide taken from Broad presentation
Slide taken from Broad presentation
Slide taken from Broad presentation
Slide taken from Broad presentation
Variant Quality Score Recalibration (VQSR)
● Use known true variants to dynamically set a cutoff between true positive and false positive calls– True positives will cluster together with the known variants and false
positives will mainly be in a separate cluster
– Alternative to setting manual hard cutoffs e.g. (coverage = 20, quality = 50, etc.)
● Known true variants for the rat species:– SNP: 500.000 high quality positions from a SNP array
– INDEL: no external set available, used top 33% (QUAL) in call set
VQSR (SNP): Plots
VQSR (SNP): Truth Sensitivity Tranches
VQSR (SNP): Truth Sensitivity Tranches
VQSR (INDEL): Plots
VQSR (INDEL): Truth Sensitivity Tranches?
Which Tranche to Take?
BAC Contig Alignment and Variant Calling
● 13 BACS for rat strain LE, ca. 150 kB per BAC, 2.1 mB in total
● BAC contig alignment – BWA-MEM
● BAC contig variant calling– GATK Unified Genotyper
● BAC & NGS alignment and variant calls in IGV
● BWA-MEM* – New long read & contig aligner from Heng Li
● 70bp to a few mB
– Can switch between end to end and local alignment
● Supports structural events detection from long reads and contigs
– Outputs a standard BAM file● Useful for downstream processing
BAC Contig Alignment: BWA-MEM
* Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (arXiv:1303.3997v2)
BWA-MEM Settings
● Seed length– 400 bp
● Banded Alignment (space to search for optimal alignment) = – 5000 positions
GATK UG Settings for Variant Calling on Aligned BAC
● --genotype_likelihoods_model BOTH● -stand_call_conf 0 ● -stand_emit_conf 0 ● -indelGapContinuationPenalty 30● -indelGapOpenPenalty 60● -minIndelCnt 1● -L BACToBedMerged.bed
BAC & NGS Alignment and Calls
BAC & NGS Alignment and Calls(zoomed out)
BAC Multiple Local Alignment
BAC Deletion vs. Reference
Unknown Reference Sequence
Mismatch Between BAC and Reference
SOLID Low Mapping Quality Regions
NGS and BAC Genotype Concordance
● Genotype concordance: – GATK module to compare 2 VCF files
● Input:– NGS call set restricted to BAC region
– BAC call set
● Filters used:– VQSR on (NGS)
– SNP cluster (3 SNP in 10bp window) (NGS and BAC)
– No known repeats regions (NGS and BAC)
– No NGS LE low quality mapping regions (NGS and BAC)
Precision and Recall (SNP)Current: Rnor 5.0, GATK Haplotype caller, VQSR 99.5 BWA-MEM
No clusters
No repeats
No low quality
Match NGS ONLY
BAC ONLY
SNP
Precision Recall
2309 14 1747 99.40% 56.93%
x 2270 22 1157 99.04% 66.24%
x x 2231 18 811 99.20% 73.34%
x x x 1944 10 192 99.49% 91.01%
Comparison:SNP
Precision Recall
Rnor 3.4, modified samtools, BLAT, no cluster, repeat and low qual. Same LE solid data set and BAC.
97.30% 82.80%
Rnor 3.4, GATK UG, simulated reads from BAC. Ilumina LE dataset and same BAC. Additional filters unknown
99.62% 91.90%
Precision and Recall (INDEL) Preliminary!
Current: Rnor 5.0, GATK Haplotype caller, VQSR 99.3, BWA-MEM
INDEL
Precision Recall
Rnor 3.4, modified samtools, BLAT, no cluster, repeat and low qual. Same LE solid data set and BAC.
97.80% 58.60%
Rnor 3.4, GATK UG, simulated reads from BAC. Ilumina LE dataset and same BAC. Additional filters unknown
96.25% 89.02%
Comparison:
INDEL
No repeats
No low quality
GT mismatch
Match NGS ONLY
BAC ONLY
Precision Recall
12 329 109 795 75.11% 29.27%
x 7 287 64 469 81.77% 37.96%
x x 1 182 28 100 86.67% 64.54%
Precision and Recall (INDEL) Improvement
● Include 2 Ilumina sequenced strains in HC variant calling
● And or include INDEL calls based on 2 Ilumina sequenced strains in VQSR– Intersection between Solid and Ilumina INDEL
calls as truth set?
● Better selection of truth INDELS from current call set?
Heterozygosity
● ~10% of LE true positive calls (vs. BAC) are heterozygous– Remaining heterozygosity?
– Paralogous regions?
– Other mapping artifacts?
– Bias of GATK HC towards diploid heterozygous species?
Conclusions & Discussion
● External data sets are very useful (SNP array & BAC for VQSR and genotype concordance)
● GATK Haplotype caller works– Better than samtools based variant calling
– To really compare with GATK UG, run on Ilumina LE strain sample● Solid reads to short to benefit from GATK HC?
– How to improve INDEL VQSR with no external truth set?
– How to handle heterozygous calls in inbred species?
● BWA-mem works– GATK UG can call SNP / INDELS on aligned BACs
– Visualization in IGV
– How to call SVs?
Acknowledgment
Cuppen Group at the Hubrecht Institute