Upload
genomeinabottle
View
625
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Tools for Using NIST Reference Materials
Citation preview
Genome in a Bottle: Tools for Using NIST Reference Materials
Next Generation Diagnostics Summit Short CourseAugust 2014
Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
Learning Objectives
• How can Genome in a Bottle Reference Materials help with validating NGS assays?
• Comparing your variant calls to high-confidence calls
• Tools available for understanding potential false positives and false negatives
• Examples of how labs are using our high-confidence calls
NIST-hostedGenome in a Bottle Consortium
• Infrastructure for performance assessment of NGS– support science-based regulatory
oversight
• No widely accepted set of metrics to characterize the fidelity of variant calls from NGS…
• Genome in a Bottle Consortium is developing standards to address this…– human genomes as Reference Materials
(RMs)• characterize and disseminate by NIST
– tools and methods to use these RMs• common sequencing instruments• bioinformatics workflows.
http://genomeinabottle.org
Whole genome sequencing technologies disagree about 100,000’s of variants
3,198,316 (80.05%)
125,574 (3.14%)
Platform #1
Platform #2
Platform #3
230,311 (5.76%)
121,440 (3.04%)
208,038 (5.21%)
71,944 (1.80%)
39,604 (0.99%)
# SNPs (% of SNPs detected
by any platform)
Bioinformatics programs also disagree
O’Rawe et al. Genome Medicine 2013, 5:28
Measurement ProcessSample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials will be developed to characterize performance of a part of process– materials will be certified
for their variants against a reference sequence, with confidence estimates
gene
ric m
easu
rem
ent p
roce
ss
NIST Human Genome RMs in the pipeline
• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable,
homogeneous references suitable for use in regulated applications
– all genomes also available from Coriell repository
• Pilot Genome– ~8400 tubes
• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent
• Asian Trio– ~10000 son; parents not yet
planned as NIST RM
8
Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in confident regions
• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)
• Avoid bias towards any particular platform– take advantage of strengths of each platform
• Avoid bias towards any particular bioinformatics algorithms
Integration Methods to Establish Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
Assigning confidence to genotypes
High-confidence sites• Sequencing/bioinformatics
methods agree or we understand the biases causing disagreement
• At least some methods have no evidence of bias
• Inherited as expected
Less confident sites• In a region known to be
difficult for current technologies
• State reasons for lower confidence
• If a site is near a low confidence site, make it low confidence
Reasons we exclude regions from high-confidence set
12
Challenges with assessing performance
• All variant types are not equal
• All regions of the genome are not equal– Homopolymers, STRs,
duplications– Can be similar or
different in different genomes
• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance
• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic
accuracy measures not well posed
Preliminary uses of high-confidence NIST-GIAB genotypes for NA12878
• NIST have released several versions of high-confidence genotypes for its pilot RM
• These data are presently being used for benchmarking– prior to release of RMs– SNPs & indels
• ~77% of the genome
NIST Plays a Role in the First FDA Authorization for Next-Generation SequencerNovember 20, 2013
Integrating NIST Call Sets into a Validation Workflow
Validation ReportFalse Positive Ratio FPR=FP/(FP+TN)
False Discovery Rate FDR=FP/(FP + TP)
Sensitivity Sens. = TP/(TP+FN)
Specificity Spec. = TN/(FP +TN)
Balanced Accuracy (Sens. + Spec.)/2
16
GCAT – Interactive Performance Metrics
• NIST is working with GCAT to use our highly confident variant calls
• Assess performance of many combinations of mappers and variant callers
• Currently assesses only exome sequencing
• www.bioplanet.com/gcat
GCAT Tests
GCAT Variant Calling Tests
Pre-run Tests
Upload your own variant calls
GCAT – Upload your own exome calls
Freebayes SNP calls changed very little in 2013
http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem-freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/snp/group-quality
Freebayes indel calls improved in 2013
http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem-freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/indel/group-quality
Background• Clinical laboratory – Division of Genomic Diagnostics Certified by regulatory
agencies (CAP).• CWES test requires stringent validation per CAP criteria to establish performance
metrics of the test.
Utilizing NIST data in validation of CWES Test
• Sequence and call variants of NA12878 at CHOP• CHOP ROI: Agilent SureSelect V5+ (SSV5+) baits file• Compare CHOP dataset to NIST data set for concordance
NIST Data Set Details:*High quality reference data set on NA12878 (Dec. 2013)*NIST’s highly confident Region of Interests (ROI) *Variants called in 219,222 regions on hg19 assembly
*: National Institute of Standards and Technology
Analytical Validation of Clinical Whole-Exome Sequencing (CWES) Test
SENSITIVITY /SPECIFICITY RefGene +/- 15bp (SSV5+)
CHOP NIST
TPSNVs: 18480 INDELs: 396
FPSNVs: 26INDELs: 3
FNSNVs: 63INDELs: 30
FP: False PositiveTP: True PositiveFN: False NegativeTN: True Negative
SNVs INDELsSensitivity (TP/TP+FN) 99.66% 92.96%Specificity (TN/TN+FP) ~100% ~100%FDR (FP/FP+TN) 0.02% 0.08%Accuracy (TP+TN/TP+TN+FP+FN) ~100% ~100%
TN = NIST highly confident regions – CHOP ROIs
Further analysis on presumptive 93 FNs and 29 FPs
63 SNVs 30 INDELs
93 FNs
29 FPs
26 SNVs 3 INDELs
Using the GeT-RM Browser• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of questionable calls
GeT-RM Load alignments for visualization
Chr6:151669820 Chr6:151669828
Difficult site in homopolymer in intron of gene AKAP12
Chr1:1666303
SNP in Gene SLC35E2, which is also in a pseudogene and a segmental duplication
SegmentalDuplication
Pseudo-gene
StructuralVariant
Feedback from MoCha lab in NCI • We built a targeted amplicons NGS assay for
detecting mutations in clinical tumor specimens• To assess the assay’s specificity, we compared 84
runs of CEPH NA12878 data from our assay with NIST’s consensus variant list (VCF v2.15)
• We observed a high overall concordance with a few FP variants in homopolymeric regions unique in our platform
• We concluded that NIST GIAB is a useful reference standard to evaluate assay specificity
Using Genome in a Bottle calls to benchmark clinical exome sequencing
at Mount Sinai School of Medicine
“We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”
Benchmarking somatic variant callingat Qiagen
HSPH – Brad Chapman Comparing variant callers
http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/
NextSeq: New Chemistry – Does it work?
Whole Genome Metrics NextSeq500 HiSeq2500% Genome Covered (>= 10X in Q20 bases) 96% 96%
Mean Coverage in Q20 Bases 28.3X 31.8X
SNPs Called (% dbSNP 129) 3,643,998 (89%) 3,664,014 (88%)
InDels Called (% dbSNP 129) 646,907 (65.7%) 686,547 (64.5%)
Genome in a Bottle SNP Sensitivity & Precision 99.07% | 99.04% 99.25% | 99.90%
Genome in a Bottle Indel Sensitivity & Precision 86.90% | 98.85% 93.29% | 97.54%
Ion Benchmarking I
Ion Benchmarking II
Command-line tools for variant benchmarking
• USeq VCFComparator– http://sourceforge.net/projects/useq/
• RTG vcfeval– ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
• bcbio.variation– http://bcbio.wordpress.com/2013/05/06/framework-
for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/
• SMaSH– http://smash.cs.berkeley.edu/
How Can I Get Involved?• Use our integrated SNP/indel
genotypes for NA12878 and give us feedback– Cells and DNA currently available from
Coriell– NIST RM available late 2014
• Sequencing/analyzing the new Genome in a Bottle samples
• Help with Structural Variant calls• Help with analyzing data from long-
read technologies• Attend our biannual workshops
(January in CA, August in MD)• Help develop methods to measure
performance using our well-characterized genomes
http://genomeinabottle.org
Email: Justin Zook - [email protected] Salit – [email protected]
Slides on slideshare at:http://www.slideshare.net/GenomeInABottle