38
Genome in a Bottle: Tools for Using NIST Reference Materials Next Generation Diagnostics Summit Short Course August 2014 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium

Tools for Using NIST Reference Materials

Embed Size (px)

DESCRIPTION

Tools for Using NIST Reference Materials

Citation preview

Page 1: Tools for Using NIST Reference Materials

Genome in a Bottle: Tools for Using NIST Reference Materials

Next Generation Diagnostics Summit Short CourseAugust 2014

Justin Zook, Marc Salit, and the Genome in a Bottle Consortium

Page 2: Tools for Using NIST Reference Materials

Learning Objectives

• How can Genome in a Bottle Reference Materials help with validating NGS assays?

• Comparing your variant calls to high-confidence calls

• Tools available for understanding potential false positives and false negatives

• Examples of how labs are using our high-confidence calls

Page 3: Tools for Using NIST Reference Materials

NIST-hostedGenome in a Bottle Consortium

• Infrastructure for performance assessment of NGS– support science-based regulatory

oversight

• No widely accepted set of metrics to characterize the fidelity of variant calls from NGS…

• Genome in a Bottle Consortium is developing standards to address this…– human genomes as Reference Materials

(RMs)• characterize and disseminate by NIST

– tools and methods to use these RMs• common sequencing instruments• bioinformatics workflows.

http://genomeinabottle.org

Page 4: Tools for Using NIST Reference Materials

Whole genome sequencing technologies disagree about 100,000’s of variants

3,198,316 (80.05%)

125,574 (3.14%)

Platform #1

Platform #2

Platform #3

230,311 (5.76%)

121,440 (3.04%)

208,038 (5.21%)

71,944 (1.80%)

39,604 (0.99%)

# SNPs (% of SNPs detected

by any platform)

Page 5: Tools for Using NIST Reference Materials

Bioinformatics programs also disagree

O’Rawe et al. Genome Medicine 2013, 5:28

Page 6: Tools for Using NIST Reference Materials

Measurement ProcessSample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials will be developed to characterize performance of a part of process– materials will be certified

for their variants against a reference sequence, with confidence estimates

gene

ric m

easu

rem

ent p

roce

ss

Page 7: Tools for Using NIST Reference Materials

NIST Human Genome RMs in the pipeline

• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable,

homogeneous references suitable for use in regulated applications

– all genomes also available from Coriell repository

• Pilot Genome– ~8400 tubes

• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent

• Asian Trio– ~10000 son; parents not yet

planned as NIST RM

Page 8: Tools for Using NIST Reference Materials

8

Goals for Data to Accompany RM

• ~0 false positive AND false negative calls in confident regions

• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)

• Avoid bias towards any particular platform– take advantage of strengths of each platform

• Avoid bias towards any particular bioinformatics algorithms

Page 9: Tools for Using NIST Reference Materials

Integration Methods to Establish Reference Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

Page 10: Tools for Using NIST Reference Materials

Assigning confidence to genotypes

High-confidence sites• Sequencing/bioinformatics

methods agree or we understand the biases causing disagreement

• At least some methods have no evidence of bias

• Inherited as expected

Less confident sites• In a region known to be

difficult for current technologies

• State reasons for lower confidence

• If a site is near a low confidence site, make it low confidence

Page 11: Tools for Using NIST Reference Materials

Reasons we exclude regions from high-confidence set

Page 12: Tools for Using NIST Reference Materials

12

Challenges with assessing performance

• All variant types are not equal

• All regions of the genome are not equal– Homopolymers, STRs,

duplications– Can be similar or

different in different genomes

• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance

• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic

accuracy measures not well posed

Page 13: Tools for Using NIST Reference Materials

Preliminary uses of high-confidence NIST-GIAB genotypes for NA12878

• NIST have released several versions of high-confidence genotypes for its pilot RM

• These data are presently being used for benchmarking– prior to release of RMs– SNPs & indels

• ~77% of the genome

Page 14: Tools for Using NIST Reference Materials

NIST Plays a Role in the First FDA Authorization for Next-Generation SequencerNovember 20, 2013

Page 15: Tools for Using NIST Reference Materials

Integrating NIST Call Sets into a Validation Workflow

Validation ReportFalse Positive Ratio FPR=FP/(FP+TN)

False Discovery Rate FDR=FP/(FP + TP)

Sensitivity Sens. = TP/(TP+FN)

Specificity Spec. = TN/(FP +TN)

Balanced Accuracy (Sens. + Spec.)/2

Page 16: Tools for Using NIST Reference Materials

16

GCAT – Interactive Performance Metrics

• NIST is working with GCAT to use our highly confident variant calls

• Assess performance of many combinations of mappers and variant callers

• Currently assesses only exome sequencing

• www.bioplanet.com/gcat

Page 17: Tools for Using NIST Reference Materials

GCAT Tests

Page 18: Tools for Using NIST Reference Materials

GCAT Variant Calling Tests

Pre-run Tests

Upload your own variant calls

Page 19: Tools for Using NIST Reference Materials

GCAT – Upload your own exome calls

Page 22: Tools for Using NIST Reference Materials

Background• Clinical laboratory – Division of Genomic Diagnostics Certified by regulatory

agencies (CAP).• CWES test requires stringent validation per CAP criteria to establish performance

metrics of the test.

Utilizing NIST data in validation of CWES Test

• Sequence and call variants of NA12878 at CHOP• CHOP ROI: Agilent SureSelect V5+ (SSV5+) baits file• Compare CHOP dataset to NIST data set for concordance

NIST Data Set Details:*High quality reference data set on NA12878 (Dec. 2013)*NIST’s highly confident Region of Interests (ROI) *Variants called in 219,222 regions on hg19 assembly

*: National Institute of Standards and Technology

Analytical Validation of Clinical Whole-Exome Sequencing (CWES) Test

Page 23: Tools for Using NIST Reference Materials

SENSITIVITY /SPECIFICITY RefGene +/- 15bp (SSV5+)

CHOP NIST

TPSNVs: 18480 INDELs: 396

FPSNVs: 26INDELs: 3

FNSNVs: 63INDELs: 30

FP: False PositiveTP: True PositiveFN: False NegativeTN: True Negative

SNVs INDELsSensitivity (TP/TP+FN) 99.66% 92.96%Specificity (TN/TN+FP) ~100% ~100%FDR (FP/FP+TN) 0.02% 0.08%Accuracy (TP+TN/TP+TN+FP+FN) ~100% ~100%

TN = NIST highly confident regions – CHOP ROIs

Page 24: Tools for Using NIST Reference Materials

Further analysis on presumptive 93 FNs and 29 FPs

63 SNVs 30 INDELs

93 FNs

29 FPs

26 SNVs 3 INDELs

Page 25: Tools for Using NIST Reference Materials

Using the GeT-RM Browser• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of questionable calls

Page 26: Tools for Using NIST Reference Materials

GeT-RM Load alignments for visualization

Page 27: Tools for Using NIST Reference Materials

Chr6:151669820 Chr6:151669828

Difficult site in homopolymer in intron of gene AKAP12

Page 28: Tools for Using NIST Reference Materials

Chr1:1666303

SNP in Gene SLC35E2, which is also in a pseudogene and a segmental duplication

Page 29: Tools for Using NIST Reference Materials

SegmentalDuplication

Pseudo-gene

StructuralVariant

Page 30: Tools for Using NIST Reference Materials

Feedback from MoCha lab in NCI • We built a targeted amplicons NGS assay for

detecting mutations in clinical tumor specimens• To assess the assay’s specificity, we compared 84

runs of CEPH NA12878 data from our assay with NIST’s consensus variant list (VCF v2.15)

• We observed a high overall concordance with a few FP variants in homopolymeric regions unique in our platform

• We concluded that NIST GIAB is a useful reference standard to evaluate assay specificity

Page 31: Tools for Using NIST Reference Materials

Using Genome in a Bottle calls to benchmark clinical exome sequencing

at Mount Sinai School of Medicine

“We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”

Page 32: Tools for Using NIST Reference Materials

Benchmarking somatic variant callingat Qiagen

Page 33: Tools for Using NIST Reference Materials

HSPH – Brad Chapman Comparing variant callers

http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/

Page 34: Tools for Using NIST Reference Materials

NextSeq: New Chemistry – Does it work?

Whole Genome Metrics NextSeq500 HiSeq2500% Genome Covered (>= 10X in Q20 bases) 96% 96%

Mean Coverage in Q20 Bases 28.3X 31.8X

SNPs Called (% dbSNP 129) 3,643,998 (89%) 3,664,014 (88%)

InDels Called (% dbSNP 129) 646,907 (65.7%) 686,547 (64.5%)

Genome in a Bottle SNP Sensitivity & Precision 99.07% | 99.04% 99.25% | 99.90%

Genome in a Bottle Indel Sensitivity & Precision 86.90% | 98.85% 93.29% | 97.54%

Page 35: Tools for Using NIST Reference Materials

Ion Benchmarking I

Page 36: Tools for Using NIST Reference Materials

Ion Benchmarking II

Page 37: Tools for Using NIST Reference Materials

Command-line tools for variant benchmarking

• USeq VCFComparator– http://sourceforge.net/projects/useq/

• RTG vcfeval– ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/

• bcbio.variation– http://bcbio.wordpress.com/2013/05/06/framework-

for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/

• SMaSH– http://smash.cs.berkeley.edu/

Page 38: Tools for Using NIST Reference Materials

How Can I Get Involved?• Use our integrated SNP/indel

genotypes for NA12878 and give us feedback– Cells and DNA currently available from

Coriell– NIST RM available late 2014

• Sequencing/analyzing the new Genome in a Bottle samples

• Help with Structural Variant calls• Help with analyzing data from long-

read technologies• Attend our biannual workshops

(January in CA, August in MD)• Help develop methods to measure

performance using our well-characterized genomes

http://genomeinabottle.org

Email: Justin Zook - [email protected] Salit – [email protected]

Slides on slideshare at:http://www.slideshare.net/GenomeInABottle