Aug2013 NIST highly confident genotype calls for NA12878

1

Using Highly Confident Genotype Calls for NA12878 to understand

sequencing accuracy

Genome in a Bottle Consortium

Justin Zook, Ph.D and Marc Salit, Ph.D. National Institute of Standards and Technology

2

Why create a set of highly confident genotypes for a genome?

• Current validation methods have limited purview or accuracy• Sanger confirmation

– Limited by number of sites (and sometimes it’s wrong)• High depth NGS confirmation

– May have same systematic errors• Genotyping microarrays

– Limited to known (easier) variants– Problems with neighboring variants, homopolymers, duplications

• Mendelian inheritance– Can’t account for some systematic errors

• Simulated data– Generally not very representative of errors in real data

• Ti/Tv– Varies by region of genome, and only gives overall statistic

3

Goals for Data Integration

• Carefully define highly confident regions of the genome– distinguish between Hom Ref and Uncertain

• ~0 false positive AND false negative calls in confident regions

• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)

• Avoid bias towards any particular platform• Avoid bias towards any particular bioinformatics

algorithms

4

Integrate 12 Datasets from 5 platforms

5

Integration of Data toForm Highly Confident Genotype Calls

Find all possible variant sites

Find highly confident sites across multiple datasets

Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias

For each site, remove datasets with decreasingly atypical characteristics until all datasets agree

Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known

segmental duplications or long repeats

Candidate variants

Confident variants

Find characteristics of bias

Arbitration

Confidence Level

6

Characteristics of Sequence Data/Genotype associated with bias

• Systematic sequencing errors– Strand bias– Base Quality Rank Sum

Test

• Local Alignment problems– Distance from end of

read– Read Position Rank Sum– HaplotypeScore

• Mapping problems– Mapping Quality– Higher (or lower) than

expected coverage – CNV

– Length of aligned reads

• Abnormal allele balance or Quality/Depth– Allele Balance – Quality/Depth

7

Regions excluded as uncertain

More recently, we also exclude homopolymers and long STRs, and 30 bp on each side of uncertain heterozygous and homozygous variant positions

8

Example of Arbitration: SSE suspected from strand bias

Platf

orm

BPl

atfor

m A

Homopolymer

Strand Bias(SNP overrepresentedon reverse strands)

9

Verification of “Highly Confident” Genotype accuracy

• Sanger sequencing– 100% accuracy but only 100s of sites

• X Prize Fosmid sequencing– Artifacts at end of fosmids

• Microarrays– Differences appear to be FP or FN in arrays

• Broad 250bp HaplotypeCaller– Very highly concordant, except a few systematic errors and homopolymers

• Platinum genomes pedigree SNPs– Some systematic errors are inherited; different representations of complex

variants• Real Time Genomics Trio SNPs and indels

– Some interesting sites called by RTG complex caller but have no evidence in mapped reads

10

GCAT – Interactive Performance Metrics

• NIST is working with GCAT to use our highly confident variant calls

• Assess performance of many combinations of mappers and variant callers

• www.bioplanet.com/gcat

11

Why do calls differ from our highly confident genotypes?

Calls not in Integration • Platform-specific systematic

sequencing errors for SNPs• Analysis-specific • Difficult to map regions• Indels in long

homopolymers

Calls specific to Integration• Different complex variant

representation• Some are incorrectly

filtered as suspected FPs

12

Illumina-specific Systematic Sequencing Errors

13

Complex variants have multiple correct representations

BWA

ssaha2

CGTools

Novo-align

Ref:

T insertion

TCTCT insertion

FP SNPs FP MNPs FP indels

Traditional comparison

0.38% (610)

100% (915)

6.5% (733)

Comparison with realignment

0.15% (249)

4.2% (38)

2.6% (298)

14

Uncertain variants: Difficult to map regions

15

Uncertain variants: Indels in long homopolymers

16

Uncertain variants: Regions with “decoy sequence”

17

Challenges with assessing performance

• All variant types are not equal

• Nearby variants are often difficult to align– Multiple representations

• All regions of the genome are not equal– Homopolymers, STRs,

duplications– Can be similar or different

in different genomes

• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance

• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic

accuracy measures not well posed

18

How to incorporate inheritance in multi-platform integration

• Adding confidence– Site follows expected

inheritance pattern (and not all homozygous)

• Identifying errors– Mendelian inheritance

errors– Sites where all family

members are heterozygous

– Some CNVs

• Limitations of inheritance– All homozygous sites can

still be systematic errors– Some errors can follow

inheritance pattern (e.g., incorrect alignment around indel, some CNVs)

19

Availability of data, genotype calls, and methods

• Data for NA12878 is available on NCBI GIAB ftp site (see blogs on genomeinabottle.org)– mirrored to Amazon

today

• Highly confident genotype calls and bed files available on GIAB ftp site

• Pre-print of manuscript available on arxiv.org

• See genomeinabottle.org blog posts for more information

20

Acknowledgements

• GCAT – David Mittelman and Jason Wang• FDA HPC – Mike Mikailov, Brian Fitzgerald, et al.• HSPH – Brad Chapman, Oliver Hofmann, Win Hide• Genome in a Bottle Consortium– www.genomeinabottle.org

• newsletters, blogs, forums, announcements

– new partners welcome! Open to anyone– targeting pilot reference material availability in early

2014

http://www.genomeinabottle.org/