Upload
genomeinabottle
View
1.065
Download
1
Tags:
Embed Size (px)
Citation preview
1
Using Highly Confident Genotype Calls for NA12878 to understand
sequencing accuracy
Genome in a Bottle Consortium
Justin Zook, Ph.D and Marc Salit, Ph.D. National Institute of Standards and Technology
2
Why create a set of highly confident genotypes for a genome?
• Current validation methods have limited purview or accuracy• Sanger confirmation
– Limited by number of sites (and sometimes it’s wrong)• High depth NGS confirmation
– May have same systematic errors• Genotyping microarrays
– Limited to known (easier) variants– Problems with neighboring variants, homopolymers, duplications
• Mendelian inheritance– Can’t account for some systematic errors
• Simulated data– Generally not very representative of errors in real data
• Ti/Tv– Varies by region of genome, and only gives overall statistic
3
Goals for Data Integration
• Carefully define highly confident regions of the genome– distinguish between Hom Ref and Uncertain
• ~0 false positive AND false negative calls in confident regions
• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)
• Avoid bias towards any particular platform• Avoid bias towards any particular bioinformatics
algorithms
4
Integrate 12 Datasets from 5 platforms
5
Integration of Data toForm Highly Confident Genotype Calls
Find all possible variant sites
Find highly confident sites across multiple datasets
Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known
segmental duplications or long repeats
Candidate variants
Confident variants
Find characteristics of bias
Arbitration
Confidence Level
6
Characteristics of Sequence Data/Genotype associated with bias
• Systematic sequencing errors– Strand bias– Base Quality Rank Sum
Test
• Local Alignment problems– Distance from end of
read– Read Position Rank Sum– HaplotypeScore
• Mapping problems– Mapping Quality– Higher (or lower) than
expected coverage – CNV
– Length of aligned reads
• Abnormal allele balance or Quality/Depth– Allele Balance – Quality/Depth
7
Regions excluded as uncertain
More recently, we also exclude homopolymers and long STRs, and 30 bp on each side of uncertain heterozygous and homozygous variant positions
8
Example of Arbitration: SSE suspected from strand bias
Platf
orm
BPl
atfor
m A
Homopolymer
Strand Bias(SNP overrepresentedon reverse strands)
9
Verification of “Highly Confident” Genotype accuracy
• Sanger sequencing– 100% accuracy but only 100s of sites
• X Prize Fosmid sequencing– Artifacts at end of fosmids
• Microarrays– Differences appear to be FP or FN in arrays
• Broad 250bp HaplotypeCaller– Very highly concordant, except a few systematic errors and homopolymers
• Platinum genomes pedigree SNPs– Some systematic errors are inherited; different representations of complex
variants• Real Time Genomics Trio SNPs and indels
– Some interesting sites called by RTG complex caller but have no evidence in mapped reads
10
GCAT – Interactive Performance Metrics
• NIST is working with GCAT to use our highly confident variant calls
• Assess performance of many combinations of mappers and variant callers
• www.bioplanet.com/gcat
11
Why do calls differ from our highly confident genotypes?
Calls not in Integration • Platform-specific systematic
sequencing errors for SNPs• Analysis-specific • Difficult to map regions• Indels in long
homopolymers
Calls specific to Integration• Different complex variant
representation• Some are incorrectly
filtered as suspected FPs
12
Illumina-specific Systematic Sequencing Errors
13
Complex variants have multiple correct representations
BWA
ssaha2
CGTools
Novo-align
Ref:
T insertion
TCTCT insertion
FP SNPs FP MNPs FP indels
Traditional comparison
0.38% (610)
100% (915)
6.5% (733)
Comparison with realignment
0.15% (249)
4.2% (38)
2.6% (298)
14
Uncertain variants: Difficult to map regions
15
Uncertain variants: Indels in long homopolymers
16
Uncertain variants: Regions with “decoy sequence”
17
Challenges with assessing performance
• All variant types are not equal
• Nearby variants are often difficult to align– Multiple representations
• All regions of the genome are not equal– Homopolymers, STRs,
duplications– Can be similar or different
in different genomes
• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance
• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic
accuracy measures not well posed
18
How to incorporate inheritance in multi-platform integration
• Adding confidence– Site follows expected
inheritance pattern (and not all homozygous)
• Identifying errors– Mendelian inheritance
errors– Sites where all family
members are heterozygous
– Some CNVs
• Limitations of inheritance– All homozygous sites can
still be systematic errors– Some errors can follow
inheritance pattern (e.g., incorrect alignment around indel, some CNVs)
19
Availability of data, genotype calls, and methods
• Data for NA12878 is available on NCBI GIAB ftp site (see blogs on genomeinabottle.org)– mirrored to Amazon
today
• Highly confident genotype calls and bed files available on GIAB ftp site
• Pre-print of manuscript available on arxiv.org
• See genomeinabottle.org blog posts for more information
20
Acknowledgements
• GCAT – David Mittelman and Jason Wang• FDA HPC – Mike Mikailov, Brian Fitzgerald, et al.• HSPH – Brad Chapman, Oliver Hofmann, Win Hide• Genome in a Bottle Consortium– www.genomeinabottle.org
• newsletters, blogs, forums, announcements
– new partners welcome! Open to anyone– targeting pilot reference material availability in early
2014