13
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY © 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the US and/or other countries. All other names, logos, and other trademarks are the property of their respective owners. QC-ing and merging truth data Michael Eberle September 15, 2016

Sept2016 smallvar illumina_platinumgenomes

Embed Size (px)

Citation preview

Page 1: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY © 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the US and/or other countries. All other names, logos, and other trademarks are the property of their respective owners.

QC-ing and merging truth data Michael Eberle September 15, 2016

Page 2: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

2

Pedigree validation (alone)

●  Using the inheritance allows us to systematically incorporate “good” variants from different callers -  Systematic errors are unlikely to result in pedigree consistent calls

●  PG merges calls from six different workflows -  Small variant call set includes 4,862,204 SNVs and 758,540 indels

●  Compared to single sample or trios this call set will… -  Mostly fail calls co-segregating with germline or cell line CNVs -  Fail germline or cell line de novo mutations

●  Even with pedigree check problems can still arise -  Same variant may be included twice (merging problem) -  Systematic errors may create pedigree consistent calls (SV edges) -  Incorrect ploidy may still produce “platinum” variants

89 90 91 92

77 78

79 80 81 82 83 84 85 87 86 88 93

Page 3: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

3

Validating & filtering variants with k-mers

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG ALT

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG REF

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTG

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATAT

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTT

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGCCAGGAAATTTG

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGC

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGCCATTTGAAAAGGTATAAGTTCTGGAAGGTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGTAAGGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGCCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

GTTCTGGAAGCTTAACAACGCCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG CAACGGCCGCCGTCAAAAATGAAATCCTAATCTTTGGCAGGAACTTTG

AAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCTTAATCTTTGGCAGGAACTTTG TATAGGTTCTGGAGGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGTAAAGGTATAGGGTCTGCAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGA

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAATTGATATCCTA

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGAAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAA

CGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG TGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

9 Reads do not fully span the 51-mer 4 Reads contain base errors in 51-mer

Page 4: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

4

Validating & filtering variants with k-mers

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG ALT

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG REF

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTG

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATAT

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTT

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGCCAGGAAATTTG

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGC

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGCCATTTGAAAAGGTATAAGTTCTGGAAGGTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGTAAGGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGCCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

GTTCTGGAAGCTTAACAACGCCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG CAACGGCCGCCGTCAAAAATGAAATCCTAATCTTTGGCAGGAACTTTG

AAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCTTAATCTTTGGCAGGAACTTTG TATAGGTTCTGGAGGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGTAAAGGTATAGGGTCTGCAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGA

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAATTGATATCCTA

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

CCATTTGAAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAA

CGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG TGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG

Validate the REF with 6 k-mers and ALT with 5 k-mers

Page 5: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

5

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAG

TAAAGGTATAGGGTCTGGAAGCTTAACAACGGCCGCCGTC AAAAAGATATC

TAAAAGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACCTTG

TAAAGGTATAGGTTCTGGAAGCTTAAAAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAAATTTGTCTTTCC

TAAAGGTATAGGTTCTGGAAGCCTAACAACGGCCGCCGTCAA

TAAAGGTATAGGTTCTGGAAACTTAACAACGGCCGCCGTAAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCCTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCCGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAAAGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCCTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC

AAGCTTAACAACGGCCGCCGTCAAAATTGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC

ACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC CGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC

AGGTTCTGGAAGCTTAACAAAGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC TCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAAC

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCT

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTT

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAAT

CGTAAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC AAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC

Inconsistent variants (homopolymer example)

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC ALT 1TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC REF

TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAA TGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC ALT 2TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAA TGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC ALT 1+2

Validated the REF allele with 6 k-mers but 0 ALT k-mers

Page 6: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

6

K-mer reporting on consistent variants

●  Number of GT errors within the 13 member pedigree -  GT Pass means that there is one k-mer supporting each allele/haplo

§  Homozygous calls require two supporting k-mers

●  Number of k-mer errors in four founders -  Pass means that the haplo-predicted k-mer is observed in the founder

●  Normalized count for each k-mer -  Number of times each k-mer is observed in 13 member pedigree

divided by number of predicted haplotypes §  e.g. if six family members have the haplotype and we count the k-mer 60 times

then normalized count is 60/6 = 10

●  Minimum normalized count for each variants -  K-mer with the lowest normalized count for each variant

89 90 91 92

77 78

79 80 81 82 83 84 85 87 86 88 93

Page 7: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

7

K-mer filtering removes ~195k “artefacts”

1Failed/Passing means that all pedigree GTs are supported by k-mers 2Failed/Passing means that all founder k-mers are observed in the proper founders

Page 8: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

8

Adding content with k-mers

●  Some k-mer failures due to conflicting representations

●  Nearby variants that are missed can cause k-mer failures

●  We only consider pedigree-consistent variants

●  Working on a modified k-mer application that can take in many variants for hypothesis testing -  Merge PG and GIAB-specific variants (or other putative variants) -  Resolve conflicting representations -  Recover pedigree-failed variants

●  Need more complete truth data & improved k-mer GT-ing & assessment algorithms

Page 9: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

9

Beyond “platinum” variants

Page 10: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

10

Recoverable SNVs

●  Identified 334,652 “high-quality” SNV calls that are not pedigree consistent -  High quality = SNV called by >1 pipeline with consistent GTs (when

called) and every sample contains a GT call

●  Broke these into four categories based on likely cause -  CAT1 (191,087) = het in every sample (dup. or paralogous sequence) -  CAT2 (3,861) = GTs consistent with deletion in the pedigree -  CAT3 (49,800) = variant called in only one sample (cell line de novo) -  CAT4 (25,501) = all others (duplications and cell line artefacts)

Page 11: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

11

●  See an excess of “all-het” failed SNVs (red) versus consistent SNVs (blue)

●  Depth confirms that these are likely true variants - Could incorporate these as 2

ref & 2 alt or 1 ref & 3 alt…

●  ~34% of CAT1 SNVs overlap population duplications from Sudmant et al [2015]

CAT1 failed SNVs: Duplications

NA12878 NA12877 Children

Page 12: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

12

●  Clusters of CAT2 failed SNVs (red) identifies deletions in this pedigree

●  Depth confirms that these are deletions in the pedigree -  This deletion is common

(~15%) in the population

CAT2 failed SNVs: Deletions

NA12878 NA12877 Children

Page 13: Sept2016 smallvar illumina_platinumgenomes

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

13

Incorporating CNVs into pedigree check

●  Many SNVs and indels that fail the pedigree check are true variants -  Many also show population-level HW deviations consistent with

predicted call

●  Can modify the pedigree check to include CNVs to create a more complete call set -  Straightforward for deletions -  Duplications will require additional “B-Allele frequency” consideration

●  Need to improve CNV characterization first -  Improved breakpoint resolution -  Tandem duplications will work but not translocations