Sept2016 smallvar 10_x

Preview:

Citation preview

10X Genomics

Novel variants and variant validation

September 2016

2

Partitioning to Linked Reads

1.0ng input

Post

er: M

etho

ds

3

Linked read data

Confidential — Do not distribute

4

Unlinked, unphased short read SNP

5

Linked reads, phased SNP

6

Standard Short Read Alignment

Close Paralogs

Short Reads

Short Read Aligners Cannot Place Reads Correctly

7

Long Ranger – LariatTM Aligner

1. Confident mapping provides anchors

2. Barcodes recruit short reads into paralogous loci

Close Paralogs

LariatTM Aligner Correctly Places Short Reads Even in Paralogous Loci

Linked-Reads

8

Improved alignment leads to improved variant calling

•SMN1 and SMN2: part of an inverted tandem duplication on chr5–Differ by 8 nucleotides (3 exonic)

• SMN1: causative of spinal muscular atrophy• SMN2: low function copy, not disease-causing

Haplotype 2 Reads

Haplotype 1 Reads

Standard Genome

Chromium Genome

SMN2

NA12878 WGS 128Gb

9

Inference

chr1

chr3

chr5

chr11

chr13

source

sink

• For every active alignment in the sink whose read has an alignment in the sink, switch the alignment in the sink to active and score probabilistically. If the source has few or no active alignments, the score goes up.

10

Inference

chr1

chr3

chr5

chr11

chr13

• This source is also now inactive.

source

sink

11

Inference

chr1

chr3

chr5

chr11

chr13

• Fast forward and we have the following active molecules left.

12

•Called by 10X data not in GIAB 3.2.2 (whole genome, not restricted to confident regions)

•Validated with PacBio requiring > 2 alt alleles supported and >15% allele fraction

•Of regions with PacBio coverage >=12, validation rates are 94% for 10X and 89% for truseq.

Novel variants

10X Truseq Diff 10x validated

Truseq validated

Diff

SNPs 335k 292k 43k 289k 237k 52k

Deletions 76k 56k 20k 73k 54k 19k

Insertions 59k 43k 16k 58k 42k 16k

Total 470k 391k 79k 420k 333k 87k

13

• PacBio validation – align pac bio reads to reference then align them to the reference with the alt allele in place of the reference allele. Only count as support if one scores higher than the other.

Novel variant validation method

• Can we validate this validation method• Sensitivity of validation in confident

region• Negative predictive value of

“random” mutations• For SNPs, random is straight

forward (could include TI/TV bias)

• For indels• Pick length from geometric

distribution• For deletions, the alt allele is

trivial• For insertions, the alt allele

used is the bases in the reference at that locus repeated.

14

•Entire 10X team especially Patrick Marks and Deanna Church•GIAB workshop organizers

1. Zheng, Grace XY, et al. "Haplotyping germline and cancer genomes with high-throughput linked-read sequencing." Nature biotechnology (2016).

2. Samonte, Rhea Vallente, and Evan E. Eichler. "Segmental duplications and the evolution of the primate genome." Nature Reviews Genetics 3.1 (2002): 65-72.

3. Bishara A et al. (2015) Read clouds uncover variation in complex regions of the human genome. Genome Res, 25:1570-1580.

4. Li, Heng, and Richard Durbin. "Fast and accurate short read alignment with Burrows–Wheeler transform." Bioinformatics 25.14 (2009): 1754-1760.

Acknowledgements and references

15

Addendum

16

SNP validation validation

Confidential — Do not distribute

Used for validation

17

Deletion validation validation

Confidential — Do not distribute

Used for validation

18

Insertion validation validation

Confidential — Do not distribute

Used for validation