13
SV-calling from 10X Linked- Read data GIAB SV Data Jamboree Sofia Kyriazopoulou-Panagiotopoulou, Sr Scientist, 10X Genomics [email protected] Sept 15, 2016

Sept2016 sv 10_x

Embed Size (px)

Citation preview

Page 1: Sept2016 sv 10_x

SV-calling from 10X Linked-Read data

GIAB SV Data Jamboree

Sofia Kyriazopoulou-Panagiotopoulou, Sr Scientist, 10X [email protected]

Sept 15, 2016

Page 2: Sept2016 sv 10_x

2

Making Linked-Reads1.0ng high-molecular-weight gDNA = 300 haploid copies of the genome

CollectGEMs

OilBarcoded primer library

Pool

EnzymePrimers

with the

same barcode

0.5ng DNA (150 haploid copies of the genome) split into 1M partitions

Long input molecule

P5 16bp BCR1 Nmer gDNA Insert

Linked-Reads

Sequence

Page 3: Sept2016 sv 10_x

3

What Linked-Reads are not

150X avg molecule coverage

chr13: BRCA2

> 30X avg read coverage

• Each GEM contains 150/1M = 1/6000 of the genome (500 Kb)• If the average molecule length is 50Kb: 10 molecules/GEM• At an average of 30X sequencing depth, the read depth per molecule is 30/150 =

0.2X.• Roughly 35 read-pairs per molecule Linked-Reads.

Page 4: Sept2016 sv 10_x

4

Linked-Reads make SV detection easier

Short-read data

Barcoded short-read data

10X barcoding

Molecule inferencePhasing

Linked-Read data

Phased Linked-Read data

Page 5: Sept2016 sv 10_x

5

Deletion detection from Linked-Read data

Linked-Read alignment + Phasing

Coverage drop detection (HMM)

Discordant read-pair clustering

Local assembly by haplotype

Probabilistic modeling of insert sizes, errors,

phasing (EM)

Final phased HET/HOM deletion calls

Candidate generation

Candidate filtering

Page 6: Sept2016 sv 10_x

6

Large SV detection from Linked-Read dataA B C D V W X Y ZE

Refe

renc

e

A B C W E D X Y ZV

Inve

rsio

n

A B C X Y Z

Alig

ning

to th

e re

fere

nce

D V WE

• If the event is heterozygous we see a mixture of the two types of signal.• Probabilistic model of molecule length distribution and read depth to call and

phase variants (deletions, inversions, duplications, translocations).

Page 7: Sept2016 sv 10_x

7

A/J and CEPH trio callsLa

rge-

scal

e SV

s (>

30Kb

)De

letio

ns 5

0bp-

30Kb

• Mostly deletions

• 2-3 inversions in the son/daughter

A/J calls against hg19 and GRCh38 deposited to GIAB

Page 8: Sept2016 sv 10_x

8

SV-calling in hard regions

• 840bp HET deletion call in child and dad, no call in mom

• No mappability for short-reads. Region is spanned by 200Kb segmental duplication with >98% identity copy.

10X

Geno

mic

sPC

R-fr

ee

Trus

eq

Hap 1

Hap 2

Unphased

Page 9: Sept2016 sv 10_x

9

SV-calling in hard regions

• 73bp HET deletion call in child and mom, no call in dad.

• Overlaps simple repeat.

• Supported by 10X de-novo assembly.

10X

Geno

mic

sPC

R-fr

ee

Trus

eq

10X de-novo

assembly

Page 10: Sept2016 sv 10_x

10

Improved breakpoint resolution over short reads

• ~30Kb HET deletion call in child and dad, no call in mom.

• No read-pair support in PCR-free Truseq, low mappability at breakpoints (LINEs)

• Barcode-aware alignment allows us to get near-bp resolution.

10X

Geno

mic

sPC

R-fr

ee

Trus

eq

Page 11: Sept2016 sv 10_x

11

Beyond deletions

• HOM inversion

• No read-pair support, low mappability at breakpoints (LINEs)

• Breakpoints resolved within <1.5Kb in all three individuals of the trio

90Kb inversion causes molecules to “jump” between breakpointsAB

10X

Geno

mic

sPC

R-fr

ee

Trus

eq

Page 12: Sept2016 sv 10_x

12

•Not all repetitive/hard-to-map regions are resolved by the Linked-Read aligner.

•Highly polymorphic regions, assembly artifacts lead to false positives–SV whitelist/blacklist?

•How do we compare/overlap SV calls, especially in repeat-rich areas?

Future development

Page 13: Sept2016 sv 10_x

13

•The GIAB workshop organizers•The entire 10X team

Thank you!

Sofia [email protected]