29
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1 , Elango Cheran 1 , and Michael Bru dno 1 1 University of Toronto, Canada

A Robust Framework for Detecting Structural Variations

  • Upload
    cody

  • View
    49

  • Download
    1

Embed Size (px)

DESCRIPTION

A Robust Framework for Detecting Structural Variations. February 6, 2008 Seunghak Lee 1 , Elango Cheran 1 , and Michael Brudno 1 1 University of Toronto, Canada. What are structural variations? (1). 10^3 – 10^6 basepair variations in the genome - PowerPoint PPT Presentation

Citation preview

Page 1: A Robust Framework for Detecting Structural Variations

1

A Robust Framework for Detecting Structural Variations

February 6, 2008

Seunghak Lee1, Elango Cheran1, and Michael Brudno1

1University of Toronto, Canada

Page 2: A Robust Framework for Detecting Structural Variations

2

What are structural variations? (1)

10^3 – 10^6 basepair variations in the genome

Insertion: a large consecutive fragment of DNA is inserted

Deletion: a large consecutive fragment of DNA is deleted

Inversion: a large consecutive fragment of DNA is inversed

Translocation: a large consecutive fragment of DNA is moved from one chromosome to another.

Copy number variations

Page 3: A Robust Framework for Detecting Structural Variations

3

What are structural variations? (2)

Various examples of structural variations

Page 4: A Robust Framework for Detecting Structural Variations

4

Outline

Introduction Type of Structural Variations Sequencing Approaches to Detect Structural Variations Motivation & Research Objectives

Probabilistic Framework for Detecting Structural Variations Probabilistic Framework Flow of our Framework Hierarchical Clustering of Matepairs (2nd phase) Choosing a Unique Mapped Location for Each Matepair (3nd phase)

Experiments Comparison with Three Previous research DMBT1 Gene for Deletion Centromere and Translocations

Conclusions

Page 5: A Robust Framework for Detecting Structural Variations

5

Type of Structural Variations (1)

Insertion

A

REF

Page 6: A Robust Framework for Detecting Structural Variations

6

Type of Structural Variations (2)

Deletion

A

REF

Page 7: A Robust Framework for Detecting Structural Variations

7

Type of Structural Variations (3)

Inversion

A

REF

5’ 3’

5’ 3’

5’3’

Page 8: A Robust Framework for Detecting Structural Variations

8

Type of Structural Variations (4)

Translocation

chr1

chr2

Page 9: A Robust Framework for Detecting Structural Variations

9

Sequencing Approaches

1. “Fine-scale structural variation of the human genome” [Tuzun et al, 2005]

• Mapping matepairs onto the reference genome • Insertion and deletion: inconsistent mapped distance• Inversion: the same orientation of both reads

2. “Paired-End mappings Reveals Extensive Structural Variation in the Human Genome” [Korbel et al, 2007]

• Proposed high-throughput and massive paired end mapping technique• Detailed types of structural variations

Page 10: A Robust Framework for Detecting Structural Variations

10

Motivation & Research Objectives (1)

Tuzun et al used scores which are the combination of several factors. (e.g. length, identity, quality of the sequences)

How can we map reads onto the reference genome?

Page 11: A Robust Framework for Detecting Structural Variations

11

Motivation & Research Objectives (2)

Sequencing method is effective to detect structural variants. Proven by Tuzun et al, Korbel et al

However, there are multiple mappings for each read Previous research used a priori mapped locations.

Why don’t we develop a probabilistic model without such assumptions? Hopefully, it can be applied to short reads from NGS machines.

Page 12: A Robust Framework for Detecting Structural Variations

12

Probabilistic Framework (1)

p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes

We play with p(Y) to describe our probabilistic framework

Page 13: A Robust Framework for Detecting Structural Variations

13

Probabilistic Framework (2)

Insertion

μY = (s+r)

P(Xi, Xj|ins=r) = P(Xi|ins=r)P(Xj|ins=r)P(Xi|ins=r) = 1 - P(μY - δ ≤Y≤μy+ δ)

, where δ= |μY- (s+r)|, s = mapped distance

μy - δ

X1, X2 = matepair 1,2Y= random variable for mapped distances of “uniquely mapped” matepairs

p(Y)

Page 14: A Robust Framework for Detecting Structural Variations

14

Probabilistic Framework (3)

Deletion

μY = (s-r)

P(Xi, Xj|del=r) = P(Xi|del=r)P(Xj|del=r)P(Xi|del=r) = 1 - P(μY - δ ≤Y≤μy+ δ)

where δ= |μY- (s-r)|, s = mapped distance

μy - δ

p(Y)

Page 15: A Robust Framework for Detecting Structural Variations

15

Probabilistic Framework (4)

c - d = s(X1) - s(X2)

P(Xi, Xj|inv) = 1 - P(μ|Y1-Y2| - δ ≤|Y1-Y2|≤μ|Y1-Y2| + δ) where δ= |μ|Y1-Y2| – (c – d)|, s(Xi) = insert size of Xi

μ|Y1-Y2|-δ

p(|Y1-Y2|)

Inversion

Page 16: A Robust Framework for Detecting Structural Variations

16

Probabilistic Framework (5)

μ|Y1-Y2|-δ(c – a) – (d – b) = s(X1) - s(X2)

P(Xi, Xj|trans) = 1 - P(μ|Y1-Y2| - δ ≤|Y1-Y2|≤μ|Y1-Y2| + δ) , where δ= |μ|Y1-Y2| – (c – a) – (d – b) |, s(Xi) = insert size of Xi

p(|Y1-Y2|)

Translocation

Page 17: A Robust Framework for Detecting Structural Variations

17

Flow of our Framework (1)

1. Preprocessing step

Get top K Get top K mappings mappings Get top K Get top K mappings mappings

Remove Remove short mappingsshort mappingsRemove Remove short mappingsshort mappings

Make all possible Make all possible combinations of combinations of

mappingsmappings

Make all possible Make all possible combinations of combinations of

mappingsmappings

Discard matepairs consistent Discard matepairs consistent with insert sizewith insert size

Discard matepairs consistent Discard matepairs consistent with insert sizewith insert size

Remove invalid Remove invalid strands (-,+) strands (-,+) Remove invalid Remove invalid strands (-,+) strands (-,+)

Remove very Remove very similar similar

mappingsmappings

Remove very Remove very similar similar

mappingsmappingsMask Mask repeatsrepeatsMask Mask repeatsrepeats

Page 18: A Robust Framework for Detecting Structural Variations

18

Flow of our Framework (2)

2. Clustering

3. Finding structural variations

Do hierarchical clustering for each structural variationDo hierarchical clustering for each structural variation(Insertion, Deletion, Inversion, Translocation)(Insertion, Deletion, Inversion, Translocation)

Do hierarchical clustering for each structural variationDo hierarchical clustering for each structural variation(Insertion, Deletion, Inversion, Translocation)(Insertion, Deletion, Inversion, Translocation)

Find a local Find a local optimum optimum configurationconfiguration

Find a local Find a local optimum optimum configurationconfiguration

Parameter learning Parameter learning for the objective for the objective functionfunction

Parameter learning Parameter learning for the objective for the objective functionfunction

Find initial Find initial configuration in configuration in greedy manner greedy manner

Find initial Find initial configuration in configuration in greedy manner greedy manner

Page 19: A Robust Framework for Detecting Structural Variations

19

Hierarchical Clustering (1)

(ex) Insertion

A

REF

•Cluster, C, is a set of matepairs explaining the same structural variations•Linkage distance = D(X1, X2) = - ln P(X1, X2|C)

X1X2

X1X2

C={X1, X2}

Page 20: A Robust Framework for Detecting Structural Variations

20

Hierarchical Clustering (2)

Generally, linkage distance is given by,

We do hierarchical clustering for each structural variation.

Page 21: A Robust Framework for Detecting Structural Variations

21

Choosing a Unique Mapped Location (1)

We should map matepairs onto unique pair of BLAT hits and unique cluster.

R1 R2

C2C1 C2C1

R2R1

1 2 3 4 5

M1,4 M2,4 M3,5

Page 22: A Robust Framework for Detecting Structural Variations

22

Choosing a Unique Mapped Location (2)

We define a objective Function J(ω)

ƒ1 corresponds to BLAT hit scores

ƒ2 corresponds to the probability

ƒ3 corresponds to the size of clusters

Page 23: A Robust Framework for Detecting Structural Variations

23

Choosing a Unique Mapped Location (3)

Find the initial configuration greedily

Learn parameters for the objective function J(ω). We used hill climbing search to maximize the l

og likelihood of P(ω|λi)

Finally, find a configuration, locally maximizing J(ω) using hill climbing search

Page 24: A Robust Framework for Detecting Structural Variations

24

P-values

We assign p-values to give confidence to our clusters.

The probability that the cluster is generated by the reference genome not by structural variants Pval(Ck)=(E choose |Ck|) ∏ P(Xi|Cnull)

where E = (Expected number of matepairs mapped to the location of the cluster)

P-values depend on the length of the cluster, thenumber of matepairs involved and probabilities.

Page 25: A Robust Framework for Detecting Structural Variations

25

Clustering Results

We started with ~360,000 matepair ~90% were uniquely mapped ~90% had a concordant position (mapped at ± 2)

Through the clustering procedure above (FDR 0.2) we found

82 Insertion clusters (53 had a uniquely mapped read) 175 Deletion clusters (135) 103 inversion clusters (24) 55 Translocation (cross-chromosome) cluster

(all were required to have a uniquely mapped read)

Page 26: A Robust Framework for Detecting Structural Variations

26

Example Deletion

Page 27: A Robust Framework for Detecting Structural Variations

27

Agreement with Previous Results

Type Total Tuzun Levy Korbel DGV-All

Insertion 82(53) 12(7)/139 6(5)/319 0(0)/34 24(13)/2216

Deletion 175(135) 21(17)/102 25(23)/344 45(36)/742 82(63)/4697

Inversion 103(24) 34(12)/56 N/A 42(8)/105 60(15)/164

We have compared

All of the correlations (besides the zero) are significant (p-values < 0.001) via Monte Carlosimulations

The DMBT1 deletion was also found in theTuzun et al dataset (but not the Levy dataset).

Page 28: A Robust Framework for Detecting Structural Variations

28

Translocations

A large fraction (69%) of the translocations were close to the centromeres

She et al. predicted up to 200 interchromosomal rearrangement events near centromeres per million years. The two donors are ~0.2 million years apart

These could also be mis-assemblies.

Distance to centromere

<106 (106, 4.5*106] >4.5*106

<106 22 6 10

(106, 4.5*106] 0 3

>4.5*106 14

Page 29: A Robust Framework for Detecting Structural Variations

29

Conclusions

Introduced a novel framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions.

Introduced a probabilistic model for structural variants

Isolated 82 insertions, 175 deletions, and 103 inversions between the reference public human genome and the JCVI donor.

These results show statistically significant correlation with previous variation studies

Isolated 194 novel structural variants that do not overlap any event from the database of genomic variants (of these 121 have support from a uniquely mapped matepair)