48
Meiotic Recombination (single-crossover) Prefix Suffix Recombination is one of the principal evolutionary forces responsible for shaping genetic variation within species. Estimating the frequency and the location of recombination is central to modern-day genetics. (e.g. disease association mapping) 1 1 L L 1 L b b b

Meiotic Recombination (single-crossover) PrefixSuffix Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Embed Size (px)

Citation preview

Page 1: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Meiotic Recombination (single-crossover)

Prefix Suffix

Recombination is one of the principal evolutionary forces responsible for shaping genetic variation within species.

Estimating the frequency and the location of recombination is central to modern-day genetics.

(e.g. disease association mapping)

1 1L L

1 L

b b

b

Page 2: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

A Phylogenetic Network

00000

52

3

3 4S

p

PS

1

4

a:00010

b:10010

c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

10100

Page 3: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Elements of a Phylogenetic Network (single crossover

recombination)• Directed acyclic graph. • Integers from 1 to m written on the edges. Each integer written

only once. These represent mutations.• A choice of ancestral sequence at the root.• Every non-root node is labeled by a sequence obtained from its

parent(s) and any edge label on the edge into it.• A node with two edges into it is a ``recombination node”, with a

recombination point r. One parent is P and one is S.• The network derives the sequences that label the leaves.

Page 4: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Minimizing Recombinations

• Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful.

• However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations.

• Problem: Given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations. NP-hard (Wang et al 2000, Semple et al 2004)

Page 5: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Minimizing recombinations in unconstrained networks

• Problem: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations used to generate M.

Page 6: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Minimization is an NP-hard Problem

There is no known efficient

solution to this problem and there likely will never be one.

What we do: Solve small data-sets optimally with algorithms that are not provably efficient but work well inpractice;

Efficiently compute lower and upper bounds on the number of needed recombinations.

Page 7: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Efficient Computation of Close Upper and Lower Bounds on

the Minimum Number of Recombinations in Biological Sequence Evolution

Yun S. Song, Yufeng Wu, Dan Gusfield

UC Davis

ISMB 2005

Page 8: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Computing close lower and upper bounds on the minimum number of recombinations needed to derive M. (ISMB 2005)

Page 9: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Incompatible Sites

A pair of sites (columns) of M that fail the

4-gametes test are said to be incompatible.

A site that is not in such a pair is compatible.

Page 10: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

2 5

Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.

Incompatibility Graph

M

THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

Page 11: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Simple Fact

If sites two sites i and j are incompatible, then the sites must be together on some recombination cycle whose recombination point is between the two sites i and j.

(This is a general fact for all phylogenetic networks.)

Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4; as are 2, 5.

Page 12: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

The grandfather of all lower bounds - HK 1985

• Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence. This bound requires a linear order.

• The HK bound is the minimum number of vertical lines needed to cut every edge in the incompatibility graph. Weak bound, but widely used - not only to bound the number of recombinations, but also to suggest their locations.

Page 13: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Justification for HK

If two sites are incompatible, there must have been some recombination where the crossover point is between the two sites.

Page 14: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

1 2 3 4 5

HK Lower Bound

Page 15: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

1 2 3 4 5

HK Lower Bound = 1

Page 16: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

More general view of HKGiven a set of intervals on the line, and for each interval I, a number N(I), define the composite problem: Find the minimum number of vertical lines so that every interval I intersects at least N(I) of the vertical lines.

In HK, each incompatibility defines an interval I where N(I) = 1.

The composite problem is easy to solve by a left-to-right myopicplacement of vertical lines.

Page 17: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

This general approach is called the Composite Method(Simon Myers 2002).

If each N(I) is a ``local” lower bound on the number ofrecombinations needed in interval I, then the solution tothe composite problem is a valid lower bound for thefull sequences. The resulting bound is called the compositebound given the local bounds.

Page 18: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

The Composite Method (Myers & Griffiths 2003)

M

1. Given a set of intervals, and

Composite Problem: Find the minimum number of vertical lines so that every I intersects at least N(I) vertical lines.

2

1

2

2

2

31

2. for each interval I, a number N(I)

8

Page 19: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Haplotype Bound (Simon Myers)

• Rh = Number of distinct sequences (rows) - Number of distinct sites (columns) -1 <= minimum number of recombinations needed (folklore)

• Before computing Rh, remove any site that is compatible with all other sites. A valid lower bound results - generally increases the bound.

• Generally Rh is really bad bound, often negative, when used on large intervals, but Very Good when used as local bounds in the Composite Interval Method, and other methods.

Page 20: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Composite Interval Method using RH bounds

Compute Rh separately for each possible interval of sites; let N(I) = Rh(I) be the local lower bound for interval I. Then compute the composite bound using these local bounds.

Page 21: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Composite Subset Method (Myers)

• Let S be subset of sites, and Rh(S) be the haplotype bound for subset S. If the leftmost site in S is L and the rightmost site in S is R, then use Rh(S) as a local bound N(I) for interval I = [S,L].

• Compute Rh(S) on many subsets, and then solve the composite problem to find a composite bound.

Page 22: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

RecMin (Myers)

• Computes Rh on subsets of sites, but limits the size and the span of the subsets. Default parameters are s = 6, w = 15 (s = size, w = span).

• Generally, impractical to set s and w large, so generally one doesn’t know if increasing the parameters would increase the bound.

• Still, RecMin often gives a bound more than three times the HK bound. Example LPL data: HK gives 22, RecMin gives 75.

Page 23: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Optimal RecMin Bound (ORB)

• The Optimal RecMin Bound is the lower bound that RecMin would produce if both parameters were set to their maximum possible values.

• In general, RecMin cannot compute (in practical time) the ORB.

• We have developed a practical program, HAPBOUND, based on integer linear programming that guarantees to compute the ORB, and have incorporated ideas that lead to even higher lower bounds than the ORB.

Page 24: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Computing the ORB

• Gross Idea: For each interval I, use ILP to find a subset of sites that maximizes Rh over all subsets in interval I. Set N(I) to the found Rh value, and use these values in the Composite Method.

• The result is guaranteed to be the ORB. i.e, the bound one would get by using all 2^m subsets in RecMin.

Page 25: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

We have moved from doing an exponential number of simplecomputations (computing Rh for each subset), to solving a quadratic number of (possibly expensive) ILPs.

Is this a good trade-off in practice? Our experience - very much so!

Page 26: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

How to compute Opt(I) by ILPCreate one variable Xi for each row i; one variable Yj for each column j in interval I. All variables are 0/1 variables.

Define S(i,i’) as the set of columns where rows i and i’ have different values. Each column in S(i,i’) is a witness that rows i and i’ differ.

For each pair of rows (i,i’), create the constraint: Xi + Xi’ <= 1 + ∑ [Yj: jin S(i,i’)]

Objective Function: Maximize ∑ Xi - ∑ Yj -1

Page 27: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Alternate way to compute Opt(I) by ILP

Create one variable Yj for each column jin interval I. All variables are 0/1 variables.

S(i,i’) as before.

For each pair of rows (i,i’) create the constraint: 1 <= ∑ [Yj: jin S(i,i’)]

Objective Function: Maximize N - ∑(Yj) -1

Finds the smallest number of columns to distinguish the rows.

First remove any duplicate rows. Let N be the number of rowsremaining.

Page 28: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Two critical tricks

1) Use the second ILP formulation to compute Opt(I). It solves much faster than the first (why?)

2) Reduce the number of intervals where Opt(I) is computed:

If the solution to Opt(I) uses columns thatonly span interval L, then there is noneed to find Opt(I’) in any interval I’ containing L. Each ILP computationdirectly spawns at most 4 other ILPs. Apply this idea recursively.

I

L

Page 29: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

With the second trick we need to find Opt(I) for only 0.5% - 35% of all the C(m,2) intervals (empirical result). Surprisingly fast in practice (with either the GNU solver or CPLEX).

Page 30: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Bounds Higher Than the Optimal RecMin Bound

• Often the ORB underestimates the true minimum, e.g. Kreitman’s ADH data: 6 v. 7

• How to derive sharper bound? Idea: In the composite method, check if each local bound N(I) = Opt(I) is tight, and if not, increase N(I) by one.

• Small increases in local bounds can increase the composite bound.

Page 31: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Bounds Sharper Than Optimal RecMin Bound

• A set of sequences M is called self-derivable if there is a network generating M where the sequence at every node (leaf and intermediate) is in M.

• Observation: The haplotype bound for a set of sequences is tight if and only if the sequences are self-derivable.

• So for each interval I where Opt(I) is computed, we check self-derivability of the sequences induced by columns S*, where S* are the columns specified by Opt(I). Increase N(I) by 1 if the sequences are not self-derivable.

Page 32: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Self-Derivability Test

Solution in O(2^n) time by DP.

Page 33: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Checking if N(I) should be increased by two

If the set of sequences is not self-derivable,we can test if adding one new sequence makes

it self-derivable.the number of candidate sequences is

polynomial and for each one added to M we check self-derivability. N(I) should be increased by two if none of these sets of sequences is self-derivable.

Page 34: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Program HapBound

• HapBound –S. Checks each Opt(I) subset for self-derivability. Increase N(I) by 1 or 2 if possible. This often beats ORB and is still practical for most datasets.

• HapBound –M. Explicitly examine each interval directly for self-derivability. Increase local bound if possible. Derives lower bound of 7 for Kreitman’s ADH data in this mode.

Page 35: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

ILP Compsite method efficiency

• Only C(m,2) intervals, but the time for each is more than for the Hap bound.

• With tricks we need to solve the ILP for only 0.5% of all the C(m,2) intervals (empirical result). Shockingly fast in practice.

Page 36: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Typical Result

With n = 40, m = 100, r = 10 (ms recombination parameter):

HK gives 4

Composite Intervals with Hap Bound or CC bound gives 6

Recmin Composite Subsets with Hap Bound and subsets of size up to 20 in windows of size 20: 9 (7 secs)

Composite Subsets with Hap Bound and subsets of size up to30 in windows of size 30: 10 (takes 10 mins on Powerbook G4)

ILP Composite gets 10 and takes 1 second on Linux Box.

Page 37: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

HapBound vs. RecMin on LPL from Clark et al.

Program Lower Bound Time

RecMin (default) 59 3s

RecMin –s 25 –w 25

75 7944s

RecMin –s 48 –w 48

No result 5 days

HapBound ORB 75 31s

HapBound -S 78 1643s2 Ghz PC

Page 38: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Example where RecMin has difficulity in Finding the ORB on a

25 by 376 Data MatrixProgram Bound Time

RecMin default 36 1s

RecMin –s 30 –w 30 42 3m 25s

RecMin –s 35 –w 35 43 24m 2s

RecMin –s 40 –w 40 43 2h 9m 4s

RecMin –s 45 –w 45 43 10h 20m 59s

HapBound 44 2m 59s

HapBound -S 48 39m 30s

Page 39: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

The Human LPL Data (Nickerson et al. 1998)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Our new lower and upper bounds

Optimal RecMin Bounds

(We ignored insertion/deletion, unphased sites, and sites with missing data.)

(88 Sequences, 88 sites)

Page 40: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

History Bound (Myers & Griffiths 2003)

000

100

010

011

111

Iterate the following operations1. Remove a column with a single 0 or 12. Remove a duplicate row3. Remove any row

History bound: the minimum number of type-3 operations needed to reduce the matrix to empty

000

100

010

011

00

10

01

01

Empty.

One type-3 operation

00

10

01

M

Page 41: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Graphical interpretation of history bound (HistB)

• Each operation in the history computation corresponds to an operation that deconstructs the optimal, but unknown ARG.

• We deconstruct the optimal ARG by removing tree parts as long as possible; then remove an exposed recombination node; repeat.

• Removing an exposed recombination node in the ARG corresponds to a single type-3 operation. So when deconstructing the optimal ARG, the number of recombination nodes = number of type-3 operations.

• Since the optimal ARG is unknown, the history bound is the

minimum number of type-3 operations needed to make the matrix empty.

Page 42: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 00101g: 00101

2p s

a: 00010b: 10010c: 00100d: 10100

e: 01100

f: 00101g: 00101

Operations on M correspond to operations on the optimal ARG

M

Page 43: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 00101

2p s

a: 00010b: 10010c: 00100d: 10100

e: 01100

f: 00101

12345

Type-2 operation

Page 44: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

4

1

3

a: 001

b: 101

d: 110

c: 010

e: 010

f: 010

2p s

a: 001b: 101c: 010d: 110

e: 010

f: 010

134

Type-1 operations

Page 45: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

4

1

3

a: 001

b: 101

d: 110

c: 010

2p s

a: 001b: 101c: 010d: 110

134

Type-2 operations

Page 46: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

4

1

3

a: 001

b: 101 c: 010

a: 001b: 101c: 010

134

Type-3 operation

Then three more Type-1 operations fully reduce M and the ARG.

Page 47: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

History bound

• Initially required trying all n! permutations of the rows to choose the type-3 operations.

• The bound can be computed by DP in O(2n) time (Bafna, Bansal).

• On datasets where it can be computed, the history bound is observed to be higher than (or equal to) all studied lower bounds (about ten of them).

• There is no static definition for what the history bound is -- it is only defined by the algorithms that compute it! The work in this part of the talk comes out of an attempt to find a simple static definition.

Page 48: Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation

Later we will discuss another lower bound, theConnected-Component Lower Bound.