Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying...

Preview:

Citation preview

Greedy method for inferring tandem duplication historyLouxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003

reference:1.Elemento,O.,(2002) Reconstructing the duplication history of tandemly repeated gene, Mol. Biol. Evol

2.Tang,M., Waterman M,(2001) Zinc finger gene clusters and tandem gene duplication, RECOMB

reporter: r92922054 李明翰 b88506020 黃寶萱 b90902020 蔡明潔

Outline

Duplication model

Constructing duplication model from phylogeny Double duplication model Arbitrary duplication model

Discussion

Duplication

A duplication replaces a stretch of DNA containing several repeats with two identical and adjacent copies of itself.

If the stretch contain k repeats, the duplication is called a k-duplication.

DM ( duplication model )

A duplication model M for tandemly repeated sequence is a directed graph.

A duplication model contains nodes, edges and blocks.

Phylogeny & DM

Node & Edge

A node in DM represents a repeat.A directed edge (u,v) indicates that v is a c

hild of u. Also means that u is an ancestor of v.Root & Leaf & Internal node.

Block

A block in DM represents a duplication.Each internal node appears in a unique

block.No node is an ancestor of another in a

block.We draw a block representing a k-

duplication only when the k>2.

Block (Cont.)

lc(v) means the left child of v. rc(v) means the right child of v.If the block corresponds to a k-duplication,

then it contains k nodes v1 , v2 ,…… vk from l

eft to right.Then

lc(v1),lc(v2),…,lc(vk),rc(v1),rc(v2),…,rc(vk)

Cont.

Hence ,for any i and j, 1 ≤ i < j ≤ k, the edge( vi , rc(vi)) and edge( vj , lc(vj)) cross each other.

The left-to-right order of leaves in the model is identical to the order of the sequences on a chromosome.

Example

lc(v1),lc(v3),lc(v4),rc(v1),rc(v3),rc(v4).

An ordered phylogenetic tree for sequence {1,2,…,n} is a rooted phylogeny in which its leaves are listed from left to right in the increasing order.

LEMMA 1:

l*c(u),r*c(u) denote the leftmost and the rightmost leaf in the subtree TM(u) rooted at u respectively.

For each internal node u in TM ,

r*c(u)> r*c(lc(u)) and l*c(u)<l*c(rc(u)).r*c(lc(u)) and l*c(rc(u)) are the biggest and

smallest labels in the subtree TM(u).

Constructing a duplication model from a phylogeny

Features:

A duplication model M has a unique associated phylogeny TM.

A phylogeny is not necessarily associated with a duplication model.

Problem:Reconstruct the Duplication model M in linear time

Input: a phylogeny T

Output: reconstruct the duplication model M

Problem (Cont.)Note: To represent a duplication model, we

only need to list all non-single duplication blocks on the associated phylogeny

[V1, V3, v 4] [V5 V6] [V7 V8]

Double duplication models

Given a phylogeny T on sequence family F = {1,2,…,n}. Associate a pair (Lv, Rv) of indices with each node v in T as follows:

1. The i th leaf node: (Lv,Rv) = (i, i)

2. The internal node: (Lv,Rv) = (l*c(v), r *c(v))

r (1,10)

1(1,1)

6(6,6) 2(2,2)5(5,5)

8(8,8)10(10,10)3(3,3)9(9,9)7(7,7)

4(4,4)

V1(1,6)

V5(2,4) V7(7,9)

V3(2,9)

V6(3,5) V8(8,10)

V4(3,10)

V2(2,10)

Bottom up fashion for (Lv, Rv)

Lv = min {Llc(v), Lrc(v)}

Rv = max {Rlc(v), Rrc(v)}

Recursively bottom upSince T contains 2n-1 nodes linear time

Constructing DDM from phylogenyDouble duplication model: A duplication

model with all duplication in it are 1(or 2)-duplcation.

By Lemma1 the leftmost and rightmost leaves in T are 1 and n respectively.

Where does 2 locate?2 must just next to 1 on the DDM

Let v0 = r, v1, v2, · · · , vp−1, vp = 1

u1 = rc(vi ), u2, · · · , uq−1, uq = 2, where q ≥ p – i

LEMMA 2. M must contain p-i-1 double duplications

[vi+1, u j1 ], [vi+2, u j2 ], · · · , [vp−1, u jp−i−1 ],

i=2

P=5

q= 6

LEMMA 2. (Cont.)

Since jp-i-1 ≤ q -1 q ≥ p – I

PROOF. If vi+k does not belong to a double duplication block in M, the leaf labeled with 2 cannot be placed before the leftmost leaf in the subtree rooted at rc(vi+k), contradicting the fact that 2 is right next to 1 in M. Hence, vi+k must appear in a double duplication block for each k, 1 ≤ k ≤ p − i − 1. This finishes the proof.

Note:Ru1 > Ru2 > · · · > Ruq−1 > Ruq and

Rvi+1 > Rvi+2 > · · · > Rvp−1

Rvi+k appears between Ru jk and Ru jk+1 for [Vi

+k, ujk]

We can determine all ujk’s in p – i +q ≤ 2q

After all the duplication blocks [vi+k , u jk ] are placed on T , the leaf 2 should be right next to the leaf 1

Derive a rooted binary tree T’’ from the subtree of T(u1) by inserting a new node by

inserting a new node vk in the edge (u jk , u

jk+1) for each 1 ≤ k ≤ p − i − 1

assigning the subtree T(rc(vi+k)) rooted at rc(vi+k) as the right subtree of vk

Note : left child of vk is u jk+1 in T now.

Then, form the new phylogeny T’ from T by replacing subtree T(vi) with T’’

Linear time (Analysis)

Since we can charge the number of comparisons taken in different recursive steps to disjoint left paths in the input tree T , the whole algorithm takes at most 2×2n comparisons for determining all the duplication blocks. linear time algorithm.

Each internal node will be compared in q (next to leftmost path) once and then be in p (leftmost path) once. And each internal node will be compared with its (Rv,Lv). Therefore, 2x2n comparisons.

Arbitrary duplication models

Now, we generalize the above algorithm into arbitrary duplication models.

Again, we assume the leftmost paths leading to leaf 1 and leaf 2 in T are given in (1) and (2) respectively.

Observation:

Assume a phylogeny T is associated with a duplication model M. Then, there exist p − i − 1 double duplication blocks [vi+k , ujk ] (1≤k≤ p − i − 1) such that, after these duplications are placed in T , the leaf 2 is right next to the leaf 1. But, these double duplication blocks may not be in M.

Recall that there are two types of nodes on the leftmost path of T’. Some nodes are original ones in the input tree T ; some are inserted due to duplication blocks we have examined so far.

To extend the existing duplication blocks to larger ones, we associate a flag to each original node on the leftmost path of T’ , which indicates whether the node is in an existing duplication block or not.

Let x be an original node on the leftmost path P of T’ appearing in a duplication block [x1, x2, · · · , xt , x] of size t + 1 so far, then, there are t inserted nodes x’i right below x on the path P, which correspond to xi for i ≤ t.

To determine whether [x1, x2, · · · , xt , x] can be extended to a large duplication block in the model with which the original tree T is associated, we need to consider x and all the x’i s (1≤i≤ t) simultaneously.

For this purpose, we introduce the concept of hyper-double (duplication) blocks.

We say that x and y form a hyper-double block [x, y] in T’ if the following three conditions hold:

(i) x is a node in some non-single duplication block that we have obtained so far;

(ii) x and y are not an ancestor of each other;

(iii) the block [x1, x2, · · · , xt , x] can be extended to a block [x1, x2, · · · , xt , x, y] of size t + 2 in the original tree T .

Hence, when we place a hyper-double block [x, y] in the current tree T’ , the edge (y, l(y)) crosses not only the edge (x, r(x)), but also the edges (x’i , r (x’i )), 1≤ i ≤ t.

So, we have that a phylogeny T is associated with a model if and only if:

(i) there exist p − i − 1 double duplication blocks [vi+k , ujk ] (1≤k≤p − i − 1) in T such that, after these duplication blocks are placed in T, leaf 2 is right next to leaf 1, and

(ii) T’ constructed above is associated to ‘a duplication model’ with introducing hyper-double duplication blocks.

To make the algorithm run in linear time, we refine the algorithm in two aspects.

First, we assign a pair (R’x , R”x ) of indices to a node x on the leftmost path of T in each recursive step: if x is in a duplication block [x1, x2, · · · , xt , x] in the current stage, we set R’x = Rx1 and R”x = Rx , which are defined in Section 2.2.1. Since R’x < Rxi < R”x for 2≤i≤t, only R’x and R”x will be examined for determining if x is in a hyper-double block in next step.

Secondly, if the duplication block [x1, x2, · · · , xt , x] is extended into a larger hyper-double block [x1, x2, · · · , xt , x,y] in a step, the binary tree T’ for next step is constructed by inserting the right subtrees of xi ’s and x into the edge between y and its left child lc(y).

To do these insertions, we need to point the left child of x1 to l(y), and then point the left child of y to x.

In this way, we are able to insert all the subtrees in only two pointer operations.

DS: [v1,v2][v3,v5][v8,v6]

DS: [v1,v2][v3,v5,v4][v8,v6]

DS: [v1,v2][v3,v5,v4,v7][v8,v6]