Upload
claire-mason
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Plgw03, 17/12/071
On the Hardness of Inferring Phylogenies
from Triplet-Dissimilarities
Ilan Gronau Shlomo Moran
Technion – Israel Institute of TechnologyHaifa, Israel
Plgw03, 17/12/072
Pairwise-Distance Based Reconstruction
L
G
E
H
M
B
DT
Butt’fly…AAGT…
Eagle…CAGA…
Gorrila…CCGT…
Human…AACG…
Lion…AATA…
Mouse…CGCG…
0 13 17 15 10 12
0 14 13 17 11
0 2 10 9
0 15 8
0 6
0
B E G H L M B E
G H
L M
D
calculate
B E G HM L
21343 7
42
5
T
1
reconstruct
0 14 15 16 14 13
0 15 16 14 13
0 3 11 10
0 12 11
0 7
0
B E G H L M B E
G H
L M
Plgw03, 17/12/073
Optimization Criteria
We wish the tree-metric DT to approximate simultaneously the
pairwise distances in D.2
n
Maximal Difference (l∞ ) 1 2 1 2,
( , ) max ( , ) ( , )i j
MaxDiff D D D i j D i j
•Maximal Distortion 1 21 2
, ,2 1
( , ) ( , ), max max
( , ) ( , )i j i j
D i j D i jMaxDist D D
D i j D i j
Two “closeness” measures studied here:
0 13 17 15 10 12
0 14 13 17 11
0 2 10 9
0 15 8
0 6
0
B E G H L M B E
G H
L M
0 14 15 16 14 13
0 15 16 14 13
0 3 11 10
0 12 11
0 7
0
B E G H L M
B E
G H
L M
should be “close” to = D DT =
Plgw03, 17/12/074
Maximal Difference (l∞ )
vs. Maximal Distortion
0 13 17 15 10 12
0 14 13 17 11
0 2 10 9
0 15 8
0 6
0
B E G H L M B E
G H
L M
0 14 15 16 14 13
0 15 16 14 13
0 3 11 10
0 12 11
0 7
0
B E G H L M B E
G H
L M
3 17, 1.821...
2 14TMaxDist D D
( , ) |10 14 | = 4TMaxDiff D D
Goal: Find optimal T ,which minimizes the maximal difference/distortion between D and DT
D= DT=
Plgw03, 17/12/075
Previous works on Approximating Dissimilarities by Tree Distances
Negative results: (NP-hardness)
• Closest tree-metric (even ultrametric ) to dissimilarity matrix under l1 l2 [Day
‘87]
• Closest tree-metric to dissimilarity matrix under l∞ [ABFPT99] Hard to approximate better than 1.125 Implicit: Hard to approximate closest MaxDist tree within any constant factor
Positive results:
• Closest ultrametric to dissimilarity matrix under l∞ [Krivanek
‘88]
• 3-approximation of closest additive metric to a given metric [ABFPT99] (implicit 6-approximation for general dissimilarity matrices)
Plgw03, 17/12/076
This Work: Triplet-Distances – Distances to Triplets Midpoints
i
j
k
τT (i ; jk)
• τT (i ; jk) = τT (i ; kj)
• τT (i ; ij) = 0
• τT (i ; jj) = DT (i, j)
C(i,j,k)
Plgw03, 17/12/077
Triplet-Distances Defined by 2-Distances
• Each distance Matrix D defines 3-trees3
n
i
k
j9
7
8
τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].
Any metric on 3 taxa…
C(i,j,k)
i j
k
3
4
5
…is realizable by a 3-tree
Plgw03, 17/12/078
reconstruct
Triplet-Distance Based Reconstruction
B E G HM L
21343 7
42
5
T
1
…AAGT…
…CAGA…
…CCGT…
…AACG…
…AATA…
…CGCG…
B E
G H
L
M
BB BE BG….. LL LM MM
12
17
15
12
13
0 0 0 . . . . . .
0 6 . . .
6 0 . . .
8 3 . . .
6 3 . . .
5 4 . . . 0
T
14
15
16
13
14
0 0 0 . . . . . .
0 7 . . .
8 0 . . .
9 2 . . .
6 4 . . .
7 5 . . . 0
B E
G H
L
M
BB BE BG….. LL LM MM
0 13 17 15 10 12
0 14 13 17 11
0 2 10 9
0 15 8
0 6
0
τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].
Plgw03, 17/12/079
Why use Triplet-Distances?
1. They enable more accurate
estimations of 2-distances.
2. They are used (de facto) by known
reconstruction algorithms
Plgw03, 17/12/0710
Improved Estimations of Pairwise Distances:
0 13 17 15 10 12
0 14 13 17 11
0 2 10 9
0 15 8
0 6
0
B E G H L M B E
G H
L M
D=
Butt’fly…AAGT…
Eagle…CAGA…
Gorrila…CCGT…
Human…AACG…
Lion…AATA…
Mouse…CGCG…
“Information Loss”
(In calculating D(H,E),
all other taxa are ignored
Human…AACG…
Eagle…CAGA…
(Maximum Likelihood)
H
E
13
Calculate D(H,E)
Plgw03, 17/12/0711
Improved Estimations (cont):
Estimate D(H,E) by calculating all the 3-trees on {H,E,X:XH,E}
(Or: calculate just one 3-tree, for a “trusted” 3rd taxon X :• V. Ranwez, O. Gascuel, Improvement of distance-based phylogenetic methods by a local maximum likelihood
approach using triplets, Mol.Biol. Evol. 19(11) 1952–1963. (2002)
B=(..AAGT..)
H= (..AACG..) E=(..CAGA..)
3 2
(..****..)
M=(..CGCG..)
33
(..****..)
H= (..AACG..) E=(..CAGA..)
G=(..CCGT..)
H= (..AACG..)
E=(..CAGA..)
1 5
(..****..)
L=(..AATA..)
H= (..AACG..) E=(..CAGA..)
2 4
(..****..)
Plgw03, 17/12/0712
(Implicit) use of Triplet-Distances in 2-Distance Reconstruction Algorithms
BB BE BG….. LL LM MM
12
17
15
12
13
0 0 0 . . . . . .
0 6 . . .
6 0 . . .
8 3 . . .
6 3 . . .
5 4 . . . 0
0 13 17 15 10 12
0 14 13 17 11
0 2 10 9
0 15 8
0 6
0
B E G H L M
B E
G H
L M
D
B E G HM L
21343 7
42
5
T
1
τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].
Plgw03, 17/12/0713
1st use : “Triplet Distances from a Single Source”:
Fix a taxon r, and construct a tree T which minimizes:
Optimal solution is doable in O(n2) time, and is used eg in :
(FKW95): Optimal approximation of distances by ultrametric trees.
(ABFPT99): The best known approximation of distances by general
trees
(BB99): Fast construction of Buneman trees.
| ( ; ) ( ; ) |: ,TMax r ij r ij i j r
i
j
r
Plgw03, 17/12/0714
2nd use:Saitou&Nei Neighbour Joining
The neighbors-selection criterion of NJ selects a taxon-pair i,j which
maximizes the sum :
i
j
r
( ; )D i j ,
( ; )r i j
r ij
,
( , ) ( , ) ( ; )r i j
Q i j D i j r ij
r
r
r
r
rr
Plgw03, 17/12/0715
Previous Works on Triplet-Dissimilarities/Distances
• I. Gronau, S. Moran Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances, Journal of
Computational Biology 14(1) pp. 1-15 (2007).
Works which use the total weights of 3 trees:
• S. Joly, GL Calve, Three Way Distances, Journal of Classification 12 pp. 191-205 (1995)
• L. Pachter, D. Speyer Reconstructing Trees from Subtrees Weights , Applied Mathematics Letters 17 pp. 615-621
(2004)
• D. Levy, R. Yoshida, L. Pachter, Beyond pairwise distances: Neighbor-joining with phylogenetic diversity
estimates, Mol. Biol. Evol. 23(3) 491–498 (2006) .
Plgw03, 17/12/0716
Summary of Results
Results for Maximal Difference (l∞):
1. Decision problem is NP-Hard
IS there a tree T s.t. ||τ,τT ||∞ ≤ Δ ?
2. Hardness-of-approximation of optimization problem
Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞
3. A 15-approximation algorithm Using the 6-approximation algorithm for 2-dissimilarities from [ABFPT99]
Result for Maximal Distortion:• Hardness-of-approximation within any constant factor
Plgw03, 17/12/0717
NP Hardness of the Decision Problem
We use a reduction from 3SAT
(the problem of determining whether a 3CNF formula is satisfiable)
1 2 3 1 2 4 1 2 4 1 3 4x x x x x x x x x x x x clause
literals
1 2 3 4; ; ;x x x x T F F FSatisfying assignment:
If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τT ||∞ ≤
Δ, then one can determine for every 3CNF formula φ whether it is
satisfiable.
We show:
Plgw03, 17/12/0718
The Reduction
The set of taxa:
• Taxa T , F.
• A taxon for every literal ( ).
• 3 taxa for every clause Cj ( y j1 , y j
2 , y j3
).
i ix , x
Given a 3CNF formula φ we define triplet
distances and an error bound Δ which enforce the output tree to imply a satisfying assignment to φ.
Plgw03, 17/12/0719
One the following can be enforced on each taxa triplet (u,v,w):1. taxon u is close to Path(v,w), or2. taxon u is far to Path(v,w)
u
Properties Enforced by the Input (,Δ)
v
w
Plgw03, 17/12/0720
A truth assignment to φ is implied by the following:1. T is far from F2. For each i, is far from , and both of and are close to Path(T ,F)
T F
Enforcing Truth Assignmaent
ix ix
Thus we set xi =T iff xi is close to T.
ixix ixix
Plgw03, 17/12/0721
A clause C=( l 1 l
2 l 3 ) is satisfied iff
At least one literal l i is true, i.e. is close to T.
Enforcing Clauses-Satisfaction
F
l 3
l 1
l 2
(l 1 l
2 l 3 ) is satisfied iff
it is not like this
We need to guarantee that all clauses avoid the above by the close/far relations.
Plgw03, 17/12/0722
-(l 1 l
2 l 3 ) is satisfied iff out of the three paths:
Path(l 1 , l
2), Path(l 1 , l
3), Path(l 2 , l
3),
at least two paths are close to T .
Clauses-Satisfaction (cont)
T Fl
1
l 3
l 2
But we don’t know which two paths
Plgw03, 17/12/0723
Clauses-Satisfaction (cont) We attach a taxon to each such path:
y1 is close to Path ( l 2,l
3)
y2 is close to Path ( l 1,l
3)
y3 is close to Path ( l 1,l
2)
(l 1 l
2 l 3 ) is satisfied iff at least two yi’s can be located
close to T.…
T Fl
1
l 3
l 2
y1y2 y3
Plgw03, 17/12/0724
… and, at least two of the yi’s can be located close to T
Path( y 2,y
3), Path( y
1,y 3), Path( y
1,y 2), are close to T
Clauses-Satisfaction (end)
So, (l 1 l
2 l 3 ) is satisfied iff all the above paths are close to T
T Fl
1
l 3
l 2
y1y2 y3
Plgw03, 17/12/0725
vFvTT F2β αα
Construction Example
1 2 3 1 2 4 1 2 4 1 3 4x x x x x x x x x x x x
1 2 3 4; ; ;x x x x T F F F
α
1x 1x 2x 3x 4x2x 3x 4x
α
y12
y11
y13
αy2
3
y21
α
y22
φ is satisfiable there is a tree T which satisfies all bounds
A1 τT (T , F ) ≥ 2α+2β
A2 i=1..n : τT (T ; ) ≤ α ; τT (F ; ) ≤ α
B1 j=1..m : τT (y j1 ; l j
2 l j3 ) ≤ α ; τT (y j
2 ; l j1 l j
3 ) ≤ α ; τT (y j3 ; l j
1 l j2 ) ≤ α
B2 j=1..m : τT (y j1 ; T F ) ≥ α ; τT (y j
2 ; T F ) ≥ α ; τT (y j
3 ; T F ) ≥ α
B3 j=1..m : τT (T ; y j2 y j
3 ) ≤ α ; τT (T ; y j1 y j
3 ) ≤ α ; τT (T ; y j1 y j
2 ) ≤ α
i ix xi ix x
Plgw03, 17/12/0726
Hardness of Approximation Results
Approximating Maximal Difference• Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞
Approximating Maximal Distortion:• Finding a tree T s.t.
MaxDist(τ,τT ) ≤ C MaxDist(τ,τOPT) for any constant CDetails in:I. Gronau and S. moran, On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities, Theoretical Computer Science 389(1-2), December 2007, pp. 44-55.
By “stretching” the close/far restrictions, the following problems are also shown NP hard:
Plgw03, 17/12/0727
Open Problems/Further Research
•Extending hardness results for 3-diss tables induced by 2-diss matrices
(τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)] )
•Extending hardness results for “naturally looking” trees(binary trees with constant-bounded edge weights)
•Check Performance of NJ when neighbor selection formula computed from “real” 3-distances.
•Devise algorithms which use 3-distances as input.
•Does optimization of 3-diss lead to good topological accuracy (under accepted models of sequence evolution)
(it is known that optimization of 2-diss doesn’t lead to good topological
accuracy)
Plgw03, 17/12/0728
Thank You
Distance-Based Phylogenetic Reconstruction
• Compute distances between all taxon-pairs
• Find a tree (edge-weighted) best-describing the distances
0
30
980
1514180
171620220
1615192190
D
4 5
7 21
210 61
Plgw03, 17/12/0738
The Reduction – τ(φ)A1 τT (T , F ) ≥ 2α+2β
A2 i=1..n : τT (T ; ) ≤ α ; τT (F ; ) ≤ α
B1 j=1..m : τT (y j1 ; l j
2 l j3 ) ≤ α ; τT (y j
2 ; l j1 l j
3 ) ≤ α ; τT (y j3 ; l j
1 l j2 ) ≤ α
B2 j=1..m : τT (y j1 ; T F ) ≥ α ; τT (y j
2 ; T F ) ≥ α ; τT (y j
3 ; T F ) ≥ α
B3 j=1..m : τT (T ; y j2 y j
3 ) ≤ α ; τT (T ; y j1 y j
3 ) ≤ α ; τT (T ; y j1 y j
2 ) ≤ α
i ix xi ix x
vFvTT F2β αα
α
1x 1x 2x 3x 4x2x 3x 4x
α
y12
y11
y13
α y23
y21
α
y22
A1 τ(T , F ) = 2α+3βA2 i=1..n : τ(T ; ) = α-β ; τ(F ; ) = α-β
B1 j=1..m : τ(y j1 ; l j
2 l j3 ) = α-β ; τ(y j
2 ; l j1 l j
3 ) = α-β ; τ(y j3 ; l j
1 l j2 ) = α-β
B2 j=1..m : τ(y j1 ; T F ) = α+β ; τ(y j
2 ; T F ) = α+β ; τ(y j3 ; T F ) = α+β
B3 j=1..m : τ(T ; y j2 y j
3 ) = α-β ; τ(T ; y j1 y j
3 ) = α-β ; τ(T ; y j1 y j
2 ) = α-β
Other 2-distances: τ(s , t ) = 2α+2β
Other 3-distances: τ(s ; t u ) = α+2β
i ix x i ix x
In our constructed tree:• All 2-distances are in [2α , 2α+2β].• All 3-distances are in [α , α+2β].
Δ=β.