Upload
ngohuong
View
220
Download
0
Embed Size (px)
Citation preview
Lecture 3. Phylogeny methods: Branch and bound,distance methods
Joe Felsenstein
Department of Genome Sciences and Department of Biology
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.1/25
Greedy search by sequential addition
A
D
B
C
A
B
C
B
C
D
A
D
A
C
B
8 7 9
BA
D
C
E11
A
D 9
E
C
B
A
D E9
BC
A
9 C
B
E
D
D 9 C
BEA
Greedy search by addition of species in a fixed order (A, B, C, D, E) in thebest place each time. Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.2/25
Goloboff’s time-saving trick
H−K
L
M−R
S−U
A
V−Z
V−Z
A−G H−R
S−U
B−G
Goloboff’s economy in computing scores of rearranged treesOnce the “views” have been computed, they can be taken to
represent subtrees, without going inside those subtrees
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.3/25
Star decomposition
A
C
D
E F
B
E
C
D
A
B
F
B C
D
A
E F
E
C
D
A
B
F
B C
D
A
F
E
C
D
A
B
F
E
“Star decomposition" search for best tree can happen in multiple ways
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.4/25
Disk-covering
A
B
C D
EF
0.1
0.05
0.1 0.04 0.1
0.030.030.02
0.05
“Disk covering" – assembly of a tree from overlapping estimated subtrees
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.5/25
Shortest Hamiltonian path problem(a) (b)
(c) (d)
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.6/25
Search tree for this problem
etc. etc.
etc.etc.
add 1 add 2 add 3
add 2 add 3 add 4 add 5
add 3 add 5
add 8 add 10add 9
add 9
add 9add 3add 10
add 10 add 8
add 8add 3add 10
add 10 add 8
add 8add 3
add 9
etc. etc.
start
(1,2,3,4,5,6,7,8,9,10) (1,2,3,4,5,6,7,9,8,10) (1,2,3,4,5,6,7,10,8,9)
(1,2,3,4,5,6,7,8,10,9) (1,2,3,4,5,6,7,9,10,8) (1,2,3,4,5,6,7,10,9,8)
add 4
etc.
add 9
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.7/25
Search tree of trees
C
A
B
D C
A B
A
C
B
D
A
B
C
D
A
ED
B
C
E
DA
B
C
D
AE
B
C
D
AC
B
E
D
AB
C
E
E
AC
B
D
E
C A B
DC
AE
B
D
C
AD
B
E
C
AB
D
E
E
AB
C
D
E
BA
C
D
B
AE
C
D
B
AD
C
E
B
A C D
ELecture 3. Phylogeny methods: Branch and bound, distance methods – p.8/25
same, with parsimony scores in place of trees
8
11
11
9
3
9
7 8
9
9
9
10
10
11
1111
11
9
11
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.9/25
Polynomial time and exponential time
1 10 10010
0
101
102
103
104
105
106
Tim
e
Problem size
6n +4n−33
e0.5n
How does the time taken by an algorithm depend on the size of theproblem? If it is a polynomial (even one with big coefficients), with a bigenough case it is faster than one that depends on the size exponentially.
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.10/25
NP completeness and NP hardness
P
NP
does thispart exist?
NP Hard
is P = NP?
NP Complete
(This diagram is not quite correct – see the diagrams on the Wikipedia page for “NP-hard”).
P = problems that can be solved by a polynomial time algorithm
NP complete = problems for which a proposed solution can be checked in polynomial timebut for which it can be proven that if one of them is in P, all are.
NP hard = problems for which a solution can be checked in polynomial time, but might be notsolvable in polynomial time.
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.11/25
Distance methodsThese have been attractive, particular to mathematical scientists who lovegeometry. This has its good and bad effects.
1. Take the sequences in all pairs.
2. For each pair compute a distance. (As we will see, this is bestthought of as the length of the 2-species tree for those species).
3. Try to find that tree which best fits the table of distances.
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.12/25
A phylogeny with branch lengths
A B C D E
A
B
C
D
E
0
0
0
0
0
0.23 0.16 0.20 0.17
0.23 0.17 0.24
0.15 0.11
0.21
0.23
0.16
0.20
0.17
0.23
0.17
0.24
0.15
0.11 0.21
0.10
0.07
0.05
0.08
0.030.06
0.05
A B
CD
E
and the pairwise distances it predicts
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.13/25
A phylogeny with branch lengths
A B
CD
E
v1v2
v3v4
v5 v6
v7
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.14/25
Least squares trees
Least squares methods minimize
Q =n
∑
i=1
∑
j 6=i
wij(Dij − dij)2
over all trees, using the distances dij that they predict.Cavalli-Sforza and Edwards suggested wij = 1, Fitch andMargoliash suggested wij = 1/D2
ij.
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.15/25
Statistical assumptions of least squares trees
Implicit assumption is that distances are (independently?) Normallydistributed with expectation dij and variance proportional to 1/w2
ij:
Dij ∼ N (dij, K/wij)
Thus the different weightings correspond to different assumptions aboutthe error in the distances. Also, there is assumed to be no covariance ofdistances.
In fact, the distances will covary, since a change in an interior branch ofthe tree increases (or decreases) all distances whose paths go throughthat branch.
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.16/25
Matrix approach to fitting branch lengthsIf we stack the distances up into a column vector D, we can solve the least squares equation(obtained by taking derivatives of the quadratic form Q):
DT = (D12, D13, D14, D15, D23, D24, D25, D34, D35, D45)
XTD =
“
XTX
”
v.
where the “design matrix” X for the given tree topology has 1’s whenever a given branch lieson the path between those two species. Here is the design matrix for the tree we just saw.
Branches which1 2 3 4 5 6 7 D
X =
2
6
6
6
6
6
6
6
6
6
6
6
6
4
1 1 0 0 0 0 11 0 1 0 0 1 01 0 0 1 0 0 11 0 0 0 1 1 00 1 1 0 0 1 10 1 0 1 0 0 00 1 0 0 1 1 10 0 1 1 0 1 10 0 1 0 1 0 00 0 0 1 1 1 1
3
7
7
7
7
7
7
7
7
7
7
7
7
5
1, 21, 31, 41, 52, 32, 42, 53, 43, 54, 5
A B
CD
E
v1v2
v3v4
v5 v6
v7
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.17/25
The Jukes-Cantor model for DNA
A G
C T
u/3
u/3
u/3u/3 u/3
u/3
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.18/25
Derivation of the probability of change
1. Imagine events occuring at rate 43u per unit time which replace a
base by one of the 4 bases chosen at random.
2. Persuade yourself that this is no different in outcome from events u
per unit time that replace it by one of the other 3 chosen at random.
3. The probability a branch has none of these (first kind of) events if itis of length t is exp(− 4
3u t) . (Think the zero term of a Poisson
distribution).
4. If it does have one or more of these events, you end up with one ofthe 4 bases chosen at random.
5. Therefore the probability of a net change is:
3
4
(
1 − e(− 4
3u t)
)
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.19/25
The distance for the Jukes-Cantor model
0
1
0
0.75
0.49
0.7945
per
site
diffe
renc
es
branch length
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.20/25
If you don’t correct for “multiple hits”
A
B
C
0.155 0.155
0.0206
A
B
C
0.20 0.20
0.00
Left: the true tree. Right: a tree fitting the uncorrected distances
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.21/25
References, page 1Maddison, D. R. 1991. The discovery and importance of multiple islands of most-parsimonious
trees. Systematic Zoology40: 315-328. [Discusses heuristic search strategy involving ties,multiple starts]
Farris, J. S. 1970. Methods for computing Wagner trees. Systematic Zoology19: 83-92. [Earlyparsimony algorithms paper is one of first to mention sequential addition strategy]
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructingphylogenetic trees. Molecular Biology and Evolution4: 406-425. [First mention ofstar-decomposition search for best trees, sort of]
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihoodmethod for reconstructing tree topologies. Molecular Biology and Evolution13: 964-969.[Assembles trees out of quartets]
Huson, D., S. Nettles, L. Parida, T. Warnow, and S. Yooseph. 1998. The disk-covering method fortree reconstruction. pp. 62-75 in Proceedings of “Algorithms and Experiments” (ALEX98), Trento,
Italy, Feb. 9-11, 1998, ed. R. Battiti and A. A. Bertossi. [“Disk-covering method” for longstringy trees]
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.22/25
References, page 2Foulds, L. R. and R. L. Graham. 1982. The Steiner problem in phylogeny is NP-complete.
Advances in Applied Mathematics3: 43-49. [Parsimony is NP-hard]Graham, R. L. and L. R. Foulds. 1982. Unlikelihood that minimal phylogenies for a realistic
biological study can be constructed in reasonable computat ional time. Mathematical
Biosciences60: 133-142. [ ... and more]Hendy, M. D. and D. Penny. 1982. Branch and bound algorithms to determine minimal
evolutionary trees. Mathematical Biosciences60: 133-142 [Introduced branch-and-bound forphylogenies]
Felsenstein, J. 2004. Inferring Phylogenies.Sinauer Associates, Sunderland, Massachusetts. [Forthis lecture the material is chapters 4, and 5]
Semple, C. and M. Steel. 2003. Phylogenetics.Oxford University Press, Oxford. [Also coverssearch strategies]
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.23/25
References, page 3
Felsenstein, J. 1984. Distance methods for inferring phylogenies: a justification.Evolution38: 16-24. [Argument for statistical interpretation of distancemethods]
Farris, J. S. 1985. Distance data revisited. Cladistics1: 67 -85. [Reply to my1984 paper]
Felsenstein, J. 1986. Distance methods: reply to Farris. Cladistics2: 130-143.[reply to Farris 1985]
Farris, J. S. 1986. Distances and statistics. Cladistics2: 1 44-157. [debate wascut off after this]
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.24/25
References, page 4
Bryant, D., and P. Waddell. 1998. Rapid evaluation of least-squares andminimum-evolution criteria on phylogenetic trees. Molecular Biology andEvolution15: 1346-1359. [quicker least squares distance trees]
Felsenstein, J. 2004. Inferring Phylogenies.Sinauer Associates, Sunderland,Massachusetts. [See chapter 11]
Semple, C. and M. Steel. 2003. Phylogenetics. Oxford University Press, Oxford.[See pp. 145-160]
Yang, Z. 2007. Computational Molecular Evolution.Oxford University Press,Oxford. [See pages 89-93]
Lecture 3. Phylogeny methods: Branch and bound, distance methods – p.25/25