Phylogeny Tree Reconstruction 1 4 3 2 5 1 4 2 3 5

Phylogeny Tree Reconstruction

1 4

3 2 5

1 4 2 3 5

Final Exam

• 24-hour, takehome exam

• More straight-forward questions than in homeworks

• Please email Michael and Serafim by Friday, with your preference of day to take exam

• Exam starts Sunday, …, Thursday noon; ends Monday, ..., Friday noon

Number of labeled unrooted tree topologies

• How many possibilities are there for leaf 4?

1 2

3

44

4



For the 4th leaf, there are 3 possibilities

1 2

3

4




1 2

3

4

5




1 2

3

4

5


• How many possibilities are there for leaf n?

For the nth leaf, there are 2n – 5 possibilities

1 2

3

4

5


• #unrooted trees for n taxa: (2n-5)*(2n-7)*...*3*1 = (2n-5)! / [2n-3*(n-3)!]

• #rooted trees for n taxa: (2n-3)*(2n-5)*(2n-7)*...*3 = (2n-3)! / [2n-2*(n-2)!]

1 2

3

4

5

N = 10#unrooted: 2,027,025#rooted: 34,459,425

N = 30#unrooted: 8.7x1036

#rooted: 4.95x1038

Search through tree topologies: Branch and Bound

Observation: adding an edge to an existing tree can only increase the parsimony cost

Enumerate all unrooted trees with at most n leaves:

[i3][i5][i7]……[i2N–5]]

where each ik can take values from 0 (no edge) to k

At each point keep C = smallest cost so far for a complete tree

Start B&B with tree [1][0][0]……[0]

Whenever cost of current tree T is > C, then: T is not optimal Any tree extending T with more edges is not optimal:

Increment by 1 the rightmost nonzero counter

Bootstrapping to get the best trees

Main outline of algorithm

1. Select random columns from a multiple alignment – one column can then appear several times

2. Build a phylogenetic tree based on the random sample from (1)

3. Repeat (1), (2) many (say, 1000) times

4. Output the tree that is constructed most frequently

Probabilistic Methods

A more refined measure of evolution along a tree than parsimony

P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot)

If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1,

= pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)

x1

t2

xroot

t1

x2


• If we know all internal labels xu,

P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j))

• Usually we don’t know the internal labels, therefore

P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1 P(x1, x2, …, x2N-1 | T, t)

xroot = x2N-1

x1

x2 xN

xu

Computing the Likelihood of a Tree

• Define P(Lk | a): probability of subtree rooted at xk, given that xk = a

• Then, P(Lk | a) = (b P(Li | b) P(b | a, tki))(c P(Lj | c) P(c | a, tki))

xk

xixj

tkitkj

Felsenstein’s Likelihood Algorithm

To calculate P(x1, x2, …, xN | T, t)

Initialization:Set k = 2N – 1

Recursion: Compute P(Lk | a) for all a If k is a leaf node:

Set P(Lk | a) = 1(a = xk)If k is not a leaf node:

1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j

2. Set P(Lk | a) = b,c P(b | a, tki)P(Li | b) P(c | a, tkj) P(Lj | c)

Termination:

Likelihood at this column = P(x1, x2, …, xN | T, t) = aP(L2N-1 | a)P(a)


Given M (ungapped) alignment columns of N sequences,

• Define likelihood of a tree:

L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm, T, t)

Maximum Likelihood Reconstruction:

• Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)

Some new sequencing technologies

Molecular Inversion Probes

Molecular Inversion Probes

Single Molecule Array for Genotyping—Solexa

Nanopore Sequencing

http://www.mcb.harvard.edu/branton/index.htm

Nanopore Sequencing

http://www.mcb.harvard.edu/branton/index.htm

Nanopore Sequencing—Assembly

• Resulting reads are likely to look different than Sanger reads: Long (perhaps 10,000bp-1,000,000bp) High error rate (perhaps 10% – 30%) Two colors?

• A/ CTG• AT/ CG• AG/ CT

• How can we assemble under such conditions?

Pyrosequencing

Pyrosequencing on a chip

Mostafa Ronaghi, Stanford Genome Technologies Center

454 Life Sciences

Pyrosequencing Signal

Pyrosequencing—Assembly

• Resulting reads are likely to look different than Sanger reads: Short (currently 100 to 200 bp) Low error rates, except in homopolymeric runs (AAA…, CCC…, etc) Currently, not known how to do paired reads on a chip

?

Polony Sequencing

Documents

Phylogeny Tree Reconstruction 1 4 3 2 5 1 4 2 3 5