39
Phylogenetic Analysi s

Phylogenetic Analysis

  • Upload
    natara

  • View
    95

  • Download
    1

Embed Size (px)

DESCRIPTION

Phylogenetic Analysis. Introduction. Intension Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic tree It can help understand the evolutionary relationships among species of organisms. - PowerPoint PPT Presentation

Citation preview

Page 1: Phylogenetic Analysis

Phylogenetic Analysis

Page 2: Phylogenetic Analysis

2

Introduction

• Intension– Using powerful algorithms to reconstruct the

evolutionary history of all know organisms.• Phylogenetic tree

– It can help understand the evolutionary relationships among species of organisms.

– But we have to infer the evolutionary history of current organisms.

Page 3: Phylogenetic Analysis

Campanulaceae (bluebell) family

Herpesviruses

Page 4: Phylogenetic Analysis

4

Ancestral Node or ROOT of

the TreeInternal Nodes orDivergence Points

(represent hypothetical ancestors of the taxa)

Branches or Lineages

Terminal Nodes

A

B

C

D

E

Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny

Common Phylogenetic Tree Terminology

Page 5: Phylogenetic Analysis

5

Taxon A

Taxon B

Taxon C

Taxon D

11

1

6

3

5

genetic change

Taxon A

Taxon B

Taxon C

Taxon D

time

Taxon A

Taxon B

Taxon C

Taxon D

no meaning

Three types of trees

Cladogram Phylogram Ultrametric tree

All show the same evolutionary relationships, or branching orders, between the taxa.

Page 6: Phylogenetic Analysis

Phylogenetic trees diagram the evolutionary relationships between the taxa

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses

Taxon A

Taxon B

Taxon C

Taxon E

Taxon D

No meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.

This dimension either can have no scale (for ‘cladograms’),can be proportional to genetic distance or amount of change(for ‘phylograms’ or ‘additive trees’), or can be proportionalto time (for ‘ultrametric trees’ or true evolutionary trees).

These say that B and C are more closely related to each other than either is to A,and that A, B, and C form a clade that is a sister group to the clade composed ofD and E. If the tree has a time scale, then D and E are the most closely related.

Page 7: Phylogenetic Analysis

7

Completely unresolvedor "star" phylogeny

Partially resolvedphylogeny

Fully resolved,bifurcating phylogeny

A A A

B

B B

C

C

C

E

E

E

D

D D

Polytomy or multifurcation A bifurcation

The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees:

Page 8: Phylogenetic Analysis

C-B Stewart, NHGRI lecture, 12/5/00

There are three possible unrooted trees for four taxa (A, B, C, D)

A C

B D

Tree 1A B

C D

Tree 2A B

D C

Tree 3

Phylogenetic tree building (or inference) methods are aimed at discovering which of the possible unrooted trees is "correct".We would like this to be the “true” biological tree — that is, one that accurately represents the evolutionary history of the taxa.However, we must settle for discovering the computationally correct or optimal tree for the phylogenetic method of choice.

Page 9: Phylogenetic Analysis

9

The number of unrooted trees increases in a greater than exponential manner with number of taxa

(2N - 5)!! = # unrooted trees for N taxa(2N- 3)!! = # rooted trees for N taxa

CA

B D

A B

C

A D

B E

C

A D

B E

C

F

Page 10: Phylogenetic Analysis

10

Introduction• NP-Hard optimization problem

– Unrooted trees # of n organisms = TU(n)– Edges # of unrooted trees of n organisms = E(n)

= 2n-3 , n>=2– TU(n) = TU(n-1)*E(n-1) = ΠE(i) = Π(2i-5)– Ex.

– Rooted trees # of n organisms = TR(n)= TU(n)*E(n) = TU(n+1)

x y

z

x y

zt

x y

z

t

x y

zt

n-1 n

i=2 i=3

add t

Page 11: Phylogenetic Analysis

11

Inferring evolutionary relationships between the taxa requires rooting the tree:

To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A

BC

Root D

A B C D

RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.

Rooted tree

Unrooted tree

Page 12: Phylogenetic Analysis

12

Now, try it again with the root at another position:

A

BC

Root

D

Unrooted tree

Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.

C D

Root

Rooted tree

A

B

Page 13: Phylogenetic Analysis

13

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees

The unrooted tree 1:

A C

B D

Rooted tree 1d

C

D

A

B

4

Rooted tree 1c

A

B

C

D

3

Rooted tree 1e

D

C

A

B

5

Rooted tree 1b

A

B

C

D

2

Rooted tree 1a

B

A

C

D

1

These trees show five different evolutionary relationships among the taxa!

Page 14: Phylogenetic Analysis

14

All of these rearrangements show the same evolutionary relationships between the taxa

B

A

C

D

A

B

D

C

B

C

AD

B

D

AC

B

ACD

Rooted tree 1aB

A

C

D

A

B

C

D

Page 15: Phylogenetic Analysis

15

Molecular phylogenetic tree building methods:Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:

COMPUTATIONAL METHODClustering algorithmOptimality criterion

DA

TA T

YPE

Cha

ract

ers

Dis

tanc

es

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Page 16: Phylogenetic Analysis

16

parsimony

• model complexity vs. sample size• minimize Hamming distance summed over

all edges of the tree• justification: minimum possible number of

evolutionary events• subject of serious dispute by systematic

biologists

Page 17: Phylogenetic Analysis

17

Method– Maximum parsimony (MP)

• Seek the tree that minimizes the total number of evolutionary events on the edges of tree

• Ex.

• Require two algorithms– Search over tree topology– The computation of a cost for a given tree

1 1

1AAA

AAG

AAA

GGA

AGA

AGA

AAA

AAA

AAG

AAA

AAA

AAA

GGAAGA

1 1 2

AAA

AAG

AAA

AAA

AAA

AGAGGA

1 12

Page 18: Phylogenetic Analysis

18

maximum likelihood

• estimate probability that a specific evolutionary model will produce a particular phylogeny yielding the observed sequences

• many evolutionary models

Page 19: Phylogenetic Analysis

19

Method– Maximum likelihood (ML)

• Seek the tree that maximizes likelihood P(data|tree)• Ex.

– Compute likelihoodP(x1,x2,x3|T,t1,t2,t3,t4)

– x•: a set of sequences– T: a tree– t•: edge lengths of tree

• Require two algorithms– Search over tree topology– Search over all possible lengths of edges t• to compute likeliho

od

X1X2

X5

X4

X3

root

t1t2 t3

t4

Page 20: Phylogenetic Analysis

20

Distance Matrix Methods

• produce a tree such that the path distance between leaves i and j (sum of edge weights in the path between i and j) equals Dij

• this the additive property for a distance matrix -- of course real distance matrices may not be additive

• most methods use agglomerative clustering -- successively choosing pairs of nodes to combine

Page 21: Phylogenetic Analysis

21

Ultrametric trees

• path distance from the root to each leaf is the same

• strong molecular clock assumption - distance is proportional to evolutionary time

Page 22: Phylogenetic Analysis

22

Example Tree and Additive Matrix

a

e

c

b

d 2

3 3

5

3

2 1

1

A B C D EA 0 10 12 10 7B 0 4 4 13C 0 6 15D 0 13E 0

Page 23: Phylogenetic Analysis

23

Distance Matrix Methods

• UPGMA• Neighbor Joining• Fitch Margoliash• Quartet Puzzling• Witness-Anitwitness• Double Pivot

many are “not yet in use by the systematic biology community”

Page 24: Phylogenetic Analysis

24

Distance Measures

• DNA hybridization amounts• immunological distances• genetic distances • sequence distances

(DNA, RNA, protein)

Page 25: Phylogenetic Analysis

25

…what distance?

• need distance measure that reflects the actual number of point mutations on the path between the leaves

• particular problem with sequence data - Hamming distance and assumption of no reversals

Page 26: Phylogenetic Analysis

26

UPGMA

• Unweighted Pair-Group Method with Arithmetic mean

Page 27: Phylogenetic Analysis

27

UPGMA Step 1combine B and C

a

e c b

d

A B C D EA 0 10 12 10 7B 0 4 4 13C 0 6 15D 0 13E 0

Page 28: Phylogenetic Analysis

28

UPGMA step 2combine BC and D

a

e c b

d

2 2

A BC D EA 0 11 10 7BC 0 5 14D 0 13E 0

(10+12)/2

(4+6)/2

Page 29: Phylogenetic Analysis

29

UPGMA step 3combine A and E

A BCD EA 0 10.5 7BCD 0 13.5E 0

a

e c b

d

2

2.5 0.5

2

Page 30: Phylogenetic Analysis

30

UPGMA step 4combine AE and BCD

a

e

c b

d 3.5

3.5

2

2.5 .5

2

AE BCDAE 0 12BCD 0

Page 31: Phylogenetic Analysis

31

UPGMA Result

a

e

c b

d 3.5

3.5

2

2.5 .5

2

2.5

1.5 A B C D E

A 0 10 12 10 7B 0 4 4 13C 0 6 15D 0 13E 0

3.5

Page 32: Phylogenetic Analysis

32

UPGMA Result

a

e

c b

d 3.5

3.5

2

2.5 .5

2

2.5

1.5

a

e

c

b

d 2

3 3

5

3

2 1

1

3.5

Page 33: Phylogenetic Analysis

33

Method

• Phylogenetic reconstruction techniques– NJ (neighbor-joining method)

• A star tree is successively inserted branches between a pair of closest neighbors and the remaining terminals in the tree

• Character– The fastest reconstruction method– Poor accuracy when the distance matrix contains

large value

Page 34: Phylogenetic Analysis

34

Method• Ex.

– The cost save by pairing S1 and S2 = New connection cost (NC) – Old connection cost (OC) = 2.34 NC = ½(average(S1)+average(S2)+d(S1,S2))=6.33 OC = average(S1) +average(S2) = 8.67

– The largest cost save by pairing S3 and S4 = 2.67Thus we pair S3 and S4

S1 S2 S3 S4S1 0 4 4 3S2 0 6 5S3 0 2S4 0Distance matrix Star tree

S2

S1 S3

S45

3.67

3.33

4

X

S2

S1 S3

S4X

XS2

S1

Pair S1 and S2

Page 35: Phylogenetic Analysis

35

Neighbor-Joining Result

a

e

c b

d

2 5

3

1

6

1.5

2

a

e

c

b

d 2

3 3

5

3

2 1

1

Page 36: Phylogenetic Analysis

36

Genome Rearragement– Generalized Nadean-Tayor (GNT) evolution model

• P(transpostion) = α• P(inverted trans.) = β• P(inversion) = 1-(α+β)• events # on edge :

according to Poissondistributionf(x) = ; x=1,2,..

Genome rearrangement

λx•e-3 x!

Page 37: Phylogenetic Analysis

37

Improving reconstruction algorithms

Page 38: Phylogenetic Analysis

38

Improving reconstruction algorithms– Estimators of true evolutionary distance

• Exact-IEBP (inverting the expected breakpoint distance)ML estimate of the breakpoint distance after K rearrangements

• Approx-IEBPapproximate Exact-IEBP

• EDE (empirically derived estimator)empirical estimate of the inversion distance after K rearrangements

produced a nonlinear regression formula that computes the expected distance given that K random rearrangements

Page 39: Phylogenetic Analysis

39

Conclusion

• New generation of phylogenetic software needs– More sophisticated models of evolution– Faster optimization algorithms– High performance algorithm engineering– Powerful modes of user interaction