49
1 BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07 BCB 444/544 Lecture 30 Phylogenetics – Distance-Based Methods #30_Nov02

BCB 444/544

  • Upload
    adolph

  • View
    29

  • Download
    1

Embed Size (px)

DESCRIPTION

BCB 444/544. Lecture 30 Phylogenetics – Distance-Based Methods #30_Nov02. Required Reading ( before lecture). Wed Oct 30 - Lecture 29 Phylogenetics Basics Chp 10 - pp 127 - 141 Thurs Oct 31 - Lab 9 Gene & Regulatory Element Prediction Fri Oct 30 - Lecture 30 - PowerPoint PPT Presentation

Citation preview

Page 1: BCB 444/544

1BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

BCB 444/544

Lecture 30

Phylogenetics – Distance-Based Methods

#30_Nov02

Page 2: BCB 444/544

2BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Wed Oct 30 - Lecture 29

Phylogenetics Basics

• Chp 10 - pp 127 - 141

Thurs Oct 31 - Lab 9

Gene & Regulatory Element Prediction

Fri Oct 30 - Lecture 30

Phylogenetic – Distance-Based Methods

• Chp 11 - pp 142 – 169

Mon Nov 5 - Lecture 31

Phylogenetics – Parsimony and ML

• Chp 11 - pp 142 - 169

Required Reading (before lecture)

Page 3: BCB 444/544

3BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Assignments & Announcements

Mon Oct 29 - HW#5

HW#5 = Hands-on exercises with phylogenetics and tree-building software

Due: Mon Nov 5 (not Fri Nov 1 as previously posted)

Page 4: BCB 444/544

4BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

BCB 544 "Team" Projects

Last week of classes will be devoted to Projects

• Written reports due: • Mon Dec 3 (no class that day)

• Oral presentations (20-30') will be: • Wed-Fri Dec 5,6,7

• 1 or 2 teams will present during each class period

See Guidelines for Projects posted online

Page 5: BCB 444/544

5BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

BCB 544 Only: New Homework Assignment

544 Extra#2

Due: √PART 1 - ASAP

PART 2 - meeting prior to 5 PM Fri Nov 2

Part 1 - Brief outline of Project, email to Drena & Michael

after response/approval, then:

Part 2 - More detailed outline of project

Read a few papers and summarize status of problem

Schedule meeting with Drena & Michael to discuss

ideas

Page 6: BCB 444/544

6BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Seminars this Week

BCB List of URLs for Seminars related to Bioinformatics:http://www.bcb.iastate.edu/seminars/index.html

• Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI

• Bob Jernigan BBMB, ISU

•Control of Protein Motions by Structure

Page 7: BCB 444/544

7BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Chp 10 - Phylogenetics

SECTION IV MOLECULAR PHYLOGENETICS

Xiong: Chp 10 Phylogenetics Basics

• Evolution and Phylogenetics• Terminology• Gene Phylogeny vs. Species Phylogeny• Forms of Tree Representation• Why Finding a True Tree is Dificult• Procedure of Building a Phylogenetic Tree

Page 8: BCB 444/544

8BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Tree Building Procedure

• Choose molecular markers• Perform MSA• Choose a model of evolution•Determine tree building method• Assess tree reliability

Page 9: BCB 444/544

9BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Choice of Molecular Markers

• Very closely related organisms - nucleic acid sequence will show more differences• For individuals within a species - faster

mutation rate is in noncoding regions of mtDNA• More distantly related species - slowly

evolving nucleic acid sequences like ribosomal RNA or protein sequences• Very distantly related species - use highly

conserved protein sequences

Page 10: BCB 444/544

10BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Multiple Sequence Alignment

• Most critical step in tree building - cannot build correct tree without correct alignment• Should build alignments with multiple

programs, then inspect and compare to identify the most reasonable one• Most alignments need manual editing• Make sure important functional residues

align• Align secondary structure elements• Use full alignment or just parts

Page 11: BCB 444/544

11BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Automatic Editing of Alignments

• Rascal and NorMD – correct alignment errors, remove potentially unrelated or highly divergent sequences• Gblocks – detect and eliminate poorly

aligned positions and divergent regions

Page 12: BCB 444/544

12BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

How do we measure divergence between sequences?

• Simple measure – just count the number of substitutions observed between the sequences in the MSA• Problem – number of substitutions may not represent the number of evolutionary events that actually occurred

Page 13: BCB 444/544

13BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Multiple Substitutions

C

A

A

G

T

Just because we only see one difference, does not mean that there was only one evolutionary event

Page 14: BCB 444/544

14BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Multiple Substitutions

A

A

A

G

T

Just because we only see no difference, does not mean that there were no evolutionary events

Page 15: BCB 444/544

15BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Choosing Substitution Models

• Statistical models of evolution are used to correct for the multiple substitution problem• Focus on DNA models

Page 16: BCB 444/544

16BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Jukes-Cantor Model

• Jukes-Cantor model assumes all nucleotides are substituted with equal probability• Can be used to

correct for multiple substitutions

Page 17: BCB 444/544

17BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Many Other Models

Page 18: BCB 444/544

18BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Evolutionary Models for Protein Sequences

• PAM and JTT substitution matrices already take into account multiple substitutions• There are also models similar to Jukes-Cantor for protein sequences

Page 19: BCB 444/544

19BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

What about differences in mutation rates between positions within a sequence?

• One of our assumptions was that all positions in a sequence are evolving at the same rate• Bad assumption• Third position in a codon changes with higher frequency• In proteins, some amino acids can change and others

cannot

• This variation is called among-site rate heterogeneity• Many tree building programs have parameters

meant to deal with this problem – adds to complexity of getting the correct tree

Page 20: BCB 444/544

20BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Chp 11 – Phylogenetic Tree Construction Methods and Programs

SECTION IV MOLECULAR PHYLOGENETICS

Xiong: Chp 11 Phylogenetic Tree Construction Methods and Programs

• Distance-Based Methods• Character-Based Methods• Phylogenetic Tree Evaluation• Phylogenetic Programs

Page 21: BCB 444/544

21BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Tree Construction

• Two main categories of tree building methods• Distance-based• Overall similarity between sequences

• Character-based• Consider the entire MSA

Page 22: BCB 444/544

22BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Distance-Based Methods

•Given a MSA and an evolutionary model, calculate the distance between all pairs of sequences• Construct distance matrix• Construct phylogenetic tree based on the distance matrix

Page 23: BCB 444/544

23BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Distance Matrices

a 0

b 6 0

c 7 3 0

d 14 10 9 0

a b c d

a

b

c

d

1 2 3 4 50 6 7 8

Page 24: BCB 444/544

24BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Distance-Based Methods

• Two ways to construct a tree based on a distance matrix• Clustering• Optimality

Page 25: BCB 444/544

25BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Clustering-Based Methods

• E.g., UPGMA and Neighbor-Joining• A cluster is a set of taxa• Interspecies distances translate into

intercluster distances• Clusters are repeatedly merged• “Closest” clusters merged first• Distances are recomputed after merging

Page 26: BCB 444/544

26BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

UPGMA

• UPGMA – Unweighted Pair Group Method Using Arithmetic Average• Uses molecular clock assumption – all

taxa evolve at a constant rate and are equally distant from the root (ultrametric tree)• This assumption is usually wrong• So why use UPGMA?• Very fast

Page 27: BCB 444/544

27BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

UPGMA Example

Page 28: BCB 444/544

28BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

UPGMA Example

Page 29: BCB 444/544

29BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

UPGMA Example

Page 30: BCB 444/544

30BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

UPGMA Example

Page 31: BCB 444/544

31BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining

• Idea: Find a pair of taxa that are close to each other but far from other taxa• Implicitly finds a pair of neighboring taxa

•No molecular clock assumption

Page 32: BCB 444/544

32BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining

•NJ corrects for unequal evolutionary rates between sequences by using a conversion step• The conversion step requires calculation of “r-values” and “transformed r-values”

Page 33: BCB 444/544

33BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining

iji dr

The r-value for a sequence is:

The sum of the distances between sequence i and all other sequences

Page 34: BCB 444/544

34BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining

2'

n

rr ii

The transformed r-value for a sequence is:

Where n is the number of taxaTransformed r-values are used to determine the distance of a taxon to the nearest node

Page 35: BCB 444/544

35BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining

jiijij rrdd 2

1'

The converted distance between two sequences is:

These converted distances are used in building the tree

Page 36: BCB 444/544

36BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining

2

'' jiijiu

rrdd

The final equation we need is for computing the distance from a new cluster to each taxa. Assume taxa i and j were merged into a cluster u. The distance from taxa i to cluster u is:

Page 37: BCB 444/544

37BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining Example

A B C

B 0.40

C 0.35 0.45

D 0.60 0.70 0.55

Page 38: BCB 444/544

38BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining Example

• Initialize tree into a star shape with all taxa connected to the center• Step 1: Compute r-values and transformed r-values for all taxa

675.02

35.1

24'

35.16.035.04.0

AA

ADACABA

rr

dddr

Page 39: BCB 444/544

39BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining Example

• Step 2: Compute converted distances

05.1

55.135.12

14.0

2

1'

BAABAB rrdd

Page 40: BCB 444/544

40BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining Example

A B C

B -1.05

C -1 -1

D -1 -1 -1.05

• Step 3: Fill out converted distance matrix

Page 41: BCB 444/544

41BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining Example

• Step 4: Create a node by merging closest taxa• In this example, the distance between A and B

is the same as the distance between C and D• We can pick either pair to start with• Let’s pick A and B and create a node called U

U

?

?B

A

DC

BA

Page 42: BCB 444/544

42BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining Example

• Step 5: Compute branch lengths• Use the equation for computing the distance

from a taxa to a node

15.02

775.0675.04.02

''

BAAB

AU

rrdd

U

0.15

0.25B

A

Page 43: BCB 444/544

43BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining Example

• Step 6: Construct reduced distance matrix by computing converted distances from each taxa to the new node U• In UPGMA, we simply calculated the average

2.0

2

25.045.015.035.02

UBBCUAAC

CU

ddddd

Page 44: BCB 444/544

44BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining Example

U C

C 0.20

D 0.45 0.55

Our reduced distance matrix:

Page 45: BCB 444/544

45BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Neighbor Joining Example

• From here, we go back to step 1• Continue until all taxa have been decomposed

from the star tree•

Page 46: BCB 444/544

46BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Optimality-Based Methods

• Clustering methods produce a single tree with no ability to judge how good it is compared to alternative tree topologies• Optimality-based methods compare all possible

tree topologies and select a tree that best fits the distance matrix• Two algorithms:• Fitch-Margoliash• Minimum evolution

Page 47: BCB 444/544

47BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Fitch-Margoliash

• Selects best tree among all possible trees based on minimum deviation between distances calculated in the tree and distances in the distance matrix• Basically, a least squares method• Dij = distance between i and j in matrix• dij = distance between i and j in tree• Objective: Find tree that minimizes

nji1

2ijij )dD(

Page 48: BCB 444/544

48BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Minimum Evolution

• Similar to Fitch-Margoliash, but uses a different optimality criterion• Searches for a tree with the minimum

total branch length• This is an indirect way of achieving the

best fit of the branch lengths with the original data

Page 49: BCB 444/544

49BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07

Summary of Distance-Based Methods

• Clustering-based methods:• Computationally very fast and can handle large

datasets that other methods cannot• Not guaranteed to find the best tree

• Optimality-based methods:• Better overall accuracies• Computationally slow

• All distance-based methods lose all sequence information and cannot infer the most likely state at an internal node