Upload
adolph
View
29
Download
1
Embed Size (px)
DESCRIPTION
BCB 444/544. Lecture 30 Phylogenetics – Distance-Based Methods #30_Nov02. Required Reading ( before lecture). Wed Oct 30 - Lecture 29 Phylogenetics Basics Chp 10 - pp 127 - 141 Thurs Oct 31 - Lab 9 Gene & Regulatory Element Prediction Fri Oct 30 - Lecture 30 - PowerPoint PPT Presentation
Citation preview
1BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
BCB 444/544
Lecture 30
Phylogenetics – Distance-Based Methods
#30_Nov02
2BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Wed Oct 30 - Lecture 29
Phylogenetics Basics
• Chp 10 - pp 127 - 141
Thurs Oct 31 - Lab 9
Gene & Regulatory Element Prediction
Fri Oct 30 - Lecture 30
Phylogenetic – Distance-Based Methods
• Chp 11 - pp 142 – 169
Mon Nov 5 - Lecture 31
Phylogenetics – Parsimony and ML
• Chp 11 - pp 142 - 169
Required Reading (before lecture)
3BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Assignments & Announcements
Mon Oct 29 - HW#5
HW#5 = Hands-on exercises with phylogenetics and tree-building software
Due: Mon Nov 5 (not Fri Nov 1 as previously posted)
4BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
BCB 544 "Team" Projects
Last week of classes will be devoted to Projects
• Written reports due: • Mon Dec 3 (no class that day)
• Oral presentations (20-30') will be: • Wed-Fri Dec 5,6,7
• 1 or 2 teams will present during each class period
See Guidelines for Projects posted online
5BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
BCB 544 Only: New Homework Assignment
544 Extra#2
Due: √PART 1 - ASAP
PART 2 - meeting prior to 5 PM Fri Nov 2
Part 1 - Brief outline of Project, email to Drena & Michael
after response/approval, then:
Part 2 - More detailed outline of project
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss
ideas
6BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:http://www.bcb.iastate.edu/seminars/index.html
• Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Bob Jernigan BBMB, ISU
•Control of Protein Motions by Structure
7BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Chp 10 - Phylogenetics
SECTION IV MOLECULAR PHYLOGENETICS
Xiong: Chp 10 Phylogenetics Basics
• Evolution and Phylogenetics• Terminology• Gene Phylogeny vs. Species Phylogeny• Forms of Tree Representation• Why Finding a True Tree is Dificult• Procedure of Building a Phylogenetic Tree
8BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Tree Building Procedure
• Choose molecular markers• Perform MSA• Choose a model of evolution•Determine tree building method• Assess tree reliability
9BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Choice of Molecular Markers
• Very closely related organisms - nucleic acid sequence will show more differences• For individuals within a species - faster
mutation rate is in noncoding regions of mtDNA• More distantly related species - slowly
evolving nucleic acid sequences like ribosomal RNA or protein sequences• Very distantly related species - use highly
conserved protein sequences
10BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Multiple Sequence Alignment
• Most critical step in tree building - cannot build correct tree without correct alignment• Should build alignments with multiple
programs, then inspect and compare to identify the most reasonable one• Most alignments need manual editing• Make sure important functional residues
align• Align secondary structure elements• Use full alignment or just parts
11BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Automatic Editing of Alignments
• Rascal and NorMD – correct alignment errors, remove potentially unrelated or highly divergent sequences• Gblocks – detect and eliminate poorly
aligned positions and divergent regions
12BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
How do we measure divergence between sequences?
• Simple measure – just count the number of substitutions observed between the sequences in the MSA• Problem – number of substitutions may not represent the number of evolutionary events that actually occurred
13BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Multiple Substitutions
C
A
A
G
T
Just because we only see one difference, does not mean that there was only one evolutionary event
14BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Multiple Substitutions
A
A
A
G
T
Just because we only see no difference, does not mean that there were no evolutionary events
15BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Choosing Substitution Models
• Statistical models of evolution are used to correct for the multiple substitution problem• Focus on DNA models
16BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Jukes-Cantor Model
• Jukes-Cantor model assumes all nucleotides are substituted with equal probability• Can be used to
correct for multiple substitutions
17BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Many Other Models
18BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Evolutionary Models for Protein Sequences
• PAM and JTT substitution matrices already take into account multiple substitutions• There are also models similar to Jukes-Cantor for protein sequences
19BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
What about differences in mutation rates between positions within a sequence?
• One of our assumptions was that all positions in a sequence are evolving at the same rate• Bad assumption• Third position in a codon changes with higher frequency• In proteins, some amino acids can change and others
cannot
• This variation is called among-site rate heterogeneity• Many tree building programs have parameters
meant to deal with this problem – adds to complexity of getting the correct tree
20BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Chp 11 – Phylogenetic Tree Construction Methods and Programs
SECTION IV MOLECULAR PHYLOGENETICS
Xiong: Chp 11 Phylogenetic Tree Construction Methods and Programs
• Distance-Based Methods• Character-Based Methods• Phylogenetic Tree Evaluation• Phylogenetic Programs
21BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Tree Construction
• Two main categories of tree building methods• Distance-based• Overall similarity between sequences
• Character-based• Consider the entire MSA
22BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Distance-Based Methods
•Given a MSA and an evolutionary model, calculate the distance between all pairs of sequences• Construct distance matrix• Construct phylogenetic tree based on the distance matrix
23BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Distance Matrices
a 0
b 6 0
c 7 3 0
d 14 10 9 0
a b c d
a
b
c
d
1 2 3 4 50 6 7 8
24BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Distance-Based Methods
• Two ways to construct a tree based on a distance matrix• Clustering• Optimality
25BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Clustering-Based Methods
• E.g., UPGMA and Neighbor-Joining• A cluster is a set of taxa• Interspecies distances translate into
intercluster distances• Clusters are repeatedly merged• “Closest” clusters merged first• Distances are recomputed after merging
26BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
UPGMA
• UPGMA – Unweighted Pair Group Method Using Arithmetic Average• Uses molecular clock assumption – all
taxa evolve at a constant rate and are equally distant from the root (ultrametric tree)• This assumption is usually wrong• So why use UPGMA?• Very fast
27BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
UPGMA Example
28BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
UPGMA Example
29BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
UPGMA Example
30BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
UPGMA Example
31BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining
• Idea: Find a pair of taxa that are close to each other but far from other taxa• Implicitly finds a pair of neighboring taxa
•No molecular clock assumption
32BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining
•NJ corrects for unequal evolutionary rates between sequences by using a conversion step• The conversion step requires calculation of “r-values” and “transformed r-values”
33BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining
iji dr
The r-value for a sequence is:
The sum of the distances between sequence i and all other sequences
34BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining
2'
n
rr ii
The transformed r-value for a sequence is:
Where n is the number of taxaTransformed r-values are used to determine the distance of a taxon to the nearest node
35BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining
jiijij rrdd 2
1'
The converted distance between two sequences is:
These converted distances are used in building the tree
36BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining
2
'' jiijiu
rrdd
The final equation we need is for computing the distance from a new cluster to each taxa. Assume taxa i and j were merged into a cluster u. The distance from taxa i to cluster u is:
37BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining Example
A B C
B 0.40
C 0.35 0.45
D 0.60 0.70 0.55
38BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining Example
• Initialize tree into a star shape with all taxa connected to the center• Step 1: Compute r-values and transformed r-values for all taxa
675.02
35.1
24'
35.16.035.04.0
AA
ADACABA
rr
dddr
39BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining Example
• Step 2: Compute converted distances
05.1
55.135.12
14.0
2
1'
BAABAB rrdd
40BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining Example
A B C
B -1.05
C -1 -1
D -1 -1 -1.05
• Step 3: Fill out converted distance matrix
41BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining Example
• Step 4: Create a node by merging closest taxa• In this example, the distance between A and B
is the same as the distance between C and D• We can pick either pair to start with• Let’s pick A and B and create a node called U
U
?
?B
A
DC
BA
42BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining Example
• Step 5: Compute branch lengths• Use the equation for computing the distance
from a taxa to a node
15.02
775.0675.04.02
''
BAAB
AU
rrdd
U
0.15
0.25B
A
43BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining Example
• Step 6: Construct reduced distance matrix by computing converted distances from each taxa to the new node U• In UPGMA, we simply calculated the average
2.0
2
25.045.015.035.02
UBBCUAAC
CU
ddddd
44BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining Example
U C
C 0.20
D 0.45 0.55
Our reduced distance matrix:
45BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Neighbor Joining Example
• From here, we go back to step 1• Continue until all taxa have been decomposed
from the star tree•
46BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Optimality-Based Methods
• Clustering methods produce a single tree with no ability to judge how good it is compared to alternative tree topologies• Optimality-based methods compare all possible
tree topologies and select a tree that best fits the distance matrix• Two algorithms:• Fitch-Margoliash• Minimum evolution
47BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Fitch-Margoliash
• Selects best tree among all possible trees based on minimum deviation between distances calculated in the tree and distances in the distance matrix• Basically, a least squares method• Dij = distance between i and j in matrix• dij = distance between i and j in tree• Objective: Find tree that minimizes
nji1
2ijij )dD(
48BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Minimum Evolution
• Similar to Fitch-Margoliash, but uses a different optimality criterion• Searches for a tree with the minimum
total branch length• This is an indirect way of achieving the
best fit of the branch lengths with the original data
49BCB 444/544 F07 ISU Terribilini #30- Phylogenetics - Distance-Based Methods 11/02/07
Summary of Distance-Based Methods
• Clustering-based methods:• Computationally very fast and can handle large
datasets that other methods cannot• Not guaranteed to find the best tree
• Optimality-based methods:• Better overall accuracies• Computationally slow
• All distance-based methods lose all sequence information and cannot infer the most likely state at an internal node