Upload
archibald-collins
View
221
Download
2
Tags:
Embed Size (px)
Citation preview
Molecular Phylogeny
2
Phylogeny is the inference of evolutionary relationships.Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are mainly used for phylogenetic analyses.
One tree of life A sketch Darwin madesoon after returning from his voyage onHMS Beagle (1831–36) showed his thinkingabout the diversification of speciesfrom a single stock (see Figure, overleaf).This branching, extended by the conceptof common descent,
3
Haeckel (1879) Pace (2001)
4
Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are based upon DNA and protein sequence data
Human
Chimpanzee
Gorilla
Orangutan
Gorilla
Chimpanzee
Orangutan
Human
Molecular analysis:Chimpanzee is related more closely
to human than the gorilla
Pre-Molecular analysis:The great apes
(chimpanzee, Gorilla & orangutan)Separate from the human
5
What can we learn from phylogenetics tree?
• Was the extinct quagga more like a zebra or a horse?
1. Determine the closest relatives of one organism in which we are interested
7
Which species are closest to Human?
Human
Chimpanzee
Gorilla
Orangutan
Gorilla
Chimpanzee
Orangutan
Human
8
Example Metagenomics
A new field in genomics aims the study the genomes recovered from environmental samples.
A powerful tool to access the wealthy biodiversity of native environmental samples
2. Help to find the relationship between the species and identify new species
106 cells/ ml seawater107 virus particles/ ml seawater
>99% uncultivated microbes
Incredible microbial diversity in a drop of seawater
shear
3 – 4 kb shotgunlibrary
paired-end sequence(F / R)
compositecontig assembly
community DNA
…ACGGCTGCGTTACATCGATCATTTACGAACATCGATCATTTACGATACCATTG…
community sample
(cloning bias)
(extraction bias)
Metagenomics
11
From : “The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples” Williamson et al, PLOS ONE 2008
3. Discover a function of an unknown gene or protein
12
RBP1_HS
RBP2_pig
RBP_RAT
ALP_HS
ALPEC_BV
ALPA1_RAT
ECBLC
Hypothetical protein
Hypothetical protein
X
Hypothetical protein
13
Relationships can be represented by Phylogenetic Tree or Dendrogram
A B C D
E
F
14
Phylogenetic Tree Terminology
• Graph composed of nodes & branches
• Each branch connects two adjacent nodes
A B C D
E
F
R
15
Rooted tree
based on priori knowledge:
Human
Chimp
Chicken
Gorilla
Human ChimpChicken Gorilla
Un-rooted tree
Phylogenetic Tree Terminology
16
Rooted vs. unrooted trees
1
2
3
3 1
2
17
How can we build a tree with molecular data?
-Trees based on DNA sequence (rRNA)-Trees based on Protein sequences
18
Questions:
• Can DNA and proteins from the same gene produce different trees ?
• Can different genes have different evolutionary history ?
• Can different regions of the same gene produce different trees ?
19
Methods
20
Approach 1 - Distance methods
• Two steps :– Compute a distances between any two sequences from the MSA.– Find the tree that agrees most with the distance table.
• Algorithms : -Neighbor joining
Approach 2 - State methods• Algorithms:
– Maximum parsimony (MP)– Maximum likelihood (ML)
21
Neighbor Joining (NJ)
• Reconstructs unrooted tree• Calculates branch lengths Based on pairwise distance• In each stage, the two nearest nodes of the
tree are chosen and defined as neighbors in our tree. This is done recursively until all of the nodes are paired together.
Star StructureAssumption: Divergence of sequences is assumed to occur at constant rate Distance to root equals
a
d
c
b
23
a b c d
a 0 8 7 5
b 8 0 3 9
c 7 3 0 8
d 5 9 8 0
a
d
c
b
Basic Algorithm
Initial star diagramDistance matrix
24
a b c d
a 0 8 7 5
b 8 0 3 9
c 7 3 0 8
d 5 9 8 0
a
d
c
b
Choose the nodes with the shortest distance and fuse them.
Selection step
25
Then recalculate the distance between the rest of the remaining sequences (a and d) to the new node (e) and remove the fused nodesfrom the table.
dc,b e
aa d e
a 0 5 6
d 5 0 7
e 6 7 0
D (EA) = (D(AC)+ D(AB)-D(CB))/2
Next Step
D (ED) = (D(DC)+ D(DB)-D(CB))/2
a b c d
a 0 8 7 5
b 8 0 3 9
c 7 3 0 8
d 5 9 8 0
26
In order to get a tree, un-fuse c and b by calculating their distance to the new node (e)
d
c
e
a
a d e
a 0 5 6
d 5 0 7
e 6 7 0 b
Dce
Dde
Next Step
27
a,d
c
ea d e
a 0 5 6
d 5 0 7
e 6 7 0 b
Dce
Dde
f
Next…
28
ac
ef e
f 0 4
e 4 0
b
Daf
Dde
f
d
Dce
Dbf
Final
D (EF) = (D(EA)+ D(ED)-D(AD))/2
29
dc,b e
a
a,d
c
e
b
Dce
Dde
f
d
ac
e
b
Daf
Dde
fDce
Dbf
1 2
3
30
IMPORTANT !!!•Usually we don’t start from a star diagram
and in order to choose the nodes to fuse we have to calculate the relative distance matrix (Mij) representing the relative distance of each node to all other nodes
31
EXAMPLE
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
A B C D E
B -13
C -11 -11
D -10 -10 -10.5
E -10 -10 -11 -13
F -10.5 -10.5 -11 -11.5 -11.5
Original distance Matrix Relative Distance Matrix (Mij)
The Mij Table is used only to choose the closest pairs not for calculating the distances
32
Advantages -It is fast and thus suited for large datasets -permits lineages with largely different branch lengths
Disadvantages - sequence information is reduced - gives only one possible tree
Advantages and disadvantages of the neighbor-joining method
More problems with phylogenetic trees
• It is wrong to assume that branch length is proportional to speciation time (molecular clock).
• It is wrong to produce a tree based on distance values of the whole alignment.
Problems with phylogenetic trees
1
7
3
5
6
2
4
0.2
Bacillus
E.coli
Pseudomonas
Salmonella
Aeromonas
Lechevaliera
Burkholderias
1
7
5
3
6
2
4
0.2
Bacillus
1
3
7
5
6
2
4
0.2
1
5
3
7
6
2
4
0.2
3
5
7
1
6
2
4
0.2
Bacillus
Bacillus
Bacillus
E.coli
E.coli E.coli
E.coli
Pseudomonas
Pseudomonas
Pseudomonas
Pseudomonas
Salmonella
Salmonella Salmonella
Salmonella
Aeromonas
Aeromonas
Aeromonas
Aeromonas
Lechevaliera
Lechevaliera
Lechevaliera
Lechevaliera
Burkholderias
Burkholderias
Burkholderias
Burkholderias
Problems with phylogenetic trees
Problems with phylogenetic trees
• It is wrong to assume that branch length is proportional to speciation time (molecular clock).
• It is wrong to produce a tree based on distance values of the whole alignment : using different regions from a same alignment may produce different trees.
• What to do?: use bootstrap
1
3
7
5
6
2
477
100
83
58
0.2
Pseudomonas
Burkholderias
E.coli
Salmonella
Lechevaliera
Aeromonas
Bacillus
Boostraped tree
•Bootstrapping is a methods for estimating generalization error based on
“resampling“. •In the context of phylogenetic trees, it consist in randomly selecting
different positions from an alignment and constructing a tree based on these
position.•As a result we get the % of times a certain node was formed.
Highly reliable none
less reliable none
38
Tools for tree reconstruction
• CLUSTALX (NJ method)
• Phylip -PHYLogeny Inference Package– includes parsimony, distance matrix, and
likelihood methods, including bootstrapping.
• Phyml (maximum likelihood method)
• More phylogeny programs
39
362