The distance between sequences, Part I.
Foundations
M. Elizabeth Corey, Ph.D.
UCSC Extension and UC Berkeley Extension
A simple start
Suppose we have two sequences, A and B
A = {a1, a2, …, am}
and
B = {b1, b2, …, bn}
and we want to know how similar they are.
What is the basis for their similarity?
The practical measure
What we usually do is obtain an alignment and then score using the sum of the pairwise scores:
n
iiiAB basS
1
),(
The nice metric
Wouldn’t it be nice if we could simply say that the distance between sequences was the geometric sum of the distances between loci in the sequences?
n
iiiAB baD
1
2)(
1 a
2 b
3 c
4 n
5 j
6 r
7 q
8 c
9 l
10c
11 r
12 p
13m
1 a 1
2 j 1
3 c 1 1 1
4 j 1
5 n 1
6 r 1
7 c 1 1
8 k
9 c 1 1
10 r 1 1
11 b 1
12 p 1
Dynamic programming methods• Compuational method
dating from the 40’s, introduced to biology as “Needleman-Wunsch” in 1969.
• A numerical value is assigned to every cell in the array giving the similarity/dissimilarity of residues
• The example shown– match = +1– mismatch = null (value 0)
a b c n j r q c l c r p m
a 1
j 1
c 1 1 1
j 1
n 1
r 1 4 3 3 2 2 0 0
c 3 3 4 3 3 3 3 4 3 3 1 0 0
k 3 3 3 3 3 3 3 3 3 2 1 0 0
c 2 2 3 2 2 2 2 3 2 3 1 0 0
r 2 1 1 1 1 2 1 1 1 1 2 0 0
b 1 2 1 1 1 1 1 1 1 1 1 0 0
p 0 0 0 0 0 0 0 0 0 0 0 1 0
Dynamic programming methods
• GOAL: For each cell find the maximum possible score for an alignment ending at that point
• Searchs subrow and subcolumn, as shown, for highest score
• Adds this to the score for the current row
• Proceeds row by row through the array
Maximum bipartite matchingSeries of solutions, starting with Dijskta, 1950’s
Find the set of matches that provide maximum flow.
Each match, ai to bj, has a capacity equal to its pairwise score.
A Bs(a1, b1) EVO
Alignment’s not really the problem
• Optimal alignment falls into a set of problems with a long history in computer science.
• The underlying metric for distances between sequences falls in the province of biology.
Beguiled by a matrix(PAM)
PAM• PAM starts with closely related sequences
from 34 superfamilies, grouped into 71 evolutionary trees.
• PAM rests on a measure of amino acid “mutability”.
• PAM attempts to capture a representative slice of evolutionary behavior.
PAM (From Dayhoff, Schwartz and Orcutt)
• Obtain alignments for homologous proteins• Compute scoring matrix elements using:
where aij is substitution frequency, mi is the mutability of i and is a proportionality constant.
• Extrapolate to longer evolutionary distances by using {S()}n
i0
0
m – 1)(
,)(
ii
iij
ijiij
s
jia
ams
Limitations of PAM matrices
PAM matrices are built from alignments with > 85% identity.
The entries in the initial scoring matrix, S(t=1) arise from short time interval substitutions; raising S(1) to a higher power may not capture some interesting substitutions with longer rate constants.
The Gutzwiller temptation
• An abstract dynamic system (M, , t)
– a measurable space, M, composed of the set of all sequences.
– a measure based on transition probabilities– a group of automorphisms, t, that map M onto
itself, that preserves and where the variable t runs through the integers.
What’s Bernoulli got to do with it?
• A scheme with subshift– The measure on M is generated by the sets Ai,j,k
= {a |ai = j, ai+1 = k} whose measure is given by a matrix of transition probabilities pjk >= 0.
– A future event a1 depends on a0; hence, memory.
– Realized in the geodesic flow on a compact closed surface of constant negative curvature.
System behaviors
• Ergodicity: Transition probabilities are positive recurrent and aperiodic.
• Mixing: Inheritance and Mendelian exceptions lead to mixing.
• K-systems: Speciation events rigidly segregate M; other segregations exist.
Our salad days
• Jukes-Cantor
• HGY
• Kimura 2-Parameter
• PAM
• BLOSUM
General Stationary Time-reversible Model
. pCrCA pGrGA pTrTA
pArAC . pGrGC pTrTC
pArAG pCrCG . pTrTG
pArAT pCrCT pGrGT .
R =
Time reversibility: pirij = pjrji
(Diagonal elements such that rows sum to zero)
General Stationary Time-reversible Model
P(t) = eRt
Given rates, one can find transition probabilities, and vice-versa.
Jukes-Cantor
-3a a a a
a -3a a a
a a -3a a
a a a -3a
R =
Kimura 2-Parameter
. b a b
b . b a
a b . b
b a b .
R =
a/b = transition/transversion bias
A C G T
HKY (Hasegawa, Kishino, Yano)
. pC pG pT
pA . pG pT
pA pC . pT
pA pC pG .
R =
= transversion / transition
The BLOSUMn matrices• Start with multiple, ungapped alignments of
proteins found using PROTOMAT.• Build clusters by placing together sequences with
N% identity. • Measure the score for each pair defined as:
sij = 2*log2(pij/eij)
eij is expected probability of occurrence of the i,j pair
pij is observed probability of the i,j pair.
LimitationsNaive approach: measure frequencies of
aligned pairs and gaps in randomly selected confirmed alignments to get pij, use a “random” set of sequences to obtain eij.
• Difficulty 1: it is difficult to get a good random sample of sequences or alignments – databases are biased.
• Difficulty 2: When sequences diverge from a common ancestor recently, pij is small and s is strongly negative. When sequences diverged long ago, pij tends to eij and s approaches zero.
A short compendium of distances and scores
• Jukes-Cantor distance
• Kimura distance
• Dayhoff evolutionary distance
• BLOSUM scores
• Profile scores
• Average scores
References
• Gu, X. & Li, W, 1996. A general additive distance with time-reversibility and rate variation among nucleotide sites. Proc. Natl. Acad. Sci. USA 93: 4671-4676.
• Hasegawa, M., Kishino, H., & Yano, T., 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160-174.
• Sanderson, M. J. & Shaffer, H. B., 2002. Troubleshooting molecular phylogenetic analyses. Annu. Rev. Ecol. Syst. 33: 49-72.
The distance between sequences, Part II.
Careful Measures
M. Elizabeth Corey, Ph.D.
UCSC Extension and UC Berkeley Extension
Exceptions to Mendel’s LawsThe theory: a chromosomal basis of inheritance Some so-called exceptions:• linkage and recombination• gene conversion• transposition and mobile genetic elements• A plethora of other mutations: point mutations,
reversions, deletions, frameshifts, duplications, inversions
“Exceptions” do not result in rejection of Mendelian genetics but a better understanding of the mechanisms underlying Mendelian inheritance.
Mutation frequencies(#mutations/generation)
• Frequency of point mutation: 10-7 to 10-8
• Reversion of point mutations: ~10-8. Sometimes called back mutation, sometimes called convergence.
• Reversion of deletion mutations: undetectably small.
“Loss of function” mutations result in grossly lower biological fitness. The rate of extinction due to gross “loss of function” is much great than the rate of reversion, so the line will die long before reversion can occur. In the aggregate, the record will show a pseudo-reversion.
Mutation frequencies
• Deletions: 10-6 – dependent on chromosomal region. Caveat: May be underestimated; less detectable because they are often lethal .
• Frameshifts: 10-6 – often repaired.• Duplications: 10-3 - E. coli: approximately 0.l% of
a culture for a given region of the chromosome.• Inversions: hard to detect, not always mutations• Gene Conversions: still unknown. Reparative.mutators increase mutation frequencies by ~100, they
work on “hot spots”
Protein-based inheritance – Prions
• Proteins that change their shape in response to fluctuating environmental pressures, and then maintain that shape during mitosis and meiosis, constitute a form of cellular memory.
• Various structural conformations are propagated outside of the traditional genetic framework.
Hsp90 and Sup35
• A buffer for silent polymorphisms: Hsp90– promotes the folding of signal tranducers– buffers the effects of many silent polymorphisms– may serve as a capacitor of evolutionary change –
storing and releasing genetic variation
• “Epigenetic inheritance”: The Sup35 prion
James Joyce’s List
Milk
Call mom!
Lettuce
Plumb the smithy of my soul for the unborn race-consciousness…
Rent
--------------------------------------
Thriving in fluctuating environments by exploiting pre-existing genetic variations.
References Recent Publications on Conformational Change
and Evolution • Queitsch, C., Sangster, T.A. and Lindquist, S. 2002. Hsp90 as a
capacitor of phenotypic variation. Nature 417: 618-624.
• Jensen, M.A., True, H.L., Chernoff, Y.O., and Lindquist, S., 2001 Molecular Population Genetics and Evolution of a Prion-like Protein in Saccaromyces cerevisiae. Genetics 159: 527-525.
• True, H.L., and Lindquist, S.L. 2000. A yeast prion provides an exploratory mechanism for genetic variation and phenotypic diversity. Nature 407: 477-483.
• Rutherford, S.L. and Lindquist, S. 1998. Hsp90 as a capacitor for morphological evolution. Nature 396: 336-342.
Mutations and time
Take a series of sequences and figure out how different they are by counting up their substitutions.
A
B
C
5 substitutions
3 substitutions
6 substitutions
Mutations and time
What process takes us from A to B to C?
A
B
C
gene conversion
frameshift (repairable)2 point accepted mutations
No direct ancestry
Counting mutations
Consider a counting process {N(t), t T} where N(ti) – N(tj) is the number of mutations in the time interval (ti,tj].
A
B
C
N(AB) = 1 GC
N(BC) = 1 FS, 2 PM
No direct ancestrybut we can still count
substitutions: N (AC) = 6 PM
Times on the edges of the tree
The “interoccurrence” times between mutations, 1 = 0, , 1 = t2 – t1, … i = ti – ti-1,
are exponential variables with mean 1/ such that
P[i > h] = e-h
and
P[i <= h] = 1 - e-h
for h>= 0.
Edge timesGene conversions gc = 1 gc/2,000* years
Frame shifts fs = 1 shift/5,000 years
Point mutations pm = 1 pm/10,000 years
*Just an wild guess
A
B
C
1/gc = 2,000 yrs
2/pm+ 1/fs = 25,000 yrs
1/pm = 60,000 yrs
Edge timesPopulation of A = 105
Population of B = 106
Population of C = I don’t care.
A
B
C
1/Na gc = 20 * 10-2 yrs
2/ Nb pm+ 1/ Nb fs = 25*10-3 yrs
1/pm = 60 * 10-2 yrs
Calculating divergence times
Doolittle, D.F., Fend, D-F, Tsang, S., Cho, G and Little, E. “Determining Divergence Times of the Major Kingdoms of Living Organisms With a Protein Clock.” Science, 271, pp. 470-477, 1996.
Calculating divergence times
Task: Build a model for evolutionary time based on pairwise distances, dij, and the fossil record– Start with the vertebrate fossil record - the
biogeochemistry gives reliable times.– Map the fossil-based phylogeny to the sequence
based phylogeny and compare edge lengths.– Adjust the sequence-based time model to match
the vertebrate fossil record.
Using the fossil record
Vertebrates: Time of first appearance in fossil record versus sequence similarity
0
5
10
15
20
25
30
0 100 200 300 400 500 600
Time (ma)
Dis
tan
ce M
easu
re
Readjusting the clock
After sampling the vertebrate fossil-record and fitting the sequence data to the fossil-record , they maintain the same clock.
Result: Eukaryotes and Prokaryotes diverged about 2.5 billion years ago.
On fitting the fossil record to sequence data
Challenges: unequal rates of change in different species due to:– different reproductive cycles in different species
– different base population sizes in different species.
Obtaining bacterial mutation rates using vertebrate mutation rates when we are looking at the evolution of populations: how viable is it?
Population mutation
Suppose an average rate of mutation per site is about 10-7 (ignoring duplications).
Compare lengths of reproductive cycles: – Prokaryotes (blue-green algae and bacteria): 20 minutes
to an hour per generation.– Humans: US, average time to first child is 24.8 years.
How many times does a bacteria reproduce in the time it takes a human being to reproduce?
24 * 365 * 25 = 219,000
So if we are comparing bacterial mutation rates to human mutation rates and we looking at aggregate populations, we have to adjust by a factor of 106.
Population mutationSize of the base population on planet earth:
5 * 1030 prokaryotes (UG, Bill Whitman) - including about a mole of bacteria
3 * 109 humans
How many bacteria are there, propagating how fast, in comparison to humans? Worst case ratio?
Calculate using base population * rate of generation * number of mutable
genes(1023 * 106*103)
-------------------------- = 1018
(109 * 1*104)
One final issue: The Success Question
When mutations succeed, they succeed within an ecological niche.
So when we ask “When did a species arise?”, it is not enough to ask about the likelihood of a certain kind of mutation, one must also ask: what is the likelihood that that mutation arose in a niche that would support it?
So, don’t forget about acceptance rates.
The FOXP2 point mutation
Enard et al, “Molecular evolution of FOXP2, a gene involved in speech and
language”, Nature, Vol. 418, August 22, 2002
Silent/expressed mutations in FOXP2
Edge labels are: Amino Acid / DNA substitutions
OHG
HG
Human Gorilla
Orangutan
0/7
0/2
1/2
2/2
Selective sweeps
Measures for determining the existence of a sweep:– Tajima’s D: from Genetics, 1989 (conservative)– Fay and Wu’s H: from “Hitchhiking under
positive Darwinian selection”, Genetics, 2000.
Also, Griffiths and Tavare estimate selection using linked SNP data
Population mutation rates
ia = 4Na i - the population mutation rate for site i in species a, where Na is the effective population size of species a and i is the mutation rate per generation at site i.
0
0.2
1 4 7 10 13 16
# of pointmutations
Tajima’s D for FOXP2
0.03%
S/an = 0.079%
S is the sample size
an is the number of segregating sites.
Discovering different rate constants
Finding the time of appearance of the FOXP2 segregation
• Sample current human population worldwide.
• Generate trees with different times for the human sequence data.
• Measure the likelihood of the different trees.
Multiple rates
The automorphism mapping M onto itself, used to be a simple shift operation.
Now, it incorporates several underlying processes, including: – mutation of the bases (mutation rate)
– expression of the mutations (expression rate)
– stabilization of a conformational phenotype (stabilization rate)
– success of the substitution (acceptance rate)
The distance between sequences,
Part III.
Algorithms for phylogenies
M. Elizabeth Corey, Ph.D.
UCSC Extension and UC Berkeley Extension
• Phylogenies provide measures of similarity and can lay a foundation for scoring alignments.
• Rate structures provide indicators for motifs. • Branch points allow us to identify and classify
interesting bases.– If the branch points are in phenotypic trees, the
mutating bases can be used as phenotypic identifiers.
– If the branch points are in genotypic trees, mutating (nonsilent) bases can be used as genetic identifiers.
Motivation
What goes into a phylogeny?
Distance measures (UPGMA, NN)
Site info (MLE and Parsimony)
Substitution scores
Equilibrium distributions for MLE
Pairwise Alignment Multiple
Alignment
Phylogenies
Transitional probability data
What do we get in return?
Guide trees
Rates and probabilities
Scoring matrices Scoring matrices
Pairwise Alignment Multiple
Alignment
Phylogenies
Transitional probability data
Part III: Goals
• Depict methods for finding guide trees for progressive multiple alignment.
• Clarify the differences between MLE, Maximum Parsimony and Distance Methods and identify the optimization techniques appropriate for each.
• Define a new approach for faster identification of near-optimal phylogenies.
Progressive multiple alignment
• Choose a set of scores for sequence comparison– Alignment scores from Needleman-Wunsch, Smith-Waterman and
variants.– Consensus word score from BLAST, PSI-BLAST and others– Substitution (scoring) matrices – PAM, BLOSUM, Jukes-Cantor,
etc.
• Construct a reputable guide tree– Hierachical clustering (UPGMA, Neighbor-Joining, Fitch and
Margoliash)– Maximum Parsimony (simple or weighted).– Maximum Likelihood Estimation (MLE)
• Use the guide tree to produce an alignment
Tree evaluation - Parsimony
• Given a semi-labeled tree, it is possible to determine the tree’s internal nodes (ancestral sequences) using a parsimony algorithm.
• Evaluation function: A summation of the scored mutations in the parsimonious tree.
Parsimony - Illustrated
ABC ADC
A(B or D)Cnode 1, cost is 1
ABE ACC
A(B or C) (E or C)node 2, cost is 2
ABCnode 3, cost is 3
Example: Simple ParsimonyInitialization:
Set the cost, C = 0. Set k = 2n-1, where n is the number of sequences.
Recursion to compute node, Nk:if k is a leaf node, Nk= sequence kif k is not a leaf node
Compute Ni and Nj for the daughter nodes of Nk.
where the intersection of Ni and Nj is nonempty,
otherwise increment the cost by the number of nonmatching residues and set
Termination: Minimum cost of tree = C.
jik NNN
jik NNN
Tree evaluation – Distance methods
• Given a set of alignment scores, but without assuming a tree topology, it is possible to determine a tree and its edge lengths using a distance method. This is sometimes called minimum evolution and includes the hierarchical clustering methods.
• Evaluation function: The sum of the edge lengths.
Hierarchical Clustering – Illustrated UPGMA
21
34
5
21
34
5
21
34
5
21
34
5
1 26 t1= t2= ½d12
1 26
4 57
From Durbin et al, 2001
1 26
4 57
3
8
6 78
9
1 2 4 5 3
½d68
Algorithm: UPGMA
Input: N sequences and their relative distances, dij
Initialization:Assign each sequence to its own cluster, Ci.Define a leaf of T for each sequence and place at height = 0.
IterationPick two clusters Ci, Cj such that dij is minimal.
Define a new cluster k by Ck = {Ci,Cj}.
Define a new set of distances {dkl} between Ck and all current clusters.
Define a node k with daughter nodes i and j, and place it at height hik = ½dik.
Add k to the set of current clusters and remove i and j.Termination:
Rooted: When only two clusters i, j, remain, add the root at height ½dik.
Tree evaluation - MLE
• Given a tree topology and sequences preassigned to each leaf, it is possible to determine a tree’s edge lengths using maximum likelihood estimation.
• Evaluation function: the likelihood of the tree.
Estimating Likelihood• Estimate branch lengths by viewing evolution
as a random process• Requires a probability model of evolution as a
function of time.– For DNA one can use Jukes-Cantor model (all
nucleotides have same substitution rates), or Kimura model (different rates for transitions and transversions).
– For proteins one can use Dayhoff, but in the probability form not the log-odds form.
Estimating LikelihoodS1, etc. are the bases or residues observed in the extant
and ancestral taxa.
v = t where is the substitution rate and t is absolute time
Pi,j(v) is the probability that the residue at node si
becomes residue at node sj in time v
0 is the prior probability of the bases or nucleotides at any position
The likelihood for this tree is:
L = 0P0,5(v5) P5,1(v1) P5,2(v2) P0,6(v6) P6,3(v3) P6,4(v4)
Example: LikelihoodFor each mutating site in a set of sequences
Initialization:Set k = 2n-1, where n is the number of sequences.
Recursion: Compute P(Lk|a) for each symbol, a, in the alphabet as follows:
If k is a leaf node:if xk,u = “a”, then P(Lk|a) = 1,
else Pk(a) = 0.if k is not a leaf node:
Compute P(Li|a), P(Lj|a) for all a at daughters i,j
Set P(Lk|a) = b,cP(b|a,ti) P(Li|b) P(c|a,tj) P(Lj|c).
Termination: Likelihood for site u = a aP(L2n-1|a)
(a is the equilibrium value of the probability distribution for a.)
Concluding step: Combine the likelihoods for each site.
Maximizing Likelihood Estimation over edge times
Likehood estimation includes a step for computing the likelihood of some character “a” at node k given the subtree of k.
While we know that there is the possibility of substitutions leading to a, these depend on how long a time we have to make those substitutions and we do not know the edge times of the tree. We must explore a series of possible times in order to to maximize the likelihood.
• A method that maximizes likelihoods over edge times is what is usually referred to as MLE.
• Standard MLE procedures do not maximize likelihoods over all topologies of the tree.
Comparisons between MLE, Parsimony and
Distance Methods
Algorithm Requires semi-labeled tree
Requires scored alignments
Order Results – Edge weights
Results – Internal tree nodes
Resulting Tree Is Ultrametric
MLE Yes No La2n-1
2an2
Transitional probabilities
subtree probability
Yes
Parsimony Yes No 2an2 Mutation counts Ancestral sequences
No
Distance Methods
No Yes 2n2 Distance measures – e.g. alignment scores
UPGMA: a cluster of sequences
UPGMA - no
NN - yes.
Exploring different topologies
• Successive addition and rearrangement– Very common method (see Phylip programs including:
PROTPARS, DNAPARS, DNACOMP, DNAML, DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX and DOLLOP)
– Sequences are taken in the order that they appear in the input file and successively added to a tree.
• MCMC
Successive addition• Initialization:
– Place the set of sequences into L.
– Create a tree,T, with one node – the root.
• Iteration: for each sequence in L
– Remove a sequence from L and add it as a leaf to T.
– Apply a process of local rearrangement (in Felsenstein’s package, there are (n-1)(2n-3) arrangements.)
– Score each locally arranged tree.
– set T to equal the best scoring tree.
• Termination: Globally rearrange the tree by swapping subtrees, score each globally rearranged tree and accept the tree with the best score.
Markov Chain Monte Carlo
A Bayesian method for phylogenetic inference – Moderately new method rooted in molecular dynamics.
– Topologies are randomly generated and scored so that a representative set of most likely tree topologies can be identified.
Mau, Newton and Larget (1998) apply MCMC to sample trees using Bayes theorem. The following explanation is based on their methodology - the mistakes are mine, the facts and foundations, theirs.
Introduction to the method
is the set of all semi-labeled trees
Introduction to the methodSampling the set of trees
Q1
Q2
Q3
Introduction to the method
a b c d a c bd a b c d
Q01 Q12 Q23
…
A Chain of Accepted Samples
Introduction to the methodThe partitioned space with representatives {1,3}
MCMC propaganda
• allow exact inference provided certain convergent criteria are demonstrated.
• are efficient and can handle many more taxa or sequences.
• measure uncertainty during tree construction (no bootstrapping needed.)
Summary of the Algorithm
1. Choose a starting tree2. Perturb the current tree’s topology and branch
lengths to find a new tree. 3. Measure the likelihood for the new tree.4. Compare the new tree to the last tree and
decide whether or not to accept it into the chain.
5. If you’ve got a sufficiently long chain, check the characteristics of your sample to see if there is convergence to a set of representative topologies. If so, stop. Otherwise, to to 2.
Subproblems to be discussed
1. How do we represent the tree so it that is easy to operate on? Cophentic matrices.
2. What is our perturbation operator?
3. How do we build our sampling chain?
4. When are we done sampling?
The Cophenetic MatrixSome Notation
– a topology
n – a node
a(n) – the ancestor of a node
L – a leaf node (the leaves are the current record)
I – an internal node (the historical record)
Cophenetic Trees
Labeled history (t1, t2) provides an order on coalescent levels.
level 0
level 2
level 1I1
L3
I2
I0
L1 L2
t1 {
t2 {
Example: A Cophenetic Tree
These trees are described in terms of nodes coalescing or merging backwards in time.
t1= 0.8
t2=0.3
t3=0.7
t4=0.5
t5=0.9
t6=1.5
total: 4.7
Example: Cophenetic Matrix
Leaf 5 7 4 1 2 6 3
5 0 9.4 9.4 9.4 9.4 9.4 9.4
7 0 1.6 4.6 6.4 6.4 6.4
4 0 4.6 6.4 6.4 6.4
1 0 6.4 6.4 6.4
2 0 3.6 3.6
6 0 2.2
3 0
The cophenetic matrix for the previous tree.
The tree representation ( a) is {(5,7,4,1,2,6,3), (4.7, 0.8, 2.3, 3.2, 1.8, 1.1)}
The Cophenetic Matrix
Theorem: For any weighted binary tree with labeled leaf nodes, the tree topology and branch lengths can be uniquely determined using the within-tree distances between all pairs of leaf nodes. (Lapoint and Legendre, 1992)
Note, each permutation of the leaf labels generates a different n x n symmetric matrix of distance distances.
What is the perturbation operator?Q is the proposal function and it has two
stages:
• Q1 randomly selects a new leaf order
• Q2 perturbs the values of the matrix supradiagonals.
The proposal mechanism is symmetricalQ(n,n+1) = Q(n+1,n)
Details on Q1 and Q2Q1 samples one of the 2n-1 leaf orderings of
the current tree model.Q2 simultaneously and independently
modifies the elements of the superdiagonal by creating a uniform distribution (ai d) where d is constant.
By applying both types of perturbations, Q1 and Q2, all the permutations of trees can be reached.
Illustration of Q2
Subproblems to be discussed
1. How do we represent the tree so it that is easy to operate on? Use cophenetic matrices.
2. What is our perturbation operator? Q.
3. How do we build our sampling chain? Apply Metropolis-Hastings
4. When are we done?
Acceptance with Metropolis-Hastings
Given a tree , Metropolis-Hastings:
1. Applies Q to build a new tree, .
2. Always accepts the new tree when it is more likely than the old one and sometimes accepts it when it is less likely than the old one.
Acceptance with Metropolis-Hastings – the algorithm
If P(*) > P()
accept * into the chain.
else
accept into the chain with probability P() / P
Acceptance with Metropolis-Hastings
The final step in evaluating the acceptance test is evaluating
P() / P
This is easy: P() is approximated using the LE of
Size of chain and convergence• How many trees do you have to propose before
you begin to get a good enough sample? Mau et al 1998 sample over about 2500 trees for Clarkia, a phylogeny with 9 leaves
• How do you test that you are done? At the end of the run, we say that we have convergence if there is a small set of topologies with high relative frequency in the chain.
• What’s the result? The topologies with the highest frequencies are the reported reconstructions.
Mixing
• To obtain a confidence measure, the algorithm must be run more than once: each run generates a chain of accepted trees.
• When chains “mix” well when they come up with the same representative topologies, starting from different tree topologies.
• If running a sufficient number of independent chains is computationally prohibitive, Suchard et al, 2002, provide a “poor man's estimate of the uncertainty”.
Example with binary data(from Mau, et al, 1998)
9 species of genus Clarkia (California plants)
120 restriction sites
Data translated into a 9 x 120 matrix of zeroes and ones, representing the absence or presence of a restriction site in the genome of each species.
Running the MCMC algorithm
Random starting trees
Chains of length 250,000 were subsampled at rate of 1/100 = 2500 trees
Each run took 20 minutes on a Sparc 10.
Convergence was inferred by reproducibility across runs with very different starting trees.
The most common topologies for Clarkia
A = 1,2; B = 3,4; C = 5,6; D=8,9
References
Smouse and Li (1989) introduced the Bayesian paradigm, but not the notation, to the phylogeny reconstruction problem.
Goldman (1993) used non-Bayesian Monte Carlo tests of significance to assess the adequacy of evolutionary models.
Griffiths and Tavare (1994) constructed Markov chains to compute likelihoods for ancestral inference.
Mau, Newton and Larget (1998) apply MCMC to sample trees using Bayes theorem.
Drill-down: RatesThe way I use it, and I admit this is quirky, motif means the genetic profile
for a functional structure. Using the following definitions:
– Let rG be the rate of mutation for a gene.– Let rE be the rate of expressed mutation for the protein G encodes.– Let rS be the rate of structural mutation for the protein G encodes.
– Let rF be the rate of functional mutation for the protein G encodes.
rG > rE > rS > rF
Note that the rate of neutral mutations is rN = rG – rF.
The “true” rate of mutation for a motif is rF, the observed rate of mutation for members of a motif in a genotypic tree is rG. If we want motif branchings, we eliminate all branchings in the phylogeny occuring with rates rN.
Drill-down: Semi-labeled trees
Trees with a defined branching pattern and defined leaf labels but WITHOUT edge lengths or internal node labels.
In our terms, phylogenies with known branching patterns but without information about ancestors or mutation times.
nccbac nacbac ncbbbc nccnaa
Drill-down: Progressive Alignments
• As you move up the tree, add to sum of characters in growing alignment
Progressive AlignmentsSum of characters in growing alignment can be represented in a table
of values called afrequency matrix or a profile
Progressive AlignmentsAlignments are frozen once they are made. Scores are then
calculated between aligned positions tabulated in a frequency matrix, using a scoring table
Sij = 2 × G:G + 1×A:G
A G S T
A 4 1 3 10
G 3 2 6
S 2 14
T 8
Algorithm: Neighbor-joining
Input: N sequences and their relative distances, dij
Initialization:Define a leaf of T for each sequence
IterationPick two nodes i,j such that dij – (ri + rj) is minimal.
Define a new set of distances, {dkl} between k and all current nodes.
Define a node k with daughter nodes i and j, and place it at edge length eik = ½(dij + ri – rj) and ejk = dij –dik.
Add k to the set of current nodes and remove i and j.
Termination:Unrooted: When only two nodes i, j, remain, add an
edge of length dij/2.
Comparison: Neighbor-joining and UPGMA
Minimization:– UPGMA uses dij
– Nearest-neighbor uses dij – (ri + rj) where
Distance measures:For distances between leaves i and j:
• dij is the same in both algorithms.
For distances between nodes k and m• UPGMA uses dik = 1/|Ci||Cj| p in Ci, q in Cjdpq
• Nearest-neighbor uses dkm = ½ (dim + djm – dij) where i and j are the daughters of k.
Edge lengths:UPGMA set the height of node k to ½ the distance between
daughters i,j (½ dij).Nearest neighbor sets the edge length between k and daughters j
to ½(dij + ri – rj), daughter k to dij – dik.
.2||
1
Lk
iki dL
r
Drill-down: MLE
P(b|a,tj)
ncbbcbcP(Lj|b) = 1
a P(Lk|a) = P(c|a,ti) P(b|a,tj)
site u = 3simplest case
nccbabcP(Li|c) = 1
P(c|a,ti)
Drill-down: Enumerating topologies
1)!-(n2
2)!-(2n !3)!-(2n||
)1(
)1(21||
1-n
labeledsemi
unlabeled n
n
n
Drill-down: Acceptance with Metropolis-Hastings
A proposed tree is accepted with probability:
However, by detailed balance you can step forward or backward with equal probability:
Q(,) = Q(, )Hence our test becomes
)(
*)(,1min
P
P
*),()(
)*,(*)(,1min
QP
QP