Upload
voliem
View
216
Download
1
Embed Size (px)
Citation preview
Statistics 877Statistical Methods for Molecular Biology
Phylogenetics
Cécile Ané
Departments of Botany and of StatisticsUniversity of Wisconsin - Madison
Spring 2017 - Feb. 9
Outline
1 Trait evolution on a phylogeny
2 Molecular evolution on a tree: Maximum LikelihoodEstimation
Outline
1 Trait evolution on a phylogenyExamplesNon-independence problemBrownian motion model
Trait evolution on a phylogeny
Sometimes we don’t care about the tree, but we need it.
In antelopes, are brain size and group size correlated?In plants, did the capacity to shed leaves evolve before orafter the move to freezing winters?In flowers, is the color red correlated with pollination byhummingbirds?
Is red correlated with hummingbirds?
Pollinators in Iochroma:hummingbirds, ants, butterflies,flies.
Pollinator importance = visitationrate× # ovules potentiallypollinated per visit
Predictors: flower color, reward,corolla length, etc.
Smith, Ané & Baum (2008) Evolution
Problem: units are not independent
Y4
Y3
Y1
Y2
X4
X3
X2
X1
X =Hummingbird importance
Y =
Bri
ghtn
ess
0.8
0.40.6
0.2
A sample of species do not form a “random” sample because ofa lack of independence.
Y = Xβ + ε
but ε’s are not iid:identically distributed (id) perhaps, but not independent (i).
Using the tree to parametrize the non-independence
for quantitative traits:Brownian motion model: neutral evolutionOrnstein-Uhlenbeck model: natural selection towards anselection optimum µ
0.0 0.4 0.8
78
911
Time
Tra
it va
lue
●
●
●
●
BM
0.0 0.4 0.8
68
1012
Time
Tra
it va
lue
●
●
●
●
OU ((αα == 2))
Brownian motion model for the residuals
Bt − Bs ∼ N (0, σ2(t − s))
ε ∼ N (0,Vθ)
Vθ,ij = σ2∗ time shared between tips i and j
Y1
Y4
Y2
Y30.60.4
0.8 0.2
Y1
Y4
Y2
Y3
Vθ = σ2
1 .8 0 0.8 1 0 00 0 1 .40 0 .4 1
Vθ = σ2
1 0 0 00 1 0 00 0 1 00 0 0 1
Likelihood for Brownian motion model
Y = Xβ + ε, ε ∼ N (0,Vθ)
Likelihood:
−2 log f (Y |θ) = n log(2π) + log |Vθ|+ (Y − Xβ)′ Vθ−1(Y − Xβ)
Looks simple, but tricky for a tree with 1000’s species.What is the size of Vθ?
Areas for current research
Linear time algorithm: for BM model, but also phylogeneticlogistic or Poisson regression(Ho & Ané, 2014, Goolsby, Bruggeman & Ané 2017)Asymptotic theory(Ané, 2008, Ho & Ané 2013, Ané Ho & Roch 2017)model selection: need new tools(Khabbazian, Kriebel, Rohe & Ané 2016)
Outline
2 Molecular evolution on a tree: Maximum LikelihoodEstimation
Maximum Likelihood Estimation for 2 speciesLikelihood Calculations on TreesSearch for Maximum Likelihood
Distance Between Pairs of Taxa
In a two-taxon tree, the distance between two taxa can beestimated under any model by maximum likelihood.If the distance is t and at site i one species has base A andthe other has base C, the contribution to the likelihood atthis site j is
Lj(t) = πAPAC(t) = πCPCA(t)
for a time-reversible model.The overall likelihood is
L(t) =∏
j
Lj(t)
and the log-likelihood is
`(t) =∑
j
log Lj(t) =∑
j
(logπx [j] + log Px [j]y [j](t)
)
Distance Between Pairs of Taxa
Models with free π: it is common to estimate π withobserved base frequencies.Other parameters usually estimated by maximumlikelihood.The simplest models have closed form solutions, othersrequire numerical optimization.
Notation for the Alignment
Alignment of m taxa and n sites: mn nucleotide bases.Observed base for the i th taxon and the j th site: xij .
Likelihood of a Tree
P(t): 4× 4 probability transition matrix over an edge of length t .
likelihood of the tree:
∏site j
∑πy1j
∏edge e
Pyp(e),j ,yc(e),j (te)
p(e): parent node of edge ec(e): child edge ete: length of enodes 1 (root) to m − 2: internal nodes,nodes m − 1 to 2m − 2: leaves, observed xij = ym−2+i,j
sum: over the 4m−2 possible allocations of valuesat internal nodes: y1,j , . . . , ym−2,j ∈ {A,C,G,T}
Likelihood of a Tree
a naive calculation would not be tractable for large treesfor m = 10, 8 internal nodes, 48 = 65,536 allocations, orterms in the sum
With a time-reversible model, the location of a root (wherethe CTMC begins at stationarity) does not affect thelikelihood calculation.
Felsenstein’s Pruning Algorithm
dynamic programming, or belief propagation algorithmpartial calculations, for each site and node:
IP{subtree rooted at that node} given each possiblestarting base.
leaves first, root last, then weighted average of full treestarting at each base.time complexity: ∝ number of sites, not exponentially.
Maximum Likelihood Estimation for one Tree
For a single tree topology, the ML estimation requiresoptimization of branch lengths and of any parameters inthe substitution model.Numerical optimization methods are required even forsimple models and small trees.
Tree Search
naive search for ML tree: optimize ML branch lengths etc.for each possible topology, then picking the best.
this exhaustive search is not feasible for 12+ taxa
heuristic search algorithms:
define a neighborhood structure for possible topologiesgo through neighborsjump to the first "better" neighbor with a higher likelihoodstop when no "better" neighbors.
e.g. RAxML and GARLI are two modern programs