Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Statistics 877Statistical Methods for Molecular Biology

Phylogenetics

Cécile Ané

Departments of Botany and of StatisticsUniversity of Wisconsin - Madison

Spring 2017 - Feb. 9

Outline

1 Trait evolution on a phylogeny

2 Molecular evolution on a tree: Maximum LikelihoodEstimation

Outline

1 Trait evolution on a phylogenyExamplesNon-independence problemBrownian motion model

Trait evolution on a phylogeny

Sometimes we don’t care about the tree, but we need it.

In antelopes, are brain size and group size correlated?In plants, did the capacity to shed leaves evolve before orafter the move to freezing winters?In flowers, is the color red correlated with pollination byhummingbirds?

Is red correlated with hummingbirds?

Pollinators in Iochroma:hummingbirds, ants, butterflies,flies.

Pollinator importance = visitationrate× # ovules potentiallypollinated per visit

Predictors: flower color, reward,corolla length, etc.

Smith, Ané & Baum (2008) Evolution

Problem: units are not independent

Y4

Y3

Y1

Y2

X4

X3

X2

X1

X =Hummingbird importance

Y =

Bri

ghtn

ess

0.8

0.40.6

0.2

A sample of species do not form a “random” sample because ofa lack of independence.

Y = Xβ + ε

but ε’s are not iid:identically distributed (id) perhaps, but not independent (i).

Using the tree to parametrize the non-independence

for quantitative traits:Brownian motion model: neutral evolutionOrnstein-Uhlenbeck model: natural selection towards anselection optimum µ

0.0 0.4 0.8

78

911

Time

Tra

it va

lue

●

●

●

●

BM

0.0 0.4 0.8

68

1012

Time

Tra

it va

lue

●

●

●

●

OU ((αα == 2))

Brownian motion model for the residuals

Bt − Bs ∼ N (0, σ2(t − s))

ε ∼ N (0,Vθ)

Vθ,ij = σ2∗ time shared between tips i and j

Y1

Y4

Y2

Y30.60.4

0.8 0.2

Y1

Y4

Y2

Y3

Vθ = σ2

1 .8 0 0.8 1 0 00 0 1 .40 0 .4 1

Vθ = σ2

1 0 0 00 1 0 00 0 1 00 0 0 1

Likelihood for Brownian motion model

Y = Xβ + ε, ε ∼ N (0,Vθ)

Likelihood:

−2 log f (Y |θ) = n log(2π) + log |Vθ|+ (Y − Xβ)′ Vθ−1(Y − Xβ)

Looks simple, but tricky for a tree with 1000’s species.What is the size of Vθ?

Areas for current research

Linear time algorithm: for BM model, but also phylogeneticlogistic or Poisson regression(Ho & Ané, 2014, Goolsby, Bruggeman & Ané 2017)Asymptotic theory(Ané, 2008, Ho & Ané 2013, Ané Ho & Roch 2017)model selection: need new tools(Khabbazian, Kriebel, Rohe & Ané 2016)

Outline

2 Molecular evolution on a tree: Maximum LikelihoodEstimation

Maximum Likelihood Estimation for 2 speciesLikelihood Calculations on TreesSearch for Maximum Likelihood

Distance Between Pairs of Taxa

In a two-taxon tree, the distance between two taxa can beestimated under any model by maximum likelihood.If the distance is t and at site i one species has base A andthe other has base C, the contribution to the likelihood atthis site j is

Lj(t) = πAPAC(t) = πCPCA(t)

for a time-reversible model.The overall likelihood is

L(t) =∏

j

Lj(t)

and the log-likelihood is

`(t) =∑

j

log Lj(t) =∑

j

(logπx [j] + log Px [j]y [j](t)

)

Distance Between Pairs of Taxa

Models with free π: it is common to estimate π withobserved base frequencies.Other parameters usually estimated by maximumlikelihood.The simplest models have closed form solutions, othersrequire numerical optimization.

Notation for the Alignment

Alignment of m taxa and n sites: mn nucleotide bases.Observed base for the i th taxon and the j th site: xij .

Likelihood of a Tree

P(t): 4× 4 probability transition matrix over an edge of length t .

likelihood of the tree:

∏site j

∑πy1j

∏edge e

Pyp(e),j ,yc(e),j (te)

p(e): parent node of edge ec(e): child edge ete: length of enodes 1 (root) to m − 2: internal nodes,nodes m − 1 to 2m − 2: leaves, observed xij = ym−2+i,j

sum: over the 4m−2 possible allocations of valuesat internal nodes: y1,j , . . . , ym−2,j ∈ {A,C,G,T}

Likelihood of a Tree

a naive calculation would not be tractable for large treesfor m = 10, 8 internal nodes, 48 = 65,536 allocations, orterms in the sum

With a time-reversible model, the location of a root (wherethe CTMC begins at stationarity) does not affect thelikelihood calculation.

Felsenstein’s Pruning Algorithm

dynamic programming, or belief propagation algorithmpartial calculations, for each site and node:

IP{subtree rooted at that node} given each possiblestarting base.

leaves first, root last, then weighted average of full treestarting at each base.time complexity: ∝ number of sites, not exponentially.

Maximum Likelihood Estimation for one Tree

For a single tree topology, the ML estimation requiresoptimization of branch lengths and of any parameters inthe substitution model.Numerical optimization methods are required even forsimple models and small trees.

Tree Search

naive search for ML tree: optimize ML branch lengths etc.for each possible topology, then picking the best.

this exhaustive search is not feasible for 12+ taxa

heuristic search algorithms:

define a neighborhood structure for possible topologiesgo through neighborsjump to the first "better" neighbor with a higher likelihoodstop when no "better" neighbors.

e.g. RAxML and GARLI are two modern programs

Documents

Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané