19
Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané Departments of Botany and of Statistics University of Wisconsin - Madison Spring 2017 - Feb. 9

Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

  • Upload
    voliem

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Statistics 877Statistical Methods for Molecular Biology

Phylogenetics

Cécile Ané

Departments of Botany and of StatisticsUniversity of Wisconsin - Madison

Spring 2017 - Feb. 9

Page 2: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Outline

1 Trait evolution on a phylogeny

2 Molecular evolution on a tree: Maximum LikelihoodEstimation

Page 3: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Outline

1 Trait evolution on a phylogenyExamplesNon-independence problemBrownian motion model

Page 4: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Trait evolution on a phylogeny

Sometimes we don’t care about the tree, but we need it.

In antelopes, are brain size and group size correlated?In plants, did the capacity to shed leaves evolve before orafter the move to freezing winters?In flowers, is the color red correlated with pollination byhummingbirds?

Page 5: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Is red correlated with hummingbirds?

Pollinators in Iochroma:hummingbirds, ants, butterflies,flies.

Pollinator importance = visitationrate× # ovules potentiallypollinated per visit

Predictors: flower color, reward,corolla length, etc.

Smith, Ané & Baum (2008) Evolution

Page 6: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Problem: units are not independent

Y4

Y3

Y1

Y2

X4

X3

X2

X1

X =Hummingbird importance

Y =

Bri

ghtn

ess

0.8

0.40.6

0.2

A sample of species do not form a “random” sample because ofa lack of independence.

Y = Xβ + ε

but ε’s are not iid:identically distributed (id) perhaps, but not independent (i).

Page 7: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Using the tree to parametrize the non-independence

for quantitative traits:Brownian motion model: neutral evolutionOrnstein-Uhlenbeck model: natural selection towards anselection optimum µ

0.0 0.4 0.8

78

911

Time

Tra

it va

lue

BM

0.0 0.4 0.8

68

1012

Time

Tra

it va

lue

OU ((αα == 2))

Page 8: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Brownian motion model for the residuals

Bt − Bs ∼ N (0, σ2(t − s))

ε ∼ N (0,Vθ)

Vθ,ij = σ2∗ time shared between tips i and j

Y1

Y4

Y2

Y30.60.4

0.8 0.2

Y1

Y4

Y2

Y3

Vθ = σ2

1 .8 0 0.8 1 0 00 0 1 .40 0 .4 1

Vθ = σ2

1 0 0 00 1 0 00 0 1 00 0 0 1

Page 9: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Likelihood for Brownian motion model

Y = Xβ + ε, ε ∼ N (0,Vθ)

Likelihood:

−2 log f (Y |θ) = n log(2π) + log |Vθ|+ (Y − Xβ)′ Vθ−1(Y − Xβ)

Looks simple, but tricky for a tree with 1000’s species.What is the size of Vθ?

Page 10: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Areas for current research

Linear time algorithm: for BM model, but also phylogeneticlogistic or Poisson regression(Ho & Ané, 2014, Goolsby, Bruggeman & Ané 2017)Asymptotic theory(Ané, 2008, Ho & Ané 2013, Ané Ho & Roch 2017)model selection: need new tools(Khabbazian, Kriebel, Rohe & Ané 2016)

Page 11: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Outline

2 Molecular evolution on a tree: Maximum LikelihoodEstimation

Maximum Likelihood Estimation for 2 speciesLikelihood Calculations on TreesSearch for Maximum Likelihood

Page 12: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Distance Between Pairs of Taxa

In a two-taxon tree, the distance between two taxa can beestimated under any model by maximum likelihood.If the distance is t and at site i one species has base A andthe other has base C, the contribution to the likelihood atthis site j is

Lj(t) = πAPAC(t) = πCPCA(t)

for a time-reversible model.The overall likelihood is

L(t) =∏

j

Lj(t)

and the log-likelihood is

`(t) =∑

j

log Lj(t) =∑

j

(logπx [j] + log Px [j]y [j](t)

)

Page 13: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Distance Between Pairs of Taxa

Models with free π: it is common to estimate π withobserved base frequencies.Other parameters usually estimated by maximumlikelihood.The simplest models have closed form solutions, othersrequire numerical optimization.

Page 14: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Notation for the Alignment

Alignment of m taxa and n sites: mn nucleotide bases.Observed base for the i th taxon and the j th site: xij .

Page 15: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Likelihood of a Tree

P(t): 4× 4 probability transition matrix over an edge of length t .

likelihood of the tree:

∏site j

∑πy1j

∏edge e

Pyp(e),j ,yc(e),j (te)

p(e): parent node of edge ec(e): child edge ete: length of enodes 1 (root) to m − 2: internal nodes,nodes m − 1 to 2m − 2: leaves, observed xij = ym−2+i,j

sum: over the 4m−2 possible allocations of valuesat internal nodes: y1,j , . . . , ym−2,j ∈ {A,C,G,T}

Page 16: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Likelihood of a Tree

a naive calculation would not be tractable for large treesfor m = 10, 8 internal nodes, 48 = 65,536 allocations, orterms in the sum

With a time-reversible model, the location of a root (wherethe CTMC begins at stationarity) does not affect thelikelihood calculation.

Page 17: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Felsenstein’s Pruning Algorithm

dynamic programming, or belief propagation algorithmpartial calculations, for each site and node:

IP{subtree rooted at that node} given each possiblestarting base.

leaves first, root last, then weighted average of full treestarting at each base.time complexity: ∝ number of sites, not exponentially.

Page 18: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Maximum Likelihood Estimation for one Tree

For a single tree topology, the ML estimation requiresoptimization of branch lengths and of any parameters inthe substitution model.Numerical optimization methods are required even forsimple models and small trees.

Page 19: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané

Tree Search

naive search for ML tree: optimize ML branch lengths etc.for each possible topology, then picking the best.

this exhaustive search is not feasible for 12+ taxa

heuristic search algorithms:

define a neighborhood structure for possible topologiesgo through neighborsjump to the first "better" neighbor with a higher likelihoodstop when no "better" neighbors.

e.g. RAxML and GARLI are two modern programs