101
1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

Embed Size (px)

Citation preview

Page 1: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

1

Towards optimal distance functionsfor stochastic substitution models

Ilan Gronau, Shlomo Moran, Irad YavnehTechnion, Israel

Page 2: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

2

PreviewThe

Phylogenetic Reconstrutction

Problem

Page 3: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

3

AATCCTG

ATAGCTGAATGGGC

GAACGTA

AAACCGA

ACGGTCA

ACGGATA

ACGGGTA

ACCCGTG

ACCGTTG

TCTGGTA

TCTGGGA

TCCGGAA AGCCGTG

GGGGATT

AAAGTCA

AAAGGCG AAACACAAAAGCTG

Evolution is modeled by a Tree

(All our sequences are DNA sequences, consisting of {A,G,C,T})

Page 5: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

5

B : AATCCTG

C : ATAGCTG

A : AATGGGC

D : GAACGTAE : AAACCGA

J : ACCGTTG

G : TCTGGGAH : TCCGGAA

I : AGCCGTG

F : GGGGATT

Goal: reconstruct the ‘true’ tree as accurately as possible

reconstruct

AB

C

FG

IH J

D

E

A

B

C F

G

I

H

J

D

E

(root)

Phylogenetic Reconstruction

Page 6: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

7

Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results

Page 7: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

8

A

C

B

D

F

G

E

edge-weighted ‘true’ tree reconstructed tree

reconstruction

B

C

A

D

F

G

E

,

ˆˆ ( , )u v S

D d u v

5

6

0.4

6

3 0.32 2

4

5

Challange: minimize the effect of noiseIntroduced by the sampling

Distance Based Phylogenetic Reconstruction:Exact vs. Noisy distances

Estimated distances

,

( , )u v S

D d u v

Exact (additive) distances

Between species

Distance estimationusing

finite Sampling

mention "inherent sensitivity"
Page 8: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

9

Road Map • Distance based reconstruction algorithms

• The Kimura 2 Parameter (K2P) Model• Performance of known distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results

Page 9: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

10

The Kimura 2 Parameter )K2P( model [Kimura80]:each edge corresponds to a “Rate Matrix”

{ }A G

{ }C T

Transitions

Transversions

Transitions

Transitions/transversions ratio = / 2 1R

-αββT

α-ββC

ββ-αG

ββα-A

TCGA

-αββT

α-ββC

ββ-αG

ββα-A

TCGA

K2P generic rate matrixu

v

A,G - PurinesC,T-Pyrimidinesadd that usually alpha>beta
Page 10: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

11

K2P standard distance: Δtotal = Total substitution rate

u v w

The total substitution rate of a K2P rate matrix R is

This is the expected number of mutations per site. It is an additive distance.

+

1( ) 2 sum of off-diagonal entries of 4total uv uvR R

α + 2β α’ + 2β’

(α+α’) + 2(β+ β’)

Page 11: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

12

Estimation of Δtotal(Ruv) = dK2P(u,v) is a noisy stochastic process

u A A C A … G T C T T C G A G G C C C

v A G C A … G C C T A T G C G A C C T

2ˆ ˆˆ( , ) 2K Pd u v

K2P total rate“distance correction”

procedure

Page 12: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

13

Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model

• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results

Page 13: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

14

Check performance of K2P “standard” distances in resolving quartet-splits

A C

B D

A B

C D

A C

D B

• Distance methods reconstruct the true split by 4-point

condition:

There are 3 possible quartet topologies:

wsep

The 4-point condition for noisy distances is:

2 2 2 2 2 2( , ) ( , ) min ( , ) ( , ) , ( , ) ( , )K P K P K P K P K P K Pd d d d d d A B C D A C B D A D B C

2 2 2 2 2 2( , ) ( , ) ( , ) ( , ) ( , ) ( , )2K P K P K P K P K Pse K Ppd d dwd d d A B C D A C B D A D B C

להגיד שהדרגות הן 3, ולציין את הקשר בין אורך הקשת לעמידות לרעש. לסלק אנימציה חוץ מסלוק שני הרביעיות
Page 14: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

15

We evaluate the accuracy of the K2P distance estimation

by Split Resolution Test:

root

D

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

t

10t

CA

B

10t 10t10t

t-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

t is “evolutionary time”

The diameter of the quartet is 22t

Page 15: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

16

Phase A: simulate evolution

DC

AB

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

Page 16: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

17

Phase B: reconstruct the split by the 4p condition

DCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

BCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

ACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

÷÷÷÷÷÷÷÷

øçççççççç

è

2ˆˆ ( , ) ( , )K P i jD i j d s s

Apply the 4p condition.

Was the correct split found?

estimate distances between sequences,

Repeat this process 10,000 times,

count number of failures

Page 17: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

18

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

the split resolution test was applied on the model quartet with various diameters

For each diameter, mark the fraction (percentage) of the

simulations in which the 4p condition failed (next slide)

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

C

AB

10t 10t 10t

t

root

D

t

10t

C

AB

10t 10t 10t

t … …

Page 18: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

19

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

quartet diameter )total rate between furthest leaves(

Fra

ctio

n of

failu

res

out o

f 100

00 e

xper

imen

tsperformance of K2P standard distance method in resolving quartets, R=10

Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

Templatequartet

Page 19: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

20

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

quartet diameter (=mutations rate between furthest leaves)

Fract

ion

of fa

ilure

s out

of 10

000 si

mul

atio

nsperformance of K2P standard distance method in resolving quartets,

For quartet ratio 0.1, R=10

Performance for larger diameters

“site saturation”

Page 20: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

21

{ }A G

{ }C T

Transitions

Transversions

Transitions

When β < α, we can postpone the “site saturation” effect. For this, use another distance function for the same model, Δtv , which counts only transversions:

{0}

{1}

This is actually the CFN model

[Cavendar78, Farris73, Neymann71]

α

α

β

Page 21: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

22

Apply the same split resolution test on the transversions only distance:

u A A C A … G T C T T C G A G G C C C

v A G C A … G C C T A T G C G A C C T

ˆ ˆ( , )trd u v

Transversions onlyDistance correction

procedure

Page 22: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

23

transversions only performs better on large, worse on small rates

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

quartet diameter

Fract

ion

of Fa

ilure

s out o

f 10

000

exper

imen

ts

performance of distance methods in resolving quartets, R=10

Transversions only

total K2P rate

להעיר שבסופו של דבר גם האדום עולה
Page 23: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

.

4 5

7 21

210 61

Conclusion: Distance based reconstruction methods should be

adaptive:

Find a distance function d which is good for the input ÷

÷÷÷÷÷÷÷

ø

ö

çççççççç

è

æ

= ˆˆ ( , ) ( , )D u v d u vD

We do a small step in this direction:

Input: An alignment of the sequences at u, v.

Output: a )near(-optimal distance function, which minimizes the

expected noise in the estimation procedure.

Page 24: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

25

Example: An adaptive distance method (max-optimal)

based on this talk:

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

quartet diameter

Fract

ion o

f fa

ilure

s out of 10

000

ex

peri

ments

performance of distance methods in resolving quartets, R=10

max-optimal

stanard K2Ptrasversions only

Page 25: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

26

Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model

• Substitution models and Substitution Rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results

Page 26: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

27

Steps in finding optimal distance functions:1. Define substitution model.

2. Characterize the available distance functions.

3. Select a function which is optimal for the input

sequences.

least sensitive to stochastic noise

Page 27: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

28

From Rate matrices to Substitution matrices

A A C A … G T C T T C G A G G C C Cu

v A G C A … G C C T A T G C G A C C T

-αββT

α-ββC

ββ-αG

ββα-A

TCGA

-αββT

α-ββC

ββ-αG

ββα-A

TCGA

Rate matrices imply stochastic substitution matrices:

C

T

G

A

CTGA

C

T

G

A

CTGA

p

p

p

p

p

p p

p

p

p

p p1 2 p p

1 2 p p

1 2 p p

1 2 p p

uvP uvR

Evolution of a finite sequence by unknown model parameters α, β

A stochastic substitution matrix Puv

Page 28: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

29

A substitution model M : A set of stochastic substitution matrices, closed under matrix product:

P,Q∈ M ⇒ PQ ∈ M

uvP

vwP

u

v

w

uw uv vwP P P

Motivation tothe definition:

Also requiredP>0, 0<det(P)<1

for all P∈M

Page 29: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

30

Uniform distribution

Model tree over M =<Tree Topology> +

<DNA distribution at the root> + <M-substitution matrices at the edges>

r

vPrv

P..

P..

P..

P..

P..

P.. P..

P..

P..

P.. P.. P..

P..

P..

P..

P..

P..

Page 30: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

31

Distances for a given model are defined by

Substitution Rate functions:

uvP

vwP

u

v

w

Δ:M is an SR function for ℝ M iff for all P,Q in M:

1. Δ(PQ) = Δ(P)+ Δ(Q) (additivity)

2. Δ(P)>0 (positivity)

להעיר שזוהי הכללה של קצב מוטציות
Page 31: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

32

Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions

• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results

Page 32: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

33

1st question:Given a model M, what are its SR functions? X

additive

SR functions are additive functions which are strictly

positive

Page 33: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

34

Example 1: The logdet function [Lake94, Steel93] is an SR function for the most general model, Muniv :

Muniv= {P: P is a stochastic 4╳4 matrix, 0<det(P)<1}.

logdetThe function ( ) ln(det( ))

additive functionis an for .univ

P P

M

logdetThe function ( , ) ln(det(

SR fun

))

is an for .ction

uv

univ

d u v P

M

Page 34: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

35

Example 2: The log eigenvalue function

4

Assume a model with the following property:

There is a vector which is an eigenvector

of .

The function

is an additive function for . [e.g. Gu&L

( ) ln(| ( ) |)

each

P

R

P

M

P

v

v

v

M

M

i98]

i.e., PPv v

Page 35: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

36

Both “logdet” and the “log eigenvalue” functions are special cases of a general technique:

Generalized logdet which is given below:

4

Definition: Let be a 4 by 4 matrix.

A subspace of R is -invariant if

If is invariant, then defines a linear transformation on .

det( | ) is the determinant of this linear transformationH

P

H P PH H

H P P H

P

.

(Generalized LogDet)Lemma GLD :

If is -invariant for all , then

ln(| det( | ) |)

is an additive function for .

( ) HH

H P P

PP

M

M

Page 36: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

37

Linearity of additive functions:

1. If Δ1 and Δ2 are additive functions for M, so is c1 Δ1 + c2 Δ2

The set of additive functions for M forms a vector space, to be denoted ADM.

Dimension(ADM) is the dimension of this vector space.Large dimension implies more “independent” distance functions

If dimension(ADM ) = 1, then M admits a single distance function (up to product by scalar). Selecting best SR function in such a model is trivial. Thus, the adaptive approach is useful only when dimension(ADM ) > 1.

Page 37: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

38

Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions

• Unified Substitutions Models: Models which the

adaptive approach is potentially useful.• Optimizing Distances in the K2P model• Simulation results

Page 38: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

39

Unified Substitution Models:

U-1 PU = λ3(P)000

?λ2(P)00

??λ1(P)0

???1

λ3(P)000

?λ2(P)00

??λ1(P)0

???1

Def: A model M is unified if there is a matrix U s.t. for each P∈M it holds that:

1 2 3

3

1

Thm: if is unified,

then for each 3 constants , , , the function

( ) ln(| ( ) |)

is an additive function for

i ii

c c c

P c P

M

M.

Using Lemma GLD, we have:

Page 39: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

40

Strongly Unified Substitution Models

U-1 PU =

Def: A model M is strongly unified if there is a matrix U s.t. for each P∈M it holds that:

3

1

Thm: if is strongly unified,

then the additive functions of

are of the form

( ) ln( ( ))i ii

all

P c P

M

M

000

000

00λ1 (P)0

0001

000

000

00λ1 (P)0

0001

λ2 (P)

λ3 (P)

Page 40: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

41

A simple strongly unified model: The Jukes Cantor model [1969]

MJC=

For all P∈ MJC , U-1 PU =

:0< p <0.25

MJC is strongly unified by U=

1 1 12 22

1 1 12 22

1 1 12 22

1 1 12 22

0

0

0

0

1-3ppppC

p1-3pppT

pp1-3ppG

ppp1-3pA

CTGA

1-3ppppC

p1-3pppT

pp1-3ppG

ppp1-3pA

CTGA

1 4P p

000

000

00λp0

0001

λp

λp

Claim dimension(ADMJC)=1

Hence the adaptive approach is irrelevant to this model.

Page 41: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

42

Another model M for which dimension(ADM)=1

Recall: Muniv consists of all DNA transition matrices.

Claim 2: dimension(ADMuniv) = 1

This means that all the additive functions of Muniv are

proportional to logdet.

Hence the adaptive approach is irrelevant also to this model.

Luckily, the additive functions of “intermediate” unified models have dimensions > 1, hence the adaptive approach is useful for them.Next we return to the Kimura 2 parameter model.

Page 42: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

43

Back to K2P: For every K2P Substitution Matrix P:

1 0 0 0

0 λP 0 0

0 0 μP 0

0 0 0 μP

Where:λP = 1 - 4Pβ = e-4β

μP = 1 - 2Pβ - 2Pα= e-2α-2β

U-1 PU =

C

T

G

A

CTGA

C

T

G

A

CTGA

p

p

p

p

p

p p

p

p

p

p p1 2 p p

1 2 p p

1 2 p p

1 2 p p

P =

0 < λP <10 < μP < 1

Conclusion: dimension(ADMK2P )=2.

U of the JC model

Page 43: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

44

The functions:Δλ(P)= -ln(λP) , Δμ (P)=-ln(μP)

Form a basis of ADK2P

1 2

Each positive function of the form:

( ) ln( ) ln( )

is an SR function for the K2P model

P PP c c

uvPu

v

The standard “total rate” distance is:

ΔK2P(P)=-(ln(λP)+2ln(μP))/4=-Δlogdet(P)/4.

The “transversion only” distance is:

Δtr(P)=-ln(λP )/4.

Page 44: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

46

Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models

• Optimizing Distances in the K2P model• Simulation results

Page 45: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

47

1 2

1 2

ˆˆ ˆCompute ( ) ln( ) ln( ),

an estimation of ( ) ln( ) ln( ).uv

uv

P c c

P c c

u A A C A … G T C T T C G A G G C C C

v A G C A … G C C T A T G C G A C C T

K2P distance estimation: where the noise comes from

ˆ ˆ ˆˆCompute ( ), ( ),

estimations of ( ), ( ).uv uv

uv uv

P P

P P

inherent noise

implied noise propagation

“user controlled” noise propagation

ˆCompute , an estimation of uv uvP P

Page 46: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

48

uvP

u

v

1 2

1 2

Given , we look for , such that:

( , ) ( ) ln( ) ln( )

has a small expected relative error.uv uv

uv

uv P P

P c c

d u v P c c

Selection of c1, c2

True distance

Expected error

Estim

ated distance+ =

Page 47: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

49

Expected Relative Error True distance

Expected error

==

Page 48: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

50

Minimizing the expected relative error

Let ( , ) ( ) be the exact distance

ˆ ˆ ˆ( , ) ( ) is the estimated (stochastic) distance.

We would like to minimize the "Normalized Mean Square Error":

ˆ ( )

uv

uv

d d u v P

d d u v P

NMSE d

2

2

ˆ

ˆIn the plots we use NRMSE=

d dE

d

d dE

d

Page 49: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

51

1 2

1

2

The NMSE of a distance function:

ˆˆ ˆ ( ) ln( )+c ln( )

Depends only on the ratio

uvP c

cc

c

This means that equivalent SR functions have

the same NMSE

A basic property of Normalized Mean Square Error:

Page 50: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

52

A Proper Disclosure on our optimal functions:

Since ln( ) is non-linear, we only find which minimizes the NMSE

ˆ of a of (usinlinear ap g the "deproxim lta mea thod")on .ti

c

44

4

4 4

and the optimal for a K2P matrix is:

11

11 1

opt

c

ee

ece e

st1 term in the Taylor

expansion of

d d

d

Hence, our approximation is imprecise when some

of the (true) Eigenvalue are very smalls

Page 51: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

53

Relation between c and SR functions:

44

4

4 4

11

11 1

opt

ee

ece e

Function name Function c c/(1+c)

Total rate (logdet) -ln)λP(-2ln)μP( 1/2 1/3

Transversions only -ln)λP( ∞ 1

13As grows from to 1, the optimal rate function

1

is gradually changed from to total rate transversions only

opt

opt

c

c

Page 52: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

54

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

total substitution rate

C1 /

(C1 +

C2) α=20β

Optimal values of copt /(1+copt) for ti/tv ratio = 10

As the rate grows, the relative weight of the “transversion” coefficient increases

Page 53: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

55

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

total substitution rate

C1 /

(C1 +

C2) α=2β

α=4βα=20β

Optimal values of c1/(c1 +c2) for various transitions/transversion rates

α=β

α>>β,rate>2

α=200β

Page 54: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

56

0 0.5 1 1.5 2 2.50

0.1

0.2

0.3

0.4

0.5

0.6R = 2

total substitution rate

pred

icte

d N

RM

SE

Expected Relative error of various distance functions: theoretical prediction

Total rate

transversions

optimal

Page 55: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

57

Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model

• Simulation results

Page 56: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

58

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

total substitution rate

NR

MS

E

R = 2

standard formula )C = 0.5(

'transversions only' )C = (actually used SR functions

predicted error for standard formula

predicted error for 'transversions only'predicted error for optimal SR function

Expected Relative error of various distance functions: simulations

Total rate

Transversions only

optimal

“small eigenvaluedistortion”

Page 57: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

59

Back to the K2P quartet resolution

A heuristic distance method )max-optimal( based on this talk:

Select a distance function which is optimal w.r.t. the largest of the six observed distances of the quartet )ie, largest copt(.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

quartet diameter

Fract

ion

of Fa

ilure

s out

of 10

000

exper

imen

ts

performance of distance methods in resolving quartets, R=10

Recall the performance of the two known distance function on the “template quartet”

Page 58: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

60

When α≠β, the suggested heuristic performs better than both known methods.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

quartet diameter

Fract

ion o

f fa

ilure

s out of 10

000

ex

peri

ments

performance of distance methods in resolving quartets, R=10

max-optimal

stanard K2Ptrasversions only

Page 59: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

61

Summary• Adaptive approach to distance based reconstructions: adjust

distance function to input sequences.• Distance functions for stochastic evolutionary models are defined by

SR functions.• SR functions can be constructed by Generalized Logdet.• When the dimension of the space of SR functions is greater than 1,

the adaptive approach is applicable.• The adaptive approach is applicible to non-trivial unified models.• Most common models are unified.• An analysis of the simplest non-trivial unified model - K2P - shows

a significant improvements in the accuracy of the adaptive

approach.

Page 60: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

62

Further Research Prove/Disprove: For any substitution model M, all the additive functions of

M are GLD functions. In the K2P model:

Define&find optimal SR functions for: two distances, quartets, general trees.

Find optimal SR functions for non-homogenous model trees Find optimal SR functions to variable rates cross sites.

Find optimal SR functions for more general evolutionary models (Tamura Nei) (analytic/heuristic methods)

Empirical/analytical study of “plugging” adaptive distances in common reconstruction algorithms (eg NJ).

Study improvement in performance on real biological data. Devise algorithms which use distance-vectors

Page 61: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

63

Page 62: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

64

Page 63: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

65

Further research questions• We have infinitely many additive distance functions for

the K2P model.• Which one should we use for reconstructing the tree?• If we have the exact substitution matrices for all pairs of

taxa, then all functions are equally good.• But we have only finite sequences,

whose alignments provide only estimations of the true substitution matrices

Page 64: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

66

Distances are defined by Substitution Rate functions

u

v

w

For each tree path u — v—w It holds that D(u,v)+D(v,w)=D(u,w).D(u,v)

D(v,w)

D(u,w)= D(u,v)+D(v,w)

Page 65: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

67

Part 3.1:

from

Substitution modelsto

Additive distances

Page 66: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

68

The aligned sequences provide for each pair of DNA letters,say A and G, how many times A was mutated to GThis defines a joint distribution matrix F

Aligned Sequences joint distribution matrices

A G T C

A 0.2 0.05 0.01 0.02

G 0.02 0.25 0.01 0.01

T 0.02 0.01 0.16 0.02

C 0.01 0.01 0.01 0.2

F =

A is aligned with GIn 5% of the pairs

Page 67: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

69

Joint Distribution matrices are converted to distances by Substitution models.These models describe how DNA sequences are transformed during the evolution. The tool used for this is called “Markovian Processes”. In the following we will sketch it. Additional reading is recommended…

Page 68: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

70

species C1 C2 C3 C4 … Cm

u A A C A … G T C T T C G A G G C C C

v A G C A … G C C T A T G C G A C C T

K2P Distinguish between two mutations types:

Transitions {AG, CT}

And

Transversions [{A,G}{C,T}]

Different biological models impose restrictions on the substitution matrices.

Our model is the Kimura 2 Parameter )K2P( model:

Page 69: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

71

K2P rate matrices have the following shape

A G T C

A -

G -

T -

C -

All transitions have rate α

All transversions has rate β

Page 70: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

72

Part 3.2:Distance functions for K2P

( Linear Algebra in the service of Biology)

Page 71: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

73

μP000

0μP00

00λP0

0001

U-1 P U =

μQ000

0μQ00

00λQ0

0001

U-1 Q U =

U-1 PQ U =

Let P,Q be two matrices in K2P. Then:

μP μQ

000

0μP μQ00

00λP λQ0

0001

U-1 PQ U =

Page 72: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

74

U-1 PQ U =

000

000

00λ1 (P)0

0001

λ2 (P)

λ3 (P)

Page 73: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

75

000

000

00λp0

0001

U-1 P U =

λp

λp

Page 74: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

76

ACGGTCA

ACGGATA

GGGGATT

The joint distribution of each pair of verticesprovides an approximation of the substitution matrices

w

v

u uvP

vwP

The common theme of all projects: Start with input sequences for two or more taxa.Find a distance function which minimizes the inaccuracy (noise) introduced by the sampling process.

uvP

vwP

Page 75: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

79

A G C T

A - α β β

G α - β β

C β β - αT β β α -

Page 76: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

80

A G C T

A - α` β` β`

G α` - β` β`

C β` β` - α`T β` β` α` -

Page 77: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

81

25%

ACGGATA

K2P Model tree:======<Tree Topology> +

<Uniform Distribution of DNA at all vertices> + <K2P instantaneous rate matrices at edges>

r

vRuv

Page 78: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

82

A G T C

A

G

T

C

p

p

p

p

p

p p

p

p

p

p p1 2 p p

1 2 p p

1 2 p p

1 2 p p

Page 79: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

83

A G T C

A 1-3p p p pG p 1-3p p pT p p 1-3p pC p p p 1-3p

Page 80: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

84

1 1 12 22

1 1 12 22

1 1 12 22

1 1 12 22

0

0

0

0

Page 81: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

85

K2P Model tree:======<Tree Topology> +

<Uniform Distribution of DNA at all vertices> + <K2P instantaneous rate matrices at edges>

0.25 0.25 0.25 0.25

A G C T

Page 82: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

86

K2P rate matrices have the following shape

A G T C

A -

G -

T -

C -

All transitions have rate α

All transversions has rate β

' ''

''''''

'

''

Page 83: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

87

Given sequences at two adjacent verticeswe define the edge length in two steps :

vertices C1 C2 C3 C4 … Cm

u A A C A … G T C T T C G A G G C C C

v A G C A … G C C T A T G C G A C C T

u

v…TCTGGGA…

…GGGGATT…

First, align the sequences,

Page 84: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

88

Natural evolutionary distance: Total substitution rate

u vw

-αββT

α-ββC

ββ-αG

ββα-A

TCGA

-αββT

α-ββC

ββ-αG

ββα-A

TCGA

Each edge is associated with a time t and a K2P rate matrix S.The total substitution rate along an edge of length t is t(α +2β).Total substitution rate between species = sum of the rates over the path connecting them.

Total substitution rates are exact distances, which we try to reconstruct from observing the joint distribution of sequences at u and v.

-α`β`β`T

α`-β`β`C

β`β`-α`G

β`β`α`-A

TCGA

-α`β`β`T

α`-β`β`C

β`β`-α`G

β`β`α`-A

TCGA

Page 85: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

89

How do we estimate DK2P(u,v)?

vertices C1 C2 C3 C4 … Cm

u A A C A … G T C T T C G A G G C C C

v A G C A … G C C T A T G C G A C C T

Our input are aligned sequences at u and v.They can be used to estimate the probablity that a nucleotide X in u will be replaced by a nucleotide Y in v

Page 86: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

90

vertices C1 C2 C3 C4 … Cm

u A A C A … G T C T T C G A G G C C C

v A G C A … G C C T A T G C G A C C T

Estimate Puv from the joint distributions:

First step in distance estimation:

(Maximum Likelihood)

C

T

G

A

CTGA

C

T

G

A

CTGA

p

p

p

p

p

p p

p

p

p

p p1 2 p p

1 2 p p

1 2 p p

1 2 p p

Page 87: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

91

C

T

G

A

CTGA

C

T

G

A

CTGA

p

p

p

p

p

p p

p

p

p

p p1 2 p p

1 2 p p

1 2 p p

1 2 p p

Page 88: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

92

Substitution matrix is estimated by the observed difference between the sequences.

ACCGTTGTCTGGGA5

ACGGGTA

ACCCGTGTCTGGTA1

2 3

2

ACCGTTGTCTGGGA

• Errors in distance estimations are amplified when:• The rate is small: signal is too weak (in extreme

cases, there are no substitution whatsoever)• The rate is large: recent substitutions overwrite older

ones.

Page 89: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

93

25%

ACGGATA

K2P Model tree:======<Tree Topology> +

<Uniform Distribution of DNA at all vertices> + <K2P instantaneous rate matrices at edges>

r

vRuv

Page 90: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

94

How reliable

Consider “balanced” quartets. Define the “quartet ratio” to be the ratio between the middle edge and two external edges.

Page 91: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

95

The rate matrix S implies a stochastic substitution matrix Puv :

uvS

u

v

uvP

C

T

G

A

CTGA

C

T

G

A

CTGA

p

p

p

p

p

p p

p

p

p

p p1 2 p p

1 2 p p

1 2 p p

1 2 p p

exp( )uv uvP S

Puv defines the joint distribution of the sequences at u,v.

Page 92: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

97

( , ) ( , ) ( , ) ( , ) ( , ) (2 , )seT T T Tp T Td d d d dw d A B C D A C B D A D B C

performance of the standard distance method in reconstructing the split from estimated distances

12 sepw

• Distance based 4-point method (FPM):

Reconstruction will fail if .

ˆ ˆ ˆ ˆ ˆ ˆ( , ) ( , ) min ( , ) ( , ), ( , ) ( , )d A B d C D d A C d B D d A D d B C

12 sepw 1

2 sepw 12 sepw 1

2 sepw 12 sepw

diam

A C

B D

A B

C D

A C

D B

wsep

diam

Page 93: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

98

root

D

t

10t

CA

B

10t 10t 10t

t

Page 94: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

99

Minimizing the expected relative error

2

2

Since ln( ) is non-linear, we only find which minimizes the NMSE

ˆ of a linear approximation of (using the "delta method").

ˆ ˆˆ ˆ(ln( ) ln( )) (ln( ) ln( ))

ln( ) ln( )

c

E cE c

c

2

2ln( ) ln( )c

44

4

4 4

and the optimal is:

11

11 1

opt

c

ee

ece e

Page 95: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

.

- Compute distances between all taxon-pairs

- Find a tree (edge-weighted) best-describing the distances

Distance based methods: The general scheme

0

30

980

1514180

171620220

1615192190

D

4 5

7 21

210 61

This Talk

Page 97: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

.

1 2

1 2

Find constants { ,c }

s.t. the SR function:

( ) ln( ) ln( )

is best for the input P P

c

P c c D

÷÷÷÷÷÷÷÷

ø

ö

çççççççç

è

æ

=

1615192190

( , ) ( , )i jD i j s s

Adaptive distance based algorithm

for the K2P model

Page 98: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

.

- Compute distances between all taxon-pairs

- Find a tree (edge-weighted) best-describing the distances

Distance based methods: The general scheme

0

30

980

1514180

171620220

1615192190

D

4 5

7 21

210 61

This Talk

Page 99: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

.

÷÷÷÷÷÷÷÷

ø

ö

çççççççç

è

æ

=

1615192190

D ( , ) ( , )i jD i j d s s4 5

7 21

210 61

Find a good distance function

- Compute distances between all taxon-pairs

- Find a tree (edge-weighted) best-describing the distances

Distance based methods: An adaptive scheme

Find a distance function d which is good for the input

This work

Page 100: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

.

÷÷÷÷÷÷÷÷

øçççççççç

è

( , ) ( , )i jD i j d s s

Promotion: Make Distance based methods adaptive

Page 101: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

106

1

1 2(

1 2

)

functions for K2P are of the form:

gives the weight the function

puts on the transversions.

Next we show how this weight is affected by

( ) ln(

the

total substitution r

) ln

)

aa e

( .

t

cc c

P P

SR

P c c

transition/transversion nd ratio

Summary of previous slides: