1 Molecular Evolution Stat 246, Spring 2002, Week 6a

1

Molecular Evolution

Stat 246, Spring 2002, Week 6a

2

Evolution using molecules

•Our DNA is inherited from our parents more or less unchanged.

•Molecular evolution is dominated by mutations that are neutral from the standpoint of natural selection.

•Mutations accumulate at fairly steady rates in surviving lineages.

•We can study the evolution of (macro) molecules and reconstruct the evolutionary history of organisms using their molecules.

3

Some important dates in history(billions of years ago)

Origin of the universe 15 4

Formation of the solar system 4.6

First self-replicating system 3.5 0.5

Prokaryotic-eukaryotic divergence 1.8 0.3

Plant-animal divergence 1.0

Invertebrate-vertebrate divergence 0.5

Mammalian radiation beginning 0.1

86 CSH Doolittle et al.

4

5

Two important early observations

Different proteins evolve at different rates, and this seems more or less independent of the host

organism, including its generation time.

It is necessary to adjust the observed percent difference between two homologous proteins to get a distance more or less linearly related

to the time since their common ancestor.

6

Evolution ofthe globins

Hemoglobin

Fib

rinop

eptid

es1.

1 M

Y

5.8 MY

Cytochrome c

20.0 MYSeparation of ancestorsof plants and animals

1

23

4

67

8 910

5

Mam

mal

s

Bird

s/R

eptil

es

Rep

tiles

/Fis

h

Car

p/La

mpr

ey

Mam

mal

s/R

eptil

es

Ver

tebr

ates

/In

sect

s

a

220

200

180

160

140

120

100

80

60

40

20

0

200100 300 400 500 600 700 800

Millions of years since divergenceAfter Dickerson (1971)

Cor

rect

ed a

min

o ac

id c

hang

es p

er 1

00 r

esid

ues

900 1000 1100 1200 1300 1400

bcde f h ig j

Hur

on

ian

Alg

on

kia

n

Cam

bria

n

Ord

ovi

cian

Silu

ria

nD

evo

nia

n

Per

mia

nTr

iass

icJu

rass

ic

Cre

tace

ous

Pal

eo

cen

e

Olig

oce

ne

Mio

cen

eP

lioce

ne

Eoc

en

e

Car

bo

nife

rou

s

Rates of macromolecular evolution

7

Protein PAMsa/100 residues Theoretical 108 years lookback timeb

Pseudogenes 400 45c

Fibrinopeptides 90 200c

Lactalbumins 27 670c

Lysozymes 24 750c

Ribonucleases 21 850c

Hemoglobins 12 1.5d

Acid proteases 8 2.3d

Triosephosphate isomerase 3 6d

Phosphoglyceraldehyde dehydrogenase 2 9d

Glutamate dehydrogenase 1 18d

______________________________________________________________________________________

aPAMs, Accepted point mutations.

bUseful lookback time = 360 PAMs.

cMillion years.

dBillion years. Doolittle 1986

Different rates of change for different proteins

8

Rates of change in protein familiesProtein Ratea Protein Rate

Fibrinopeptides 90 Thyrotropin beta chain 7.4Growth hormone 37 Parathyrin 7.3Ig kappa chain C region 37 Parvalbumin 7.0Kappa casein 33 BPTI Protease inhibitors 6.2Ig gamma chain C region 31 Trypsin 5.9Lutropin beta chain 30 Melanotropin beta 5.6Ig lambda chain C region 27 Alpha crystallin A chain 5.0Complement C3a 27 Endorphin 4.8Lactalbumin 27 Cytochrome b5 4.5Epidermal growth factor 26 Insulin 4.4Somatotropin 25 Calcitonin 4.3Pancreatic ribonuclease 21 Neurophysin 2 3.6Lipotropin beta 21 Plastocyanin 3.5Haptoglobin alpha chain 20 Lactate dehydrogenase 3.4Serum albumin 19 Adenylate cyclase 3.2Phospholipase A2 19 Triosephosphate isomerase 2.8Protease inhibitor PST1 type 18 Vasoactive intestinal peptide 2.6Prolactin 17 Corticotropin 2.5Pancreatic hormone 17 Glyceraldehyde 3-P DH 2.2Carbonic anydrase C 16 Cytochrome C 2.2Lutropin alpha chain 16 Plant ferredoxin 1.9Hemoglobin alpha chain 12 Collagen 1.7Hemoglobin beta chain 12 Troponin C, skeletal muscle 1.5Lipid-binding protein A-II 10 Alpha crystallin B-chain 1.5Gastrin 9.8 Glucagon 1.2Animal lysozyme 9.8 Glutamate DH 0.9 Myoglobin 8.9 Histone H2B 0.9Amyloid A 8.7 Histone H2A 0.5Nerve growth factor 8.5 Histone H3 0.14Acid proteases 8.4 Ubiquitin 0.1Myelin basic protein 7.4 Histone H4 0.1

apercent/100My From (Nei, 1987; Dayhoff et al., 1978)

9

Beta-globins (orthologues)10 20 30 40

M V H L T P E E K S A V T A L W G K V N V D E V G G E A L G R L L V V Y P W T Q BG-human- . . . . . . . . N . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . BG-macaque

- - M . . A . . . A . . . . F . . . . K . . . . . . . . . . . . . . . . . . . . BG-bovine- . . . S G G . . . . . . N . . . . . . I N . L . . . . . . . . . . . . . . . . BG-platypus

. . . W . A . . . Q L I . G . . . . . . . A . C . A . . . A . . . I . . . . . . BG-chicken- . . W S E V . L H E I . T T . K S I D K H S L . A K . . A . M F I . . . . . T BG-shark

50 60 70 80

R F F E S F G D L S T P D A V M G N P K V K A H G K K V L G A F S D G L A H L D BG-human. . . . . . . . . . S . . . . . . . . . . . . . . . . . . . . . . . . . N . . . BG-macaque

. . . . . . . . . . . A . . . . N . . . . . . . . . . . . D S . . N . M K . . . BG-bovine. . . . A . . . . . S A G . . . . . . . . . . . . A . . . T S . G . A . K N . . BG-platypus

. . . A . . . N . . S . T . I L . . . M . R . . . . . . . T S . G . A V K N . . BG-chicken. Y . G N L K E F T A C S Y G - - - - - . . E . A . . . T . . L G V A V T . . G BG-shark

90 100 110 120

N L K G T F A T L S E L H C D K L H V D P E N F R L L G N V L V C V L A H H F G BG-human. . . . . . . Q . . . . . . . . . . . . . . . . K . . . . . . . . . . . . . . . BG-macaque

D . . . . . . A . . . . . . . . . . . . . . . . K . . . . . . . V . . . R N . . BG-bovineD . . . . . . K . . . . . . . . . . . . . . . . N R . . . . . I V . . . R . . S BG-platypus. I . N . . S Q . . . . . . . . . . . . . . . . . . . . D I . I I . . . A . . S BG-chicken

D V . S Q . T D . . K K . A E E . . . . V . S . K . . A K C F . V E . G I L L K BG-shark

130 140

K E F T P P V Q A A Y Q K V V A G V A N A L A H K Y HBG-human. . . . . Q . . . . . . . . . . . . . . . . . . . . .BG-macaque

. . . . . V L . . D F . . . . . . . . . . . . . R . .BG-bovine. D . S . E . . . . W . . L . S . . . H . . G . . . .BG-platypus. D . . . E C . . . W . . L . R V . . H . . . R . . .BG-chicken

D K . A . Q T . . I W E . Y F G V . V D . I S K E . . BG-shark

. means same as reference sequence

- means deletion

10

Beta-globinsUncorrected pairwise distances

DISTANCES between protein sequences Calculated over: 1 to 147 Below diagonal: observed number of differences Above diagonal: number of differences per 100 amino acids

hum mac bov pla chi sha

hum ---- 5 16 23 31 65 mac 7 ---- 17 23 30 62 bov 23 24 ---- 27 37 65

pla 34 34 39 ---- 29 64

chi 45 44 52 42 ---- 61 sha 91 88 91 90 87 ----

11

Beta-globinsCorrected pairwise distances

DISTANCES between protein sequences Calculated over: 1 to 147 Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 amino acids Correction method: Jukes-Cantor hum mac bov pla chi sha

hum ---- 5 17 27 37 108 mac 7 ---- 18 27 36 102 bov 23 24 ---- 32 46 110

pla 34 34 39 ---- 34 106

chi 45 44 52 42 ---- 98 sha 91 88 91 90 87 ----

12

UPGMA tree

BG-bovine

BG-humanBG-macaque

BG-platypus

BG-chicken

BG-shark

13

10 20 30 40 50 60

C C G A C A G G C A C G G T G G C T C A C A C C T G T A A T C C C A G T A C T T T G G G A G G C T G A G G C G A G A G G hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . . . . . A . . . . . . . . . . . . . . . . . . . C . . . . . . . C . . . . . . . . . . . . . . . . . . . G . . . . chimp. . . . . . . A . . . . . . . . . . . . . . . . . . . C . . . . . . . C . . . . . . . . . . . . . . . . . . . G . . . . bonob. . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . C . . . . . . . . G . . . . goril. G . . . . . A . . . . . . . . . . . . . G . . . . . . . . . . . . . C . . . . . . . . . . . . C . . . . T . G . C . . orang

70 80 90 100 110 120

A T C A C C T G A G G T C G G G A G T T T G A G A C C A G C C T G A C C A A T A T G G A G A A A C C C C A G T T A T A C hum-1. . . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C . . . . . hum-3. . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . chimp. . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . bonob. . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . . goril. . . . . . . . . . . . T . . . . . . . C . . A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C . . . orang

130 140 150 160 170 180

T A A A A A T A C A A A A T T A G C T G G G T G T G G T G G C G C A T G C C T G T A A T C C T A G C T A C T A G G A A G hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . . . . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . G . . chimp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . G . . bonob. . . . . . . . . . . . . . . . . . . . . . . . G . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . G . . goril. . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . T . C . . . . . . . . . . G . . orang

190 200 210 220 230 240

G C T G A G G C A G G A G A A T C G C T T G A A C C C G G G A G G T G G A G G T T G A G G T G A G C T G A G A T C A C G hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . chimp. . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . bonob. . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . A . . . . . . . . . T . . . . . . . . . . . . . . . . . A goril. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . . . . . G . . orang

250 260 270 280 290 300

C C A T T G C A C T C C A G C C T G G G C A A C A A G A G C A A A A C T C C G T C T C A A A A A A T A A A T A A A T A A hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . . C . . . . chimp. . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . bonob. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . goril. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . T . . . . . . . . . . . . . . . . . orang

Alu sequences (-globin2 Alu 1, Knight et al., 1996)

14

Alu sequencesUncorrected pairwise distances

DISTANCES between nucleic acid sequences Calculated over: 1 to 300, considering all base positions Below diagonal: observed number of differences Above diagonal: number of differences per 100 bases hum-1 hum-2 hum-3 chimp bonob goril orang

hum-1 ---- 0 0 4 3 5 7 hum-2 1 ---- 1 4 4 5 7 hum-3 1 2 ---- 4 4 5 7

chimp 12 13 13 ---- 1 5 6

bonob 10 11 11 2 ---- 4 5 goril 14 15 15 14 12 ---- 7

orang 20 21 21 18 16 22 ----

15

Alu sequencesCorrected pairwise distances

DISTANCES between nucleic acid sequences Calculated over: 1 to 300, considering all base positions Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 bases Correction method: Jukes-Cantor

hum-1 hum-2 hum-3 chimp bonob goril orang

hum-1 ---- 0 0 4 3 5 7 hum-2 1 ---- 1 4 4 5 7 hum-3 1 2 ---- 4 4 5 7

chimp 12 13 13 ---- 1 5 6

bonob 10 11 11 2 ---- 4 6 goril 14 15 15 14 12 ---- 8

orang 20 21 21 18 16 22 ----

16

Correcting distances between DNA and protein sequences

We mentioned earlier that it is necessary to adjust observed percent differences to get a distance measure which scales linearly with time. This is because we can have multiple and back substitutions at a given position along a lineage.

All of the correction methods (with names like Jukes-Cantor, 2-parameter Kimura, etc) are justified by probabilistic arguments, whose basis is worth mastering. The same evolutionary ideas are used in scoring sequence alignments.

17

The Jukes-Cantor model of nucleotide substitution

Commonancestor ofhuman and orang.

t time units

human (now)

infinitesimalmatrix Q =

A G C T

A -3 G -3 C -3 T -3

Consider the nt at the 2nd position in a-globin2 Alu1

= rate of substitution of one nt by another, assumed constant

18

A simple 4-state Markov chain: the Kimura 2-parameter model for nucleotide change

A G

TC

Transitions rates: Horizontal: a Diagonal & vertical: b Self: c = a2b

c a b b

a c b b

b b c a

b b a c

A G C TA

G

C

T

19

Markov chain

State space = {A,C,G,T}.

P(i,j) = pr(next state Sj | current state Si)Markov assumption:

P(i,j) = pr(next state Sj | current state Si & any configuration of states before this)

Only the present state, not previous states, affects the probs of moving to next states.

20

The multiplication rulepr(state after next is Sk | current state is Si)= ∑j pr(state after next is Sk, next state is Sj | current state is Si) [addition rule]= ∑j pr(next state is Sj| current state is Si) x pr(state after next is Sk | current state is Si, next state is Sj) [multiplication rule]= ∑j P(i,j) x P(j,k) [Markov assumption]= (i,k)-element of P2, where P=(P(i,j)).

More generally, pr(state t steps from now is Sk | current state is Si) = i,k element of Pt

21

Continuous-time version

For any s,t write pij(t) = pr(Sj at time t+s | Si at time s) for

stationary (time-homogeneous) transition probabilities.

Write P(t) = (pij(t)) for the matrix of pij(t) ‘s.

Then for any t,u: P(t+u) = P(t) P(u).

It follows that P(t) = exp(Qt), where Q = P’(0) is the

derivative of P(t) at t = 0.

Q is called the infinitesimal matrix of P(t), and satisfies

P’(t) = QP(t) = P(t)Q.

22

Interpretation of Q

Roughly, q(i,j) is the rate of transitions of i to j, while

q(i,i) = j q(i,j), so each row sum is 0.

If under some initial conditions, we have a Markov

chain evolving in continuous time with infinitesimal

matrix Q, and pj(t) = pr(Sj at time t), then

pj(t+h) =i pr(Si at t, Sj at t+h)

= i pr(Si at t)pr(Sj at t+h | Si at t)

= pj(t)x(1+hqjj) + i pi(t)x hqij

≠

≠

≠

≠

→

23

i.e., h-1[pj(t+h) - pj(t)] = pj(t)q(j,j) + i pi(t)q(i,j)

which becomes P’ = QP as h 0.

Important approximation:

When t is small, P(t) I + Qt.

24

Q =

P(t) =

Jukes-Cantor model (1969)

-3 -3 -3 -3

r s s s

s r s s

s s r s

s s s r

r = (1+3e4t)/4, s = (1 e4t)/4.

25

Let P(t) = exp(Qt). Then the A,G element of P(t) is

pr(G now | A then) = (1 e4t)/4.

Same for all pairs of different nucleotides.

Overall rate of change k = 3t.

When k = .01, described as 1 PAM

PAM = accepted point mutation

Put t = .01/3 = 1/300. Then the resulting

P = P(1/300) is called the PAM(1) matrix.

Why use PAMs?

26

Evolutionary time, PAM

Since sequences evolve at different rates, it is

convenient to rescale time so that 1 PAM of evolutionary time corresponds to 1% expected substitutions.

For Jukes-Cantor, k = 3t is the expected number of substitutions in [0,t], so is a distance. (Show this.)

Set 3t = 1/100, or t = 1/300, so 1 PAM = 1/300 years.

27

Distance adjustment

For a pair of sequences, k = 3t is desired, but not observable. Instead, pr(different) is observed. Convert pr(different) to k.

This is very similar to the conversion of = pr(recombination) to genetic distance (expected number of crossovers) using the Haldane function

= 1/2 (1 e-2d),

assuming the Poisson model.

28

common ancestor

Gorang

Chuman

Still 2nd position in a-globin Alu 1

Assume that the common ancestor has A, G, C or T with probability 1/4.

Then the chance of the nt differing

p≠ = 3/4 (1 e8t)

= 3/4 (1 e4k/3), since k =2 3t

t

3/4

29

If we suppose all positions behave identically and independently, and n≠ differ out of n, we can invert this, obtaining

= 3/4 log(1 4/3 n≠/n).

This is the corrected or adjusted fraction of differences (under this simple model). 100 to get PAMs

The analogous simple model for amino acid sequences has

= 19/20 log(1 20/19 n≠/n).

100 for PAM.

30

Illustration

1. Human and bovine beta-globins are aligned with no deletions at 145 out of 147 sites. They differ at 23 of these sites. Thus n≠/n = 23/145, and the corrected distance using the Jukes-Cantor formula is (natural logs)

19/20 log(1 20/19 23/145) = 17.3 10-2.

2. The hum-1 and gorilla sequences are aligned without gaps across all 300 bp, and differ at 14 sites. Thus n≠/n = 14/300, and the corrected distance using the Jukes-Cantor formula is

3/4 log(1 4/3 14/300) = 4.8 10-2.

31

Correspondence between observed a.a. differences and the evolutionary

distance (Dayhoff et al., 1978)

Observed Percent Difference Evolutionary Distance in PAMs

1511172330384756678094112133159195246

1510152025303540455055606570758085 328

32

The stationary distribution

A probability distribution on {A,C,G,T}, , is a stationary distribution if

i (i) P(i,j) = (j), for all j.

Fact: Given any initial distribution, the distribution at time t as t .

For the Jukes-Cantor and Kimura models, is the uniform distribution. Assume ancestor

sequence is IID .

33

Reversibility

A Markov chain is reversible if

(i) P(i,j) = P(j,i) (i), for all i,j (detailed balance).

Under reversibility, the human sequence can be considered the ancestor of the orang sequence and vice versa. (Proof next slide.)

Both the Jukes-Cantor and Kimura models are

reversible.

34

Proofpr( G in orang, C in human) =

i pr(ancestor base is i)pr(G in orang|i in ancestor)pr(A in human|i in ancestor)

i (i) P(t,i,G) P(t,i,C)

= i (G) P(t,G,i) P(t,i,C) (reversibility) = (G) P(2t,G,C) (addition rule) = F(2t,G,C).The matrix F(2t) is the joint distribution at times

0 and 2t. It is symmetric, and both itsrow and column sum is the stationary

distribution .

35

Below diagonal: BLOSUM62 substitution matrixAbove diagonal: Difference matrix obtained by subracting the PAM 160 matrix entrywise.From Henikoff & Henikoff 1992

C S T P A G N D E Q H R K M I L V F Y W

0 -1 1 0 2 1 1 2 1 2 0 0 2 4 1 5 1 2 -2 5 C

2 0 -2 0 -1 0 0 0 1 0 0 0 1 0 1 -1 1 1 -1 S

C 9 2 -1 -1 -1 0 0 0 0 0 0 -1 0 -1 1 0 1 1 3 T

S -1 4 2 -2 -1 -1 0 0 -1 -1 -1 1 1 0 -1 0 0 2 1 P

T -1 1 5 2 -1 -2 -2 -1 0 0 1 1 0 0 1 0 1 1 2 A

P -3 -1 -1 7 2 0 -1 -2 0 1 1 0 0 -1 0 -1 1 2 4 G

A 0 1 0 -1 4 3 -1 -1 0 0 1 -1 0 -1 0 -1 0 0 0 N

G -3 0 -2 -2 0 6 2 -1 -1 -1 0 -1 0 0 0 0 2 1 3 D

N -3 1 0 -2 -2 0 6 1 0 0 2 2 1 -1 0 0 2 2 4 E

D -3 0 -1 -1 -2 -1 1 6 0 -2 0 1 1 -1 0 0 1 3 3 Q

E -4 0 -1 -1 -1 -2 0 2 5 2 -1 0 1 0 -1 0 1 2 2 H

Q -3 0 -1 -1 -1 -2 0 0 2 5 -1 -1 0 -1 1 0 1 3 -4 R

H -3 -1 -2 -2 -2 -2 1 -1 0 0 8 1 -2 -1 1 1 2 3 1 K

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 -2 -1 -1 0 1 2 4 M

K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5 -1 1 0 0 1 3 I

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 -1 0 -1 1 2 L

I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 0 1 2 4 V

L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 -1 -2 1 F

V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 2 Y

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 -1 W

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7

W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

C S T P A G N D E Q H R K M I L V F Y W

36

PAM matrices

Trees are constructed for closely related (% identity > 15) protein sequences from 71 families.

Internal nodes (intermediate and youngest common ancestors) are inputed by parsimony.

Number of replacements on each branch are totalled across trees; this gives a 20 by 20 matrix of counts. The count matrix is symmetrized, call it A.

37

Transitition probability from i to j is estimated by

P (i,j) = A(i,j)/k A(i,k).Now P might or might not be PAM1.

To get the PAM1 matrix, P has to be calibrated.

Let Q = P – I. Q is very close to the underlying rate matrix, up to a scale factor. Set P() = I + Q . Need to find so that P() is PAM1.

Recall: 1 PAM = 1% expected number of substitutions, or approximately 1% = pr(ancestor is different from descendant).

38

0.01 = pr(ancestor is different from descendant)

= i (i) ji P(,i,j)

= i (i)(1 – P(,i,i))

= i (i) Q (i,i).

So let = 0.01/{i (i) Q (i,i)}.

Then P() is PAM1.

PAM250 = PAM1250, etc.

39

A R N D C Q E G H I L K M F P S T W Y V

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

A Ala 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18

R Arg 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1

N Asn 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1

D Asp 6 0 42 9859 0 6 53 6 4 1 0 6 0 0 1 5 3 0 0 1

C Cys 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2

Q Gln 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1

E Glu 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 1 1 2

G Gly 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5

H His 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1

I Ile 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33

L Leu 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15

K Lys 2 37 2 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1

M Met 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4

F Phe 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0

P Pro 13 5 2 1 1 8 3 2 4 1 2 2 1 1 9926 12 4 0 0 2

S Ser 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 68 5 2 2

T Thr 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9

W Trp 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0

Y Tyr 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1

V Val 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901

Transition probability matrix: PAM(1) x 104 (after M. Dayhoff, 1978)

40

A R N D C Q E G H I L K M F P S T W Y V

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

A Ala 13 6 9 9 4 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9

R Arg 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2

N Asn 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3

D Asp 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3

C Cys 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2

Q Gln 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3

E Glu 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3

G Gly 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7

H His 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2

I Ile 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9

L Leu 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13

K Lys 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5

M Met 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2

F Phe 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3

P Pro 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4

S Ser 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6

T Thr 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6

W Trp 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0

Y Tyr 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2

V Val 7 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 7 2 4 17

Transition probability matrix: PAM(250) x 102 (after M. Dayhoff, 1978)

41

BLOck SUbstitution Matrices (BLOSUM)

Built from gapless multiple alignments (blocks) of homologous protein sequences at various distances.

Basic idea: for a block, sum up matrices of counts for all pairwise comparisons, then sum across blocks and symmetrize. This symmetric count matrix is normalized so that all entries sum to 1. Call this normalized matrix F.

The row or column sums of F is . The score for (i,j) is log {F(i,j)/ (i)(j)}.

42

Matrices at different distances

To produce matrices at different distances, thresholds,clustering and weights are used.

For example, BLOSUM80 uses the threshold 80%. In each block, any group of sequences that are more similar than 80% are clustered. The clusters are down-weighted by their size in summing the counts.

43

For example, suppose there are 3 clusters of size 1, 5 and 10 in a block of 16 sequences.

Then each count matrix of comparing a sequence from cluster 1 to a sequence from cluster 2 is divided by 1 x 5.

Similarly, for cocmparing cluster 2 and cluster 3, the factor is 5 x 10.

At a lower threshold, say at 45%, the clusters are bigger, and the factors are also larger, so similar sequences are downweighted more at lower thresholds, and BLOSUM45 is better for divergent sequences than BLOSUM80.

44

How did variations arise?

Mutation: (a) Inherent: DNA replication errors are not always corrected. (b) External: exposure to chemicals and radiation.

Selection: Deleterious mutations are removed quickly. Neutral and rarely, advantageous mutations, are tolerated and stick around.

Fixation: It takes time for a new variant to be established (having a stable frequency) in a population.

45

Modeling DNA base substitution

Strictly speaking, only applicable to regions undergoing little selection.

Assumptions 1. Site independence. 2. Site homogeneity. 3. Markovian: given current base, future

substitutions independent of past. 4. Temporal homogeneity: stationary Markov

chain.

46

A pair of homologous bases

Typically, ancestor is unknown.

ancestor

A C

QhQm

T years

47

More assumptions

5. Qh = sh Q and Qm = sm Q, for some positive

sh, sm, and a rate matrix Q. 6. The ancestor is sampled from the

stationary distribution of Q. 7. Q is reversible: (a) P(t,a,b) = P(t,b,a) (b), b, t 0

(detailed balance).

48

New picture

ancestor ~

A

C

QQ

shT PAMs smT

PAMs

49

Joint probability of A and C

Under the model, the joint probability is

a (a) P(shT,a,A) P(smT,a,C)

= a (A) P(shT,A,a) P(smT,a,C)

= (A) P(shT+ smT,A,C) = F(t,A,C).The matrix F(t) is symmetric. It is equally valid to

view A as the ancestor of C or vice versa.

t = shT+ smT is the “distance” between A and C.

Note: sh , sm and T are not identifiable.

50

Estimating distance of two sequences

Suppose two protein sequences a1…an and b1…bn are separated by t PAMs.

Under a reversible substitution model that is IID across sites, the likelihood of t is

L(t) = Pr(a1…an,b1…bn | model)

= k F(t,ak,bk)

= a,b F(t,a,b)c(a,b),

where c(a,b) = # {k : ak = a, bk = b}.Maximizing this quantity gives the maximum likelihood

estimate of t. This generalizes the distance correction with Jukes-Cantor.

51

M pairs

The kth pair of sequences, separated by tk PAMs, gives the count matrix c(k).

Pr(all pairs) = k a,b F(tk,a,b)c(k,a,b).If the distances are about constant, then the percent identity should be about constant across pairs. Often, this is not so.

Parameters : Q, t1, t2,…,tM.Maximum likelihood estimation by numerical methods.

52

One simple method

Start with an estimate of Q, Q0. Fixing Q0, maximize the pair-specific likelihoods as functions of the distances to get first estimates of the distances.

Keeping the distances fixed, maximize the overall likelihood as a function of Q.

Go back and forth until convergence.

53

The likelihood has exactly one maximum in t, so estimating distance is easy.

To find the MLE of Q, randomly perturb current estimate and update if the new candidate has a higher likelihood.

Documents

1 Molecular Evolution Stat 246, Spring 2002, Week 6a