30
Random Walks and BLAST Marek Kimmel (Statistics, Rice) [email protected] 713 348 5255

Random Walks and BLAST Marek Kimmel (Statistics, Rice) [email protected] 713 348 5255 [email protected]

Embed Size (px)

Citation preview

Page 1: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Random Walks and BLAST

Marek Kimmel (Statistics, Rice)[email protected]

713 348 5255

Page 2: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Outline

• Explaining the connection

• Simple RW with absorption

• Moment-generating function method

• Size and duration of excursions

• Renewal equation and general RW

• Significance of alignments in BLAST

Page 3: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Intuitive introduction• Alignment as a random walk

g g a g a c t g t a g a c

g a a c g c c c t a g c c• Scores: match = 1, mismatch = -1• Solid symbols = ladder points, squares = excursions

Page 4: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Relation to BLAST

• Quality of alignment reflected by the course of the RW.

• Distribution of maximum heights of excursions achievable by chance, provides null hypothesis.

Page 5: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Simple RW with absorbing boundaries

We consider the case p q only

qpSSiidS ii 1]1Pr[},1,0{,,}{ 1

bTbT

aTaT

nkallbaTShT

nn

nn

k

n

iin

1

1

1

),,(,

Page 6: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Absorption probabilities

• Consider backward equation

• 2nd order, homogeneous, linear, difference equ.

where)]1(exp[)]1(exp[)exp(

)exp(

1,1,

1

]enoughlarge,Pr[

]enoughlarge,Pr[

2

1

11

hqhph

hCw

wwqwpww

uw

nbTw

naTu

iiih

bahhh

hh

nh

nh

Page 7: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Absorption probabilities

• This provides

• Constants derived from boundary conditions

hh

h

wu

ab

ahw

pq

1

)exp()exp(

)exp()exp(

)/ln(,0

**

**

*21

Page 8: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Mean number of steps to absorption

• 2nd order, inhomogeneous, difference equ.• Solution = any particular solution of (*)

+ general solution of the corresponding homogeneous equ.

• Verify

is a particular solution, and therefore

qp

hauhbwm

mmhCCpqhm

pqhm

qmpmm

hhh

bah

h

hhh

)()(

0),exp()/(

)/(

(*)1

*21

11

Page 9: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Moment-generating function approach

1)(,0

0]Pr[,0]Pr[

1.1

)exp(]Pr[)(then

},1,,0,,1,{If

)][exp()(

**

S

d

ciS

S

mst

dScS

Theorem

iiSm

ddccS

SEm

Page 10: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Moment-generating function approach

• Simple RW:

• Until absorption

11)]exp()exp([)(

)]exp()exp([)(

1)(),/ln(

)exp()exp()(

***

1

**

NNhT

NhT

N

iNN

S

S

pqm

pqmShT

mpq

pqm

N

N

Page 11: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Moment-generating function approach

• Sticky argument now: At the time of absorption,

• But the latter is equal to 1

)(

)](exp[)1()](exp[)(

)](exp[)1()](exp[)(

wp

1wp

***

h

hhhT

hhhT

h

hN

w

hawhbwm

hawhbwm

whb

whahT

N

N

Page 12: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Stopping time (at absorption)

)(

)]()([)(.

0)()()(

0)]}(exp[)({

all,1)]}(exp[)({

)'(.1.7

randomntdisplaceme

timestopping

*|

1

h

hhh

N

NS

NS

N

N

iiN

m

hbwhauqpmie

hTESENE

hTmEdt

d

hTmE

sWaldTheorem

hT

N

ShT

Page 13: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Asymptotics (p < q)

• Hypotheses

• So, define Y = excursion height

]onceleast at hittingPr[

)exp()exp(

)exp(1]0at absorptionPr[

1,0

**

*

y

yy

yb

ah

yyCy

yyY

as ),exp()exp()]exp(1[~

]onceleast at hittingPr[]Pr[***

Page 14: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Asymptotics of the mean time to absorption

• A = Mean{# steps before absorption at -1}

• Since

we have

bwu

bbw

pqpq

bbwu

b

as ,11

0)(

1)(lim

00

0

00

Page 15: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Random walks versus alignments

Page 16: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Anatomy of an excursion

• Pr[Yi y] ~ Cexp(-*y)• A= E[inter-ladder pts. distance]• A and C difficult to compute

Page 17: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

P-values for a BLAST comparison

• Assume comparison of two sequences of length N, with expected ladder points distance A. This gives n=N/A excursions on the average. Also, let us denote

• From expression (2.134) we have (since Y is geometric-like)

• Making substitutions

we obtain

)]}1(exp[exp{1)]ln(Pr[)]exp(exp[1

)ln(

)exp( )exp(

)]}1(exp[exp{1]Pr[)]exp(exp[1

1max

1

1

max

*

xnKNxYxnK

Nxy

KACCAK

ynCyYynC

Page 18: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

P-values for a BLAST comparison

• From previous slide

• Let us assume a normalized score

• Substituting into previous inequality, we obtain

• So, P-value, corresponding to an empirically obtained maximum score, equals

)ln(' where)],'exp(exp[1

)]exp(exp[1]'Pr[

)ln('

)]}1(exp[exp{1)]ln(Pr[)]exp(exp[1

max

max

1max

NKyssvalueP

ssS

NKYS

xKNxYxK

Page 19: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

P-values for a BLAST comparison

• Expected value of the normalized score is equal approximately to Euler’s constant

• This yields

• Both

and

are invariant with respect to multiplication of the score by a constant (why?)

)2ln(

)ln( scorebit

)ln('

)ln(])[ln(][

]'[

max

max

11max

KY

NKYS

KNKNYE

SE

Page 20: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

P-values for a BLAST comparison

• Expected number of excursions of height at least equal to v

• For an empirically found value of the score,

• By comparison with a previous formula we see

)'exp(1 value

)exp('

)exp()exp(

max

EP

yNKE

vNKvCA

NE

Page 21: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

From the BLAST coursehttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of

• (i) real but non-homologous sequences; • (ii) real sequences that are shuffled to preserve compositional

properties or • (iii) sequences that are generated randomly based upon a DNA or

protein sequence model.

Analytic statistical results invariably use the last of these definitions of chance, while empirical results based on simulation and curve-fitting may use any of the definitions.

Page 22: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

• As demonstrated above, scores of local alignments are covered by a well-developed theory.

• For global alignments, Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions, but these can not be generalized easily.

– It is possible to express the score of interest in terms of standard deviations from the mean, but it is a mistake to assume that the relevant distribution is normal and convert this Z-value into a P-value; the tail behavior of global alignment scores is unknown.

– The most one can say reliably is that if 100 random alignments have score inferior to the alignment of interest, the P-value in question is likely less than 0.01.

• One further pitfall to avoid is exaggerating the significance of a result found among multiple tests.

– When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly.

– An alignment with P-value 0.0001 in the context of a single trial may be assigned a P-value of only 0.1 if it was selected as the best among 1000 independent trials.

From the BLAST coursehttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Page 23: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

• The E-value of equation applies to the comparison of two proteins of lengths m and n. How does one assess the significance of an alignment that arises from the comparison of a protein of length m to a database containing many different proteins, of varying lengths?

• One view is that all proteins in the database are a priori equally likely to be related to the query. This implies that a low E-value for an alignment involving a short database sequence should carry the same weight as a low E-value for an alignment involving a long database sequence. To calculate a "database search" E-value, one simply multiplies the pairwise-comparison E-value by the number of sequences in the database.    

From the BLAST coursehttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Page 24: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

• An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains.

• If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues.

• Examining equation this can be accomplished simply by treating the database as a single long sequence of length N.

• The BLAST programs take this approach to calculating database E-value. Notice that for DNA sequence comparisons, the length of database records is largely arbitrary, and therefore this is the only really tenable method for estimating statistical significance.

From the BLAST coursehttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Page 25: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Comparison of two unaligned sequences

• Until now, a fixed ungapped alignment in the comparison of two sequences of length N each.

• Now, given two sequences of lengths N1 and N2 without any specific alignment (total N1 + N2 – 1 ungapped alignments).

• Theory advanced, we give only highlights of results. • Many conclusions of the previous sections carry over

with N substituted by N1N2.

Page 26: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Scores

• The basic score is re-defined now

• Mean number of (independent) ladder points in all alignments

• Since the heights of excursions are geometric-like rv’s (n of them),

)1(

1])ln(Pr[1

before) as and ,( srv' like-geometric ofmax

alignments ungapped possible all using

sequences, thecomparinga RW in achieved score max

211

max

21

max

max

xx KeKe exNNYe

A

NNn

ACY

Y

Page 27: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Scores

• From the previous slide

• Define standardized score

• Expected count of (independent) excursions of height at least y

• Similar expressions as before for expected score and P-value

EKe

sssNNy

KeKe

eesSSE

KeeA

CCe

A

NNCe

A

NN

KNNYS

exNNYe

s

xx

11]'Pr[ , ]'[

E

, )ln('

1])ln(Pr[1

)ln(2121

21max

211

max

21

)1(

Page 28: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Karlin-Altschul sum statistic

• Idea: Add information from the r-1 “next to the highest” excursions

• It was proved that

• The particular statistics used

enough large for ,)!1(!

]Pr[,''

exp),,(

effects) edgereflect ' and '( ,,1),''ln('

1

1

11

2121

21max

trr

tetTSST

sessf

NNriKNNYS

YYYY

rt

rrr

r

kk

srS

ii

r

r

Page 29: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Choice of r and multiple testing

• Usually, all sum tests are performed for all “available” r

• The best P-value is accepted, following heuristic corrections (see Section 9.3.4),

1,1,''2

1,])1[(

max21

11

rePKeNNE

rPP

Ey

r

Page 30: Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Comparison of a query sequence against a database

• Use Poisson distribution to obtain the following probability

• Since database is of length D, then expected # HSPs with scores v

• For all other

• Analyze Example 9.5.2.

Expect

2

Expect

2

1, " old"

Expect

1

1,)1(

Expect

1] withseq. database andquery between HSP1least at Pr[

1

ePN

DP

r

ePN

De

evY

r

E

E