53
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk

Heuristic alignment algorithms; Cost matrices

  • Upload
    donar

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Heuristic alignment algorithms; Cost matrices. 2.5 – 2.9 Thomas van Dijk. Content. Dynamic programming Improve to linear space Heuristics for database searches BLAST, FASTA Statistical significance What the scores mean Score parameters (PAM, BLOSUM). Dynamic programming. - PowerPoint PPT Presentation

Citation preview

Page 1: Heuristic alignment algorithms; Cost matrices

Heuristic alignment algorithms;Cost matrices

2.5 – 2.9

Thomas van Dijk

Page 2: Heuristic alignment algorithms; Cost matrices

Content Dynamic programming

Improve to linear space

Heuristics for database searches BLAST, FASTA

Statistical significance What the scores mean Score parameters (PAM, BLOSUM)

Page 3: Heuristic alignment algorithms; Cost matrices

Dynamic programming

Improve to linear space

Page 4: Heuristic alignment algorithms; Cost matrices

Dynamic programming

NW and SW run in O( nm ) time And use O( nm ) space

For proteins, this is okay. But DNA strings are huge!

Improve this to O( n+m ) space while still using O( nm ) time.

Page 5: Heuristic alignment algorithms; Cost matrices

Basic idea

Full DP requires O( nm ) space

Page 6: Heuristic alignment algorithms; Cost matrices

Basic idea for linear space Cells don’t directly depend on cells more

than one column to the left. So keep only two columns; forget the rest

No back-pointers!

How to find the alignment?

Page 7: Heuristic alignment algorithms; Cost matrices

If we happen to know a cell on the optimal alignment…

Divide and conquer

Page 8: Heuristic alignment algorithms; Cost matrices

We could then repeat this trick!

Divide and conquer

Page 9: Heuristic alignment algorithms; Cost matrices

But how to find such a point?

Divide and conquer

Page 10: Heuristic alignment algorithms; Cost matrices

Determine where the optimal alignment crossed a certain column: at every cell, remember at which row it crossed

the column.

Modified DP

Page 11: Heuristic alignment algorithms; Cost matrices

Always only two columns at a time. Clearly O( n+m ) space.

But what about the time?

Space analysis

Page 12: Heuristic alignment algorithms; Cost matrices

Using linear space DP at every step of a divide and conquer scheme.

What are we doing?

Page 13: Heuristic alignment algorithms; Cost matrices

We do more work now, but how much? Look at case with two identical strings.

n2

Time analysis

n2

Page 14: Heuristic alignment algorithms; Cost matrices

We do more work now, but how much? Look at case with two identical strings.

n2 + ½ n2

Time analysis

¼ n2

¼ n2

Page 15: Heuristic alignment algorithms; Cost matrices

We do more work now, but how much? Look at case with two identical strings.

n2 + ½ n2 + ¼ n2

Time analysis

1/16 n2

1/16 n2

1/16 n2

1/16 n2

Page 16: Heuristic alignment algorithms; Cost matrices

We do more work now, but how much? Look at case with two identical strings.

n2 + ½ n2 + ¼ n2 + … < 2n2

Time analysis

et cetera…

Page 17: Heuristic alignment algorithms; Cost matrices

We do more work now, but how much? Look at case with two identical strings.

n2 + ½ n2 + ¼ n2 + … < 2n2

Along the same lines,the algorithm in generalis still O( nm ).

And actually only abouttwice as much work!

Time analysis

et cetera…

Page 18: Heuristic alignment algorithms; Cost matrices

Questions?

Page 19: Heuristic alignment algorithms; Cost matrices

Heuristics for database search

BLAST, FASTA

Page 20: Heuristic alignment algorithms; Cost matrices

Searching a database

Database of strings Query string

Select best matching string(s) from the database.

Page 21: Heuristic alignment algorithms; Cost matrices

Algorithms we already know

Until now, the algorithms were exact and correct (For the model!)

But for this purpose, too slow: Say 100M strings with 1000 chars each Say 10M cells per second Leads to ~3 hours search time

Page 22: Heuristic alignment algorithms; Cost matrices

Getting faster

Full DP is too expensive

Space issue: fixed. With guaranteed same result

Now the time issue

Page 23: Heuristic alignment algorithms; Cost matrices

BLAST and FASTA

Heuristics to prevent doing full DP - Not guaranteed same result. - But tend to work well.

Idea: First spend some time to analyze strings Don’t calculate all DP cells; Only some, based on analysis.

Page 24: Heuristic alignment algorithms; Cost matrices

Basic idea In a good alignment, it is likely that several

short parts match exactly.

A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A

A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A

A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A

Page 25: Heuristic alignment algorithms; Cost matrices

k-tuples

Decompose strings into k-tuples with corresponding offset.

E.g. with k=3, “A C C A B B” becomes

0: A C C 1: C C A 2: C A B 3: A B B

Do this for the database and the query

Page 26: Heuristic alignment algorithms; Cost matrices

Example strings

A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A

A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A

A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A

Page 27: Heuristic alignment algorithms; Cost matrices

3-tuple join

0: ACC 1: CCA 2: CAB 3: ABB 4: BBD 5: BDB 6: DBC 7: BCD 8: CDC 9: DCB10: CBB11: BBC12: BCB13: CBA

0: ABB 1: BBA 2: BAD 3: ADA 4: DAC 5: ACC 6: CCB 7: CBB 8: BBC 9: BCC10: CCD11: CDC12: DCD13: CDA

3

-5

3

-3

Page 28: Heuristic alignment algorithms; Cost matrices

Matches / hits

Lots on the same diagonal: might be a good alignment.

Offset in query

Off

set

in d

b s

trin

g

Page 29: Heuristic alignment algorithms; Cost matrices

Do e.g. “banded DP” around diagonals with many matches

Don’t do full DP

Offset in query

Off

set

in d

b s

trin

g

Page 30: Heuristic alignment algorithms; Cost matrices

Some options

If no diagonal with multiple matches, don’t DP at all.

Don’t just allow exact ktup matches, but generate ‘neighborhood’ for query tuples.

Page 31: Heuristic alignment algorithms; Cost matrices

Personal experience “Database architecture”

practical assignment

MonetDB: main memory DBMS SWISS-PROT decomposed

into 3-tuples 150k strings 150M 3-tuples

Find database strings with more than one match on the same diagonal.

Page 32: Heuristic alignment algorithms; Cost matrices

Personal experience

43 char query string ~500k matches (in ~122k strings) ~32k diagonals with more than one

match in ~25k strings

With some implementation effort: ~1s(Kudos for Monet here!)

Page 33: Heuristic alignment algorithms; Cost matrices

Personal experience

From 150k strings to 15k ‘probable’ strings in 1 second.

This discards 90% percent of database for almost no work.

And even gives extra information to speed up subsequent calculations.

… but might discard otherwise good matches.

Page 34: Heuristic alignment algorithms; Cost matrices

Personal experience

Tiny query6 char query 122 diagonals in 119 stringsin no time at all

An actual protein from the database:250 char query ~285k diagonals in ~99k stringsin about 5 seconds

Page 35: Heuristic alignment algorithms; Cost matrices

BLAST/FASTA conclusion

- Not guaranteed same result. - But tend to work well.

Page 36: Heuristic alignment algorithms; Cost matrices

Questions?

Page 37: Heuristic alignment algorithms; Cost matrices

Statistical significance

What do the scores mean?

Score parameters (PAM, BLOSUM)

Page 38: Heuristic alignment algorithms; Cost matrices

What do the scores mean? We are calculating ‘optimal’ scores, but

what do they mean? Used log-odds to get an additive scoring scheme

Biologically meaningful versusJust the best alignment between random strings

1. Bayesian approach2. Classical approach

Page 39: Heuristic alignment algorithms; Cost matrices

Bayesian approach

Interested in: Probability of a match, given the strings P( M | x,y )

Already know: Probability of strings given the models, i.e.

P( x,y | M ) and P( x,y | R )

So … Bayes rule.

Page 40: Heuristic alignment algorithms; Cost matrices

Bayesian approach

Bayes rule gives us:

P( x,y | M ) P( M )P( M | x,y ) = P( x,y )

…rewrite… …rewrite… …rewrite…

Page 41: Heuristic alignment algorithms; Cost matrices

Bayesian approach

P( M | x,y ) = σ( S’ )

S’ = log( P(x,y|M)/P(x,y|R) ) + log( P(M)/P(R) )

Our score!

Page 42: Heuristic alignment algorithms; Cost matrices

Take care! Requires that substitution matrix

contains probabilities.

Mind the prior probabilities:when matching against a database, subtract a log(N) term.

Alignment score is for the optimal alignment between the strings; ignores possible other good alignments.

Page 43: Heuristic alignment algorithms; Cost matrices

Classical approach

Call the maximum score among N random strings: MN.

P( MN < x ) means “Probability that the best match from

a search of a large number N of unrelated sequences has score lower than x.”

is an Extreme Value Distribution Consider x = our score. Then if this

probability is very large: likely that our match was not just random.

Page 44: Heuristic alignment algorithms; Cost matrices

Correcting for length

Additive score So longer strings have higher scores!

Page 45: Heuristic alignment algorithms; Cost matrices

Correcting for length

If match with any string should be equally likely Correct for this bias by subtracting

log(length) from the score Because score is log-odds, this `is a

division’ that normalizes score

Page 46: Heuristic alignment algorithms; Cost matrices

Scoring parameters

How to get the values in cost matrices?

1. Just count frequencies?2. PAM3. BLOSUM

Page 47: Heuristic alignment algorithms; Cost matrices

Just count frequencies?

Would be maximum likelihood estimate

- Need lots of confirmed alignments

- Different amounts of divergence

Page 48: Heuristic alignment algorithms; Cost matrices

PAM

Grouped proteins by ‘family.’ Phylogenetic tree

PAM1 matrix probabilities for 1 unit of time.

PAMn as (PAM1)n

PAM250 often used.

Page 49: Heuristic alignment algorithms; Cost matrices

BLOSUM

Long-term PAM are inaccurate: Inaccuracies in PAM1 multiply! Actually differences between short-term

and long-term changes.

Different BLOSUM matrices are specifically determined for different levels of divergence Solves both problems.

Page 50: Heuristic alignment algorithms; Cost matrices

Gap penalties

No proper time-dependent model

But seems reasonable that: expected number of gaps linear in time length of gaps constant distribution

Page 51: Heuristic alignment algorithms; Cost matrices

Questions?

Page 52: Heuristic alignment algorithms; Cost matrices

What have we seen?

Linear space DP

Heuristics:BLAST, FASTA

What the scores mean

Available substitution matrices:PAM, BLOSUM

Page 53: Heuristic alignment algorithms; Cost matrices

Last chance for questions…