Upload
larue
View
33
Download
0
Embed Size (px)
DESCRIPTION
BIC I, Week 4 lectures. Rhys Price Jones and Anne Haake Rochester Institute of Technology [email protected] , [email protected]. Overview of the need for Dynamic Programming. Consider Fibonacci - PowerPoint PPT Presentation
Citation preview
1
BIC I, Week 4 lectures
Rhys Price Jones and Anne Haake
Rochester Institute of Technology
2
Overview of the need for Dynamic Programming
• Consider Fibonacci• The obvious algorithm is elegant, easily
derived from the definition, and clearly correct.(define fib (lambda (n) (if (<= n 1) 1 (+ (fib (- n 2)) (fib (- n 1))))))
• But it’s hopelessly inefficient• Why?• Because it makes repeated recursive calls
with the same argument
3
The Traditional Solution
• Change the order in which the computations are performed
• Change the logic of the program– So that it works “bottom up” instead of “top down”– Fill an array with calculated values starting with (fib 0), then
(fib 1) then (fib 2), etc.
• You can do it manually, as in fib.ss• That is dynamic programming!• The main problem is that it requires thought and
programming and hence may introduce error.
4
It’s not just Fibonacci
• Many programs “write themselves” from the specification of the problem.
• When that happens, we are extremely pleased
• Sadly, the resulting program is often inefficient
• But dynamic programming is a technique to make it efficient again.
5
Memo-izing
• Redefine the function calling mechanism so that:– We first check to see if we’ve made that calculation before– If no, go ahead and compute it but store the result in a hash
table– If yes, look up the previously computed value in the hash
table
• Do it once• Inefficient code becomes efficient automatically with
no re-programming memolambda.ss
memofib.ssmemofib.ss
6
Another Example
• Pascal’s triangle• Each entry is the sum of its parents
– Cn,k = Cn-1,k-1 + Cn-1,k
– C0,k = Cn,0 = 1
• Leading to program• Runs really slowly• Replace lambda by memolambda
badcomb.ss
badcomb.ss
goodcomb.ss
7
Review of Pattern Matching
• Does CGGA appear within the sequence ATCGCGTAACGGAGATAGGCTTA ?
• More generally, where does pattern p (length n) appear within text t (length m)
• Boyer-Moore, or Knuth-Morris-Pratt give O(m+n) search
• If p is going to change a lot and t stay the same, suffix tree can be built in O(m), each search is then O(n)
• If p is stable and there are lots of different t, virtual machine can be built in O(n) and then each search is O(m)
8
Build a Virtual Machine
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
9
First Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
10
Second Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
11
Third Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
12
Fourth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
13
Fifth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
14
Sixth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
15
Seventh Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
16
Eighth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
17
Ninth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
18
Tenth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
19
Eleventh Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
20
Twelfth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
21
Thirteenth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
22
Fourteenth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
23
Fifteenth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
24
Sixteenth Step
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
25
17th – 23rd Steps
• CGGA
AGT
C G G A
ACGT
C
C CAT
AT
GT
ATCGCGTAACGGAGATAGGCTTA
26
Pattern Matching – Conclusion
• Exact pattern matching is easy• Often the naive algorithm is good enough• Fast algorithms are readily available• Sadly, not much use for biological tasks
27
Why not?
• What’s the difference?• Mutation• Insertion/deletion gaps• We need an inexact way to compare two (or
more) biological sequences
28
Pattern Matching vs. Sequence Alignment
• In the CS world, we talk of comparing strings, or matching patterns of characters within strings
• For biological applications, we talk of comparing sequences, or aligning sequences of nucleotides (or amino acids) to each other
29
Evolutionary Relatedness
• Consider ACCGT and CACGT• How likely is it that they are “related”?• Possible alignments:• ACCGT AC-CGTXX||| -|-|||CACGT -CACGT
• Which is better?
30
It Depends
• ACCGT AC-CGTXX||| -|-|||CACGT -CACGT
• Scoring 2 for a match, -2 for a mismatch, and –1 for a gap, 2 versus 6
• Scoring 2 for a match, 0 for a mismatch and –2 for a gap, 6 versus 4
• And we haven’t even begun to consider experimental evidence that might cause us to rank some mutations better than others!
31
Distance measure
• Score 0 for a match• 1 for a mismatch or gap• Low score best!• ACCGT AC-CGTXX||| -|-|||CACGT -CACGT
• Now it’s 2 versus 2
32
Global alignment
• For two sequences• - A C C A C C-ACACC
• Use the scoring scheme to fill in the table, starting with first row and first column
33
First entries
• Using the distance measure• - A C C A C C- 0 1 2 3 4 5 6A 1 C 2A 3C 4A 5
• Each nucleotide<->gap costs 1 point
34
Extending inwards
• Extending the distance measure• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 A 3 2 C 4 3 A 5 4
• Extending from North or West costs 1 point, from NW costs 0 (match) or 1 (mismatch)
• Pick cheapest of the three
35
More extension
• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4 A 3 2 1 1 C 4 3 2 1 A 5 4 3 2
• mi,j = min (mi,j-1+g mi-1,j+g mi-1,j-1+cij)
• where cij = 0 for a match, 1 for a mismatch
36
Getting there...
• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 A 5 4 3 2 1
• mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij)
• where cij = 0 for a match, 1 for a mismatch
37
Almost done...
• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2
• mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij)
• where cij = 0 for a match, 1 for a mismatch
38
Finally, we can get a Global alignment
• One of the least-cost routes• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2 2
• Can you see how this path leads to the alignment• ACCACCAC-ACA
39
Global alignment program
• Distance measure• Runnable program• Dynamic Programming version
globalig.txt
globalig.ss
globaligm.ss
40
Global vs Local Alignment
• Global alignment seeks the best alignment between the complete sequence and the complete sequenceA global alignment between GATCCACCA and GTAACACA might be
• G-ATCCACCA|-|X|-||-|GTAAC-AC-A
• A local alignment is the best alignment between subsequences. A local alignment between GATCCACCA and GTAACACA might be
• gATCCACca |X|-||gtAAC-ACa
• Best local alignment depends on scoring scheme
41
Local Alignment
• For this demo, we will use a different measure– 2 for a match– -1 for a mismatch, -2 for a gap– Find best match withinG C T C T G C G A A T A GC G T T G A G A T A C T C
42
The solution
• - G C T C T G C G A A T A G
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 2 0 2 0 0 2 0 0 0 0 0 0
G 0 2 0 1 0 1 2 0 4 2 0 0 0 2
T 0 0 1 2 0 2 0 1 2 3 1 2 0 0
T 0 0 0 3 1 2 1 0 0 1 2 3 1 0
G 0 2 0 1 2 0 4 2 2 0 0 1 2 3
A 0 0 1 0 0 1 2 3 1 4 2 0 3 1
G 0 2 0 0 0 0 3 1 5 3 3 1 1 5
A 0 0 1 0 0 0 1 2 3 7 5 3 3 3
T 0 0 0 3 1 2 0 0 1 5 6 7 5 3
A 0 0 0 1 2 0 1 0 0 3 7 5 9 7
C 0 0 2 0 3 1 0 3 1 1 5 6 7 8
T 0 0 0 4 2 5 3 1 2 0 3 7 5 6
C 0 0 2 2 6 4 4 5 3 1 1 5 6 4
• G C T C T G C G A A T A G
| | x | | X | |
C G T T G A G A - T A C T C
43
The Program
• Has dynamic programming to make it fast!• This is basically Smith-Waterman• Work has been done on different scoring
schemes, gap penalties, etc.• Runs in time O(mn)
localig.ss
44
Exercises
• that we will attempt in class:– amend global alignment program to do the
“backtracking” needed for the alignment
• that will be homework– amend local alignment program to do the
“backtracking” needed for the alignment