Click here to load reader

Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha

  • View
    220

  • Download
    3

Embed Size (px)

Text of Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha

  • Dynamic Programming: Sequence alignmentCS 466Saurabh Sinha

  • DNA Sequence Comparison: First Success Story Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced genes functionIn 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) geneA normal growth gene switched on at the wrong time causes cancer !

  • Cystic Fibrosis Cystic fibrosis (CF) is a chronic and frequently fatal genetic disease of the body's mucus glands. CF primarily affects the respiratory systems in children.

    Search for the CF gene was narrowed to ~1 Mbp, and the region was sequenced.

    Scanned a database for matches to known genes. A segment in this region matched the gene for some ATP binding protein(s). These proteins are part of the ion transport channel, and CF involves sweat secretions with abnormal sodium content!

  • Role for BioinformaticsGene similarities between two genes with known and unknown function alert biologists to some possibilities

    Computing a similarity score between two genes tells how likely it is that they have similar functions

    Dynamic programming is a technique for revealing similarities between genes

  • Motivating Dynamic Programming

  • Dynamic programming example:Manhattan Tourist ProblemImagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan gridSink***********Source*

  • Dynamic programming example:Manhattan Tourist ProblemImagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan gridSink***********Source*

  • Manhattan Tourist Problem: FormulationGoal: Find the longest path in a weighted grid.Input: A weighted grid G with two distinct vertices, one labeled source and the other labeled sinkOutput: A longest path in G from source to sink

  • MTP: An Example32407333013244564655822501230123j coordinatei coordinate13sourcesink432401024331122241995152302034

  • MTP: Greedy Algorithm Is Not Optimal125 2152340005303501035512promising start, but leads to bad choices!sourcesink18

  • MTP: Simple Recursive ProgramMT(n,m) if n=0 or m=0 return MT(n,m) x MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y}

    Whats wrong with this approach?

  • Heres whats wrongM(n,m) needs M(n, m-1) and M(n-1, m)Both of these need M(n-1, m-1)So M(n-1, m-1) will be computed at least twiceDynamic programming: the same idea as this recursive algorithm, but keep all intermediate results in a table and reuse

  • 150101isource15S1,0 = 5S0,1 = 1 Calculate optimal path score for each vertex in the graph Each vertexs score is the maximum of the prior vertices score plus the weight of the respective edge in betweenMTP: Dynamic Programmingj

  • MTP: Dynamic Programming (contd)1253012012source13584S2,0 = 8iS1,1 = 4S0,2 = 33-5j

  • MTP: Dynamic Programming (contd)125301230123isource1358840581035-59131-5S3,0 = 8S2,1 = 9S1,2 = 13S3,0 = 8j

  • MTP: Dynamic Programming (contd)greedy alg. fails!125-51-5-53053035010-3-501230123isource13858849138912S3,1 = 9S2,2 = 12S1,3 = 8j

  • MTP: Dynamic Programming (contd)125-51-5-5330053035010-3-5-5201230123isource13858849138129159jS3,2 = 9S2,3 = 15

  • MTP: Dynamic Programming (contd)125-51-5-5330053035010-3-5-5201230123isource13858849138129159j0116S3,3 = 16(showing all back-traces)Done!

  • MTP: RecurrenceComputing the score for a point (i,j) by the recurrence relation:The running time is n x m for a n by m grid(n = # of rows, m = # of columns)

  • Manhattan Is Not A Perfect GridWhat about diagonals? The score at point B is given by:

  • Manhattan Is Not A Perfect Grid (contd)Computing the score for point x is given by the recurrence relation: Predecessors (x) set of vertices that have edges leading to x The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once

  • Traveling in the GridBy the time the vertex x is analyzed, the values sy for all its predecessors y should be computed otherwise we are in trouble. We need to traverse the vertices in some orderFor a grid, can traverse vertices row by row, column by column, or diagonal by diagonal

  • Traversing the Manhattan Grid 3 different strategies:a) Column by columnb) Row by rowc) Along diagonalsa)b)c)

  • Traversing a DAGA numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label

    How to obtain a topological ordering ?

  • Alignment

  • Aligning DNA SequencesV = ATCTGATGW = TGCATACn = 8m = 7V W Alignment : 2 x k matrix ( k m, n )

    ATCTGATGTGCATAC

  • Longest Common Subsequence (LCS) Alignment without Mismatches Given two sequences v = v1 v2vm and w = w1 w2wn The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < < it < mand a sequence of positions in w: 1 < j1 < j2 < < jt < nsuch that it -th letter of v equals to jt-letter of w and t is maximal

  • LCS: ExampleAT--CTGATC--TGCT--A--Celements of velements of w--Aj coords:i coords:Matches shown in redpositions in v:positions in w: 2 < 3 < 4 < 6 < 81 < 3 < 5 < 6 < 7Every common subsequence is a path in 2-D grid(0,0)(1,0)(2,1)(2,2)(3,3)(3,4)(4,5)(5,5)(6,6)(7,6)(8,7)

  • Computing LCSLet vi = prefix of v of length i: v1 viand wj = prefix of w of length j: w1 wjThe length of LCS(vi,wj) is computed by:

  • LCS Problem as Manhattan Tourist ProblemTGCATAC12345670iATCTGATC012345678j

  • Edit Graph for LCS ProblemTGCATAC12345670iATCTGATC012345678j

  • Edit Graph for LCS ProblemTGCATAC12345670iATCTGATC012345678jEvery path is a common subsequence.Every diagonal edge adds an extra element to common subsequenceLCS Problem: Find a path with maximum number of diagonal edges

  • Backtrackingsi,j allows us to compute the length of LCS for vi and wjsm,n gives us the length of the LCS for v and wHow do we print the actual LCS ?At each step, we chose an optimal decision si,j = max ()Record which of the choices was made in order to obtain this max

  • Computing LCSLet vi = prefix of v of length i: v1 viand wj = prefix of w of length j: w1 wjThe length of LCS(vi,wj) is computed by:

  • Printing LCS: BacktrackingPrintLCS(b,v,i,j) if i = 0 or j = 0 return if bi,j = PrintLCS(b,v,i-1,j-1) print vi else if bi,j = PrintLCS(b,v,i-1,j) else PrintLCS(b,v,i,j-1)

  • From LCS to AlignmentThe Longest Common Subsequence problemthe simplest form of sequence alignment allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indelsConsider penalizing indels and mismatches with negative scoresSimplest scoring scheme: +1 : match premium - : mismatch penalty - : indel penalty

  • Simple ScoringWhen mismatches are penalized by , indels are penalized by , and matches are rewarded with +1, the resulting score is:

    #matches (#mismatches) (#indels)

  • The Global Alignment ProblemFind the best alignment between two strings under a given scoring schema

    Input : Strings v and w and a scoring schemaOutput : Alignment of maximum score

    = - = 1 if match = - if mismatch

    si-1,j-1 +1 if vi = wjsi,j = max s i-1,j-1 - if vi wj s i-1,j - s i,j-1 - m : mismatch penalty : indel penalty

  • Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix . In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character -.This will simplify the algorithm as follows: si-1,j-1 + (vi, wj)si,j = max s i-1,j + (vi, -) s i,j-1 + (-, wj)

  • Making a Scoring MatrixScoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the proteins function, therefore some penalties, (vi , wj), will be less harsh than others.

  • Scoring Matrix: Example Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein.

    ARNKA5-2-1-1R-7-13N--70K---6

  • ConservationAmino acid changes that tend to preserve the physico-chemical properties of the original residuePolar to polaraspartate glutamateNonpolar to nonpolaralanine valineSimilarly behaving residuesleucine to isoleucine

  • Local vs. Global AlignmentThe Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph.

    The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i, j) in the edit graph.In the edit graph with negatively-scored edges, Local Alignment may score higher than Global Alignment

  • Local Alignment: ExampleGlobal alignmentLocal alignment

  • Local Alignments: Why?Two gene

Search related