19

Needleman-Wunsch and Smith-Waterman Algorithm

Embed Size (px)

Citation preview

Page 1: Needleman-Wunsch and Smith-Waterman Algorithm
Page 2: Needleman-Wunsch and Smith-Waterman Algorithm

BIOTOOLS

AND

DATABASES

BY

C.GAYATHRI

(I M.Sc.BIOINFORMATICS)

Page 3: Needleman-Wunsch and Smith-Waterman Algorithm

Needleman Wunsch Algorithm

Smith Waterman Algorithm

Page 4: Needleman-Wunsch and Smith-Waterman Algorithm

Published in 1970 by SAUL NEEDLEMAN and CHRISTIAN WUNSCH

General algorithm for sequence comparison

Commonly used in bioinformatics to align protein or nucleotide sequences

Example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison.

Page 5: Needleman-Wunsch and Smith-Waterman Algorithm

Scores for aligned characters are specified by a SIMILARITY MATRIX. Here, S(i, j) is the similarity of characters i and j. It uses a LINEAR GAP PENALTY, here called d.

Maximizes a similarity score, to give ‘MAXIMUM MATCH’

Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions

Finds the best GLOBAL alignment of any two sequences

Page 6: Needleman-Wunsch and Smith-Waterman Algorithm

N-W involves an iterative matrix method of calculationAll possible pairs of residues (bases or

amino acids) - one from each sequence - are represented in a 2-dimensional array

All possible alignments (comparisons) are represented by pathways through this array

Page 7: Needleman-Wunsch and Smith-Waterman Algorithm

Three main steps

1. Assign similarity values

2. For each cell, allowing insertions and deletions give the maximum possible scoring value

3. Construct an alignment (pathway) back from the highest scoring cell

Page 8: Needleman-Wunsch and Smith-Waterman Algorithm

Similarity values Numerical value is assigned to

every cell (depending on the similarity/dissimilarity of the two residues)

simple scores or more complicated, (chemical similarities or frequency of observed substitutions)

The example shown here has match = +1 mismatch = 0

M P R C L C Q R J N C B AP 1B 1R 1 1C 1 1 1KC 1 1 1R 1N 1J 1C 1 1 1J 1A 1

Page 9: Needleman-Wunsch and Smith-Waterman Algorithm

Score pathways through array

to know the maximum possible score for an alignment

Searches sub rows and sub columns, for the highest score

Adds this to the score for the current cell

Proceeds row by row through the array

Gap penalty for the introduction of gaps in the alignment = 0

M P R C L C Q R J N C B AP 0 1 0 0 0 0 0 0 0 0 0 0 0B 0 0 1 1 1 1 1 1 1 1 1 2 1R 0 0 2 1 1 1 1 2 1 1 1 1 2C 0 0 1 3 2 3 2 2 2 2 3 2 2K 0 0 1 2 3 3 3 3 3 3 3 3 3C 0 0 1 3 3 4 3 3 3 3 4 3 3R 0 0 2 2 3 3 4 ?NJ 1C 1 1 1J 1A 1

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}

Page 10: Needleman-Wunsch and Smith-Waterman Algorithm

Construct alignment The alignment score is

cumulative by adding along a path through the array

The best alignment has the highest score i.e. the maximum match

Maximum match = largest number resulting from summing the cell values of every pathway

The maximum match will ALWAYS be somewhere in the outer row or column shown

The alignment is constructed by working backwards from the maximum match

M P R C L C Q R J N C B AP 0 1 0 0 0 0 0 0 0 0 0 0 0B 0 0 1 1 1 1 1 1 1 1 1 2 1R 0 0 2 1 1 1 1 2 1 1 1 1 2C 0 0 1 3 2 3 2 2 2 2 3 2 2K 0 0 1 2 3 3 3 3 3 3 3 3 3C 0 0 1 3 3 4 3 3 3 3 4 3 3R 0 0 2 2 3 3 4 5 4 4 4 4 4N 0 0 1 2 3 3 4 4 5 6 5 5 5J 0 0 1 2 3 3 4 4 6 5 6 6 6C 0 0 1 3 3 4 4 4 5 6 7 6 6J 0 0 1 2 3 3 4 4 6 6 6 7 7A 0 0 1 2 3 3 4 4 5 6 6 7 8

MP –RCLCQR - JNCBA | | | | | | | | -PBRCKC –RNJ - CJA

Page 11: Needleman-Wunsch and Smith-Waterman Algorithm

Statistical Significance

Maximum match is a function of sequence relationship and composition

Useful to know probability of obtaining result (maximum match) from a pair of random sequences

Estimate this experimentally Sequences from random proteins are taken(I.e.

having same composition as the real proteins) if the value for the random proteins is

significantly different from that for the real proteins then the difference is a function of the sequences alone and not of their composition

Page 12: Needleman-Wunsch and Smith-Waterman Algorithm

Proposed by Temple Smith and Michael Waterman in 1981

Smith-Waterman algorithm is useful for performing local sequence alignment

Determining similar regions between two nucleotide or protein sequences

Page 13: Needleman-Wunsch and Smith-Waterman Algorithm

Instead of looking at entire sequence, it compares segments of all possible lengths and optimizes the similarity measure.

For every cell the algorithm calculates ALL possible paths that can be of any length and contain insertions, deletions and gaps

Page 14: Needleman-Wunsch and Smith-Waterman Algorithm

Works effectively, only when gap penalties are used

In example shown match = +1 mismatch = -1/3 gap = -1+1/3k (k=extent

of gap) Start with all cell values =

0 Looks in sub column and

sub row shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account

C A G C C U C G C U U A GA 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3C 1.0 0.7 0.0 1.0 3.0 1.7 ?AUUGACGG

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}

Page 15: Needleman-Wunsch and Smith-Waterman Algorithm

Four possible ways of forming a path

For every residue in the query sequence

1. To align with next residue, score =previous score +similarity score2. Deletion (i.e. match residue of query with a gap), score =previous score - gap penalty dependent on size of

the gap Insertion (i.e. match residue of db sequence with a gap, score =previous score - gap penalty dependent on size of

the gap4. Stop when the score is zero

Choose whichever of these which has the highest score

Page 16: Needleman-Wunsch and Smith-Waterman Algorithm

Construct Alignment The score in each cell is

the maximum possible score for an alignment of ANY LENGTH ending at those coordinates

Trace pathway back from highest scoring cell

This cell can be anywhere in the array

Align highest scoring segment

C A G C C U C G C U U A GA 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0

GCC-UCGGCCAUUG

Page 17: Needleman-Wunsch and Smith-Waterman Algorithm

Needleman-Wunsch

1. Global alignments

2. Requires alignment score for a pair of residues to be >=0

3. No gap penalty required

4. Score cannot decrease between two cells of a pathway

5. Trace back is mostly from the last cell that has the highest score

Smith-Waterman

1. Local alignments

2. Residue alignment score may be positive or negative

3. Requires a gap penalty to work effectively

4. Score can increase, decrease or stay level between two cells of a pathway

5. Trace back is from the cell that has the highest score

Page 18: Needleman-Wunsch and Smith-Waterman Algorithm

CONCLUSION

Hence from calculating and working many times on these algorithms considering different organisms, it is found that NW and SW algorithms are excellent methods for finding the similarity and dissimilarity between the different organisms

Page 19: Needleman-Wunsch and Smith-Waterman Algorithm