Needleman-Wunsch and Smith-Waterman Algorithm

BIOTOOLS

AND

DATABASES

BY

C.GAYATHRI

(I M.Sc.BIOINFORMATICS)

Needleman Wunsch Algorithm

Smith Waterman Algorithm

Published in 1970 by SAUL NEEDLEMAN and CHRISTIAN WUNSCH

General algorithm for sequence comparison

Commonly used in bioinformatics to align protein or nucleotide sequences

Example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison.

Scores for aligned characters are specified by a SIMILARITY MATRIX. Here, S(i, j) is the similarity of characters i and j. It uses a LINEAR GAP PENALTY, here called d.

Maximizes a similarity score, to give ‘MAXIMUM MATCH’

Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions

Finds the best GLOBAL alignment of any two sequences

N-W involves an iterative matrix method of calculationAll possible pairs of residues (bases or

amino acids) - one from each sequence - are represented in a 2-dimensional array

All possible alignments (comparisons) are represented by pathways through this array

Three main steps

1. Assign similarity values

2. For each cell, allowing insertions and deletions give the maximum possible scoring value

3. Construct an alignment (pathway) back from the highest scoring cell

Similarity values Numerical value is assigned to

every cell (depending on the similarity/dissimilarity of the two residues)

simple scores or more complicated, (chemical similarities or frequency of observed substitutions)

The example shown here has match = +1 mismatch = 0

M P R C L C Q R J N C B AP 1B 1R 1 1C 1 1 1KC 1 1 1R 1N 1J 1C 1 1 1J 1A 1

Score pathways through array

to know the maximum possible score for an alignment

Searches sub rows and sub columns, for the highest score

Adds this to the score for the current cell

Proceeds row by row through the array

Gap penalty for the introduction of gaps in the alignment = 0

M P R C L C Q R J N C B AP 0 1 0 0 0 0 0 0 0 0 0 0 0B 0 0 1 1 1 1 1 1 1 1 1 2 1R 0 0 2 1 1 1 1 2 1 1 1 1 2C 0 0 1 3 2 3 2 2 2 2 3 2 2K 0 0 1 2 3 3 3 3 3 3 3 3 3C 0 0 1 3 3 4 3 3 3 3 4 3 3R 0 0 2 2 3 3 4 ?NJ 1C 1 1 1J 1A 1

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}

Construct alignment The alignment score is

cumulative by adding along a path through the array

The best alignment has the highest score i.e. the maximum match

Maximum match = largest number resulting from summing the cell values of every pathway

The maximum match will ALWAYS be somewhere in the outer row or column shown

The alignment is constructed by working backwards from the maximum match

M P R C L C Q R J N C B AP 0 1 0 0 0 0 0 0 0 0 0 0 0B 0 0 1 1 1 1 1 1 1 1 1 2 1R 0 0 2 1 1 1 1 2 1 1 1 1 2C 0 0 1 3 2 3 2 2 2 2 3 2 2K 0 0 1 2 3 3 3 3 3 3 3 3 3C 0 0 1 3 3 4 3 3 3 3 4 3 3R 0 0 2 2 3 3 4 5 4 4 4 4 4N 0 0 1 2 3 3 4 4 5 6 5 5 5J 0 0 1 2 3 3 4 4 6 5 6 6 6C 0 0 1 3 3 4 4 4 5 6 7 6 6J 0 0 1 2 3 3 4 4 6 6 6 7 7A 0 0 1 2 3 3 4 4 5 6 6 7 8

MP –RCLCQR - JNCBA | | | | | | | | -PBRCKC –RNJ - CJA

Statistical Significance

Maximum match is a function of sequence relationship and composition

Useful to know probability of obtaining result (maximum match) from a pair of random sequences

Estimate this experimentally Sequences from random proteins are taken(I.e.

having same composition as the real proteins) if the value for the random proteins is

significantly different from that for the real proteins then the difference is a function of the sequences alone and not of their composition

Proposed by Temple Smith and Michael Waterman in 1981

Smith-Waterman algorithm is useful for performing local sequence alignment

Determining similar regions between two nucleotide or protein sequences

Instead of looking at entire sequence, it compares segments of all possible lengths and optimizes the similarity measure.

For every cell the algorithm calculates ALL possible paths that can be of any length and contain insertions, deletions and gaps

Works effectively, only when gap penalties are used

In example shown match = +1 mismatch = -1/3 gap = -1+1/3k (k=extent

of gap) Start with all cell values =

0 Looks in sub column and

sub row shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account

C A G C C U C G C U U A GA 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3C 1.0 0.7 0.0 1.0 3.0 1.7 ?AUUGACGG

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}

Four possible ways of forming a path

For every residue in the query sequence

1. To align with next residue, score =previous score +similarity score2. Deletion (i.e. match residue of query with a gap), score =previous score - gap penalty dependent on size of

the gap Insertion (i.e. match residue of db sequence with a gap, score =previous score - gap penalty dependent on size of

the gap4. Stop when the score is zero

Choose whichever of these which has the highest score

Construct Alignment The score in each cell is

the maximum possible score for an alignment of ANY LENGTH ending at those coordinates

Trace pathway back from highest scoring cell

This cell can be anywhere in the array

Align highest scoring segment

C A G C C U C G C U U A GA 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0

GCC-UCGGCCAUUG

Needleman-Wunsch

1. Global alignments

2. Requires alignment score for a pair of residues to be >=0

3. No gap penalty required

4. Score cannot decrease between two cells of a pathway

5. Trace back is mostly from the last cell that has the highest score

Smith-Waterman

1. Local alignments

2. Residue alignment score may be positive or negative

3. Requires a gap penalty to work effectively

4. Score can increase, decrease or stay level between two cells of a pathway

5. Trace back is from the cell that has the highest score

CONCLUSION

Hence from calculating and working many times on these algorithms considering different organisms, it is found that NW and SW algorithms are excellent methods for finding the similarity and dissimilarity between the different organisms

Documents

Needleman-Wunsch and Smith-Waterman Algorithm