26
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Embed Size (px)

Citation preview

Page 1: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

BLAST:Basic Local Alignment Search Tool

Altschul et al. J. Mol Bio. 1990.

CS 466

Saurabh Sinha

Page 2: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Motivation

• Sequence homology to a known protein suggest function of newly sequenced protein

• Bioinformatics task is to find homologous sequence in a database of sequences

• Databases of sequences growing fast

Page 3: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Alignment

• Natural approach to check if the “query sequence” is homologous to a sequence in the database is to compute alignment score of the two sequences

• Alignment score counts gaps (insertions, deletions) and replacements

• Minimizing the evolutionary distance

Page 4: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Alignment

• Global alignment: optimize the overall similarity of the two sequences

• Local alignment: find only relatively conserved subsequences

• Local similarity measures preferred for database searches– Distantly related proteins may only share isolated

regions of similarity

Page 5: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Alignment

• Dynamic programming is the standard approach to sequence alignment

• Algorithm is quadratic in length of the two sequences

• Not practical for searches against very large database of sequences (e.g., whole genome)

Page 6: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Scoring alignments

• Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein)

• Amino acid sequences: “PAM” matrix– Consider amino acid sequence alignment for

very closely related proteins, extract replacement frequencies (probabilities), extrapolate to greater evolutionary distances

• DNA sequences: match = +5, mismatch = -4

Page 7: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

BLAST: the MSP

• Given two sequences of same length, the similarity score of their alignment (without gaps) is the sum of similarity values for each pair of aligned residues

• Maximal segment pair (MSP): Highest scoring pair of identical length segments from the two sequences

• The similarity score of an MSP is called the MSP score

• BLAST heuristically aims to find this

Page 8: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Locally maximal segment pair

• A molecular biologist may be interested in all conserved regions shared by two proteins, not just their highest scoring pair

• A segment pair (segments of identical lengths) is locally maximal if its score cannot be improved by extending or shortening in either direction

• BLAST attempts to find all locally maximal segment pairs above some score cutoff.

Page 9: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Rapid approximation of MSP score

• Goal is to report those database sequences that have MSP score above some threshold S.

• Statistics tells us what is the highest threshold S at which “chance similarities” are likely to appear

• Tractability to statistical analysis is one of the attractive features of the MSP score

Page 10: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Rapid approximation of MSP score

• BLAST minimizes time spent on database sequences whose similarity with the query has little chance of exceeding this cutoff S.

• Main strategy: seek only segment pairs (one from database, one query) that contain a word pair with score >= T

• Intuition: If the sequence pair has to score above S, its most well matched word (of some predetermined small length) must score above T

• Lower T => Fewer false negatives• Lower T => More pairs to analyze

Page 11: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Implementation

1. Compile a list of high scoring words

2. Scan database for hits to this word list

3. Extend hits

Page 12: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Step 1: Compiling list of words from query sequence

• For proteins: List of all w-length words that score at least T when compared to some word in query sequence

• Question: Does every word in the query sequence make it to the list?

• For DNA: list of all w-length words in the query sequence, often with w=12

Page 13: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Step 2: Scanning the database for hits

• Find exact matches to list words

• Can be done in linear time– two methods (next slides)

• Each word in list points to all occurrences of the word in word list from previous step

Page 14: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Scanning the database for hits

• Method 1: Let w=4, so 204 possible words• Each integer in 0 … 204-1 is an index for an

array• Array element point to list of all occurrences

of that word in query• Not all 204 elements of array are populated

– only the ones in word list from previous step

Page 15: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Scanning the database for hits

• Method 2: use “deterministic finite automaton” or “finite state machine”.

• Similar to the keyword trees seen in course.

• Build the finite state machine out of all words in word list from previous step

Page 16: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Step 3: Extending hits

• Once a word pair with score >= T has been found, extend it in each direction.

• Extend until score >= S is obtained• During extension, score may go up, and then

down, and then up again• Terminate if it goes down too much (a certain

distance below the best score found for shorter extensions)

• One implementation allows gaps during extension

Page 17: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

BLAST: approximating the MSP

• BLAST may not find all segment pairs above threshold S

• Trying to approximate the MSP

• Bounds on the error: not hard bounds, but statistical bounds– “Highly likely” to find the MSP

Page 18: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Statistics

• Suppose the MSP has been calculated by BLAST (and suppose this is the true MSP)

• Suppose this observed MSP scores S.• What are the chances that the MSP score for

two unrelated sequences would be >= S?• If the chances are very low, then we can be

confident that the two sequences must not have been unrelated

Page 19: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Statistics

• Given two random sequences of lengths m and n

• Probability that they will produce an MSP score of >= x ?

Page 20: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Statistics

• Number of separate SPs with score >= x is Poisson distributed with mean y(x) = Kmn exp(-x), where

is the positive solution of

∑pipjexp(s(i,j)) = 1

• K is a constant

• s(i,j) is the scoring matrix, pi is the frequency of i in random sequences

Page 21: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Statistics

• Poisson distribution:Pr(x) = (e- x)/x!

• Pr(#SPs >= )

= 1 - Pr(#SPs <= -1)

=1−e−yy i

i!i=0

α −1

=1− e−yy i

i!i=0

α −1

Page 22: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Statistics

• For =1, Pr(#SPs >= 1) = 1-e-y(x)

• Choose S such that 1-e-y(S) is small• Suppose the probability of having at least 1 SP with

score >= S is 0.001.• This seems reasonably small• However, if you test 10000 random sequences, you

expect 10 to cross the threshold• Therefore, require “E-value” to be small.• That is, expected number of random sequence pairs

with score >= S should be small.

Page 23: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

More statistics

• We just saw how to choose threshold S• How to choose T ?• BLAST is trying to find segment pairs

(SPs) scoring above S• If an SP scores S, what is the

probability that it will have a w-word match of score T or more?

• We want this probability to be high

Page 24: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

More statistics: Choosing T

• Given a segment pair (from two random sequences) that scores S, what is the probability q that it will have no w-word match scoring above T?

• Want this q to be low • Obtained from simulations• Found to decrease exponentially as S

increases

Page 25: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

BLAST is the universally used bioinformatics tool

Page 26: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

http://flybase.org/blast/