47
Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Embed Size (px)

Citation preview

Page 1: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Part 2- OUTLINE

• Introduction and motivation

• How does BLAST work?

• Query sequence: DNA or protein?

• Query type

• BLAST vs. PSI-BLAST

Page 2: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Why BLAST?Finding homologous

• Homology- similarity between sequences that result from a common ancestor.

• Sequences look alike probably have the same function and structure.

• Use a sequence as a search query in order to find homologous sequences in a data base.

• Save time! – exploit the knowledge you have about your homologues, and conclude about your query.

More then: 25% for proteins

70% for nucleotideswill be considered as homologous

Page 3: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Why BLAST?

Answering basic questions such as:

• Which bacterial species have a protein that is related in lineage to a certain protein with known amino-acid sequence?

• Where does a certain sequence of DNA originate?

• What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined?

Finding homologous

Page 4: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Searching a sequence database

The idea: Use your sequence as a query to find homologous

sequences in a sequence database

Database

A sequence takenfrom Venter’s trip

Why BLAST?

Page 5: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Searching a sequence database

Database

query

Why BLAST?

Page 6: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Database

queryhit

Searching a sequence database

Why BLAST?

Page 7: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Why Heuristics ?

Database

Query107 sequences

Assuming 10 comparisons in every second, a full comparison of the query to the database requires 11.5 days.

Why BLAST?

Page 8: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Why Heuristics ?

Database

Query

107 sequences

Assuming 10 comparisons in every second, a full comparison of the query to the database requires 11.5 days.

• 11.5 days is ok if we are doing it once.

• 150,000 searches (at least!!) are performed per day: >82,000,000 sequence records in GenBank.

Why BLAST?

Page 9: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Terminology

• Query sequence - the sequence with which we are searching the database

• Hit – a sequence found in the database, suspected as homologous to the query sequence

Why BLAST?

Page 10: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

BLAST(Basic Local Alignment Search Tool)

• Goal: A fast search for homologues in a huge database

• One of the most widespread bioinformatics programs:• Provides a solution to a fundamental need• Emphasizes speed over sensitivity the databases

are enorsmous and will only grow larger and larger… • Cannot guarantee optimal alignment

after finding the homologs via BLAST, an additional alignment program is needed

Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search

tool” J. Mol. Biol. 215: 403-410

How does BLAST work?

Page 11: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

BLAST(Basic Local Alignment Search Tool)

• The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them

• The heuristic:1. Discard irrelevant sequences2. Perform exact local alignment only with the

remaining sequences

Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search

tool” J. Mol. Biol. 215: 403-410

How does BLAST work?

Page 12: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

12

Searching a sequence database• Idea:In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs

• Query sequence - the sequence with which we are searching

• Hit – a sequence found in the database, suspected as homologous (HSP- the matched region)

How does BLAST work?

Page 13: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

For each database record & query:1. Look for common words instead of trying all

possible alignments between two sequences.2. If many common words are found:

Then – The query and the record are homologues

BLAST Main paradigm

Find common words between record and

query?

Possible Homologs:Save record for further

analysis

Retrieve next record from database

Probably not homologs:Discard record

Yes No

Page 14: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

14

Searching a sequence database• Inputs:

• Query sequence

• Database of sequences

• Word size (use default…)• Substitution matrix (use default…)• Gap penalty (use default…)

How does BLAST work?

Page 15: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

How does BLAST work?

The parameters-

W : Word size – find W-mers in target/query2-3 for aa, 6-11 for nucleotides.

T : Threshold – focus on pairs scoring >Tusually 11-13

X : Drop-off – stop extending when loss >X

S : Score – the final score of segment pair

Page 16: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

How do we discard irrelevantsequences quickly?

• Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA)

• Save the words in a look-up table that can be searched quickly

WTDFGYPAILKGGTAC

WTDTDFDFGFGYGYP …

How does BLAST work?

Page 17: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

BLAST: discarding sequences

• When the user enters a query sequence, it is also divided into words.

• For each word, neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with the cutoff level (T)

How does BLAST work?

GFB

GFC (20)

GPC (11)

WAC (5)

Page 18: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

BLAST: discarding sequences• A list is compiled including the possible

neighboring words, for which only exact matches to word in the database are accepted.

• The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded.

How does BLAST work?

GFB

GFC (20)

GPC (11)

WAC (5)

Page 19: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

How does BLAST work?

The algorithm:

1. Align a query sequence with the database.

2. Find “hits”: short word pairs of length W with an ungapped alignment score of at least T.

3. Extend alignments until score drops more than X below hitherto best scoreConsumes most of the processing time (>90%)

s

t

Page 20: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Try to extend the alignment

• Stop extending when the score of the alignment drops X beneath the maximal score obtained so far

• Discard segments with score < S

ASKIOPLLWLAASFLHNEQAPALSDAN

JWQEOPLWPLAASOIHLFACNSIFYASScore=15

Score=17

Score=14

How does BLAST work?

Page 21: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

How does BLAST work?

• The goal:Faster algorithmReduce number of extensions

• Observations:HSP much longer than Woften contains more than one word-pair

• Idea: focus on two or more words on same diagonal

Two-Hit Gapped BLAST

Page 22: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Query

Data

base

re

cord

Neighbor word Look for a seed: hits on the same diagonal which

can be connected

At least 2 hits on the same diagonal with

distance which is smaller than a predetermined

cutoff

This is the filtering stage – many unrelated hits are filtered, saving lots of

time!

A

How does BLAST work?

Page 23: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Two-Hit Gapped BLAST

The new gapped BLAST algorithm:1. Start with the two hit method-

(a) find two hits of score higher then T, within a distance A.(b) invoke an ungapped extension on the second hit.

2. If the HSP generated has an expected score:(a) Trigger a gapped extension(b) If the final score has a significant E-value –

report the gapped alignment.

How does BLAST work?

Page 24: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

The result – local alignment

• The result of BLAST will be a series of local alignments between the query and the different hits found

How does BLAST work?

Page 25: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

How does BLAST work?

The scoring system• BLAST uses BLOSSOM62 as the scoring matrix to

perform the alignment (default).

Page 26: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

How does BLAST work?

E-value

• To asses the bits score we calculate E-value:E-value = The expected number of HSP’s with a score of at least S

• For each score S there is a specific E-value.

Small E-value better score

Page 27: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

How does BLAST work?

E-valueTheoretically, we could trust any result with an E-value ≤ 1

In practice – BLAST uses estimations.

• E-values of 10-4 and lower indicate a significant homology.

• E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous).

• E-values between 10-2 and 1 do not indicate a good homology

Page 28: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

How does BLAST work?

Low complexity regions- filter• Low-complexity region- a region of a sequence is

composed of few kinds of elements.

• These regions might give high scores that confuse the program to find the actual significant sequences in the database should be filtered out with specialized programs.

Page 29: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Query sequence: DNA or protein?

• For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences.

• Which is preferable if we want to learn about homology?

Page 30: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Query type

• Nucleotides: a four letter alphabet• Amino acids: a twenty letter

alphabet

• Two random DNA sequences will, on average, have 25% identity

• Two random protein sequences will, on average, have 5% identity

Query sequence: DNA or protein?

Page 31: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Query typesWhich search is preferable?

1. The genetic code is redundant. Some amino acids are coded by more than one codon. Therefore, the DNA sequence can change while the amino acid sequence will remain the same.

2. Nucleotides: a four letter alphabet. Amino acids: a twenty letter alphabet.

3. Protein comparison matrices are much more sensitive than those for DNA, i.e., similarity relationships are defined between two amino acids (PAM/Blosum).

4. DNA databases are much larger, meaning more random hits.

Query sequence: DNA or protein?

Page 32: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Amino acids are better!

• Selection (and hence conservation) works (mostly) at the protein level:

CTTTCA = Leu-SerTTGAGT = Leu-Ser

Query sequence: DNA or protein?

Page 33: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

1. Protein sequence comparisons typically double the evolutionary look-back time over DNA sequence comparisons.

2. Evolutionary distant proteins will exhibit a high similarity rather than a high identity.

3. Hits can exhibit a long alignment (homology) or a short alignment (conserved domains).

Why use a nucleotide sequence after all?

Amino acids are better!

Query sequence: DNA or protein?

Page 34: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Query type

• The sequence query can be a nucleotide sequence or an amino acid sequence.But … we can translate the query sequence!

• The search is performed against a nucleotide or amino acid database.But … we can use translated databases! (e.g., trEMBL)

All types of searches are possible:

Query: DNA Protein

Database: DNA Protein

Page 35: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

• Nucleotide query can be translated and searched against protein databases:1.Translate all reading frames (3 + 3)2.Find long ORF.

• Amino acid query can be back-translated to and searched against nucleotide databases?1.During translation we lose information. 2.A single amino acid sequence can be back-

translated to many possible nucleotide sequences .

Query type

Page 36: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

1. amino acid query against protein database (blastp)– identifying a protein sequence – finding similar sequences in protein databases.

2. nucleotide query against nucleotide database (blastn)– In non-coding regions (no ORF found)- Identify the query sequence or

find similar sequences.– Find primer binding sites or map short contiguous motifs

3. compares translated nucleotide query against protein database. (blastx) – Useful when the query includes a coding region, and we try to find

homologous proteins. – Used extensively in analyzing EST sequences. This search is more

sensitive than nucleotide blast since the comparison is performed at the protein level.

4. protein query against translated nucleotide database (tblastn)– useful for finding protein homologs in unnannotated nucleotide data of

coding regions (e.g., ESTs, draft genome records (HTG)). 5. translated nucleotide query against translated nucleotide

database. (tblastz)– Useful for identifying novel genes in error prone query sequences. – Used for identifying potential proteins encoded by single pass read ESTs.

Query type

Page 37: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

BLAST vs. PSI-BLAST

PSI-BLAST

Position Specific Iterated BLAST

• Use sequence information to build position specific scoring matrices

• More sensitive

• After 1 BLAST iteration, we invoke the different PSI-BLAST for a number of additional iterations

Page 38: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

PSI-BLAST

Step 1:1. Set a standard protein-protein BLAST search

(BLOSUM62)

2. Build a position specific scoring matrix (PSSM) according to MSA of the alignment results with low E-value.

Step 2:3. Set a BLAST search using the PSSM to evaluate the

alignment. PSSM vs. DB instead of seq vs. DB 4. Update the PSSM according to the new result5. Go back to the beginning of step two or stop.

BLAST vs. PSI-BLAST

Page 39: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

PSI-BLAST• Searching with a Profile• aligning profile matrix to a simple

sequence– like aligning two sequences– except score for aligning a character with a

matrix position is given by the matrix itself– not a substitution matrix

BLAST vs. PSI-BLAST

Page 40: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Figure from: Altschul et al. Nucleic Acids Research 25, 1997

PSI-BLAST

BLAST vs. PSI-BLAST

Page 41: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Testing PSI-BLAST

Compare sensitivity and speed of:

• Smith-Waterman

• Original BLAST

• Gapped BLAST

• PSI-BLAST

BLAST vs. PSI-BLAST

Page 42: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Testing PSI-BLAST• All but one are true homologs• PSI-BLAST is faster and more sensitive• Other BLAST algorithms good as well

BLAST vs. PSI-BLAST

Page 43: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

The power of PSI-BLAST:

1. A much sensitive scoring system .each position has its own pattern probabilities .

2. Different weight to conserved positions.

3. Important motifs are bounded

4. Lowers the level of random noise.

5. Finding distant relatives.

BLAST vs. PSI-BLAST

Page 44: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

Lets sum up…- Blast is a fast way to find homologues

- No analytic theory that estimates the statistical significance of gapped alignments

- Gap scores have been selected by trial and error.applying different scoring matrix No grantee for gap scores

- PSI-BLAST finds weak homologues fast

BLAST vs. PSI-BLAST

Page 45: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

• Where? (to find homologues)

• Structural templates- search against the PDB

• Sequence homologues- search against SwissProt or Uniprot or UniRef90 (recommended!)

• How many?

As many as possible, as long as the MSA looks good (examples in the next hour…)

Finding & selecting homologues

Page 46: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

• How long? (length of homologues)

• Fragments- short homologues (less than 50,60% the query’s length) = bad alignment

• Ensure your sequences exhibit the wanted domain(s)

• N/C terminal tend to vary in length between homologues

• Can use HSPs or full sequences, depends on which case you are working on…

• How close? (distance from query sequence)

• All too close- no information

• Too many too far- bad alignment

• Ensure that you have a balanced collection!

Finding & selecting homologues

Page 47: Part 2- OUTLINE Introduction and motivation How does BLAST work? Query sequence: DNA or protein? Query type BLAST vs. PSI-BLAST

• From who? (which species the sequence belongs to)

• Don’t care, all homologues are welcome

• Orthologues/paralogues may be helpful

• Sequences from distant/close species provide different types of information

• Which method? (BLAST/PSI-BLAST)

Depends on the protein, available homologues, the goal in mind…

Finding & selecting homologues