19
Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGA GCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAG GAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCG ACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTT ACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAA CCAGCAACGAA There are many programs used to do this. They range from relatively slow programs which find the exact best matching alignment, through ones which take progressively inexact shortcuts to speed things up. Of this latter class, the best known, and easily most widely used is BLAST, developed by Stephen Altschul and others, and continuously refined over the last 10-15 years. The essential idea is to compare your query sequence against a collection or ‘database’ of target sequences, looking for the one(s) that match the query sequence the best. >target1 AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAG >target2 CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG >target3 GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC >target4 CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG query databas e COMPARE LIST MATCHES

Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

Embed Size (px)

Citation preview

Page 1: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

Finding Sequence Similarities

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to do thisThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST BLAST can perform a number of similar tasks with different types of sequence

BLASTn ndash comparing nucleotide sequence vs nucleotide sequence database - FAST

BLASTp ndash comparing protein sequence vs protein sequence database - FAST

BLASTx ndash comparing nucleotide sequence vs protein sequence database by translating the nucleotide sequence in all possible reading frames - SLOW

tBLASTn ndash comparing protein sequence vs nucleotide sequence database translated into all possible reading frames - SLOWER

tBLASTx ndash comparing nucleotide sequence vs nucleotide sequence database translating both into all possible reading frames ndash EXCRUCIATINGLY SLOW

The amino acid sequence based programs use a substitution matrix to allow some amino acids to count as effective matches with each other These are the BLOSUM and PAM matrices you may see referred to from time to time

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

BLAST achieves its speed through two strategies

It lsquoindexesrsquo the database sequences so it know where all the minor subsequences are in each sequence so it doesnrsquot have to look all the way through each sequence each time letter by letter

Itrsquos lsquoword basedrsquo so that it will only start looking for possible extensive alignments once itrsquos found a seed alignment of an exact match The default seed lengths are 11 letters for BLASTn and 3 for BLASTp This means that some good alignments are un-findable eg a 50 protein match with exactly every second amino acid matching It relies on these lsquouniformly distributedrsquo alignments being very rare occurrences

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 2: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

Flavours of BLAST BLAST can perform a number of similar tasks with different types of sequence

BLASTn ndash comparing nucleotide sequence vs nucleotide sequence database - FAST

BLASTp ndash comparing protein sequence vs protein sequence database - FAST

BLASTx ndash comparing nucleotide sequence vs protein sequence database by translating the nucleotide sequence in all possible reading frames - SLOW

tBLASTn ndash comparing protein sequence vs nucleotide sequence database translated into all possible reading frames - SLOWER

tBLASTx ndash comparing nucleotide sequence vs nucleotide sequence database translating both into all possible reading frames ndash EXCRUCIATINGLY SLOW

The amino acid sequence based programs use a substitution matrix to allow some amino acids to count as effective matches with each other These are the BLOSUM and PAM matrices you may see referred to from time to time

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

BLAST achieves its speed through two strategies

It lsquoindexesrsquo the database sequences so it know where all the minor subsequences are in each sequence so it doesnrsquot have to look all the way through each sequence each time letter by letter

Itrsquos lsquoword basedrsquo so that it will only start looking for possible extensive alignments once itrsquos found a seed alignment of an exact match The default seed lengths are 11 letters for BLASTn and 3 for BLASTp This means that some good alignments are un-findable eg a 50 protein match with exactly every second amino acid matching It relies on these lsquouniformly distributedrsquo alignments being very rare occurrences

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 3: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

BLAST achieves its speed through two strategies

It lsquoindexesrsquo the database sequences so it know where all the minor subsequences are in each sequence so it doesnrsquot have to look all the way through each sequence each time letter by letter

Itrsquos lsquoword basedrsquo so that it will only start looking for possible extensive alignments once itrsquos found a seed alignment of an exact match The default seed lengths are 11 letters for BLASTn and 3 for BLASTp This means that some good alignments are un-findable eg a 50 protein match with exactly every second amino acid matching It relies on these lsquouniformly distributedrsquo alignments being very rare occurrences

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 4: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 5: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 6: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 7: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 8: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 9: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 10: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 11: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 12: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 13: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 14: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 15: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

And the difficulty is because BLAST does not set out to address questions like orthology BLAST only tells you about sequence similarity with some notion of how likely a similarity is to have arisen by chance based on some general biological principles

You will always have to add in your own knowledge of biology and exactly what your query sequence was and how it is related to your matching sequences In particular whether the degree of similarity matches up to the supposed evolutionary distance between the two species You will also need to take into account the length of the reported match compared to the lengths of your query and matched sequences And of course the size of the database

Are there any useful guidelines though

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 16: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 17: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because there are many different DNA sequences that can give exactly the same protein sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get of we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 18: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
Page 19: Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG

ExercisesGo to the file random-DNA-sequenceshtml randomly select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

  • Finding Sequence Similarities
  • Flavours of BLAST
  • How does it work
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-values Effect of Database Size
  • Slide 11
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises